Cluster Notes

Topic:

Understanding the differences between Grid Engine 5.3, 6.0 and Sun N1 Grid Engine 6 (N1GE 6)

Important Historical Note:

[2008] - The information contained in this document is VERY OUT OF DATE in that it does not mention any Grid Engine version after 6.0 including the quite different SGE 6.1 release (introduction of resource quotas) and SGE 6.2 (massive scaling improvements and sge_schedd becomes a qmaster thread). The only reason this document still exists is becuase peopel still seem to look for it via google etc. The original intent of this document was to describe the major changes in SGE 6.0 at a point in time where there was little other public information available. That time has now long past.

People looking for more up to date infromation may want to check out my blog (http://gridengine.info) or the newly deployed Sun Wiki site for Grid Engine (http://wikis.sun.com/display/GridEngine). For more updated information on differences and new features in modern SGE releases, check out DanT's excellent blog at http://blogs.sun.com/templedf/category/Grid - he has some great "Why upgrade?" articles that clearly talk about difference and new features.

Disclaimer:

I wrote this to facilitate my own work and drew upon many different sources. All mistakes are my own.

Author:

Chris Dagdigian, dag@sonsorol.org
BioTeam Inc.
http://bioteam.net


Version:

1.9 -- Last updated June 29, 2004
The most recent version of this document will always be available at http://bioteam.net/dag/


Revision History:

1.0 - Feedback including other useful 'qstat' options from Andy Schwierskott added
1.1 - Dumb typo corrected. Reporting Console exports data in 'CSV' not 'CVS' format!
1.2 - Added new content to the details section including cluster queues, etc.; cleared up some questions I had posed.
1.3 - More feature additions I picked out of emails to the developer list from Shannon (re: job priority) and Charu (re: job names)
1.4 - Added minor detail about how parallel environments (PE) configuration has changed slightly, fixed formatting, added some more screenshots
1.5 - Error corrections (advance reservation, etc.) and other content additions.
1.6 - Personal commentary added in some places, documented the default tuning profiles. Mentioned the new autoinstallation features.
1.7 - Minor additions now that N1GE has been announced. Added the "no more free grid engine!" section.
1.9 - More minor additions, completed some unfinished observations.

Table of Contents

What is Grid Engine 6.0?

What is Sun N1 Grid Engine 6 (N1GE)?

Sun says that there is no more 'free' Grid Engine and that 5.3 is being discontinued!

Why this document?

Grid Engine 6 Documentation and Manuals

Executive Summary -- The biggest changes in Grid Engine 6 and Sun N1GE 6

DETAIL

 

What is Grid Engine 6.0?

Grid Engine 6.0 is a distributed resource management (DRM) software layer developed and distributed under an open source license. The project lives at http://gridengine.sunsource.net.

What is Sun N1 Grid Engine 6 (N1GE)?

N1GE 6 is a branded and official Sun Microsystems software product commercially available worldwide. It was announced on June 1st, 2004.

The N1GE product is based from the same codebase as the open source Grid Engine project.

The primary differences between N1GE 6 and Grid Engine 6 are:

  • N1GE 6 is not available for free
  • Sun does additional QA and QC testing on each new N1GE release or revision
  • Sun formally offers support, training and professional services for N1GE worldwide
  • Sun adds non-English support to N1GE that is not available in Grid Engine
    • Sun ONE Grid Engine 5.3 was localized for: French, Japanese, Simplified Chinese, Traditional Chinese and Korean so it is safe to assume that these will also be targets for 6.0. It is unclear when the N1GE6 localization will be completed as english manuals and documentation are still being updated and written.
  • Sun adds monitoring, reporting and accounting features that are not available in Grid Engine
  • N1GE will play nicely with other Sun N1 software product and provisioning initiatives

Sun says that there is no more 'free' Grid Engine and that 5.3 is being discontinued!

Don't worry -- this is only true for the Sun branded versions of Grid Engine.

What is important to understand is that Sun, particularly in marketing, product and sales materials has to be very careful about the phrasing and language that is used. It helps if you understand that when Sun says something "official" about Grid Engine on their website and product materials they are almost always just talking about their own branded product(s) only and not the open source Grid Engine suite from http://gridengine.sunsource.net.

Previously, when the Grid Engine suite had both a "standard" and an "enterprise" operating mode Sun would offer "Grid Engine 5.3 Standard Edition" for free to its customers while charging a fee for use of "Grid Engine 5.3 Enterprise Edition".

Sun Grid Engine 5.3 Standard Edition - free, Sun Grid Engine 5.3 Enterprise Edition - not free.

Now that Grid Engine 6 ships with "Enterprise Edition" behavior as the default, what used to be two products within Sun has become a single product called N1 Grid Engine 6. This is only available to paying customers now, there is no "free" version of the Sun branded Grid Engine product.

With this new product announced and available, Sun is starting the process managing the end-of-life transition of Grid Engine 5.3 and 5.3 Enterprise Edition. Existing customers and people with support contracts will all be taken care of . New customers will be discouraged (and at some point, not allowed) to download or deploy the branded versions of the v5.3 product.

These actions have no effect at all on the versions of Grid Engine 5.3 and Grid Engine 6 that are hosted at the open source site. The latest version of Grid Engine available on the open source site is SGE-5.3p6 and the official release of SGE-6.0 is only days away.

Why this document?

The differences between the current Grid Engine 5.3 and the future Grid Engine 6.0 (and the equivalent Sun branded versions) are very significant and represent fundamental change in how Grid Engine is architected, deployed, used and configured. This is also the first time that the commercial version of Grid Engine from Sun will offer significant new functionality above and beyond what the open source version provides.

Information on the specific features and changes found in 6.0 is still sort of hard to come by as the technical documentation has lagged behind the code development. The best information still seems to be with the developers, the manpages downloaded from the CVS codebase repository and the various technical presentations made by Grid Engine developers that I've been given or found on the net.

I needed a real, plain-english listing of the major differences between 5.3 and 6.0 so I wrote one.

Grid Engine 6 Documentation and Manuals

Grid Engine 6 went into beta without any manuals or documentation which was pretty confusing.
Now Sun has posted the N1 Grid Engine full documentation suite including the release notes and the Admin, User and Installation guides. These are useful to both N1 Grid Engine users as well as users of the open source Grid Engine. The documentation collection is available online in browsable HTML or downloadable PDF form at http://docs.sun.com/db/coll/1017.3.

Executive Summary -- The biggest changes in Grid Engine 6 and Sun N1GE 6

Cluster Queues

Quite possibly the biggest change of all. Grid Engine now supports the concept of queues that can span multiple hosts or hostgroups (hostgroup objects are another new feature). This is the same behavior that similar products such as PBS and Platform LSF have used for a very long time. Previous versions of Grid Engine treated 'queues' as job containers that were associated with a specific machine or server host. Entire bodies of Grid Engine configuration and deployment best-practices are based upon host-associated queues. Initial impressions from technical docs and presentations suggest that the use of 'cluster queues' is now the default recommended behavior. If true, this is not just a nice new feature, it is a fundamental change in how Grid Engine is configured and used. Nothing relating to "classic queues" is being deprecated though -- in fact they are still present and 'queue instances' still run on every execution host. Think of cluster queues as an administrative aid and a way to group together systems that may share constants such as prolog/epilog scripts, etc.

Significant scheduler improvements - including unified ticketing, resource reservation and backfill capability

The scheduler has been improved and now supports resource reservation and backfilling. 'Advance reservation' is not implemented yet but will be. Powerful new scheduling algorithms have been added as well.

XML reporting output

The 'qstat' program now has a powerful "-xml" output option. For people who parse 'qstat' output to build monitoring scripts this is a significant new feature.

DRMAA 1.0 compliant

Grid Engine 6 meets the 1.0 specs from www.drmaa.org. This is a standard API meant to allow people to submit, control and monitor jobs across various cluster schedulers and grid types. This is not something that end-users will find useful. People developing cluster aware software or workflow mechanisms will finally have a public API that they can call upon for job submission and control. Even better, this API should work for PBS, PBSPro and Condor systems or any other DRM that supports the DRMAA 1.0 specification.

Daemon and communication changes

The Grid Engine daemons and communication methods have been re-architected. The sge_commd communication daemon is gone. The sge_qmaster, sge_schedd and sge_execd daemons all use separate TCP ports now. The sge_qmaster daemon is now multi-threaded.

BerkeleyDB spooling

Grid Engine now can use BerkeleyDB to log state and status data. The BerkeleyDB files can be written to local directories or sent via a RPC based mechanism to a remote spooling server. This change opens up room for significant performance and scaling gains but can potentially have negative impact on cluster performance, failover and/or security depending on what options are selected. People deploying Grid Engine 6 will have to strike a balance between shadow master failover, performance and security when trying to decide on using "classic spooling" vs local BerkeleyDB vs remote BerkeleyDB spooling. This is a major change.

Improved resource request syntax

Support for wildcards ('*') and ('?'), ranges, and additional regular expression type syntax such as the logical ('|') OR operator have been added.

Improved remote installation

Grid Engine can be fully deployed to remote execution hosts automatically from a management node. The system uses a configuration file to guide an automated installation that is performed via rsh or passwordless ssh (one of these mechanisms must be properly configured beforehand).

Future Windows support

For the first time there are formal mentions of supporting Windows operating systems. Nobody is on the record with any sort of official date. Some references refer to "end of 2004" as a rough guideline. It is very likely that Windows support will only appear in the commercial N1 Grid Engine product from Sun.

"Enterprise Edition" behaviour is the default

Grid Engine prior to 6.0 had a "standard" and an "enterprise" edition. Both editions used the same codebase and binaries. The only difference was that the enterprise edition 'activation' enabled new scheduling algorithims and resource allocation policies. As of Grid Engine 6.0 and N1GE 6.0 the functionality found in "enterprise edition" is now the default behavior.

Sun N1GE Only Features

To differentiate the commercial N1GE from the open source Grid Engine product, Sun plans to add value-added modules and features to N1GE that are not available in Grid Engine.

The first available module is a SQL based reporting, monitoring and accounting back-end with a web based query and report generation console. This system is being pitched to cluster operators as a very useful management aid but is also targeted towards organizations wishing to turn their clusters into a measurable "cost center" via accurate reporting of usage and utilization broken down by department, project, user etc. Similar arguments are being made to resellers and third parties interested in extending the mechanism to create interesting new services based on a "utility billing" model. N1GE will also be fully integrated into Sun N1 provisioning and management products such as Sun Web Console, etc.

It is also possible (noted above) that the rumored forthcoming support for Windows platforms may be a N1GE-only value added feature.

DETAIL

Cluster Queues

Grid Engine 6 now supports the concept of queues associated with multiple hosts or hostgroups. This is similar to how LSF and PBS behave and has long been a source of initial confusion to new Grid Engine users.

Additional Material and Detail:

'qmon' GUI Screenshots from N1GE 6beta2:

 
 

1

 
2
 
3

Click on any of the thumbnail images for a larger version

Image 1 shows how the previous 'iconic' views of queue states in previous versions of Grid Engine have been replaced by a tabular listing of 'queue instances'

Image 2 shows a 2nd tab in the queue configuration tool that lists tabular information about 'cluster queues'

Image 3 shows how the host configuration tool has been changed to support the concept of host groups

New Terminology Introduced:

'queue instance' -- this is how a Grid Engine 5.3 or "classic queue" is referred to

'cluster queue' -- a set of queue instances (effectively a queue that spans multiple hosts)

'cluster domain' -- all queue instances of a cluster queue in which the hosts belong to the same hostgroup

What it means:

The "classic" concept of queues (now called 'queue instances') ARE NOT GOING AWAY, in fact they are still critical to Grid Engine configuration and operation. Every execution host in the cluster is still going to run one or more 'queue instances'. The new cluster queues represent an attempt to simplify Grid Engine configuration and management by allowing for easy grouping of similar queues. Users can choose to submit jobs to certain queues or they can continue doing what they did in the past -- describing the resources they need for their job to be successful and letting the Grid Engine scheduler find and allocate the best possible queue ('queue instance' now).

What it does not mean:

Cluster queues in Grid Engine 6.0 are not 100% identical to what LSF or PBS calls a 'queue'. You will not be setting up 'default', 'high-priority', 'low-priority' cluster queues to manage resource allocation by controlling who has acccess to what cluster queue and what priority the cluster queue then assigns the pending jobs-- the best way to do that stuff is still via the various Grid Engine policy mechanisms.

How it was implemented:

The developers appear to have sensibly extended the existing queue mechanisms within Grid Engine so that a 'queue' is no longer just associated with a single host system. What was once a queue is now a "queue instance".

{Paraphrasing from the Cluster Queues Workshop Presentation here...}

Step 1 involved extending the concept of a 'queue' to cover a hostlist rather than just a hostname:

qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
qtype BATCH INTERACTIVE
ckpt_list NONE
rerun FALSE
slots 1,[compute1.private.sonsorol.net=1], \
[compute2.private.sonsorol.net=1], \
[compute3.private.sonsorol.net=1], \
[compute4.private.sonsorol.net=1]

tmpdir /tmp
shell /bin/csh
<...truncated...>

 

Step 2 involved making sure that queue level attributes could be extended to cover the fact that more than one host is now involved:

qname big
hostlist compute1 compute2 compute3 compute4
seq_no 0,[compute3=1],[compute4=1],[compute2=2],[compute1=2]
load_thresholds NONE
suspend_thresholds NONE
...

Step 3 was creating a entirely new Grid Engine object known as a "hostgroup"

group_name @transmeta_blades
hostlist compute1 compute2

group_name @pentium_blades
hostlist compute3 compute4

qname big
hostlist @transmeta_blades @pentium_blades
seq_no 0,[@pentium_blades=1],[@transmeta_blades=2]
load_thresholds NONE
suspend_thresholds NONE

Example useage:

qsub ./job.sh

qsub -q myQueue ./job.sh

Example usage with hostgroups:

qsub -q myQueue@@bladeServers ./job.sh

qsub -q myQueue@node23 ./myJob.sh

It appears that the use of cluster queues and hostgroups is now the preferred default behavior rather than the 5.3 method of having a queue associated with each host. Host-specific queues are now recommended "for different large SMP hosts".

Users can submit jobs either to:

  1. A "cluster queue" (which has its own assocated hostlist/hostgroup object)
  2. A "queue instance" (queue associated with a single host)
  3. A hostgroup (which will have one or more available queues associated with it)
  4. Nothing -- "qsub ./myJob.sh" still works and Grid Engine will still work to find and allocate the best 'queue instance'

 

BerkeleyDB Spooling

Additional material:
Sun Grid Engine 2003 workshop presentation

In general the use of BerkeleyDB is aimed at increasing both the performaance of grid engine as well as raising the scaling ceiling. Grid Engine 5.3 generally is considered to have a scaling ceiling of around 2000 CPUs at max. The new goals for Grid Engine 6.0 and beyond are to support 10,000+ hosts etc.

"Classic mode" does local spooling to text files, all data is ASCII text except for job data which is stored in binary form. It appears that the local file spooling has been changed/optimized between 5.3 and 6.0.

Classic mode file spooling across a NFS share allows for cluster qmaster failover between master hosts and shadow master hosts at the expense of "performance". Spooling to a NFS filesystem is ok for small clusters or systems that do not have massive job throughput requirements but in general it is considered "slow" for large or very high throughput systems.

A new spooling option method in Grid Engine 6.0 allows for the use of BerkeleyDB mechanisms to spool data to local disk. This has scaling and performance advantages. Apparently the stored spool data is structured much like the SGE internals.

One of the disadvantages to using BerkeleyDB for local spooling is that it must be done to local disk, it cannot be used across an NFS mounted filesystem. This is because "...Berkeley DB uses mechanisms like file locking and memory mapping to model its locking and transaction mechanism. Therefore a Berkeley DB cannot be accessed through NFS. All database access has to be done on a local filesystem..."

A new spooling option method in Grid Engine 6.0 allows for a client-server approah to BerkeleyDB spooling. In this scenario the qmaster uses a network RPC-based mechanism to send data to a remote RPC-server host. The advantage to this approach is, (1) All the performace and scaling gains of using BerkeleyDB instead of flatfiles, (2) Shadow master failover hosts do not need any sort of shared filesystem anymore (a big win). There are potentially significant drawbacks to this approach as the RPC mechanism has been described as having "no security at all".

Personal Commentary:

The transition to BerkeleyDB will aid performance and increase the overall scaling headroom of Grid Engine. It even lays the framework for a very nice failover and resiliancy system once Grid Engine can leverage the BerkeleyDB community tools to gain features such as the ability to safely replicate the Berkeley spooldb to waiting shadow master hosts.

However, there is still a HUGE hole in the overall fault tolerance and resiliancy mechanisms as they can be used in Grid Engine 6. People who set up remote RPC spooling are just moving their single point of failure from the qmaster host to the RPC server host (since there can only be 1 RPC host currently). People who choose spooling with BerkeleyDB to local disk loose the ability to have waiting shadow master(s) sharing an NFS mount.

At this point in time it seems that the only way to leverage the improvements that BerkeleyDB spooling brings while still providing a high availbility (HA) grid engine cluster is to run the Grid Engine qmaster with local BerkeleyDB spooling on a purpose-built clustered HA server. All of the 'big' vendors like Sun, IBM and HP already offer special hardware configurations and software stacks that allow multiple servers to act as a single ultra-reliable HA server host.

Just to be clear - nobody is crippling Grid Engine just to boost hardware HA sales. This issue is just an artifact of migrating to BerkeleyDB before all of the 'bells and whistles' such as spooldb replication are all in place. Expect these to appear in future releases.

People who can't afford a hardware HA solution for their qmaster machine are left with two choices -- give up some performance by dropping berkeleyDB entirely and using 'classic spooling' to local files that can be shared with backup shadow masters via NFS.

The 2nd choice involves using BerkeleyDB spooling to local disk on the qmaster node and coming to terms with the fact that this is going to be a potential single point of failure. Personally, this is the option I am probably going to be using in my work. Very few of the 5.3 based SGE clusters I've worked on even have shadow masters configured so this is not much of a problem for me moving forward.

The clusters I tend to build are used for discovery research purposes and all of the operators and managers understand that sometimes downtime will occur. A few hours of downtime (or even a day or more) to recover from a failed qmaster node is acceptable to everyone once the true financial costs of switching to a clustered-HA head node are explained. Spending 5-figure sums (or more) going from 99% uptime to 99.99% often can not be justified for experimental systems where every dollar spent on HA means less money for software, licenses or additional compute power.

Your experience may vary, especially across different industries and disciplines...

Scheduling Changes

The scheduler now supports resource reservations for any sort of resource including software licenses, etc.

The scheduler can now backfill to increase utilization, this is especially useful in reservation scheduling scenarios where scheduling to support reserved resources can result in 'gaps' or underutlized resources.

The use of the "-p" switch to indicate a job priority value has been greatly enhanced. In Grid Engine 5.3 and earlier versions the use of priority values applied only as a simple job ordering mechanism relative to other pending jobs . In Grid Engine 6.0 and later the useage of priority values has been unified and now can work in the context of ranking job order within the more advanced functional share and share tree policies. This is done via the "-js" argument.

The man page explains more:

...
In case of the Share Tree Policy, users can distribute the tickets to which they are currently entitled, among their jobs using different shares assigned via -js. If all jobs have the same job share value, the tickets are distributed evenly. Otherwise, jobs receive tickets relative to the different job shares. Job shares are treated like an additional level in the share tree in the latter case.


In connection with the Functional Policy, the job share can be used to weight jobs within the functional job category. Tickets are distributed relative to any uneven job share distribution treated as a virtual share distribution level underneath the functional job category. If both the Share Tree and the Functional Policy are active, the job shares will have an effect in both policies and the tickets independently derived in each of them are added up to the total number of tickets for each job.
...


Algorithms

Existing (in 5.3) algorithims:

  • Priority
  • Share Tree Policy
  • Functional Policy
  • Override Policy
  • Deadline

New in Grid Engine / Sun N1GE 6.0

  • Resource reservation & backfilling
  • Urgency Policy
  • Improved Priority based scheduling
  • Unified Ticketing

Unified Ticketing

All of the various policies that use 'tickets' (share tree, functional policy and override policy) are combined into a unified policy simply by normalizing each value to arrive at a relative priority. This is a very nice thing as it is easy to enable/disable and tweak each of the ticketing policies. Most people will use Share Tree or Functional Policy. 'Override' seems to be more of an artifact or deprecated system -- not many people seem to use it these days.

Urgency Scheduling

Three urgency related sub-policies (ResourceUrgency, WaitTimeUrgency and DeadlineUrgency) are normalized to sort jobs under the Urgency scheduling mechanism.

Expensive assets (software licenses, big SMP servers) can be assigned a higher static ResourceUrgency value to ensure they are used as frequently as possible. Any jobs requesting this resource will 'inherit' the ResourceUrgency value of the expensive asset and will rise accordingly in priority. Very resource hungry jobs can also be assigned ResourceUrgency values.

WaitUrgency increases as jobs pend waiting for execution. This can be used to prevent very low-priority jobs from being 'starved' or forgotten in a high volume cluster. Can also be used to ensure a steady stream of low priority of jobs are always getting some execution time. I suspect that some groups of Grid Engine users will find this very valuable.

DeadlineUrgency increases as a preset job dispatch deadline approaches. Jobs with approaching dispatch deadlines will rise in priority.

Wildcards and Boolean operators can now be used in resource request strings

  • "*" matches any char and any number of chars
  • "?" matches any single character except null
  • "[xx]" matches an array or range of allowed characters
  • "|" logical OR operator


Example:

qsub -hard -l "arch=lx24-x86|darwin" ./myJob.sh
{run this job only on a linux OR an Apple system }

Other details, fixes, features and functional changes I've been able to document

  1. A default cluster queue called "all.q" is created during installation, by default it contains all of the 'cluster instances'.

  2. A default "allhosts" hostgroup is created during installation, by default it contains all configured execution hosts.

  3. It is very interesting to see now that three different scheduler tuning options are presented during install. This is useful for helping people understand that Grid Engine (and the scheduler in particular) can be configured/tuned to support different usage types. This is how the selectable default profiles are presented to the user:


1) Normal -- Fixed interval scheduling, report scheduling information, actual + assumed load
2) High -- Fixed interval scheduling, report limited scheduling information, actual load
3) Max -- Scheduling on demand, report no scheduling information, actual load

The table below lists the specific Grid Engine tuning parameters that are changed for each install-time profile:

Grid Engine Parameter
Normal
High
Max
job_load_adjustments
np_load_avg=0.5
none
none
load_adjustment_decay_time
00:07:30
00:00:00
00:00:00
schedd_job_info
TRUE
FALSE
FALSE
schedule_interval
00:00:15
00:00:15
00:02:00
flush_submit_sec
0
0
4
flush_finish_sec
0
0
4
report_pjob_tickets
TRUE
TRUE
FALSE

Personal commentary:

None of these preconfigured settings are 100% perfect for the Grid Engine clusters I commonly see. I generally start with tuning settings that are similar to what is shown in the "Normal" column with the exception that I always tend to set $np_load_avg and $load_adjustment_decay_time to zero. Artifically inflating load and decaying it over time is a mechanism used to prevent overloading execution hosts configured with more job slots than CPUs -- the artificial short term boost to to np_load_average is great for making sure the node does not get slammed with too many jobs in too short a time period. However, this setup is almost never seen in the systems I work on which generally always have queue instance job slots set equal to the number of available system CPUs.

I can do away with artificial load adjustments which reduces some Grid Engine overhead. I do like the "extra" reporting information and tend to keep these configured in all but the most highly loaded Grid Engine systems.

The "Max" settings look interesting and are something I need to experiment with. By setting the shedule_interval to 2 minutes (effectively infinite) and making the flush_submit_sec and flush_finish_sec values set to 4 seconds Grid Engine is essentially configured for on-demand scheduling. The system will run a scheduler cycle 4 seconds after any job is submitted and 4 seconds after any job exits. Could be interesting but I need more exposure to see how this compares to just tightening up the value of $schedule_interval itself (we have set it as low as 2 seconds in some systems optimized for servicing qrsh queries coming from a webserver process).

 

  1. The single 'rcsge' startup script seen in Grid Engine 5.3 has been replaced by 2 scripts

    $SGE_ROOT/$SGE_CELL/common/sgemaster (qmaster and scheduler machine)
    $SGE_ROOT/$SGE_CELL/common/sgeexecd (cluster execution hosts)

  2. The 'qstat' query program has several new command-line arguments and features
  • [--explain] displays the reasons for 'ambigious' queue error states. The queue state of 'ambigious' is new in GE 6

  • [-qs {a|c|d|o|s|u|A|C|D|E|S}] can select for queues only in a given state

  • [-urg] displays job urgency information (urgency is a new scheduling algorithm type in GE 6)

  • [-pri] display job priority information

  • Option [-g] supports new output modifiers

    • [-g {c}] displays cluster queue summary
    • [-g {t}] displays all parallel job tasks (do not group)

  • [-xml] switch produces parsable XML status output, can be used with any and all other qstat related commandline arguments!
    supports multiple defined "output formats" to customize the range of info reported

Example 1:

# qstat -f


[root@bladebox n1ge]# qstat -f
queuename                      qtype used/tot. load_avg arch          states
----------------------------------------------------------------------------
all.q@compute1.private.sonsoro BIP   0/1       0.01     lx24-x86      
----------------------------------------------------------------------------
all.q@compute2.private.sonsoro BIP   0/1       0.02     lx24-x86      
----------------------------------------------------------------------------
all.q@compute3.private.sonsoro BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
all.q@compute4.private.sonsoro BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
pentium@compute3.private.sonso BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
pentium@compute4.private.sonso BIP   0/1       0.00     lx24-x86      
----------------------------------------------------------------------------
transmeta@compute1.private.son BIP   0/1       0.01     lx24-x86      
----------------------------------------------------------------------------
transmeta@compute2.private.son BIP   0/1       0.02     lx24-x86      

  

The new output grouping modifiers that qstat supports are going to be very valuable. For instance in the output for Example 1 shown above, the default output for 'qstat -f' lists all configured 'queue instances'. The following example shows how the output modifier can be used to just show information on the status of 'cluster queues' -- this sort of output is more managable for very large clusters and will also be more familiar to Platform LSF and PBS users.

Example 2:

# qstat -f -g c


[root@bladebox n1ge]# qstat -f -g c
CLUSTER QU LOAD USED AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------
all.q 0.00 0 4 4 0 0
pentium 0.00 0 2 2 0 0

transmeta 0.00 0 2 2 0 0

Example 3:

# qstat -f -xml

<?xml version='1.0'?>
<job_info xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<queue_info>
<Queue-List>
<name>testqueue@chrisdag.local</name>
<qtype>BIP</qtype>
<slots-used>0</slots-used>
<slots-total>1</slots-total>
<load_avg>0.34326</load_avg>
<arch>darwin</arch>
</Queue-List>
</queue_info>
<job_info>
</job_info>
</job_info>

 

Job Name strings can be used everywhere now, not just in the context of:

$ qsub -hold_jid <name or JID>

Grid Engine 6 has full support for job names everywhere including qalter, qdel, qstat, etc.

 

  1. Changes in Parallel Environments (PE) and Checkpointing Environments -- they are standalone objects now

In Grid Engine 5.3, checkpoint and PE objects would have internal listings of the queues they were associated with. In Grid Engine 6.0 the reverse is occuring -- the queue intstances and cluster queues themselves are configured with a list of Parallel or Checkpointing environments that they are willing to support.

One side-effect of this is that there is no longer a "Parallel" queue type. Queues can now only be "BATCH" or "INTERACTIVE"

 

  1. Job submission is no longer constrained to scripts. It is possible to 'qsub' or 'qrsh' a binary now

When jobs are submitted via qsub, qrsh or the qmon GUI the default behavior of Grid Engine is to expect some sort of shell job script.

However, in the GUI there is now a "binary" option and on the command line one can submit qsub or qrsh jobs with the "-b y[es]" switch to allow a binary to be directly submitted.

 

  1. STDIN can be specified as an argument to 'qsub' now.

"Complexes" are essentially deprecated -- there is now one unified system (a single system complex) for configuring and tracking resources associated with the global cluster, cluster queues, queue instances and hosts. In Grid Engine 6 it makes more sense now to discuss just plain "resources" or "resource attributes" rather than "complexes". There are no separate Host, Global or Cluster Complex objects in Grid Engine 6.

Sun N1 Grid Engine 6 Only Features


1. Analysis, monitoring and accounting addon module

SWC Screenshots:

 
 

Sun Web Console

  Sun Web Console showing ARCo plugin    

Primary components are a standalone SQL repository for event and utilization data with a web based query and reporting console that speaks to the repository.

The SQL database is referred to as the "accounting and reporting db (ARDB)" and can be accessd via SQL, ODBC or JDBC. It stores both historical data as well as derived data (sums & averages). Running it on a separate hosts means that data mining usage data will not slow down or affect the running cluster system.

The ARDB system uses java program called a "reporting writer" that is loosely coupled to the SGE cluster via a 'reporting file'. The java program periodically reads the reporting file and sends data to the SQL database via JDBC calls. The user console is referred to as the "accounting and reporting console (ARCo)". The reporting interval is user configurable (defaults to 60seconds) and the java program is launched via a script called "sgedbwriter".

The ARCo:

  • Is a Java plugin to the Sun Web Console
  • Can visualize the data stored in the reporting db
  • Users can create simple and advanced SQL queries or choose from "canned" queries
  • Has a commandline report generating interface
  • Data is exportable via CSV or PDF


Some of the query options reported by ARCo include:

  • Hourly Accounting per Project
  • Project target
  • Share Target Project
  • Accounting per Department
  • Accounting per Project
  • Accounting per User
  • Statistics
  • Host Load
  • Average Job Turnaround Time
  • Average Job Wait Time
  • Job Log
  • Number of Jobs completed
  • Queue consumables
  • Share History