Cluster NotesTopic:Understanding the differences between Grid Engine 5.3, 6.0 and Sun N1 Grid Engine 6 (N1GE 6)Important Historical Note:[2008] - The information contained in this document is VERY OUT OF DATE in that it does not mention any Grid Engine version after 6.0 including the quite different SGE 6.1 release (introduction of resource quotas) and SGE 6.2 (massive scaling improvements and sge_schedd becomes a qmaster thread). The only reason this document still exists is becuase peopel still seem to look for it via google etc. The original intent of this document was to describe the major changes in SGE 6.0 at a point in time where there was little other public information available. That time has now long past. People looking for more up to date infromation may want to check out my blog (http://gridengine.info) or the newly deployed Sun Wiki site for Grid Engine (http://wikis.sun.com/display/GridEngine). For more updated information on differences and new features in modern SGE releases, check out DanT's excellent blog at http://blogs.sun.com/templedf/category/Grid - he has some great "Why upgrade?" articles that clearly talk about difference and new features. Disclaimer:I wrote this to facilitate my own work and drew upon many different sources. All mistakes are my own. Author: Chris Dagdigian, dag@sonsorol.org
|
|
2 |
3 |
Click on any of the thumbnail images for a larger version
Image 1 shows how the previous 'iconic' views of queue states in previous versions of Grid Engine have been replaced by a tabular listing of 'queue instances'
Image 2 shows a 2nd tab in the queue configuration tool that lists tabular information about 'cluster queues'
Image 3 shows how the host configuration tool has been changed to support the concept of host groups
New Terminology Introduced:
'queue instance' -- this is how a Grid Engine 5.3 or "classic queue" is referred to
'cluster queue' -- a set of queue instances (effectively a queue that spans multiple hosts)
'cluster domain' -- all queue instances of a cluster queue in which the hosts belong to the same hostgroup
What it means:
The "classic" concept of queues (now called 'queue instances') ARE NOT GOING AWAY, in fact they are still critical to Grid Engine configuration and operation. Every execution host in the cluster is still going to run one or more 'queue instances'. The new cluster queues represent an attempt to simplify Grid Engine configuration and management by allowing for easy grouping of similar queues. Users can choose to submit jobs to certain queues or they can continue doing what they did in the past -- describing the resources they need for their job to be successful and letting the Grid Engine scheduler find and allocate the best possible queue ('queue instance' now).
What it does not mean:
Cluster queues in Grid Engine 6.0 are not 100% identical to what LSF or PBS calls a 'queue'. You will not be setting up 'default', 'high-priority', 'low-priority' cluster queues to manage resource allocation by controlling who has acccess to what cluster queue and what priority the cluster queue then assigns the pending jobs-- the best way to do that stuff is still via the various Grid Engine policy mechanisms.
How it was implemented:
The developers appear to have sensibly extended the existing queue mechanisms within Grid Engine so that a 'queue' is no longer just associated with a single host system. What was once a queue is now a "queue instance".
{Paraphrasing from the Cluster Queues Workshop Presentation here...}
Step 1 involved extending the concept of a 'queue' to cover a hostlist rather than just a hostname:
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
qtype BATCH INTERACTIVE
ckpt_list NONE
rerun FALSE
slots 1,[compute1.private.sonsorol.net=1], \
[compute2.private.sonsorol.net=1], \
[compute3.private.sonsorol.net=1], \
[compute4.private.sonsorol.net=1]
tmpdir /tmp
shell /bin/csh
<...truncated...>
Step 2 involved making sure that queue level attributes could be extended to cover the fact that more than one host is now involved:
qname big
hostlist compute1 compute2 compute3 compute4
seq_no 0,[compute3=1],[compute4=1],[compute2=2],[compute1=2]
load_thresholds NONE
suspend_thresholds NONE
...Step 3 was creating a entirely new Grid Engine object known as a "hostgroup"
group_name @transmeta_blades
hostlist compute1 compute2
group_name @pentium_blades
hostlist compute3 compute4qname big
hostlist @transmeta_blades @pentium_blades
seq_no 0,[@pentium_blades=1],[@transmeta_blades=2]
load_thresholds NONE
suspend_thresholds NONE
Example useage:
qsub ./job.sh
qsub -q myQueue ./job.shExample usage with hostgroups:
qsub -q myQueue@@bladeServers ./job.sh
qsub -q myQueue@node23 ./myJob.shIt appears that the use of cluster queues and hostgroups is now the preferred default behavior rather than the 5.3 method of having a queue associated with each host. Host-specific queues are now recommended "for different large SMP hosts".
Users can submit jobs either to:
- A "cluster queue" (which has its own assocated hostlist/hostgroup object)
- A "queue instance" (queue associated with a single host)
- A hostgroup (which will have one or more available queues associated with it)
- Nothing -- "qsub ./myJob.sh" still works and Grid Engine will still work to find and allocate the best 'queue instance'
Additional material:
Sun Grid Engine 2003 workshop
presentation
In general the use of BerkeleyDB is aimed at increasing both the performaance of grid engine as well as raising the scaling ceiling. Grid Engine 5.3 generally is considered to have a scaling ceiling of around 2000 CPUs at max. The new goals for Grid Engine 6.0 and beyond are to support 10,000+ hosts etc.
"Classic mode" does local spooling to text files, all data is ASCII text except for job data which is stored in binary form. It appears that the local file spooling has been changed/optimized between 5.3 and 6.0.
Classic mode file spooling across a NFS share allows for cluster qmaster failover between master hosts and shadow master hosts at the expense of "performance". Spooling to a NFS filesystem is ok for small clusters or systems that do not have massive job throughput requirements but in general it is considered "slow" for large or very high throughput systems.
A new spooling option method in Grid Engine 6.0 allows for the use of BerkeleyDB mechanisms to spool data to local disk. This has scaling and performance advantages. Apparently the stored spool data is structured much like the SGE internals.
One of the disadvantages to using BerkeleyDB for local spooling is that it must be done to local disk, it cannot be used across an NFS mounted filesystem. This is because "...Berkeley DB uses mechanisms like file locking and memory mapping to model its locking and transaction mechanism. Therefore a Berkeley DB cannot be accessed through NFS. All database access has to be done on a local filesystem..."
A new spooling option method in Grid Engine 6.0 allows for a client-server approah to BerkeleyDB spooling. In this scenario the qmaster uses a network RPC-based mechanism to send data to a remote RPC-server host. The advantage to this approach is, (1) All the performace and scaling gains of using BerkeleyDB instead of flatfiles, (2) Shadow master failover hosts do not need any sort of shared filesystem anymore (a big win). There are potentially significant drawbacks to this approach as the RPC mechanism has been described as having "no security at all".
The transition to BerkeleyDB will aid performance and increase the overall scaling headroom of Grid Engine. It even lays the framework for a very nice failover and resiliancy system once Grid Engine can leverage the BerkeleyDB community tools to gain features such as the ability to safely replicate the Berkeley spooldb to waiting shadow master hosts.
However, there is still a HUGE hole in the overall fault tolerance and resiliancy mechanisms as they can be used in Grid Engine 6. People who set up remote RPC spooling are just moving their single point of failure from the qmaster host to the RPC server host (since there can only be 1 RPC host currently). People who choose spooling with BerkeleyDB to local disk loose the ability to have waiting shadow master(s) sharing an NFS mount.
At this point in time it seems that the only way to leverage the improvements that BerkeleyDB spooling brings while still providing a high availbility (HA) grid engine cluster is to run the Grid Engine qmaster with local BerkeleyDB spooling on a purpose-built clustered HA server. All of the 'big' vendors like Sun, IBM and HP already offer special hardware configurations and software stacks that allow multiple servers to act as a single ultra-reliable HA server host.
Just to be clear - nobody is crippling Grid Engine just to boost hardware HA sales. This issue is just an artifact of migrating to BerkeleyDB before all of the 'bells and whistles' such as spooldb replication are all in place. Expect these to appear in future releases.
People who can't afford a hardware HA solution for their qmaster machine are left with two choices -- give up some performance by dropping berkeleyDB entirely and using 'classic spooling' to local files that can be shared with backup shadow masters via NFS.
The 2nd choice involves using BerkeleyDB spooling to local disk on the qmaster node and coming to terms with the fact that this is going to be a potential single point of failure. Personally, this is the option I am probably going to be using in my work. Very few of the 5.3 based SGE clusters I've worked on even have shadow masters configured so this is not much of a problem for me moving forward.
The clusters I tend to build are used for discovery research purposes and all of the operators and managers understand that sometimes downtime will occur. A few hours of downtime (or even a day or more) to recover from a failed qmaster node is acceptable to everyone once the true financial costs of switching to a clustered-HA head node are explained. Spending 5-figure sums (or more) going from 99% uptime to 99.99% often can not be justified for experimental systems where every dollar spent on HA means less money for software, licenses or additional compute power.
Your experience may vary, especially across different industries and disciplines...
The scheduler now supports resource reservations for any sort of resource including software licenses, etc.
The scheduler can now backfill to increase utilization, this is especially useful in reservation scheduling scenarios where scheduling to support reserved resources can result in 'gaps' or underutlized resources.
The use of the "-p" switch to indicate a job priority value has been greatly enhanced. In Grid Engine 5.3 and earlier versions the use of priority values applied only as a simple job ordering mechanism relative to other pending jobs . In Grid Engine 6.0 and later the useage of priority values has been unified and now can work in the context of ranking job order within the more advanced functional share and share tree policies. This is done via the "-js" argument.
The man page explains more:
...
In case of the Share Tree Policy, users can distribute the tickets to which
they are currently entitled, among their jobs using different shares assigned
via -js. If all jobs have the same job share value, the
tickets are distributed evenly. Otherwise, jobs receive tickets relative
to the different job shares. Job shares are treated like an additional level
in the share tree in the latter case.
In connection with the Functional Policy, the job share can be used to weight
jobs within the functional job category. Tickets are distributed relative
to any uneven job share distribution treated as a virtual share distribution
level underneath the functional job category. If both the Share Tree and
the Functional Policy are active, the job shares will have an effect in
both policies and the tickets independently derived in each of them are
added up to the total number of tickets for each job.
...
AlgorithmsExisting (in 5.3) algorithims:
- Priority
- Share Tree Policy
- Functional Policy
- Override Policy
- Deadline
New in Grid Engine / Sun N1GE 6.0
- Resource reservation & backfilling
- Urgency Policy
- Improved Priority based scheduling
- Unified Ticketing
Unified Ticketing
All of the various policies that use 'tickets' (share tree, functional policy and override policy) are combined into a unified policy simply by normalizing each value to arrive at a relative priority. This is a very nice thing as it is easy to enable/disable and tweak each of the ticketing policies. Most people will use Share Tree or Functional Policy. 'Override' seems to be more of an artifact or deprecated system -- not many people seem to use it these days.
Urgency Scheduling
Three urgency related sub-policies (ResourceUrgency, WaitTimeUrgency and DeadlineUrgency) are normalized to sort jobs under the Urgency scheduling mechanism.
Expensive assets (software licenses, big SMP servers) can be assigned a higher static ResourceUrgency value to ensure they are used as frequently as possible. Any jobs requesting this resource will 'inherit' the ResourceUrgency value of the expensive asset and will rise accordingly in priority. Very resource hungry jobs can also be assigned ResourceUrgency values.
WaitUrgency increases as jobs pend waiting for execution. This can be used to prevent very low-priority jobs from being 'starved' or forgotten in a high volume cluster. Can also be used to ensure a steady stream of low priority of jobs are always getting some execution time. I suspect that some groups of Grid Engine users will find this very valuable.
DeadlineUrgency increases as a preset job dispatch deadline approaches. Jobs with approaching dispatch deadlines will rise in priority.
Example:qsub -hard -l "arch=lx24-x86|darwin" ./myJob.sh
{run this job only on a linux OR an Apple system }
A default cluster queue called "all.q" is created during installation, by default it contains all of the 'cluster instances'.
A default "allhosts" hostgroup is created during installation, by default it contains all configured execution hosts.
It is very interesting to see now that three different scheduler tuning options are presented during install. This is useful for helping people understand that Grid Engine (and the scheduler in particular) can be configured/tuned to support different usage types. This is how the selectable default profiles are presented to the user:
1) Normal -- Fixed interval scheduling, report scheduling information, actual + assumed load
2) High -- Fixed interval scheduling, report limited scheduling information, actual load
3) Max -- Scheduling on demand, report no scheduling information, actual loadThe table below lists the specific Grid Engine tuning parameters that are changed for each install-time profile:
Grid Engine Parameter Normal High Maxjob_load_adjustments np_load_avg=0.5 none noneload_adjustment_decay_time 00:07:30 00:00:00 00:00:00schedd_job_info TRUE FALSE FALSEschedule_interval 00:00:15 00:00:15 00:02:00flush_submit_sec 0 0 4flush_finish_sec 0 0 4report_pjob_tickets TRUE TRUE FALSE
None of these preconfigured settings are 100% perfect for the Grid Engine clusters I commonly see. I generally start with tuning settings that are similar to what is shown in the "Normal" column with the exception that I always tend to set $np_load_avg and $load_adjustment_decay_time to zero. Artifically inflating load and decaying it over time is a mechanism used to prevent overloading execution hosts configured with more job slots than CPUs -- the artificial short term boost to to np_load_average is great for making sure the node does not get slammed with too many jobs in too short a time period. However, this setup is almost never seen in the systems I work on which generally always have queue instance job slots set equal to the number of available system CPUs.
I can do away with artificial load adjustments which reduces some Grid Engine overhead. I do like the "extra" reporting information and tend to keep these configured in all but the most highly loaded Grid Engine systems.
The "Max" settings look interesting and are something I need to experiment with. By setting the shedule_interval to 2 minutes (effectively infinite) and making the flush_submit_sec and flush_finish_sec values set to 4 seconds Grid Engine is essentially configured for on-demand scheduling. The system will run a scheduler cycle 4 seconds after any job is submitted and 4 seconds after any job exits. Could be interesting but I need more exposure to see how this compares to just tightening up the value of $schedule_interval itself (we have set it as low as 2 seconds in some systems optimized for servicing qrsh queries coming from a webserver process).
$SGE_ROOT/$SGE_CELL/common/sgemaster (qmaster and scheduler machine)
$SGE_ROOT/$SGE_CELL/common/sgeexecd (cluster execution hosts)
- [--explain] displays the reasons for 'ambigious' queue error states. The queue state of 'ambigious' is new in GE 6
- [-qs {a|c|d|o|s|u|A|C|D|E|S}] can select for queues only in a given state
- [-urg] displays job urgency information (urgency is a new scheduling algorithm type in GE 6)
- [-pri] display job priority information
- Option [-g] supports new output modifiers
- [-g {c}] displays cluster queue summary
- [-g {t}] displays all parallel job tasks (do not group)
- [-xml] switch produces parsable XML status output, can be used with any and all other qstat related commandline arguments!
supports multiple defined "output formats" to customize the range of info reported
Example 1:
# qstat -f
[root@bladebox n1ge]# qstat -f queuename qtype used/tot. load_avg arch states ---------------------------------------------------------------------------- all.q@compute1.private.sonsoro BIP 0/1 0.01 lx24-x86 ---------------------------------------------------------------------------- all.q@compute2.private.sonsoro BIP 0/1 0.02 lx24-x86 ---------------------------------------------------------------------------- all.q@compute3.private.sonsoro BIP 0/1 0.00 lx24-x86 ---------------------------------------------------------------------------- all.q@compute4.private.sonsoro BIP 0/1 0.00 lx24-x86 ---------------------------------------------------------------------------- pentium@compute3.private.sonso BIP 0/1 0.00 lx24-x86 ---------------------------------------------------------------------------- pentium@compute4.private.sonso BIP 0/1 0.00 lx24-x86 ---------------------------------------------------------------------------- transmeta@compute1.private.son BIP 0/1 0.01 lx24-x86 ---------------------------------------------------------------------------- transmeta@compute2.private.son BIP 0/1 0.02 lx24-x86The new output grouping modifiers that qstat supports are going to be very valuable. For instance in the output for Example 1 shown above, the default output for 'qstat -f' lists all configured 'queue instances'. The following example shows how the output modifier can be used to just show information on the status of 'cluster queues' -- this sort of output is more managable for very large clusters and will also be more familiar to Platform LSF and PBS users.
Example 2:
# qstat -f -g c
[root@bladebox n1ge]# qstat -f -g c
CLUSTER QU LOAD USED AVAIL TOTAL aoACDS cdsuE
-------------------------------------------------------------------------------
all.q 0.00 0 4 4 0 0
pentium 0.00 0 2 2 0 0
transmeta 0.00 0 2 2 0 0Example 3:
# qstat -f -xml
<?xml version='1.0'?>
<job_info xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<queue_info>
<Queue-List>
<name>testqueue@chrisdag.local</name>
<qtype>BIP</qtype>
<slots-used>0</slots-used>
<slots-total>1</slots-total>
<load_avg>0.34326</load_avg>
<arch>darwin</arch>
</Queue-List>
</queue_info>
<job_info>
</job_info>
</job_info>
Job Name strings can be used everywhere now, not just in the context of:
$ qsub -hold_jid <name or JID>
Grid Engine 6 has full support for job names everywhere including qalter, qdel, qstat, etc.
In Grid Engine 5.3, checkpoint and PE objects would have internal listings of the queues they were associated with. In Grid Engine 6.0 the reverse is occuring -- the queue intstances and cluster queues themselves are configured with a list of Parallel or Checkpointing environments that they are willing to support.
One side-effect of this is that there is no longer a "Parallel" queue type. Queues can now only be "BATCH" or "INTERACTIVE"
When jobs are submitted via qsub, qrsh or the qmon GUI the default behavior of Grid Engine is to expect some sort of shell job script.
However, in the GUI there is now a "binary" option and on the command line one can submit qsub or qrsh jobs with the "-b y[es]" switch to allow a binary to be directly submitted.
"Complexes" are essentially deprecated -- there is now one unified system (a single system complex) for configuring and tracking resources associated with the global cluster, cluster queues, queue instances and hosts. In Grid Engine 6 it makes more sense now to discuss just plain "resources" or "resource attributes" rather than "complexes". There are no separate Host, Global or Cluster Complex objects in Grid Engine 6.
1. Analysis, monitoring and accounting addon module
SWC Screenshots:
Sun Web Console
Sun Web Console showing ARCo plugin
Primary components are a standalone SQL repository for event and utilization data with a web based query and reporting console that speaks to the repository.
The SQL database is referred to as the "accounting and reporting db (ARDB)" and can be accessd via SQL, ODBC or JDBC. It stores both historical data as well as derived data (sums & averages). Running it on a separate hosts means that data mining usage data will not slow down or affect the running cluster system.
The ARDB system uses java program called a "reporting writer" that
is loosely coupled to the SGE cluster via a 'reporting file'. The java program
periodically reads the reporting file and sends data to the SQL database via
JDBC calls. The user console is referred to as the "accounting and reporting
console (ARCo)". The reporting interval is user configurable (defaults
to 60seconds) and the java program is launched via a script called "sgedbwriter".