Cluster Notes

Topic:

Command-line methods for
Department-Based Resource Allocation
within Grid Engine 6 & Sun N1 Grid Engine

Disclaimer:

I wrote this to facilitate my own work. All mistakes are my own.
Feedback and error corrections are always appreciated.

Author:

Chris Dagdigian, dag@sonsorol.org
BioTeam Inc.
http://bioteam.net

Version:

1.2 -- Updated September 22, 2005
1.1 -- Updated November 16, 2004
The most recent version of this document will always be available at http://bioteam.net/dag/

Revision History:

1.2 -- Comments & feedback from Ryan Thomas incorporated
1.1 -- Comments & feedback from Charu Chaubal and Stephan Grell incorporated
1.0 -- Original version, November 15, 2004

Table of Contents

Preface

Configuration Goals & Resource Allocation Policy Scope

Why a Functional Policy and not a Share-Tree Policy?

Implementation Step by Step

 

Preface

Grid Engine 6 is a distributed resource management (DRM) software layer developed and distributed under an open source license. A commercial version is sold by Sun Microsystems as "N1 Grid Engine 6" . The project lives at http://gridengine.sunsource.net.

For more background on Grid Engine, refer to the online documentation collection hosted at http://docs.sun.com/app/docs/coll/1017.3. Readers of this particular document should probably be familiar with the "Managing Policies and the Scheduler" chapter of the Administrators Guide.

It should also be noted that managing policies is one of the Grid Engine admin tasks that is often easier and more straightforward when done via the graphical "qmon" utility. The author of this document tends to work remotely on clusters via low bandwidth SSH connections or VPN setups that do not allow X11 trafic. The purpose of this document is to highlight the specific command-line methods which sometimes can be under-documented in the official Grid Engine manuals.

Configuration Goals & Resource Allocation Policy Scope

(1) Create a Functional Share policy using command-line Grid Engine tools to enable resource allocation on a percentage basis between Departments.

(2) When the cluster is idle, anyone and any department can use cluster resources.

(3) When the cluster is busy, Departments get a percentage of available cluster resources.

(4) When contention for resources exists on a busy cluster, running jobs will not be killed or otherwise manipulated. The resource allocation will be done only within the pending job list. This involves bumping up the priority of pending jobs belonging to a departments with higher entitlement will occur. Essentially we can't muck with running jobs because we have no clean way of suspending, checkpointing or moving them.

(5) Users within each department should be considered equal from a resource allocation viewpoint.


Desired cluster resource allocation mix:

unassigned: 18% of cluster resources

Dept_A : 18% of cluster resources

Dept_B : 18% of cluster resources

Dept_C : 11% of cluster resources

Dept_D : 35% of cluster resources

Why a Functional Policy and not a Share-Tree Policy?

In an ideal world, share-tree is the policy that most people probably should be using. It nicely remembers past usage and works to average out usage such that eventually entitlements trend back over time to being in harmony with the configured policies. Users and groups with little past usage are compensated with higher resource allocation when they start submitting work. Heavy cluster users will find their current entitlements dropping so the under-represented users and groups can get up to speed more rapidly. It works, and it is fair.

Sadly though, even though users and managers understand share-tree when the method is explained to them they tend to forget these details when they notice their jobs pending in the wait list. Users who have been told to expect a 50% entitlement to cluster resources get frustrated when they launch their jobs and don't get to take over half of the cluster instantly. Explaining to them that the 50% entitlement is a goal that the scheduler is working to meet "as averaged over time..." fall upon deaf ears. Heavy users get upset to learn that their current entitlement is being "penalized" because their past usage greatly exceeded their alloted share. Cluster admins then spend far too much time attempting to "prove" to the user community that they are not getting short changed.

For a cluster administrator, it is often less hassle to dump the share-tree and convert to a functional policy which has no concept or memory of past cluster usage and simply tries to meet resource allocation policies each time a scheduling run is performed. The resource allocation is far more obvious and users can watch the pending list to see how the scheduler bumps jobs up in the queue according to the configured entitlements.

I've given up using share-tree at customer sites and now pretty much use the functional policy exclusively.

Update: Charu Chaubal makes an excellent point about tweaking the halftime scheduler parameter to use the Share-Tree policy without the time based "memory" of past resource utilization:

"...but instead of going to the Functional policy, did you ever consider using Sharetree but just setting the half-life to zero? This effectively would behave like the Functional Policy (no memory or compensation), but it has the advantage that you can still do a tree structure..."

This is an excellent point. The main advantage with Charu's recommendation is that Grid Engine administrators can use the far more expressive and flexible tree structure to define entitlements between different levels (and layers) of departments, projects, userlists and users. The halftime parameter is under-documented in the Grid Engine manpages and official documentation -- setting it to zero to effectively disable it is not going to be problematic.

22 September 2005 - Update: It was pointed out on the Grid Engine users mailing list (and by Ryan Thomas) that the real effect of setting the half-life value to zero is to make the decay time infinite which means that the SGE scheduler NEVER forgets about past usage (the opposite of what Charu was hoping would happen...) Further discussion on the list centered around the odd fact that half-life only accepts integer values which makes it hard to set values reflecting a halflife of less than one hour. The full discussion thread is prettty interesting.

Implementation Step by Step

  1. Functional share policy activated within SGE scheduler
  2. 100,000 functional share tickets added to the pool
  3. Algorithm adjusted to make Department membership more important
  4. Algorithm adjusted to make user slightly more important
  5. Algorithm adjusted to make project and job less important
  6. User objects created within grid engine matching given user list.
  7. Assign arbitrary but equal number of user tickets to each user so they are each treated equally within department.
  8. Departments created within grid engine matching given list
  9. Assign tickets to departments in proportional value to the total number of available configured tickets.

Steps 1,2: Activate functional share resource allocation policy

The functional share policy is activated by adding tickets to the functional share pool. The pool is defined as weight_tickets_functional in the Grid Engine scheduler configuration.

Run the command :

qconf -msconf

Assign 100000 to value of weight_tickets_functional

Steps 3,4,5: Adjust algorithm weights for Department and User

The functional share algorithm can assign relative weight or importance values to "user", "project", "department" and "job".

In the default configuration these values are all treated equally. The sum of these 4 weights must add up to "1". The defaults are defined in the scheduler configuration:

    weight_user       0.250000
    weight_project    0.250000
    weight_department 0.250000
    weight_job        0.250000

We want to make "Department" more important than anything else while also slightly raising the importance of "user" because we are going to give out some functional share tickets to users as well (to enforce user equality within a department).


The new values (changed via "qconf -msconf") are:

weight_user        0.200000
weight_project     0.100000
weight_department  0.600000
weight_job         0.100000

 

Update: Stephan Grell pointed out a huge weakness in the suggested configuration if one only adjusts the parameters shown above. By ignoring the other weight_* parameters (weight_ticket, weight_priority, weight_urgency, etc.) we enable a scenario by which a user can use the POSIX Priority policy to bypass the intended resource allocation mix. We need to either disable those mechanisms entirely or make them "less important" within the scheduler than the functional ticketing scheme.

Stephan comments:

"...In your described setting a "qsub -p 1000" or or a "qsub -pe make 10" will invert your fair scheduling policy. If your scheduling should only be based on the functional tickets, you need to set:

weight_ticket             1.0000
weight_waiting_time       0.0000
weight_deadline     3600000.0000
weight_urgency            0.0000
weight_priority           0.0000 
   

If you want to support the posix priority and/or urgency, their weight values have to be a lot smaller, than the weight_ticket. Such as:

weight_ticket            10.0000
weight_waiting_time       0.00000
weight_deadline     3600000.00000
weight_urgency            0.01000
weight_priority           0.01000

This allows a user to set the priorities within his jobs and he will not exceed his percentage from the ticket setup. The weight parameters are difficult to handle and can completely compromise the ticket configuration."

Stephan's suggestions have been taken into consideration. Since we want users to be able to use the Priority mechanism to prioritize their own pending jobs we are going to make changes to the scheduler configuration that keep the weight_urgency and weight_priority mechanisms enabled but "less important" overall than the functional ticket policy.

Verified by running the command "qconf -ssconf" to view current config:

algorithm                         default
schedule_interval                 0:0:7
maxujobs                          0
queue_sort_method                 load
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   true
flush_submit_sec                  0
flush_finish_sec                  0
params                            none
reprioritize_interval             0:0:0
halftime                          168
usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor               5.000000
weight_user                   0.200000
weight_project                0.100000
weight_department             0.600000
weight_job                    0.100000
weight_tickets_functional     100000
weight_tickets_share              0
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               none
policy_hierarchy                  OFS
weight_ticket                 10.00000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.100000
weight_priority               0.500000
max_reservation                   0
default_duration                  0:10:0 
        

Steps 6,7: Creating users

The command "qconf -auser" is run for each new username. We want to create user entries within Grid Engine where each user has been allocated 100 functional share tickets. Giving the users an equal number of shares should ensure that users are treated equally within Department groups when it comes to resource entitlements.

The default user values are:

name template
oticket 0
fshare 0
delete_time 0
default_project NONE

They need to be changed to:


name <username>
oticket 0
fshare 100
delete_time 0
default_project NONE

I threw together a simple perl script to automate the process of adding users with 100 functional share tickets. The script writes a template
to a temp location and then calls "qconf -Auser /path-to-template" - Grid Engine will read in the file and accept the new settings.

This is the script:

#!/usr/bin/perl
use POSIX qw(tmpnam);
my $tmp = POSIX::tmpnam();
my $user=shift;
open(TMP,"> $tmp");
print TMP<<EOL;
name $user
oticket 0
fshare 100
delete_time 0
default_project NONE
EOL
close(TMP);
print "User=($user), Configfile=($tmp)\n";
system("qconf -Auser $tmp");
unlink($tmp);


This is what the script looks like when run for several users:

fakehost:~ root# ./create-sge-user.pl userA
User=(userA), Configfile=(/var/tmp/tmp.0.zoTUio)
Creating user:root@fakehost.bioteam.net added "userA" to user list

fakehost:~ root# ./create-sge-user.pl userB
User=(userB), Configfile=(/var/tmp/tmp.0.MYYGB3)
Creating user:root@fakehost.bioteam.net added "userB" to user list

fakehost:~ root# ./create-sge-user.pl userC
User=(userC), Configfile=(/var/tmp/tmp.0.cy4SXR)
Creating user:root@fakehost.bioteam.net added "userC" to user list

fakehost:~ root# ./create-sge-user.pl userD
User=(userD), Configfile=(/var/tmp/tmp.0.YeU83Q)
Creating user:root@fakehost.bioteam.net added "userD" to user list

 

Steps 8,9: Creating and defining Department lists

Within Grid Engine DEPARTMENTS are considered to be userlists similar to access control lists. To create a new userlist of
type department one would do:


qconf -mu <department>

For our example department "Dept_A":

qconf -mu Dept_A

And we set the values to:

name    Dept_A
type    DEPT
fshare  18000
oticket 0
entries userA
 

The important values are:

"type" -- need to make this a DEPT rather than ACL object
"fshare" -- 18000 is 18% of the 100,000 available functional share tickets
"entries" -- userA is the first configured member of the department named "Dept_A". Additional usernames are comma-seperated.


To show the contents of a departmental user list:

fakehost:~ root# qconf -su Dept_A
name Dept_A
type DEPT
fshare 18000
oticket 0
entries userA

To show a list of all userset objects:

cat:~ root# qconf -sul
deadlineusers
defaultdepartment
Dept_A
Dept_B
Dept_C
Dept_D

Note that the configueration goals called for roughly 18% of cluster resources unassigned and available for general use.

This is what the pre-existing Department object "defaultdepartment" is for. Any user not assigned to a given Department willl be considered for scheduling purposes to be a member of the "defaultdepartment" group.

Although this document deals with the command-line methods for manipulating the Department based Functional Share policy a screenshot is available showing what these settings would look like when viewed via the graphical 'qmon' program. The screenshot is quite large (~ 324KB) and can be accessed by clicking on this link.