I wrote this to facilitate my own work. All mistakes are my own.
Feedback and error corrections are always appreciated.
Grid Engine 6 is a distributed resource management (DRM) software layer developed
and distributed under an open source license. A commercial version is sold
by Sun Microsystems as "N1 Grid Engine 6" . The project lives at http://gridengine.sunsource.net.
It should also be noted that managing policies is one of the Grid Engine
admin tasks that is often easier and more straightforward when done via the
graphical "qmon" utility. The author of this document tends to work
remotely on clusters via low bandwidth SSH connections or VPN setups that
do not allow X11 trafic. The purpose of this document is to highlight the
specific command-line methods which sometimes can be under-documented in the
official Grid Engine manuals.
Configuration Goals & Resource Allocation Policy
Scope
(1) Create a Functional
Share policy using command-line Grid Engine
tools to enable resource allocation on a percentage basis between
Departments.
(2) When the cluster is idle, anyone and any department can use cluster resources.
(3) When the cluster is busy, Departments get a percentage of available cluster
resources.
(4) When contention for resources exists on a busy cluster, running jobs
will not be killed or otherwise manipulated. The resource allocation will
be done only within the pending job list. This involves bumping up the priority
of pending jobs belonging to a departments with higher entitlement will occur.
Essentially we can't muck with running jobs because we have no clean way of
suspending, checkpointing or moving them.
(5) Users within each department should be considered equal from a resource
allocation viewpoint.
Desired cluster resource allocation mix:
unassigned: 18% of cluster resources
Dept_A : 18% of cluster resources
Dept_B : 18% of cluster resources
Dept_C : 11% of cluster resources
Dept_D : 35% of cluster resources
In an ideal world, share-tree is the policy that most people probably should
be using. It nicely remembers past usage and works to average out usage such
that eventually entitlements trend back over time to being in harmony with
the configured policies. Users and groups with little past usage are compensated
with higher resource allocation when they start submitting work. Heavy cluster
users will find their current entitlements dropping so the under-represented
users and groups can get up to speed more rapidly. It works, and it is fair.
Sadly though, even though users and managers understand share-tree when the
method is explained to them they tend to forget these details when they notice
their jobs pending in the wait list. Users who have been told to expect a
50% entitlement to cluster resources get frustrated when they launch their
jobs and don't get to take over half of the cluster instantly. Explaining
to them that the 50% entitlement is a goal that the scheduler is working to
meet "as averaged over time..." fall upon deaf ears. Heavy
users get upset to learn that their current entitlement is being "penalized"
because their past usage greatly exceeded their alloted share. Cluster admins
then spend far too much time attempting to "prove" to the user community
that they are not getting short changed.
For a cluster administrator, it is often less hassle to dump the share-tree
and convert to a functional policy which has no concept or memory of past
cluster usage and simply tries to meet resource allocation policies each time
a scheduling run is performed. The resource allocation is far more obvious
and users can watch the pending list to see how the scheduler bumps jobs up
in the queue according to the configured entitlements.
I've given up using share-tree at customer sites and now pretty much use
the functional policy exclusively.
Implementation Step by Step
- Functional share policy activated within SGE scheduler
- 100,000 functional share tickets added to the pool
- Algorithm adjusted to make Department membership more important
- Algorithm adjusted to make user slightly more important
- Algorithm adjusted to make project and job less important
- User objects created within grid engine matching given user list.
- Assign arbitrary but equal number of user tickets to each user so they
are each treated equally within department.
- Departments created within grid engine matching given list
- Assign tickets to departments in proportional value to the total number
of available configured tickets.
Steps 1,2: Activate functional share resource allocation
policy
The functional share policy is activated by adding tickets to the functional
share pool. The pool is defined as weight_tickets_functional
in the Grid Engine scheduler configuration.
Run the command :
qconf -msconf
Assign 100000 to value of weight_tickets_functional
Steps 3,4,5: Adjust algorithm weights for Department
and User
The functional share algorithm can assign relative weight or importance values
to "user", "project", "department" and "job".
In the default configuration these values are all treated equally. The sum
of these 4 weights must add up to "1". The defaults are defined
in the scheduler configuration:
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
We want to make "Department" more important than anything else
while also slightly raising the importance of "user" because we
are going to give out some functional share tickets to users as well (to enforce
user equality within a department).
The new values (changed via "qconf
-msconf") are:
weight_user 0.200000
weight_project 0.100000
weight_department 0.600000
weight_job 0.100000
Update: Stephan Grell pointed out a huge weakness in the
suggested configuration if one only adjusts the parameters shown above.
By ignoring the other weight_*
parameters (weight_ticket, weight_priority,
weight_urgency, etc.) we enable a scenario by which a user can use
the POSIX Priority policy to bypass the intended resource allocation mix.
We need to either disable those mechanisms entirely or make them "less
important" within the scheduler than the functional ticketing scheme.
Stephan comments:
"...In your described setting a "qsub
-p 1000" or or a "qsub
-pe make 10" will invert your fair scheduling policy. If your
scheduling should only be based on the functional tickets, you need to
set:
weight_ticket 1.0000
weight_waiting_time 0.0000
weight_deadline 3600000.0000
weight_urgency 0.0000
weight_priority 0.0000
If you want to support the posix priority and/or urgency, their weight
values have to be a lot smaller, than the weight_ticket. Such as:
weight_ticket 10.0000
weight_waiting_time 0.00000
weight_deadline 3600000.00000
weight_urgency 0.01000
weight_priority 0.01000
This allows a user to set the priorities within his jobs and he will not exceed his percentage from the ticket setup. The weight parameters are difficult to handle and can completely compromise the ticket configuration."
Stephan's suggestions have been taken into consideration. Since we want users
to be able to use the Priority mechanism to prioritize their own pending jobs
we are going to make changes to the scheduler configuration that keep the
weight_urgency and weight_priority
mechanisms enabled but "less important" overall than the functional
ticket policy.
Verified by running the command "qconf
-ssconf" to view current config:
algorithm default
schedule_interval 0:0:7
maxujobs 0
queue_sort_method load
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info true
flush_submit_sec 0
flush_finish_sec 0
params none
reprioritize_interval 0:0:0
halftime 168
usage_weight_list cpu=1.000000,mem=0.000000,io=0.000000
compensation_factor 5.000000
weight_user 0.200000
weight_project 0.100000
weight_department 0.600000
weight_job 0.100000
weight_tickets_functional 100000
weight_tickets_share 0
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list none
policy_hierarchy OFS
weight_ticket 10.00000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.100000
weight_priority 0.500000
max_reservation 0
default_duration 0:10:0
Steps 6,7: Creating users
The command "qconf -auser"
is run for each new username. We want to create user entries within Grid Engine
where each user has been allocated 100 functional share tickets. Giving the
users an equal number of shares should ensure that users are treated equally
within Department groups when it comes to resource entitlements.
The default user values are:
name template
oticket 0
fshare 0
delete_time 0
default_project NONE
They need to be changed to:
name <username>
oticket 0
fshare 100
delete_time 0
default_project NONE
I threw together a simple perl script to automate the process of adding users
with 100 functional share tickets. The script writes a template
to a temp location and then calls "qconf
-Auser /path-to-template" - Grid Engine will read
in the file and accept the new settings.
This is the script:
#!/usr/bin/perl
use POSIX qw(tmpnam);
my $tmp = POSIX::tmpnam();
my $user=shift;
open(TMP,"> $tmp");
print TMP<<EOL;
name $user
oticket 0
fshare 100
delete_time 0
default_project NONE
EOL
close(TMP);
print "User=($user), Configfile=($tmp)\n";
system("qconf -Auser $tmp");
unlink($tmp);
This is what the script looks like when run for several users:
fakehost:~
root# ./create-sge-user.pl userA
User=(userA), Configfile=(/var/tmp/tmp.0.zoTUio)
Creating user:root@fakehost.bioteam.net added "userA" to user list
fakehost:~
root# ./create-sge-user.pl userB
User=(userB), Configfile=(/var/tmp/tmp.0.MYYGB3)
Creating user:root@fakehost.bioteam.net added "userB" to user list
fakehost:~
root# ./create-sge-user.pl userC
User=(userC), Configfile=(/var/tmp/tmp.0.cy4SXR)
Creating user:root@fakehost.bioteam.net added "userC" to user list
fakehost:~
root# ./create-sge-user.pl userD
User=(userD), Configfile=(/var/tmp/tmp.0.YeU83Q)
Creating user:root@fakehost.bioteam.net
added "userD" to user list
Steps 8,9: Creating and defining Department lists
Within Grid Engine DEPARTMENTS are considered to be userlists similar to access
control lists. To create a new userlist of
type department one would do:
qconf -mu
<department>
For our example department "Dept_A":
qconf -mu Dept_A
And we set the values to:
name Dept_A
type DEPT
fshare 18000
oticket 0
entries userA
The important values are:
"type" -- need to make this a DEPT rather than
ACL object
"fshare" -- 18000 is 18% of the 100,000 available
functional share tickets
"entries" -- userA is the first configured member
of the department named "Dept_A". Additional usernames are comma-seperated.
To show the contents of a departmental user list:
fakehost:~
root# qconf -su Dept_A
name Dept_A
type DEPT
fshare 18000
oticket 0
entries userA
To show a list of all userset objects:
cat:~ root# qconf -sul
deadlineusers
defaultdepartment
Dept_A
Dept_B
Dept_C
Dept_D
Note that the configueration goals called for roughly 18% of cluster resources
unassigned and available for general use.
This is what the pre-existing Department object "defaultdepartment"
is for. Any user not assigned to a given Department willl be considered for
scheduling purposes to be a member of the "defaultdepartment" group.
Although this document deals with the command-line methods for manipulating
the Department based Functional Share policy a screenshot is available showing
what these settings would look like when viewed via the graphical 'qmon' program.
The screenshot is quite large (~ 324KB) and can be accessed by clicking
on this link.