Fairshare scheduling – How fair is fair?
Sharing is good. Whether we’re sharing a soda, an apartment or an HPC cluster, chances are good that sharing can save us money.
As readers of my previous blog will know, I’ve been playing around with OpenLava. OpenLava is an LSF compatible workload manager that is free for use and downloadable from http://openlava.org or http://teraproc.com.
One of the new features in OpenLava 3.0 is fairshare scheduling. I know a lot of clients see value in this, so I decided to setup another free cluster in the cloud for the purpose of trying out OpenLava 3.0’s new fairshare scheduler.
For those not familiar with fairshare scheduling, this refers to sharing resources in accordance with a policy. If a cluster costs a million dollars, and department A contributed $800K while department B contributed 200K, sharing resources on the basis of 80/20 (when there is contention) is probably considered fair. Fairshare does not mean “equal share” – but it does mean sharing according to a policy and usually in proportion to available resources. Department A may have employees with different levels of seniority. If there is conflict for resources between my chief scientist and the intern, I probably need some facility to make sure that a smart intern cannot monopolize the cluster leaving my chief scientist unproductive. I wanted to test this notion as well – essentially hierarchical sharing.
OpenLava gets Fairshare
Fairshare scheduling has existed in IBM Platform LSF for many years, and in OpenLava 3.0, the open-source scheduler adds this capability. To demonstrate a sharing scenario, I setup OpenLava as follows:
First, in lsb.users, an OpenLava configuration file, I define two groups of users. My company has both staff and partners. I setup a group for each, and declare the members of each group. The three staff members will share resources on a 1:1:1 basis and the two members of the partner group will share resources allocated equally as well although I could choose different ratios if I wanted. This represents my “intra-departmental sharing” policy.
Begin UserGroup
GROUP_NAME GROUP_MEMBER USER_SHARES
#G1 (david john zebra) ([david, 1] [john,1] [zebra, 1])
#G2 (crock wlu kluk) ([crock, 2] [wlu,1] [kluk,1])
staff (william david james) ([wlu, 1] [david, 1] [james, 1])
partners (gord dan) ([gord, 1] [dan, 1])
End UserGroup
We add an additional constraint in the configuration in lsb.users. At no time do I want any of my partners to individually consume more than five slots on the cluster (meaning they can run five jobs at most concurrently on cluster hosts). The syntax partners@ indicates that the limit applies to each member of the group rather than the group collectively.
Begin User
USER_NAME MAX_JOBS JL/P
#develop@ 20 8
#support 50 -
partners@ 5 -
End User
Next we define a queue called “share” that will enforce an allocation policy between our two groups of users – staff and partners. This is done in lsb.queues and the syntax is as follows. For every 3 slots allocated to staff, partners will only get one – meaning that our staff will get 75% of the resources on a guaranteed basis – and more if there are no partners submitting work.
Begin Queue
QUEUE_NAME = share
PRIORITY = 30
NICE = 20
DESCRIPTION = Queue to demonstrate fairshare scheduling policy
FAIRSHARE = USER_SHARES[[staff,3] [partners,1]]
End Queue
Simulating jobs
Next, in order to generate a reasonable workload, we write a script that will submit work on behalf of five users – three of whom are “staff” and two of whom are “partners”. To make this credible, we generate a workload comprised of 1,000 jobs per user for a total of 5,000 jobs. Each job will take 10 seconds. A workload with 5,000 jobs at 10 seconds per job would take 50,000 seconds or 138.9 hours in ideal circumstances on a single computer sequentially. Because our cluster allows us to run 40 jobs concurrently (we have 40 jobs slots configured as shown below) our theoretical time is reduced to 3.47 hours, almost exactly what I experienced actually tunning the workload and writing this blog.
Below is the OpenLava bhosts command showing available scheduling slots across our four cluster hosts.
[root@ip-172-31-34-75 ~]# bhosts HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV ip-172-31-34-75 ok - 10 0 0 0 0 0 ip-172-31-35-144 ok - 10 0 0 0 0 0 ip-172-31-35-97 ok - 10 0 0 0 0 0 ip-172-31-46-36 ok - 10 0 0 0 0 0 [root@ip-172-31-34-75 ~]#
Given this cluster configuration and sharing policy, the next step is to submit 5,000 jobs on behalf of our five defined users. This script is self explanatory I believe and submits give jobs arrays, one per user, comprised of 1,000 jobs each.
#!/bin/sh sudo -u gord -i bsub -q share -J "gordArray[1-1000]" sleep 10 sudo -u dan -i bsub -q share -J "danArray[1-1000]" sleep 10 sudo -u william -i bsub -q share -J "williamArray[1-1000]" sleep 10 sudo -u david -i bsub -q share -J "davidArray[1-1000]" sleep 10 sudo -u james -i bsub -q share -J "jamesArray[1-1000]" sleep 10
We execute the script, submitting these 5,000 10 second jobs to our cluster by running the script.
[root@ip-172-31-34-75 ~]# ./fairshare_demo.sh
Job is submitted to queue .
Job is submitted to queue .
Job is submitted to queue .
Job is submitted to queue .
Job is submitted to queue .
After a few minutes, we issue a command to look at our 5,000 jobs to understand their status and how they are being allocated to cluster hosts. Note that running jobs are in exactly the proportion we would expect. Each of our staff have ten running jobs for a total of thirty jobs (75% of the slots) – our two partners by contrast are each running 5 jobs totaling 25% of the slots.
[root@ip-172-31-34-75 ~]# bjobs -u all | more
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[1] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[2] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[3] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[4] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[5] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[6] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[7] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[8] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *mArray[9] Mar 15 16:10
313 william RUN share ip-172-31-3 ip-172-31-3 *Array[10] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[1] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[2] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[3] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[4] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[5] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[6] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[7] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[8] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *dArray[9] Mar 15 16:10
314 david RUN share ip-172-31-3 ip-172-31-4 *Array[10] Mar 15 16:10
311 gord RUN share ip-172-31-3 ip-172-31-3 *dArray[1] Mar 15 16:10
312 dan RUN share ip-172-31-3 ip-172-31-3 *nArray[1] Mar 15 16:10
311 gord RUN share ip-172-31-3 ip-172-31-3 *dArray[2] Mar 15 16:10
312 dan RUN share ip-172-31-3 ip-172-31-3 *nArray[2] Mar 15 16:10
311 gord RUN share ip-172-31-3 ip-172-31-3 *dArray[3] Mar 15 16:10
312 dan RUN share ip-172-31-3 ip-172-31-3 *nArray[3] Mar 15 16:10
311 gord RUN share ip-172-31-3 ip-172-31-3 *dArray[4] Mar 15 16:10
312 dan RUN share ip-172-31-3 ip-172-31-3 *nArray[4] Mar 15 16:10
311 gord RUN share ip-172-31-3 ip-172-31-3 *dArray[5] Mar 15 16:10
312 dan RUN share ip-172-31-3 ip-172-31-3 *nArray[5] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[1] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[2] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[3] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[4] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[5] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[6] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[7] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[8] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *sArray[9] Mar 15 16:10
315 james RUN share ip-172-31-3 ip-172-31-3 *Array[10] Mar 15 16:10
311 gord PEND share ip-172-31-3 *dArray[6] Mar 15 16:10
311 gord PEND share ip-172-31-3 *dArray[7] Mar 15 16:10
311 gord PEND share ip-172-31-3 *dArray[8] Mar 15 16:10
311 gord PEND share ip-172-31-3 *dArray[9] Mar 15 16:10
311 gord PEND share ip-172-31-3 *Array[10] Mar 15 16:10
311 gord PEND share ip-172-31-3 *Array[11] Mar 15 16:10
311 gord PEND share ip-172-31-3 *Array[12] Mar 15 16:10
311 gord PEND share ip-172-31-3 *Array[13] Mar 15 16:10
Notice how the allocation directly aligns to our sharing policy.
The bqueues command shows who jobs are allocated by user for a specific queue.
[root@ip-172-31-34-75 ~]# bqueues -l share
QUEUE: share
-- Queue to demonstrate fairshare scheduling policy
PARAMETERS/STATISTICS
PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV
30 20 Open:Active - - - - 4720 4680 40 0 0 0
Interval for a host to accept two jobs is 0 seconds
SCHEDULING PARAMETERS
r15s r1m r15m ut pg io ls it tmp swp mem
loadSched - - - - - - - - - - -
loadStop - - - - - - - - - - -
SCHEDULING POLICIES: FAIRSHARE
TOTAL_SLOTS: 40 FREE_SLOTS: 0
USER/GROUP SHARES PRIORITY DSRV PEND RUN
staff/ 3 0.750 30 2760 30
william 1 0.333 10 920 10
david 1 0.333 10 920 10
james 1 0.333 10 920 10
partners/ 1 0.250 10 1920 10
gord 1 0.500 5 960 5
dan 1 0.500 5 960 5
USERS: all users
HOSTS: all hosts used by the LSF Batch system
Note that the bqueues -l output shows total fidelity to the sharing policy even when there are thousands of jobs in the cluster. Sometime later after a few hundred for jobs have completed we see based on the pending jobs that our staff members have over 50% of their jobs completed, and the allocation of resources is still perfectly balanced.
SCHEDULING POLICIES: FAIRSHARE
TOTAL_SLOTS: 40 FREE_SLOTS: 40
USER/GROUP SHARES PRIORITY DSRV PEND RUN
staff/ 3 0.750 30 1380 0
william 1 0.333 10 460 0
david 1 0.333 10 460 0
james 1 0.333 10 460 0
partners/ 1 0.250 10 1460 0
gord 1 0.500 5 730 0
dan 1 0.500 5 730 0
USERS: all users
HOSTS: all hosts used by the LSF Batch system
How fair is OpenLava’s fairshare scheduling policy? This is a very simple example of course, but based on this test it looks pretty darned fair!