Fairshare scheduling – How fair is fair?

Sharing is good. Whether we’re sharing a soda, an apartment or an HPC cluster, chances are good that sharing can save us money.

As readers of my previous blog will know, I’ve been playing around with OpenLava. OpenLava is an LSF compatible workload manager that is free for use and downloadable from http://openlava.org or http://teraproc.com.

One of the new features in OpenLava 3.0 is fairshare scheduling. I know a lot of clients see value in this, so I decided to setup another free cluster in the cloud for the purpose of trying out OpenLava 3.0’s new fairshare scheduler.

For those not familiar with fairshare scheduling, this refers to sharing resources in accordance with a policy. If a cluster costs a million dollars, and department A contributed $800K while department B contributed 200K, sharing resources on the basis of 80/20 (when there is contention) is probably considered fair. Fairshare does not mean “equal share” – but it does mean sharing according to a policy and usually in proportion to available resources. Department A may have employees with different levels of seniority. If there is conflict for resources between my chief scientist and the intern, I probably need some facility to make sure that a smart intern cannot monopolize the cluster leaving my chief scientist unproductive. I wanted to test this notion as well – essentially hierarchical sharing.

OpenLava gets Fairshare

Fairshare scheduling has existed in IBM Platform LSF for many years, and in OpenLava 3.0, the open-source scheduler adds this capability. To demonstrate a sharing scenario, I setup OpenLava as follows:

First, in lsb.users, an OpenLava configuration file, I define two groups of users. My company has both staff and partners. I setup a group for each, and declare the members of each group. The three staff members will share resources on a 1:1:1 basis and the two members of the partner group will share resources allocated equally as well although I could choose different ratios if I wanted. This represents my “intra-departmental sharing” policy.

Begin UserGroup
GROUP_NAME       GROUP_MEMBER         USER_SHARES
#G1         (david john zebra)      ([david, 1] [john,1] [zebra, 1])
#G2         (crock wlu kluk)        ([crock, 2] [wlu,1] [kluk,1])
staff       (william david james)   ([wlu, 1] [david, 1] [james, 1])
partners   (gord dan)              ([gord, 1] [dan, 1])
End UserGroup

We add an additional constraint in the configuration in lsb.users. At no time do I want any of my partners to individually consume more than five slots on the cluster (meaning they can run five jobs at most concurrently on cluster hosts). The syntax partners@ indicates that the limit applies to each member of the group rather than the group collectively.

Begin User
USER_NAME       MAX_JOBS        JL/P
#develop@        20              8
#support         50              -
partners@        5               -
End User

Next we define a queue called “share” that will enforce an allocation policy between our two groups of users – staff and partners. This is done in lsb.queues and the syntax is as follows. For every 3 slots allocated to staff, partners will only get one – meaning that our staff will get 75% of the resources on a guaranteed basis – and more if there are no partners submitting work.

Begin Queue
QUEUE_NAME      = share
PRIORITY        = 30
NICE            = 20
DESCRIPTION     = Queue to demonstrate fairshare scheduling policy
FAIRSHARE       = USER_SHARES[[staff,3] [partners,1]]
End Queue

Simulating jobs

Next, in order to generate a reasonable workload, we write a script that will submit work on behalf of five users – three of whom are “staff” and two of whom are “partners”. To make this credible, we generate a workload comprised of 1,000 jobs per user for a total of 5,000 jobs. Each job will take 10 seconds. A workload with 5,000 jobs at 10 seconds per job would take 50,000 seconds or 138.9 hours in ideal circumstances on a single computer sequentially. Because our cluster allows us to run 40 jobs concurrently (we have 40 jobs slots configured as shown below) our theoretical time is reduced to 3.47 hours, almost exactly what I experienced actually tunning the workload and writing this blog.

Below is the OpenLava bhosts command showing available scheduling slots across our four cluster hosts.

[root@ip-172-31-34-75 ~]# bhosts
HOST_NAME         STATUS       JL/U   MAX NJOBS   RUN SSUSP USUSP   RSV
ip-172-31-34-75   ok             -     10     0     0     0     0     0
ip-172-31-35-144   ok             -     10     0     0     0     0    0
ip-172-31-35-97   ok             -     10     0     0     0     0     0
ip-172-31-46-36   ok             -     10     0     0     0     0     0
[root@ip-172-31-34-75 ~]#

Given this cluster configuration and sharing policy, the next step is to submit 5,000 jobs on behalf of our five defined users. This script is self explanatory I believe and submits give jobs arrays, one per user, comprised of 1,000 jobs each.

#!/bin/sh
sudo -u gord     -i bsub -q share -J "gordArray[1-1000]"    sleep 10
sudo -u dan      -i bsub -q share -J "danArray[1-1000]"     sleep 10
sudo -u william  -i bsub -q share -J "williamArray[1-1000]" sleep 10
sudo -u david    -i bsub -q share -J "davidArray[1-1000]"   sleep 10
sudo -u james    -i bsub -q share -J "jamesArray[1-1000]"   sleep 10

We execute the script, submitting these 5,000 10 second jobs to our cluster by running the script.

[root@ip-172-31-34-75 ~]# ./fairshare_demo.sh
Job  is submitted to queue .
Job  is submitted to queue .
Job  is submitted to queue .
Job  is submitted to queue .
Job  is submitted to queue .

After a few minutes, we issue a command to look at our 5,000 jobs to understand their status and how they are being allocated to cluster hosts. Note that running jobs are in exactly the proportion we would expect. Each of our staff have ten running jobs for a total of thirty jobs (75% of the slots) – our two partners by contrast are each running 5 jobs totaling 25% of the slots.

[root@ip-172-31-34-75 ~]# bjobs -u all | more
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[1] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[2] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[3] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[4] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[5] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[6] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[7] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[8] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *mArray[9] Mar 15 16:10
313     william RUN   share      ip-172-31-3 ip-172-31-3 *Array[10] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[1] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[2] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[3] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[4] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[5] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[6] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[7] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[8] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *dArray[9] Mar 15 16:10
314     david   RUN   share      ip-172-31-3 ip-172-31-4 *Array[10] Mar 15 16:10
311     gord    RUN   share      ip-172-31-3 ip-172-31-3 *dArray[1] Mar 15 16:10
312     dan     RUN   share      ip-172-31-3 ip-172-31-3 *nArray[1] Mar 15 16:10
311     gord    RUN   share      ip-172-31-3 ip-172-31-3 *dArray[2] Mar 15 16:10
312     dan     RUN   share      ip-172-31-3 ip-172-31-3 *nArray[2] Mar 15 16:10
311     gord    RUN   share      ip-172-31-3 ip-172-31-3 *dArray[3] Mar 15 16:10
312     dan     RUN   share      ip-172-31-3 ip-172-31-3 *nArray[3] Mar 15 16:10
311     gord    RUN   share      ip-172-31-3 ip-172-31-3 *dArray[4] Mar 15 16:10
312     dan     RUN   share      ip-172-31-3 ip-172-31-3 *nArray[4] Mar 15 16:10
311     gord    RUN   share      ip-172-31-3 ip-172-31-3 *dArray[5] Mar 15 16:10
312     dan     RUN   share      ip-172-31-3 ip-172-31-3 *nArray[5] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[1] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[2] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[3] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[4] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[5] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[6] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[7] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[8] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *sArray[9] Mar 15 16:10
315     james   RUN   share      ip-172-31-3 ip-172-31-3 *Array[10] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *dArray[6] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *dArray[7] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *dArray[8] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *dArray[9] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *Array[10] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *Array[11] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *Array[12] Mar 15 16:10
311     gord    PEND  share      ip-172-31-3             *Array[13] Mar 15 16:10

Notice how the allocation directly aligns to our sharing policy.

The bqueues command shows who jobs are allocated by user for a specific queue.

[root@ip-172-31-34-75 ~]# bqueues -l share

QUEUE: share
  -- Queue to demonstrate fairshare scheduling policy

PARAMETERS/STATISTICS
PRIO NICE     STATUS       MAX JL/U JL/P JL/H NJOBS  PEND  RUN  SSUSP USUSP  RSV
  30  20    Open:Active     -    -    -    -  4720  4680    40     0     0    0

Interval for a host to accept two jobs is 0 seconds

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -

SCHEDULING POLICIES:  FAIRSHARE

TOTAL_SLOTS: 40 FREE_SLOTS: 0
USER/GROUP   SHARES   PRIORITY     DSRV     PEND      RUN
staff/           3       0.750       30     2760       30
william          1       0.333       10      920       10
david            1       0.333       10      920       10
james            1       0.333       10      920       10
partners/        1       0.250       10     1920       10
gord             1       0.500        5      960        5
dan              1       0.500        5      960        5

USERS:  all users
HOSTS:  all hosts used by the LSF Batch system

Note that the bqueues -l output shows total fidelity to the sharing policy even when there are thousands of jobs in the cluster. Sometime later after a few hundred for jobs have completed we see based on the pending jobs that our staff members have over 50% of their jobs completed, and the allocation of resources is still perfectly balanced.

SCHEDULING POLICIES:  FAIRSHARE

TOTAL_SLOTS: 40 FREE_SLOTS: 40
USER/GROUP   SHARES   PRIORITY     DSRV     PEND      RUN
staff/           3       0.750       30     1380        0
william          1       0.333       10      460        0
david            1       0.333       10      460        0
james            1       0.333       10      460        0
partners/        1       0.250       10     1460        0
gord             1       0.500        5      730        0
dan              1       0.500        5      730        0

USERS:  all users
HOSTS:  all hosts used by the LSF Batch system

How fair is OpenLava’s fairshare scheduling policy? This is a very simple example of course, but based on this test it looks pretty darned fair!