Wednesday, January 9, 2019

2937 tasks

I feel like a big boy.

I just submitted an array job that consists of 2937 tasks to BU's shared computing cluster. Granted, they are embarrassingly parallel tasks, but this is still a job that consists of almost 3000 tasks so allow me to feel pretty pleased with myself.

Ok, let's back up a bit. Here's what I needed to do: I needed to run 89 yearly analysis (1922-2010) on 33 separate datasets (33 x 89 = 2937). I have been using the cluster to split my analysis by province, thus running 33 instances of the dofile at the same time. The SGE has a built-in FOR loop so it's pretty straightforward. I just tacked this on:
#$ -t 1-33
This loop uses an environment variable in SGE, $SGE_TASK_ID. I can then pass this variable to a macro in Stata:
local taskID : env SGE_TASK_ID
But how do I nest a second loop to complete the job? The last example from here suggests that I can create bash shell variables from the SGE environment variable. The bash shell also has built-in arithmetic operations, so one way is to work backwards: calculate the number of tasks I needed to complete, and split this into two loop counters (provID going from 1-33 and yearID going from 1922-2010), then passing the two counter variables along to Stata a la passing along taskID above. Here's the arithmetic operation I did:
#$ -t 1-2937
year_index=$(( (SGE_TASK_ID - 1) % 89 + 1 ))
prov_index=$(( (SGE_TASK_ID - year_index)/89 +1 ))
It turns out to not be as simple as that, however. Stata did not recognize my newly constructed environment variables. The suggested solutions both in Statalist and Stackoverflow called for either modifying the .bash_profile or launching AppleScript Editor, which I'm not comfortable doing and/or don't know how to do it on the cluster (and don't feel like spending extra three hours trying to experiment with a possibly disastrous outcome).

So here's what ends up working for me: I abandoned the idea of creating two loop counters using bash shell, straight up passed the $SGE_TASK_ID to Stata, and only in Stata split them into the approriate counters.
local taskID : env SGE_TASK_ID
local year_per_prov = 89
local yearID = mod((`taskID'-1),`year_per_prov')+1
local provID = floor((`taskID'-1)/`year_per_prov')+1
And this worked.


Only my inbox is groaning with the barrage of notification emails after the jobs are completed successfully.



P.S. I was worried 3000 is a bit much so I also limit the number of tasks running concurrently. First to 50, then to 150.
#$ -tc 150
With caps at 50, the job took roughly 5 hours. Tripling the limit (so it's as if I have 150 different computers running my dofile!) made the job took just around a couple of hours!

No comments: