(Single PC version)
Imagine you got a desktop or laptop PC with 4 or perhaps even 8 CPU cores available, and you want to run the Monte Carlo particle transport program
FLUKA on it using all CPU cores.
The FLUKA execution script
rfluka however was designed to run in "serial" mode. That is, if you request to repeat your simulation a lot of times (say, 100) issuing the command
rfluka -N0 -M100 example, each process is launched serially, instead of utilizing all available cores on your PC.
A solution can be to use a job queuing system and a scheduler. Here, I'll present one way to do it on a
Debian based Linux system.
Ubuntu might work just as well, since Ubuntu is very similar to Debian. A feature of the method presented here, is that it can easily be extended to cover several PCs on your network, so you can use the computing power of your colleagues when they do not use their PCs (e.g. at night). However, this post will try to make it very simple, namely set it just on your own PC. In less than 10 minutes you'll have it up and running...
The idea is to use
TORQUE in a very minimal configuration. There will be no fuzz with
Maui or similar schedulers, we will only use packages we can get from the Debian/Ubuntu software repositories.
In order to be friendly to all the Ubuntu users out there, all commands issued as root are here prefixed with the "sudo" command. As a Debian user you can become root using the "su" command first.
First install these packages:
$ sudo apt-get install torque-server torque-scheduler
$ sudo apt-get install torque-common torque-mom libtorque2
and either
$ sudo apt-get install torque-client
or
$ sudo apt-get install torque-client-x11
after installation we need to setup torque properly. I here assume that your PC hostname cannot be resolved by DNS, which is quite common on small local networks. You can test whether your hostname can be resolved by the "host" command. Assuming your PC has the name "kepler", you may get an answer like:
$ host $HOSTNAME
Host kepler not found: 3(NXDOMAIN)
this means you may need to edit the
/etc/hosts file, so your PC can associate an IP number with your hostname. Debian like distros may have a propensity to assign the hostname to 127.0.1.1 which will
not work with torque. Instead I looked up my IP number (which in my case is pretty static) using
/sbin/ifconfig, and edited the
/etc/hosts accordingly, using your favourite text editor (emacs, gedit, vi...)
My
/etc/hosts file ended up looking like this:
127.0.0.1 localhost
#127.0.1.1 kepler.lan kepler
192.168.1.108 kepler
If your hostname of your PC can be resolved, you can ommit the last line, but under all circumstances
you must comment out the line starting with 127.0.1.1.
Once this is done, execute the following commands to configure torque:
$ sudo echo $HOSTNAME > /etc/torque/server_name
$ sudo echo $HOSTNAME > /var/spool/torque/server_name
$ sudo pbs_server -t create
$ sudo echo $HOSTNAME np=`grep proc /proc/cpuinfo | wc -l` > /var/spool/torque/server_priv/nodes
$ sudo qterm
$ sudo pbs_server
$ sudo pbs_mom
(Update: If qterm fails, you probably have a problem with your /etc/hosts file. You can still kill the server with $killall -r "pbs_*".)
Now let's see if things are running as expected:
$ pbsnodes -a
kepler
state = free
np = 4
ntype = cluster
status = rectime=1326926041,varattr=,jobs=,state=free,netload=3304768553,gres=,loadave=0.09,ncpus=4,physmem=3988892kb,availmem=6643852kb,totmem=7876584kb,idletime=2518,nusers=2,nsessions=8,sessions=1183 1760 2170 2271 2513 15794 16067 16607,uname=Linux kepler 3.1.0-1-amd64 #1 SMP Tue Jan 10 05:01:58 UTC 2012 x86_64,opsys=linux
and also
$sudo momctl -d 0 -h $HOSTNAME
Host: kepler/kepler Version: 2.4.16 PID: 16835
Server[0]: kepler (192.168.1.108:15001)
Last Msg From Server: 279 seconds (CLUSTER_ADDRS)
Last Msg To Server: 9 seconds
HomeDirectory: /var/spool/torque/mom_priv
MOM active: 280 seconds
LogLevel: 0 (use SIGUSR1/SIGUSR2 to adjust)
NOTE: no local jobs detected
Now setup a queue, which here is called "batch".
$ sudo qmgr -c 'create queue batch'
$ sudo qmgr -c 'set queue batch queue_type = Execution'
$ sudo qmgr -c 'set queue batch resources_default.nodes = 1'
$ sudo qmgr -c 'set queue batch resources_default.walltime = 01:00:00'
$ sudo qmgr -c 'set queue batch enabled = True'
$ sudo qmgr -c 'set queue batch started = True'
$ sudo qmgr -c 'set server default_queue = batch'
$ sudo qmgr -c 'set server scheduling = True'
[
update: you may want to increase walltime to 10:00:00 so jobs dont stop after 1 hour]
and start the scheduler:
$ sudo pbs_sched
The rest of the commands can be issued as a normal user (i.e. non-root).
Let's see if all servers are running:
$ ps -e | grep pbs
1286 ? 00:00:00 pbs_mom
1293 ? 00:00:00 pbs_server
2174 ? 00:00:00 pbs_sched
Anything in the queue?
$ qstat
$
Nope, it's empty.
Lets try to submit a simple job
echo "sleep 20" | qsub
and within the next 20 seconds you can test, if its in the queue:
$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
0.kepler STDIN bassler 0 R batch
Great, now were ready to rock 'n roll! This is really a minimalistic setup, which just works. For more bells and whistles, check the
torque manual.
All we need, is a simple FLUKA job submission script:
rtfluka.sh
#!/bin/bash
#
# how to use this
# change to directory with the files you want to run
# and enter:
# $ qsub -V -t 0-9 -d . rtfluka.sh
#
#PBS -N FLUKA_JOB
#
start="$PBS_ARRAYID"
let stop="$start+1"
stop_pad=`printf "%03i\n" $stop`
#
# Init new random number sequence for each calculation.
# This may be a poor solution.
cp $FLUPRO/random.dat ranexample$stop_pad
sed -i '/RANDOMIZE 1.0/c\RANDOMIZE 1.0 '"${RANDOM}"'.0 \' example.inp
$FLUPRO/flutil/rfluka -N$start -M$stop example -e flukadpm3
Update: Note that your RANDOMIZE card in your own .inp file must match the sed regular expression above, else you may repeat the exact same simulation over and over again...
Let's submit 10 jobs:
$ qsub -V -t 0-9 -d . rtfluka.sh
And watch the blinkenlichts.
$ qstat
Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
15-0.kepler FLUKA_JOB-0 bassler 0 R batch
15-1.kepler FLUKA_JOB-1 bassler 0 R batch
15-2.kepler FLUKA_JOB-2 bassler 0 R batch
15-3.kepler FLUKA_JOB-3 bassler 0 R batch
15-4.kepler FLUKA_JOB-4 bassler 0 Q batch
15-5.kepler FLUKA_JOB-5 bassler 0 Q batch
15-6.kepler FLUKA_JOB-6 bassler 0 Q batch
15-7.kepler FLUKA_JOB-7 bassler 0 Q batch
15-8.kepler FLUKA_JOB-8 bassler 0 Q batch
15-9.kepler FLUKA_JOB-9 bassler 0 Q batch
Surely, this can be improved a lot, suggestions are most welcome in the comments below. One problem is for instance, that the random number seed is limited to a 16 bit integer, which only covers a very small fraction of the possible seeds for the RANDOMIZE card.
Update: There is also a very small risk that the same seed occasionally is used twice (or more often). Alternatively one could just add a random number to a starting seed after each run. (Any MC random number experts out there?)
Output data can be processed in regular ways, using
flair.
Alternatively you may use some of the scripts in the
auflukatools package, which for instance can do the merging of USRBIN output with a single command. Auflukatools also includes
rtfluka.sh as well as a CONDOR job submission script
rcfluka.py, which is better suited for heterogenous clusters.
Finally, here is a job script for
SHIELD_HITxxA, (which is even shorter):
#!/bin/bash
#
# how to use
# change to directory you want to run
# $ qsub -V -t 0-9 -d . rtshield.sh
#
#PBS -N SHIELD_JOB
shield_exe -N$PBS_ARRAYID
Enjoy!
Totally unrelated:
englishrussia.com just posted some nice pics from the
Budker institute for Nuclear Physics in Novosibirsk, Russia. Certainly worth visiting, have a look at:
http://englishrussia.com/2012/01/21/the-budker-institute-of-nuclear-physics/
:-) Heaps of pioneering accelerator technology was developed there, such as
electron cooling, the first collider, lithium lenses (e.g. for capturing antiprotons), and they supplied the conventional magnets for the beam transfer lines to the LHC at CERN. I visited the center many years ago but
my pics are not as good. :-/ The German
wiki about Budker himself, is also worth reading.