1slurm
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
1slurm [2020/03/27 15:35] – admin | 1slurm [2020/06/26 12:38] (current) – admin | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | !!!! WORK IN PROGRESS !!!!! | + | |
- | ====== | + | ====== |
Our cluster runs [[https:// | Our cluster runs [[https:// | ||
- | The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (included | + | The queuing system give you access to computers owned by LCM, LTHC, LTHI, LINX and IC Faculty; sharing the computational resources among as many groups as possible will result in a more efficient use of the resources (including |
- | As user you can take advantage of many more machines for your urgent calculations and get results faster. On the other hand, since the machines | + | On the other hand, since the machines |
+ | |||
+ | We have configured | ||
+ | - number of CPU/cores: you must indicate the correct number of cores you're going to use; | ||
+ | - Megabytes/ | ||
+ | - Time for the execution: if your job is not completed by the indicated time, it will be automatically terminated; | ||
+ | |||
+ | here we provide just a fast and dirty guide for the most basic commands/ | ||
+ | - [[https:// | ||
+ | - [[https:// | ||
+ | - [[https:// | ||
- | We have configured the system almost without access restriction because the queuing system can make a more efficient use of the cluster if it does not have to satisfy too many constraints. we are currently using only two constraints: | + | ==== partitions (a.k.a. queues) ==== |
- | - number | + | If you used other types of cluster management, |
- | - Megabytes/ | + | |
- | - Time for the execution: | + | |
===== Mini User Guide ===== | ===== Mini User Guide ===== | ||
- | The 3 most used commands are: | + | The most used/ |
- | - '' | + | - '' |
- | - '' | + | - '' |
- | - '' | + | - '' |
+ | - '' | ||
- | ==== squeue ==== | + | |
- | | + | |
< | < | ||
- | $ qstat -q | + | $ sinfo |
+ | PARTITION | ||
+ | slurm-cluster* | ||
+ | slurm-ws | ||
+ | slurm-ws | ||
+ | </ | ||
- | server: pbs | + | * '' |
+ | < | ||
+ | $ squeue | ||
+ | JOBID PARTITION | ||
+ | 550 slurm-clu | ||
+ | 551 slurm-clu script.s rmarino1 PD | ||
+ | 549 slurm-clu | ||
+ | 548 slurm-clu | ||
+ | </ | ||
+ | here you can see that the command provides the ID of the jobs, the PARTITION used to run the jobs (hence the nodes where these jobs will run), the NAME assigned to the jobs, the name of the USER that submitted the jobs, the STATUS of the job (R=Run, PD=Waiting), | ||
- | Queue Memory CPU Time Walltime Node Run Que Lm State | + | |
- | ---------------- ------ -------- -------- ---- --- --- -- ----- | + | Once a job is submitted (and accepted by the cluster), you'll receive the ID assigned to the job: |
- | long | + | < |
- | default | + | $ sbatch sheepit.slurm |
- | short -- 01:00:00 -- -- 0 0 -- E R | + | Submitted batch job 552 |
- | batch -- -- | + | |
- | algo | + | |
- | ----- ----- | + | |
- | | + | |
</ | </ | ||
- | * '' | + | * '' |
+ | |||
+ | * '' | ||
< | < | ||
- | $ qstat -a | + | $ squeue |
+ | JOBID PARTITION | ||
+ | 552 slurm-clu sheepit | ||
+ | 550 slurm-clu | ||
+ | 551 slurm-clu script.s rmarino1 PD | ||
+ | 549 slurm-clu | ||
+ | 548 slurm-clu | ||
+ | $ scancel 552 | ||
+ | $ squeue | ||
+ | JOBID PARTITION | ||
+ | 550 slurm-clu | ||
+ | 551 slurm-clu script.s rmarino1 PD | ||
+ | 549 slurm-clu | ||
+ | 548 slurm-clu | ||
+ | </ | ||
- | licossrv4.epfl.ch: | + | === Scripts (used with sbatch) === |
- | | + | |
- | Job ID | + | |
- | -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- | + | |
- | 146.licossrv4.epfl.c damir batch STDIN 3980 | + | |
- | 147.licossrv4.epfl.c damir batch STDIN 3998 | + | |
- | 148.licossrv4.epfl.c damir batch STDIN | + | |
- | 149.licossrv4.epfl.c damir batch STDIN | + | |
- | 150.licossrv4.epfl.c damir batch STDIN | + | |
- | 151.licossrv4.epfl.c damir batch STDIN | + | |
- | 152.licossrv4.epfl.c damir batch STDIN | + | |
- | 153.licossrv4.epfl.c damir batch STDIN | + | |
- | 154.licossrv4.epfl.c damir batch STDIN | + | |
- | 155.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 156.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 157.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 158.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 159.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 160.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 161.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 162.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 163.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 164.licossrv4.epfl.c cangiani batch STDIN | + | |
- | $ qstat -n1 | + | It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set '' |
+ | < | ||
+ | $ cat sheepit.slurm | ||
+ | #!/bin/bash | ||
- | licossrv4.epfl.ch: | + | # |
- | | + | # |
- | Job ID | + | # |
- | -------------------- -------- -------- ---------- ------ ----- --- ------ ----- - ----- | + | # |
- | 165.licossrv4.epfl.c damir batch STDIN 4522 | + | # |
- | 166.licossrv4.epfl.c damir batch STDIN 4549 | + | # |
- | 167.licossrv4.epfl.c damir batch STDIN | + | # |
- | 168.licossrv4.epfl.c damir batch STDIN | + | # |
- | 169.licossrv4.epfl.c damir batch STDIN | + | # |
- | 170.licossrv4.epfl.c damir batch STDIN -- | + | # |
- | 171.licossrv4.epfl.c | + | # |
- | 172.licossrv4.epfl.c damir batch STDIN | + | # |
- | 173.licossrv4.epfl.c damir batch STDIN -- | + | # |
- | 174.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 175.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 176.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 177.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 178.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 179.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 180.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 181.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 182.licossrv4.epfl.c cangiani batch STDIN | + | |
- | 183.licossrv4.epfl.c cangiani batch STDIN | + | |
- | </ | + | |
- | ==== sbatch / srun ==== | ||
- | sbatch is used to submit jobs. Jobs are nothing else than short scripts where the program to be executed is launched. | ||
- | srun is used to actually launch your program inside the cluster. | ||
- | The output of the program will be written by default in two files called '' | + | echo "$(hostname) $(date)" |
- | === Scripts === | + | cd ${HOME}/ |
+ | srun sleep 60 | ||
+ | echo " | ||
- | It is convenient to write the job script in a file not only because in this way the script can be reused, but also because it is also possible to set '' | ||
- | < | ||
- | $ cat myScript.sh | ||
- | # lines starting with #PBS are directives for qsub | ||
- | #PBS -j oe | ||
- | #PBS -o myScript.out | ||
- | #PBS -l nodes=1: | ||
- | cd bin | ||
- | ./bogo | ||
</ | </ | ||
- | Inside a script, all the line that starts with the '#' | + | < |
+ | At the beginning of the file, you can read the line ''# | ||
+ | </ | ||
+ | Inside a script, all the line that starts with the '#' | ||
The example above instruct the queuing system to: | The example above instruct the queuing system to: | ||
- | * ''# | + | * ''# |
- | * ''# | + | * ''# |
- | * ''# | + | * ''# |
- | * ''# | + | * ''# |
- | \\ | + | * '' |
- | Many options are available for the qsub command. The most important are the following: | + | * '' |
- | * '' | + | * '' |
- | * '' | + | * '' |
- | * '' | + | * '' |
- | * '' | + | * '' |
- | * '' | + | |
- | * '' | + | |
- | The properties available on the various nodes can be listed with the '' | + | |
- | For the moment we have defined these properties: | + | |
- | * ''bit64'' on 64 bit machines. | + | |
- | * '' | + | |
- | | + | |
- | * '' | + | |
- | | + | |
- | | + | |
- | Example **qsub | + | |
- | <note important> | + | At the moment we have defined these resources: |
- | It is **mandatory** to specify at least the estimated run time of the job and the memory needed by so that the scheduler can optimize the machines usage and the overall cluster throughput. If your job will pass the limits you fixed, it will be automatically killed by the cluster manager. | + | * '' |
- | By default, if no time limit is specified, the job is sent to the '' | + | and these properties/ |
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | * '' | ||
+ | |||
+ | < | ||
+ | Please pay attention that '' | ||
+ | * ''# | ||
+ | * ''# | ||
+ | </ | ||
+ | |||
+ | <note important> | ||
+ | It is **mandatory** to specify at least the estimated run time of the job and the memory needed, so the scheduler can optimize the nodes/ | ||
- | Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for // | + | Please keep in mind that longer jobs are less likely to enter the queue when the cluster load is high. Therefore, don't be lazy and do not always ask for // |
</ | </ | ||
- | * '' | + | Here you can find some useful |
- | The '' | + | |
- | If the month, '' | + | |
- | \\ | + | |
- | Here you can find some useful | + | |
^Script | ^Script | ||
- | |{{base2.pbs|Base example script}} contains most of the useful options|qsub [qsub options] base.pbs| | + | |{{base.slurm|Base example script}} contains most of the useful options|sbatch |
- | |{{matlab.pbs|Script example for running matlab computations}}|qsub -l nodes=1: | + | |{{matlab.slurm|Script example for running matlab computations}}|sbatch |
- | |{{mathematica.pbs|Script example for running Mathematica computations}}|qsub [qsub options] mathematica.pbs| | + | |{{mathematica.slurm|Script example for running Mathematica computations}}|sbatch |
- | |{{wine.pbs|Script example for windows programs (executed under wine)}}|qsub [qsub options] wine.pbs| | + | |{{wine.slurm|Script example for windows programs (executed under wine)}}|sbatch |
\\ | \\ | ||
- | The shell running the pbs script will have access to various variables that might be usefull: | + | |
- | * '' | + | The shell running the sbatch |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
- | * '' | + | |
See the man page for more details. | See the man page for more details. | ||
- | |||
- | ==== Making your script cross platform ==== | ||
- | Presently, we have only 64 bit compute nodes. If you need to compile for 32 bit platforms, in principle, 64 bit nodes can run 32 bit code out of the box. In reality, there might be problems due to missing or incompatible library. | ||
- | An easy solution for taking advantage both of the full set of architecture and also of the optimized 64 code on 64 bit machines is the following (suggested by Alipour Masoud): | ||
- | |||
- | - Compile two version of your code (32 and 64 bit); | ||
- | - name the two executables (32 and 64 bit) as '' | ||
- | - in your pbs script use '' | ||
- | |||
- | |||
- | ==== qdel ==== | ||
- | |||
- | When you submit a job, you receive from the system a number that is used as reference to the job. to delete the job all you have to do is launch the qdel command followed by the job number you want to delete.\\ | ||
- | < | ||
- | |||
- | $ qdel 236 | ||
- | |||
- | </ | ||
- | |||
- | You can also indicate more than one job number: | ||
- | < | ||
- | |||
- | $ qdel 236 237 241 | ||
- | |||
- | </ | ||
- | |||
- | ==== BUG ==== | ||
- | |||
- | There is a bug in pbs that appears some time when the server would like to stop a running job but the node where the job is running does not respond (e.g. it did crash). When this happens, the server starts to send you a lot of identical mail messages telling you that it had to kill your job because it exceeded the time limit. If you start to receive the same message over and over about the same JOB ID, please contact your sys admin. Thanks. | ||
===== Tips and Tricks ===== | ===== Tips and Tricks ===== | ||
=== Delete all queued jobs === | === Delete all queued jobs === | ||
< | < | ||
- | qstat -u $(whoami) -n1 | grep " | + | |
+ | squeue | ||
</ | </ | ||
- | === A script that run as long as possible === | ||
- | Here is a short script that can be useful in those cases where you have the same | ||
- | calculation to run many times (e.g. for collecting statistics). | ||
- | |||
- | Since the machines are different and take different time to run the program, one usually | ||
- | allocates the time needed by the slowest machine even if on the fastest machine the actual | ||
- | running time would be 1/10 of the requested one. | ||
- | |||
- | The following script will keep running your program until there is time left. It will use | ||
- | the time needed to run 1 iteration to decide if another one can be ran. | ||
- | |||
- | < | ||
- | qstat=/ | ||
- | jobid=${PBS_JOBID%%.*} | ||
- | |||
- | # check how much time is left and set the " | ||
- | checktime() { | ||
- | if [ -x $qstat ] ; then | ||
- | times=$(qstat -n1 $jobid | tail -n 1) | ||
- | let tend=$(echo $times | awk ' | ||
- | let tnow=$(echo $times | awk ' | ||
- | let trem=$tend-$tnow | ||
- | let tmin=$tnow/ | ||
- | if [ $trem -ge $tmin ] ; then | ||
- | moretime=" | ||
- | else | ||
- | moretime=" | ||
- | fi | ||
- | else | ||
- | # cannot say => random guess | ||
- | moretime=" | ||
- | fi | ||
- | } | ||
- | |||
- | # Execute a task as many times as possible. | ||
- | let niter=0; | ||
- | moretime=" | ||
- | while [ " | ||
- | # run your program here | ||
- | ./ | ||
- | let niter=$niter+1 | ||
- | checktime | ||
- | done | ||
- | </ | ||
1slurm.1585319739.txt.gz · Last modified: 2020/03/27 15:35 by admin