How to handle python job cancelation in Slurm job manager
By Dmitry Kabanov
This post discusses usage of the signal
Python package to process signals
that Slurm sends to inform the running job that the time is expiring.
If you use Slurm job manager to run jobs on shared cluster, it often occurs
that your job reaches the time limit and is terminated by Slurm.
To allow a user to deal with the job termination, Slurm does this in two stages:
first, the job receives SIGTERM
signal that indicates that the job will be
killed soon, and then the job receives SIGKILL
signal that actually kills it.
The time interval between these two signals is specified via Slurm’s
configuration parameter KillWait
. This information is from the
documentation for the sbatch
command for the
--time
parameter.
Now, to actually handle the SIGTERM
signal in Python, one should use signal
package that comes built-in and register callback function that handles this
signal:
import signal
class TimeLimitException(Exception):
pass
def handle_signal(signal_num, frame):
if signal_num == signal.SIGTERM:
raise TimeLimitException()
Note that we raise a custom exception such that the main code of the script could catch this particular exception and do the appropriate logic to end the job gracefully (write results of computations to file, etc.).
In the __main__
part of the python script we must register the handler and
write the code with try
-except
block:
if __name__ == "__main__":
signal.signal(signal.SIGTERM, handle_signal)
try:
# do computations
except TimeLimitException:
# handle approaching-time-limit event by saving files, etc.
That’s it! Now you should be able to write python jobs for Slurm that can save the results when the job approaches time limit.
This post was written with the help of wonderful people from Stackoverflow.
UPDATE 2020-12-21.
For some reason, the signal propagates to the child process (python script) only
when the process is started under srun
even if it is a serial computation.
That is, inside your job script you should start the computations
using a command such as
srun python run_my_computations.py