How to handle python job cancelation in Slurm job manager
By Dmitry Kabanov
This post discusses usage of the signal Python package to process signals
that Slurm sends to inform the running job that the time is expiring.
If you use Slurm job manager to run jobs on shared cluster, it often occurs
that your job reaches the time limit and is terminated by Slurm.
To allow a user to deal with the job termination, Slurm does this in two stages:
first, the job receives SIGTERM signal that indicates that the job will be
killed soon, and then the job receives SIGKILL signal that actually kills it.
The time interval between these two signals is specified via Slurm’s
configuration parameter KillWait. This information is from the
documentation for the sbatch command for the
--time parameter.
Now, to actually handle the SIGTERM signal in Python, one should use signal
package that comes built-in and register callback function that handles this
signal:
import signal
class TimeLimitException(Exception):
pass
def handle_signal(signal_num, frame):
if signal_num == signal.SIGTERM:
raise TimeLimitException()
Note that we raise a custom exception such that the main code of the script could catch this particular exception and do the appropriate logic to end the job gracefully (write results of computations to file, etc.).
In the __main__ part of the python script we must register the handler and
write the code with try-except block:
if __name__ == "__main__":
signal.signal(signal.SIGTERM, handle_signal)
try:
# do computations
except TimeLimitException:
# handle approaching-time-limit event by saving files, etc.
That’s it! Now you should be able to write python jobs for Slurm that can save the results when the job approaches time limit.
This post was written with the help of wonderful people from Stackoverflow.
UPDATE 2020-12-21.
For some reason, the signal propagates to the child process (python script) only
when the process is started under srun even if it is a serial computation.
That is, inside your job script you should start the computations
using a command such as
srun python run_my_computations.py