Skip to content
Snippets Groups Projects
Commit 7135ac49 authored by Rasmus Ringdahl's avatar Rasmus Ringdahl
Browse files

feat: add job array example

Closes #3
parent 29e86c08
No related branches found
No related tags found
1 merge request!3feat: add job array example
# Job arrays
A job array is a method to queue multiple jobs with similar resource needs. Job arrays are suitable when the same type of computation is needed to be run multiple times with different input data.
## How to run
To run the example do the following steps:
1. Log in to Lundgren
2. Change directory to the example code
3. Run `sbatch job_array.sh`
4. Check queue status by running `squeue`
5. When the job is completed check the file _job_array_XXX_YY.log_
Try to extend the job array with one more run by adding a new file in the _data_ folder and update the _config.txt_ file.
## Detailed description of the example
The batch script is the main file for the job allocation and preparation. Inside the python script a few environmental variables are fetched and printed out. Furthermore there are a folder, _data_ that contains some figurative input data and a _config.txt_ file that maps the array to the input data.
### The batch script
The batch script, _job_array.sh_, contains four sections. The first section contains input arguments to the Slurm scheduler. The second section loads Python into environment so it is accessible. In the third step a the _config.txt_ file is read and the filename of the file corresponding to the array index is stored. Lastly the job step is performed with the relevant filename as input argument.
The input arguments are defined with a comment beginning with SBATCH followed by the argument key and value. For easier readablility the -- method is used.
- __job-name:__ The name of the job is set to _demo_job_array_
- __time:__ The requeted time is set to 5 minutes, _00:05:00_
- __ntasks:__ The number of tasks to be performed in this job is set to _1_.
- __cpus-per-task:__ The requested number of cores per task is set to _2_
- __mem:__ The requested memory is set to _50 MB_
- __output:__ The standard output should be sent to the file _job_array_%A_%a.log__, the %A will expand to the job number and %a will expand to the array index.
- __array:__ The array is set t _1-3_. This represents a list of array ids that should be created. Each id will be a separate job. The array can be of any numbering that suites the user.
_Note: Multiple similar jobs will be run and output files need to be handled in a way so they are not overwritten._
Python needs to be loaded into the environment in order to be accessible this is done in the next step with the __module__ command.
The job step with the single task is allocated and performed with the __srun__ command.
#### The configuration file and data files
The _config.txt_ is a text file containing a simple table, the first column contains a the array index and the second column contains the filepath to the data file to be loaded into the job. It is importat that the index in the file matches the _--array_ argument.
The data files in this example is a simple json object but could be a CSV-file or other file formats.
For simpler applications the data files could be ignored and the _config.txt_ contains all relevant data.
#### The python script
The python script represents the task to be done. In this case the task is to wait a time based on the input data file and print the waiting is done.
The environment variable __SLURM_CPUS_PER_TASK__ is used to restrict the worker pool to the allocated number of cores.
### How to set the number of cores in different programing languages and softwares
Most programming languages and softwares tries to make use of all cores that are available. This can lead to an oversubscription on the resources. On a shared resource one must match the maximum used resources with the allocated ones. This section gives a reference in how to do it in commonly used softwares.
- __CPLEX:__ Use the parameter _global thread count_. Read more in the [documentation](https://www.ibm.com/docs/en/icos/22.1.2?topic=parameters-global-thread-count)
- __Gurobi:__ Use the configuration parameter _ThreadLimit_. Read more in the [documentation](https://docs.gurobi.com/projects/optimizer/en/current/reference/parameters.html#threadlimit)
- __MATLAB:__ Create a instance of the parpool object with the _poolsize_ set to the number of cores and use the pool when running in parallell. Read more in the [documentation](https://se.mathworks.com/help/parallel-computing/parpool.html)
- __Python:__ If the multiprocessing package is used, create an instance of the pool class with the _processes_ set to the number of cores and use the pool when running in parallell. Read more in the [documentation](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool)
task file
1 data/parameters_first_job.txt
2 data/parameters_second_job.txt
3 data/parameters_third_job.txt
{"name": "First file", "sleep": [2,12,7,4,4]}
\ No newline at end of file
{"name": "Second file", "sleep": [3,10,4,11,2]}
\ No newline at end of file
{"name": "Third file", "sleep": [12,3,14,10,20]}
\ No newline at end of file
#! /bin/bash
#SBATCH --job-name=demo_job_array
#SBATCH --time=00:05:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=50MB
#SBATCH --output=job_array_%A_%a.log
#SBATCH --array=1-3
# Loading Python into the environment
module load python/anaconda3-2024.02-3.11.7
# Specify the path to the config file
config=config.txt
# Extract the file name for the current $SLURM_ARRAY_TASK_ID
file=$(awk -v task=$SLURM_ARRAY_TASK_ID '$1==task {print $2}' $config)
# Start job stage
srun python job_array_task.py ${file}
\ No newline at end of file
from datetime import datetime
from multiprocessing import Pool
import json
import logging
import os
import sys
import time
logger = logging.getLogger(__name__)
def sleep(input):
time.sleep(input[1])
logger.info('Task %d done.',input[0])
def main(filename: str):
# Read environment variables.
NUMBER_OF_CORES = os.environ.get('SLURM_CPUS_PER_TASK','Unknown')
if NUMBER_OF_CORES in 'Unknown':
logger.error('Unkown number of cores, exiting.')
return
NUMBER_OF_CORES = int(NUMBER_OF_CORES)
logger.info('Running program with %d cores.',NUMBER_OF_CORES)
# Reading configuration file and create a list of tasks
# This represents the reading of parameters and calculations
logger.info('Reading configuration from %s.',filename)
with open(filename, 'r') as file:
data = json.load(file)
tasks = []
total_time = 0
for i in range(len(data['sleep'])):
time = data['sleep'][i]
tasks.append((i, time))
total_time = total_time + time
# Creating a multiprocessing pool to perform the tasks
pool = Pool(processes=NUMBER_OF_CORES)
# Running submitting the tasks to the worker pool
tic = datetime.now()
logger.info('Submitting tasks to pool.')
pool.map(sleep, tasks)
toc = datetime.now()
logger.info('All tasks are done, took %d seconds, compared to %d seconds with single thread.',
(toc-tic).seconds, total_time)
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
filename = sys.argv[1]
main(filename)
...@@ -25,4 +25,9 @@ Learn more about the [example](https://gitlab.liu.se/rasri17/lundgren-examples/- ...@@ -25,4 +25,9 @@ Learn more about the [example](https://gitlab.liu.se/rasri17/lundgren-examples/-
#### Example 2 - Mutli core job #### Example 2 - Mutli core job
A multi core job is a job that splits the computation to multiple cores. This type of job is the most suitable and most common ones to run on Lundgren. This includes optimization problems and heavy computations. A multi core job is a job that splits the computation to multiple cores. This type of job is the most suitable and most common ones to run on Lundgren. This includes optimization problems and heavy computations.
Learn more about the [example](https://gitlab.liu.se/rasri17/lundgren-examples/-/tree/main/2_multi_core_job). Learn more about the [example](https://gitlab.liu.se/rasri17/lundgren-examples/-/tree/main/2_multi_core_job).
\ No newline at end of file
#### Example 3 - Job arrays
A job array is a method to queue multiple jobs with similar resource needs. Job arrays are suitable when the same type of computation is needed to be run multiple times with different input data.
Learn more about the [example](https://gitlab.liu.se/rasri17/lundgren-examples/-/tree/main/3_job_array).
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment