The most detailed guide on installing minimized slurm
2022年4月7日This article guides you to install slurm 19.05 under Ubuntu 20.04. You only need one machine. This article does not install the database. In addition to the installation, this article is responsible for verifying the installation and provides an example of submitting a job.
Install
{{sudoText}}apt install -y slurm-wlm
Some articles ask to install munge[1]——Authentication plugins identifies the user originating a message[2]; the website of slurm says munge is a plugin, but it is unclear whether I must install this plugin.
I’m not telling you whether to install or not, you just follow this tutorial!
Download https://gist.github.com/gqqnbig/8a1e5082ec1c974a84fdd8abd1a4fbf6. Make sure that the paths in the first three lines are correct. After running the script, do not run any systemctl related commands!
As of now, slurm has been installed and configured, but it is still a question whether slurm is functional. We will not use systemctl because systemctl has an extra layer of wrapping. The error may be from systemctl or from slurm, which is not easy to distinguish. Hence, we are about to run slurm manually.
According to the documentation, slurmctld runs on the master node and is responsible for monitoring resources and jobs; slurmd runs on the computing node (slave node) and is responsible for executing jobs. So, since we only have one machine, we must run both slurmctld and slurmd. Note that the slurm software does not have a slurm command.
Verify
Run slurmctld
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 1 unk* localhost
Notice the state is unknown*。unknown means the controller node doesn’t know the state of localhost, which is expected because you haven’t started slurmctld yet.
* means localhost is not reachable[3], which is expected because you haven’t started slurmd.
$ sudo slurmctld -c -D slurmctld: error: Unable to open pidfile `/var/run/slurm/slurmctld.pid': No such file or directory slurmctld: error: Configured MailProg is invalid slurmctld: slurmctld version 19.05.5 started on cluster 35adae022fc4478592f75c3b4c97bce1 slurmctld: No memory enforcing mechanism configured. slurmctld: layouts: no layout to initialize slurmctld: layouts: loading entities/relations information slurmctld: Recovered state of 0 reservations slurmctld: _preserve_plugins: backup_controller not specified slurmctld: Running as primary controller slurmctld: No parameter for mcs plugin, default values set
An error pops up immediately on the first line. Open /etc/slurm-llnl/slurm.conf, we found the path is specified by SlurmctldPidFile, and /var/run doesn’t contain a slurm folder.
By experience, slurm doesn’t create folders. Therefore we change /etc/slurm-llnl/slurm.conf to
SlurmctldPidFile=/var/run/slurmctld.pid SlurmdPidFile=/var/run/slurmd.pid
run again
$ sudo slurmctld -c -D slurmctld: error: Configured MailProg is invalid slurmctld: slurmctld version 19.05.5 started on cluster 35adae022fc4478592f75c3b4c97bce1 slurmctld: No memory enforcing mechanism configured. slurmctld: layouts: no layout to initialize slurmctld: layouts: loading entities/relations information slurmctld: Recovered state of 0 reservations slurmctld: _preserve_plugins: backup_controller not specified slurmctld: Running as primary controller slurmctld: No parameter for mcs plugin, default values set
The error of pidfile is gone. “error: Configured MailProg is invalid” is not a big issue and we will save for later. slurm is able to send email after a job finished, this error means Email service is in error.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up infinite 1 down* localhost
It turns out the controller node knows localhost is down.
Next we run slurmd.
Run slurmd
$ sudo slurmd -D slurmd: Message aggregation disabled slurmd: WARNING: A line in gres.conf for GRES gpu has 1 more configured than expected in slurm.conf. Ignoring extra GRES. slurmd: slurmd version 19.05.5 started slurmd: slurmd started on Thu, 28 Jan 2021 02:59:20 +0000 slurmd: CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=128573 TmpDisk=111654 Uptime=2860 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
We also find the log of slurmctld has “slurmctld: Node localhost now responding”, then we run sinfo
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up infinite 1 down localhost
Up to now, the communication between the controller and the slave is working, except the warning on the second line “WARNING: A line in gres.conf for GRES gpu has 1 more configured than expected in slurm.conf. Ignoring extra GRES.”
Open /etc/slurm-llnl/gres.conf, we see Name=gpu File=/dev/nvidia[0-1]
, it means the install script has recognized two GPUs on this machine.
Open /etc/slurm-llnl/slurm.conf, we see NodeName=localhost Gres=gpu CPUs=8 Boards=1 SocketsPerBoard=2 ...
, which is incorrect. Per slurm.conf – Slurm configuration file, the argument of Gres must with a number, so we change it to
NodeName=localhost Gres=gpu:2 CPUs=8 Boards=1 SocketsPerBoard=2 ...
Restart slurmctld and slurmd, the error is gone.
Next we test if submitting jobs is working.
Submit jobs
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST normal* up infinite 1 idle localhost $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) $ scontrol show job No jobs in the system
以上代码显示slurm单机集群正常,目前没有任何job(作业),队列为空。
下面提交一个作业,要求运行名为hostname的程序。hostname是Linux的内置程序,用来打印当前机器的名称。-N1表示hostname程序要在1个节点运行。
$ srun -N1 hostname gqqnbig
如果该命令迟迟不返回,可能是防火墙问题。
下面提交一个作业,要求运行名为hostname的程序。–nodes=2-3表示该程序要在2到3个节点运行。显然,本文只配置单机集群,没有这么多计算节点,本命令会陷入无限等待。
$ srun --nodes=2-3 hostname srun: Requested partition configuration not available now srun: job 17 queued and waiting for resources
在另一个命令行窗口运行
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 17 normal hostname gqqnb PD 0:00 2 (PartitionNodeLimit)
发现job 17正在等待。回到原来的命令行窗口,按ctrl+c终止job 17。至于--nodes
还可以接受什么样的参数,请参见man srun。
Test to limit resources
使用《浅谈Linux Cgroups机制》()的C++代码。
#include <unistd.h>
#include <stdio.h>
#include <cstring>
#include <thread>
void test_cpu() {
printf("thread: test_cpu start\n");
int total = 0;
while (1) {
++total;
}
}
void test_mem() {
printf("thread: test_mem start\n");
int step = 20;
int size = 10 * 1024 * 1024; // 10Mb
for (int i = 0; i < step; ++i) {
char* tmp = new char[size];
memset(tmp, i, size);
sleep(1);
}
printf("thread: test_mem done\n");
}
int main(int argc, char** argv) {
std::thread t1(test_cpu);
std::thread t2(test_mem);
t1.join();
t2.join();
return 0;
}
编译并运行
$ g++ -o test test.cc --std=c++11 -lpthread $ ./test
htop
发现CPU占用为100%,内存慢慢涨到约400MB。(如果htop显示三个test,按H切换到进程模式,就只会显示一个了。)运行srun ./test
,内存占用相同。
Limit memory
现在实现内存限制。在/etc/slurm-lnl/cgroup.conf写入
CgroupAutomount=yes MaxRAMPercent=0.1 ConstrainRAMSpace=yes
在/etc/slurm-llnl/slurm.conf写入或修改
ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup
重启slurmctld和slurmd,确保没报任何异常。这时再运行srun ./test
,发现RES被限制在12MB,VIRT则很大,说明内存限制成功。(但我们没有限制虚拟内存)
Limit CPU
创建文件test-cpu.py,运行它。test-cpu.py输出CPU的数量,并且创建该数量的线程。
#!/usr/bin/env python3
import threading, multiprocessing
import time
print(multiprocessing.cpu_count())
def loop():
x = 0
while True:
x = x ^ 1
for i in range(multiprocessing.cpu_count()):
t = threading.Thread(target=loop)
print(f'create a thread {i}...', flush=True)
t.start()
ps -eLF | grep test-cpu | sort -n -k9
第9列显示的是test-cpu的线程所在的CPU,发现test-cpu.py运行在数个CPU上。
现在用slurm限制,运行srun -c4 ./test-cpu.py
,再运行ps,发现test-cpu.py运行在4个CPU上。
这说明,向slurm申请CPU,slurm就只分配那么多个CPU。虽然multiprocessing.cpu_count()
能获取到真实的CPU数量,但无法全部使用。
另外也可以通过scontrol show job
来检查-c的设置是否生效。
Request GPU
运行tf-GPU-test.py并申请4个GPU。
srun --gres=gpu:4 python tf-GPU-test.py
import time
if __name__ == '__main__':
# 大概需要300MB内存
arr = [None] * 10000000
time.sleep(10)
for i in range(len(arr)):
arr[i] = i
time.sleep(10)
print('done')
https://stackoverflow.com/questions/52421171/slurm-exceeded-job-memory-limit-with-python-multiprocessing
$ srun --mem=100G hostname srun: error: Memory specification can not be satisfied srun: error: Unable to allocate resources: Requested node configuration is not available
我们的机器内存没有100G,所以该任务无法运行。把命令改为srun --mem=10M hostname
就可以运行。同时也发现,--mem
并不限制任务本身的内存占用。
Not allowed on any websites under BAIDU
References
neurokernel. gpu-cluster-config/slurm.conf. . 2016-03-13 [2021-01-28].
参考资料
- . SLURM single node install. . [2021-01-27].↑
- . Download Slurm. Slurm. [2021-01-27].↑
- . sinfo - View information about Slurm nodes and partitions.. . [2021-01-28]. “NODE STATE CODES”↑