The most detailed guide on installing minimized slurm

2022年4月7日

This article guides you to install slurm 19.05 under Ubuntu 20.04. You only need one machine. This article does not install the database. In addition to the installation, this article is responsible for verifying the installation and provides an example of submitting a job.

Install

{{sudoText}}apt install -y slurm-wlm

Some articles ask to install munge[1]——Authentication plugins identifies the user originating a message[2]; the website of slurm says munge is a plugin, but it is unclear whether I must install this plugin.
I’m not telling you whether to install or not, you just follow this tutorial!

Download https://gist.github.com/gqqnbig/8a1e5082ec1c974a84fdd8abd1a4fbf6. Make sure that the paths in the first three lines are correct. After running the script, do not run any systemctl related commands!

As of now, slurm has been installed and configured, but it is still a question whether slurm is functional. We will not use systemctl because systemctl has an extra layer of wrapping. The error may be from systemctl or from slurm, which is not easy to distinguish. Hence, we are about to run slurm manually.

According to the documentation, slurmctld runs on the master node and is responsible for monitoring resources and jobs; slurmd runs on the computing node (slave node) and is responsible for executing jobs. So, since we only have one machine, we must run both slurmctld and slurmd. Note that the slurm software does not have a slurm command.

Verify

Run slurmctld

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1   unk* localhost

Notice the state is unknown*。unknown means the controller node doesn’t know the state of localhost, which is expected because you haven’t started slurmctld yet.
* means localhost is not reachable[3], which is expected because you haven’t started slurmd.

$ sudo slurmctld -c -D
slurmctld: error: Unable to open pidfile `/var/run/slurm/slurmctld.pid': No such file or directory
slurmctld: error: Configured MailProg is invalid
slurmctld: slurmctld version 19.05.5 started on cluster 35adae022fc4478592f75c3b4c97bce1
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: layouts: loading entities/relations information
slurmctld: Recovered state of 0 reservations
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: Running as primary controller
slurmctld: No parameter for mcs plugin, default values set

An error pops up immediately on the first line. Open /etc/slurm-llnl/slurm.conf, we found the path is specified by SlurmctldPidFile, and /var/run doesn’t contain a slurm folder.
By experience, slurm doesn’t create folders. Therefore we change /etc/slurm-llnl/slurm.conf to

SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid

run again

$ sudo slurmctld -c -D
slurmctld: error: Configured MailProg is invalid
slurmctld: slurmctld version 19.05.5 started on cluster 35adae022fc4478592f75c3b4c97bce1
slurmctld: No memory enforcing mechanism configured.
slurmctld: layouts: no layout to initialize
slurmctld: layouts: loading entities/relations information
slurmctld: Recovered state of 0 reservations
slurmctld: _preserve_plugins: backup_controller not specified
slurmctld: Running as primary controller
slurmctld: No parameter for mcs plugin, default values set

The error of pidfile is gone. “error: Configured MailProg is invalid” is not a big issue and we will save for later. slurm is able to send email after a job finished, this error means Email service is in error.

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1  down* localhost

It turns out the controller node knows localhost is down.

Next we run slurmd.

Run slurmd

$ sudo slurmd -D
slurmd: Message aggregation disabled
slurmd: WARNING: A line in gres.conf for GRES gpu has 1 more configured than expected in slurm.conf. Ignoring extra GRES.
slurmd: slurmd version 19.05.5 started
slurmd: slurmd started on Thu, 28 Jan 2021 02:59:20 +0000
slurmd: CPUs=48 Boards=1 Sockets=2 Cores=12 Threads=2 Memory=128573 TmpDisk=111654 Uptime=2860 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

We also find the log of slurmctld has “slurmctld: Node localhost now responding”, then we run sinfo

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1   down localhost

Up to now, the communication between the controller and the slave is working, except the warning on the second line “WARNING: A line in gres.conf for GRES gpu has 1 more configured than expected in slurm.conf. Ignoring extra GRES.”

Open /etc/slurm-llnl/gres.conf, we see Name=gpu File=/dev/nvidia[0-1], it means the install script has recognized two GPUs on this machine.
Open /etc/slurm-llnl/slurm.conf, we see NodeName=localhost Gres=gpu CPUs=8 Boards=1 SocketsPerBoard=2 ..., which is incorrect. Per slurm.conf – Slurm configuration file, the argument of Gres must with a number, so we change it to

NodeName=localhost Gres=gpu:2 CPUs=8 Boards=1 SocketsPerBoard=2 ...

Restart slurmctld and slurmd, the error is gone.

Next we test if submitting jobs is working.

Submit jobs

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up   infinite      1   idle localhost
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ scontrol show job
No jobs in the system

以上代码显示slurm单机集群正常,目前没有任何job(作业),队列为空。

下面提交一个作业,要求运行名为hostname的程序。hostname是Linux的内置程序,用来打印当前机器的名称。-N1表示hostname程序要在1个节点运行。

$ srun -N1 hostname
gqqnbig

如果该命令迟迟不返回,可能是防火墙问题。

下面提交一个作业,要求运行名为hostname的程序。–nodes=2-3表示该程序要在2到3个节点运行。显然,本文只配置单机集群,没有这么多计算节点,本命令会陷入无限等待。

$ srun --nodes=2-3 hostname
srun: Requested partition configuration not available now
srun: job 17 queued and waiting for resources

在另一个命令行窗口运行

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                17    normal hostname    gqqnb PD       0:00      2 (PartitionNodeLimit)

发现job 17正在等待。回到原来的命令行窗口,按ctrl+c终止job 17。至于--nodes还可以接受什么样的参数,请参见man srun

Test to limit resources

使用《浅谈Linux Cgroups机制》()的C++代码。

#include <unistd.h>
#include <stdio.h>
#include <cstring>
#include <thread>

void test_cpu() {
    printf("thread: test_cpu start\n");
    int total = 0;
    while (1) {
        ++total;
    }
}

void test_mem() {
    printf("thread: test_mem start\n");
    int step = 20;
    int size = 10 * 1024 * 1024; // 10Mb
    for (int i = 0; i < step; ++i) {
        char* tmp = new char[size];
        memset(tmp, i, size);
        sleep(1);
    }
    printf("thread: test_mem done\n");
}

int main(int argc, char** argv) {
    std::thread t1(test_cpu);
    std::thread t2(test_mem);
    t1.join();
    t2.join();
    return 0;
}

编译并运行

$ g++ -o test test.cc --std=c++11 -lpthread
$ ./test
htop

发现CPU占用为100%,内存慢慢涨到约400MB。(如果htop显示三个test,按H切换到进程模式,就只会显示一个了。)运行srun ./test,内存占用相同。

Limit memory

现在实现内存限制。在/etc/slurm-lnl/cgroup.conf写入

CgroupAutomount=yes
MaxRAMPercent=0.1
ConstrainRAMSpace=yes

在/etc/slurm-llnl/slurm.conf写入或修改

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup

重启slurmctld和slurmd,确保没报任何异常。这时再运行srun ./test,发现RES被限制在12MB,VIRT则很大,说明内存限制成功。(但我们没有限制虚拟内存)

Limit CPU

创建文件test-cpu.py,运行它。test-cpu.py输出CPU的数量,并且创建该数量的线程。

#!/usr/bin/env python3

import threading, multiprocessing
import time

print(multiprocessing.cpu_count())

def loop():
    x = 0
    while True:
        x = x ^ 1

for i in range(multiprocessing.cpu_count()):
    t = threading.Thread(target=loop)
    print(f'create a thread {i}...', flush=True)

    t.start()

ps -eLF | grep test-cpu | sort -n -k9 第9列显示的是test-cpu的线程所在的CPU,发现test-cpu.py运行在数个CPU上。

现在用slurm限制,运行srun -c4 ./test-cpu.py,再运行ps,发现test-cpu.py运行在4个CPU上。

这说明,向slurm申请CPU,slurm就只分配那么多个CPU。虽然multiprocessing.cpu_count()能获取到真实的CPU数量,但无法全部使用。

另外也可以通过scontrol show job来检查-c的设置是否生效。

Request GPU

运行tf-GPU-test.py并申请4个GPU。

srun --gres=gpu:4 python tf-GPU-test.py
import time

if __name__ == '__main__':
	# 大概需要300MB内存
	arr = [None] * 10000000

	time.sleep(10)
	for i in range(len(arr)):
		arr[i] = i


	time.sleep(10)
	print('done')

https://stackoverflow.com/questions/52421171/slurm-exceeded-job-memory-limit-with-python-multiprocessing

$ srun --mem=100G hostname
srun: error: Memory specification can not be satisfied
srun: error: Unable to allocate resources: Requested node configuration is not available

我们的机器内存没有100G,所以该任务无法运行。把命令改为srun --mem=10M hostname就可以运行。同时也发现,--mem并不限制任务本身的内存占用。

Release in CC-BY-SA 3.0
Not allowed on any websites under BAIDU

References

neurokernel. gpu-cluster-config/slurm.conf. . 2016-03-13 [2021-01-28].

参考资料

  1. . SLURM single node install. . [2021-01-27].
  2. . Download Slurm. Slurm. [2021-01-27].
  3. . sinfo - View information about Slurm nodes and partitions.. . [2021-01-28]. “NODE STATE CODES”