Cluster infrastructure

HTCondor

Condor infrastructure is composed into 3 main part
  • A Collector: which collect information from worker node (CPU, RAMS, …)

  • A Scheduler: which allow user to submit a job to HTCondor

  • A Negotiator: which will do the “bridge” between worker (collector) and user (scheduler).

In our case, the Collector and Negotiator are on the same machine (called Condor Manager) named condor.najah.edu. To make the cluster usage easiest, we will install Scheduler service on every worker node and on the gateway.

HTCondor are store into /etc/condor/config.d directory. ALL file are store here and no other file should be edited.

As HTCondor read the files in alpha-numeric order, a good pratice is to start the name of the file with a number (like 01-default, 02-features, …)

Basic Configuration

Condor Manager

The condor manager contains 2 services (Collector and Negotiator) and a special service called Master, we also need to define the condor manager (called CONDOR_HOST on configuration file) and make a special configuration to allow other machine to condor cluster (ALLOW_WRITE).

/etc/condor/config.d/01-default:

CONDOR_HOST = condor.najah.edu
DAEMON_LIST = MASTER NEGOTIATOR COLLECTOR
ALLOW_WRITE = *

Condor Worker

The worker node contains 2 services (Scheduler and Startd). Startd is the daemon that send information to collector. As Schedd also need ALLOW_WRITE definition, we will define it.

/etc/condor/config.d/01-worker:

CONDOR_HOST = condor.najah.edu
DAEMON_LIST = MASTER STARTD SCHEDD
ALLOW_WRITE = *

Condor on the gateway

The gateway is a special machine where job shoud not be run but you want to have possibility to run job from this machine. To configure that specific usecase, you can define.

/etc/condor/config.d/01-schedd:

CONDOR_HOST = condor.najah.edu
DAEMON_LIST = MASTER SCHEDD
ALLOW_WRITE = *

Advanced configuration of HTCondor

Multicore job

Sometimes, a user need to start a threaded job that will take more than 1 CPUs. As, per default, HTCondor reserve one CPU/slot, if a user run this kind of job, your user will overload your worker and can make your cluster unsuable.

To avoid that, we can configure HTCondor to reserve more than 1 CPU/slot. There are multiple way to do it but the one that we will document is the more convinient way.

The main part of this configuration is done on every worker node. We will say 2 differents thing.

  • We want only one slot per machine (instead of one per core)

  • A slot can be splitted to be shared by different job

To do that, we need to add a second file on every worker called 02-multicore with the following contents

/etc/condor/config.d/02-multicore:

NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true

To understand the configuration

NUM_SLOTS:

We say to the worker to only declare 1 slot

NUM_SLOTS_TYPE_1:

This slot will be called TYPE_1 in the rest of the configuration file

SLOT_TYPE_1:

The slot will take all CPUs

SLOT_TYPE_1_PARTITIONABLE:

The slot can be shared by multiple job

A user can now request more that one CPU for is job by adding request_cpus on is job description file.

But, multicore job is more difficult to “enter” in a worker. As it need more than one CPU available, you must wait a worker with more than one CPU available. When you are in concurrence with singlecore job, every time a CPU will be released, a singlecore job will take it and your multicore job can be stuck in IDLE phase.

To avoid that, we can add a new service on condor manager called “DEFRAG”. Defrag will prevent singlecore job “entry” when a worker release one CPU.

To do that, we can add a file called 02-defrag on HTCondor manager with the following contents.

/etc/condor/config.d/02-defrag:

DAEMON_LIST = $(DAEMON_LIST) DEFRAG

DEFRAG_SCHEDULE = graceful

# Definition of a "whole" machine:
# - anything with 8 free cores
# - empty machines
# - must be configured to actually start new jobs (otherwise machines which are deliberately being drained will be included)
DEFRAG_WHOLE_MACHINE_EXPR = ((Cpus == TotalCpus) || ((Cpus >= 8)&&(DynamicSlot=!=true))) && (Offline=!=True)

# Decide which machines to drain
# - must be Partitionable
# - must be online
# - must have more than 8 cores
DEFRAG_REQUIREMENTS = PartitionableSlot && Offline=!=True && TotalCpus>8

## Logs
MAX_DEFRAG_LOG = 104857600
MAX_NUM_DEFRAG_LOG = 10

Parallel Job (MPI)

By default, HTCondor is not able to run parallel. To do that, we need to add some configuration on worker side. We should decide which scheduler will be able to start MPI job. For our usage, the best choice is t3ps.najah.edu.

On every worker node, we must add a new configuration file called 03-mpi with the following contents.

Note: DedicatedScheduler key word is missleading as the worker can still accept job from other scheduler. it’s just a key word to make this specific scheduler able to run parallel job.

/etc/condor/config.d/03-mpi:

DedicatedScheduler = "DedicatedScheduler@t3ps.najah.edu"
STARTD_ATTRS = $(STARTD_ATTRS), DedicatedScheduler

After reloading condor service on worker node, you can check configuration with the following command

condor_status -const '!isUndefined(DedicatedScheduler)' -format "%s\t" Machine -format "%s\n" DedicatedScheduler

If the output is empty, then there is something wrong in your configuration.

To run a MPI job, on description job file, you must add the following line

universe = parallel # We will run a parallel job
machine_count = 2 # We will run a job on 2 different machine

parallel universe can be used on the same way than a standard job.

Note: machine_count can be missleading, in fact condor will choose different slot. In case of a machine have more than one slot, same “hardware” can be choose by HTCondor