Condor Cluster

Condor creates a High Throughput Computing (HTC) environment using the Astronomy network of workstations to schedule computational jobs. It does so by monitoring the usage of every core on the available workstations and farming out jobs to unused cores. If a user comes back to the computer, Condor pulls the job.

The most common use case in our department is for running some program that takes an hour or two on tens to thousands of stars/galaxies individually. The University of Wisconsin-Madison's Center for High Throughput Computing has an annual HTC workshop and provide presentation materials online, such as this Introduction to using HTCondor.

Page Contents: Running a job Workflow Disk Usage FAQs Resources

Documentation

Running a Job

Important Note

While Condor is good for running a bunch of individual jobs, it does not allow parallel computing by default (i.e. running the same job using multiple cores simultaneously). Additionally, if a job gets pulled from a computer, Condor does not save the job state to resume execution on a new core. Instead, it will restart the job once it finds a new core.  If your jobs last more than 2-5 hours, you should build checkpointing into your code that periodically saves its progress to resume upon a job restart.

  1. Make sure that your code can run in the background (usually a problem for IDL code, most should be fine).
  2. Submit jobs only on the condor server, condor.astro.washington.edu. The submission may still work if you submit from another computer, but it's buggy and better if the condor nodes aren't acting as the master.
  3. Put your code and any data that your code requires on a disk that is visible to all of the Linux PCs (your home directory is fine) in the Condor Cluster. Most disks in the department are visible from anywhere else in the department. The notable exception is anything in your /local or /tmp directory.
  4. Create a Condor .cfg file. This is a set of instructions telling Condor how to run your code. We provide an example file in the next section.
  5. Submit the job using % condor_submit
    • % ssh condor <br>
    • % cd /path/to/mydir<br>
    • % condor_submit ./myjobs.cfg
  6. Check the progress of your Condor job(s).
    • % condor_q -sub username
  7. If necessary, remove jobs that you no longer need to run.
    • % condor_rm job# or % condor_rm username to remove all your jobs.

Example cfg File

The idea of a cfg file is to specify a list of parameters: the log file, output file, executable, and arguments, and then submit the particular run. This is accomplished using the word Queue. When this word appears, condor takes the current values for Executable, Arguments, etc. and sends it to a node to compute. A cfg file looks like the code block below.

# you can change this so condor will email you under certain circumstances.
Notification = never
# very necessary. This loads your .cshrc file and makes Condor aware of your home directory/the network before the job starts.
# the job will fail without this line.
getenv = true
# what file condor will run
Executable = /astro/users/username/directory/condor_run.csh
# which directory your code starts in (e.g. calls to ./ mean this directory)
Initialdir = /astro/users/username/directory/
# read the documentation before changing this
Universe = vanilla
# Condor log information
Log = /astro/users/username/directory/output/log1.txt
Output = /astro/users/username/directory/output/run1.out
Error = /astro/users/username/directory/output/run1.err
Arguments = /path/to/inputfile1 <arg1> <arg2>
Queue
Log = /astro/users/username/directory/output/log2.txt
Output = /astro/users/username/directory/output/run2.out
Error = /astro/users/username/directory/output/run2.err
Arguments = /path/to/inputfile2 <arg1> <arg2>
Queue
Log = /astro/users/username/directory/output/log3.txt
Output = /astro/users/username/directory/output/run3.out
Error = /astro/users/username/directory/output/run3.err
Arguments = /path/to/inputfile3 <arg1> <arg2>
Queue

The log file writes Condor information. For example, which computer it was running on, how long it ran for, if it got kicked off the machine, why it died. The outptut file records stdout, or what usually gets printed to your terminal. If you want to discard anything printed to the screen, set Output = /dev/null

Note about running Python. The executable line above tells Condor what script to run, and it gets passed the appropriate arguments. For some reason, you cannot hand Condor a .py file and expect it to run it. So instead what is usually done is to make a short .csh file like so and use that as the Executable =  in your .cfg.

  • #!/bin/csh
  • python /astro/users/path/to/python/script.py $1 $2 $3

For some reason Condor seems to be very particular and requires the shebang line telling it exactly which shell to use to interpret the script, so make sure you include that. This will then pass the 3 arguments from your .cfg file into the Python file you want to execute.

Typical Workflow

Putting everything together, here is how large batches of Condor jobs are usually done:

  • Create a directory where most of your output will go.
  • Write the one line .csh file that calls your Python script with the appropriate arguments (see above).
  • Make sure that the .csh file, if made by a condor wrapper script, ends with a newline (\n) character.
  • Write a condor_setup.py wrapper file. This script will generate the .cfg file for your big job. You can loop through and print out the log information for each job your want to run (thousands of different stars for example), changing your arguments appropriately.
  • Log into condor.astro and then run % condor_submit ./your_config_file.cfg. Alternatively, log onto condor.astro to run your condor_setup.py file, and then as the last line of the file have it run  % subprocess.check_call(['condor_submit', './your_config_file.cfg']) to automatically submit your jobs.

Note on matplotlib: Matplotlib can sometimes have issues working on Condor jobs out of the box. If you run a job and see error messages relating to matplotlib, try changing your backend. Do this by in one of the first lines of your python script (before any other references to matplotlib) adding the following to the preamble: import matplotlib matplotlib.use('agg')

Disk Usage

Knowing how the file system works is crucial to getting work done quickly and efficiently. What you think of as "disks" are the back end hardware. "File systems" sit in front of the disk and process your data into the file system format on the physical hard disk. Every year someone brings down the entire computer network because they write or read files in an IDL script that uses condor in a non-optimal way. To avoid this, here are several guidelines that can improve your performance.

  • Don't use tight loops that repeat file open/close/write operations
  • There's no reason to write out large single column ascii files. Once your arrays get bigger than 1000 elements, use binary format.
  • Lots of small files are bad, use one big binary file
  • Directories with more than 10,000 files are impossible for file systems to handle. Any operation you do in such a directory will complete at a snail's pace.
  • Entire directory trees should not exceed 1 million files no matter how you structure them. While that sounds like a lot, you'd be surprised how quickly you will generate a large number of files.
  • Each condor job should run in a different output directory; and each process should do it's own IO to it's own directory.

Python Example

If you are running code on Condor that does write frequently, you can execute the program from inside a Python script that also copies all necessary files and then executes the program on the local machine, thus preventing writes from occurring in huge numbers across the network. Note that the following example will *not* allow you to do checkpointing. Any time your job is kicked off a machine, it will have to start over on another.

First, import os and move to the local directory on whichever machine your job ends up on:

  • import os
  • os.chdir('/local/tmp')

Next, make a directory for yourself and copy your files over:

  • os.system('mkdir username')
  • os.chdir('username')
  • os.system('cp <yourfiles> .')

Then execute your program. You'll need to make sure that the code writes to the current directory and not some specific directory on the network.

  • os.system('<yourprogram> <arguments>')

Finally, clean up:

  • os.system('mv <outputfiles> <directoryname>')
  • os.chdir('..')
  • os.system('rm -r username')

This way the program executed by Condor (your Python script, probably inside a shell wrapper, unless you can figure out how to make Condor understand that yes, Python is a valid executable, in which case you should be writing this page and not me), only writes across the network twice, and you don't get 100+ computers trying to write 10000+ times per second to a single disk.

FAQs

  • How do I stop other people's Condor jobs on my computer? Condor is set up to exit when you start using your computer. However, since our computers have two processer, condor does not get the message that the computer is in use and continues to use the second processor, slowing you down. To force condor to exit, type:  % condor_vacate
  • How should I credit the creators of Condor in a journal paper? Point 8 of the Condor Academic License provides a sentence that you should include in the acknowledgments of any publication whose results were obtained with Condor.
  • Who should I ask for help with using Condor? If you've already carefully looked at all of the resources available from this page, and still can't find an answer to your question, try asking PACS. Or if this page hasn't been updated in a while, log onto condor.astro and run % condor_userprio -allusers -all. Find someone still here who has lots of accumulated usage, and odds are they can help you.
  • Why does my computer slow down so much? A process which manages your jobs runs on the computer at which you submitted your jobs. If you submit lots of jobs, this condor manager has to do a lot of work, and can slow down that computer, if its not fast enough. condor.astro is a dedicated Condor server that you can submit jobs from rather than your desktop.
  • Are comments allowed in the .cfg files? Yes, any line beginning with the pound sign (#) is ignored by Condor.
  • How do I make Condor inherit my environment variables? In your .cfg file, set: getenv=true  The length of all characters specified in the environment is currently limited to 10240 characters. If your environment is larger than that, Condor will not allow you to submit your job, and you will have to use the environment setting described below, instead. environment = List of environment variables of the form : = Multiple environment variables can be specified by separating them with a semicolon (`` ; ). These environment variables will be placed into the job's environment before execution.
  • How do I kill a condor job that I've started? type the following: % condor_rm <your_user_name>
    This will remove all condor jobs under your username! If the job is listed in state 'X', do % condor_rm -forcex <jobID>
  • Why am I getting a "cannot connect to X server localhost" error? This is the error that is related to matplotlib, make sure you insert precautionary measures outlined above into your code.
  • Why are my jobs being held due to "Permission denied?" You forgot to % chmod +x filename.sh

Resources

The Caltech Bioinformatics Lab has a Wiki page on troubleshooting Condor. Also, the Instituto de Astrofísica de Canarias has a (now deprecated) list of useful HTCondor commands

Share