Condor creates a High Throughput Computing (HTC) environment using the Astronomy network of workstations to schedule computational jobs. It does this by monitoring the usage of every core on the workstations, and running the jobs on available machines. The most common use case in our department is for running some program that takes an hour or two on tens to thousands of stars/galaxies individually.
Important Note: While Condor is good for running a bunch of individual jobs, it does not allow parallel computing by default. Additionally, if a job gets pulled from a computer, Condor does not save the state of the job to resume execution on a new core. Instead, it will restart the job once it finds a new core. If your jobs last more than 2-5 hours, you should build checkpointing into your code that periodically saves its progress to resume upon a job restart.
Running a Job
- Make sure that your code can run in the background. (This is a problem for IDL code, most everything else should be fine.)
- Submit jobs only on the condor server, condor.astro.washington.edu. The submission may still work if you submit from another computer, but it's buggy and better if the condor nodes aren't acting as the master.
- Put your code and any data that your code requires on a disk that is visible to all of the Linux PCs (your home directory is fine) in the Condor Cluster. Most disks in the department are visible from anywhere else in the department. The notable exception is anything in your /local or /tmp directory.
- Create a Condor .cfg file. This is a set of instructions telling Condor how to run your code.
- Submit the job using
% ssh condor <br>
% cd /path/to/mydir<br>
% condor_submit ./myjobs.cfg
- Check the progress of your Condor job(s).
% condor_q -sub username
- If necessary, remove jobs that you no longer need to run.
% condor_rm job#or
% condor_rm usernameto remove all your jobs.
- How do I stop other people's Condor jobs on my computer? Condor is set up to exit when you start using your computer. However, since our computers have two processer, condor does not get the message that the computer is in use and continues to use the second processor, slowing you down. To force condor to exit, type:
- How should I credit the creators of Condor in a journal paper? Point 8 of the Condor Academic License provides a sentence that you should include in the acknowledgments of any publication whose results were obtained with Condor.
- Who should I ask for help with using Condor? If you've already carefully looked at all of the resources available from this page, and still can't find an answer to your question, try asking PACS (?). Or if this page hasn't been updated in a while, log onto condor.astro and run
condor_userprio -allusers -all. Find someone still here who has lots of accumulated usage, and odds are they can help you.
- Why does my computer slow down so much? A process which manages your jobs runs on the computer at which you submitted your jobs. If you submit lots of jobs, this condor manager has to do a lot of work, and can slow down that computer, if its not fast enough. "condor.astro" is a dedicated Condor server that you can submit jobs from rather than your desktop.
- Are comments allowed in the .cfg files? Yes, any line beginning with the pound sign (#) is ignored by Condor.
- How do I make Condor inherit my environment variables? In your .cfg file, set: getenv = true You must be careful when using this feature, since the maximum allowed size of the environment in Condor is 10240 characters. If your environment is larger than that, Condor will not allow you to submit your job, and you will have to use the ``Environment setting described below, instead. environment = List of environment variables of the form : = Multiple environment variables can be specified by separating them with a semicolon (`` ; ). These environment variables will be placed into the job's environment before execution. The length of all characters specified in the environment is currently limited to 10240 characters.
- How do I kill a condor job that I've started? type the following:
% condor_rm <your_user_name>
This will remove all condor jobs under your username! If the job is listed in state 'X', do
condor_rm -forcex <jobID>.
- Why am I getting a "cannot connect to X server localhost" error? This is the error that is related to matplotlib, make sure you insert precautionary measures outlined above into your code.
- Why are my jobs being held due to "Permission denied?" You forgot to
%chmod +x filename.sh