nqsgs(1)
NQSGS - Getting Started with NQS
Introduction
This document describes the Network Queuing System as
implemented at Monsanto on Unix Workstations. It provides
basic information on how to set up jobs, submit them, and
monitor their status as they run. This document is intended
to be an overview for new users and not as a complete
description. For more detailed information, consult the
appropriate man pages.
NQS allows one to submit batch jobs to queues on local or
remote machines and have the log file returned to the
originating machine or another machine. After submitting
the request, the user can watch the progress of the request.
The user can also affect the job after it is submitted by
holding a queued job from being scheduled, releasing a held
job, suspending a running request, resuming a suspended
request, or deleting a queued or running request.
There are two main types of queues: batch and pipe. A batch
queue is an execution queue where the request actually runs.
A pipe queue provides routing capabilities; when a request
is submitted to a pipe queue it is passed on to another pipe
queue for further routing or a batch queue for execution on
the same or another machine.
The core NQS user commands are as follows:
Qsub - Submit an NQS job
Qdel - Delete an NQS job
Qstat - Determine the status of a job or a queue
The security mechanism for NQS is the .rhosts file, which is
checked to determine if a request from a user on a remote
system can be processed. This is checked when a request for
status or a job arrives from a remote system. NQS requires
that the system name and username both be present on the
line separated by whitespace. In some cases it is necessary
to have a line with the unqualified host name and a line
with the fully qualified host name. If this file is not
present on both the local and remote machines, then requests
may not transfer to the execution machine, or the log file
may not be returnable to the local machine.
How do I submit a job to NQS?
NQS jobs are submitted using the Qsub command. Qsub accepts
a script which contains the shell commands to be executed
when the job runs. In addition, you can provide
instructions to Qsub to modify the characteristics of the
request, such as to give the request a name, to indicate
be embedded at the beginning of the script, or placed on the
Qsub command line.
Here is a sample NQS script with embedded switches:
# QSUB -eo
# QSUB -r cvtabc
# QSUB -q batch
# QSUB
.
. Various script commands follow here
Note that the lines starting with "#" appear as comments to
the shell, but that Qsub interprets the lines starting with
"# QSUB" as indicators that a Qsub switch follows. This
script indicates that stdout and stderr should be combined
into one file (-eo), that the request should be called
"cvtabc" (-r), and that the job should be queued to the
queue called "batch" (-q). The final QSUB line without any
parameters indicates to Qsub that no more switches follow.
If this script was called scriptname.sh, it could be
submitted using the command:
qsub scriptname.sh
If there was a similar script called anotherscript.sh
without the embedded NQS commands, it could be submitted
using the following command and run exactly as the above
script:
qsub -eo -r cvtabc -q batch anotherscript.sh
It is also possible to have switches both imbedded in the
script and on the command line.
Here are several of the most often used Qsub switches:
Switch Action
-a run request after stated time
-e direct stderr output to the given
destination
-eo combine stdout and stderr in one file
-o direct stdout output to the given
destination
-r give the request a name
-q indicate to which queue to submit the
job
All of the Qsub switches are explained in detail in the Qsub
man pages.
modify the script after submission and not affect the
request.
By default the sequence number of the request is printed
when Qsub completes its processing. This number combined
with the hostname makes up the unique identifier for the
request.
How do I decide which queue to use?
The queue configuration on any set of machines is very site
dependent. Therefore, one cannot describe a configuration
that applies for all locations. Instead, please read the
man pages on NQSCONFIG, which describe the local
configuration or ask your local NQS system manager.
How do I find out what queues are available?
The general purpose command for determining the status of
queues and jobs is Qstat. To find out what queues are
present on the local machine, use the following command:
qstat -x
If you add the "-b" switch you will get a brief version of
the information, and if you add the "-l" switch you will see
a lot more. Sample output from qstat -x is:
batch@beaker.monsanto.com; type=BATCH; [ENABLED, INACTIVE]; pri=16 lim=1
0 exit; 0 run; 0 stage; 0 queued; 0 wait; 0 hold; 0 arrive;
User run limit= 1
helium@beaker.monsanto.com; type=PIPE; [ENABLED, INACTIVE]; pri=16 lim=1
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Destset = {batch@helium};
The first queue is a batch queue, and jobs actually run in
this queue. The second queue is a pipe queue, which means
jobs submitted to it are transferred to another queue either
on the same machine or another to execute. The destset on
the helium queue indicates that the jobs submitted to that
queue are transferred to the batch queue on the node helium
to run.
If you want to learn more about queues on remote machines,
use the command of the form:
qstat -x @ddcs1
which indicates that the request should be forwarded to the
machine ddcs1 and the appropriate information printed on
your screen.
Again, Qstat is the command to get the status of NQS
requests. You can use various Qstat switches to select
which requests are shown. The default is to show only your
own jobs on the local machine which originated anywhere.
Additional switches can be used:
Qstat switch Effect
-a show all requests
-u username show request belonging to a specific
user
-o select jobs which originated on the local
machine
-d show jobs on all machines within the local
NQS domain
There are also switches which control the format of the
output. The default Monsanto format is a single line for
each request. The -s switch gives the standard COSMIC NQS
format, and the -l switch provides much more detail in a
long format.
The systems in the local NQS domain are listed in the file
/usr/lib/nqs/nqs-domain (by default). This is a list of
systems which can be considered a unit; jobs can be
submitted between systems on the list. The -d switch then
requests information from each system on the list.
This list can be modified by having a file called .qstat in
your home directory which has the same format as the
system-wide file, but has only the systems in which you are
interested. Then you will get NQS status only from that
list of systems.
The Qcat utility is also available to get information on the
status of a job. It will list the spooled input script or
the available output or error files. Since applications may
not flush the stdout or stderr streams frequently, the
available information may be limited, but it can be helpful
in indicating how a job is progressing.
Here is an example of the default qstat output:
Request I.D. Owner Queue Start Time Time Limit Total Time St
-------------- ------ -------- -------- ----------- ---------- ---------- --
example 129 jrroma batch 4/30 10:11 4 04:00:00 0 00:00:00 R
The columns are self-explanatory, except perhaps, for the
last one, which indicates the status of the request.
Possible statuses include R for running, Q for queued, H for
holding, W for waiting, and S for suspended.
standard COSMIC NQS format:
batch@beaker.monsanto.com; type=BATCH; [ENABLED, RUNNING]; pri=16 lim=1
0 exit; 1 run; 0 stage; 0 queued; 0 wait; 0 hold; 0 arrive;
User run limit= 1
REQUEST NAME REQUEST ID USER PRI STATE PGRP
1: example 129.beaker jrroma 31 RUNNING 7835
helium@beaker.monsanto.com; type=PIPE; [ENABLED, INACTIVE]; pri=16 lim=1
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
And an example of output from the -l switch is as follows:
batch@beaker.monsanto.com; type=BATCH; [ENABLED, RUNNING]; pri=16 lim=1
0 exit; 1 run; 0 stage; 0 queued; 0 wait; 0 hold; 0 arrive;
User run limit= 1
Request 1: Name=example
Id=129.beaker Owner=jrroma Priority=31 RUNNING Pgrp=7835
Created at Thu Apr 30 10:11:09 CDT 1992
Mail = [NONE]
Mail address = jrroma@beaker
Owner user name at originating machine = jrroma
Request is not restartable, not recoverable.
Broadcast = [NONE]
Per-proc. core file size limit= [32 megabytes, 32 megabytes]<DEFAULT>
Per-proc. data size limit= [32 megabytes, 32 megabytes]<DEFAULT>
Per-proc. permanent file size limit= [500 megabytes, 500 megabytes]<DEFAULT>
Per-proc. execution nice priority = 0 <DEFAULT>
Per-proc. stack size limit= [32 megabytes, 32 megabytes]<DEFAULT>
Per-proc. CPU time limit= [360000.0, 360000.0]<DEFAULT>
Per-proc. working set limit= [32 megabytes, 32 megabytes]<DEFAULT>
Standard-error access mode = EO
Standard-output access mode = SPOOL
Standard-output name = beaker:/usr2/jrroma/tmp/example.o129
Shell = DEFAULT
Umask = 22
helium@beaker.monsanto.com; type=PIPE; [ENABLED, INACTIVE]; pri=16 lim=1
0 depart; 0 route; 0 queued; 0 wait; 0 hold; 0 arrive;
Again, information on the status of jobs on remote machines
can be obtained by using the "@node" syntax to indicate
where to get the information.
Why is my request not running?
Occasionally, your job may be in the Waiting or Queued state
and it might not be clear why it is not running.
Determination of the reason can be complicated. NQS allows
system managers to set limits on the number of jobs that can
run at a time. There are queue run limits, which limit the
queue user run limits, which limit the number of jobs a
particular user can run at a time. In the same manner there
are global run and user run limits, which determine the
number of total jobs that can run on the system and the
number of jobs a person can have running at any one time,
respectively.
An investigation of the interactions of these limits and the
mix of jobs on the system should indicate the reason a
particular request is not running.
How do I delete a job?
Qdel is the command that deletes NQS jobs. It takes as a
parameter the identifier of the job or jobs to be deleted.
The identifier consists of the sequence number and the
originating host of the job separated by a period. The
sequence number will be reported when you submit the job,
and it is shown when you do a qstat on the job. So the
identifier of a job which is sequence number 217 and was
originally submitted on beaker is 217.beaker.
If this job is queued on beaker, the appropriate command is:
qdel 217.beaker
If the job is running, you must add the -k switch which
indicates that the running job is to be killed.
Local jobs can be deleted by the request name with the -r
swich. The argument to the -r switch is the request pattern
to delete. If the -c switch is used with the -r switch,
then the user is prompted to confirm the deletion of the
job.
If this job submitted from beaker is now running on a remote
machine, you will need to add the remote system name, or:
qdel -k 217.beaker@ddcs1
where ddcs1 is the name of the remote machine where the job
is running. This will send a message to ddcs1 to delete the
request 217 which originated on beaker.
Advanced Information:
There are other commands one can use to modify requests, but
which are not used as often. These are:
Qhold
Qhold holds all queued or waiting NQS requests given on
the command line. Qhold will not hold a running
request.
releases a held request and makes it eligible to be
scheduled to run. Qrls will cannot release a job which
is not being held.
Qsuspend
Qsuspend takes a running request and causes it to have
no access to the cpu. That is, it will no longer run.
One cannot suspend a request which is not in the NQS
Running state.
Qresume
Qresume is the inverse operation of Qsuspend. It takes
a suspended job and lets the cpu run the job again.
Qresume cannot resume a job which is not suspended.
These two commands have only been tested on SGI
machines running IRIX and IBM RS6000s running AIX.
Qlimit
Qlimit does not modify the request, but shows the
supported batch limits and shell strategy for the local
or remote host. Here is example output:
Core file size limit (-lc)
Data segment size limit (-ld)
Per-process permanent file size limit (-lf)
Nice value (-ln)
Stack segment size limit (-ls)
Per-process cpu time limit (-lt)
Working set limit (-lw)
Shell strategy = FREE
For more information on these commands, consult their
respective man pages.
Who do I contact when something goes wrong?
Contact your local system manager.
History
Origin: Monsanto Company
May 1992 - John Roman, Monsanto
Original version
August 1994 - John Roman, Monsanto
Version 3.36