nqsgs(1)


	  NQSGS	- Getting Started with NQS

     Introduction
	  This document	describes the Network Queuing System as
	  implemented at Monsanto on Unix Workstations.	 It provides
	  basic	information on how to set up jobs, submit them,	and
	  monitor their	status as they run.  This document is intended
	  to be	an overview for	new users and not as a complete
	  description. For more	detailed information, consult the
	  appropriate man pages.

	  NQS allows one to submit batch jobs to queues	on local or
	  remote machines and have the log file	returned to the
	  originating machine or another machine.  After submitting
	  the request, the user	can watch the progress of the request.
	  The user can also affect the job after it is submitted by
	  holding a queued job from being scheduled, releasing a held
	  job, suspending a running request, resuming a	suspended
	  request, or deleting a queued	or running request.

	  There	are two	main types of queues: batch and	pipe.  A batch
	  queue	is an execution	queue where the	request	actually runs.
	  A pipe queue provides	routing	capabilities; when a request
	  is submitted to a pipe queue it is passed on to another pipe
	  queue	for further routing or a batch queue for execution on
	  the same or another machine.

	  The core NQS user commands are as follows:

	       Qsub - Submit an	NQS job
	       Qdel - Delete an	NQS job
	       Qstat	 - Determine the status	of a job or a queue

	  The security mechanism for NQS is the	.rhosts	file, which is
	  checked to determine if a request from a user	on a remote
	  system can be	processed.  This is checked when a request for
	  status or a job arrives from a remote	system.	 NQS requires
	  that the system name and username both be present on the
	  line separated by whitespace.	 In some cases it is necessary
	  to have a line with the unqualified host name	and a line
	  with the fully qualified host	name.  If this file is not
	  present on both the local and	remote machines, then requests
	  may not transfer to the execution machine, or	the log	file
	  may not be returnable	to the local machine.

     How do I submit a job to NQS?
	  NQS jobs are submitted using the Qsub	command.  Qsub accepts
	  a script which contains the shell commands to	be executed
	  when the job runs.  In addition, you can provide
	  instructions to Qsub to modify the characteristics of	the
	  request, such	as to give the request a name, to indicate
	  be embedded at the beginning of the script, or placed	on the
	  Qsub command line.

	  Here is a sample NQS script with embedded switches:

	       # QSUB -eo
	       # QSUB -r cvtabc
	       # QSUB -q batch
	       # QSUB
	       .
	       .    Various script commands follow here

	  Note that the	lines starting with "#"	appear as comments to
	  the shell, but that Qsub interprets the lines	starting with
	  "# QSUB" as indicators that a	Qsub switch follows.  This
	  script indicates that	stdout and stderr should be combined
	  into one file	(-eo), that the	request	should be called
	  "cvtabc" (-r), and that the job should be queued to the
	  queue	called "batch" (-q).  The final	QSUB line without any
	  parameters indicates to Qsub that no more switches follow.

	  If this script was called scriptname.sh, it could be
	  submitted using the command:

	       qsub scriptname.sh

	  If there was a similar script	called anotherscript.sh
	  without the embedded NQS commands, it	could be submitted
	  using	the following command and run exactly as the above
	  script:
	       qsub -eo	-r cvtabc -q batch anotherscript.sh

	  It is	also possible to have switches both imbedded in	the
	  script and on	the command line.

	  Here are several of the most often used Qsub switches:

		    Switch	   Action
		    -a	      run request after	stated time
		    -e	      direct stderr output to the given
	       destination
		    -eo	      combine stdout and stderr	in one file
		    -o	      direct stdout output to the given
	       destination
		    -r	      give the request a name
		    -q	      indicate to which	queue to submit	the
	       job

	  All of the Qsub switches are explained in detail in the Qsub
	  man pages.

	  modify the script after submission and not affect the
	  request.

	  By default the sequence number of the	request	is printed
	  when Qsub completes its processing.  This number combined
	  with the hostname makes up the unique	identifier for the
	  request.

     How do I decide which queue to use?
	  The queue configuration on any set of	machines is very site
	  dependent. Therefore,	one cannot describe a configuration
	  that applies for all locations.  Instead, please read	the
	  man pages on NQSCONFIG, which	describe the local
	  configuration	or ask your local NQS system manager.

     How do I find out what queues are available?
	  The general purpose command for determining the status of
	  queues and jobs is Qstat.  To	find out what queues are
	  present on the local machine,	use the	following command:

	       qstat -x

	  If you add the "-b" switch you will get a brief version of
	  the information, and if you add the "-l" switch you will see
	  a lot	more.  Sample output from qstat	-x is:

	  batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, INACTIVE];	 pri=16	 lim=1
	    0 exit;   0	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
	    User run limit= 1

	  helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
	    0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;
	    Destset = {batch@helium};

	  The first queue is a batch queue, and	jobs actually run in
	  this queue. The second queue is a pipe queue,	which means
	  jobs submitted to it are transferred to another queue	either
	  on the same machine or another to execute.  The destset on
	  the helium queue indicates that the jobs submitted to	that
	  queue	are transferred	to the batch queue on the node helium
	  to run.

	  If you want to learn more about queues on remote machines,
	  use the command of the form:

	       qstat -x	@ddcs1

	  which	indicates that the request should be forwarded to the
	  machine ddcs1	and the	appropriate information	printed	on
	  your screen.

	  Again, Qstat is the command to get the status	of NQS
	  requests.  You can use various Qstat switches	to select
	  which	requests are shown.  The default is to show only your
	  own jobs on the local	machine	which originated anywhere.
	  Additional switches can be used:

	       Qstat switch   Effect
	       -a	 show all requests
	       -u username    show request belonging to	a specific
	       user
	       -o	 select	jobs which originated on the local
	       machine
	       -d	 show jobs on all machines within the local
	       NQS domain

	  There	are also switches which	control	the format of the
	  output.  The default Monsanto	format is a single line	for
	  each request.	 The -s	switch gives the standard COSMIC NQS
	  format, and the -l switch provides much more detail in a
	  long format.

	  The systems in the local NQS domain are listed in the	file
	  /usr/lib/nqs/nqs-domain (by default).	 This is a list	of
	  systems which	can be considered a unit; jobs can be
	  submitted between systems on the list.  The -d switch	then
	  requests information from each system	on the list.

	  This list can	be modified by having a	file called .qstat in
	  your home directory which has	the same format	as the
	  system-wide file, but	has only the systems in	which you are
	  interested.  Then you	will get NQS status only from that
	  list of systems.

	  The Qcat utility is also available to	get information	on the
	  status of a job.  It will list the spooled input script or
	  the available	output or error	files.	Since applications may
	  not flush the	stdout or stderr streams frequently, the
	  available information	may be limited,	but it can be helpful
	  in indicating	how a job is progressing.

	  Here is an example of	the default qstat output:

	  Request	  I.D.	Owner	 Queue	  Start	Time   Time Limit  Total Time St
	  -------------- ------	-------- -------- -----------  ----------  ---------- --
	  example	   129	jrroma	 batch	  4/30 10:11   4 04:00:00  0 00:00:00 R

	  The columns are self-explanatory, except perhaps, for	the
	  last one, which indicates the	status of the request.
	  Possible statuses include R for running, Q for queued, H for
	  holding, W for waiting, and S	for suspended.

	  standard COSMIC NQS format:


	  batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, RUNNING];	pri=16	lim=1
	    0 exit;   1	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
	    User run limit= 1

		   REQUEST NAME	       REQUEST ID	     USER  PRI	  STATE	    PGRP
	      1:	example	      129.beaker	    jrroma  31	RUNNING	    7835
	  helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
	    0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;

	  And an example of output from	the -l switch is as follows:

	  batch@beaker.monsanto.com;  type=BATCH;  [ENABLED, RUNNING];	pri=16	lim=1
	    0 exit;   1	run;   0 stage;	  0 queued;   0	wait;	0 hold;	  0 arrive;
	    User run limit= 1

	    Request    1:  Name=example
	    Id=129.beaker     Owner=jrroma  Priority=31	 RUNNING  Pgrp=7835
	    Created at Thu Apr 30 10:11:09 CDT 1992
	    Mail = [NONE]
	    Mail address = jrroma@beaker
	    Owner user name at originating machine = jrroma
	    Request is not restartable,	not recoverable.
	    Broadcast =	[NONE]
	    Per-proc. core file	size limit= [32	megabytes, 32 megabytes]<DEFAULT>
	    Per-proc. data size	limit= [32 megabytes, 32 megabytes]<DEFAULT>
	    Per-proc. permanent	file size limit= [500 megabytes, 500 megabytes]<DEFAULT>
	    Per-proc. execution	nice priority =	0 <DEFAULT>
	    Per-proc. stack size limit=	[32 megabytes, 32 megabytes]<DEFAULT>
	    Per-proc. CPU time limit= [360000.0, 360000.0]<DEFAULT>
	    Per-proc. working set limit= [32 megabytes,	32 megabytes]<DEFAULT>
	    Standard-error access mode = EO
	    Standard-output access mode	= SPOOL
	    Standard-output name = beaker:/usr2/jrroma/tmp/example.o129
	    Shell = DEFAULT
	    Umask =  22

	  helium@beaker.monsanto.com;  type=PIPE;  [ENABLED, INACTIVE];	 pri=16	 lim=1
	    0 depart;	0 route;   0 queued;   0 wait;	 0 hold;   0 arrive;

	  Again, information on	the status of jobs on remote machines
	  can be obtained by using the "@node" syntax to indicate

	  where	to get the information.

     Why is my request not running?
	  Occasionally,	your job may be	in the Waiting or Queued state
	  and it might not be clear why	it is not running.
	  Determination	of the reason can be complicated.  NQS allows
	  system managers to set limits	on the number of jobs that can
	  run at a time.  There	are queue run limits, which limit the
	  queue	user run limits, which limit the number	of jobs	a
	  particular user can run at a time.  In the same manner there
	  are global run and user run limits, which determine the
	  number of total jobs that can	run on the system and the
	  number of jobs a person can have running at any one time,
	  respectively.

	  An investigation of the interactions of these	limits and the
	  mix of jobs on the system should indicate the	reason a
	  particular request is	not running.

     How do I delete a job?
	  Qdel is the command that deletes NQS jobs.  It takes as a
	  parameter the	identifier of the job or jobs to be deleted.
	  The identifier consists of the sequence number and the
	  originating host of the job separated	by a period. The
	  sequence number will be reported when	you submit the job,
	  and it is shown when you do a	qstat on the job.  So the
	  identifier of	a job which is sequence	number 217 and was
	  originally submitted on beaker is 217.beaker.

	  If this job is queued	on beaker, the appropriate command is:

	       qdel 217.beaker

	  If the job is	running, you must add the -k switch which
	  indicates that the running job is to be killed.

	  Local	jobs can be deleted by the request name	with the -r
	  swich.  The argument to the -r switch	is the request pattern
	  to delete.  If the -c	switch is used with the	-r switch,
	  then the user	is prompted to confirm the deletion of the
	  job.

	  If this job submitted	from beaker is now running on a	remote
	  machine, you will need to add	the remote system name,	or:
	       qdel -k 217.beaker@ddcs1

	  where	ddcs1 is the name of the remote	machine	where the job
	  is running.  This will send a	message	to ddcs1 to delete the
	  request 217 which originated on beaker.

     Advanced Information:
	  There	are other commands one can use to modify requests, but
	  which	are not	used as	often.	These are:

	  Qhold
	       Qhold holds all queued or waiting NQS requests given on
	       the command line. Qhold will not	hold a running
	       request.

	       releases	a held request and makes it eligible to	be
	       scheduled to run.  Qrls will cannot release a job which
	       is not being held.

	  Qsuspend
	       Qsuspend	takes a	running	request	and causes it to have
	       no access to the	cpu.  That is, it will no longer run.
	       One cannot suspend a request which is not in the	NQS
	       Running state.

	  Qresume
	       Qresume is the inverse operation	of Qsuspend.  It takes
	       a suspended job and lets	the cpu	run the	job again.
	       Qresume cannot resume a job which is not	suspended.
	       These two commands have only been tested	on SGI
	       machines	running	IRIX and IBM RS6000s running AIX.

	  Qlimit
	       Qlimit does not modify the request, but shows the
	       supported batch limits and shell	strategy for the local
	       or remote host.	Here is	example	output:

	    Core file size limit (-lc)
	    Data segment size limit (-ld)
	    Per-process	permanent file size limit (-lf)
	    Nice value (-ln)
	    Stack segment size limit (-ls)
	    Per-process	cpu time limit (-lt)
	    Working set	limit (-lw)

	    Shell strategy = FREE

	  For more information on these	commands, consult their
	  respective man pages.

     Who do I contact when something goes wrong?
	  Contact your local system manager.

     History
	  Origin: Monsanto Company

	  May 1992 - John Roman, Monsanto
	  Original version

	  August 1994 -	John Roman, Monsanto
	  Version 3.36