The $
sign denotes commands in the shell. Do not type in the dollar
sign.
Some places you need to type in something that will or may vary with each of you. If so, the thing that you should change is between arrow brackets, like <this>.
We will use the program called ssh
to log in and out of SAGA.
Task
- open a terminal window
- type in the following (don't include the
$
).
$ ssh <your_FEIDE_user_name>@saga.sigma2.no
It will then request a password. Type it in. Remember, passwords are case sensitive!
NOTE: If you have the same username on your virtual machine and on SAGA, you don't need to type in your username and the at sign in the ssh command.
Sometimes you might get confused about which computer you're
currently on. You can use the command hostname
to figure out where you
are.
Task
- type in
hostname
in a terminal window on your computer - type in
hostname
in a terminal window
There are three main storage locations that you need to know about on SAGA (Fig. 1).
-
Your home ($HOME) area is
/cluster/home/username
. When you log in, you will automatically land in what is called your home area. You will commonly not use this location much, and it should not be used to store sequence data that is from the Veterinary Institute (Fig. 1). -
The work area,
/cluster/projects/nn9305k
(Fig. 1). This is where the Veterinary Institute stores the databases we use and the software we install ourselves. You can store intermediate files here when you are analyzing data, but this area should not be used for storing RAW sequence data. Think of it as one of the common ressource areas on the Veterinary Institute area. -
The last one is the project's data and analysis backup area. This is
/cluster/projects/nn9305k
(Fig. 1). This is the area where all RAW sequence data that belongs to the Veterinary Institute should be stored.
Figure 1 SAGA data storage overview for the Veterinary Institute. Variables that can be used in slurm scripts or commands are indicated between brackets. (Figure was copied from https://github.com/uio-cees/hpc/wiki/Data-Storage and then modified for this website)
There is also a forth area that will be discussed below and that is mainly used for data storage when analyzing data. It should not be used for storage, and after analysis data should be transfered to the project or the work area. For a detailed overview and how to use these areas please look at the description of disks_areas at the Norwegian VeterinaryInstitute, to see the data workflow when launching/running jobs.
Task
- use
cd
to go to the work-project area. - read the
README
file that is in the top directory of the home area - use the knowledge gained from the README file to figure out where adapter files are stored on SAGA.
Next we will do a bit of setup to get you set up properly on SAGA.
Task Go to this webpage here and follow the instructions from and bellow: "On first time login".
NB: login out of SAGA: exit
or logout
Note: if you've done this already, you can skip this step.
During the setup process, you are creating a separate "home directory" inside the project area. You use this location for things that you don't collaborate with people on. If it is a collaborative project, that is, other people on SAGA will also use it, you use directories at the top level.
We will now explore a couple of methods for getting data to and from SAGA.
NOTE: any transfer of data to SAGA happens with permission from Karin!
Task
- right click on this link and download the file.
- go to the directory you downloaded the file to on your computer and locate the file
- open the file - how many fasta sequences are in the file?
We will now transfer this file to SAGA.
Task
- on the virtual machine, type in
$ scp mb.fsa your_user_name@SAGA.uio.no:
- open a different terminal window and log in to SAGA
- find out where the file is now (hint:
ls
)
You need to have the colon on the end, otherwise you just end up
copying the file to a file named your_user_name@SAGA.uio.no
.
Note: if you have the same username on your virtual machine as on SAGA, you don't need to type in your user name or the at sign.
You have now transferred a file from your computer to SAGA. That file ended up in your home area. What do you have to do to get it into your home area on the project area? If you want to put it in a different location than the default place you end up when you log in, you need to specify the path to that location after the colon. That is:
$ scp filename your_user_name@SAGA.uio.no:/path/to/different/place
Task
Try putting the mb.fsa
file into your project home area using scp.
We have now transferred to SAGA. How do we get files to our local computer? To do that, we switch things up in the command:
$ scp your_user_name@SAGA.uio.no:/path/to/file/on/SAGA local_place_path
You need to do this on your local machine.
More often than not, you want to store things in the location you're
at, which means that local_place_path
would be a simple dot, i.e. .
.
rsync is a command that lets you transfer things within and between computers, while ensuring that the data is not corrupted on the way.
This command has many options, as do other commands under unix. One set
of options that is commonly used with rsync
is -rauPW
.
Task Have a look at the wikipedia page for rsync. Can you figure out the syntax for rsync?
You can either:
- redo
rsync
with same options: if the transfer was successfull, nothing will be synchronized (same content). - use hash programs that generates a code based on file content (for both original and transfered/copied file). If both codes are identical, this means that the content of each files are identical i.e. that the file transfer was successull.
Example: md5sum file_origin
and md5sum file_transfered
. The hash-codes
should be identical.
Better to automate the process if you have many files in the same folder
- create a temporary file
tempfile_md5sum.txt
- generate md5sum for each files (use regular expressions) and append result to
tempfile_md5sum.txt
Example:md5sum yourfiles | tee "tempfile_md5sum.txt"
- check the content of your file: it will idicate if there are differences:
md5sum -c "tempfile_md5sum.txt"
(if there are differences, start again transfer for files where checksum did not match). - delete your temporary file
Use rsync to copy all your files from directory to another (ie. moving your directories on SAGA)
Example: from /work/projects/nn9305k/Mydirectory
to Mydirectory in parent_directory /projects/nn9305k/
NB: parent directory (just the level above of where your want Mydirectory to be)
- use
rsync -rauPW /origin_path/Mydirectory /parent_directory_destination
- NB: do not write /parent_directory_destination/Mydirectory otherwise it will make a subdirectory /../Mydirectory/Mydirectory
- Do it twice: if you get errors, it is probably a permissions problem -> contact Karin
- To compare if the copy/synchronisation of files from the origin and destination folders worked: we build 2 temporary files (containing the hash-codes for all files in origin and destination folders respectively).
-
You need to do it is from the respective parent directories of the origin and destination folders
-
the command finds all the files in your directories, executes md5sum on each file, sort hashes and then write into a file all those hash-codes. The idea is that if the copy/synchronisation succeded: then all the hashes codes for all files should be identical in the origin and destination directories.
In the parent directory of origin : find origin_Mydirectory -type f -exec md5sum {} + | sort -k 2 > origin_temp.txt
In the parent directory of destination: find destination_Mydirectory -type f -exec md5sum {} + | sort -k 2 > destination_temp.txt
- Then we check if the hash-codes generated for all the files in origin and destination directories are identical. It should return that files are identical. If not, the file content of your directories are different OR you did not do the previous command from the parent folder.
diff -s origin_mydirectory_temp.txt destination_mydirectory_temp.txt
- When you are sure that the content of both directories are identical, you can remove the directory of origin, and the temporary files you created.
wget is very useful for getting data from a website to either your local computer or to a remote computer.
We will now try to get the mb.fsa
file onto SAGA without going via
our own local machine.
Task
- right click on the url for the file (see above), and copy the link.
- go in your SAGA terminal window
- make sure that you are in your projects-home area
- type and paste in
$ wget <paste_the_link_here>
and hit enter.
You will now have downloaded the file directly onto SAGA.
There is a lot of software available on SAGA. We cannot have all of it
available at the same time, because they would step on each others'
toes. To organize software, the SAGA system uses the module
system.
Task
- on SAGA, type in
module avail
- wait for a bit
- can you figure out how many versions of blast is installed?
- try typing in
module avail blast
, and see what happens.
As you see, one of these have the word (default) typed into it. This means that if you do
$ module load <software_name_before_slash>
that is the version of the program that will be loaded.
**If you want a different version than the default version, you need to
include what's after the /
.
NB: You can also look at short description of Software on SAGA here
We will now try our hands at running a quick blast command.
NOTE: running things like this, on the login nodes, is not really "done". However, small test commands can be run. If you run things that either use too much time (30 minutes) or too much memory, the command will be terminated.
When we are blast
ing, we need sequences that we're searching with
(query
), and we need a database to search in. On SAGA, there are
versions of the ncbi databases that live here:
/work/databases/bio/ncbi
Task
- on SAGA, type in
$ module load blast+
- type in
$ blastn -help
Can you figure out which options that you need to specify a query and a database? 3. On SAGA, type in
$ blastn -query mb.fsa -db /work/databases/bio/ncbi/refseqgene
You will get output fairly quickly. You might want to save that output,
do that by repeating the command and add -out <a_new_filename>
.
The previous command gave results quite quickly.
We will now try to search against a different database, a database that
is bigger, and where searches thus take longer time to run. For this,
we will use a qlogin
shell. With qlogin
, you get a "normal"
prompt, but you are using slurm allocated resources, thus you can
use more resources than you can on the login nodes.
The options we will specify to qlogin
are:
- account: which bucket of cpu hours to use. Ours is nn9305k
- time: for how long do we want this command line window to run
- ntasks: number of cpus to use.
- mem-per-cpu: how much memory we want
The more you ask for, the longer you have to wait for things to start.
Task
- Start
qlogin
with the following options
qlogin --account=nn9305k --time=00:30:00 --ntasks=4 --mem-per-cpu=4G
You might have to wait a bit to get the prompt back.
2. Run ls
. You will see that you are now in your home area. Use
cd
and ls
to go to the location where the mb.fsa
file is. You
also have to re-load the blast module.
3. Try running the same blast command as above. However, do the
following change:
- use multiple CPUs, by adding
-num_threads 4
Note: we are here using 4 CPUs, which is the same number of CPUs
that we asked for when we started qlogin
. It is important that these
two numbers are the same.
We will now do the same using a script file instead. To work with the queueing system in this way, we need to know a few commands:
- sbatch - submit a slurm script to the queue
- squeue - look at the current queue
- scancel - cancel a job which is in the queue
Task
We will now create a slurm script, and put our blast command into it.
- make sure you are in your project-home area
- copy the file
sample_slurm.slurm
file in samplefiles directory to this location - open the file using
nano
- edit the file so that you are using the same options that
you used with the
qlogin
command above. - add the command to load the blast module at the bottom
- add the blast command line you used above under the module load command.
- submit the script to the queue, by typing in
$ sbatch sample_slurm.slurm
Task
We will now look at jobs submitted to the cluster.
- try out the command
squeue
- try looking at just jobs under our project, by adding
-A nn9305k
- try looking at just your jobs, by adding
-u <your_user_name>
The status of jobs shows up under ST
.
- PD: the job is waiting for resources
- R: running
- CF: the system has allocated resources, but the job is waiting for them to become ready
There are also other statuses.
Another important feature that you can find here is the job id. If you need to cancel a job, you need this number.
Task
We will now look at the output file that slurm gives us.
- use
ls
to find a file in this directory that ends with*.out
- look at it.
If you need to cancel a job, this is the command you would give:
scancel <jobid>
Task
You will now edit the slurm script so that you instead of searching in the refseq_genome database, you will search the nt database. This will take a long time!
After editing the file, submit it to the queue system using sbatch
.
Then, use squeue
and scancel
to stop the job.
Optional task
If there's time, restart the slurm job above, and let it finish. Have a look at the output both from slurm and bash