Center for Health Data Science, July 2022
- Introduction
- A note on pseudo code
- Exercise 1: Navigating Files and Directories
- Exercise 2: Project Organization
- Exercise 3: Working with Files
- Exercise 4: More Bash Commands - Part 1: wc, sed & cut
- Exercise 4: More Bash Commands - Part 2: sort, paste, awk & grep
- Exercise 5: Redirection & Pipes
- Exercise 6: Part 1: Shell Scripts
- Exercise 6: Part 2: Loops
- Exercise 7: Software Installation, Upkeep & More
This markdown document contains of exercises for the introductory
workshop on command line use, entitled Just Bash It
, developed and
hosted by Center for Health Data Science (HeaDS) at the Faculty of
Health and Medical Sciences (HeaDS), in
collaboration with the University Library
(KUB) Data Lab.
This one day workshop is targeted at biomedical and health researchers
at the University of Copenhagen with no prior experience in bash command
line use.
The workshop consists of a slideshow with slides to each section, in addition to hands-on presentations (code-along) and accompanying exercises.
Some exercises and explanations in this document contain what is called ‘pseudo code’. Pseudo code is an abstracted way to write an idea of code. It will not necessarily run when executed, so do not copy pseudo code straight into the command line. Rather it explains the idea of how your code should be structured. We will explicitly let you know when a snippet is pseudo code.
Pseudo code can look like this:
wc [your file]
In the above example square bracets mean you need to replace what is
between them with the actual file, expression, ect you want to use. The
square bracets themselves are not part of the code. For example, if you
wanted to word count the readme file, you would replace [your file]
with README.md
:
wc README.md
On your command line, go to where you have downloaded the course materials. If the directory is zipped, unzip it. (This might be easier to do with the graphical user interface).
-
List the files and directories in the top directory
Just-Bash-It
. Which file was last updated? Which file is the largest? -
Go to the
Examples
directory and list its contents. -
Move the text file
mytextfile.txt
from theimages
folder to thedocs
folder. Confirm that is is in the right place. -
Make a copy of the
mytextfile.txt
file (now indocs
) and rename the copy to whatever you’d like. -
Move to the
Examples
directory and make a new folder here calledTEMP
. Now move your copied file from point 4. above to theTEMP
folder. -
Remove the whole
TEMP
directory including the file within it. Do you run into any problems with trying to do this?
You need a flag to remove a whole folder, try the manual help for the remove commandman rm
to figure out what flag this is.
Let’s get structured!
- Make a
projects
directory at/home/user/Desktop
on your computer with all the sub-directories shown on slide 30 in the slideshow. You are free to name the project within theprojects
directory whatever you would like, e.g. Just_Bash_It, Intro_to_command_line, First_Project, etc..
Cheat Sheet 1 in the slideshow will have the commands you need.
You should have downloaded the course materials including the raw data files to you computer, maybe it is in your downloads directory or perhaps you moved it somewhere else.
-
Using the command-line, navigate to where you downloaded the course materials an go to the directory named
Data
, here you should see three files, all with the extension.gz
. Move these three files to the project directory you have made and place them in the correct sub-directory. -
How large (in bytes and disk space) are the data files you moved from the course materials?
-
Check the permissions of the data files from the course materials. Do you have permission to
read
,write
andexecute
them? If you are not allowed to execute the, what is the reason for this? -
Make a new file called
Readme
within your project directory. Check the permissions of the file you just made and modify the permission of this file so the ‘group’ can write to the file (HINT:chmod
).
-
Go to your project directory and edit the
readme
file you made in Exercise 2 using the one of the editorsnano
,vim or vi
,emacs
. N.B you might only have one of them installed!
Areadme
file should contain information about what a certain directory contains, its purpose and who owns it/is the editor. -
Move to the sub-directory
Data/Raw
here you should have the three data files you moved from the course material directory.
Two of these files are fastq
files, they contain RNAseq reads from a
bulk-RNA sequencing experiment using Arabidopsis thaliana (thale cress).
The extension _R1
and _R2
denote that reads are paired-end
(e.g. read 1 and read 2). The third file with the extension .gff
,
contains annotations of genes and other genomic feature from the
organism of study, Arabidopsis Thaliana.
-
Expand the
GCF_genomicAnnotation.gff.gz
annotation file while making sure you keep a copy of the compressed version.
HINT: look into what flags/arguments need to be specified when decompressing). -
Try out the commands
less, cat, head, tail
on the expanded.gff
file.
The file contains the fields (columns) shown below, and includes information on; annotation source, region, sequence start and end position, stand, as well as a column containing various information on the sequences, i.e. entry id, gene name, locus, ids, tags etc.
NC ID | Source | Region | Start | End | Strand | Gene transcript etc. |
---|---|---|---|---|---|---|
NC_003070.9 | RefSeq | exon | 4706 | 5095 | + | gene=ARV1;locus_tag |
Note how that the file header (first part of the file) is denoted by hastags.
-
Rename the expanded
.gff
toAnnotation.gff
. Move theAnnotation.gff
to yourScratch
directory. -
Look at the content of one of the two
fq.gz
files with RNA sequencing read, N.B. this time without expanding the file! How many lines are annotated for each read?
You will now test out some of the new commands introduced in the slide show and the command line presentation.
Go to your directory Scratch
where you should have a copy of the
unzipped GCF_genomicAnnotation.gff.gz
file named Annotation.gff
,
which you made in exercise 3.5 above. For inspiration on how to solve
the questions below have a look at slide 46.
N.B Remember if you are not sure what arguments (flags) a given
command has you can always use man [name_of_command]
to see what flags
(arguments) a command takes.
-
Figure out how many, lines, words and characters the
Annotation.gff
file contains usingwc
. -
As we are not interested in the header lines, denoted by hastages, remove these from the
Annotation.gff
and name the new fileAnnotation_tmp.gff
. To do this you can employ the commandsed
, see pseudo code below. You will need to figure out the line numbers of the first line and last line to remove. What do you think thed
refers to?
sed '[number first line],[number last line]d' Annotation.gff > Annotation_tmp.gff
Have a look at the Annotation_tmp.gff
. Does it look correct now,
i.e. no headers with hastags? And does it have number of lines you
expect it to have?
- Use the command
cut
to make a new file which contains onlyfields 3,4,5
(i.e. columns with region, start and end) from theAnnotation_tmp.gff
and name it whatever you’d like. Check that the file looks correct.
We also would like to have the gene names. They are in field 7 in the
Annotation_tmp.gff
, but field 7 contains itself several entries ids,
gene name, locus, tags etc. separated by a ;
. We refer to such
separating characters as delimiters
. One could say that field 7
contains sub-fields delimited by ;
.
- We want to get the sub field starting with
gene=
, this is the gene name. To extract this information we will use cut twice. First, we will cut out column 7 fromAnnotation_tmp.gff
and name that temporary file something. Then, we cut the temporary file to get field 5
Run the code provided below. Try to understand what happens in each line. Particularly:
- What do the flags
-d
and-f
do? - Why are we setting the -f flag to 5 not 7 (HINT: field vs sub-field)?
cut -f 7 Annotation_tmp.gff > col7.tmp
cut -d ';' -f 5 col7.tmp > gene_names
Check that the file looks correct after extraction, i.e. does it seem to the the field you are interested in?
- We would like to remove the repetitive
gene=
in each line to just obtain the clean gene name. One way to get this is to do a search and replace withsed
. The syntax is shown as pseudo code below. Thesed
command requires you to specify what pattern to match[pattern to match]
and what pattern to replace it with[Replace with]
. The patterns are separated by slashes and the whole expression is encased in quotes'
.
sed 's/[pattern to match]/[Replace with]/g' [input file] > [output file]
For the pseudo code chunk figure out:
- What the input file should be?
- Change
[pattern to match]
to the pattern you want to match and[Replace with]
to what you want to replace with (HINT: replace with empty/nothing). - What does the
s
andg
denote? - Run the command and save the output to whatever file name you’d like.
-
You should now have one file containing only field 3, 4 & 5 (question 2.) and one containing only the sub-field with gene names from field 7 (question 5.).
Paste these two files together into a single file using the commandpaste
, name this fileAnnotation_Gene.gff
. You will need to use the flag-d
to specify what kind of delimiter should be used to paste the fields together, you should use a tab-delimiter, denoted'\t'
. -
Run and decipher the command below.
awk -F '\t' 'OFS="\t" {$5=$3-$2}{print}' Annotation_Gene.gff > Annotation_Gene_Len.gff
- What is the output of the command?
- What does the flag
-F
specify? - What does
OFS="\t"
mean? HINT: Google this! - What is going on inside the curly brackets?
- Let’s have a look at the content of your final file.
- Are there any gene sequences with length 0 in the annotation file? -
Try the command
sort
with flags-k 5
- What is the name of the gene with the longest sequence, annotated in
your file? - Try the command
sort
with flags-k 5 -nr
- Does our organism of study, Arabidopsis Thaliana, have the
TERT
gene? - Try the commandgrep
. - All living organisms have polymerase genes, including Arabidopsis
Thaliana. How many types of
POL
genes are annotated? - How many gene annotation lines in the file pertain to transfer RNA
(
tRNA
)? - Try the commandgrep
with flag-c
.
- Your new file
Annotation_Gene_Len.gff
does not have any headers on each field. Make a new file calledheader.gff
, use nano for this, copy the content below into the file and close/save it.
Region Start End GeneName Length
Now, use the command cat
to bind header.gff
and
Annotation_Gene_Len.gff
together. Name the output
Annotation_[TodaysDate].gff
When you have checked at the file looks correct, move the file to your
Results directory - N.B go to the sub-directory and check that
you have moved it correctly!. You can now remove (delete) all the
temporary files in your Scratch
directory. WELL DONE!
- Copy (
cp
) theGCF_genomicAnnotation.gff
from yourData/Raw
dir to theScratch
dir. This is done by specifying the path to where the copy should go. Move to theScratch
directory.
Lets try some piping (chaining) of commands.
- You will chain four commands with pipes (
|
), step by step:
- Remove the header rows (those beginning with hastages) in the file, like you did in point 2, Exercise 4, above
- Extract (
cut
) the column that contains the annotation Region (exon, CDS, etc.). - Sort the extracted column with
sort
- Get the unique elements from this column with the command
uniq
-
Re-run the command line you used above, but this time redirect the output directly to the
Data/Generated
directory by specifying the path and the name you would like the output file to have. Check theData/Generated
directory to check that you have correctly made the file. -
Using a single command line, figure out the name of the microRNA (miRNA) gene which has the smallest genomic starting coordinate. You will need to combine commands
cut
,sort
andgrep
to archive this. HINT: miRNA gene names begin withMIR
. -
You will now redo points 2-6 from Exercise 4 (parts 1 & 2) above. You should end up with a file containing the 4 columns:
Region, Start, End and GeneName (only)
. To archive this, chain together commands from points 2-6.
N.B: You do NOT need to do everything in one command line, but try to reduce the number of intermediate files to as few as possible. It is possible to get the output file in two command lines (i.e. only one intermediate file). HINT: the commands do not need to be combined in the same order as in Exercise 4, in fact, they shouldn’t be.
We will now save the commands from exercises 4 and 5 in a script so we don’t lose them and can easily re-run our analysis with a different gff file.
We will use the history
command to easily get a record of what we have
done so far.
First, let’s see how much is in your history:
history | wc
If you have a lot of commands saved in your history you might only want to see the last 20 or so. This command will show you the last 20:
history | tail -n 20
-
Create a new file, either with the command line or the editor of your choice, and name it
ex4.sh
. This is the file which will contain your script. A script is just a collection of commands that are executed one after another. -
Using your
history
, copy the commands you used to solve exercise 4 intoex4.sh
. Save it. -
Run
ex4.sh
by callingbash
. Does it run without error? If not, correct what is wrong until it does. Check that it creates the correct files, i.e. they should look the same as when you manually executed the commands.
bash ex4.sh
-
Add lines to your script
ex4.sh
that will remove the temporary files created. Check that the files are indeed removed, and no other files have been deleted by accident. If you did delete something you can just re-download the respective file from github. -
Now replace the name of the input file,
GCF_genomicAnnotation.gff.gz
, with an argument so that you can run your script on different input files. Test that this works. -
Run your script on
ecoli.gff.gz
by passing the file name as an argument.
- Make a short script with a for loop that does the following:
- go to the data folder
- for each zipped file, i.e. each file ending with .gz:
- display the file name
- display the first 10 lines without unzipping the entire file like you did in Exercise 3, question 6.
In this exercise we will install first a package manager and then use this package manager to install a software.
- Installer:
- For OS X systems:
InstallHomebrew
(N.B this may take some time), figure out how to do it here (is is one-liner in the command line): https://brew.sh/ - For Ubuntu and Linux systems, Windows (WSL) systems & MobaXterm
users: You most likely already have
apt-get
installed. Figure out what version you have? HINT:--version
.
Now that you have installed a package manager you will use it to install a command line tool on your laptop. We will install FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) which is a tool for quality control of DNA and RNA sequencing reads.
- FastQC:
- For OS X systems, Ubuntu/Linux systems & Windows (WSL) systems:
If you already havefastqc
installed, then update it, otherwise install using the appropriate install command for your system. - MobaXterm users:
You will need to installperl
before you can usefastqc
:- Use
apt-get
to installperl
. Check it is installed with--version
. - Unfortunately
fastqc
is not available for mobaXterm withapt-get
, so downloadfastqc
here: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.9.zip.
- Use
We will check that fastqc
is correctly installed and works by running
the tool on our two files (_R1.fastq
and _R2.fastq
) containing
sequencing reads.
- Move to the
Data/Raw/
. Modify and use the command below to runfastqc
. You should specify the path to the directory you want the output to go to, in this caseData/Generated/
.
Have a look at your output.
Extra: Try using open
to open a file (xlsx, docx, etc.) and an app
(browser, Rstudio, etc.) via the command line.