General UNIX notebook

0. Introduction

This workflow gives a basic introduction into the command line.

A lot of the files work with some basic text files provided in the Input_docs folder. If you want to run this tutorial you can download these files from here.

If working on the servers please keep in mind:

General info on this tutorial

  • grey box. : The code that we want to execute
  • red box. : Urgent comments ;-)
  • Code : In some cases the code might be hidden and you only see a little box at the right hand side. Hidden codes means that from the previous sessions you should have enough knowledge to do this step or it is is a test of your skills. If you run into issues you can open the code and check how to run your command by clicking on the Code button to reveal the code
  • hyperlink : If the text is highlighted like this during a text section, then you can click on it and a hyperlink gives you more info (either the publication or an online tutorial)
  • Exercise: : This will appear if you should have the required background to finish all the steps yourself. If you run into trouble, you can find the answers if you click the Code butto, which will reveal the answer.
  • Sometimes we also just ask questions for you to think about the output of certain programs. The text is then hidden and you have to hover over the “spoiler” to see the answer to the question

General

Tutorials for bioinformatics

One example for a shorter unix tutorial can be found here. The tutorial that follows below should give insights into all basic commands as well but in case something is missing it is always worthwhile to check other sources as well.

Basic Unix introduction

Basics

What is the shell?

This is a program that opens a window and lets you interact with the shell.

What is the terminal?

A Program that takes commands from the keyboard and gives them to the operating system to perform Nowadays, we have graphical user interfaces (GUIs) in addition to command line interfaces (CLIs) such as the shell.

  • Accessing the terminal (to work on our servers, this is our most important tool)

    • Mac users: Search for terminal in spotlight and you are done.
    • Windows users: You will need some software for using the command line. One option is Cygwin and another alternative is mobaXterm. Finally, newer Windows version come with it’s own Windows subsystem for Linux (WSL) which can be installed as explained here.

If all is working you should see something like this:

If you are working from a mac, most of the important things (nano, perl, python are already installed). For Windows users you might need to setup some more things (esp a textviewer, like nano, and python/perl are good to have)

The file system

The basic file system on Unix looks like this:

  • Different to what you might know from Windows, unix does not use Drives.
  • The first directory in the file system is called the root directory.
  • With Unix we do not split our system into different drives but Unix always has a single tree.
  • Different storage devices may contain different branches of the tree, but there is always a single tree. e.g., /users/john/ and / users/mary/
  • Here, users is on the first level of the tree and john and mary on the second.
  • /bin is the directory that contains binaries, that is, some of the applications and programs you can run.

Moving around folders via the terminal

Now that we know how the file system looks like, we do want to move around. Especially for the servers this becomes important since we can not use our mouse anymore.

  • pwd = print working directory, find out where we are. Usually when we login we start from our home directory
    • This will be something like /Users/username on your own computer
#check where we are
pwd

Now lets find out how to move around

  • cd = change directory
#move into the next directory, called Desktop
cd Desktop/

#move one directory backwards, back to our personal homedirectory
cd ..

#from whereever you are, go directly into your homedirectory
cd ~

#on the NIOZ server, move the spang directory and into the project directory
cd ../spang_team/Projects/

#change to our home dir
cd ~

The second to last example is a bit longer but essentially what we are doing:

  • move back to a lower directory
  • go into the spang_team folder
  • from the spang_team directory, go into the Project directory

In the last example, we use the tilde, i.e. ~, as shortcut to the home folder.

Pathnames

Below is some important syntax that discusses the difference between absolute and relative pathnames

Absolute pathnames

An absolute pathname begins with the root directory and follows the tree branch by branch until the path to the desired directory or file is completed. For example on your computer full path to our home directory is: /Users/username/Desktop

If we want to go to the desktop using an absolute path we do

cd /Users/username/Desktop

Relative pathnames

A relative pathname starts from the working directory (the directory you are currently in). Using the relative pathway allows you to use a couple of special notations to represent relative positions in the file system tree.

  • . (dot) The working directory itself
  • .. (dot dot) Refers to the working directory’s parent directory

If we want to go to the desktop using an relative path we do (assuming we start from our home directory)

cd Desktop

General comment:

When recording your script it is useful to always start with the absolute path to set your working directory. Afterwards, you can work with relative paths.

General structure of a Unix command

The general structure of a command looks like this:

command [options] [arguments]

  • command is the name of the command
  • options is one or more adjustments to the command’s behavior. I.e. for ls we can use
    • Short notation: “-a”
    • Long notation: “—all”
  • arguments is one or more “things” upon which the command operates.

Working with the terminal = Basics

Viewing what files, folders, etc are present in our directories

Here, we use the command ls with an option -l

  • ls stands for list directory contents
  • everything starting with a minus symbol is an optional argument we can use
  • -l = use a long listing format
#list everything in the current directory
ls -l

#if we want to check for what other options ls has we do (exit the manual with q)
man ls

In case you want to check what a program does or what options there are, depending on the program there are different ways how to do this. These most common are:

  • man ls
  • ls –help
  • ls -h

Generating and viewing files

Let’s first make a file using nano.

Nano is a basic text editor that lets us view and generate text.

#open a new document and name it random.txt
nano random.txt

If we type this and press enter, we will open a new document.

  • Type something in there into this document
  • Close the document with control + X
  • Type y to save changes and press enter

If we now use ls -l again, we see that a new file was generated.

We can open it in nano again, but there are some other options that are useful with extremely big files.

less

less is a program that lets you view text files (by pagination!)

less random.txt

Once started, less will display the text file one page at a time.

  • You can use the arrow Up and Page arrow keys to move through the text file.
  • To exit less, type q.
  • G Go to the end of the text file
  • 1G Go to the beginning of the text file (or to the Nth line “NG”)
  • /characters Search forward in the text file for an occurrence of the specified characters
  • n Repeat the previous search
  • h Display a complete list less commands and options

For files with a lots of columns we can use

less -S random.txt

In this mode we can also use the arrow right and left keys, to view columns that are further on the right.

tail

If you want to check the last 10 rows use tail

tail random.txt

I/O redirection to new files

By using some special notations we can redirect the output of many commands to files, devices, and even to the input of other commands.

Standard output (stdout)

By default, standard output directs its contents to the display. To redirect standard output to a file, the “>” character is used like this:

#redirect the output from ls to a new file
ls > file_list.txt

#check what happened
nano file_list.txt

#Use this symbol twice (“>>”) to append the result to an existing file
ls >> file_list.txt

#check what happened
nano file_list.txt

Standard input (stdin)

By default, standard input gets its contents from the keyboard. To redirect standard input from a file, the “<“ character is used like this:

#redirect standard input from a file
sort < file_list.txt

# redirect standard output to another file
sort < file_list.txt > sorted_file_list.txt

#check what happened
nano sorted_file_list.txt

Escaping characters

Certain characters are significant to the shell; Escaping is a method of quoting single characters. The escape () preceding a character tells the shell to interpret that character literally. Below, we find the special meanings of certain escaped characters:

Making new folders

  • mkdir - make a new directory
#go into the folder from which you want to work
cd Desktop

#make a new folder (in the directory we currently are in, name it new_older)
mkdir Unix_Tutorial

For the NIOZ tutorial do this:

#go into the folder from which you want to work
cd /export/lv3/scratch/workshop_2021/Users/UserName

#make a new folder (in the directory we currently are in, name it new_older)
mkdir Unix_Tutorial

Moving and copying files

  • cp - copy files and directories
  • mv - move or rename files and directories

Now lets move our test data (from whereever we downloaded it)

#make a file were we store some random text
ls  > random.txt

#copy our random file into our new_folder
cp random.txt Unix_Tutorial

If we check the files after this command, we can see that we have a version in our home directory and our new folder

#copy our random file into our new_folder
mv random.txt Unix_Tutorial

If we do this again with mv, we see that we only have the file in the new folder.

Removing files and folders

To remove stuff, we use the rm command.

#rm file1
rm Unix_Tutorial/random.txt

#rm directory
rm -r Unix_Tutorial

For the rm command, we need to tell the command that we want to remove folders. Therefore we need to use the -r argument (to remove directories and their contents recursively).

Downloading data

We can download data using wget. With -P we specify were to download the data

#make a folder for our downloads
mkdir downloads

#download a genome from ncbi using wget
wget -P downloads ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/728/275/GCA_002728275.1_ASM272827v1/GCA_002728275.1_ASM272827v1_genomic.fna.gz

File compression

Compressing data using gzip

gzip reduces the size of the named files using Lempel–Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension ‘.gz’, while keeping the same ownership modes, access and modification times.

#decompress gz data (-d = decompress)
gzip -d downloads/GCA_002728275.1_ASM272827v1_genomic.fna.gz

#compress
gzip downloads/GCA_002728275.1_ASM272827v1_genomic.fna

Compressing data using tar

  • Short for Tape Archive, and sometimes referred to as tarball, is a file in the Consolidated Unix Archive format.
  • The TAR file format is common in Unix and Unix systems, but only for storing data! Not compressing it!
  • TAR files are often compressed after being created, but those become TGZ files, using the tgz, tar.gz, or gz extension.

Exercise

  • Decompress the data we have downloaded using gzip
  • Make a new folder, name it to_compress
  • cp our decompressed data into this folder
#decompress gz data (-d = decompress)
gzip -d downloads/GCA_002728275.1_ASM272827v1_genomic.fna.gz

#make new folder for the data we want to compress
mkdir to_compress

#compress gz data (-d = decompress)
cp  downloads/GCA_002728275.1_ASM272827v1_genomic.fna to_compress

Now we have some files in a new folder to test file compression on. Always check between steps, what is happening with ls or head!

#create and compress a tar file from a directory
tar -cvzf to_compress.tar.gz to_compress

#uncompress a file
tar -xvf to_compress.tar.gz -C to_compress

Basic UNIX commands

Preparing our data

Exercise

Now we first need to get our data. For this

  • make a new folder, name it Unix_tutorial
  • cp the folder with the Input docs from `` (beware that we copy a folder and need to set an option to do this)
  • go into the Unix tutorial folder
#make folder
mkdir Unix_tutorial

#cp test data
cp -r /export/lv3/scratch/workshop_2021/Sxx_Practice/Unix/Input_docs/ Unix_tutorial/

cd Unix_tutorial

WC: Counting files

The wc (=wordcount) command in UNIX is a command line utility for printing newline, word and byte counts for files.

  • It can return the number of lines in a file, the number of characters in a file and the number of words in a file.
  • It can also be combine with pipes for general counting operations. We will explain pipes a bit later.
  • * Stands for a wildcard, and we will discuss below what exactly this does

This command is simple but essential for quality control and you will use it a lot to check whether your commands worked all right.


#count how many words we have
wc -w Input_docs/Experiment1.txt

#count how many lines we have in a file we have
wc -l Input_docs/Experiment1.txt

#count how many files we have that end with a certain extension
ll Input_docs/*.txt | wc -l

Grep: Finding patterns in files

The grep command is used to search text. It searches the given file for lines containing a match to the given strings or words.

This command is simple but essential for quality control/

#count how often the pattern **control** occurs in our document
grep "control" Input_docs/Experiment1.txt

#only give the counts, not the lines
grep -c "control" Input_docs/Experiment1.txt

#grep a pattern only if it occurs at the beginning of a line
grep "^Ex" Input_docs/Experiment1.txt

#we can also count the number of sequences in a fasta file
grep -c ">" Input_docs/*faa

Using basic wildcards

Since the shell uses filenames so much, it provides special characters to help you rapidly specify groups of filenames.

A Wild-card character can be used as a substitute for any class of characters in a search

  1. wildcard with the broadest meaning of any of the wildcards,can represent 0 characters, all single characters or any string of characters. I.e. we can grep any files that end with .txt
grep -c "Ex" Input_docs/*txt
  1. ? wildcard = matches exactly one character except a dot
grep "control" Input_docs/Experiment?.txt
  1. ‘.’ wildcard = any letter or number bc dot is a wildcard, be careful when using it in some commands
grep "control." Input_docs/Experiment1.txt
  1. [012] wildcard = matches 0 or 1 or 2 exactly once
grep  "control" Input_docs/Experiment[012].txt
  1. [0-9] wildcard = matches matches any number exactly one
grep  "control" Input_docs/Experiment[0-9].txt
  1. combining wildcards
  • [A-Z] wildcard = any letter in capitals occurring once
  • [a-z]* wildcard = any letter in non-capital letters occurring many times
grep "control" Input_docs/[A-Z][a-z]*[12].txt
  • [a-z]{7} we are exactly looking for 7 letters (as in ‘control’)
  • these 7 letters should be followed by either a 1 or 2
grep "[a-z]\{7\}[12]" Input_docs/Experiment[12].txt
  1. if we are not sure how many characters we have
  • matches 3-10 characters
grep "[a-z]\{3,10\}[12]" Input_docs/Experiment[12].txt

Grep and special symbols:

#this does not work
grep "control?" Input_docs/Experiment1.txt

#this works how we want it to work
grep -E "control?" Input_docs/Experiment1.txt
  • -E tells grep that the ‘?’ is not taken literally but as a wildcard

unfortunately different programs have slightly ways to do things, i.e greop uses -E while sed uses different quotes. If you run into problems when using wildcards check the manual or the web.

Exercise

  • In PF00900.faa, how many sequences do we have? Notice, sequences always start with a >
  • How many sequences do we have in PF00900.faa and PF01015.faa?
  • In PF00900.faa, how often do we have 3 consecutive A’s?
  • In PF00900.faa, how often do we have 2x M’s followed by a P?
  • In PF00900.faa, how often do we have 2x M’s followed by a P or an I?

Comment: If you unsure what is happening remove the -c to see what grep is grepping.

#question1:
grep -c ">" Input_docs/PF00900.faa 

#question2
grep -c ">" Input_docs/*.faa

#question3
grep -c "[A]\{3\}" Input_docs/PF00900.faa 

#question4
grep -c "[M]\{2\}[P]" Input_docs/PF00900.faa

#question5
grep -c  "[M]\{2\}[PI]" Input_docs/PF00900.faa

Hint: if you are unsure what is happening, redo the command without the -c option

The cut command

We can use cut to separate columns


#only print the second column of our table
cut -f2 Input_docs/Experiment1.txt

#change how we cut, ie cut after a #
cut -f1 -d "#" Input_docs/Experiment1.txt

Options:

  • -f2 = keep the second element after the separators (tab, by default)
  • -d "#" = we use a # as delimiter

Exercise

  • In PF00900.faa, cut the text of after the first _ (i.e. keep the first element)
  • In PF00900.faa, cut to only keep the text after the first _ (i.e. keep the second element)
#question1
cut -f1 -d "_" Input_docs/PF00900.faa

#question2
cut -f2 -d "_" Input_docs/PF00900.faa

Why should we not do the second option?

If we cut away the > symbol we break how the header of fasta files look. I.e. fasta headers always should start with this symbol and if we would use such a command we would need the > back in

Cat: Combining data

The cat command has three main functions related to manipulating text files:

  1. creating files
  2. displaying files
  3. combining files

Depending how we use cat, we can use it for these 3 different contexts.

  1. Create a new file, type Hello! press enter, and *Press “ctrl+d” to save the file
cat > file_test.txt
  1. Display the content of an existing file
cat file_test.txt
  1. Concatenating several files
#merge files
cat file_test.txt file_test.txt  > file_merged.txt
 
#check the content of the file
cat file_merged.txt

Exercise

  • View Input_docs/Experiment1.txt by using cat
  • Combine Input_docs/Experiment1.txt and Input_docs/Experiment2.txt.
  • Do the same as above but use a wildcard to not have to type everything and store the output as All_Input_docs/Experiments.txt
#question1
cat Input_docs/Experiment1.txt

#question3
cat Input_docs/Experiment1.txt Input_docs/Experiment2.txt

#question3
cat Input_docs/Experiment*.txt > Input_docs/Experiments.txt

Pipes: Combining commands

Pipes are a powerful utility to connect multiple commands together. Basically, pipes allow us to feed the standard output of one command as input into another command.

In this example we first combine the files, then use grep to count the number of lines

cat Input_docs/Experiment[12].txt | wc -l

#if we would not use cut, we still can run wc -l , the output just would look a bit different
wc -l Input_docs/Experiment[12].txt 

Sort

sort – sort lines in a file from A-Z

There are a number of programs that require files to be sorted. We can sort files like this:

#sort using the fourth column
sort -k4 Input_docs/Experiment2.txt

#sort using the 5th and then the third column
sort -k5 -k3 Input_docs/Experiment1.txt

Uniq

uniq – can be used to remove or find duplicates. However, for this to work the file first needs to be sorted. To combine two commands in one go, we can use the pipe, i.e. | symbol.

#only keep duplicates
sort Input_docs/Experiment2.txt | uniq -d

#remove duplicates
sort Input_docs/Experiment2.txt| uniq -u

Exercise

  • Using a pipe: combine the faa files in Input_docs. How many sequences do we have in total?
  • Using a pipe: combine the faa files in Input_docs. Extract only the header and then count how many duplicates do we have?
  • Same as above, but how many sequences are not duplicated? Check the manual for uniq on what option allows you to do this.
#question1
cat Input_docs/*faa | grep -c ">"

#question 2
cat Input_docs/*faa | grep ">" | sort | uniq -d | wc -l 

#question 3
cat Input_docs/*faa | grep ">" | sort | uniq -u | wc -l 

diff: finding differences

This command compares the contents of two files and displays the differences. Lines beginning with a < denotes file1, while lines beginning with a > denotes file2.

#check the original files
cat Input_docs/Experiment2.txt
cat Input_docs/Experiment3.txt

#find the differences
diff Input_docs/Experiment2.txt Input_docs/Experiment3.txt

Exercise

  • Find the headers that are different for PF00900.faa and PF01015.faa
#option1, generate intermediate files that store the file header
grep ">" Input_docs/PF00900.faa > PF00900_header
grep ">" Input_docs/PF01015.faa > PF01015_header

diff PF00900_header PF01015_header

#otpion2, do this all in one command
diff <(grep ">" Input_docs/PF00900.faa) <(grep ">" Input_docs/PF01015.faa)

In the hidden code, there are 2 ways to do this, one generating intermediate files and a second command not doing this. Here, we use a new type of syntax: <(command). <( command ) is called process substitution. Basically, we run the grep command and redirect the output of that command into the diff function.

find

This searches through the directories for files and directories with a given name, date, size, or any other attribute you care to specify.


#find all files with a txt extension
find Input_docs/ -name "*txt"

#find files over a certain size and display results as a long list
find . -size +1M -ls

echo and variables

echo

echo displays a line of text

#print a string to the console
echo "Hello everyone"

variables

A variable is a character string to which we assign a value. The value assigned could be a number, text, filename, path, device, or any other type of data.

#store your name in a variable called ``x``
x="Nina"

#print the variable we just have created using echo
echo "I am $x"

Something to be aware of is that echo treats quotes differently:

  • double quote the $x is read as a VARIABLE
  • single quote the $x is read literally
#store 10 in the variable x
x=10

#print the variable
echo "the value is $x"

#print a literal `$x`
echo 'the value is $x'

We can use these things in several ways:

  1. we can use it to add rows, i.e. a header to our data
echo -e 'Experiment\tTreatment\tValue1\tValue2\tcomment' |cat - Input_docs/Experiment1.txt

Here. the - is used as a pseudo file name to indicate standard input in this case the input is the echo-line from our first command. The -e is used to enable interpretation of backslash escapes

  1. If we write longer scripts, we can set variables for our working directory and most important paths at the beginning.

For the example below, change the path to your directory in

#set variable for our wdir
wdir="/export/lv3/scratch/workshop_2021/"

#check if our variable is stored
echo $wdir

#set wdir
cd $wdir

#go back to your unix working dir
cd Users/NDombrowski/Unix_tutorial

If you set variables and use them like that they are stored until:

  • They are over-written by a variable with the same name
  • you close your session, i.e. close the terminal

You can permantently store variables but this is a topic we will not cover here.

SED: manipulating files

Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).

The most basic pattern to use sed for is sed ‘s/find/replace/’ file. And example file can be generated with echo “how now brown cow” > temp.txt

Basics: search and replace

# search for 'Ex' and replace it with 'Experiment'
sed 's/Ex/Experiment/' Input_docs/Experiment1.txt 

# search and replace pattern across the whole file and not only the first occurence per line
sed 's/Ex/Experiment/g' Input_docs/Experiment1.txt 

#we also can use wildcards to replace control1 and control2 with contrl
sed 's/control[0-9]/control/g' Input_docs/Experiment1.txt 

One important thing to remember is that certain symbols have specific meanings in UNIX. Examples are: comma, brackets, pipes. Do search for these in files, we need to escape them with a ****:

#replace the square bracket with a round bracket. 
sed 's/N\[0.4uM\]/N(0.4uM)/g' Input_docs/Experiment1.txt 

Exercise

  • In Input_docs/PF00900.faa , replace the GCA_ with Genome_
  • In Input_docs/PF00900.faa , replace the string of numbers after the GCA with a hello
#qst 1
sed 's/GCA_/Genome_/g'  Input_docs/PF00900.faa 

#qst 2
sed 's/GCA_[0-9]*/GCA_hello/g'  Input_docs/PF00900.faa 

Removing things (and using WILDCARDs)

  • *$’ indicates a line containing zero or more spaces. Hence, this will delete all lines which are either empty or lines with only some blank spaces.
# remove 1 first line
sed '1d' Input_docs/Experiment1.txt

#remove last line
sed '$d' Input_docs/Experiment1.txt

#remove lines 2-4
sed '2,4d' Input_docs/Experiment1.txt

#remove lines other than 2-4
sed '2,4!d' Input_docs/Experiment1.txt 

#remove the first and last line
sed '1d;$d' Input_docs/Experiment1.txt

#remove lines beginning with an **L**
sed '/^L/d' Input_docs/Experiment1.txt

#delete lines ending with d
sed '/d$/d' Input_docs/Experiment1.txt

#delete lines ending with d OR D
sed '/[dD]$/d' Input_docs/Experiment1.txt

#delete blank lines ('^$' indicates lines containing nothing)
sed '/^$/d' Input_docs/Experiment1.txt

#delete lines that start with capital letters
sed '/^[A-Z]/d' Input_docs/Experiment1.txt

#delete lines with the pattern **Ex**
sed '/Ex/d' Input_docs/Experiment1.txt

#delete files with the pattern control or uM
sed '/control\|uM/d' Input_docs/Experiment1.txt

#delete the 2nd occurence of each pattern per line (here *o*)
sed 's/o//2' Input_docs/Experiment1.txt

#remove all digits across the whole file
sed 's/[0-9]//g' Input_docs/Experiment1.txt

#remove all alpha-numerical characters and numbers
sed 's/[a-zA-Z0-9]//g' Input_docs/Experiment1.txt

#remove character, here E,  regardless of the case
sed 's/[eE]//g' Input_docs/Experiment1.txt

removing text between patterns

If we have complicated headers and want to shorten it sed is also useful. I.e. lets consider this example: >MBN1215629.1 4Fe-4S dicluster domain-containing protein [Candidatus Lokiarchaeota archaeon]

The ID and taxon are useful but the text in between might be a bit much. Here, we can use the first space and the [ as patterns and we want to remove everything in between.


sed 's/ [^\[]*/_/1' <(echo ">MBN1215629.1 4Fe-4S dicluster domain-containing protein [Candidatus Lokiarchaeota archaeon]")

Using a screen

Screen or GNU Screen is a terminal multiplexer. It means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected This is perfect, if we start longer running processes on the server and want to shut down our computer.

The basic commands to know are

#start a screen
screen

We detach from a screen with control+a+d

#start a screen and give it a name
screen -S testrun

#see what screens you have running
screen -ls

#restart an existing screen
screen -r testrun

#detach a screen (from outside a screen)
screen -d
 
#completely close and remove the screen, type
exit

Running loops

A ’for *loop’** is a bash programming language statement which allows code to be repeatedly executed. I.e. it allows us to run a command 100 times.

The bash *while** loop is a control flow statement that allows code or commands to be executed repeatedly based on a given condition. For example, run echo command 5 times or read text file line by line or evaluate the options passed on the command line for a script.

Try running this example:

for i in 1 2 3 4 5; do echo "Welcome $i times"; done

Here, you see what this command does step by step:

We could also use a loop to run sed on all our txt files

#change the N against the P in all our files
for i in Input_docs/*txt; do sed 's/N/P/g' $i; done

We might also want to store the output of our loops

for i in Input_docs/*txt; do sed 's/N/P/g' $i > ${i}_2.txt; done

#check that all went alright with 
ll Input_docs/*txt

You see here, that we added curly brackets to define the borders of our variable i. I.e. if we would not have the brackets unix would look for a variable i_2, which does not exist.

We can also use a file list for building a loop, which allows us to much better control, where we want to store the new files. For example in FileList we have a list of the two files we want to work with: Experiment1 and Experiment2

for i in `cat Input_docs/FileList`; do sed 's/N/P/g' Input_docs/${i}.txt > Output_docs/${i}_2.txt; done

Notice, that using this option we need to add the file extension!

Exercise

  • make a list with all the faa files (do this with the command line). Ideally, we want to have PF00900 and PF01015 in one column.
  • Use this list to, in a loop, replace the GCA_ with Genome_. Store the files with the new ending renamed.faa
#question1
ls Input_docs/*faa | sed 's/Input_docs\///g' | sed 's/\.faa//g' > FaaList

#question2
for sample in `cat FaaList`; do sed 's/GCA_/Genome_/g' Input_docs/${sample}.faa > ${sample}_renamed.faa; done

#check file
head PF00900_renamed.faa

While loops

Comment: Need to extend this with a practice example later

#3.while loop example 1

'''
#!/bin/bash
file=/etc/resolv.conf
while IFS= read -r line
do
        # echo line is stored in $line
    echo $line
done < "$file"
'''

#4. while loop to move a list of genomes to folder A to folder B
    
    while read from to; do
        echo "mv ${from}* $to"
    done < to_replace.txt

Change file extensions

Some programs at times need very specific file endings. I.e. our files end with faa but it might be that a script requires you do provide files with a fna ending. We can use our loop powers do change filenames

#create a dummy test folder
mkdir test_folder

#cp our files into the dmumy folder
cp Input_docs/*faa test_folder

#go into the test folder
cd test_folder

#change file ending
for f in *.faa; do  
mv -- "$f" "${f%.faa}.fna" 
done

#check that all went ok
ls *

#go back
cd ..

Exercise:

Change the filename back to faa.

#go into the test folder
cd test_folder

#change file ending
for f in *.fna; do  
mv -- "$f" "${f%.fna}.faa" 
done

#check that all went ok
ls *

#go back
cd ..

Change file names using rename

rename works as follows:

Oldname - NewName - Pattern used to grep all files

Warning Based on the file system there are two versions on how to use rename and if you work on your own system you might need to install it first.

The two versions are:

# version 1
rename 's/OldName/NewName/' Bin*  

#version 2
rename OldName Newname Bin*

Lets test this on our dummy files:

#go into the dummy folder
cd test_folder

#change the PF to FPAM (the second version should work on ada, the first version works on my Mac)
rename 's/PF/PFAM/' PF*  
rename PF PFAM PF*

#check if we did everything ok
ls* 

#go back
cd ..

A general tipp:

Epecially when testing commands, do this on a backup on your files (like we do here, with our dummy folder). More often than not things happen that you do not want to happen.

Replacing names in files (works on every kind of file)

The file names_to_replace.txt needs to look as follows: Two columns, which are tab separated First column = Original name Second column = New name

Warning: Be careful with similar names. Both Bin1 and Bin11 would be replaced with: Bin1 /tab Bin1_new In these cases either update your naming scheme OR add an end of line indicator

#rename
perl /export/data01/tools/scripts/perl/Replace_tree_names.pl Input_docs/names_to_replace Input_docs/PF00900.faa > Input_docs/PF00900_renamed.faa

#check what happened 
head Input_docs/PF00900_renamed.faa

Here, we learn something new and that is using custom scripts. I.e. perl or python scripts that we find online or write ourselves.

The script above is a bad example, but a lot of scripts should have some information on how to use them. You can either find usage info by using header or checking if there is a help function included with the script

  • head /export/data01/tools/scripts/perl/Replace_tree_names.pl
  • /export/data01/tools/scripts/perl/Replace_tree_names.pl -h

JOIN: Merging files with common columns

For join to work input files need to be sorted, in the given example the files are sorted by the first column, if column needs to be change use -k

Let’s imagine that we have some metadata (i.e. T and how long we did the experiment). We can use join to add this information into our measurements stored in Input_docs/Experiment1.txt

#merge two of our experiment files
LC_ALL=C join -a1 -j1 -e'-' -o 0,1.2,1.3,1.4,2.2,2.3,2.4 <(LC_ALL=C sort Input_docs/Experiment1.txt) <(LC_ALL=C sort Input_docs/Metadata) -t $'\t' | LC_ALL=C sort  > MergedFile

#check what happened 
head MergedFile 

Used options:

  • LC_ALL=C: make sure that join and sort speak the same language (sometimes ada has issues with join and sort communicating otherwise) -a1 = also print unpairable lines (print all lines from file 1 in this case) -1 1 = in File1 use column 1 for merging -2 1 = in File2 use column 1 for merging -j = equivalent to ‘-1 1 -2 1’, i.e. in both File1 and File2 use column 1 for merging -o: specify the order of the output format

Random but useful

The code below is a random collection of code the author found useful. This means it might be interesting for you BUT these are so far not linked to example files and some programs might not be avail. on ada.

Excel/DOS to UNIX issue cleanup

  1. hidden symbols

Sometimes when saving excel documents as text this insert hidden symbols. These can be seen as a blue M when opening the file in vim (sometimes they result in odd errors while parsing tables). Most often you see these issues when you open your files and lines are merged that should not be merged.

This symbol can be removed as follows in vim:

:%s/\r/\r/g   to remove blue M
  1. wrong file types

Files created on WINDOWS systems are not always compatible with UNIX. In case there is an issue it is always safer to convert. You see if you have an issue if you open your file of interest with nano and check the file format at the bottom.

If we see we are dealing with a dos file, we can clean files like this:

#dos to unix
awk '{ sub("\r$", ""); print }' winfile.txt > unixfile.txt

#unix to dos
awk 'sub("$", "\r")' unixfile.txt > winfile.txt

datamash: merging rows by keys

GNU datamash is a command-line program which performs basic numeric, textual and statistical operations on input textual data files.

datamash -sW -g1 collapse 2 collapse 4 < Unimarkers_KO_count_table.txt  > Unimarkers_KO_collapsed.txt

Working with Conda Environments

Some quick words about conda.

Conda is a powerful package manager and environment manager that you use with command line commands at the Anaconda Prompt for Windows, or in a terminal window for macOS or Unix.

Importantly, it allows to install without admin rights and is a safe alternative to not mess-up program setups.

#start conda (for NIOZ users)
source ~/.bashrc.conda3

#list all avail. environements
conda info --envs

#start environment
source activate myenv

#end environment
source deactivate

Using environmental variables

Simply put, environment variables are variables that are set up in your shell when you log in. They are called “environment variables” because most of them affect the way your Unix shell works for you. I.e. one points to your home directory and another to your history file.

#list of avail. variables
env | sort | head -10

# show where variables are stored
echo $PATH

# change location of variables
export HOME=/home/shs

#change path of environmental varialbes
PATH=~/bin:$PATH:/apps/bin

Installing stuff

For installing things on the server, for bigger things talk to Hans Malschaert () since we do not have administrator rights.

When installing tools on your own computer you check apt-get, however, since this requires admin rights, always be careful what you install.

apt-get install xx

For Mac users, brew is another, relatively safe alternative to install tools conda is another option that is supported for Mac and Windows.

Dealing with PDFs –> CPDF

CPDF is a useful program if you want to merge pdfs, remodel them all to A4 etc.

This is not available on the server but easy to install in case you are interested.

For more information, see here

# merge pdfs
~/Desktop/Programs/cpdf-binaries-master/OSX-Intel/cpdf -merge *pdf -o All_Main_Figs.pdf

#convert to A4
~/Desktop/Programs/cpdf-binaries-master/OSX-Intel/cpdf -scale-to-fit a4portrait SI_Figures.pdf -o test.pdf

Accesss rights

Each file (and directory) has associated access rights, which may be found by typing ls -l. Also, ls -lg gives additional information as to which group owns the file


ls -lg Input_docs/Experiment1.txt 

-rw-r--r-- 1 staff 156B Sep 28 2019 Input_docs/Experiment1.txt

In the left-hand column is a 10 symbol string consisting of the symbols d, r, w, x, -, and, occasionally, s or S. If d is present, it will be at the left hand end of the string, and indicates a directory: otherwise - will be the starting symbol of the string.

The 9 remaining symbols indicate the permissions, or access rights, and are taken as three groups of 3.

  • The left group of 3 gives the file permissions for the user that owns the file (or directory)
  • the middle group gives the permissions for the group of people to whom the file (or directory)
  • the rightmost group gives the permissions for all others.

The symbols r, w, etc., have slightly different meanings depending on whether they refer to a simple file or to a directory.

Access rights on files.

  • r (or -), indicates read permission (or otherwise), that is, the presence or absence of permission to read and copy the file
  • w (or -), indicates write permission (or otherwise), that is, the permission (or otherwise) to change a file
  • x (or -), indicates execution permission (or otherwise), that is, the permission to execute a file, where appropriate

Access rights on directories.

  • r allows users to list files in the directory;
  • w means that users may delete files from the directory or move files into it;
  • x means the right to access files in the directory. This implies that you may read files in the directory provided you have read permission on the individual files.

So, in order to read a file, you must have execute permission on the directory containing that file, and hence on any directory containing that directory as a subdirectory, and so on, up the tree.

Changing access rights

  • chmod = changing a file mode

Chmod options:

u = user g = group o = other a = all r = read w = write (and delete) x = execute (and access directory) + = add permission - = take away permission

For example, to remove read write and execute permissions on the file example.txt for the group and others, type


chmod go-rwx example.txt 

Extract sequence by pattern found in fasta header


perl extractSequence.pl File.faa <pattern> > File_Subset.faa

Counting things in a large number of files

Here: Find all files that start with *DN** and have a *codon** in them. Once you have these files grep the number of sequences in each file.


find . -maxdepth 6 -name 'DN*codon*' -exec grep -c -H ">" {} \; > Count_hits.txt

Connecting to severs

Basics

SSH (Secure Shell) is a network protocol that enables secure remote connections between two systems.

Options:

  • -Y option enables trusted X11 forwarding in SSH (if we want to open i.e. a java interface, or view alignments). For this to work you might need to install X11 on your computer first. Trusted means: the remote machine is treated as a trusted client. This means that other graphical (X11) clients could take data from the remote machine (make screenshots, do keylogging and other nasty stuff) and it is even possible to alter those data.
  • -X option enables untrusted X11 forwarding in SSH. Untrusted means = your local client sends a command to the remote machine and receives the graphical output

#connect to a server
ssh -X username@server

Checking available resources

There are different methods we have to check how busy the servers are.

1a. top

Typing top into the terminal should give something like this:

  • PID: Unique process id.
  • USER: Task’s owner.
  • PR: It is the priority of the task.
  • NI: The nice value of the task. A negative nice value means higher priority,whereas a positive nice value means lower priority.
  • VIRT: Total amount of virtual memory used by the task.
  • RES: Resident size, the non-swapped physical memory a task has used.
  • SHR: Shared Mem size (kb), the amount of shared memory used by a task.
  • %CPU: It shows the CPU usage as a percentage of total CPU time.
  • %MEM: It shows the Memory usage, a task’s currently used share of available physical memory.
    • as a rule of thump in the example above the first process uses 1492/100 = so roughly 15 of the 144 avail. cpus
  • S: Status of the process
  • TIME+: CPU Time
  • COMMAND: Display the command line used to start a task or the name ofthe associated program.

1b. htop

Htop gives similar info to top but allows to access how many CPUs and how much mem is used in total a bit easier:

The numbers from 1-144 are our 144 CPUs and the fuller the bar is, the more of a single CPU is currently in use. This is also summarized under tasks. Another important line is the memory listing how much in total of the avail. memory is in use.

  1. df

Monitor avail. space on the different file systems.

  1. free

Monitor avail. memory on the different file systems.

We can see that lv3, where scratch is, is almost full. So it is a good point to do data cleaning.

  1. du

Monitor how much space specific folders take up. In the example below we look at all the folders in the current directory

For example, we can ask how much space our desktop needs:

du -sh Desktop/

Transferring data from/to a server

Note: If transferring data the transfer is prepared from the terminal of your local computer. There are filetransfer systems that can make the job easier, i.e. FileZilla.

#from local to server
scp File username@server:/HomeDir

#from server to local
scp username@server:/HomeDir/File Desktop

Working on a server via slurm

As mentioned above on the larger NIOZ server we can not directly run jobs but need to submit them via a job submission system called slurm.

This section is for your information only.

Preparing a job script

To submit a job, we need to open a document in nano and describe what resources we need, i.e. with

nano jobscript.sh

Inside the script we can have something written like this:

#!/bin/sh
#SBATCH --partition=normal    # default "normal", if not specified
#SBATCH --nodelist=no1       # the node we want to work on
#SBATCH --time=0-06:30:00     # run time in days-hh:mm:ss
#SBATCH --nodes=1             # require 1 node
#SBATCH --ntasks-per-node=36  # (by default, "ntasks"="cpus")
#SBATCH --mem-per-cpu=4000    # MB RAM per CPU core (default 4 GB/core)
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

# Executable commands :
iqtree -s my_aln.faa

The most important things are

  • the partition, esp. for jobs that run longer
  • the node we want to work on, i.e. only some allow for longer running jobs
  • the number of nodes, we usally use one for our jobs
  • –error and –output are good to keep in case you ran into problems

Not absolutely necessary (at least on the NIOZ system but might be on other systems)

  • time = not necessary for the NIOZ server, just make sure you are in the max limit
  • mem-per-cpu
  • nodes = not needed if you use nodelist

basic commands for slurm:

  • squeue is important to check and see whether your command is running ok but also to see how heavily used servers are.
#submitting your job
sbatch jobscript.sh

#checking for running jobs
squeue

#kill a job incase sth is wrong (the job ID you can find via squeue)
scancel job#