General UNIX notebook
0. Introduction
This workflow gives a basic introduction into the command line.
A lot of the files work with some basic text files provided in the
Input_docs
folder. If you want to run this tutorial you can
download these files from here.
If working on the servers please keep in mind:
Since we share resources with a lot of other users:
–> Try not to use more than 30% of our avail. resources.
–> If you need more contact the users.
General info on this tutorial
- grey box. : The code that we want to execute
- red box. : Urgent comments ;-)
- Code
: In some cases the code might be hidden and you only see a little box
at the right hand side. Hidden codes means that from the previous
sessions you should have enough knowledge to do this step or it is is a
test of your skills. If you run into issues you can open the code and
check how to run your command by clicking on the
Code
button to reveal the code - hyperlink : If the text is highlighted like this during a text section, then you can click on it and a hyperlink gives you more info (either the publication or an online tutorial)
- Exercise: : This will appear if you should have the required background to finish all the steps yourself. If you run into trouble, you can find the answers if you click the Code butto, which will reveal the answer.
- Sometimes we also just ask questions for you to think about the output of certain programs. The text is then hidden and you have to hover over the “spoiler” to see the answer to the question
General
Tutorials for bioinformatics
One example for a shorter unix tutorial can be found here. The tutorial that follows below should give insights into all basic commands as well but in case something is missing it is always worthwhile to check other sources as well.
Basic Unix introduction
Basics
What is the shell?
This is a program that opens a window and lets you interact with the shell.
What is the terminal?
A Program that takes commands from the keyboard and gives them to the operating system to perform Nowadays, we have graphical user interfaces (GUIs) in addition to command line interfaces (CLIs) such as the shell.
Accessing the terminal (to work on our servers, this is our most important tool)
- Mac users: Search for terminal in spotlight and you are done.
- Windows users: You will need some software for using the command line. One option is Cygwin and another alternative is mobaXterm. Finally, newer Windows version come with it’s own Windows subsystem for Linux (WSL) which can be installed as explained here.
If all is working you should see something like this:
If you are working from a mac, most of the important things (nano, perl, python are already installed). For Windows users you might need to setup some more things (esp a textviewer, like nano, and python/perl are good to have)
The file system
The basic file system on Unix looks like this:
- Different to what you might know from Windows, unix does not use Drives.
- The first directory in the file system is called the root directory.
- With Unix we do not split our system into different drives but Unix always has a single tree.
- Different storage devices may contain different branches of the tree, but there is always a single tree. e.g., /users/john/ and / users/mary/
- Here, users is on the first level of the tree and john and mary on the second.
- /bin is the directory that contains binaries, that is, some of the applications and programs you can run.
Moving around folders via the terminal
Now that we know how the file system looks like, we do want to move around. Especially for the servers this becomes important since we can not use our mouse anymore.
pwd
= print working directory, find out where we are. Usually when we login we start from our home directory- This will be something like /Users/username on your own computer
Now lets find out how to move around
cd
= change directory
#move into the next directory, called Desktop
cd Desktop/
#move one directory backwards, back to our personal homedirectory
cd ..
#from whereever you are, go directly into your homedirectory
cd ~
#on the NIOZ server, move the spang directory and into the project directory
cd ../spang_team/Projects/
#change to our home dir
cd ~
The second to last example is a bit longer but essentially what we are doing:
- move back to a lower directory
- go into the spang_team folder
- from the spang_team directory, go into the Project directory
In the last example, we use the tilde, i.e. ~
, as
shortcut to the home folder.
Pathnames
Below is some important syntax that discusses the difference between absolute and relative pathnames
Absolute pathnames
An absolute pathname begins with the root directory and follows the
tree branch by branch until the path to the desired directory or file is
completed. For example on your computer full path to our home directory
is: /Users/username/Desktop
If we want to go to the desktop using an absolute path we do
Relative pathnames
A relative pathname starts from the working directory (the directory you are currently in). Using the relative pathway allows you to use a couple of special notations to represent relative positions in the file system tree.
.
(dot) The working directory itself..
(dot dot) Refers to the working directory’s parent directory
If we want to go to the desktop using an relative path we do (assuming we start from our home directory)
General comment:
When recording your script it is useful to always start with the absolute path to set your working directory. Afterwards, you can work with relative paths.
General structure of a Unix command
The general structure of a command looks like this:
command [options] [arguments]
- command is the name of the command
- options is one or more adjustments to the command’s behavior. I.e.
for ls we can use
- Short notation: “-a”
- Long notation: “—all”
- arguments is one or more “things” upon which the command operates.
Working with the terminal = Basics
Viewing what files, folders, etc are present in our directories
Here, we use the command ls
with an option
-l
- ls stands for list directory contents
- everything starting with a minus symbol is an optional argument we can use
-l
= use a long listing format
#list everything in the current directory
ls -l
#if we want to check for what other options ls has we do (exit the manual with q)
man ls
In case you want to check what a program does or what options there are, depending on the program there are different ways how to do this. These most common are:
- man ls
- ls –help
- ls -h
Generating and viewing files
Let’s first make a file using nano.
Nano is a basic text editor that lets us view and generate text.
If we type this and press enter, we will open a new document.
- Type something in there into this document
- Close the document with
control + X
- Type
y
to save changes and press enter
If we now use ls -l again, we see that a new file was generated.
We can open it in nano again, but there are some other options that are useful with extremely big files.
less
less is a program that lets you view text files (by pagination!)
Once started, less will display the text file one page at a time.
- You can use the arrow Up and Page arrow keys to move through the text file.
- To exit less, type
q
. G
Go to the end of the text file1G
Go to the beginning of the text file (or to the Nth line “NG”)/characters
Search forward in the text file for an occurrence of the specified charactersn
Repeat the previous searchh
Display a complete list less commands and options
For files with a lots of columns we can use
In this mode we can also use the arrow right and left keys, to view columns that are further on the right.
I/O redirection to new files
By using some special notations we can redirect the output of many commands to files, devices, and even to the input of other commands.
Standard output (stdout)
By default, standard output directs its contents to the display. To redirect standard output to a file, the “>” character is used like this:
#redirect the output from ls to a new file
ls > file_list.txt
#check what happened
nano file_list.txt
#Use this symbol twice (“>>”) to append the result to an existing file
ls >> file_list.txt
#check what happened
nano file_list.txt
Standard input (stdin)
By default, standard input gets its contents from the keyboard. To redirect standard input from a file, the “<“ character is used like this:
Escaping characters
Certain characters are significant to the shell; Escaping is a method of quoting single characters. The escape () preceding a character tells the shell to interpret that character literally. Below, we find the special meanings of certain escaped characters:
Making new folders
mkdir
- make a new directory
#go into the folder from which you want to work
cd Desktop
#make a new folder (in the directory we currently are in, name it new_older)
mkdir Unix_Tutorial
For the NIOZ tutorial do this:
Moving and copying files
cp
- copy files and directoriesmv
- move or rename files and directories
Now lets move our test data (from whereever we downloaded it)
#make a file were we store some random text
ls > random.txt
#copy our random file into our new_folder
cp random.txt Unix_Tutorial
If we check the files after this command, we can see that we have a version in our home directory and our new folder
If we do this again with mv, we see that we only have the file in the new folder.
Removing files and folders
To remove stuff, we use the rm
command.
For the rm
command, we need to tell the command that we
want to remove folders. Therefore we need to use the -r
argument (to remove directories and their contents recursively).
Unix does not have an undelete command.
Once you
delete something with rm, it’s gone
Downloading data
We can download data using wget
. With
-P we specify were to download the data
File compression
Compressing data using gzip
gzip reduces the size of the named files using Lempel–Ziv coding (LZ77). Whenever possible, each file is replaced by one with the extension ‘.gz’, while keeping the same ownership modes, access and modification times.
Compressing data using tar
- Short for Tape Archive, and sometimes referred to as tarball, is a file in the Consolidated Unix Archive format.
- The TAR file format is common in Unix and Unix systems, but only for storing data! Not compressing it!
- TAR files are often compressed after being created, but those become TGZ files, using the tgz, tar.gz, or gz extension.
Exercise
- Decompress the data we have downloaded using gzip
- Make a new folder, name it
to_compress
- cp our decompressed data into this folder
#decompress gz data (-d = decompress)
gzip -d downloads/GCA_002728275.1_ASM272827v1_genomic.fna.gz
#make new folder for the data we want to compress
mkdir to_compress
#compress gz data (-d = decompress)
cp downloads/GCA_002728275.1_ASM272827v1_genomic.fna to_compress
Now we have some files in a new folder to test file compression on. Always check between steps, what is happening with ls or head!
#create and compress a tar file from a directory
tar -cvzf to_compress.tar.gz to_compress
#uncompress a file
tar -xvf to_compress.tar.gz -C to_compress
Basic UNIX commands
Preparing our data
Exercise
Now we first need to get our data. For this
- make a new folder, name it
Unix_tutorial
- cp the folder with the Input docs from `` (beware that we copy a folder and need to set an option to do this)
- go into the Unix tutorial folder
WC: Counting files
The wc
(=wordcount) command in UNIX is a command line
utility for printing newline, word and byte counts for files.
- It can return the number of lines in a file, the number of characters in a file and the number of words in a file.
- It can also be combine with pipes for general counting operations. We will explain pipes a bit later.
*
Stands for a wildcard, and we will discuss below what exactly this does
This command is simple but essential for quality control and you will use it a lot to check whether your commands worked all right.
Grep: Finding patterns in files
The grep command is used to search text. It searches the given file for lines containing a match to the given strings or words.
This command is simple but essential for quality control/
#count how often the pattern **control** occurs in our document
grep "control" Input_docs/Experiment1.txt
#only give the counts, not the lines
grep -c "control" Input_docs/Experiment1.txt
#grep a pattern only if it occurs at the beginning of a line
grep "^Ex" Input_docs/Experiment1.txt
#we can also count the number of sequences in a fasta file
grep -c ">" Input_docs/*faa
Using basic wildcards
Since the shell uses filenames so much, it provides special characters to help you rapidly specify groups of filenames.
A Wild-card character can be used as a substitute for any class of characters in a search
- wildcard with the broadest meaning of any of the wildcards,can represent 0 characters, all single characters or any string of characters. I.e. we can grep any files that end with .txt
- ? wildcard = matches exactly one character except a dot
- ‘.’ wildcard = any letter or number bc dot is a wildcard, be careful when using it in some commands
- [012] wildcard = matches 0 or 1 or 2 exactly once
- [0-9] wildcard = matches matches any number exactly one
- combining wildcards
- [A-Z] wildcard = any letter in capitals occurring once
- [a-z]* wildcard = any letter in non-capital letters occurring many times
- [a-z]{7} we are exactly looking for 7 letters (as in ‘control’)
- these 7 letters should be followed by either a 1 or 2
- if we are not sure how many characters we have
- matches 3-10 characters
Grep and special symbols:
#this does not work
grep "control?" Input_docs/Experiment1.txt
#this works how we want it to work
grep -E "control?" Input_docs/Experiment1.txt
-E
tells grep that the ‘?’ is not taken literally but as a wildcard
unfortunately different programs have slightly ways to do things, i.e
greop uses -E
while sed uses different quotes. If you run
into problems when using wildcards check the manual or the web.
Exercise
- In PF00900.faa, how many sequences do we have? Notice, sequences
always start with a
>
- How many sequences do we have in PF00900.faa and PF01015.faa?
- In PF00900.faa, how often do we have 3 consecutive A’s?
- In PF00900.faa, how often do we have 2x M’s followed by a P?
- In PF00900.faa, how often do we have 2x M’s followed by a P or an I?
Comment: If you unsure what is happening remove the -c
to see what grep is grepping.
#question1:
grep -c ">" Input_docs/PF00900.faa
#question2
grep -c ">" Input_docs/*.faa
#question3
grep -c "[A]\{3\}" Input_docs/PF00900.faa
#question4
grep -c "[M]\{2\}[P]" Input_docs/PF00900.faa
#question5
grep -c "[M]\{2\}[PI]" Input_docs/PF00900.faa
Hint: if you are unsure what is happening, redo the command without
the -c
option
The cut command
We can use cut
to separate columns
#only print the second column of our table
cut -f2 Input_docs/Experiment1.txt
#change how we cut, ie cut after a #
cut -f1 -d "#" Input_docs/Experiment1.txt
Options:
-f2
= keep the second element after the separators (tab, by default)-d "#"
= we use a # as delimiter
Exercise
- In PF00900.faa, cut the text of after the first
_
(i.e. keep the first element) - In PF00900.faa, cut to only keep the text after the first
_
(i.e. keep the second element)
Why should we not do the second option?
If we cut away the >
symbol we
break how the header of fasta files look. I.e. fasta headers always
should start with this symbol and if we would use such a command we
would need the >
back in
Cat: Combining data
The cat command has three main functions related to manipulating text files:
- creating files
- displaying files
- combining files
Depending how we use cat, we can use it for these 3 different contexts.
- Create a new file, type
Hello!
press enter, and *Press “ctrl+d” to save the file
- Display the content of an existing file
- Concatenating several files
#merge files
cat file_test.txt file_test.txt > file_merged.txt
#check the content of the file
cat file_merged.txt
Exercise
- View Input_docs/Experiment1.txt by using cat
- Combine Input_docs/Experiment1.txt and Input_docs/Experiment2.txt.
- Do the same as above but use a wildcard to not have to type
everything and store the output as
All_Input_docs/Experiments.txt
Pipes: Combining commands
Pipes are a powerful utility to connect multiple commands together. Basically, pipes allow us to feed the standard output of one command as input into another command.
In this example we first combine the files, then use grep to count the number of lines
Sort
Uniq
uniq – can be used to remove or find duplicates.
However, for this to work the file first needs to be sorted. To combine
two commands in one go, we can use the pipe, i.e. |
symbol.
#only keep duplicates
sort Input_docs/Experiment2.txt | uniq -d
#remove duplicates
sort Input_docs/Experiment2.txt| uniq -u
Exercise
- Using a pipe: combine the faa files in Input_docs. How many sequences do we have in total?
- Using a pipe: combine the faa files in Input_docs. Extract only the header and then count how many duplicates do we have?
- Same as above, but how many sequences are not duplicated? Check the manual for uniq on what option allows you to do this.
diff: finding differences
This command compares the contents of two files and displays the differences. Lines beginning with a < denotes file1, while lines beginning with a > denotes file2.
#check the original files
cat Input_docs/Experiment2.txt
cat Input_docs/Experiment3.txt
#find the differences
diff Input_docs/Experiment2.txt Input_docs/Experiment3.txt
Exercise
- Find the headers that are different for PF00900.faa and PF01015.faa
#option1, generate intermediate files that store the file header
grep ">" Input_docs/PF00900.faa > PF00900_header
grep ">" Input_docs/PF01015.faa > PF01015_header
diff PF00900_header PF01015_header
#otpion2, do this all in one command
diff <(grep ">" Input_docs/PF00900.faa) <(grep ">" Input_docs/PF01015.faa)
In the hidden code, there are 2 ways to do this, one generating
intermediate files and a second command not doing this. Here, we use a
new type of syntax: <(command)
. <( command ) is
called process substitution. Basically, we run the grep command and
redirect the output of that command into the diff function.
find
echo and variables
variables
A variable is a character string to which we assign a value. The value assigned could be a number, text, filename, path, device, or any other type of data.
#store your name in a variable called ``x``
x="Nina"
#print the variable we just have created using echo
echo "I am $x"
Something to be aware of is that echo treats quotes differently:
- double quote the $x is read as a VARIABLE
- single quote the $x is read literally
#store 10 in the variable x
x=10
#print the variable
echo "the value is $x"
#print a literal `$x`
echo 'the value is $x'
We can use these things in several ways:
- we can use it to add rows, i.e. a header to our data
Here. the -
is used as a pseudo file name to indicate
standard input in this case the input is the echo-line from our first
command. The -e
is used to enable interpretation of
backslash escapes
- If we write longer scripts, we can set variables for our working directory and most important paths at the beginning.
For the example below, change the path to your directory in
#set variable for our wdir
wdir="/export/lv3/scratch/workshop_2021/"
#check if our variable is stored
echo $wdir
#set wdir
cd $wdir
#go back to your unix working dir
cd Users/NDombrowski/Unix_tutorial
If you set variables and use them like that they are stored until:
- They are over-written by a variable with the same name
- you close your session, i.e. close the terminal
You can permantently store variables but this is a topic we will not cover here.
SED: manipulating files
Sed is a stream editor. A stream editor is used to perform basic text transformations on an input stream (a file or input from a pipeline).
The most basic pattern to use sed for is sed ‘s/find/replace/’ file. And example file can be generated with echo “how now brown cow” > temp.txt
Basics: search and replace
# search for 'Ex' and replace it with 'Experiment'
sed 's/Ex/Experiment/' Input_docs/Experiment1.txt
# search and replace pattern across the whole file and not only the first occurence per line
sed 's/Ex/Experiment/g' Input_docs/Experiment1.txt
#we also can use wildcards to replace control1 and control2 with contrl
sed 's/control[0-9]/control/g' Input_docs/Experiment1.txt
One important thing to remember is that certain symbols have specific meanings in UNIX. Examples are: comma, brackets, pipes. Do search for these in files, we need to escape them with a ****:
#replace the square bracket with a round bracket.
sed 's/N\[0.4uM\]/N(0.4uM)/g' Input_docs/Experiment1.txt
Exercise
- In Input_docs/PF00900.faa , replace the GCA_ with Genome_
- In Input_docs/PF00900.faa , replace the string of numbers after the
GCA with a
hello
Removing things (and using WILDCARDs)
- *$’ indicates a line containing zero or more spaces. Hence, this will delete all lines which are either empty or lines with only some blank spaces.
# remove 1 first line
sed '1d' Input_docs/Experiment1.txt
#remove last line
sed '$d' Input_docs/Experiment1.txt
#remove lines 2-4
sed '2,4d' Input_docs/Experiment1.txt
#remove lines other than 2-4
sed '2,4!d' Input_docs/Experiment1.txt
#remove the first and last line
sed '1d;$d' Input_docs/Experiment1.txt
#remove lines beginning with an **L**
sed '/^L/d' Input_docs/Experiment1.txt
#delete lines ending with d
sed '/d$/d' Input_docs/Experiment1.txt
#delete lines ending with d OR D
sed '/[dD]$/d' Input_docs/Experiment1.txt
#delete blank lines ('^$' indicates lines containing nothing)
sed '/^$/d' Input_docs/Experiment1.txt
#delete lines that start with capital letters
sed '/^[A-Z]/d' Input_docs/Experiment1.txt
#delete lines with the pattern **Ex**
sed '/Ex/d' Input_docs/Experiment1.txt
#delete files with the pattern control or uM
sed '/control\|uM/d' Input_docs/Experiment1.txt
#delete the 2nd occurence of each pattern per line (here *o*)
sed 's/o//2' Input_docs/Experiment1.txt
#remove all digits across the whole file
sed 's/[0-9]//g' Input_docs/Experiment1.txt
#remove all alpha-numerical characters and numbers
sed 's/[a-zA-Z0-9]//g' Input_docs/Experiment1.txt
#remove character, here E, regardless of the case
sed 's/[eE]//g' Input_docs/Experiment1.txt
removing text between patterns
If we have complicated headers and want to shorten it sed is also useful. I.e. lets consider this example: >MBN1215629.1 4Fe-4S dicluster domain-containing protein [Candidatus Lokiarchaeota archaeon]
The ID and taxon are useful but the text in between might be a bit
much. Here, we can use the first space and the [
as
patterns and we want to remove everything in between.
sed 's/ [^\[]*/_/1' <(echo ">MBN1215629.1 4Fe-4S dicluster domain-containing protein [Candidatus Lokiarchaeota archaeon]")
Using a screen
Screen or GNU Screen is a terminal multiplexer. It means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running in Screen will continue to run when their window is not visible even if you get disconnected This is perfect, if we start longer running processes on the server and want to shut down our computer.
The basic commands to know are
We detach from a screen with control+a+d
#start a screen and give it a name
screen -S testrun
#see what screens you have running
screen -ls
#restart an existing screen
screen -r testrun
#detach a screen (from outside a screen)
screen -d
#completely close and remove the screen, type
exit
Running loops
A ’for *loop’** is a bash programming language statement which allows code to be repeatedly executed. I.e. it allows us to run a command 100 times.
The bash *while** loop is a control flow statement that allows code or commands to be executed repeatedly based on a given condition. For example, run echo command 5 times or read text file line by line or evaluate the options passed on the command line for a script.
Try running this example:
Here, you see what this command does step by step:
We could also use a loop to run sed on all our txt files
We might also want to store the output of our loops
for i in Input_docs/*txt; do sed 's/N/P/g' $i > ${i}_2.txt; done
#check that all went alright with
ll Input_docs/*txt
You see here, that we added curly brackets to define the borders of our variable i. I.e. if we would not have the brackets unix would look for a variable i_2, which does not exist.
We can also use a file list for building a loop, which allows us to
much better control, where we want to store the new files. For example
in FileList
we have a list of the two files we want to work
with: Experiment1 and Experiment2
for i in `cat Input_docs/FileList`; do sed 's/N/P/g' Input_docs/${i}.txt > Output_docs/${i}_2.txt; done
Notice, that using this option we need to add the file extension!
Exercise
- make a list with all the faa files (do this with the command line). Ideally, we want to have PF00900 and PF01015 in one column.
- Use this list to, in a loop, replace the GCA_ with Genome_. Store
the files with the new ending
renamed.faa
#question1
ls Input_docs/*faa | sed 's/Input_docs\///g' | sed 's/\.faa//g' > FaaList
#question2
for sample in `cat FaaList`; do sed 's/GCA_/Genome_/g' Input_docs/${sample}.faa > ${sample}_renamed.faa; done
#check file
head PF00900_renamed.faa
While loops
Comment: Need to extend this with a practice example later
Change file extensions
Some programs at times need very specific file endings. I.e. our
files end with faa
but it might be that a script requires
you do provide files with a fna
ending. We can use our loop
powers do change filenames
#create a dummy test folder
mkdir test_folder
#cp our files into the dmumy folder
cp Input_docs/*faa test_folder
#go into the test folder
cd test_folder
#change file ending
for f in *.faa; do
mv -- "$f" "${f%.faa}.fna"
done
#check that all went ok
ls *
#go back
cd ..
Exercise:
Change the filename back to faa.
Change file names using rename
rename works as follows:
Oldname - NewName - Pattern used to grep all files
Warning Based on the file system there are two versions on how to use rename and if you work on your own system you might need to install it first.
The two versions are:
Lets test this on our dummy files:
#go into the dummy folder
cd test_folder
#change the PF to FPAM (the second version should work on ada, the first version works on my Mac)
rename 's/PF/PFAM/' PF*
rename PF PFAM PF*
#check if we did everything ok
ls*
#go back
cd ..
A general tipp:
Epecially when testing commands, do this on a backup on your files (like we do here, with our dummy folder). More often than not things happen that you do not want to happen.
Replacing names in files (works on every kind of file)
The file names_to_replace.txt needs to look as follows: Two columns, which are tab separated First column = Original name Second column = New name
Warning: Be careful with similar names. Both Bin1 and Bin11 would be replaced with: Bin1 /tab Bin1_new In these cases either update your naming scheme OR add an end of line indicator
#rename
perl /export/data01/tools/scripts/perl/Replace_tree_names.pl Input_docs/names_to_replace Input_docs/PF00900.faa > Input_docs/PF00900_renamed.faa
#check what happened
head Input_docs/PF00900_renamed.faa
Here, we learn something new and that is using custom scripts. I.e. perl or python scripts that we find online or write ourselves.
The script above is a bad example, but a lot of scripts should have some information on how to use them. You can either find usage info by using header or checking if there is a help function included with the script
head /export/data01/tools/scripts/perl/Replace_tree_names.pl
/export/data01/tools/scripts/perl/Replace_tree_names.pl -h
JOIN: Merging files with common columns
For join to work input files need to be sorted, in the given example the files are sorted by the first column, if column needs to be change use -k
Let’s imagine that we have some metadata (i.e. T and how long we did
the experiment). We can use join
to add this information
into our measurements stored in
Input_docs/Experiment1.txt
#merge two of our experiment files
LC_ALL=C join -a1 -j1 -e'-' -o 0,1.2,1.3,1.4,2.2,2.3,2.4 <(LC_ALL=C sort Input_docs/Experiment1.txt) <(LC_ALL=C sort Input_docs/Metadata) -t $'\t' | LC_ALL=C sort > MergedFile
#check what happened
head MergedFile
Used options:
- LC_ALL=C: make sure that join and sort speak the same language (sometimes ada has issues with join and sort communicating otherwise) -a1 = also print unpairable lines (print all lines from file 1 in this case) -1 1 = in File1 use column 1 for merging -2 1 = in File2 use column 1 for merging -j = equivalent to ‘-1 1 -2 1’, i.e. in both File1 and File2 use column 1 for merging -o: specify the order of the output format
Random but useful
The code below is a random collection of code the author found useful. This means it might be interesting for you BUT these are so far not linked to example files and some programs might not be avail. on ada.
Excel/DOS to UNIX issue cleanup
- hidden symbols
Sometimes when saving excel documents as text this insert hidden symbols. These can be seen as a blue M when opening the file in vim (sometimes they result in odd errors while parsing tables). Most often you see these issues when you open your files and lines are merged that should not be merged.
This symbol can be removed as follows in vim:
- wrong file types
Files created on WINDOWS systems are not always compatible with UNIX. In case there is an issue it is always safer to convert. You see if you have an issue if you open your file of interest with nano and check the file format at the bottom.
If we see we are dealing with a dos file, we can clean files like this:
datamash: merging rows by keys
Working with Conda Environments
Some quick words about conda.
Conda is a powerful package manager and environment manager that you use with command line commands at the Anaconda Prompt for Windows, or in a terminal window for macOS or Unix.
Importantly, it allows to install without admin rights and is a safe alternative to not mess-up program setups.
Using environmental variables
Simply put, environment variables are variables that are set up in your shell when you log in. They are called “environment variables” because most of them affect the way your Unix shell works for you. I.e. one points to your home directory and another to your history file.
Installing stuff
For installing things on the server, for bigger things talk to Hans Malschaert (hans.malschaert@nioz.nl) since we do not have administrator rights.
When installing tools on your own computer you check
apt-get
, however, since this requires admin rights, always
be careful what you install.
For Mac users, brew is another, relatively safe alternative to install tools conda is another option that is supported for Mac and Windows.
Dealing with PDFs –> CPDF
CPDF is a useful program if you want to merge pdfs, remodel them all to A4 etc.
This is not available on the server but easy to install in case you are interested.
For more information, see here
Accesss rights
Each file (and directory) has associated access rights, which may be found by typing ls -l. Also, ls -lg gives additional information as to which group owns the file
-rw-r--r-- 1 staff 156B Sep 28 2019 Input_docs/Experiment1.txt
In the left-hand column is a 10 symbol string consisting of the symbols d, r, w, x, -, and, occasionally, s or S. If d is present, it will be at the left hand end of the string, and indicates a directory: otherwise - will be the starting symbol of the string.
The 9 remaining symbols indicate the permissions, or access rights, and are taken as three groups of 3.
- The left group of 3 gives the file permissions for the user that owns the file (or directory)
- the middle group gives the permissions for the group of people to whom the file (or directory)
- the rightmost group gives the permissions for all others.
The symbols r, w, etc., have slightly different meanings depending on whether they refer to a simple file or to a directory.
Access rights on files.
- r (or -), indicates read permission (or otherwise), that is, the presence or absence of permission to read and copy the file
- w (or -), indicates write permission (or otherwise), that is, the permission (or otherwise) to change a file
- x (or -), indicates execution permission (or otherwise), that is, the permission to execute a file, where appropriate
Access rights on directories.
- r allows users to list files in the directory;
- w means that users may delete files from the directory or move files into it;
- x means the right to access files in the directory. This implies that you may read files in the directory provided you have read permission on the individual files.
So, in order to read a file, you must have execute permission on the directory containing that file, and hence on any directory containing that directory as a subdirectory, and so on, up the tree.
Changing access rights
chmod
= changing a file mode
Chmod options:
u = user g = group o = other a = all r = read w = write (and delete) x = execute (and access directory) + = add permission - = take away permission
For example, to remove read write and execute permissions on the file example.txt for the group and others, type
Extract sequence by pattern found in fasta header
Connecting to severs
Basics
SSH (Secure Shell) is a network protocol that enables secure remote connections between two systems.
Options:
-Y
option enables trusted X11 forwarding in SSH (if we want to open i.e. a java interface, or view alignments). For this to work you might need to install X11 on your computer first. Trusted means: the remote machine is treated as a trusted client. This means that other graphical (X11) clients could take data from the remote machine (make screenshots, do keylogging and other nasty stuff) and it is even possible to alter those data.-X
option enables untrusted X11 forwarding in SSH. Untrusted means = your local client sends a command to the remote machine and receives the graphical output
Checking available resources
Since we share resources with a lot of other users keep all
of this in mind.
Try not to use more than 30% of our avail.
resources.
If you need more contact the users.
There are different methods we have to check how busy the servers are.
1a. top
Typing top into the terminal should give something like this:
- PID: Unique process id.
- USER: Task’s owner.
- PR: It is the priority of the task.
- NI: The nice value of the task. A negative nice value means higher priority,whereas a positive nice value means lower priority.
- VIRT: Total amount of virtual memory used by the task.
- RES: Resident size, the non-swapped physical memory a task has used.
- SHR: Shared Mem size (kb), the amount of shared memory used by a task.
- %CPU: It shows the CPU usage as a percentage of total CPU time.
- %MEM: It shows the Memory usage, a task’s currently used share of
available physical memory.
- as a rule of thump in the example above the first process uses 1492/100 = so roughly 15 of the 144 avail. cpus
- S: Status of the process
- TIME+: CPU Time
- COMMAND: Display the command line used to start a task or the name ofthe associated program.
1b. htop
Htop gives similar info to top but allows to access how many CPUs and how much mem is used in total a bit easier:
The numbers from 1-144 are our 144 CPUs and the fuller the bar is, the more of a single CPU is currently in use. This is also summarized under tasks. Another important line is the memory listing how much in total of the avail. memory is in use.
df
Monitor avail. space on the different file systems.
free
Monitor avail. memory on the different file systems.
We can see that lv3, where scratch is, is almost full. So it is a good point to do data cleaning.
du
Monitor how much space specific folders take up. In the example below we look at all the folders in the current directory
For example, we can ask how much space our desktop needs:
Transferring data from/to a server
Working on a server via slurm
As mentioned above on the larger NIOZ server we can not directly run jobs but need to submit them via a job submission system called slurm.
This section is for your information only.
Preparing a job script
To submit a job, we need to open a document in nano and describe what resources we need, i.e. with
Inside the script we can have something written like this:
#!/bin/sh
#SBATCH --partition=normal # default "normal", if not specified
#SBATCH --nodelist=no1 # the node we want to work on
#SBATCH --time=0-06:30:00 # run time in days-hh:mm:ss
#SBATCH --nodes=1 # require 1 node
#SBATCH --ntasks-per-node=36 # (by default, "ntasks"="cpus")
#SBATCH --mem-per-cpu=4000 # MB RAM per CPU core (default 4 GB/core)
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out
# Executable commands :
iqtree -s my_aln.faa
The most important things are
- the partition, esp. for jobs that run longer
- the node we want to work on, i.e. only some allow for longer running jobs
- the number of nodes, we usally use one for our jobs
- –error and –output are good to keep in case you ran into problems
Not absolutely necessary (at least on the NIOZ system but might be on other systems)
- time = not necessary for the NIOZ server, just make sure you are in the max limit
- mem-per-cpu
- nodes = not needed if you use nodelist
basic commands for slurm:
squeue
is important to check and see whether your command is running ok but also to see how heavily used servers are.