Metagenomics: Annotation tutorial

1 Introduction

The goal of this tutorial is to teach students how to annotate proteins from genomes and is part of the NIOZ bioinformatic workshop. However, if you are interested in the topic feel free to check out the code as well as all required tools and databases are referenced.

Specifically, we will:

  • download two genomes from NCBI
  • annotate these genomes using several databases such as the COGs, arCOGS, KOs, PFAM, …
  • combine the results into one comprehensive table

General comments

  • A lot of programs in this workflow use for loops or GNU parallel to speed things up. While this might not make so much sense for only two genomes, if you work with thousands of files this becomes very useful.
  • For most annotation steps we can use of several CPUs. For this tutorial we only use few CPUs but this can be adjusted according to the available resources you have available. More on how to check resources can be found in the Unix tutorial
  • This script uses a lot of AWK for parsing. If you want to have some more insights into how to use AWK we also have a AWK tutorial available.
  • Once you have the annotation table, we provide an R tutorial here that gives an example how to summarize the data.
  • The individual database searches might take a while to run (esp. the interproscan and the diamond search against the NCBI database) so if you run the full workflow have patience
Note

Whenever we ask a question the code solution will be hidden.
You can show the code by pressing on <Click me to see an answer>.

2 General

Below you find a list of the used programs. If you want to run this pipeline outside of the bioinformatics workshop on your own servers, please install the necessary tools and download any Custom scripts and databases from the Zenodo repository.

General notice for people interested in eukaryotic genomes:

This tutorial is designed specifically with prokaryotic genomes in mind but can be used for eukaryotic proteins as well. This especially is easy if you already have proteins for your genomes of interest. If you do not have such files beware that the tool we use for protein-calling, prokka, is specific for prokaryotic genomes. Tools for identifying eukaryotic proteins are, among others, Augustus for single genomes or MetaEuk for metagenomes.

Something to keep in mind with the used databases is that these were often generated using only archaeal and/or bacterial genomes. Only a few of the databases used here, such as the KO database, also includes eukaryotic genomes. A good alternative to check out are the KOGs that are specifically designed for eukaryotes. The most recent profiles provided by EggNOG can be found here.

2.1 Version programs

The versions might differ from what you have available on your system and should not be problematic but for consistency we list the version numbers that were used to generate this workflow as well as links with install instructions.

  1. Python 2.7.5
  2. perl v5.16.3
  3. prokka 1.14-dev ; install instructions can be found here
  4. HMMER 3.1b2; install instructions can be found here
  5. blastp v2.7.1+; install instructions can be found here
  6. diamond 0.9.22; install instructions can be found here
  7. interproscan: InterProScan-5.29-68.0; install instructions can be found here
  8. GNU parallel 20181222; install instructions can be found here

2.2 Version databases

Databases (and mapping files) used in the current version can be downloaded from [here] (https://zenodo.org/deposit/6489670).

  1. KOhmms: downloaded 90419 from here
  2. PGAP (former TIGRs): downloaded Nov 2021 from here
  3. Pfams: RELEASE 34.0, downloaded Nov 2021 from here
  4. CAZymes: dbCAN-HMMdb-V9 downloaded on 21 April 2020 from here
  5. Merops: downloaded from Merops server in November 2018 from here
  6. HydDB: downloaded form HydDB webserver in July 2020 from here
  7. COGs: downloaded from NCBI October 2020 from here
  8. TransporterDB: downloaded from TCDB April 2021 here
  9. ncbi_nr (maintained and downloaded by Alejandro Abdala Asbun). The last update was done on August 2021. For more details on downloading the data on your own check out this resource. Additionally, prot.accession2taxid.gz was downloaded from here. Additionally, since the nr database was converted into a diamond database with blastdbcmd -entry all -db nr/nr -out all.nr.masqued.fasta -line_length 10000000 -ctrl_a followed by diamond makedb --in all.nr.masqued.fasta --db diamond/nr --taxonmap prot.accession2taxid.gz
  10. ArCOGs: downloaded in 2019 and available here (update date: 2018)

2.3 External scripts needed

Custom scripts used in the current version can be downloaded from [Zenodo] (https://zenodo.org/deposit/6489670) and Github.

python

  • fasta_length_gc.py
  • Parse_prokka_for_MAGs_from_gbk-file.py
  • Split_Multifasta.py
  • parse_IPRdomains_vs2_GO_2_ts_sigP.py
  • parse_diamond_blast_results_id_taxid.py
  • merge_df.py

3 Setup working environment

3.1 Define variables

Notice: If you have done the assembly and binning tutorials that we provide as well, you can see that we do things a bit differently here. When annotating genomes, we use a lot of different databases and there are two ways how to deal with a lot of files that might be found in different locations:

  1. we can add the path for each database into the code itself
  2. we can define variables that store the paths at the beginning of our workflow

Especially if you work with collaborators it often is easier to use the second option, as you just have to adjust the paths in one spot.

Instructions:

  • Change the first variable, the wdir, to the path were you want to work from and change the paths for the databases according to where you downloaded the databases and mapping files.
  • The two last lines of code are needed for the software prokka and are specific for the system the script was written on. Adjust according to how prokka is setup in your system.

For people following the tutorial outside of the NIOZ bioinformatics workshop on their own servers. The necessary databases and scripts can be downloaded from zenodo.

#set variable for the base directory and personal user directory
user_name="ndombrowski"
base_dir="/export/lv3/scratch/bioinformatics_workshop/Users/"
wdir=${base_dir}/${user_name}

#nr of cpus we generally want to use
cpus=2

#location of some custom python and perl scripts
export script_folder="/export/lv3/scratch/bioinformatics_workshop/S11b_annotations/scripts/"

#locations of the individual databases
cog_db="/export/data01/databases/cogs/hmms/NCBI_COGs_Oct2020.hmm"
arcog_db="/export/data01/databases/arCOG/arCOGs2019/All_Arcogs_2018.hmm"
ko_db="/export/data01/databases/ko/All_KO"
pfam_db="/export/data01/databases/pfam/Pfam-A.hmm"
tigr_db="/export/data01/databases/PGAP_ProtFam/hmm_PGAP.lib"
cazy_db="/export/data01/databases/cazy_spt/dbCAN-HMMdb-V7.txt"
transporter_db="/export/data01/databases/tcdb/tcdb"
hyd_db="/export/data01/databases/HydDB/HydDB_uniq"
ncbi_nr_db="/export/data01/databases/ncbi_nr/diamond/nr.dmnd"

#locations of the mapping files for the indiv. databases 
#these files link the database ID, ie KO0001 to a more useful description
cog_mapping="/export/data01/databases/cogs/hmms/cog_mapping.txt"
ko_mapping="/export/data01/databases/ko/ko_list_for_mapping_clean"
pfam_mapping="/export/data01/databases/pfam/Pfam-A.clans.cleaned.tsv"
cazy_mapping="/export/data01/databases/cazy_spt/CAZY_mapping_2.txt"
hyd_mapping="/export/data01/databases/HydDB/HydDB_mapping"
taxon_db="/export/data01/databases/taxmapfiles/ncbi_nr/prot.accession2taxid.gz"
arcog_mapping="/export/data01/databases/arCOG/arCOGs2019/ar14.arCOGdef18.nina.tab"
transporter_mapping="/export/data01/databases/tcdb/tcdb_definition.tsv"
tigr_mapping="/export/data01/databases/tigr/hmm_PGAP_clean.tsv"

#mapping file for ncbi taxonomy
ncbi_tax_mapping="/export/lv3/scratch/bioinformatics_workshop/S12_annotations/databases/taxonomy5.txt"

#Fix PERL5LIB PATH (required for using prokka on the NIOZ servers)
export PERL5LIB=/export/lv1/public_perl/perl5/lib/perl5

3.2 Set our working directory

Let’s first go to our working directory and make a new folder for the Annotation tutorial.

Note

Below you see how we use our variables, i.e. we use the path stored in the wdir variable to go to our working folder. If you want to use the code but are on another server or want to use a different folder, you don’t have to change anything in the code below. Instead there is only a single section were another user has to adapt the code keeping the rest of the code more stable.

Therefore, environmental variables can make it easier to adapt our workflow to changing circumstances. For example, if we need to change the location of our input data, we can update the environmental variable for the input data path rather than having to update the path at different spots in the workflow individually.

#go to your working directory
cd $wdir

#make a folder for the Annotations and move into this directory
mkdir Annotations
cd Annotations

4 Prepare genomes of interest

In this workflow we will work with 2 genomes that we download from NCBI but the steps can easily be adjusted for your own genomes. You could also work with the bins we have assembled and binned in the previous workflows, if you feel comfortable with the command line feel free to use those instead.

First, we will download two archaeal genomes from NCBI and rename them to avoid issues with the workflow. The bins from the previous workflows (inside ../Binning/4_FinalBins/renamed) already have a good header and should be ready to use in case you want to repeat the workflow with those genomes.

First a word on the importance of good file names and sequence headers

After we have downloaded the 2 genomes from NCBI the file name and file header should look something like this:

We can see that both the file name and file header are long and has a lot of symbols UNIX might have trouble with (brackets, spaces, dots…) and generally, something to watch out for when naming files is:

  • Have short but descriptive file names and file headers. I.e. if we look at the header of the NCBI genomes you will see the header is rather long and has unneccessary information for our purposes (i.e. genomic). Therefore we want to make it shorter.

  • Avoid extra symbols (especially avoid using spaces, commas, semicolons, quotes or brackets) as these can easily result in unwanted behavior. I.e. a very simple naming could only include two symbols:

    • underscores, _ , for everything where we normally would use a space
    • minus symbol - for key information that we might want to separate later (i.e. genome name and protein ID)

Another useful thing is to add information into the headers that allow us to concatenate multiple genomes and afterwards still know from what genome a sequence came from. Therefore, we recommend to:

  • Add the genome name into the fasta header and use a unique delimiter to separate it from the contig id. That way you can concatenate your genomes and afterwards still know from what genome a contig came from
  • have a consistent way to separate the genome and protein ID across different analyses. In phylogenies its very common to concatenate several genes/proteins from individual genomes and for this it is useful to remove the proteinID from the fasta header.
Note

If you want to work with the files from the binning tutorial:

Your files already have a header that is good to go, so you can simply move them from the Binning tutorial to the fna/renamed folder.

mkdir -p fna/renamed

#create symbolic link
ln -sf $(pwd)/../Binning/4_FinalBins/renamed/NIOZ1* fna/renamed

#control that all bins are available
ll fna/renamed/* | wc -l

#control that the sequence header has the format: BinID-contigID
head $(ls -1 fna/renamed//*fna | head -1)

4.1 Download genomes from NCBI and cleanup

#make folders to store our genomes in
mkdir -p fna/renamed

#download genomes from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/085/GCA_000008085.1_ASM808v1/GCA_000008085.1_ASM808v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/945/GCA_000017945.1_ASM1794v1/GCA_000017945.1_ASM1794v1_genomic.fna.gz

#unzip files
gzip -d *gz

#confirm that we have as many files as we expect: 2
ls -l *.fna

#shorten the file name by cutting everything away after the dot
for f in *.fna; do mv $f "${f/.*}".fna; done

#check the file name and see if it changed
ls -l *.fna

#check how the header looks 
#--> we see that we have a long name and a lot of symbols that we do not want, lets remove these 
head $(ls -1 *fna | head -1)

#shorten the fasta headers by removing everything after the space
#we also move the file with the changed header into the fna folder
for f in *.fna; do cut -d " " -f1  $f > fna/${f}; done
head $(ls -1 fna/*fna | head -1)

#add the genome name into the header (this is useful if we want to concatenate a lot of genomes into one file)
cd fna
for i in *fna; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fna/,x)}1' $i > renamed/$i; done
head $(ls -1 renamed/*fna | head -1)

#remove intermediate files
rm -f *fna
cd ..
rm -f *fna

The cleaned up files can be found in fna/renamed and file names should look something like this afterwards:

4.1.1 The steps above work fine but what if we have hundreds of genomes that we want to download from NCBI?

Warning

Below we have some hidden, alternative code.

If you deal a lot with genomes from NCBI and you want to have a file that links a NCBI taxonomy id to a tax string. For example the NCBI taxonomy ID 7 is connected to this taxonomic information: Bacteria,Proteobacteria,Alphaproteobacteria,Rhizobiales,Xanthobacteraceae,Azorhizobium,Azorhizobium_caulinodans . The names between brackets refer to different taxonomic ranks: Domain, Phylum, Class, Order, Family, Genus and Species.

This information can be useful if you work with a genome of interest, such as E.coli, and if you might want to include several closely related genomes as reference in your analysis and NCBI is an excellent resource providing you with a multitude of genomes as well as metadata. The taxonomy string in this example can help you to more easily find related genomes

The code is not yet fully implemented for people working outside the NIOZ but the code below should give you enough information to get you started.

To explore the code, click on Click me to look at the code to see the code.

Click me to look at the code

Notice:

  • The file that links the taxonomy ID with the taxa string (lineages-2020-12-10.csv) was generated with ncbitax2lin (instructions added below)
  • this part of the workflow was set up to run at servers at the NIOZ and the paths need to be adjusted to work on your system

The first part of this workflow is to get a list of all available archaeal and bacterial genomes from NCBI and add a taxonomic string into the file we downloaded.

#prepare some folders
mkdir ncbi_genomes
cd ncbi_genomes

#get a list of all avail. microbial genomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt

This file includes a list of all genomes available at NCBI and relevant metadata such as the project ID, taxonomy ID, species name, download link for individual genome files etc.

Next, we will clean the file a bit and since the species name might not be informative enough (i.e. we might want to know to what phylum a species belongs), we want to add the full taxonomy string in.

#clean up the file by removing the first row and any `#` and move the tax id to the front
sed 1d assembly_summary_genbank.txt | sed 's/# //g' | awk -F"\t" -v OFS="\t" '{print $6, $0} ' | sed 's/ /_/g' > assembly_summary_edit.txt

#replace empty rows with a "none" and remove the header
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' assembly_summary_edit.txt | sed '1d' > temp1

#add in the tax string (to be able to better select genomes)
#the code for how to generate the csv needed here can be found below
#merge assembly info with tax string
LC_ALL=C join -a1  -j1 -e'-' -o 0,1.2,1.3,1.4,1.5,1.6,1.8,1.9,1.10,1.11,1.13,1.16,1.17,1.18,1.19,1.21,2.2 <(LC_ALL=C sort temp1) <(LC_ALL=C sort  /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020/lineages-2020-12-10.csv) -t $'\t' > temp2

#add in headers
echo -e "taxid\taccession\tbioproject\tbiosample\twgs\trefseq_category\tspeciesID\torganism\tintraspecific_name\tisolate\tassembly\trel_date\tasm\tsubmitter\tgcf_name\tftp_path\ttax" | cat -  temp2 >  assembly_summary_archaea_Dec2020_tax.txt

cd ..

In assembly_summary_archaea_Dec2020_tax.txt you can easily search for the genomes you are interested in. For example, you can subset for lineages of interest by selection specific groups via their taxonomy and you might run CheckM as well to select them based on their genome stats.

Once you have your genomes selected that you are interested in you then can copy the ftp_links into a document called get_genomes.txt via nano and then do some small changes to the file with sed. Afterwards you can download all the genomes from NCBI using wget.

# extend path to be able to download the genomes
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' get_genomes.txt  > get_genomes_v2.txt

# download genomes
for next in $(cat get_genomes_v2.txt); do wget  "$next"; done

Once you have the genomes, you can clean the names as suggested above and continue with the steps below.

In case you want to generate the csv file that links the NCBI taxID to a taxonomic string, which includes the phylum, class, order, etc information, this is the workflow (you will need to download ncbitax2lin):

#change to the directory in which we installed ncbitax2lin
cd /export/lv1/user/spang_team/Scripts/NCBI_taxdump/ncbitax2lin

#start bashrc that allows to run conda
source ~/.bashrc.conda2

#setup virtual environment (see more instructions here: https://github.com/zyxue/ncbitax2lin)
#virtualenv venv/

#re-do taxonomy mapping file
make

#unzip file
gzip -d  lineages-2020-12-10.csv.gz

#close virtual environment
#source deactivate venv/

#make a new directory for the most current tax
mkdir /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020

#change dir
cd /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020

#cp the taxa file into our new dir and replace spaces with underscore
sed 's/ /_/g'  ~/../spang_team/Scripts/NCBI_taxdump/ncbitax2lin/lineages-2020-12-10.csv  > temp1

#only print the levels until species level
awk -F',' -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8}' temp1 > temp2

#add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp2 > temp3

#merge columns 2-8, thus we have 1 column with the tax number and one column with the tax string
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2","$3","$4","$5","$6","$7","$8}' temp3 > temp4

#remove header
sed '1d' temp4 > lineages-2020-12-10.csv

#clean up
rm temp*

4.2 Make a filelist for all genomes of interest

Let us start by making a file that lists all the genomes we are working with. Such a file list is useful when working with for loops.

#make a folder for all the secondary files we need
mkdir FileLists

#list our genomes of interest
ls fna/renamed/* | sed 's/\.fna//g' | sed 's/fna\/renamed\///g' > FileLists/GenomeList

#check content of file
head FileLists/GenomeList

The last command should print something like this:

4.3 Call proteins with prokka and clean files

Now that we have the genomes organized, lets identify the coding regions and call the proteins with Prokka. Another common tool you might encounter for this step is prodigal. Prokka basically uses prodigal but is doing some extra annotations, which we also want to look at.

Notice:

  • Our genomes are from archaea, if you have other genomes change the –kingdom Archaea part below. This setting generates no major differences when calling proteins but the Prokka annotations might slightly change.
  • Prokka is not designed to work with eukaryotic genomes, if you work with eukaryotic genomes an alternative tool to look into is for example Augustus.
  • Prokka does very quick and dirty annotations, these are good for a first glimpse into what you have but in our experience there can be several issues, so use them with caution.
  • Similarly, no database for annotating proteins is perfect. This is why we use mutliple databases in our workflow and compare results to be able to spot miss-annotated proteins more easily (more on that in later last section)

4.3.1 Run prokka

Prokka is a tool for rapid prokaryotic genome annotation (Seeman et al., 2014).

#prepare new directories to store our data
mkdir faa
mkdir faa/Prokka

#run prokka
for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus $cpus --norrna --notrna ; done

#control number of faa files generated: 2
#this should be the same as the original number of genomes
ll faa/Prokka/*/*faa | wc -l

#check how the faa header looks like
head $(ls -1 faa/Prokka/*/*faa | head -1)

We have 2 genomes and the header of the first file should look something like this:

We can see that prokka gives a unique protein ID and adds an annotation to the file header.

Other than the protein faa file, Prokka generates several additional output files. These include:

Additionally, prokka does not only call proteins but also identifies tRNAs and tries to annotate proteins. The whole workflow is summarized in the image below:

Note

A small piece of advice for running things more efficiently

As you have seen above and in the following code we often use for loops to automatically run a tool multiple times on the same input files. Depending on the analysis done this can take a lot of time. If your system permits, a great alternative is GNU parallel.

More specifically, instead of running the two prokka runs after each other with:

for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus 2 --norrna --notrna ; done

Instead, we could run 2 prokka runs in parallel with GNU parallel. If we would have 1000 genomes and you could run 40 analyses in parallel that would make things even more efficient.

parallel -j2 'i={}; prokka fna/renamed/${i}.fna --outdir fna/renamed/${i} --prefix ${i} --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus 2 --norrna --notrna' ::: `cat FileLists/GenomeList`

4.3.2 Clean faa header and combine genomes

If we look again at the protein sequences headers of the newly generated files, we see that the header is not ideal yet. For example, the header should look something like this:
>CALJACFG_00010 ATP-dependent DNA helicase Hel308.

This header is very long and uses a lot of extra symbols (a space, and a minus symbol, which we already use for other purposes). Therefore, we first want to clean things up a bit.

Additionally, since for the following workflow we want to concatenate the two genomes, i.e. merge the two files into one, it is useful to add the genome name into the fasta header before merging the files.

This is optional, but it is useful to use a special delimiter to later separate the genome from the protein name. Therefore we add the file name and separate it with the old sequence header with a -.

#go into the prokka folder
cd faa/Prokka/

#make a new folder in which we want to store our renamed gneomes
mkdir renamed

#organize our original files
cp */*faa .

#add the filename into the fasta header and store them in our new folder
for i in *faa; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.faa/,x)}1' $i > renamed/$i; done

#control if this worked 
head $(ls -1 renamed/*faa | head -1)

#remove old files
rm -f *faa

#concatenate files
cat renamed/*faa > All_Genomes.faa

#clean the header and shorten it by removing everything after the first space
cut -f1 -d " "  All_Genomes.faa > All_Genomes_clean.faa

#check if that worked 
head All_Genomes_clean.faa

#go back to the wdir
cd ../..

#make a blast db out of concatenated genomes (we need this whenever we want to run blast)
makeblastdb -in faa/Prokka/All_Genomes_clean.faa  -out faa/All_Genomes_clean   -dbtype prot -parse_seqids

Now our file header should look something like this:

5 If you work with your own genomes

This is a short reminder for people who already have their own protein files and want to start with the annotations right away:

  • Ideally have not only the protein ID but also the genome ID sequence header
  • Make sure the file header is concise and does not have ANY spaces and that ideally uses a ‘-’ (or any other unique delimiter) to separate the genome ID from the protein ID
  • If you have a concise header + the bin ID in the header, it is easy to concatenate the protein sequences of your genomes into one single file and still easily know from what genome the sequence originally came from

More specifically you want to:

  1. for the fna files
  • generate a folder named fna/renamed
  • add the individual genomes in there
  1. for the faa files:
  • generate the file in the folder faa/Prokka
  • concatenate the proteins from all your genomes of interest and name the file All_Genomes_clean.faa
  • continue with the steps below

6 Make a mapping file with all relevant genome info

Now that our files are ready we first want to make a table with the basic stats (i.e. contig length, contig gc, protein length, …).

6.1 Get contig stats

Prokka has the habit to rename the fasta header. Therefore, it is not straightforward to say from what contig the protein originally came from. For example a header from the contigs (i.e. the fna file) might look like this >GCA_000008085-AE017199.1 while a protein on that contig might be named like this GCA_000008085-CALJACFG_00010 in the faa file.

The following workflow below generates a mapping file that links the original contig name with the prokka contig name. Beware the workflow is a bit convoluted since the original contig ID AE017199 is stored nowhere in the prokka output files and we need it to improvise a bit to generate the information.

The goal is to generate a file with this information:

  • old contig name, binID, contig GC, contig length

Notice

  • This part depends a lot on the files having short naming schemes and no extra symbols. So make sure that the final file, has the 4 columns we would expect.
Note

Comment if working with your own genomes:

If working with your own genomes this will only apply to you if you called proteins with prokka. Additionally, this workflow depends on a consistent naming of the sequence headers.Specifically, this step will ONLY work if your fasta header includes the binID and the contig ID separated with a -. If you do not have both pieces of info, have extra symbols symbol or if that symbol is somewhere else in the filename/header this part of the workflow will very likely not work (without needing adjustments).

If your file headers look different, it will be easiest skip this step and do the following:
(A) generate the file FileLists/1_Bins_to_protein_list.txt in the section Prepare list that links proteins with binIDs
(B) continue with the step Do the annotations.
(C) During the first annotation step with the COG database, ignore the step #combine with previous data frame and do cp NCBI_COGs/A_NCBI_COGs.tsv merge/temp_A.tsv to generate the first table.

#make folder
mkdir contig_mapping

#calculate the length and gc for each contig
for i in `cat FileLists/GenomeList`; do  python $script_folder/fasta_length_gc.py fna/renamed/${i}.fna; done > contig_mapping/temp1

#make a separate column for the genome ID and proteinID and give a numbering to the contigs
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2], a[1], ++count[a[1]], $2, $3}' contig_mapping/temp1 > contig_mapping/Contig_Old_mapping.txt

#check file: the header should have --> contigID, binID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping.txt

#cleanup temp files
rm contig_mapping/*temp*

#create file for merging with prokka contigIDs. Here, we add the genome ID with the contig number together to recreate the prokka contig name
awk 'BEGIN{OFS="\t"}{print $2"_contig_"$3,$0}' contig_mapping/Contig_Old_mapping.txt > contig_mapping/Contig_Old_mapping_for_merging.txt

#check file: prokka contig name, prokka genome ID, genome ID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping_for_merging.txt

In the file we see the contig name, contig GC and contig length. Since the two genomes are complete genomes we only work with two contigs in total. If you work with MAGs this table will get much longer.

6.2 Prepare list that links proteins with binIDs

Now, we want to generate a file that lists all proteins we are working and a second column with the bin ID. So we want two generate with two columns: proteinID (i.e. accession), genome ID

#get list of protein accession numbers
grep "^>" faa/Prokka/All_Genomes_clean.faa  > temp

#remove the prokka annotations
cut -f1 -d " " temp > temp2

#remove the ``>``
sed 's/>//g' temp2 > FileLists/AllProteins_list.txt

#check file; we want one column with the proteinIDs
head FileLists/AllProteins_list.txt 

#remove temp files
rm -f temp*

#Modify protein list to add in a column with genome ID
awk -F'\t' -v OFS='\t' '{split($1,a,"-"); print $1, a[1]}' FileLists/AllProteins_list.txt | LC_ALL=C sort > FileLists/1_Bins_to_protein_list.txt

#check file content: we now have the accession and the binID
head FileLists/1_Bins_to_protein_list.txt

#check with how many proteins we work: 2069
wc -l FileLists/1_Bins_to_protein_list.txt

We now have a small table that lists all our 2,069 proteins and the respective genome each protein belongs to. The first few lines of the file should look something like this:

:::{.callout-note}

Throughout the workflow (or whenever generating your own pipeline), check that the number of protein stays consistent this is a useful sanity check to ensure that the code does what you expect it to.

Additionally, while running things for the first time it is always good to check the file content that was generated with nano or less to get a better grasp on:

  • What is happening at specific points in the script
  • Find potential bugs

:::

6.3 Prepare a file with prokka annotations & Lengths & other genome info

Now, we want to generate a file that links the contig info and all the protein info.

#compress gbk files (generated by prokka)
gzip faa/Prokka/*/*gbk

#make list for looping
for f in faa/Prokka/*/*gbk.gz; do echo $f >> FileLists/list_of_gzips; done

#control that the nr of files we have is correct: 2
wc -l FileLists/list_of_gzips

#create a summary file based on the gbk file
#lists the genome ID, contig name, contig length, proteinID, prokka annotation
python $script_folder/Parse_prokka_for_MAGs_from_gbk-file_v2.py -i FileLists/list_of_gzips -t FileLists/Contig_Protein_list.txt

#check output of the script: Prokka binID, Prokka contig ID, contig length, gene start, gene stop, strand, accession, prokka annotation
head FileLists/Contig_Protein_list.txt

The output should look something like this

Next, we want to have a list that assigns the new ID prokka gives to a genome to the old genome name we got from NCBI:

#make prokka to genome I mapping file
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2],$2}' FileLists/1_Bins_to_protein_list.txt | sed 's/_[0-9]*\t/\t/g' | sort | uniq > FileLists/Prokka_to_BinID.txt

#check file: Prokka bin ID, old bin ID
head FileLists/Prokka_to_BinID.txt

Finally, we can merge everything together.

#link old to new genome ID
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' FileLists/Prokka_to_BinID.txt FileLists/Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$10,$2,$2,$3,$4,$5,$6,$7,$8}'  > temp_Bin_Contig_Protein_list.txt

#add in extra column for contig nr
awk -F'\t' -v OFS='\t' 'gsub("[A-Za-z0-9]*_","",$4)' temp_Bin_Contig_Protein_list.txt | awk -F'\t' -v OFS='\t' '{print $2"_contig_"$4,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' > FileLists/Bin_Contig_Protein_list.txt

#check file
head FileLists/Bin_Contig_Protein_list.txt

#merge with old contig IDs
#headers: accession, genome ID, new ContigID, old ContigID, merged ContigID,new Contig Length, old Contig Length, GC,ProteinID, protein start, protein end, protein strand, prokka annotation
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' contig_mapping/Contig_Old_mapping_for_merging.txt FileLists/Bin_Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $3"-"$7,$3,$4,$13,$1,$6,$17,$16,$7,$8,$9,$10,$11 }' >  FileLists/Bin_Contig_Protein_list_merged.txt

#check file and number of proteins in file
head FileLists/Bin_Contig_Protein_list_merged.txt
wc -l FileLists/Bin_Contig_Protein_list_merged.txt

#prepare a file with protein length and gc to add protein length into our file
perl $script_folder//fasta_length_gc.py faa/Prokka/All_Genomes_clean.faa > temp

#merge with contig/protein info
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' temp FileLists/Bin_Contig_Protein_list_merged.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13}' > temp2

#add a header
echo -e "accession\tBinID\\tNewContigID\tOldContigId\tContigNewLength\tContigOldLength\tGC\tProteinID\tProteinStart\tProteinEnd\tStrand\tProteinLength\tProkka" | cat - temp2 > FileLists/B_GenomeInfo.txt

#check file
less -S FileLists/B_GenomeInfo.txt

#control that all is ok and has the expected number of rows: 2070
#notice, we have one row extra since we count the header
wc -l FileLists/B_GenomeInfo.txt

#remove temp files
rm -f temp*

The file stored in FileLists/B_GenomeInfo.txt should look something like this:

7 Do the annotations

All searches that follow below have a similar workflow:

  • Run the search (with hmmer, diamond, blast, etc.)
  • For each protein, make sure we have just one unique hit (usually we select the best hit for each protein using the best e-value or bitscore). I.e. a single protein might be similar to several dehydrogenases and we only keep the best hit.
  • Clean up the output from the search.
  • Merge the output with our protein list (that way we make sure we always have the same number of rows in all the files we generate).
  • add gene descriptions to our table, i.e. when we do the search against the COGs we only have the COG IDs, but we want to add a gene description here.
  • add an informative header.

For all the searches, we keep the original input table and the final output table and the other intermediate files we delete.

Once we have run our searches against several different databases, we merge all the results in a step-wise manner (using the protein IDs) and then in the final step clean the final data frame by removing redundant information.

We will add basic information on the different databases and search algorithms in the last section of this tutorial, so if you want more details check there.

Note

Depending on the size of your dataset (and the database) these searches might take a while to run.

Therefore, when you run this on your own genomes adjust the number of CPUs accordingly and depending on what system you run this on consider submitting it as a batch job or inside a screen in case your server connection gets disrupted.

7.3 KO search using Hmm’s

The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs and was generated using archaeal, bacterial and eukaryotic genomes.

Possible settings to change:

  • number of CPUs
  • e-value threshold
#setup directories
mkdir KO_hmm

#run KO search 
hmmsearch --tblout KO_hmm/sequence_results.txt -o KO_hmm/results_all.txt --domtblout KO_hmm/domain_results.txt --notextw --cpu $cpus $ko_db faa/Prokka/All_Genomes_clean.faa

#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' KO_hmm/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > KO_hmm/sequence_results_red_e_cutoff.txt

#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g KO_hmm/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  KO_hmm/All_KO_hmm.txt

#merge with protein list
LC_ALL=C join -a1 -t $'\t' -j1 -e'-' -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort KO_hmm/All_KO_hmm.txt) | sort -u -k1 > KO_hmm/temp

#merge with KO_hmm names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.4,1.3,2.2,2.12  <(LC_ALL=C sort -k2 KO_hmm/temp) <(LC_ALL=C sort -k1 $ko_mapping) | LC_ALL=C  sort > KO_hmm/temp3

#add in an extra column that lists whether hits have a high confidence score
awk  -v OFS='\t' '{ if ($4 > $5){ $7="high_score" }else{ $7="-" } print } ' KO_hmm/temp3 > KO_hmm/temp4

#add header
echo -e "accession\tKO_hmm\tKO_e_value\tKO_bit_score\tKO_bit_score_cutoff\tKO_Definition\tKO_confidence" | cat - KO_hmm/temp4 > KO_hmm/C_KO_hmm.tsv

#control lines and content 
wc -l KO_hmm/C_KO_hmm.tsv
head KO_hmm/C_KO_hmm.tsv

#clean up
rm -f KO_hmm/temp*

7.9 IPRscan

InterPro provides a functional analysis of proteins by classifying them into families and predicting domains and important sites, such as transmembrane motives (TMs) or signal proteins.

Note

Both IPRscan and the Diamonds search against NCBI_nr will take quite some time and are best run on the HPC (i.e. laplace). Therefore, we won’t run the two following analyses during the workshop but feel free to try this on your own.

An example batch script can be found in the script folder, the name is interproscan.sh.

7.9.1 Genome cleanup

IPRscan does not like to deal with * in the protein files, which is used for stop codons. If we work with programs that don’t recognize *, we should remove this symbol first.

#interproscan does not like if we have stars in our protein files, so we need to remove them
cd faa/Prokka
sed 's/*//g' All_Genomes_clean.faa >  All_Genomes_clean_interproscan.faa 
cd ../..

Before running IPRscan, we can split our big concatenated file into several smaller files. Then we can use parallel GNU to speed things up.

We set the number of processes to run with -j. Always make sure that you adjust this depending on the available resources.

7.9.2 run IPRscan

Possible settings to change:

  • Number of files we split the concatenated fasta file into via Split_Multifasta.py (default of -n = each split file will contain 1000 sequences)
  • number of parallel jobs to run via parallel (default setting for -j= 4)
  • e-value threshold (default setting for -evalue = 1e-20)
mkdir IPRscan

#split files --> 2069 proteins were split into 3 files
python $script_folder/Split_Multifasta.py -m faa/Prokka/All_Genomes_clean_interproscan.faa -n 1000

#move files into a better location
mkdir split_faa
mv File*.faa split_faa

#cp the batch script for running this job on an hpc
#specifically adjust:
#interproscan only works on no83-no89, so choose any of those
##SBATCH --ntasks-per-node=30: Adjust this to the total nr of CPUs you need
#-j3 : Adjust this according to how many jobs you want to run in parallel
cp $script_folder/interproscan_run.sh .

#start job
sbatch interproscan_run.sh

#content of the script
'''
parallel -j3 'i={}; interproscan.sh -i $i -d IPRscan/ -appl TIGRFAM,SFLD,SUPERFAMILY,Gene3D,Hamap,Coils,ProSiteProfiles,SMART,CDD,PRINTS,ProSitePatterns,Pfam,ProDom,MobiDBLite,PIRSF,TMHMM -T IPRscan/temp --iprlookup --goterms -cpu 5' ::: split_faa/File*.faa
'''

#Concat result files
cd IPRscan
mkdir single_files
mv File* single_files
mkdir Concat_results/

cat single_files/File*.faa.xml > Concat_results/All_bins_iprscan-results.xml
cat single_files/File*.faa.gff3 > Concat_results/All_bins_bins_iprscan-results.gff3
cat single_files/File*.faa.tsv > Concat_results/All_bins_bins_iprscan-results.tsv

#cleanup
rm -f single_files/File*

cd ..

#clean up fasta header so that it is exactly the same as the accession ID in the interproscan results
python $script_folder/parse_IPRdomains_vs2_GO_2_ts_sigP.py -s faa/Prokka/All_Genomes_clean.faa -i IPRscan/Concat_results/All_bins_bins_iprscan-results.tsv -o IPRscan/All_bins_bins_iprscan-results_parsed.tsv

#remove issue with spaces
sed 's/ /_/g' IPRscan/All_bins_bins_iprscan-results_parsed.tsv | LC_ALL=C  sort > IPRscan/temp.txt

#print only columns of interest
awk -F'\t' -v OFS="\t"  '{print $1, $2,$3,$4, $5, $8, $9}' IPRscan/temp.txt > IPRscan/temp2

#add header
echo -e "accession\tIPR\tIPRdescription\tIPR_PFAM\tIPR_PFAMdescription\tIPR_TMHMM\tIPR_SignalP" | cat - IPRscan/temp2 > IPRscan/I_IPR.tsv

#control lines: 2070
wc -l IPRscan/I_IPR.tsv

#view file content 
less -S IPRscan/I_IPR.tsv

#clean up
rm -f IPRscan/temp*

7.10 Diamond search against NCBI NR

The NCBI nr database is a protein sequence database maintained by the National Center for Biotechnology Information (NCBI). nr stands for non-redundant and means that it contains unique protein sequences that have been compiled from various sources, including GenBank, RefSeq, PDB, and Swiss-Prot.

Possible settings to change:

  • number of CPUs (–threads)
  • e-value threshold (–evalue)
  • Sequence numbers to display (–seq)

Notice:

  • This step can take quite some time.
  • the NCBI nr database is constantly updated (and very large), which is why we are not providing the database with the other files in our repository. For information on how to download ncbi_nr check out the information here.
#make folder
mkdir NCBI_NR

#run diamond against NR database
diamond blastp -q faa/Prokka/All_Genomes_clean.faa --more-sensitive --evalue 1e-3 --threads $cpus --seq 50 --db $ncbi_nr_db --taxonmap $taxon_db --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids -o NCBI_NR/All_NCBInr.tsv

#Select columns of interest in diamond output file
awk -F'\t' -v OFS="\t" '{ print $1, $5, $6, $11, $12, $14, $15 }'  NCBI_NR/All_NCBInr.tsv | sed 's/ /_/g' > NCBI_NR/temp

#Parse Diamond Results
python $script_folder/parse_diamond_blast_results_id_taxid.py -i FileLists/AllProteins_list.txt -d NCBI_NR/temp -o NCBI_NR/temp2

#rm header
sed 1d NCBI_NR/temp2 > NCBI_NR/temp3

#add an '-' into empty columns
awk -F"\t" '{for(i=2;i<=NF;i+=2)gsub(/[[:blank:]]/,"_",$i)}1' OFS="\t" NCBI_NR/temp3 > NCBI_NR/temp4

#the python script above sometimes leaves an empty 7th column, this gets rid of that issue
awk -F'\t' -v OFS="\t"  '{if (!$7) {print $1,$2, $4 , $6, "-"} else {print $1, $2, $4, $6, $7}}' NCBI_NR/temp4 | LC_ALL=C sort > NCBI_NR/temp5

#split columns with two tax ids
awk -F'\t' -v OFS='\t' '{split($5,a,";"); print $1, $2, $3, $4, a[1]}' NCBI_NR/temp5 > NCBI_NR/temp6

#in column 2 remove everything after < (otherwise the name can get too long)
awk -F'\t' -v OFS='\t' '{split($2,a,"<"); print $1, a[1], $3, $4, $5}' NCBI_NR/temp6 > NCBI_NR/temp6a

#merge with tax names
LC_ALL=C join -a1 -1 5 -2 1 -e'-' -t $'\t'  -o1.1,1.2,1.3,1.4,1.5,2.2  <(LC_ALL=C sort -k5  NCBI_NR/temp6a) <(LC_ALL=C sort -k1 $ncbi_tax_mapping) | LC_ALL=C  sort > NCBI_NR/temp7

#add in header
echo -e "accession\tTopHit\tE_value\tPecID\tTaxID\tTaxString" | cat - NCBI_NR/temp7 > NCBI_NR/J_Diamond.tsv

#control lines
wc -l NCBI_NR/J_Diamond.tsv

#view file content
head NCBI_NR/J_Diamond.tsv

#clean up
rm -f NCBI_NR/temp*

8 Combine the individual dataframes

Now, lets cleanup the final dataframe and then we made it!

#activate conda environment to run py3
source ~/.bashrc.conda3

#combine tables
python3 $script_folder/merge_df.py */[A-Z]_*tsv accession Annotations.txt

#control that the number of rows is correct: 2070
wc -l Annotations.txt

#view content of file
#column allows to display the rows a bit cleaner allowing for easier reading
column -t -s $'\t' Annotations.txt | less -S

#close to conda environment
conda deactivate

Question/Homework –> Let’s figure out what Igniccous is doing

  • Download the table Annotations.txt to your personal computer using scp.
  • Compare the different annotations, have a look at how they differ.
  • Make one separate table for each of the two genomes
  • For Ingicoccus (GCA_000017945), copy the accession numbers and KO ids into a new file and upload or copy and paste them to the KEGG mapper. Use this tool to get some first insights into the metabolic capacities of that organism.

Based on this try to answer:

  • Does Igni use glycolysis or gluconeogenesis (or both)?
    This question allows us to figure out whether our organism might use carbon sources like sugars (via glycolysis, if we have this we might look further into whether we have a fermenting organism) or uses acetyl-CoA to make their own sugars, which could be then used in other pathways (gluconeogenesis).

  • Is Igni likely an anaerobic or aerobic organism?
    This question can help us to decide what growth conditions we need and whether O2 is required for our organism to generate energy.

  • Is Ingi and autotroph or heterotroph?
    Again, this question is important if we want to know how to grow our organism, do we need to feed it sugars so that it grows, or can it grow on CO2. If it is involved in CO2 and/or methane cycling it might also have an impact in global carbon cycles.

  • Can Igni reduce sulfur?
    If Igni grows on sulfur, we could add sulfur to a potential growth medium. Plus, for the environmental side of things we want to look into lithotrophy and how organism in our environment are involved in nitrogen, sulfur, metal or hydrogen cycling.

  • If Igni reduces sulfur, what does it oxidize?
    Same as above.

The Input for the website should look something like this:

Click me to see an answer

Answer for Does Igni use glycolysis or gluconeogenesis?

Ignicoccus has a gluconegenic pathway. The key enzyme to distinguish glycolysis from gluconeogenesis is EC:3.1.3.11 which allows this organism to go from acetate to glucose. To figure this out, next to adding a literature search, there are two routes to take:
Option A:

Option B:

Answer for Is Igni likely an anaerobic or aerobic organism?

There are several things to check, but oxidative phosphorylation is a good start. We can see that complex IV, the terminal oxidase of the respiratory chain that transfers electrons from cyt c to O2, is absent. So this organism is likely an anaerobe.

Answer for Is Ingi and autotroph or heterotroph?

Igni is an autotroph that encodes the dicarboxylate-hydroxybutyrate cycle (big red circle highlights the genes of the pathway.

This one is a bit more tricky, since the malate dehydrogenase, 1.1.1.37, is absent if we look at the map (small red box). However, if we dig into the annotation tables, we can find the blast hit to WP_012123403.1 (likely gene ID 13020) hits to a malate dehydrogenase and we can see that most databases view the gene as a malate/lactate dehydrogenase (i.e. arCOG00246). Malate dehydrogenases are very easy to mix-up with lactate dehydrogenases and that happened here with the KO annotations.

As a general tip, if only a few genes of a pathway are lacking –> have a second look into the table and search the other annotations. If we have genes where its hard to distinguish between substrate specificity sometimes a tree might help or looking at the catalytic site of a protein, depending on how well studied our protein of interest is:

Answer for Can Igni reduce sulfur?

Yes, Igni can reduce sulfur using K18367/arCOG01069, which encodes CoA-dependent NAD(P)H_sulfur_oxidoreductase [EC:1.8.1.18].

This one is a bit tricky and something you do not find this hit looking at the pathway maps alone.

To find the issue with the maps: go to the metabolic pathway overview, click on show matched objects and search for K18367. You will see we will not find a hit here, but if we search for this KO in our table we can find it, so what is the issue?

If we search for K18367 across the KEGG database, we find some info here. We first see that this enzyme is not linked to a pathway (and a problem of the pathway map tool) and also the explanation that this enzyme is involved in NAD(P)H-linked sulfur reduction. We can confirm this by showing that (a) this gene is similar to a characterized gene in Pyrococcus but also since we find sulfite transporters (i.e. arCOG02050/WP_011998105.1) which might be involved in getting the produced H2S outside of the cell.

Answer for If Igni reduces sulfur, what does it oxidize?

Igni gains energy by reduction of elemental sulfur to hydrogen sulfide using molecular hydrogen as the electron donor.

Again, this is not something the Kegg mapper will tell us. To figure this out we can again use different approaches: (a) search the literature and find out what related organisms do, i.e. Pyrococcus and see whether our organism has a similar gene set, (b) search our Annotation table for potential electron donors that might make sense and are also used by other S reducers. <

The second approach (and some background reading in most cases) will lead us to K06281 and K06282. These hit to a [NiFe]_Group_1g hydrogenase and we can check here what exactly this hydrogenase typically is doing. We need to check this, because for hydrogenases it is sometimes tricky to say, whether they are involved energy reactions. If we check this specific example we find that it is involved in hydrogenotrophic respiration using sulfur as terminal electron acceptor, so perfect!

A word of warning, the HydDB for hydrogenases will never be perfect. So also here, for more uncertain hits it might make sense to build a tree for hydrogenases to see if we really have a 1g hydrogenase.

9 What next?

Depending on the number of genomes we have the final table will get rather large. Therefore, it makes a lot of sense to condense the information. The following ideas will not be part of this tutorial but will give you some pointers on how do to that:

9.1 R magic

We provide a R tutorial here. This tutorial starts with basics in R but ends with an example to deal with the annotation tables here. More specifically, it will show you how to:

  • Make count tables for each genome, i.e. for MAG1 count how often we find arCOG0001. This calculation is done each genome that is part of your analysis and thus provides you with more condensed table
  • Condense the data on the same on taxonomic or custom levels. I.e. the example in the R tutorial is for ~30 genomes of the same phylum. However, the genomes come from different environments. Using the tutorial you will be able to ask, how many of the aquifer genomes have arCOG0001? How many of the marine genomes have this gene?

9.2 Python magic

This section will be extended once the author of this tutorial has time. However, once you have to deal with 5000 genomes python becomes much more attractive to do the parsing esp. compared to RStudio.

10 Background info on the used tools and databases

Here, you will some background info on the used approaches, databases and tools. The idea is to provide links to the manuals and some other information that might (or might not) be useful.

Note, this is an on-going part of this document, so it might change over time.

10.1 Gene predictions

There are several tools to identify genes in our genomes

  • Prodigal: a prokaryotic gene recognition and translation initiation site identification tool. Hyatt et al., 2010
  • Prokka: A rapid prokaryotic genome annotation tool. Seeman 2014 This tool is based on prodigal but does some basic genome annotations.

Something important to keep in mind is that these gene prediction tools generally used for genomes from archaea and bacteria. If you are interested in eukaryotic genomes, things like splicing and alternative reading frames make gene prediction more tricky and there are special tools out there that are more suited for Eukaryotes.

10.2 Information on the different search tools

10.2.1 BLAST

BLAST (Basic Local Alignment Search Tool) finds regions of local similarity between protein or nucleotide sequences (Altschul 1990).

NCBI provides a quick tutorial explaining its general usage and you can find some more details in the algorithm and usage of the website here.

There are different versions of blast that you can use depending on what comparisons you want to make.

  • blastn = nucleotide blast searches with a nucleotide “query” against a database of nucleotide “subject” sequences.
  • blastp = protein blast searches with a protein “query” against a database of protein “subject” sequences.
  • translated blasts searches use a genetic code to translate either the “query,” database “subjects,” or both, into protein sequences, which are then aligned as in “blastp.”
    • tblastn = a protein sequence query is compared to the six-frame translations of the sequences in a nucleotide database.
    • blastx = a nucleotide sequence query is translated in six reading frames, and the resulting six-protein sequences are compared, in turn, to those in a protein sequence database.
    • tblastx = both the “query” and database “subject” nucleotide sequences are translated in six reading frames, after which 36 (6 × 6) protein “blastp” comparisons are made.

Additionally, we can also work with alignments using psi-blast = protein sequence profile search method that builds off the alignments generated by a run of the BLASTp program. Some more info into this algorithm can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK2590/#:~:text=PSI-BLAST%20(Position-Specific,threshold%20using%20protein–protein%20BLAST.)

10.2.2 Diamond

Diamond is an alternative to BLAST for protein and translated DNA searches (Buchfink et al., 2015) If you check the diamond website you will find that some advantages over blast are listed:

  • Pairwise alignment of proteins and translated DNA at 100x-20,000x speed of BLAST.
  • Frameshift alignments for long read analysis.
  • Low resource requirements and suitable for running on standard desktops or laptops.
  • Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

In principle diamond works very similar compared to blast and uses the same input files.

10.2.3 HMMER

We also use HMMER for searching databases and generating alignments (but generally we prefer to use mafft-linsi for alignments ourselves).

The approach taken by HMMER is quite different from blast and diamond as HMMER implements methods using probabilistic models called profile hidden Markov models (profile HMMs).

If you look at some HMM databases used in this tutorial with ie. less arcog.hmm, you will see that the files look very different compared to what we use as input for blast and diamond. One of the biggest differences is that the profiles used in HMMER searches are build from multiple sequence alignments.

For more information you can check the HMMER website or the hmmer paper

10.3 Information on the different databases

10.4 COGs

COG = Clusters of Orthologous Groups.

The COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages.

The different COGs are characterized into different pathways and the pathway description can be found here

Some papers you might want to check if you want to know more:

10.5 arCOGs

arCOG = Archaeal Clusters of Orthologous Genes. arCOGs are generated in a similar manner as the COGs, just with a focus on archaeal genomes.

Some papers you might want to check if you want to know more:

10.6 KOs and KEGG

KO = KEGG ORTHOLOGY

The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs. By using the HMMs we assign K numbers to our proteins by HMMER/HMMSEARCH against the KOfam database.

Details on how the HMMs and cutoffs were generated can be found here

Something useful to know is that all KOs, ie. K00161, are linked to metabolic maps that have quite good descriptions. For example, if you would google K00161 you will come across this website that gives:

  • a detailed description of what gene we have
  • if applicable the enzyme number (EC number)
  • The pathways this gene can be involved in. If you click on the pathway you see its exact position in a pathway
  • A paper that first describes the gene

If you click on the enzyme number (EC 1.2.4.1) you will get other useful information

  • alternative names
  • the “chemical” class it belongs to
  • the chemical reaction
  • useful comments about the reaction and whether it is part of an enzyme complex
  • if it is an enzyme complex, we get the KOs for the other subunits

10.7 Biocyc: Alternative to KEGG to see how a gene is linked to a pathway

If you want to find out more about a gene, you can check also metacyc. I.e. if we search for "pyruvate dehydrogenase" biocycin goggle we should land here

Similar as with KEGG, we get some basic information about this enzyme, such as:

  • the different subunits
  • a basic description of what that enzyme does
  • the reaction, incl. the direction
  • Functions of the other subunits
  • some paper references

10.8 PFAM

PFAM is a database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

10.9 TIGRFAMs

The TIGRFAMs database is a resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins.

10.10 CAZy

CAZy = Carbohydrate-Active enZYmes

A database dedicated to the display and analysis of genomic, structural and biochemical information on Carbohydrate-Active Enzymes (CAZymes).

Resources:

  • Here we have a database with more information on indiv. CAZymes and potential substrates
  • dbCAN is a webserver for CAZYme annotation. We also get the sequences we use for our workflow from this website.
  • The paper describing the CAZY database

10.11 Transporter DB

This is the website that gives more information about potential transporters found in our search.

10.12 HydDB

HydDB = Hydrogenase database

Resources:

  • The original paper
  • The website we originally got the sequences from. Also provides more detailed information on the different hydrogenase subcategories.

10.13 IPRscan

Short for InterProScan

Interpro allows for the classification of protein families and predicting domains and important sites.

If we see an ID like this IPR000013, this is a so called interpro domain that might be found on your protein. Notice, a protein can have more than one domain.

Resources:

  • The website from which to download all information
  • The original paper, which provides also an overview of the different databases integrated with InterPro.

10.14 NCBI-nr

As we have seen above, NCBI (the National Center for Biotechnology Information) provides a lot of tools, such as blast, and databases, such as the COGs and arCOGs. Additionally, it is one of the largest repositories of sequences from metagenomes, amplicon libraries and genomes.

The nr database (here: ncbi_nr) is compiled by the NCBI as a protein database for Blast searches.It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF and is frequently updated. However, it is a quite large database making searches more time consuming.

10.15 Uniprot - checking your results

Another good database, next to KEGG and metacyc, to check proteins is Uniprot. The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

We can search for:

  • protein names, i.e. “pyruvate dehydrogenase”
  • proteins from certain taxa using the taxonomy option, i.e. taxonomy:bacteria pyruvate dehydrogenase
  • KOs, arCOGs, etc …

I.e. if we would search for taxonomy:bacteria "pyruvate dehydrogenase e1" we should get:

  • A list of all proteins classified as E1 subunit of the pyruvate dehydrogenase in bacteria
  • some of these sequences are reviewed (i.e. might have some more supporting info) and others are unreviewed (annotation only based on sequence similarity)
  • a list of taxa that have this protein (under view by click on taxonomy) and then we can see in exactly what bacteria these enzyme is found

If we click on an individual example, i.e. from E.coli, we get:

  • Details on the reaction
  • Publications about this specific enzyme in E.coli
  • Co-factors
  • Information about potential functions and biological processes
  • a reference to MetaCyc
  • The links to several databases (i.e. protein databases, KEGG, IPR domains, etc.)