#set variable for the base directory and personal user directory
user_name="ndombrowski"
base_dir="/export/lv3/scratch/bioinformatics_workshop/Users/"
wdir=${base_dir}/${user_name}
#nr of cpus we generally want to use
cpus=2
#location of some custom python and perl scripts
export script_folder="/export/lv3/scratch/bioinformatics_workshop/S11b_annotations/scripts/"
#locations of the individual databases
cog_db="/export/data01/databases/cogs/hmms/NCBI_COGs_Oct2020.hmm"
arcog_db="/export/data01/databases/arCOG/arCOGs2019/All_Arcogs_2018.hmm"
ko_db="/export/data01/databases/ko/All_KO"
pfam_db="/export/data01/databases/pfam/Pfam-A.hmm"
tigr_db="/export/data01/databases/PGAP_ProtFam/hmm_PGAP.lib"
cazy_db="/export/data01/databases/cazy_spt/dbCAN-HMMdb-V7.txt"
transporter_db="/export/data01/databases/tcdb/tcdb"
hyd_db="/export/data01/databases/HydDB/HydDB_uniq"
ncbi_nr_db="/export/data01/databases/ncbi_nr/diamond/nr.dmnd"
#locations of the mapping files for the indiv. databases
#these files link the database ID, ie KO0001 to a more useful description
cog_mapping="/export/data01/databases/cogs/hmms/cog_mapping.txt"
ko_mapping="/export/data01/databases/ko/ko_list_for_mapping_clean"
pfam_mapping="/export/data01/databases/pfam/Pfam-A.clans.cleaned.tsv"
cazy_mapping="/export/data01/databases/cazy_spt/CAZY_mapping_2.txt"
hyd_mapping="/export/data01/databases/HydDB/HydDB_mapping"
taxon_db="/export/data01/databases/taxmapfiles/ncbi_nr/prot.accession2taxid.gz"
arcog_mapping="/export/data01/databases/arCOG/arCOGs2019/ar14.arCOGdef18.nina.tab"
transporter_mapping="/export/data01/databases/tcdb/tcdb_definition.tsv"
tigr_mapping="/export/data01/databases/tigr/hmm_PGAP_clean.tsv"
#mapping file for ncbi taxonomy
ncbi_tax_mapping="/export/lv3/scratch/bioinformatics_workshop/S12_annotations/databases/taxonomy5.txt"
#Fix PERL5LIB PATH (required for using prokka on the NIOZ servers)
export PERL5LIB=/export/lv1/public_perl/perl5/lib/perl5
Metagenomics: Annotation tutorial
1 Introduction
The goal of this tutorial is to teach students how to annotate proteins from genomes and is part of the NIOZ bioinformatic workshop. However, if you are interested in the topic feel free to check out the code as well as all required tools and databases are referenced.
Specifically, we will:
- download two genomes from NCBI
- annotate these genomes using several databases such as the COGs, arCOGS, KOs, PFAM, …
- combine the results into one comprehensive table
General comments
- A lot of programs in this workflow use for loops or GNU parallel to speed things up. While this might not make so much sense for only two genomes, if you work with thousands of files this becomes very useful.
- For most annotation steps we can use of several CPUs. For this tutorial we only use few CPUs but this can be adjusted according to the available resources you have available. More on how to check resources can be found in the Unix tutorial
- This script uses a lot of AWK for parsing. If you want to have some more insights into how to use AWK we also have a AWK tutorial available.
- Once you have the annotation table, we provide an R tutorial here that gives an example how to summarize the data.
- The individual database searches might take a while to run (esp. the interproscan and the diamond search against the NCBI database) so if you run the full workflow have patience
2 General
Below you find a list of the used programs. If you want to run this pipeline outside of the bioinformatics workshop on your own servers, please install the necessary tools and download any Custom scripts and databases from the Zenodo repository.
General notice for people interested in eukaryotic genomes:
This tutorial is designed specifically with prokaryotic genomes in mind but can be used for eukaryotic proteins as well. This especially is easy if you already have proteins for your genomes of interest. If you do not have such files beware that the tool we use for protein-calling, prokka, is specific for prokaryotic genomes. Tools for identifying eukaryotic proteins are, among others, Augustus for single genomes or MetaEuk for metagenomes.
Something to keep in mind with the used databases is that these were often generated using only archaeal and/or bacterial genomes. Only a few of the databases used here, such as the KO database, also includes eukaryotic genomes. A good alternative to check out are the KOGs that are specifically designed for eukaryotes. The most recent profiles provided by EggNOG can be found here.
2.1 Version programs
The versions might differ from what you have available on your system and should not be problematic but for consistency we list the version numbers that were used to generate this workflow as well as links with install instructions.
- Python 2.7.5
- perl v5.16.3
- prokka 1.14-dev ; install instructions can be found here
- HMMER 3.1b2; install instructions can be found here
- blastp v2.7.1+; install instructions can be found here
- diamond 0.9.22; install instructions can be found here
- interproscan: InterProScan-5.29-68.0; install instructions can be found here
- GNU parallel 20181222; install instructions can be found here
2.2 Version databases
Databases (and mapping files) used in the current version can be downloaded from [here] (https://zenodo.org/deposit/6489670).
- KOhmms: downloaded 90419 from here
- PGAP (former TIGRs): downloaded Nov 2021 from here
- Pfams: RELEASE 34.0, downloaded Nov 2021 from here
- CAZymes: dbCAN-HMMdb-V9 downloaded on 21 April 2020 from here
- Merops: downloaded from Merops server in November 2018 from here
- HydDB: downloaded form HydDB webserver in July 2020 from here
- COGs: downloaded from NCBI October 2020 from here
- TransporterDB: downloaded from TCDB April 2021 here
- ncbi_nr (maintained and downloaded by Alejandro Abdala Asbun). The last update was done on August 2021. For more details on downloading the data on your own check out this resource. Additionally,
prot.accession2taxid.gz
was downloaded from here. Additionally, since the nr database was converted into a diamond database withblastdbcmd -entry all -db nr/nr -out all.nr.masqued.fasta -line_length 10000000 -ctrl_a
followed bydiamond makedb --in all.nr.masqued.fasta --db diamond/nr --taxonmap prot.accession2taxid.gz
- ArCOGs: downloaded in 2019 and available here (update date: 2018)
2.3 External scripts needed
Custom scripts used in the current version can be downloaded from [Zenodo] (https://zenodo.org/deposit/6489670) and Github.
python
- fasta_length_gc.py
- Parse_prokka_for_MAGs_from_gbk-file.py
- Split_Multifasta.py
- parse_IPRdomains_vs2_GO_2_ts_sigP.py
- parse_diamond_blast_results_id_taxid.py
- merge_df.py
3 Setup working environment
3.1 Define variables
Notice: If you have done the assembly and binning tutorials that we provide as well, you can see that we do things a bit differently here. When annotating genomes, we use a lot of different databases and there are two ways how to deal with a lot of files that might be found in different locations:
- we can add the path for each database into the code itself
- we can define variables that store the paths at the beginning of our workflow
Especially if you work with collaborators it often is easier to use the second option, as you just have to adjust the paths in one spot.
Instructions:
- Change the first variable, the wdir, to the path were you want to work from and change the paths for the databases according to where you downloaded the databases and mapping files.
- The two last lines of code are needed for the software prokka and are specific for the system the script was written on. Adjust according to how prokka is setup in your system.
For people following the tutorial outside of the NIOZ bioinformatics workshop on their own servers. The necessary databases and scripts can be downloaded from zenodo.
3.2 Set our working directory
Let’s first go to our working directory and make a new folder for the Annotation tutorial.
#go to your working directory
cd $wdir
#make a folder for the Annotations and move into this directory
mkdir Annotations
cd Annotations
4 Prepare genomes of interest
In this workflow we will work with 2 genomes that we download from NCBI but the steps can easily be adjusted for your own genomes. You could also work with the bins we have assembled and binned in the previous workflows, if you feel comfortable with the command line feel free to use those instead.
First, we will download two archaeal genomes from NCBI and rename them to avoid issues with the workflow. The bins from the previous workflows (inside ../Binning/4_FinalBins/renamed
) already have a good header and should be ready to use in case you want to repeat the workflow with those genomes.
First a word on the importance of good file names and sequence headers
After we have downloaded the 2 genomes from NCBI the file name and file header should look something like this:
We can see that both the file name and file header are long and has a lot of symbols UNIX might have trouble with (brackets, spaces, dots…) and generally, something to watch out for when naming files is:
Have short but descriptive file names and file headers. I.e. if we look at the header of the NCBI genomes you will see the header is rather long and has unneccessary information for our purposes (i.e. genomic). Therefore we want to make it shorter.
Avoid extra symbols (especially avoid using spaces, commas, semicolons, quotes or brackets) as these can easily result in unwanted behavior. I.e. a very simple naming could only include two symbols:
- underscores,
_
, for everything where we normally would use a space - minus symbol
-
for key information that we might want to separate later (i.e. genome name and protein ID)
- underscores,
Another useful thing is to add information into the headers that allow us to concatenate multiple genomes and afterwards still know from what genome a sequence came from. Therefore, we recommend to:
- Add the genome name into the fasta header and use a unique delimiter to separate it from the contig id. That way you can concatenate your genomes and afterwards still know from what genome a contig came from
- have a consistent way to separate the genome and protein ID across different analyses. In phylogenies its very common to concatenate several genes/proteins from individual genomes and for this it is useful to remove the proteinID from the fasta header.
4.1 Download genomes from NCBI and cleanup
#make folders to store our genomes in
mkdir -p fna/renamed
#download genomes from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/085/GCA_000008085.1_ASM808v1/GCA_000008085.1_ASM808v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/945/GCA_000017945.1_ASM1794v1/GCA_000017945.1_ASM1794v1_genomic.fna.gz
#unzip files
gzip -d *gz
#confirm that we have as many files as we expect: 2
ls -l *.fna
#shorten the file name by cutting everything away after the dot
for f in *.fna; do mv $f "${f/.*}".fna; done
#check the file name and see if it changed
ls -l *.fna
#check how the header looks
#--> we see that we have a long name and a lot of symbols that we do not want, lets remove these
head $(ls -1 *fna | head -1)
#shorten the fasta headers by removing everything after the space
#we also move the file with the changed header into the fna folder
for f in *.fna; do cut -d " " -f1 $f > fna/${f}; done
head $(ls -1 fna/*fna | head -1)
#add the genome name into the header (this is useful if we want to concatenate a lot of genomes into one file)
cd fna
for i in *fna; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fna/,x)}1' $i > renamed/$i; done
head $(ls -1 renamed/*fna | head -1)
#remove intermediate files
rm -f *fna
cd ..
rm -f *fna
The cleaned up files can be found in fna/renamed
and file names should look something like this afterwards:
4.1.1 The steps above work fine but what if we have hundreds of genomes that we want to download from NCBI?
Click me to look at the code
Notice:
- The file that links the taxonomy ID with the taxa string (lineages-2020-12-10.csv) was generated with ncbitax2lin (instructions added below)
- this part of the workflow was set up to run at servers at the NIOZ and the paths need to be adjusted to work on your system
The first part of this workflow is to get a list of all available archaeal and bacterial genomes from NCBI and add a taxonomic string into the file we downloaded.
#prepare some folders
mkdir ncbi_genomes
cd ncbi_genomes
#get a list of all avail. microbial genomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
This file includes a list of all genomes available at NCBI and relevant metadata such as the project ID, taxonomy ID, species name, download link for individual genome files etc.
Next, we will clean the file a bit and since the species name might not be informative enough (i.e. we might want to know to what phylum a species belongs), we want to add the full taxonomy string in.
#clean up the file by removing the first row and any `#` and move the tax id to the front
sed 1d assembly_summary_genbank.txt | sed 's/# //g' | awk -F"\t" -v OFS="\t" '{print $6, $0} ' | sed 's/ /_/g' > assembly_summary_edit.txt
#replace empty rows with a "none" and remove the header
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' assembly_summary_edit.txt | sed '1d' > temp1
#add in the tax string (to be able to better select genomes)
#the code for how to generate the csv needed here can be found below
#merge assembly info with tax string
LC_ALL=C join -a1 -j1 -e'-' -o 0,1.2,1.3,1.4,1.5,1.6,1.8,1.9,1.10,1.11,1.13,1.16,1.17,1.18,1.19,1.21,2.2 <(LC_ALL=C sort temp1) <(LC_ALL=C sort /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020/lineages-2020-12-10.csv) -t $'\t' > temp2
#add in headers
echo -e "taxid\taccession\tbioproject\tbiosample\twgs\trefseq_category\tspeciesID\torganism\tintraspecific_name\tisolate\tassembly\trel_date\tasm\tsubmitter\tgcf_name\tftp_path\ttax" | cat - temp2 > assembly_summary_archaea_Dec2020_tax.txt
cd ..
In assembly_summary_archaea_Dec2020_tax.txt
you can easily search for the genomes you are interested in. For example, you can subset for lineages of interest by selection specific groups via their taxonomy and you might run CheckM as well to select them based on their genome stats.
Once you have your genomes selected that you are interested in you then can copy the ftp_links into a document called get_genomes.txt
via nano and then do some small changes to the file with sed. Afterwards you can download all the genomes from NCBI using wget.
# extend path to be able to download the genomes
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' get_genomes.txt > get_genomes_v2.txt
# download genomes
for next in $(cat get_genomes_v2.txt); do wget "$next"; done
Once you have the genomes, you can clean the names as suggested above and continue with the steps below.
In case you want to generate the csv file that links the NCBI taxID to a taxonomic string, which includes the phylum, class, order, etc information, this is the workflow (you will need to download ncbitax2lin):
#change to the directory in which we installed ncbitax2lin
cd /export/lv1/user/spang_team/Scripts/NCBI_taxdump/ncbitax2lin
#start bashrc that allows to run conda
source ~/.bashrc.conda2
#setup virtual environment (see more instructions here: https://github.com/zyxue/ncbitax2lin)
#virtualenv venv/
#re-do taxonomy mapping file
make
#unzip file
gzip -d lineages-2020-12-10.csv.gz
#close virtual environment
#source deactivate venv/
#make a new directory for the most current tax
mkdir /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020
#change dir
cd /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020
#cp the taxa file into our new dir and replace spaces with underscore
sed 's/ /_/g' ~/../spang_team/Scripts/NCBI_taxdump/ncbitax2lin/lineages-2020-12-10.csv > temp1
#only print the levels until species level
awk -F',' -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8}' temp1 > temp2
#add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp2 > temp3
#merge columns 2-8, thus we have 1 column with the tax number and one column with the tax string
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2","$3","$4","$5","$6","$7","$8}' temp3 > temp4
#remove header
sed '1d' temp4 > lineages-2020-12-10.csv
#clean up
rm temp*
4.2 Make a filelist for all genomes of interest
Let us start by making a file that lists all the genomes we are working with. Such a file list is useful when working with for loops.
#make a folder for all the secondary files we need
mkdir FileLists
#list our genomes of interest
ls fna/renamed/* | sed 's/\.fna//g' | sed 's/fna\/renamed\///g' > FileLists/GenomeList
#check content of file
head FileLists/GenomeList
The last command should print something like this:
4.3 Call proteins with prokka and clean files
Now that we have the genomes organized, lets identify the coding regions and call the proteins with Prokka. Another common tool you might encounter for this step is prodigal. Prokka basically uses prodigal but is doing some extra annotations, which we also want to look at.
Notice:
- Our genomes are from archaea, if you have other genomes change the –kingdom Archaea part below. This setting generates no major differences when calling proteins but the Prokka annotations might slightly change.
- Prokka is not designed to work with eukaryotic genomes, if you work with eukaryotic genomes an alternative tool to look into is for example Augustus.
- Prokka does very quick and dirty annotations, these are good for a first glimpse into what you have but in our experience there can be several issues, so use them with caution.
- Similarly, no database for annotating proteins is perfect. This is why we use mutliple databases in our workflow and compare results to be able to spot miss-annotated proteins more easily (more on that in later last section)
4.3.1 Run prokka
Prokka is a tool for rapid prokaryotic genome annotation (Seeman et al., 2014).
#prepare new directories to store our data
mkdir faa
mkdir faa/Prokka
#run prokka
for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus $cpus --norrna --notrna ; done
#control number of faa files generated: 2
#this should be the same as the original number of genomes
ll faa/Prokka/*/*faa | wc -l
#check how the faa header looks like
head $(ls -1 faa/Prokka/*/*faa | head -1)
We have 2 genomes and the header of the first file should look something like this:
We can see that prokka gives a unique protein ID and adds an annotation to the file header.
Other than the protein faa file, Prokka generates several additional output files. These include:
Additionally, prokka does not only call proteins but also identifies tRNAs and tries to annotate proteins. The whole workflow is summarized in the image below:
4.3.2 Clean faa header and combine genomes
If we look again at the protein sequences headers of the newly generated files, we see that the header is not ideal yet. For example, the header should look something like this:
>CALJACFG_00010 ATP-dependent DNA helicase Hel308
.
This header is very long and uses a lot of extra symbols (a space, and a minus symbol, which we already use for other purposes). Therefore, we first want to clean things up a bit.
Additionally, since for the following workflow we want to concatenate the two genomes, i.e. merge the two files into one, it is useful to add the genome name into the fasta header before merging the files.
This is optional, but it is useful to use a special delimiter to later separate the genome from the protein name. Therefore we add the file name and separate it with the old sequence header with a -
.
#go into the prokka folder
cd faa/Prokka/
#make a new folder in which we want to store our renamed gneomes
mkdir renamed
#organize our original files
cp */*faa .
#add the filename into the fasta header and store them in our new folder
for i in *faa; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.faa/,x)}1' $i > renamed/$i; done
#control if this worked
head $(ls -1 renamed/*faa | head -1)
#remove old files
rm -f *faa
#concatenate files
cat renamed/*faa > All_Genomes.faa
#clean the header and shorten it by removing everything after the first space
cut -f1 -d " " All_Genomes.faa > All_Genomes_clean.faa
#check if that worked
head All_Genomes_clean.faa
#go back to the wdir
cd ../..
#make a blast db out of concatenated genomes (we need this whenever we want to run blast)
makeblastdb -in faa/Prokka/All_Genomes_clean.faa -out faa/All_Genomes_clean -dbtype prot -parse_seqids
Now our file header should look something like this:
5 If you work with your own genomes
This is a short reminder for people who already have their own protein files and want to start with the annotations right away:
- Ideally have not only the protein ID but also the genome ID sequence header
- Make sure the file header is concise and does not have ANY spaces and that ideally uses a ‘-’ (or any other unique delimiter) to separate the genome ID from the protein ID
- If you have a concise header + the bin ID in the header, it is easy to concatenate the protein sequences of your genomes into one single file and still easily know from what genome the sequence originally came from
More specifically you want to:
- for the fna files
- generate a folder named
fna/renamed
- add the individual genomes in there
- for the faa files:
- generate the file in the folder
faa/Prokka
- concatenate the proteins from all your genomes of interest and name the file
All_Genomes_clean.faa
- continue with the steps below
6 Make a mapping file with all relevant genome info
Now that our files are ready we first want to make a table with the basic stats (i.e. contig length, contig gc, protein length, …).
6.1 Get contig stats
Prokka has the habit to rename the fasta header. Therefore, it is not straightforward to say from what contig the protein originally came from. For example a header from the contigs (i.e. the fna file) might look like this >GCA_000008085-AE017199.1
while a protein on that contig might be named like this GCA_000008085-CALJACFG_00010
in the faa file.
The following workflow below generates a mapping file that links the original contig name with the prokka contig name. Beware the workflow is a bit convoluted since the original contig ID AE017199
is stored nowhere in the prokka output files and we need it to improvise a bit to generate the information.
The goal is to generate a file with this information:
- old contig name, binID, contig GC, contig length
Notice
- This part depends a lot on the files having short naming schemes and no extra symbols. So make sure that the final file, has the 4 columns we would expect.
#make folder
mkdir contig_mapping
#calculate the length and gc for each contig
for i in `cat FileLists/GenomeList`; do python $script_folder/fasta_length_gc.py fna/renamed/${i}.fna; done > contig_mapping/temp1
#make a separate column for the genome ID and proteinID and give a numbering to the contigs
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2], a[1], ++count[a[1]], $2, $3}' contig_mapping/temp1 > contig_mapping/Contig_Old_mapping.txt
#check file: the header should have --> contigID, binID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping.txt
#cleanup temp files
rm contig_mapping/*temp*
#create file for merging with prokka contigIDs. Here, we add the genome ID with the contig number together to recreate the prokka contig name
awk 'BEGIN{OFS="\t"}{print $2"_contig_"$3,$0}' contig_mapping/Contig_Old_mapping.txt > contig_mapping/Contig_Old_mapping_for_merging.txt
#check file: prokka contig name, prokka genome ID, genome ID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping_for_merging.txt
In the file we see the contig name, contig GC and contig length. Since the two genomes are complete genomes we only work with two contigs in total. If you work with MAGs this table will get much longer.
6.2 Prepare list that links proteins with binIDs
Now, we want to generate a file that lists all proteins we are working and a second column with the bin ID. So we want two generate with two columns: proteinID (i.e. accession), genome ID
#get list of protein accession numbers
grep "^>" faa/Prokka/All_Genomes_clean.faa > temp
#remove the prokka annotations
cut -f1 -d " " temp > temp2
#remove the ``>``
sed 's/>//g' temp2 > FileLists/AllProteins_list.txt
#check file; we want one column with the proteinIDs
head FileLists/AllProteins_list.txt
#remove temp files
rm -f temp*
#Modify protein list to add in a column with genome ID
awk -F'\t' -v OFS='\t' '{split($1,a,"-"); print $1, a[1]}' FileLists/AllProteins_list.txt | LC_ALL=C sort > FileLists/1_Bins_to_protein_list.txt
#check file content: we now have the accession and the binID
head FileLists/1_Bins_to_protein_list.txt
#check with how many proteins we work: 2069
wc -l FileLists/1_Bins_to_protein_list.txt
We now have a small table that lists all our 2,069 proteins and the respective genome each protein belongs to. The first few lines of the file should look something like this:
:::{.callout-note}
Throughout the workflow (or whenever generating your own pipeline), check that the number of protein stays consistent this is a useful sanity check to ensure that the code does what you expect it to.
Additionally, while running things for the first time it is always good to check the file content that was generated with nano
or less
to get a better grasp on:
- What is happening at specific points in the script
- Find potential bugs
:::
6.3 Prepare a file with prokka annotations & Lengths & other genome info
Now, we want to generate a file that links the contig info and all the protein info.
#compress gbk files (generated by prokka)
gzip faa/Prokka/*/*gbk
#make list for looping
for f in faa/Prokka/*/*gbk.gz; do echo $f >> FileLists/list_of_gzips; done
#control that the nr of files we have is correct: 2
wc -l FileLists/list_of_gzips
#create a summary file based on the gbk file
#lists the genome ID, contig name, contig length, proteinID, prokka annotation
python $script_folder/Parse_prokka_for_MAGs_from_gbk-file_v2.py -i FileLists/list_of_gzips -t FileLists/Contig_Protein_list.txt
#check output of the script: Prokka binID, Prokka contig ID, contig length, gene start, gene stop, strand, accession, prokka annotation
head FileLists/Contig_Protein_list.txt
The output should look something like this
Next, we want to have a list that assigns the new ID prokka gives to a genome to the old genome name we got from NCBI:
#make prokka to genome I mapping file
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2],$2}' FileLists/1_Bins_to_protein_list.txt | sed 's/_[0-9]*\t/\t/g' | sort | uniq > FileLists/Prokka_to_BinID.txt
#check file: Prokka bin ID, old bin ID
head FileLists/Prokka_to_BinID.txt
Finally, we can merge everything together.
#link old to new genome ID
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' FileLists/Prokka_to_BinID.txt FileLists/Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$10,$2,$2,$3,$4,$5,$6,$7,$8}' > temp_Bin_Contig_Protein_list.txt
#add in extra column for contig nr
awk -F'\t' -v OFS='\t' 'gsub("[A-Za-z0-9]*_","",$4)' temp_Bin_Contig_Protein_list.txt | awk -F'\t' -v OFS='\t' '{print $2"_contig_"$4,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' > FileLists/Bin_Contig_Protein_list.txt
#check file
head FileLists/Bin_Contig_Protein_list.txt
#merge with old contig IDs
#headers: accession, genome ID, new ContigID, old ContigID, merged ContigID,new Contig Length, old Contig Length, GC,ProteinID, protein start, protein end, protein strand, prokka annotation
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' contig_mapping/Contig_Old_mapping_for_merging.txt FileLists/Bin_Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $3"-"$7,$3,$4,$13,$1,$6,$17,$16,$7,$8,$9,$10,$11 }' > FileLists/Bin_Contig_Protein_list_merged.txt
#check file and number of proteins in file
head FileLists/Bin_Contig_Protein_list_merged.txt
wc -l FileLists/Bin_Contig_Protein_list_merged.txt
#prepare a file with protein length and gc to add protein length into our file
perl $script_folder//fasta_length_gc.py faa/Prokka/All_Genomes_clean.faa > temp
#merge with contig/protein info
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' temp FileLists/Bin_Contig_Protein_list_merged.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13}' > temp2
#add a header
echo -e "accession\tBinID\\tNewContigID\tOldContigId\tContigNewLength\tContigOldLength\tGC\tProteinID\tProteinStart\tProteinEnd\tStrand\tProteinLength\tProkka" | cat - temp2 > FileLists/B_GenomeInfo.txt
#check file
less -S FileLists/B_GenomeInfo.txt
#control that all is ok and has the expected number of rows: 2070
#notice, we have one row extra since we count the header
wc -l FileLists/B_GenomeInfo.txt
#remove temp files
rm -f temp*
The file stored in FileLists/B_GenomeInfo.txt
should look something like this:
7 Do the annotations
All searches that follow below have a similar workflow:
- Run the search (with hmmer, diamond, blast, etc.)
- For each protein, make sure we have just one unique hit (usually we select the best hit for each protein using the best e-value or bitscore). I.e. a single protein might be similar to several dehydrogenases and we only keep the best hit.
- Clean up the output from the search.
- Merge the output with our protein list (that way we make sure we always have the same number of rows in all the files we generate).
- add gene descriptions to our table, i.e. when we do the search against the COGs we only have the COG IDs, but we want to add a gene description here.
- add an informative header.
For all the searches, we keep the original input table and the final output table and the other intermediate files we delete.
Once we have run our searches against several different databases, we merge all the results in a step-wise manner (using the protein IDs) and then in the final step clean the final data frame by removing redundant information.
We will add basic information on the different databases and search algorithms in the last section of this tutorial, so if you want more details check there.
7.1 COG search
The COGs (Clusters of Orthologous Genes) is a database of orthologous gene clusters generated from a database from 1,187 bacteria and 122 archaea (2020 version) and therefore is useful if you work with organisms from these two groups.
Possible settings to change:
- number of CPUs
- e-value threshold
#Setup directories
mkdir NCBI_COGs
mkdir merge
#run hmmsearch against all COGs
hmmsearch --tblout NCBI_COGs/sequence_results.txt -o NCBI_COGs/results_all.txt --domtblout NCBI_COGs/domain_results.txt --notextw --cpu $cpus $cog_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select hits above a certain e-value
sed 's/ \+ /\t/g' NCBI_COGs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > NCBI_COGs/sequence_results_red_e_cutoff.txt
#get best hit/protein based on bit score, and e-value
sort -t$'\t' -k3,3gr -k4,4g NCBI_COGs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > NCBI_COGs/All_NCBI_COGs_hmm.txt
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort NCBI_COGs/All_NCBI_COGs_hmm.txt) | LC_ALL=C sort > NCBI_COGs/temp4
#merge with COG mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.2,2.5,1.3 <(LC_ALL=C sort -k2 NCBI_COGs/temp4) <(LC_ALL=C sort -k1 $cog_mapping) | LC_ALL=C sort > NCBI_COGs/temp5
#add headers
echo -e "accession\tCOG\tCOG_Description\tCOG_PathwayID\tCOG_Pathway\tCOG_evalue" | cat - NCBI_COGs/temp5 > NCBI_COGs/A_NCBI_COGs.tsv
#check file
head NCBI_COGs/A_NCBI_COGs.tsv
#count the number of lines we have
#this should be our number of proteins + 1 row for the header
wc -l NCBI_COGs/A_NCBI_COGs.tsv
#cleanup
rm -f NCBI_COGs/temp*
In step 2, hmmsearch will generate multiple outputs (for details on the individual data found in these tables see chaper 5, page 45 of the HMMER manual):
- The –tblout output option produces the target hits table. The target hits table consists of one line for each different query/target comparison that met the reporting thresholds, ranked by decreasing statistical significance (increasing E-value).
- The –domtblout option produces the domain hits table. There is one line for each domain. There may be more than one domain per sequence. T
- A table with the raw output, including alignments (-o option). Notice: This file can get rather large and if you have space issues, its not essential and can be deleted or you don’t even have to use the
-o
option.
After parsing our table that is stored in NCBI_COGs/A_NCBI_COGs.tsv
contains the protein accession, COG ID, COG description and e-value. If other relevant data is available, in the case of the COGs individual COGs are sorted into individual pathways, this information is added as well:
Exercise
Compare the files sequence_results.txt and sequence_results_red_e_cutoff.txt. What is the difference between them?
Click me to see an answer
The commands to check this are:
head NCBI_COGs/sequence_results.txt
head NCBI_COGs/sequence_results_red_e_cutoff.txt
#we can also check the row numbers
#we see that we go from 17888 to 11066 rows
wc -l NCBI_COGs/sequence_results*
Additionally, we can see that we removed some columns from the second table and that we excluded any rows starting with a #
7.2 ArCOGs search
The arCOGs is a database of orthologous groups based on archaeal genomes and therefore is especially useful when working with archaea.
Possible settings to change:
- number of CPUs (default = 2)
- e-value threshold (default = 1e-3)
#prepare folders
mkdir arcogs
#run hmmsearch
hmmsearch --tblout arcogs/sequence_results.txt -o arcogs/results_all.txt --domtblout arcogs/domain_results.txt --notextw --cpu $cpus $arcog_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' arcogs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > arcogs/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g arcogs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > arcogs/All_arcogs_hmm.txt
#separate header with arcog and ID
awk -F'\t' -v OFS='\t' '{split($2,a,"."); print $1, a[1], $3,$4}' arcogs/All_arcogs_hmm.txt | LC_ALL=C sort > arcogs/temp3
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort arcogs/temp3) | LC_ALL=C sort > arcogs/temp4
#merge with arcog names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.4,2.2,1.3 <(LC_ALL=C sort -k2 arcogs/temp4) <(LC_ALL=C sort -k1 $arcog_mapping) | LC_ALL=C sort > arcogs/temp5
#add in headers
echo -e "accession\tarcogs\tarcogs_geneID\tarcogs_Description\tarcogs_Pathway\tarcogs_evalue" | cat - arcogs/temp5 > arcogs/B_arcogs.tsv
#control counts
wc -l arcogs/B_arcogs.tsv
#check what happened
head arcogs/B_arcogs.tsv
#clean-up
rm -f arcogs/temp*
Now, we we have a second table with information for the arCOGs. If we would merge the two tables already here, we could check whether each protein is assigned to the same annotation when using different databases.
Again, it is useful to use more then one database to analyse your genome since no database is perfect and having several databases to compare the results allows us to judge how confident we are with a certain annotation.
Unfortunately, you also see that there is no consistent naming convention. I.e. in the table above you can see for CALJACFG_00080 that the COG links this protein to a tRNA_A37_methylthiotransferase and the arCOG links this protein to a 2-methylthioadenine_synthetase but despite the different names both refer to the same enzyme.
Question
We now have 3 annotations for each protein from prokka, the arcogs and the COGs. This is a good time to look at whether or not we get consistent annotations across all these annotation results.
As an exercise look at the gene with ID 00260 for the genome GCA_000008085 and check if all 3 databases give consistent annotations. If not, which annotation do you trust most? Since the prokka generates different genome IDs for each run, I can not give you the exact accession, but using grep to search for the genome and protein ID should be enough for you to find the protein.
Click me to see an answer
You can search for the term as follows:
#find the prokka annotation
grep 'GCA_000008085-.*00260' FileLists/B_GenomeInfo.txt
#find the hit in our arcog and cog annotations
grep 'GCA_000008085-.*00260' */[A-Z]_*.tsv
This should give you the following results:
- Prokka: GCA_000008085-CALJACFG_00260 Sulfide-quinone reductase
- A_NCBI_COGs.tsv:GCA_000008085-CALJACFG_00260 COG1252 NADH_dehydrogenase,_FAD-containing_subunit
- GCA_000008085-CALJACFG_00260 arCOG01065 HcaD NAD(FAD)-dependent_dehydrogenase
[CALJACFG_00260 hits to sulfide-quinone reductase with prokka and to a NADH_dehydrogenase with the arcogs and the COGs. Generally, the e-values are pretty good with the last two databases, while prokka by default does not give us any score to judge the confidence of this hit. We therefore should be skeptical about the annotation as sulfide quinone reductase.
7.3 KO search using Hmm’s
The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs and was generated using archaeal, bacterial and eukaryotic genomes.
Possible settings to change:
- number of CPUs
- e-value threshold
#setup directories
mkdir KO_hmm
#run KO search
hmmsearch --tblout KO_hmm/sequence_results.txt -o KO_hmm/results_all.txt --domtblout KO_hmm/domain_results.txt --notextw --cpu $cpus $ko_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' KO_hmm/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > KO_hmm/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g KO_hmm/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > KO_hmm/All_KO_hmm.txt
#merge with protein list
LC_ALL=C join -a1 -t $'\t' -j1 -e'-' -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort KO_hmm/All_KO_hmm.txt) | sort -u -k1 > KO_hmm/temp
#merge with KO_hmm names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.4,1.3,2.2,2.12 <(LC_ALL=C sort -k2 KO_hmm/temp) <(LC_ALL=C sort -k1 $ko_mapping) | LC_ALL=C sort > KO_hmm/temp3
#add in an extra column that lists whether hits have a high confidence score
awk -v OFS='\t' '{ if ($4 > $5){ $7="high_score" }else{ $7="-" } print } ' KO_hmm/temp3 > KO_hmm/temp4
#add header
echo -e "accession\tKO_hmm\tKO_e_value\tKO_bit_score\tKO_bit_score_cutoff\tKO_Definition\tKO_confidence" | cat - KO_hmm/temp4 > KO_hmm/C_KO_hmm.tsv
#control lines and content
wc -l KO_hmm/C_KO_hmm.tsv
head KO_hmm/C_KO_hmm.tsv
#clean up
rm -f KO_hmm/temp*
7.4 Pfam search
The Pfam database is a large collection of protein families that are generated from UniProt Reference Proteomes and includes archaea, bacteria and eukaryotes.
Possible settings to change:
- number of CPUs
- e-value threshold
#1. Setup directories
mkdir pfam
#2. run hmmsearch against all pfams
hmmsearch --tblout pfam/sequence_results.txt -o pfam/results_all.txt --domtblout pfam/domain_results.txt --notextw --cpu $cpus $pfam_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' pfam/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $4, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > pfam/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g pfam/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > pfam/All_pfam.txt
#clean pfam id
awk -F'\t' -v OFS='\t' '{split($2,a,"."); print $1,a[1], $3, $4}' pfam/All_pfam.txt > pfam/temp
#merge with protein list
LC_ALL=C join -e'-' -a1 -t $'\t' -j1 -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort pfam/temp) | sort -u -k1 > pfam/temp3
#merge with pfam names
LC_ALL=C join -a1 -1 2 -2 1 -t $'\t' -e'-' -o1.1,0,2.5,1.4 <(LC_ALL=C sort -k2 pfam/temp3) <(LC_ALL=C sort -k1 $pfam_mapping) | LC_ALL=C sort > pfam/temp5
#add in header
echo -e "accession\tPFAM_hmm\tPFAM_description\tPfam_Evalue" | cat - pfam/temp5 > pfam/D_Pfam.tsv
#control lines and content
wc -l pfam/D_Pfam.tsv
head pfam/D_Pfam.tsv
#clean up
rm -f pfam/temp*
7.5 TIGR search
The original TIGRFAMs database was a research project of The Institute for Genomic Research (TIGR) and its successor, the J. Craig Venter Institute (JCVI) and are now maintained by NCBI. TIGRFAMs is a collection of manually curated protein families focusing primarily on prokaryotic sequences.
Possible settings to change:
- number of CPUs
- e-value threshold
#1. Setup directories
mkdir TIGRs
#2. run hmmsearch against all tigrs
hmmsearch --tblout TIGRs/sequence_results.txt -o TIGRs/results_all.txt --domtblout TIGRs/domain_results.txt --notextw --cpu $cpus $tigr_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' TIGRs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $4, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > TIGRs/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g TIGRs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > TIGRs/All_TIGR.txt
#merge with protein list
LC_ALL=C join -a1 -e'-' -t $'\t' -j1 -o 0,1.2,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort TIGRs/All_TIGR.txt) | sort -u -k1 > TIGRs/temp
#merge with tigr names
LC_ALL=C join -a1 -1 3 -2 1 -e'-' -t $'\t' -o1.1,0,2.11,2.13,1.4 <(LC_ALL=C sort -k3 TIGRs/temp) <(LC_ALL=C sort -k1 $tigr_mapping)| LC_ALL=C sort > TIGRs/temp3
#add header
echo -e "accession\tTIRGR\tTIGR_description\tTIGR_EC\tTIGR_Evalue" | cat - TIGRs/temp3 > TIGRs/E_TIGR.tsv
#control lines and content
wc -l TIGRs/E_TIGR.tsv
head TIGRs/E_TIGR.tsv
#clean up
rm -f TIGRs/temp*
7.6 CazyDB search
The CAZy (Carbohydrate-Active enZYmes) database describes the families of structurally-related catalytic and carbohydrate-binding modules (or functional domains) of enzymes that degrade, modify, or create glycosidic bonds. As such, this database does not consist of clusters of orthologous groups but individual protein sequences that can be linked to CAZys.
Possible settings to change:
- number of CPUs
- e-value threshold (default = 1e-10). We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir CAZYmes
#2. run hmmsearch against the CAZy database
hmmsearch --tblout CAZYmes/sequence_results.txt -o CAZYmes/results_all.txt --domtblout CAZYmes/domain_results.txt --notextw --cpu $cpus $cazy_db faa/Prokka/All_Genomes_clean.faa
#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' CAZYmes/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-10' > CAZYmes/sequence_results_red_e_cutoff.txt
#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g CAZYmes/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k3,3gr -k4,4g > CAZYmes/All_vs_CAZYmes.txt
#clean things and remove .hmm from the file
sed s'/\.hmm//g' CAZYmes/All_vs_CAZYmes.txt > CAZYmes/temp
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort CAZYmes/temp) | LC_ALL=C sort > CAZYmes/temp3
#merge with mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.3,2.2 <(LC_ALL=C sort -k2 CAZYmes/temp3) <(LC_ALL=C sort $cazy_mapping) | LC_ALL=C sort > CAZYmes/temp4
#add in headers
echo -e "accession\tCAZy\tCAZy_evalue\tCAZy_Description" | cat - CAZYmes/temp4 > CAZYmes/F_CAZy.tsv
#control lines and file content
wc -l CAZYmes/F_CAZy.tsv
head CAZYmes/F_CAZy.tsv
#clean up
rm -f CAZYmes/temp*
7.7 Transporter DB search
The transporter classification database (also tcdb) details a comprehensive classification system for membrane transport proteins known as the Transporter Classification (TC) system. The database consists of individual protein sequences linked to this classification system.
Possible settings to change:
- number of CPUs (default setting for -num_threads = 2)
- e-value threshold (default setting for -evalue = 1e-10) We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir TransporterDB
#2. run protein blast against all TransporterDB
blastp -num_threads $cpus -outfmt 6 -query faa/Prokka/All_Genomes_clean.faa -db $transporter_db -out TransporterDB/All_vs_TPDB.tsv -evalue 1e-10
#get best hit based on bit score, and then e-value
sort -t$'\t' -k12,12gr -k11,11g TransporterDB/All_vs_TPDB.tsv | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k12,12gr -k11,11g > TransporterDB/temp
#merge with contig list
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.11 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort TransporterDB/temp) | LC_ALL=C sort > TransporterDB/temp3
#merge with mapping file
#no good mapping file existing yet, so we won't run this
#LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.3,2.2 <(LC_ALL=C sort -k2 TransporterDB/temp3) <(LC_ALL=C sort -k1 $transporter_mapping | uniq -u) | LC_ALL=C sort > TransporterDB/temp4
#add in headers
echo -e "accession\tTCBD_ID\tTCBD_evalue\tTCBD_description" | cat - TransporterDB/temp3 > TransporterDB/G_TPDB.tsv
#control lines
wc -l TransporterDB/G_TPDB.tsv
head TransporterDB/G_TPDB.tsv
#clean up
rm -f TransporterDB/temp*
7.8 HydDB search
The hydrogenase database (HydDB) contains individual protein sequences from hydrogenases and is a curated database of hydrogenases by known type.
Possible settings to change:
- number of CPUs (default setting for -num_threads = 2)
- e-value threshold (default setting for -evalue = 1e-10) We are a bit stricter here, because the database is smaller –> so the chance of false-positives is higher.
#make folder
mkdir HydDB
#run search
blastp -num_threads $cpus -outfmt 6 -query faa/Prokka/All_Genomes_clean.faa -db $hyd_db -out HydDB/All_vs_HydDB.tsv -evalue 1e-10
#get best hit based on bit score, and then e-value
sort -t$'\t' -k12,12gr -k11,11g HydDB/All_vs_HydDB.tsv | sort -t$'\t' --stable -u -k1,1 | sort -t$'\t' -k12,12gr -k11,11g > HydDB/temp
#merge with contig list; control: grep for 05160
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,2.2,2.3,2.11,2.12 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort HydDB/temp) | LC_ALL=C sort > HydDB/temp2
#merge with HydDB names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.2,1.4 <(LC_ALL=C sort -k2 HydDB/temp2) <(LC_ALL=C sort -k1 $hyd_mapping) | LC_ALL=C sort > HydDB/temp3
#add in headers
echo -e "accession\tHydDB\tDescription\tHydDB_evalue" | cat - HydDB/temp3 > HydDB/H_HydDB.tsv
#control lines and content
wc -l HydDB/H_HydDB.tsv
head HydDB/H_HydDB.tsv
#clean up
rm -f HydDB/temp*
7.9 IPRscan
InterPro provides a functional analysis of proteins by classifying them into families and predicting domains and important sites, such as transmembrane motives (TMs) or signal proteins.
7.9.1 Genome cleanup
IPRscan does not like to deal with *
in the protein files, which is used for stop codons. If we work with programs that don’t recognize *
, we should remove this symbol first.
#interproscan does not like if we have stars in our protein files, so we need to remove them
cd faa/Prokka
sed 's/*//g' All_Genomes_clean.faa > All_Genomes_clean_interproscan.faa
cd ../..
Before running IPRscan, we can split our big concatenated file into several smaller files. Then we can use parallel GNU to speed things up.
We set the number of processes to run with -j
. Always make sure that you adjust this depending on the available resources.
7.9.2 run IPRscan
Possible settings to change:
- Number of files we split the concatenated fasta file into via Split_Multifasta.py (default of -n = each split file will contain 1000 sequences)
- number of parallel jobs to run via parallel (default setting for -j= 4)
- e-value threshold (default setting for -evalue = 1e-20)
mkdir IPRscan
#split files --> 2069 proteins were split into 3 files
python $script_folder/Split_Multifasta.py -m faa/Prokka/All_Genomes_clean_interproscan.faa -n 1000
#move files into a better location
mkdir split_faa
mv File*.faa split_faa
#cp the batch script for running this job on an hpc
#specifically adjust:
#interproscan only works on no83-no89, so choose any of those
##SBATCH --ntasks-per-node=30: Adjust this to the total nr of CPUs you need
#-j3 : Adjust this according to how many jobs you want to run in parallel
cp $script_folder/interproscan_run.sh .
#start job
sbatch interproscan_run.sh
#content of the script
'''
parallel -j3 'i={}; interproscan.sh -i $i -d IPRscan/ -appl TIGRFAM,SFLD,SUPERFAMILY,Gene3D,Hamap,Coils,ProSiteProfiles,SMART,CDD,PRINTS,ProSitePatterns,Pfam,ProDom,MobiDBLite,PIRSF,TMHMM -T IPRscan/temp --iprlookup --goterms -cpu 5' ::: split_faa/File*.faa
'''
#Concat result files
cd IPRscan
mkdir single_files
mv File* single_files
mkdir Concat_results/
cat single_files/File*.faa.xml > Concat_results/All_bins_iprscan-results.xml
cat single_files/File*.faa.gff3 > Concat_results/All_bins_bins_iprscan-results.gff3
cat single_files/File*.faa.tsv > Concat_results/All_bins_bins_iprscan-results.tsv
#cleanup
rm -f single_files/File*
cd ..
#clean up fasta header so that it is exactly the same as the accession ID in the interproscan results
python $script_folder/parse_IPRdomains_vs2_GO_2_ts_sigP.py -s faa/Prokka/All_Genomes_clean.faa -i IPRscan/Concat_results/All_bins_bins_iprscan-results.tsv -o IPRscan/All_bins_bins_iprscan-results_parsed.tsv
#remove issue with spaces
sed 's/ /_/g' IPRscan/All_bins_bins_iprscan-results_parsed.tsv | LC_ALL=C sort > IPRscan/temp.txt
#print only columns of interest
awk -F'\t' -v OFS="\t" '{print $1, $2,$3,$4, $5, $8, $9}' IPRscan/temp.txt > IPRscan/temp2
#add header
echo -e "accession\tIPR\tIPRdescription\tIPR_PFAM\tIPR_PFAMdescription\tIPR_TMHMM\tIPR_SignalP" | cat - IPRscan/temp2 > IPRscan/I_IPR.tsv
#control lines: 2070
wc -l IPRscan/I_IPR.tsv
#view file content
less -S IPRscan/I_IPR.tsv
#clean up
rm -f IPRscan/temp*
7.10 Diamond search against NCBI NR
The NCBI nr database is a protein sequence database maintained by the National Center for Biotechnology Information (NCBI). nr stands for non-redundant and means that it contains unique protein sequences that have been compiled from various sources, including GenBank, RefSeq, PDB, and Swiss-Prot.
Possible settings to change:
- number of CPUs (–threads)
- e-value threshold (–evalue)
- Sequence numbers to display (–seq)
Notice:
- This step can take quite some time.
- the NCBI nr database is constantly updated (and very large), which is why we are not providing the database with the other files in our repository. For information on how to download ncbi_nr check out the information here.
#make folder
mkdir NCBI_NR
#run diamond against NR database
diamond blastp -q faa/Prokka/All_Genomes_clean.faa --more-sensitive --evalue 1e-3 --threads $cpus --seq 50 --db $ncbi_nr_db --taxonmap $taxon_db --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids -o NCBI_NR/All_NCBInr.tsv
#Select columns of interest in diamond output file
awk -F'\t' -v OFS="\t" '{ print $1, $5, $6, $11, $12, $14, $15 }' NCBI_NR/All_NCBInr.tsv | sed 's/ /_/g' > NCBI_NR/temp
#Parse Diamond Results
python $script_folder/parse_diamond_blast_results_id_taxid.py -i FileLists/AllProteins_list.txt -d NCBI_NR/temp -o NCBI_NR/temp2
#rm header
sed 1d NCBI_NR/temp2 > NCBI_NR/temp3
#add an '-' into empty columns
awk -F"\t" '{for(i=2;i<=NF;i+=2)gsub(/[[:blank:]]/,"_",$i)}1' OFS="\t" NCBI_NR/temp3 > NCBI_NR/temp4
#the python script above sometimes leaves an empty 7th column, this gets rid of that issue
awk -F'\t' -v OFS="\t" '{if (!$7) {print $1,$2, $4 , $6, "-"} else {print $1, $2, $4, $6, $7}}' NCBI_NR/temp4 | LC_ALL=C sort > NCBI_NR/temp5
#split columns with two tax ids
awk -F'\t' -v OFS='\t' '{split($5,a,";"); print $1, $2, $3, $4, a[1]}' NCBI_NR/temp5 > NCBI_NR/temp6
#in column 2 remove everything after < (otherwise the name can get too long)
awk -F'\t' -v OFS='\t' '{split($2,a,"<"); print $1, a[1], $3, $4, $5}' NCBI_NR/temp6 > NCBI_NR/temp6a
#merge with tax names
LC_ALL=C join -a1 -1 5 -2 1 -e'-' -t $'\t' -o1.1,1.2,1.3,1.4,1.5,2.2 <(LC_ALL=C sort -k5 NCBI_NR/temp6a) <(LC_ALL=C sort -k1 $ncbi_tax_mapping) | LC_ALL=C sort > NCBI_NR/temp7
#add in header
echo -e "accession\tTopHit\tE_value\tPecID\tTaxID\tTaxString" | cat - NCBI_NR/temp7 > NCBI_NR/J_Diamond.tsv
#control lines
wc -l NCBI_NR/J_Diamond.tsv
#view file content
head NCBI_NR/J_Diamond.tsv
#clean up
rm -f NCBI_NR/temp*
8 Combine the individual dataframes
Now, lets cleanup the final dataframe and then we made it!
#activate conda environment to run py3
source ~/.bashrc.conda3
#combine tables
python3 $script_folder/merge_df.py */[A-Z]_*tsv accession Annotations.txt
#control that the number of rows is correct: 2070
wc -l Annotations.txt
#view content of file
#column allows to display the rows a bit cleaner allowing for easier reading
column -t -s $'\t' Annotations.txt | less -S
#close to conda environment
conda deactivate
Question/Homework –> Let’s figure out what Igniccous is doing
- Download the table
Annotations.txt
to your personal computer usingscp
. - Compare the different annotations, have a look at how they differ.
- Make one separate table for each of the two genomes
- For Ingicoccus (GCA_000017945), copy the accession numbers and KO ids into a new file and upload or copy and paste them to the KEGG mapper. Use this tool to get some first insights into the metabolic capacities of that organism.
Based on this try to answer:
Does Igni use glycolysis or gluconeogenesis (or both)?
This question allows us to figure out whether our organism might use carbon sources like sugars (via glycolysis, if we have this we might look further into whether we have a fermenting organism) or uses acetyl-CoA to make their own sugars, which could be then used in other pathways (gluconeogenesis).Is Igni likely an anaerobic or aerobic organism?
This question can help us to decide what growth conditions we need and whether O2 is required for our organism to generate energy.Is Ingi and autotroph or heterotroph?
Again, this question is important if we want to know how to grow our organism, do we need to feed it sugars so that it grows, or can it grow on CO2. If it is involved in CO2 and/or methane cycling it might also have an impact in global carbon cycles.Can Igni reduce sulfur?
If Igni grows on sulfur, we could add sulfur to a potential growth medium. Plus, for the environmental side of things we want to look into lithotrophy and how organism in our environment are involved in nitrogen, sulfur, metal or hydrogen cycling.If Igni reduces sulfur, what does it oxidize?
Same as above.
The Input for the website should look something like this:
Click me to see an answer
Answer for Does Igni use glycolysis or gluconeogenesis?
Ignicoccus has a gluconegenic pathway. The key enzyme to distinguish glycolysis from gluconeogenesis is EC:3.1.3.11 which allows this organism to go from acetate to glucose. To figure this out, next to adding a literature search, there are two routes to take:
Option A:
Option B:
Answer for Is Igni likely an anaerobic or aerobic organism?
There are several things to check, but oxidative phosphorylation is a good start. We can see that complex IV, the terminal oxidase of the respiratory chain that transfers electrons from cyt c to O2, is absent. So this organism is likely an anaerobe.
Answer for Is Ingi and autotroph or heterotroph?
Igni is an autotroph that encodes the dicarboxylate-hydroxybutyrate cycle (big red circle highlights the genes of the pathway.
This one is a bit more tricky, since the malate dehydrogenase, 1.1.1.37, is absent if we look at the map (small red box). However, if we dig into the annotation tables, we can find the blast hit to WP_012123403.1 (likely gene ID 13020) hits to a malate dehydrogenase and we can see that most databases view the gene as a malate/lactate dehydrogenase (i.e. arCOG00246). Malate dehydrogenases are very easy to mix-up with lactate dehydrogenases and that happened here with the KO annotations.
As a general tip, if only a few genes of a pathway are lacking –> have a second look into the table and search the other annotations. If we have genes where its hard to distinguish between substrate specificity sometimes a tree might help or looking at the catalytic site of a protein, depending on how well studied our protein of interest is:
Answer for Can Igni reduce sulfur?
Yes, Igni can reduce sulfur using K18367/arCOG01069, which encodes CoA-dependent NAD(P)H_sulfur_oxidoreductase [EC:1.8.1.18].
This one is a bit tricky and something you do not find this hit looking at the pathway maps alone.
To find the issue with the maps: go to the metabolic pathway overview, click on show matched objects and search for K18367. You will see we will not find a hit here, but if we search for this KO in our table we can find it, so what is the issue?
If we search for K18367 across the KEGG database, we find some info here. We first see that this enzyme is not linked to a pathway (and a problem of the pathway map tool) and also the explanation that this enzyme is involved in NAD(P)H-linked sulfur reduction. We can confirm this by showing that (a) this gene is similar to a characterized gene in Pyrococcus but also since we find sulfite transporters (i.e. arCOG02050/WP_011998105.1) which might be involved in getting the produced H2S outside of the cell.
Answer for If Igni reduces sulfur, what does it oxidize?
Igni gains energy by reduction of elemental sulfur to hydrogen sulfide using molecular hydrogen as the electron donor.
Again, this is not something the Kegg mapper will tell us. To figure this out we can again use different approaches: (a) search the literature and find out what related organisms do, i.e. Pyrococcus and see whether our organism has a similar gene set, (b) search our Annotation table for potential electron donors that might make sense and are also used by other S reducers. <
The second approach (and some background reading in most cases) will lead us to K06281 and K06282. These hit to a [NiFe]_Group_1g hydrogenase and we can check here what exactly this hydrogenase typically is doing. We need to check this, because for hydrogenases it is sometimes tricky to say, whether they are involved energy reactions. If we check this specific example we find that it is involved in hydrogenotrophic respiration using sulfur as terminal electron acceptor, so perfect!
A word of warning, the HydDB for hydrogenases will never be perfect. So also here, for more uncertain hits it might make sense to build a tree for hydrogenases to see if we really have a 1g hydrogenase.
9 What next?
Depending on the number of genomes we have the final table will get rather large. Therefore, it makes a lot of sense to condense the information. The following ideas will not be part of this tutorial but will give you some pointers on how do to that:
9.1 R magic
We provide a R tutorial here. This tutorial starts with basics in R but ends with an example to deal with the annotation tables here. More specifically, it will show you how to:
- Make count tables for each genome, i.e. for MAG1 count how often we find arCOG0001. This calculation is done each genome that is part of your analysis and thus provides you with more condensed table
- Condense the data on the same on taxonomic or custom levels. I.e. the example in the R tutorial is for ~30 genomes of the same phylum. However, the genomes come from different environments. Using the tutorial you will be able to ask, how many of the aquifer genomes have arCOG0001? How many of the marine genomes have this gene?
9.2 Python magic
This section will be extended once the author of this tutorial has time. However, once you have to deal with 5000 genomes python becomes much more attractive to do the parsing esp. compared to RStudio.
10 Background info on the used tools and databases
Here, you will some background info on the used approaches, databases and tools. The idea is to provide links to the manuals and some other information that might (or might not) be useful.
Note, this is an on-going part of this document, so it might change over time.
10.1 Gene predictions
There are several tools to identify genes in our genomes
- Prodigal: a prokaryotic gene recognition and translation initiation site identification tool. Hyatt et al., 2010
- Prokka: A rapid prokaryotic genome annotation tool. Seeman 2014 This tool is based on prodigal but does some basic genome annotations.
Something important to keep in mind is that these gene prediction tools generally used for genomes from archaea and bacteria. If you are interested in eukaryotic genomes, things like splicing and alternative reading frames make gene prediction more tricky and there are special tools out there that are more suited for Eukaryotes.
10.2 Information on the different search tools
10.2.1 BLAST
BLAST (Basic Local Alignment Search Tool) finds regions of local similarity between protein or nucleotide sequences (Altschul 1990).
NCBI provides a quick tutorial explaining its general usage and you can find some more details in the algorithm and usage of the website here.
There are different versions of blast that you can use depending on what comparisons you want to make.
- blastn = nucleotide blast searches with a nucleotide “query” against a database of nucleotide “subject” sequences.
- blastp = protein blast searches with a protein “query” against a database of protein “subject” sequences.
- translated blasts searches use a genetic code to translate either the “query,” database “subjects,” or both, into protein sequences, which are then aligned as in “blastp.”
- tblastn = a protein sequence query is compared to the six-frame translations of the sequences in a nucleotide database.
- blastx = a nucleotide sequence query is translated in six reading frames, and the resulting six-protein sequences are compared, in turn, to those in a protein sequence database.
- tblastx = both the “query” and database “subject” nucleotide sequences are translated in six reading frames, after which 36 (6 × 6) protein “blastp” comparisons are made.
Additionally, we can also work with alignments using psi-blast = protein sequence profile search method that builds off the alignments generated by a run of the BLASTp program. Some more info into this algorithm can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK2590/#:~:text=PSI-BLAST%20(Position-Specific,threshold%20using%20protein–protein%20BLAST.)
10.2.2 Diamond
Diamond is an alternative to BLAST for protein and translated DNA searches (Buchfink et al., 2015) If you check the diamond website you will find that some advantages over blast are listed:
- Pairwise alignment of proteins and translated DNA at 100x-20,000x speed of BLAST.
- Frameshift alignments for long read analysis.
- Low resource requirements and suitable for running on standard desktops or laptops.
- Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.
In principle diamond works very similar compared to blast and uses the same input files.
10.2.3 HMMER
We also use HMMER for searching databases and generating alignments (but generally we prefer to use mafft-linsi for alignments ourselves).
The approach taken by HMMER is quite different from blast and diamond as HMMER implements methods using probabilistic models called profile hidden Markov models (profile HMMs).
If you look at some HMM databases used in this tutorial with ie. less arcog.hmm
, you will see that the files look very different compared to what we use as input for blast and diamond. One of the biggest differences is that the profiles used in HMMER searches are build from multiple sequence alignments.
For more information you can check the HMMER website or the hmmer paper
10.3 Information on the different databases
10.4 COGs
COG = Clusters of Orthologous Groups.
The COG protein database was generated by comparing predicted and known proteins in all completely sequenced microbial genomes to infer sets of orthologs. Each COG consists of a group of proteins found to be orthologous across at least three lineages.
The different COGs are characterized into different pathways and the pathway description can be found here
Some papers you might want to check if you want to know more:
10.5 arCOGs
arCOG = Archaeal Clusters of Orthologous Genes. arCOGs are generated in a similar manner as the COGs, just with a focus on archaeal genomes.
Some papers you might want to check if you want to know more:
10.6 KOs and KEGG
KO = KEGG ORTHOLOGY
The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs. By using the HMMs we assign K numbers to our proteins by HMMER/HMMSEARCH against the KOfam database.
Details on how the HMMs and cutoffs were generated can be found here
Something useful to know is that all KOs, ie. K00161, are linked to metabolic maps that have quite good descriptions. For example, if you would google K00161 you will come across this website that gives:
- a detailed description of what gene we have
- if applicable the enzyme number (EC number)
- The pathways this gene can be involved in. If you click on the pathway you see its exact position in a pathway
- A paper that first describes the gene
If you click on the enzyme number (EC 1.2.4.1) you will get other useful information
- alternative names
- the “chemical” class it belongs to
- the chemical reaction
- useful comments about the reaction and whether it is part of an enzyme complex
- if it is an enzyme complex, we get the KOs for the other subunits
10.7 Biocyc: Alternative to KEGG to see how a gene is linked to a pathway
If you want to find out more about a gene, you can check also metacyc. I.e. if we search for "pyruvate dehydrogenase" biocyc
in goggle we should land here
Similar as with KEGG, we get some basic information about this enzyme, such as:
- the different subunits
- a basic description of what that enzyme does
- the reaction, incl. the direction
- Functions of the other subunits
- some paper references
10.8 PFAM
PFAM is a database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).
10.9 TIGRFAMs
The TIGRFAMs database is a resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins.
10.10 CAZy
CAZy = Carbohydrate-Active enZYmes
A database dedicated to the display and analysis of genomic, structural and biochemical information on Carbohydrate-Active Enzymes (CAZymes).
Resources:
10.11 Transporter DB
This is the website that gives more information about potential transporters found in our search.
10.12 HydDB
HydDB = Hydrogenase database
Resources:
- The original paper
- The website we originally got the sequences from. Also provides more detailed information on the different hydrogenase subcategories.
10.13 IPRscan
Short for InterProScan
Interpro allows for the classification of protein families and predicting domains and important sites.
If we see an ID like this IPR000013
, this is a so called interpro domain that might be found on your protein. Notice, a protein can have more than one domain.
Resources:
- The website from which to download all information
- The original paper, which provides also an overview of the different databases integrated with InterPro.
10.14 NCBI-nr
As we have seen above, NCBI (the National Center for Biotechnology Information) provides a lot of tools, such as blast, and databases, such as the COGs and arCOGs. Additionally, it is one of the largest repositories of sequences from metagenomes, amplicon libraries and genomes.
The nr database (here: ncbi_nr) is compiled by the NCBI as a protein database for Blast searches.It contains non-identical sequences from GenBank CDS translations, PDB, Swiss-Prot, PIR, and PRF and is frequently updated. However, it is a quite large database making searches more time consuming.
10.15 Uniprot - checking your results
Another good database, next to KEGG and metacyc, to check proteins is Uniprot. The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.
We can search for:
- protein names, i.e. “pyruvate dehydrogenase”
- proteins from certain taxa using the taxonomy option, i.e.
taxonomy:bacteria pyruvate dehydrogenase
- KOs, arCOGs, etc …
I.e. if we would search for taxonomy:bacteria "pyruvate dehydrogenase e1"
we should get:
- A list of all proteins classified as E1 subunit of the pyruvate dehydrogenase in bacteria
- some of these sequences are reviewed (i.e. might have some more supporting info) and others are unreviewed (annotation only based on sequence similarity)
- a list of taxa that have this protein (under view by click on taxonomy) and then we can see in exactly what bacteria these enzyme is found
If we click on an individual example, i.e. from E.coli, we get:
- Details on the reaction
- Publications about this specific enzyme in E.coli
- Co-factors
- Information about potential functions and biological processes
- a reference to MetaCyc
- The links to several databases (i.e. protein databases, KEGG, IPR domains, etc.)