Metagenomics: Binning tutorial
1 Introduction
This tutorial is part of the NIOZ bioinformatics for ecology workshop but can be followed outside of this workshop as well.
The goal of this tutorial is to:
- Work with the contigs generated in the Assembly tutorial. While these two tutorials are design to go hand it hand, this tutorial can stand on its own and you can either work on your own contigs, or contigs that were prepared and are provided via a repository.
- Do the read mapping and generate coverage files
- Adjust the coverage files for different binning tools
- Bin contigs into metagenome-assembled genomes (MAGs) with Metabat, Maxbin, Concoct, BinSanity
- Combine bins from different the binning tools by using DASTool
- Get the genome stats using CheckM
- Get a first overview of the taxonomic assignment by using GTDP_tk
- Create summary documents for the contigs and bins
For this tutorial we work with assemblies (and the raw reads) from 3 metagenomes. The reason to include more than one metagenome is that most binning tools take into account sequence information as well as coverage information. Coverage information gets more informative if we make use of data from several metagenomes (i.e. metagenomes covered across different depth profiles in our case).
If you generate your own assembly, we strongly recommend that your contigs have a consistent naming format. The format we assume for this workflow is NIOZ114_sc105_1
, i.e. sampleID_contigID. Ideally, it is good to have short and concise sequence headers that includes the sampleID and no other symbols other than underscores.
All the output data is provided in a Zenodo repository found here. Specifically, you can download the folder Example_Data.tar.gz and have a look at how the output from all individual steps look like.
General info on this tutorial
During the course of the tutorial you might get examples on how the output can look like. Sometimes these numbers can slightly differ to the output you might see. This is ok since the different tools might give slightly different outputs. As long as the results are in a similar range all is good.
To fully understand this tutorial some good background in Bash and AWK is needed. If you want to review your background we provide some intro tutorials for:
1.1 Used programs and version numbers
Notice:
All of the required tools are pre-installed on the NIOZ servers, you only have to set these up if you work on a different system.
- python 2.7.5
- perl v5.16.3
- awk v4.0.2
- sed 4.2.2
- bwa v0.7.17-r1188
- samtools v1.8
- metabat v2.12.1
- maxbin v2.2.4
- concoct v1.1.0
- Binsanity v0.2.6.4
- DASTool v1.1.0
- CheckM v1.0.11
- GTDBtk v2.2.3 and taxonomy release release207_v2
- rename from util-linux 2.23.2
- join (GNU coreutils) 8.22
1.2 Prepare working directory
To get started lets change our directory to where we want to work in. Here, we assume that you also followed the assembly tutorial and have access to the raw reads as well as three assemblies from Megahit.
If you work on this tutorial outside of the NIOZ tutorial download the data from zenodo. Detailed instructions can be found below.
#set variable for the working directory, personal user directory and folder with the example data
#here: change the user_name to whatever your folder name is in the bioinfo_workshop folder
user_name="ndombrowski"
wdir="/export/lv3/scratch/bioinformatics_workshop/Users/"
download_dir="/export/lv3/scratch/bioinformatics_workshop/S11_binning"
#go to your working directory
cd ${wdir}/${user_name}
#generate a folder in which we want to generate our bins
mkdir Binning
cd Binning
#Prepare folders in which we store the data
mkdir -p 0_assembly_and_reads/1_reads/
mkdir 0_assembly_and_reads/2_assembly/
#create symbolic links to quality cleaned raw reads
#(the cleaned reads are part of the files generated during the assembly tutorial
# we just created a symbolic link to have all files organized in the same folder)
ln -sf $(pwd)/../Assemblies/3_trimmomatic/NIOZ1* 0_assembly_and_reads/1_reads/
#create symbolic link to the assembly we work with
#(which are part of the files generated during the assembly tutorial)
ln -sf $(pwd)/../Assemblies/4_assemblies/megahit/* 0_assembly_and_reads/2_assembly/
#create symbolic link to the scripts folder
ln -s $download_dir/scripts/ .
Click me if you use the tutorial outside of the workshop and want to use the Zenodo files
#set a variable to the directory from which we want to work
wdir="/export/lv1/user/ndombrowski/Binning"
#go to our working directory
cd $wdir
#download the data
curl "https://zenodo.org/record/7759615/files/0_assembly_and_reads.tar.gz?download=1" --output 0_assembly_and_reads.tar.gz
#if needed, this is the link to download the mapping data
#curl "https://zenodo.org/record/7759615/files/1_mapping.tar.gz?download=1" --output 1_mapping.tar.gz
#unzip the data
ls *.gz | xargs -n1 tar -xzf
#get custom scripts from github
mkdir scripts
wget https://raw.githubusercontent.com/ndombrowski/Binning_tutorial/main/scripts/length%2BGC.pl -O scripts/length+GC.pl
wget https://raw.githubusercontent.com/ndombrowski/Binning_tutorial/main/scripts/rename.pl -O scripts/rename.pl
wget https://raw.githubusercontent.com/ndombrowski/Binning_tutorial/main/scripts/number_bins_consistently.py -o scripts/number_bins_consistently.py
#cleanup folder
rm *gz
1.3 Prepare files that lists the metagenomes we want to loop through
First, we prepare a file that lists with what samples we work with. In our case we work with assemblies that were generated from three metagenomes coming from samples called NIOZ114, NIOZ118 and NIOZ130.
#prepare folder
mkdir FileLists
#generate a new document in which we list our samples of interest
echo -e "NIOZ114\nNIOZ118\nNIOZ130" > FileLists/SampleList
Next we need to prepare another mapping file:
For the binning process, we want to know for each assembly how abundant each contig is in our 3 metagenomes. Therefore, we make a table with two columns that lists all possible 9 interactions. We can make automatically make such a list from the file we generated above, FileLists/SampleList
.
#prepare a file with matching samples
awk -F'\t' -v OFS='\t' '{ a[$0] }END {for (i in a){for (j in a){ print (i, j)}}}' FileLists/SampleList | sort > FileLists/SampleList_matching
The file we generated should look like this (each column should be separated with a tab).
NIOZ114 NIOZ114
NIOZ114 NIOZ118
NIOZ114 NIOZ130
NIOZ118 NIOZ114
NIOZ118 NIOZ118
NIOZ118 NIOZ130
NIOZ130 NIOZ114
NIOZ130 NIOZ118
NIOZ130 NIOZ130
2 Do the read mapping
2.1 Map the reads
#create an index for our assemblies
for sample in `cat FileLists/SampleList`; do bwa index 0_assembly_and_reads/2_assembly/${sample}/${sample}_contigs_2500.fna; done
#prepare folders in which we want to store the results from bwa, one for each metagenome we are looking at
mkdir 1_mapping
for i in `cat FileLists/SampleList`; do mkdir 1_mapping/$i; done
#create 1_mapping files (just do this on your own time, not during the actual exercise)
while read sample1 sample2; do bwa mem -t 4 0_assembly_and_reads/2_assembly/${sample1}/${sample1}_contigs_2500.fna 0_assembly_and_reads/1_reads/${sample2}_R1_paired.fastq.gz 0_assembly_and_reads/1_reads/${sample2}_R2_paired.fastq.gz | gzip -3 > 1_mapping/${sample1}/${sample1}_v_${sample2}.sam.gz; done < FileLists/SampleList_matching
#check what files were generated in this step
ll 1_mapping/NIOZ114/*sam.gz
For each metagenome the step bwa mem
, generates a new folder, each contains three files:
The three files contained in each folder are all the different mapping combinations we generated. I.e. NIOZ114 reads mapped against the contigs from NIOZ114 and against the contigs from NIOZ118 and NIOZ130.
Info on the used code:
BWA is a short read aligner that can take a reference genome or assembly from a metagenome and map single- or paired-end sequence data to it (Li et al., 2009). There are other tools out there, such as bowtie, so you can always switch to your favorite mapping tool.
To run bwa we only need to provide:
- Our assembly and an index for the assembly (the index is generated with
bwa index
in the step before runningbwa mem
) - Our raw reads (R1 and R2 for the forward and reverse reads)
To explain why we need to index our assembly, let me cite part of a discussion from biostars: “Indexing a genome can be explained similar to indexing a book. If you want to know on which page a certain word appears or a chapter begins, it is much more efficient/faster to look it up in a pre-built index than going through every page of the book until you found it. Same goes for alignments. Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory.”
To keep our files small (the resulting sam files can get huge, watch out for this) it is useful to immediately zip the newly generated files up.
General info on read mapping:
Mapping the reads of a metagenome to a reference genome or assembly is important for the binning process but can also be used to get an idea about the relative abundance of a contig. During the mapping process each read is assigned to a specific location of the assembly and thus we can count how “abundant” each contig is.
In the example above, we can see three contigs, Contig 1 with relatively high coverage and Contigs 2 and 3 with lower coverage. Binning tools generally assume that contigs from the same organisms have a similar abundance and thus might group contig 2 and 3 but not contig 1 into the same MAG.
However, in real life read mapping can be much more variable and some good discussion on this can be found here
The sam file format
BWA will produce a mapping file in sam-format. If we look at the files, these come with a lot of headers, all having different kinds of information:
If we look at our files, the header should look like shown below and defines read and the position within the reference genome/assembly, where the read mapped and a quality of the mapping.:
2.2 Create sorted bam files and count reads as a control
Next, we want to convert the file type from sam to bam and count how many reads are recruited by each contig with samtools. The bam format is a compressed output of the sam format and helpful for storing sam files and getting information of the reads.
#change folder
cd 1_mapping
#create bam files and make index
ls N*/*sam.gz | parallel -j3 "samtools view -hb -F4 {} | samtools sort -o {.}_sort.bam - ; samtools index {.}_sort.bam"
#count how many reads mapped to our contigs
#the output file contains the following info: contig ID, read length, #mapped reads and # unmapped reads
for sample in `ls N*/*_sort.bam`; do samtools idxstats ${sample} > ${sample}_mapped.txt ; done
#return to wdir
cd ..
#check what we did by looking at the output
#each column contains the following info: contig, contig length, mapped reads, unmapped reads
head 1_mapping/NIOZ114/NIOZ114_v_NIOZ114.sam_sort.bam_mapped.txt
The mapping results should look similar to the example below. The columns give the following info for each contig:
- the contig name
- the contig length
- the reads mapped against a contig of NIOZ114
- the unmapped reads (0 in our case because we discarded unmapped reads when running
samtools view -hb -F4
and we explain this in a bit more detail below)
Info on the used code:
Samtools allows to read/write/edit/index/view files in SAM/BAM/CRAM format.
- samtools view: Tool to view and convert sam/bam files.
- samtools sort: Sort alignments by leftmost coordinates, or by read name when -n is used.
- samtools index: Builds an index of the alignment.
- samtools ixstats: reports alignment summary statistics but needs a sorted and indexed bam file as input
Options used when running samtools view
(the code we do not run right now, because it is relatively time consuming):
- h : Output in the BAM format
- b : Include the header in the output.
- f : Only output alignments based on the used FLAG
- F : Do not output alignments based on the used FLAG
A flag value is a number that indicates if, among others, a read is mapped, unmapped, etc. For example using 4 is the FLAG for unmapped reads. Using -F 4
means that we discard unmapped reads (which is why when we open the file the 4th column is empty). Using -f 4
means that we only want to keep unmapped reads. If you want to use a specific flag, you can find all the combinations here
Once we have bam-file, we can also delete the original sam-file as it requires too much space and we can always recreate it from the bam-file.
#get the read depth for all positions of the assembly
samtools depth 1_mapping/NIOZ114/NIOZ114_v_NIOZ114.sam_sort.bam | gzip > 1_mapping/NIOZ114/NIOZ114_depth.txt.gz
#lets extract the depth info for a specific contig
zcat 1_mapping/NIOZ114/NIOZ114_depth.txt.gz | egrep '^NIOZ114_sc5544_1' > 1_mapping/NIOZ114//NIOZ114_single_contig_depth.txt.gz
#open R
R
#read the data in
x <- read.table('1_mapping/NIOZ114/NIOZ114_single_contig_depth.txt.gz', sep='\t', header=FALSE, strip.white=TRUE)
# Look at the beginning of the data we have just read in
head(x)
# calculate average depth
mean(x[,3])
# calculate the std dev
sqrt(var(x[,3]))
# mark areas that have a coverage below 20 in red
#plot(x[,2], x[,3], col = ifelse(x[,3] < 5,'red','black'), pch=19, xlab='postion', ylab='coverage')
# to save a plot; mark areas that have a coverage below 5 in red
png('1_mapping//NIOZ114_sc1105_1.png', width = 1200, height = 500)
plot(x[,2], x[,3], col = ifelse(x[,3] < 5,'red','black'), pch=19, xlab='postion', ylab='coverage')
dev.off()
#exit
quit()
#view plot (you might need to use a tool specific to your system to view the png)
xdg-open 1_mapping/NIOZ114_sc1105_1.png
If you run this, you would see something like this:
We can see that:
- we do have quite some variability (we provided a link that discusses the issues above during the read mapping step)
- Especially the ends of a contig have a low coverage. This might have to do with loosing the read pairs or that this region might have an abnormal coverage due to genetic differences (i.e. we might hit a repeat region). This also explains, why the contig breaks at this point –> there are no reads that cover the end of the contig with another contig.
Generally, it is good to keep in mind that aligners are not good in aligning only partial reads, so we will loose a lot of reads at the ends due to that.
Homework
If you run the full workflow and generate the your own bam files: Try to generate a file that includes also the the unmapped reads (remember to change the flags, to select what reads you want to exclude or to include). Open the mapping summary and check what the difference is compared to when we discarded the reads.
Click me to see an answer
You can do this by removing the -F4 flag in the command samtools view
.
First our file should have looked like this:
NIOZ114_sc48_1 3180 476 0.
Now, we get this:
NIOZ114_sc48_1 3180 476 2.
Unmapped reads are given the mapping coordinates of their mapped mate, which means that of the forward and reverse reads in two cases only one of the two mapped. This could be a mapping error or the mapped read is of the end of our contig and during the assembly we were not able to get a longer contig that also would have included the other pair
2.3 Create depth files
Next, we generate so called depth files using the script jgi_summarize_bam_contig_depths
. The depth files summarize the data for all metagenomes into one file. This script will also is do some normalization of the reads.
#create depth file (we will use this to run 2_binning via metabat)
for i in `cat FileLists/SampleList`; do jgi_summarize_bam_contig_depths --outputDepth 1_mapping/${i}/${i}_depth_metabat.txt --pairedContigs 1_mapping/${i}/paired.txt 1_mapping/${i}/*.bam ; done
#check the file generated
head 1_mapping/NIOZ114/NIOZ114_depth_metabat.txt
The file we generate looks like this and contains:
- the contig id
- the contig length
- the total average depth
- the depth calculated by mapping the NIOZ114 reads against the NIOZ114 contigs
- the variance in the data if we map the NIOZ114 reads against the NIOZ114 contigs
- the depth calculated by mapping the NIOZ118 reads against the NIOZ114 contigs
- the variance in the data if we map the NIOZ118 reads against the NIOZ114 contigs
- ….
Info on the used code:
jgi_summarize_bam_contig_depths (part of the MetaBAT package) normalizes the output of depth coverages by contig length and total number of reads as well as includes a correction of the raw count by a number of factors to give a more realistic metric considering the nature of reads, errors, and strain variations present in the data and samples:
- The edges of a contig are generally excluded from the coverage counts up to a default of 75 bases or the average read length (–includeEdgeBases, –maxEdgeBases). This is because, generally mappers have a difficult time aligning a partial read to a contig when it would extend off edge and the coverage ramps up from 0 to the true coverage in this region.
- Reads that map imperfectly are excluded when the %ID of the mapping drops below a threshold (–percentIdentity=97). MetaBAT is designed to resolve strain variation and mapping reads with low %ID indicate that the read actually came from a different strain/species.
- Clips/insertions/deletions/mismatches are excluded from the coverage count – only the read bases that exactly match the reference are counted as coverage.
A comment on why we do not use the output from samtools that we have used before from the tool developer: We specifically do not use samtools depth because it does not generate a great estimate of the relative abundance that MetaBAT requires for good separation of samples. When a contig base has 0 coverage, then samtools depth neglects to report that, throwing off some calculations and when a read poorly maps samtools depth does count it, again foiling some of the species and strain variation nuances.
3 Prepare the depth files
Different binning tools generally need different input files. This is also the case for the coverage binners we will use and how the want the coverage info to look like. We will use some bash to prepare the depth files for each specific tool.
3.1 Create a depth file for MetaBAT
The output we generated with jgi_summarize_bam_contig_depths
is exactly what MetaBAT needs, so we need to do anything here.
3.2 Create a depth file for CONCOCT (using the metabat depth file as input)
Concoct does not care about the variance info, so we needed to remove it for this depth profile.
for i in `cat FileLists/SampleList`; do awk 'NR > 1 {for(x=1;x<=NF;x++) if(x == 1 || (x >= 4 && x % 2 == 0)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' 1_mapping/${i}/${i}_depth_metabat.txt > 1_mapping/${i}/${i}_depth_concoct.txt; done
#check file
head 1_mapping/NIOZ114/NIOZ114_depth_concoct.txt
The file for looks like this and contains:
- the contig id
- NIOZ114 assembly mapped against NIOZ114 reads
- NIOZ114 assembly mapped against NIOZ118 reads
- NIOZ114 assembly mapped against NIOZ130 reads
3.3 Create a depth file for MaxBin (using metabat depth file as input)
Maxbin does not really use coverage info across samples, so we only get the coverage info for matching samples and contigs. I.e. we only want the coverage of i.e. contig NIOZ114_sc48_1 against the raw reads of the NIOZ114 metagenome or contig NIOZ130_sc6732_1 against the NIOZ130 metagenome.
#get coverage info for the right samples
for i in `cat FileLists/SampleList`; do awk -F '\t' -v col="${i}_v_${i}.sam_sort.bam" 'NR==1{for (i=1; i<=NF; i++) if ($i == col){c=i; break}}{print $1"\t"$c}' 1_mapping/${i}/${i}_depth_metabat.txt | sed 1d > 1_mapping/${i}/${i}_depth_maxbin.txt; done
#check file
head 1_mapping/NIOZ114/NIOZ114_depth_maxbin.txt
The file for looks like this and contains:
- the contig id - NIOZ114 assembly mapped against NIOZ114 reads
3.4 Create depth file for BinSanity
Binsanity comes with its own algorithm to create depth profiles, so we will just that.
Notice: Binsanity-profile (similar as the jgi script we have used before) is doing some normalizations using FeatureCounts. Additionally, it does some data transformation, by default the data is log transformed.
#create dirs for binsanity
mkdir 1_mapping/2_binning
for i in `cat FileLists/SampleList`; do mkdir 1_mapping/2_binning/${i}; done
for i in `cat FileLists/SampleList`; do mkdir 1_mapping/2_binning/${i}/BinSanity; done
#get contigs IDs in the format that BinSanity prefers
for i in `cat FileLists/SampleList`; do get-ids -f 0_assembly_and_reads/2_assembly/${i}/ -l ${i}_contigs_2500.fna -o 0_assembly_and_reads/2_assembly/${i}/${i}_ids.txt; done
#create BinSanity depth profiles
for i in `cat FileLists/SampleList`; do Binsanity-profile -i 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna -s 1_mapping/${i}/ -T 1 --ids 0_assembly_and_reads/2_assembly/${i}/${i}_ids.txt -c 1_mapping/${i}/${i}_depth_binsanity -o 1_mapping/2_binning/${i}/BinSanity; done
#check file
head 1_mapping/NIOZ114/NIOZ114_depth_binsanity.cov
If you run this on your own, the coverage files, i.e. 1_mapping/NIOZ114/NIOZ114_depth_binsanity.cov
, should look something like this:
- the contig id
- NIOZ114 assembly mapped against NIOZ114 reads
- NIOZ114 assembly mapped against NIOZ118 reads
- NIOZ114 assembly mapped against NIOZ130 reads
3.5 Add depth info into assembly stats
Now, since we know the read depth for each contig, we can add this information to our general assembly statistics (if you followed the assembly tutorial).
Here, we just use our bash powers to make a summary info for each contig that lists the gc, length and coverage info. Having such information in tables is always useful because from experience you will need them at a later point.
For each contig we already provide a file that includes the contig length and contig GC, so we only need to merge this with the depth info we calculated.
#combine mapping stats from the different assemblies into one document
cat 1_mapping/N*/*_depth_maxbin.txt > 0_assembly_and_reads/2_assembly/AllContigs/All_contigs_2500_depth.txt
#control how many counts we have
wc -l 0_assembly_and_reads/2_assembly/AllContigs/All_contigs_2500_depth.txt
'''
The file contains 1245 rows, which should be consistent with the total number of contigs we have
'''
#merge info on depth and length and GC (these files we have generated during the assembly tutorial)
awk 'BEGIN{OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' 0_assembly_and_reads/2_assembly/AllContigs/All_contigs_2500_depth.txt 0_assembly_and_reads/2_assembly/AllContigs/All_contigs_2500_length_GC.txt | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$5}' > 0_assembly_and_reads/2_assembly/AllContigs/All_Contig_info.tsv
#add in a header
echo -e "contigName\tGC\tLength\tAbundanceAcrossSamples" | cat - 0_assembly_and_reads/2_assembly//AllContigs/All_Contig_info.tsv > 0_assembly_and_reads/2_assembly//AllContigs/All_Contig_info_header.tsv
#check file
head 0_assembly_and_reads/2_assembly//AllContigs/All_Contig_info_header.tsv
The file should look like this:
4 Start binning
Some tools will take a bit longer, so instead of using all four binning tools we will just run 2 of the 4 binning tools and you can use Concoct and BinSanity on your own if you are interested.
The tools are (in bold are the tools we will use during this tutorial):
- Metabat
- Concoct
- Maxbin
- BinSanity
General notice:
You will notice that for each tool we have more or less the same workflow:
- Get the correct depth file format
- Use the binning tool
- Give the generated bins some more informative names
- Make a file that links for each contig to what bin it was assigned
As such this can be very easily adjusted and you can remove or add new binning tools very easily.
4.1 Prepare a folder for each sample
#prepare folder to store the binning results
mkdir 2_binning
#prepare folder to store the binning results for each metagenome
for i in `cat FileLists/SampleList`; do mkdir 2_binning/$i; done
4.2 Metabat
MetaBATis a robust statistical framework for reconstructing genomes from metagenomic data Kang et al., 2015
4.2.1 Start binning
#prepare folders
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Metabat; done
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Metabat/bins; done
#run binning tool
for i in `cat FileLists/SampleList`; do metabat2 -t 2 -i 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna -a 1_mapping/${i}/${i}_depth_metabat.txt -o 2_binning/${i}/Metabat/bins; done
#clean files up by moving the bins into another folder
#the goal here is to have the same folder layout for all bining tools
for i in `cat FileLists/SampleList`; do mv 2_binning/${i}/Metabat/*fa 2_binning/${i}/Metabat/bins/; done
#check how bins are named
ll 2_binning/NIOZ1*/Metabat/bins/*
#check the content of a file to see how the headers look
head 2_binning/NIOZ114/Metabat/bins/bins.1.fa
#count number of bins generated per metagenome
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Metabat/bins/*fa | wc -l; done | paste FileLists/SampleList -
We have so many bins:
NIOZ114: 2
NIOZ118: 1
NIOZ130: 2
If we look at the bin name, it is not very informative, i.e. the name looks something like this bins.1.fa
.
So in the next step we add some additional information and also make sure in the later steps that for each binning tool we run we generate bin names in a consistent manner. I.e. if some files end with .fa
and others with .fasta
, which is not ideal. We also want to remove extra symbols, such as dots, and add what binning tool we used in the file header. The final name will look something like mb_b1.fa
.
Similarly, when we check the file header we should see something like >NIOZ114_sc464_1
. This is basically the contig name. However, it is best to also include a bin ID at a later step, which then would allow us to concatenate all genomes for downstream analyses.
In short: Think about naming as soon as possible in a project and make sure that you have informative file names as well as sequence headers.
4.2.2 Do cosmetics: Give bins a better name
#make backup in case renaming gets messed up
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Metabat/bins/backup ; done
for i in `cat FileLists/SampleList`; do cp 2_binning/${i}/Metabat/bins/* 2_binning/${i}/Metabat/bins/backup ; done
#cleanup file names
for i in `cat FileLists/SampleList`; do rename bins. mb_b 2_binning/${i}/Metabat/bins/*fa; done
#check how bins are named now, after our changes, we now should have mb_b1.fa instead of bins.1.fa
ll 2_binning/NIOZ1*/Metabat/bins/*fa
#check file header (nothing should have changed here)
head 2_binning/NIOZ114/Metabat/bins/mb_b1.fa
#count number of bins/metagenome (the numbers should be exactly the same)
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Metabat/bins/*fa | wc -l; done | paste FileLists/SampleList -
This is a control step to see whether the genomes were renamed correctly. Therefore, the numbers should be exactly the same as in the example above.
NIOZ114: 2
NIOZ118: 1
NIOZ130: 2
If there is some issue and the file names, do not include the binning tool, i.e. you do not have something like mb_b1.fa
, then double check that the code is correct and what version of rename
you are using.
4.2.3 Prepare a list of contigs - bins and cleanup
Next, we need a list that links the BinID with the contigs in each bin (for DAS Tool, which we will use later). Additionally, we will remove some files we do not need any more.
#make a list that lists for each MAG what contigs were included
for i in `cat FileLists/SampleList`; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fa/,x)}1' 2_binning/${i}/Metabat/bins/*fa | grep ">" | sed 's/>//g' | awk 'BEGIN{FS="\t";OFS="\t"}{split($1,a,"-")}{print a[2],a[1]}' | awk 'BEGIN{FS="\t";OFS="\t"}{split($2,a,"/")}{print $1,a[5]}' > 2_binning/${i}/Metabat/Metabat_contig_list.txt; done
#check how the contig list looks like
head 2_binning/NIOZ114/Metabat/Metabat_contig_list.txt
#rm backup data
for i in `cat FileLists/SampleList`; do rm -fr 2_binning/${i}/Metabat/bins/backup ; done
#gzip fasta files
for i in `cat FileLists/SampleList`; do gzip 2_binning/${i}/Metabat/bins/*fa ; done
The contig list, which we stored in 2_binning/NIOZ114/Metabat/Metabat_contig_list.txt
, should look like this and contain two columns that include:
- All binned contigs (i.e. NIOZ114_sc464_1)
- Info what bin each contig was assigned to (i.e. mb_b1)
When checking these files, make sure that the updated bin name is used, i.e. we want to see something like mb_b1 and nothing generic like bin1.
4.3 Concoct
CONCOCT is a program for unsupervised binning of metagenomic contigs by using nucleotide composition, coverage data in multiple samples and linkage data from paired end reads (Alneberg et al., 2014)
4.3.1 Start binning
#start concoct
source $HOME/.bashrc.conda3 concoct
#create directories
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Concoct; done
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Concoct/bins; done
#run concoct v1.0.0
for i in `cat FileLists/SampleList`; do concoct -t 4 --composition_file 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna --coverage_file 1_mapping/${i}/${i}_depth_concoct.txt -b 2_binning/${i}/Concoct; done
#prepare file to get the clustered contigs
for i in `cat FileLists/SampleList`; do merge_cutup_clustering.py 2_binning/${i}/Concoct/clustering_gt1000.csv > 2_binning/${i}/Concoct/clustering_merged.csv; done
#check how the file looks good
head 2_binning/${i}/Concoct/clustering_merged.csv
#extract bins
for i in `cat FileLists/SampleList`; do extract_fasta_bins.py 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna 2_binning/${i}/Concoct/clustering_merged.csv --output_path 2_binning/${i}/Concoct/bins; done
#check what files were generated
ll 2_binning/NIOZ1*/Concoct/bins/*fa
#check how the fasta header looks of a file
head 2_binning/NIOZ114/Concoct/bins/0.fa
#count nr bins/sample
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Concoct/bins/*fa | wc -l; done | paste FileLists/SampleList -
#deactivate conda and anaconda environments
conda deactivate
conda deactivate
We might have something like this:
NIOZ114: 13
NIOZ118: 8
NIOZ130: 23
4.3.2 Cosmetics: Make bin name more informative
#make backup
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Concoct/bins/backup ; done
for i in `cat FileLists/SampleList`; do cp 2_binning/${i}/Concoct/bins/* 2_binning/${i}/Concoct/bins/backup ; done
#restore original files if needed
#for i in `cat FileLists/SampleList`; do cp 2_binning/${i}/Concoct/bins/backup/*fa 2_binning/${i}/Concoct/bins ; done
#cleanup file names
perl scripts/rename.pl 's/(.*)\/(.*)\/(.*)\//$1\/$2\/$3\/cc_b/' 2_binning/N*/Concoct/bins/*fa
#check if this went ok and check how the file names now lool
ll 2_binning/NIOZ1*/Concoct/bins/*fa
#count number of bins/metagenome to confirm all is good
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Concoct/bins/*fa | wc -l; done | paste FileLists/SampleList -
After renaming, we should have the exact numbers as above:
NIOZ114: 13
NIOZ118: 8
NIOZ130: 23
4.3.3 Prepare Contig - Bin list and cleanup files
#make contig list
for i in `cat FileLists/SampleList`; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fa/,x)}1' 2_binning/${i}/Concoct/bins/*fa | grep ">" | sed 's/>//g' | awk 'BEGIN{FS="\t";OFS="\t"}{split($1,a,"-")}{print a[2],a[1]}' | awk 'BEGIN{FS="\t";OFS="\t"}{split($2,a,"/")}{print $1,a[5]}' > 2_binning/${i}/Concoct/Concoct_contig_list.txt; done
#control how the output looks
head 2_binning/NIOZ114/Concoct/Concoct_contig_list.txt
#clean up files
for i in `cat FileLists/SampleList`; do gzip 2_binning/${i}/Concoct/bins/*fa ; done
#rm backup
for i in `cat FileLists/SampleList`; do rm -rf 2_binning/${i}/Concoct/bins/backup ; done
4.4 Maxbin
MaxBin is an automated binning method that recovers individual genomes from metagenomes using an expectation-maximization algorithm (Wu et al., 2014)
4.4.1 Start binning
#create dirs
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Maxbin; done
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Maxbin/bins; done
#run Maxin
for i in `cat FileLists/SampleList`; do run_MaxBin.pl -thread 2 -contig 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna -out 2_binning/${i}/Maxbin/mx_b -abund 1_mapping/${i}/${i}_depth_maxbin.txt; done
#check the file names
ll 2_binning/NIOZ1*/Maxbin/*fasta
#check the fasta header
head $(ls -1 2_binning/NIOZ1*/Maxbin/*fasta | head -1)
#count number of bins/metagenome
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Maxbin/*fasta | wc -l; done | paste FileLists/SampleList -
We might have something like this:
NIOZ114: 2
NIOZ118: 2
NIOZ130: 2
4.4.2 Cosmetics: Rename bins
Specifically, we want to remove unwanted symbols, such as dots, and we want the file names to have the extension .fa
instead of .fasta
. That way we ensure that MAGs generated with different binning tools all end with .fa
.
#clean up
for i in `cat FileLists/SampleList`; do mv 2_binning/${i}/Maxbin/*fasta 2_binning/${i}/Maxbin/bins/; done
#make backup in case renaming gets messed up
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/Maxbin/bins/backup ; done
for i in `cat FileLists/SampleList`; do cp 2_binning/${i}/Maxbin/bins/* 2_binning/${i}/Maxbin/bins/backup ; done
#rename: instead of mx_b.001.fasta we want sth like: mx_b001.fa
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/mx_b\./mx_b/" 2_binning/${i}/Maxbin/bins/*.fasta; done
for i in `cat FileLists/SampleList`; do rename fasta fa 2_binning/${i}/Maxbin/bins/mx_*; done
#check if all is fine
ll 2_binning/NIOZ*/Maxbin/bins/*fa
#count number of bins/metagenome to confirm all is good
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/Maxbin/bins/*fa | wc -l; done | paste FileLists/SampleList -
After renaming, we want to have the same nr of bins as before:
NIOZ114: 2
NIOZ118: 2
NIOZ130: 2
4.4.3 Prepare Contig - Bin list and cleanup
#make contig list
for i in `cat FileLists/SampleList`; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fa/,x)}1' 2_binning/${i}/Maxbin/bins/*fa | grep ">" | sed 's/>//g' | awk 'BEGIN{FS="\t";OFS="\t"}{split($1,a,"-")}{print a[2],a[1]}' | awk 'BEGIN{FS="\t";OFS="\t"}{split($2,a,"/")}{print $1,a[5]}' > 2_binning/${i}/Maxbin/Maxbin_contig_list.txt; done
#check if the output looks ok
head 2_binning/NIOZ114/Maxbin/Maxbin_contig_list.txt
#clean up files
for i in `cat FileLists/SampleList`; do gzip 2_binning/${i}/Maxbin/bins/*fa ; done
#rm backup
for i in `cat FileLists/SampleList`; do rm -rf 2_binning/${i}/Maxbin/bins/backup ; done
4.5 BinSanity
BinSanity: unsupervised clustering of environmental microbial assemblies using coverage and affinity propagation (Graham et al., 2017)
4.5.1 Start binning
#prepare folders
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/BinSanity; done
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/BinSanity/REFINED-BINS; done
#bin
parallel -j3 'i={}; Binsanity-wf --threads 1 -x 2500 -f 0_assembly_and_reads/2_assembly/${i}/ -l ${i}_contigs_2500.fna -c 1_mapping/${i}/${i}_depth_binsanity.cov.x100.lognorm -o 2_binning/${i}/BinSanity' ::: `cat FileLists/SampleList`
#clean up
for i in `cat FileLists/SampleList`; do mv 2_binning/${i}/BinSanity/BinSanity-Final-bins 2_binning/${i}/BinSanity/bins; done
#check the fasta header of one bin and the general file names
ll 2_binning/NIOZ1*/BinSanity/bins/*fna
#count number of bins/metagenome
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/BinSanity/bins/*fna | wc -l; done | paste FileLists/SampleList -
Afterwards, we might have something like this:
NIOZ114: 3
NIOZ118: 2
NIOZ130: 2
4.5.2 Rename bins
#make backup in case renaming gets messed up
for i in `cat FileLists/SampleList`; do mkdir 2_binning/${i}/BinSanity/bins/backup ; done
for i in `cat FileLists/SampleList`; do cp 2_binning/${i}/BinSanity/bins/* 2_binning/${i}/BinSanity/bins/backup ; done
#rename files
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/fna/fa/" 2_binning/${i}/BinSanity/bins/*.fna; done
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/${i}_contigs_2500_Bin-/bs_b/" 2_binning/${i}/BinSanity/bins/*.fa; done
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl -- "s/\-refined_/r/g" 2_binning/${i}/BinSanity/bins/*fa; done
#check the fasta header of one bin and the general file names
ll 2_binning/NIOZ1*/BinSanity/bins/*fa
#count and remove low quality bins:2
ll 2_binning/NIOZ1*/BinSanity/bins/low_completion* | wc -l
rm -f 2_binning/NIOZ1*/BinSanity/bins/low_completion*
#count number of bins/metagenome to confirm all is good
for i in `cat FileLists/SampleList`; do ll 2_binning/${i}/BinSanity/bins/*fa | wc -l; done | paste FileLists/SampleList -
After renaming we should have the same number of bins as before (beware, we might have less bins since we remove some low completion bins):
NIOZ114: 2
NIOZ118: 1
NIOZ130: 2
4.5.3 Create Contig - Bin list and cleanup
#make contig list
for i in `cat FileLists/SampleList`; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fa/,x)}1' 2_binning/${i}/BinSanity/bins/bs* | grep ">" | sed 's/>//g' | awk 'BEGIN{FS="\t";OFS="\t"}{split($1,a,"-")}{print a[2],a[1]}' | awk 'BEGIN{FS="\t";OFS="\t"}{split($2,a,"/")}{print $1,a[5]}' > 2_binning/${i}/BinSanity/Binsanity_contig_list.txt; done
#rm backup
for i in `cat FileLists/SampleList`; do rm -rf 2_binning/${i}/BinSanity/bins/backup ; done
#gzip fasta files
for i in `cat FileLists/SampleList`; do gzip 2_binning/${i}/BinSanity/bins/*fa ; done
5 Combine Bins using DasTool
Now that we ran the different binning tools, we want to find out what tool worked best and get one representative bin with the best genome statistics.
5.1 Run DASTool
DAS: Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy (Sieber et al., 2018)
Notice:
- Since we did not run BinSanity and Concoct, you need change the second command if you want to include these or other tools if you run this on your own
- This will take a bit to run, so have patience
#make dirs
mkdir 3_DASTool
for i in `cat FileLists/SampleList`; do mkdir 3_DASTool/${i}; done
#run DAS, if we already have the proteins we can add ``--proteins 3_DASTool/${i}_${i}_proteins.faa``
for i in `cat FileLists/SampleList`; do DAS_Tool -i 2_binning/${i}/Metabat/Metabat_contig_list.txt,2_binning/${i}/Maxbin/Maxbin_contig_list.txt -l Metabat,Maxbin -c 0_assembly_and_reads/2_assembly/${i}/${i}_contigs_2500.fna --db_directory /opt/biolinux/DAS_Tool -o 3_DASTool/${i}/DASTool_${i} --write_bins 1 -t 4; done
#gzip the protein files, since they can get rather large
gzip 3_DASTool/*/*_proteins.faa
#clean up folder name, ie the current path `3_DASTool/NIOZ130/DASTool_NIOZ130_DASTool_bins` is a bit long
for i in `cat FileLists/SampleList`; do mv 3_DASTool/${i}/DASTool_${i}_DASTool_bins/ 3_DASTool/${i}/bins ; done
#check the bin IDs
ll 3_DASTool/*/bins/*fa
#count number of bins/metagenome
for i in `cat FileLists/SampleList`; do ll 3_DASTool/${i}/bins/*fa | wc -l; done | paste FileLists/SampleList -
This could look something like this:
NIOZ114: 2
NIOZ118: 1
NIOZ130: 2
When we check the bin names, we can see that the bins come from both binning tools. We can expect that if we would use more binning tools, i.e. include concoct and binsanity, we would see how hard it would be to decide which single tool would be best.
DAS also generates a lot of additional data in its folders:
- Summary of output bins including quality and completeness estimates (DASTool_summary.txt).
- Scaffolds to bin file of output bins (DASTool_scaffolds2bin.txt).
- Quality and completeness estimates of input bin sets, if –write_bin_evals 1 is set ([method].eval).
- Plots showing the amount of high quality bins and score distribution of bins per method, if –create_plots 1 is set (DASTool_hqBins.pdf, DASTool_scores.pdf).
- Bins in fasta format if –write_bins 1 is set (DASTool_bins).
5.2 Cleanup bin names
Now, lets make some more informative bin names using our bash magic. Specifically, we want to:
- ensure all files end with the extension
fna
, which is more commonly used for MAGs - remove unnecessary info that just makes the name longer than it needs to be, i.e. the
contigs
tag in the file header - add the sample ID to easily know from which sample the metagenome was assembled from
- When we check the file names, we also see that we have two times then name
mx_b002.contigs.fa
for NIOZ130 and NIOZ114 samples. Duplicated are not good and can easily create a confusion, so that is something else we want to change.
#make backup of your files
for i in `cat FileLists/SampleList`; do mkdir 3_DASTool/${i}/bins/backup ; done
for i in `cat FileLists/SampleList`; do cp 3_DASTool/${i}/bins/* 3_DASTool/${i}/bins/backup ; done
#check initial file name
ll 3_DASTool/*/bins/*f*
#cleanup file names: adjust the extension, remove the contig info and add the sample ID into the file header
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/fa/fna/" 3_DASTool/${i}/bins/*.fa; done
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/_sub.contigs\./\./" 3_DASTool/${i}/bins/*.fna; done
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/\.contigs\./\./" 3_DASTool/${i}/bins/*.fna; done
for i in `cat FileLists/SampleList`; do perl scripts/rename.pl "s/(.*)\/(.*)\/(.*)\//3_DASTool\/$i\/bins\/${i}_/" 3_DASTool/${i}/bins/*.fna; done
#check new file names
ll 3_DASTool/*/bins/*f*
#control counts
for i in `cat FileLists/SampleList`; do ll 3_DASTool/${i}/bins/*fna | wc -l; done | paste FileLists/SampleList -
#if there is an issue with the renaming, we can get the old bins from the backup folder
#rm 3_DASTool/*/bins/*
#for i in `cat FileLists/SampleList`; do cp 3_DASTool/${i}/bins/backup/* 3_DASTool/${i}/bins/; done
This could look something like this:
NIOZ114: 2
NIOZ118: 1
NIOZ130: 2
5.3 Cleanup file header
Now we clean things up by:
- moving the MAGs into a final folder
- creating a new version, where we add the BinID into the fasta header (this is useful bc it allows to concatenate all genomes into one big file)
#copy bins into new folder
#if you have a lot of files, use mv or ln instead
mkdir 4_FinalBins
for i in `cat FileLists/SampleList`; do cp 3_DASTool/${i}/bins/*fna 4_FinalBins; done
#check file names
ll 4_FinalBins/*fna
#the script below needs some modification to work if you have more than 999 bins/binning tool
#have a consistent numbering, i.e. we want all numbers have 3 digits
#python scripts/number_bins_consistently.py 4_FinalBins
#check file names again
#ll 4_FinalBins/*fna
#make a list with all the bins we have and count the nr of genomes in that list: 5
ls 4_FinalBins/*fna | cut -f2 -d "/" | sed 's/\.fna//g' > FileLists/BinList.txt
wc -l FileLists/BinList.txt
#count how many genomes we have --> this should be 5
ll 4_FinalBins/* | wc -l
#check the fasta header
head $(ls -1 4_FinalBins/*fna | head -1)
#add in binID into fasta header (that way we can concatenate genomes later into one large file)
mkdir 4_FinalBins/renamed
cd 4_FinalBins
for i in *fna; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fna/,x)}1' $i > renamed/$i; done
#check if everything went alright
head $(ls -1 renamed/*fna | head -1)
#if the header looks good, lets remove the old files
rm -f *fna
cd ..
#cleanup backup files
rm -rf 3_DASTool/*/bins/backup
We can now see that the header looks something like this: >NIOZ114_mb_b2-NIOZ114_sc48_1
. It includes the BinID separated with a -
symbol from the contig ID. This is a general format we will use for everything we have and stick with using either _
or -
. The less clutter the easier most workflows will be.
6 CheckM: Determine Bin Quality
CheckM: A tool for assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes (Parks et al., 2015)
#prepare folder
mkdir 5_CheckM
#run 5_CheckM
checkm lineage_wf -x fna -f 5_CheckM/5_CheckM_table.txt -t 4 --pplacer_threads 1 4_FinalBins/renamed 5_CheckM
#check how the initial file looks like
head 5_CheckM/5_CheckM_table.txt
#since the initial file is not cleaned very nicely, lets clean things up
#rm rows with "-" and replace multiple spaces with a tab
sed '/-/d' 5_CheckM/5_CheckM_table.txt | sed 's/ \+ /\t/g' | cut -f 2- > 5_CheckM/temp1
#clean header name
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==1) $1="BinID"} {print $0 }' 5_CheckM/temp1 > 5_CheckM/CheckM_table_clean.txt
#check results
less -S 5_CheckM/CheckM_table_clean.txt
The output from CheckM for our bins looks like this:
Question:
- How many archaeal bins do we have?
- How complete are these genomes?
Click me to see an answer
We can see that:
(a) We have 4 bacterial 3 archaeal MAGs that are not classified at lower level
(b) all but one have a completeness >80% and all except one are less than 5% contaminated, which is pretty good
General info on checkM:
A downside of CheckM is that it searches for absence/duplication of marker genes. The number of marker genes checked depend on the lineage we are looking at. For example if CheckM thinks we work with a Deltaproteobacterium it might check 150 genes, however, if it only estimates that we work with a bacterium it might only check 50 genes. Thus keep in mind that:
- only a small fraction of the genome is checked and
- this fraction gets even smaller if we have novel lineages
Especially since a lot of genes sit in operons, such as ribosomal proteins, this can mean that we over/underestimate genome completeness. Issues with the CheckM approach are also discussed here
We have not tested this yet, but GUNC is an alternative that looks at the full gene content and might be something interesting to look at:
- the paper
- the github page
7 GTDB_tk: Assign a taxonomic label to the genomes
GTDB-Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes based on the Genome Database Taxonomy GTDB. It is designed to work with recent advances that allow hundreds or thousands of metagenome-assembled genomes (MAGs) to be obtained directly from environmental samples. It can also be applied to isolate and single-cell genomes (Chaumeil et al., 2019)
#start the conda env which we need for the tool to run
source $HOME/.bashrc.conda3 gtdbtk2
#set path to right gtdb database
export GTDBTK_DATA_PATH=/export/data01/databases/gtdb/release207_v2/
#prepare folders
mkdir 6_Gtdbtk
#run gtdbtk
gtdbtk classify_wf --genome_dir 4_FinalBins/renamed --out_dir 6_Gtdbtk --cpus 20 --skip_ani_screen
#reset env to default
conda deactivate
conda deactivate
The key information we get should look something like this:
Question:
Go into the folder that contains the GTDB results and try to answer:
- What taxa are our MAGs assigned to?
- Based on Gtdb did we find a new phylum, class, order, …?
Click me to see an answer
We check the files with:
less -S 6_Gtdbtk/classify/gtdbtk.ar53.summary.tsv
less -S 6_Gtdbtk/classify/gtdbtk.bac120.summary.tsv
We have bins from the Altarchaeota, Cloacimonadota and Planctomycetota, None of them is assigned to a cultured representative above the family level. We see this since there is no specific name assignment at the genus or species level. Generic names, such as JACNJR01, are more typically used as placeholders for as of yet uncultured microbes.
General info on gtdbtk
- The tool and links to all relevant references
- RED values and cut-offs are discussed here. Notice, there are some technically issues about how red values are assigned but whenever you describe a novel lineage a lot of papers might ask you for this value, so note it down.
GTDBtk will run a phylogenetic analysis, however, this is a very quick analysis. Therefore, this step here is useful to get a first idea with what we work and because it adds our genome to a tree with a lot of reference genomes, so we will know close relatives. However, the downside is that the phylogenetic tree generated is not the best and we strongly recommend that you run your own tree with a suitable marker set and a good phylogenetic model.
The problems (in short) are that some of the used markers are prone to horizontal gene transfers, the alignment is trimmed a lot, the genomes are added into a reference tree that was run using a simpler model.
8 Calculate the stats for the contigs and bins
8.1 Get summary stats for each contig
Now, lets gets some basic stats for each contig that include:
- contig length
- GC content
#prep folders
mkdir 7_Stats
mkdir 7_Stats/Length
#calculate stats for each bin
for i in `cat FileLists/BinList.txt`; do perl scripts/length+GC.pl 4_FinalBins/renamed/${i}.fna > 7_Stats/Length/${i}_length.txt; done
#combine indiv file into one larger file and view content of the file
cat 7_Stats/Length/*txt > 7_Stats/Length_Bin_length.txt
head 7_Stats/Length_Bin_length.txt
#cleanup indiv. files
rm -f 7_Stats/Length/N*
#also add the binID into the file
awk 'BEGIN{FS="\t"; OFS="\t"}{split($1,a,"-")}{print a[2],a[1],$2,$3}' 7_Stats/Length_Bin_length.txt > 7_Stats/Length_Bin_length_final.txt
#add in headers
echo -e "contigName\tBinID\tGC\tlength" | cat - 7_Stats/Length_Bin_length_final.txt > 7_Stats/Length_Bin_length_final_header.txt
#check file
head 7_Stats/Length_Bin_length_final_header.txt
The file we get should look something like this:
Question
- How many of our original contigs were binned?
Click me to see an answer
We check this with:
grep -c ">" 0_assembly_and_reads/2_assembly/AllContigs/All_contigs_2500.fna
wc -l 7_Stats/Length_Bin_length_final.txt
Initially we have 1245 contigs, of these 1195 contigs were binned
8.2 Calculate: contig nr, average length, average gc and average coverage across bins
For the bins we want to calculate:
- the genome size
- the average gc
- the average depth
#make folder
mkdir 7_Stats/Coverage
#get the depth info for the indiv. sample
cat 1_mapping/N*/*_depth_maxbin.txt > 1_mapping/AllSamples_depth_maxbin.txt
#merge with the contig stats (header:contig, binID, gc, length, RA)
LC_ALL=C join -a1 -j1 -e'-' -t $'\t' -o 0,1.2,1.3,1.4,2.2 <(LC_ALL=C sort 7_Stats/Length_Bin_length_final.txt) <(LC_ALL=C sort 1_mapping/AllSamples_depth_maxbin.txt) | LC_ALL=C sort -k2 > 7_Stats/temp1
#add header
echo -e "contigName\tBinID\tAvGC\tLength\tRA" | cat - 7_Stats/temp1 > 7_Stats/temp2
#get list of the coverage info across all samples
cat 1_mapping/N*/*_depth_metabat.txt > 1_mapping/AllSamples_depth_metabat.txt
#check file
head 1_mapping/AllSamples_depth_metabat.txt
#combine tables with the abundances and contig stats
awk 'BEGIN{OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' 1_mapping/AllSamples_depth_metabat.txt 7_Stats/temp2 > 7_Stats/temp3
#control that the nr of lines is ok --> should be around 1200
wc -l 7_Stats/temp3
#subset data to only include relevant info and remove the variance info
awk '{for(x=1;x<=NF;x++) if(x <= 5 || (x >=9 && x % 2)) printf "%s", $x (x == NF || x == (NF-1) ? "\n":"\t")}' 7_Stats/temp3 > 7_Stats/temp4
#clean up the header names
sed 's/\.sam_sort\.bam//g' 7_Stats/temp4 | sed 's/[A-Za-z0-9]*_v_//g' > 7_Stats/contig_stats.txt
#check file
head 7_Stats/contig_stats.txt
#do math
awk -F'\t' -v OFS='\t' 'NR>1{ContigNr[$2]++; GenomeSize[$2]+=$4/1000000;AvRa[$2]+=$5;AvGC[$2]+=$3*100; }END{for(i in GenomeSize) print i,ContigNr[i],GenomeSize[i],AvGC[i]/ContigNr[i],AvRa[i]/ContigNr[i]}' 7_Stats/contig_stats.txt > 7_Stats/bin_stats.txt
#add headers
echo -e "BinID\tContigNR\tAverageGenomeSize\tAverageGC\tAverageDepth" | cat - 7_Stats/bin_stats.txt > 7_Stats/bin_stats_header.txt
#check output generated
head 7_Stats/bin_stats_header.txt
#clean up temp files
rm -f 7_Stats/temp*
8.3 Combine the different dataframes
#combine gtdb results for archaea and bacteria
awk 'FNR==1 && NR!=1 { while (/^<header>/) getline; }1 {print}' 6_Gtdbtk/gtdbtk.*.summary.tsv> 6_Gtdbtk/gtdbtk.ALL.summary.tsv
#change column name user_genome to BinID (to make the merging in the next step possible)
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==1) $1="BinID"} {print $0 }' 6_Gtdbtk/gtdbtk.ALL.summary.tsv > 6_Gtdbtk/gtdbtk.ALL.summary2.tsv
#combine gtdb stats and bin stats
awk 'BEGIN{OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' 6_Gtdbtk/gtdbtk.ALL.summary2.tsv 7_Stats/bin_stats_header.txt > 7_Stats/temp1
#combine with 5_CheckM results
awk 'BEGIN{OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' 5_CheckM/CheckM_table_clean.txt 7_Stats/temp1 > 7_Stats/bin_stats_final.txt
#check output generated
less -S 7_Stats/bin_stats_final.txt
This final data frames combines all the data we have looked at before and contains:
- the bin stats (the ones we calculated as well as the CheckM results)
- the results from gtdb