Metagenomics: Assembly tutorial
1 Introduction
The goal of this tutorial is to teach students how to assemble reads from metagenomic sequence data. During this tutorial, we will:
- take the raw reads from 3 “mini-metagenomes” that were recovered from 3 depth profiles of the water column from the Black Sea in a 2018 cruise (at 500 m, 1000 m and 2000 m water depth). These genomes are named NIOZ114, NIOZ118 and NIOZ130, respectively
- check the quality of the metagenomes
- remove reads that have a low quality
- assemble the reads into contigs using megahit and metaspades
- get the basics stats to characterize the assemblies
The reason why we work with three and not only one metagenome is because another pipeline that can follow the assembly is the binning pipeline that benefits from incorporating data from multiple samples. When binning contigs from an assembly into metagenome-assembled genomes (MAGs) most binning tools take into account coverage information. We get this information by comparing how many reads a contig recruits from different metagenomes (i.e. that come from different depth profiles in our case).
In this tutorial we will provide an example for how to run the code for a single metagenome or in a loop with all metagenomes of interest. The single example will be commented out (we added a #
in front of it), if you want to try this remove the #
and run the code.
To fully understand this tutorial some good background in bash and AWK is needed. If you want to review your background we provide some intro tutorials for:
1.1 Running the same code on more than one file
1.1.1 Using for loops
You will see that during this workflow we will use a lot of for-loops. The reason is that this allows us to not run our 3 samples in three lines of code but just in one.
A general introduction into for-loops can be found here.
The general syntax is:
for var in sample1 sample2 sample3
do
echo $var
done
If we would run this inside a terminal, this should be printed to the screen:
sample1
sample2
sample3
This is another way to write to write a for-loop in a more condensed way.
for var in sample1 sample2 sample3; do echo $var done
1.1.2 while read loops
This kind of loop allow us to read in files into our loop, i.e. we might have two samples and want to do all possible comparisons between such samples. I.e. the comparisons we want make are are stored in a text file named test.txt
like this:
Sample1 Sample2
Sample2 Sample1
And we can use a while loop to use the information from both columns as follows:
while read column1 column2; do cat $column1 $column2; done < test.txt
1.1.3 GNU parallel
GNU parallel is another efficient way to run things in parallel using more than one CPU. However, we won´t discuss it during this tutorial.
1.2 Version programs
Notice, the workflow was written while using these program versions. If you are using the NIOZ server system there is nothing you need to do but if you run this tutorial on your own system/cluster you would need to install these tools first.
- prokka 1.14-dev
- Python 2.7.5
- perl v5.16.3
- MultiQC v1.8
- FastQC v0.11.9
- megahit v1.1.3
- metaspades SPAdes-3.13.1
- BWA v0.7.17-r1188
- Quast v4.6.3
2 Addon: Prepare test data
#set wdir
cd /export/lv1/user/ndombrowski/Assembly_workflow
#combine 3 genomes from which we want to extract reads from the same depths
cat bins_1/indiv/* > bins_1/combined/NIOZ_bins.fna
#prepare folders for read mapping
mkdir mapping_1
#make folder per sample
for i in `cat FileLists/Samples`; do mkdir mapping_1/$i; done
#build index for our concatenated genomes
bwa index bins_1/combined/*
#create mapping files
for sample1 in `cat FileLists/Samples`; do bwa mem -t 40 bins_1/combined/NIOZ_bins.fna /export/lv4/projects/Black_Sea18/Data/RawData/Illumina/*/${sample1}*_L001_R1_001.fastq.gz /export/lv4/projects/Black_Sea18/Data/RawData/Illumina/*/${sample1}*_L001_R2_001.fastq.gz > mapping_1/${sample1}/${sample1}_v_${sample1}.sam; done
#retain properly paired reads, sort and create bam file
for sample1 in `cat FileLists/Samples`; do samtools view -f 2 -b -u mapping_1/${sample1}/${sample1}_v_${sample1}.sam | samtools sort -n -@ 4 - -o mapping_1/${sample1}/${sample1}_v_${sample1}_sorted.bam; done
#extract paired reads
for sample1 in `cat FileLists/Samples`; do samtools fastq mapping_1/${sample1}/${sample1}_v_${sample1}_sorted.bam -1 mapping_1/${sample1}/${sample1}_R1.fastq -2 mapping_1/${sample1}/${sample1}_R2.fastq;done
#check counts
for sample1 in `cat FileLists/Samples`; do echo $(cat mapping_1/${sample1}/${sample1}_R1.fastq|wc -l)/4|bc; done
for sample1 in `cat FileLists/Samples`; do echo $(cat mapping_1/${sample1}/${sample1}_R2.fastq|wc -l)/4|bc; done
#cleanup
gzip mapping_1/*/*R*.fastq
#mv final files
mkdir rawdata_1
mv mapping_1/*/*R[12].fastq.gz rawdata_1
To give you a feeling about the size of the data set, below you can see how many forward (R1) and reverse reads (R2) were recruited to each of the 3 metagenomes:
R1:
424,058
595,711
473,936
R2:
424,058
595,711
473,936
If you work with your own data, this easily can be several million reads. So run times will be much longer and you can benefit from using more CPU’s to run jobs.
3 Tutorial start: Prepare your data
If you work on this tutorial outside of the NIOZ tutorial download the data from zenodo. Have patience, they take a while to download. Exchange the path to where you downloaded the data.
Click me if you use the tutorial outside of the workshop and want to use the zenodo files
#set a variable to the directory from which we want to work
#change this according to your preferences
wdir="/export/lv3/scratch/bioinformatics_workshop/S10_Assembly"
#go to our working directory
cd $wdir
#download the data
curl "https://zenodo.org/record/7752831/files/rawdata_1.tar.gz?download=1" --output rawdata_1.tar.gz
curl "https://zenodo.org/record/7752831/files/adaptors.tar.gz?download=1" --output adaptors.tar.gz
#unzip the data
tar xvzf rawdata_1.tar.gz
tar xvzf adaptors.tar.gz
#prepare custom scripts
mkdir scripts
wget https://raw.githubusercontent.com/ndombrowski/Assembly_tutorial/main/scripts/length%2BGC.pl -O scripts/length+GC.pl
wget https://raw.githubusercontent.com/ndombrowski/Assembly_tutorial/main/scripts/screen_list_new.pl -O scripts/screen_list_new.pl
#cleanup folder
rm *gz
Exercise: Setting up the working directory and required files
- Go into your personal folder in
/export/lv3/scratch/workshop_2021/Users/<username>
- Make a new working directory, and name it
Assemblies
- go into that folder, this is your working directory
- inside that folder, set a symbolic link to a folder with the metagenomic data, which is found here:
/export/lv3/scratch/bioinformatics_workshop/S10_Assembly/rawdata_1/
- set another symbolic link to a folder with secondary data we need from here:
/export/lv3/scratch/bioinformatics_workshop/S10_Assembly/adaptors/
Click me to see an answer
#set variable for the User directory
user_name="ndombrowski"
#prepare working directory
cd /export/lv3/scratch/bioinformatics_workshop/Users/${user_name}
#generate a folder to run the assembly workflow in
mkdir Assemblies
cd Assemblies
#create symbolic links to the metagenome we work with
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/rawdata_1/ .
#create symbolic link to an adapter file we need to clean up the genomes
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/adaptors/ .
#create symbolic link to a folder with custom scripts
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/scripts/ .
3.1 Prepare a list with samples to loop through
As discussed in the introduction, we will use for-loops to run the same script over our 3 different metagenomes. Something useful to have is a file that lists the samples we are working with. So, let’s start by creating such a list.
echo -e "NIOZ114\nNIOZ118\nNIOZ130" > SampleList
Questions
It is always good to explore your data before doing things with it. To do this:
- Check the data structure and open the file with less (ideally without unzipping the file, for example using
zcat
). Do you know what is stored in each of the four lines? If you are unsure, click the button below for a more detailed explanation/ - Answer how many sequences we have in each file.
- Find out whether the R1 and R2 files (in which store the forward and the reverse reads in, respectively) have the same number of sequences in the file.
Click me to see an answer
#Question1:
zcat rawdata_1/NIOZ114_R1.fastq.gz | head
The four lines in a fastq file contain:
- A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
- The sequence (the base calls; A, C, T, G and N).
- A separator, which is simply a plus (+) sign.
- The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.
In case you are unsure about the format of fastq files, you can find some more info here.
#Question2 and 3:
cd rawdata_1
for sample in *gz; do zcat $sample | echo $((`wc -l`/4)); done
cd ..
You should get the following numbers:
774200 (NIOZ114_R1)
774200 (NIOZ114_R2)
354428 (NIOZ118_R1)
354428 (NIOZ118_R1)
288432 (NIOZ30_R1)
288432 (NIOZ130_R2)
We can see that:
- The R1 and R2 files have the same number of sequences. This is important, if you ever see that you don’t have the same number this indicates that there might have been a problem when downloading the data or that the sequence center might have done some cleaning. You should follow up on this and figure out why the numbers are not the same.
- You also see that we have much more reads for sample NIOZ114 than for the other samples which might be important to keep in mind when interpreting the data.
4 Quality control
4.1 Check initial quality
4.1.1 FastQC to determine the quality info
FastQC is one of the many tools we can use to look at the quality of the data. Among other things it looks at:
- Overall read quality,
- QC bias
- Number of Ns (i.e. unassigned bases)
- Over-represented sequences (often adapter sequences)
Some examples for good and bad reports can be found here
#make folder
mkdir 2_seq_quality
mkdir 2_seq_quality/fastqc
#run fastq
for sample in `cat SampleList`; do fastqc -t 4 rawdata_1/${sample}* -o 2_seq_quality/fastqc; done
#view report
chrome 2_seq_quality/fastqc/NIOZ114_R1_fastqc.html
If you want to exit chrome press: Control + X
Notice: on ada this is a bit slow and often faster to download the data to your own computer and view from there
The output should look something like this and among others we see:
- the general sequence statistics
- the quality scores, across the sequence length. The phred score goes from 0-40 (good to bad) and we can see that the sequences generally are a bit worse at the beginning. This has something to do with the sequencing approach and is nothing to worry about. However, if you would see that the complete sequence is bad that is something to discuss for example with the sequencing center.
- The average sequence quality. If you have a bump at the beginning, that means we have more low quality sequences that we should to get rid of.
- The duplication level. I.e. here most of our sequences are singletons. If you have 16S sequences that will look a bit different and we will have more duplicated sequences.
- Overrepresented sequences Here, esp. if we have adapter sequences left, you will see a sequence information. This is important to note down, as we want to remove adapters in the cleaning step and its important to check if we used the right adapter information for the cleaning.
4.1.2 MultiQC to summarize the fastqc results
MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn’t run other tools for you - it’s designed to be placed at the end of analysis pipelines or to be run manually when you’ve finished running your tools (Ewels et al., 2016).
#make folder
mkdir 2_seq_quality/multiqc
#open conda environments
source ~/.bashrc.conda3
conda activate multiqc
#run MultiQC
multiqc -d 2_seq_quality/fastqc -o 2_seq_quality/multiqc
#view report
chrome 2_seq_quality/multiqc/multiqc_report.html
#close conda environment and reset environment back to default (yes, run it twice to first deactivate conda and then anaconda)
conda deactivate
conda deactivate
MultiQC summarizes the data across all metagenomes and should look something like this:
We also get the plots we saw before for fastqc now just with multiple lines, each representing one of our metagenomes.
Here we also get a zoom into the adapter content. If we hover over a line we see exactly what adapter was found as well.
4.2 Run trimmomatic
Next, we want to clean our sequences. For example, we want to remove adapters as well as low quality sequences. For this we can for example use Trimmomatic.
Trimmomatic : is a flexible trimmer for Illumina sequence data. We will use this tool to filter our sequences and only retain them if they have a high enough quality. Additionally, we will remove Ns and for this we will use trimmomatic.Bolger et al., 2014
While this is running (this might take a bit):
- Check the trimmomatic manual and control what the individual options do
#prepare folder
mkdir 3_trimmomatic
#run trimmomatic
for sample in `cat SampleList`; do java -jar /opt/biolinux/Trinity/trinity-plugins/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 4 -trimlog 3_trimmomatic/${sample}.trimlog.txt rawdata_1/${sample}_R1.fastq.gz rawdata_1/${sample}_R2.fastq.gz 3_trimmomatic/${sample}_R1_paired.fastq.gz 3_trimmomatic/${sample}_R1_unpaired.fastq.gz 3_trimmomatic/${sample}_R2_paired.fastq.gz 3_trimmomatic/${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:adaptors/TruSeq3-PE-2_modif_no_small.fa:1:30:10 SLIDINGWINDOW:5:22 MINLEN:100; done
#check files generated for one of our metagenomes
ll 3_trimmomatic/NIOZ114*
If we check the contents of the newly generated folder 3_trimmomatic
we should see something like the image below. Overall, for each metagenome 5 files are generated:
- A log file, which gives the read name, surviving sequence length, the location of the first surviving base, i.e. the amount trimmed from the start, the location of the last surviving base in the original read and the amount trimmed from the end”
- The paired reads for the forward and reverse reads. In this scenario both the forward and the corresponding reverse read are retained when the quality of each is good enough
- The unpaired reads for the forward and reverse reads. For these reads, only one of the F and R reads had good quality stats and is retained
What is happening:
- ILLUMINACLIP: Cuts the adapter and other Illumina-specific sequences from the read
- SLIDINGWINDOW: Performs a sliding window trimming approach to identify low quality sequences. The window starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
- MINLEN: Drops the read if it is below a specified length
Using the argument -trimlog
creates a log of all read trimmings. It will contain the following details:
- the read name
- the surviving sequence length
- the location of the first surviving base, aka. the amount trimmed from the start
- the location of the last surviving base in the original read
- the amount trimmed from the end
Example output for one sample (printed to screen):
- Input Read Pairs: 262,355
- Both Surviving: 227,265 (86.62%)
- Forward Only Surviving: 21,581 (8.23%)
- Reverse Only Surviving: 7,569 (2.89%)
- Dropped: 5,940 (2.26%)
4.3 Check quality of trimmed reads and check what happened
Homework
Once you are done with the full tutorial:
- Run fastqc and multiqc on the paired fastq files generated with trimmomatic. Since we will use only the paired data for the assembly, we will only look at these files. Specifically, compare how the raw reads and trimmed reads compare.
Click me to see an answer
#prepare folder
mkdir 2_seq_quality/fastqc_trimmed
#run fastq
for sample in `cat SampleList`; do fastqc -t 2 3_trimmomatic/${sample}*_paired.fastq.gz -o 2_seq_quality/fastqc_trimmed; done
#make folder for multiqc
mkdir 2_seq_quality/multiqc_trimmed
#open conda environments
source ~/.bashrc.conda3
conda activate multiqc
#run multiqc
multiqc -d 2_seq_quality/fastqc* -o 2_seq_quality/multiqc_trimmed
#view report
chrome 2_seq_quality/multiqc_trimmed/multiqc_report.html
#close conda environment and reset environment back to default (yes, run it twice to first deactivate conda and then anaconda)
conda deactivate
conda deactivate
If we compare the results for i.e. NIOZ114_R1 we should see that:
- We lost a few sequences
- The few short sequences are gone
- We lost the few sequences with bad quality scores
- The adapters are gone
<>
5 Run Assemblies
There are a number of different tools to assemble metagenomes, here, we will introduce two of them.
- Megahit
- Metaspades
5.1 Run Megahit
MEGAHIT is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on single genome and single-cell assemblies. Li et al., 2016 The benefit compared to Metaspades is that it works well while not using a lot of memory, which allows us to run it on less powerful computers.
Other features:
- Makes use of de Bruijn graphs. If you want to know more about the theory, you can check this paper
- Sequences are split into kmers (k-mers are subsequences of length
n
contained within a biological sequence). - Megahit builds assemblies from kmers with different lengths (for example: 21,29,39,59,79,99,119,141)
- i.e. ATGG as two 3mers: ATG and TGG
- In each kmer iteration potential errors in the graphs are cleaned
While this is running (it takes a bit), feel free to revisit your code or have a look at different information sites on:
#prepare folder
mkdir 4_assemblies
mkdir 4_assemblies/megahit
#run megahit on all metagenomes (one sample)
megahit -1 3_trimmomatic/NIOZ114_R1_paired.fastq.gz -2 3_trimmomatic/NIOZ114_R2_paired.fastq.gz -t 2 -o 4_assemblies/megahit/NIOZ114 --k-min 43 --k-max 127 --k-step 28
#run megahit on all metagenomes (all samples)
#for i in `cat SampleList`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz -t 20 -o 4_assemblies/megahit/$i; done
#lets how the content of an assembly looks
head 4_assemblies/megahit/NIOZ114/final.contigs.fa
#control how many contigs we get for each assembly
grep -c ">" 4_assemblies/megahit/N*/final.contigs.fa
The files generated from Megahit should look like this:
These files contain the following:
- final.contigs.fa = contains our assembly
- intermediate_contigs = a folder that contains the assemblies for the different kmer sizes
- log = log info listing the info on what happened at each step. In here, the end is interesting as it tells us the basic contig stats and should look something like this: 4512 contigs, total 7371931 bp, min 202 bp, max 138579 bp, avg 1634 bp, N50 10836 bp
Exercise
- Run Megahit also for the other two assembles, we will need all of this to proceed with the binning. Depending on how busy the server you might run this as homework after the session today.
- Check how many contigs were generated for each assembly
Click me to see an answer
#run just the assemblies for the two missing metagenomes
#adjust the number of cpus used based on how busy the server is
for i in `echo -e "NIOZ118\nNIOZ130"`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz -t 4 -o 4_assemblies/megahit/$i; done
#control how many contigs we get for each assembly
grep -c ">" 4_assemblies/megahit/N*/final.contigs.fa
To run this while you work through the rest of the tutorial, run the assembly for NIOZ118 and NIOZ130 in a screen or via slurm on the HPC. A short reminder on how to use screen:
#start a screen and give it a descriptive name:
screen -S testrun
#see what screens you have running with:
screen -ls
#restart an existing screen with:
screen -r testrun
#detach a screen (from outside a screen) with:
screen -d
#to completely close and remove the screen, type:
exit
5.2 Run Metaspades
Metaspades is another commonly used assembler that is a, for metagenomes, adjusted spades assembler
Features:
- metaSPAdes first constructs the de Bruijn graph of all reads using SPAdes
- similar as Megahit, this tool also generates assemblies with different kmers, the default is to automatically select the best range of kmers to use
- Has an option to also run a read error correction (this is however memory intensive)
Here, we use fewer kmers compared to megahit due to the longer run-time. When running this yourself you should test different parameters and test, compare how the assembly is affected by looking at the number of contigs generated, average contig length and other parameters we will discuss a bit later.
#make folder
mkdir 4_assemblies/metaspades
#run spades on a single metagenome
#adjust the number of CPUs used accordingly
#spades.py --meta -t 4 -k 21,33,55,77 -1 3_trimmomatic/NIOZ114_R1_paired.fastq.gz -2 3_trimmomatic/NIOZ114_R2_paired.fastq.gz -o metaspades/NIOZ114
#run spades on all metagenomes
for sample in `cat SampleList`; do spades.py --meta -t 4 -k 21,39,59,99,127 -1 3_trimmomatic/${sample}_R1_paired.fastq.gz -2 3_trimmomatic/${sample}_R2_paired.fastq.gz -o 4_assemblies/metaspades/${sample}; done
#check content of a folder
#we use find to simplify the output and not list the content of subfolders
find 4_assemblies/metaspades/NIOZ114/ -maxdepth 1 -not -type d
Homework
Run Metaspades on either one or all three assemblies, view the content of the generated files and check how many contigs we have.
5.3 Run quast to evaluate the assembly
Quast can be used to view assembly statistics of a single assembly but also compare individual assemblies. For now, we will just look at a single assembly, but on your own, do a comparison between spades and Megahit. Metaquast: MetaQUAST evaluates and compares metagenome assemblies based on alignments to close references. It is based on QUAST genome quality assessment tool, but addresses features specific for metagenome datasets.Mikheenko et al., 2016
Something that is good to know if you work with single genomes: Quast allows you to load reference genomes and align your assembly against this reference genome. That way, you can investigate your assembly for contamination and missassemblies. In our own experience this does not work too well with metagenomes at the moment, because not enough reference data can be loaded.
#make folders
mkdir 5_quast/
#generate summary statistics for a single metagenome
metaquast.py 4_assemblies/megahit/NIOZ114/final.contigs.fa -t 2 --no-snps --space-efficient -o 5_quast --max-ref-number 0
#view report
nano 5_quast/report.tsv
Information on the output:
- N50: Imagine that you line up all the contigs in your assembly in the order of their sequence lengths (Fig. 1a). You have the longest contig first, then the second longest, and so on with the shortest ones in the end. Then you start adding up the lengths of all contigs from the beginning, so you take the longest contig + the second longest + the third longest and so on — all the way until you’ve reached the number that is making up 50% of your total assembly length. That length of the contig that you stopped counting at, this will be your N50 number.
- N50: While N50 corresponds to the sequence length in base pairs, L50 represents the number of sequences. L50 is simply the rank of your contig that gives you the N50 length.
Often the N50 is viewed as a good store for a good assembly, since ideally a good genome has few contigs and thus a high N50 and a bad genome that is fragmented should have a low N50. This is why generally, we view a large N50 as an indicator for a good assembly. Issues with this are discussed in this blog post. In brief, if you filter your assembly before you by removing short contigs you can artificially inflate the N50. Another issue with the N50 is that it does not evaluate the correctness of an assembly. This is discussed in the follow up blog post. There are some proposed solutions but this typically require some reference genomes to compare our genomes to, which of course is more tricky for metagenomes. You can try to generate a reference database for Quast, however, to our knowledge, there is no ideal solution yet around this.
Homework
- Run all 3 assemblies both with Metaspades and Megahit. Watch the resources on ada when running this!
- Compare the statistics with quast
Click me to see an answer
#run megahit
#for i in `cat SampleList`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz -t 10 -o 4_assemblies/megahit/$i; done
#run metaspades
#for sample in `cat SampleList`; do spades.py --meta -t 10 -k 21,39,59,99,127 -1 3_trimmomatic/${sample}_R1_paired.fastq.gz -2 3_trimmomatic/${sample}_R2_paired.fastq.gz -o 4_assemblies/metaspades/${sample}; done
#make folders
mkdir 5_quast/
mkdir 5_quast/all_comparisons
#compare all genomes w/o references as input
metaquast.py 4_assemblies/megahit/NIOZ*/final.contigs.fa 4_assemblies/metaspades/NIOZ*/contigs.fasta -t 10 --no-snps --space-efficient -o 5_quast/all_comparisons --max-ref-number 0
#view report
less -S 5_quast/all_comparisons/report.tsv
The output from the tool should look something like this:
Contigs:
Pre-processing…
1 4_assemblies/megahit/NIOZ114/final.contigs.fa.gz ==> NIOZ114_final.contigs
2 4_assemblies/megahit/NIOZ118/final.contigs.fa.gz ==> NIOZ118_final.contigs
3 4_assemblies/megahit/NIOZ130/final.contigs.fa.gz ==> NIOZ130_final.contigs
4 4_assemblies/metaspades/NIOZ114/contigs.fasta ==> NIOZ114_contigs
5 4_assemblies/metaspades/NIOZ118/contigs.fasta ==> NIOZ118_contigs
6 4_assemblies/metaspades/NIOZ130/contigs.fasta ==> NIOZ130_contigs
If we check the N50 and N75, both megahit and metaspades give similar scores. What tool we decide on might also depend on the available memory, where megahit usually is better.
<>
6 Assembly curation + basic stats (length, GC)
Notice:
Depending on how the sequence header looks after your assembly the first step might need to be adjusted a bit if you run this for your own data.
The goal of this section is to generate a concise, informative but simple header. For example, if you generate assemblies from multiple metagenomes it is useful to not only have the contig number but also a sample ID in the header name. A good rule of thumb is to avoid spaces or other weird symbols when choosing sample IDs.
For this workflow:
- We check how the sequence header from our contigs looks like. The initial contig from Megahit should look something like this:
k141_5 flag=1 multi=2.0000 len=570
, which has some useful information but if we deal with multiple assemblies is not very informative and there is the chance that we easily mix up things. So we first want to add the sample ID into the sequence header as well. - We store the information from Megahit into a new text document.
- We shorten the header and just keep the sample and the contig ID. I.e. the header might now look like this:
NIOZ114_K141_5
- We calculate the length and GC of each contig. This is a bit redundant, since the Megahit header provides that info but other tools, such as Metaspades, do not give these numbers by default.
- Generally, contigs less than 2500 bp are not very informative and are often misbinned (if you want to assemble these contigs into MAGs), so for now we remove them and will not work with them in later parts of the workflow
- We count how many contigs we have left
#add the sample id into the file header for the assembly file to avoid bugs and easily identify where the contig originally came from
parallel -j3 'i={}; sed "s/>/>${i}_/g" 4_assemblies/megahit/$i/final.contigs.fa | sed "s/k141_/sc/g" | sed "s/ /_1 /1" > 4_assemblies/megahit/${i}/${i}_temp' ::: `cat SampleList`
#check what we did
head 4_assemblies/megahit/NIOZ114/NIOZ114_temp
#store the header info in an additional file (to keep the length)
parallel -j3 'i={}; grep ">" 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_info_megahit.txt' ::: `cat SampleList`
#clean the header and remove everything after the space
parallel -j3 'i={}; cut -f1 -d " " 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_contigs.fna' ::: `cat SampleList`
#check how the header looks now
head 4_assemblies/megahit/NIOZ114/NIOZ114_contigs.fna
#check how many contigs we have --> 6000 to 14 000 contigs
grep -c ">" 4_assemblies/megahit/NIOZ*/NIOZ*_contigs.fna
#make a table with the length and GC for each contig
parallel -j3 'i={}; perl scripts/length+GC.pl 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_length_gc.txt' ::: `cat SampleList`
#check output file
head 4_assemblies/megahit/NIOZ114/NIOZ114_length_gc.txt
#as a sanity check, count the number of contigs for each assembly --> 6000 to 14 000 contigs
wc -l 4_assemblies/megahit/N*/N*_length_gc.txt
#make list for contigs above a certain length threshold (i.e. only keep contigs with >2500 bp)
for i in `cat SampleList`; do awk '$6>=2500 {print $1,$5,$6}' 4_assemblies/megahit/${i}/${i}_length_gc.txt > 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt; done
#check how one of the output file looks like
head 4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt
#check how many contigs we have now
#for NIOZ114 we should have around 705 instead of the initial 6000 contigs
#we can also see that almost 90% contigs were removed
wc -l 4_assemblies/megahit/N*/N*_Contigs_2500.txt
#count nr of reads for >2500 and total length
for i in `cat SampleList`; do awk '{Length+=$3}END{print FILENAME,NR,Length}' 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt; done
The output file, i.e. 4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt
, should look something like this in the end and for each contig gives the GC and length information:
NIOZ114_sc48_1 0.582 3180
NIOZ114_sc51_1 0.344 2979
NIOZ114_sc85_1 0.333 3851
NIOZ114_sc96_1 0.334 7741
The summary statistics for the number of contigs > 2500 bp and total sequence length should look something like this:
4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt 705 12904683
4_assemblies/megahit/NIOZ118/NIOZ118_Contigs_2500.txt 118 1374909
4_assemblies/megahit/NIOZ130/NIOZ130_Contigs_2500.txt 422 2038370
Now lets remove the short contigs from our actual sequences:
#for each assembly make a new fna file that only includes contigs >2500bp
for i in `cat SampleList`; do perl scripts/screen_list_new.pl <(cut -f1 -d " " 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt) 4_assemblies/megahit/${i}/${i}_contigs.fna keep > 4_assemblies/megahit/${i}/${i}_contigs_2500.fna; done
#control step
grep -c ">" 4_assemblies/megahit/N*/*_contigs_2500.fna
The files we get have so many contigs (this number needs to be consistent with the numbers we have above during the step #count nr of reads for >2500 and total length
). Again, if we just work with NIOZ1114, you might get some error messages that you can ignore):
4_assemblies/megahit/NIOZ114/NIOZ114_contigs_2500.fna:705
4_assemblies/megahit/NIOZ118/NIOZ118_contigs_2500.fna:118
4_assemblies/megahit/NIOZ130/NIOZ130_contigs_2500.fna:422
Now we can clean things up to not store large or unnecessary files:
#rm temp files
for i in `cat SampleList`; do rm 4_assemblies/megahit//${i}/${i}_temp; done
#gzip "old" contigs
for i in `cat SampleList`; do gzip 4_assemblies/megahit//${i}/final.contigs.fa; done
for i in `cat SampleList`; do gzip 4_assemblies/megahit//${i}/*_info_megahit.txt; done
#rm intermediate contig files, since they take up space
rm -f 4_assemblies/megahit//*/intermediate_contigs/*fa
6.1 Combine all contig info into one file
#prepare folder
mkdir 4_assemblies/megahit/AllContigs
#combine contigs (this can be useful if you would want to annotate the contigs, not the MAGs)
cat 4_assemblies/megahit/N*/N*_contigs_2500.fna > 4_assemblies/megahit/AllContigs/All_contigs_2500.fna
#count the total nr of contigs we have: 1245
grep -c ">" 4_assemblies/megahit/AllContigs/All_contigs_2500.fna
#combine length and GC statistics for each contig into one table
cat 4_assemblies/megahit/N*/N*_Contigs_2500.txt > 4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt
#check the number of contigs in that table: 1245
wc -l 4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt
#check the contents of the summary file
head 4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt
We have a total of 1245 contigs from our 3 metagenomes with a length greater than 2,500 bp. If we just worked with NIOZ114 we got 705 contigs with a length greater than 2,500 bp.