Metagenomics: Assembly tutorial

1 Introduction

The goal of this tutorial is to teach students how to assemble reads from metagenomic sequence data. During this tutorial, we will:

take the raw reads from 3 “mini-metagenomes” that were recovered from 3 depth profiles of the water column from the Black Sea in a 2018 cruise (at 500 m, 1000 m and 2000 m water depth). These genomes are named NIOZ114, NIOZ118 and NIOZ130, respectively
check the quality of the metagenomes
remove reads that have a low quality
assemble the reads into contigs using megahit and metaspades
get the basics stats to characterize the assemblies

The reason why we work with three and not only one metagenome is because another pipeline that can follow the assembly is the binning pipeline that benefits from incorporating data from multiple samples. When binning contigs from an assembly into metagenome-assembled genomes (MAGs) most binning tools take into account coverage information. We get this information by comparing how many reads a contig recruits from different metagenomes (i.e. that come from different depth profiles in our case).

In this tutorial we will provide an example for how to run the code for a single metagenome or in a loop with all metagenomes of interest. The single example will be commented out (we added a # in front of it), if you want to try this remove the # and run the code.

Note

During this tutorial you will see examples that show how the output can look like. Sometimes these numbers can slightly differ to the output you might see. This is ok, as the assembler might just produce a slightly different output. As long as the results are in a similar range all is good.
Whenever we ask a question the code solution will be hidden. You can show the code by pressing on <Click me to see an answer>.

To fully understand this tutorial some good background in bash and AWK is needed. If you want to review your background we provide some intro tutorials for:

bash
awk

1.1 Running the same code on more than one file

1.1.1 Using for loops

You will see that during this workflow we will use a lot of for-loops. The reason is that this allows us to not run our 3 samples in three lines of code but just in one.

A general introduction into for-loops can be found here.

The general syntax is:

for var in sample1 sample2 sample3
do
   echo $var
done

If we would run this inside a terminal, this should be printed to the screen:

sample1 
sample2 
sample3

This is another way to write to write a for-loop in a more condensed way.

for var in sample1 sample2 sample3; do echo $var done

1.1.2 while read loops

This kind of loop allow us to read in files into our loop, i.e. we might have two samples and want to do all possible comparisons between such samples. I.e. the comparisons we want make are are stored in a text file named test.txt like this:

Sample1 Sample2
Sample2 Sample1

And we can use a while loop to use the information from both columns as follows:

while read column1 column2; do cat $column1 $column2;  done < test.txt

1.1.3 GNU parallel

GNU parallel is another efficient way to run things in parallel using more than one CPU. However, we won´t discuss it during this tutorial.

1.2 Version programs

Notice, the workflow was written while using these program versions. If you are using the NIOZ server system there is nothing you need to do but if you run this tutorial on your own system/cluster you would need to install these tools first.

prokka 1.14-dev
Python 2.7.5
perl v5.16.3
MultiQC v1.8
FastQC v0.11.9
megahit v1.1.3
metaspades SPAdes-3.13.1
BWA v0.7.17-r1188
Quast v4.6.3

2 Addon: Prepare test data

Warning

This code outlines how the data was generated but it is not supposed to be run for the tutorial.

To start this tutorial go to the following section.

This code was simply included if you ever need to extract certain reads from your own metagenomes.

#set wdir
cd /export/lv1/user/ndombrowski/Assembly_workflow

#combine 3 genomes from which we want to extract reads from the same depths
cat bins_1/indiv/* > bins_1/combined/NIOZ_bins.fna

#prepare folders for read mapping
mkdir mapping_1

#make folder per sample
for i in `cat FileLists/Samples`; do mkdir mapping_1/$i; done

#build index for our concatenated genomes
bwa index bins_1/combined/*

#create mapping files
for sample1 in `cat FileLists/Samples`; do bwa mem -t 40 bins_1/combined/NIOZ_bins.fna /export/lv4/projects/Black_Sea18/Data/RawData/Illumina/*/${sample1}*_L001_R1_001.fastq.gz /export/lv4/projects/Black_Sea18/Data/RawData/Illumina/*/${sample1}*_L001_R2_001.fastq.gz  >  mapping_1/${sample1}/${sample1}_v_${sample1}.sam; done 

#retain properly paired reads, sort and create bam file
for sample1 in `cat FileLists/Samples`; do samtools view -f 2 -b -u mapping_1/${sample1}/${sample1}_v_${sample1}.sam | samtools sort -n -@ 4 - -o mapping_1/${sample1}/${sample1}_v_${sample1}_sorted.bam; done

#extract paired reads
for sample1 in `cat FileLists/Samples`; do samtools fastq mapping_1/${sample1}/${sample1}_v_${sample1}_sorted.bam -1 mapping_1/${sample1}/${sample1}_R1.fastq -2 mapping_1/${sample1}/${sample1}_R2.fastq;done

#check counts
for sample1 in `cat FileLists/Samples`; do echo $(cat mapping_1/${sample1}/${sample1}_R1.fastq|wc -l)/4|bc; done
for sample1 in `cat FileLists/Samples`; do echo $(cat mapping_1/${sample1}/${sample1}_R2.fastq|wc -l)/4|bc; done

#cleanup
gzip mapping_1/*/*R*.fastq

#mv final files
mkdir rawdata_1
mv mapping_1/*/*R[12].fastq.gz rawdata_1

To give you a feeling about the size of the data set, below you can see how many forward (R1) and reverse reads (R2) were recruited to each of the 3 metagenomes:

R1:
424,058
595,711
473,936

R2:
424,058
595,711
473,936

If you work with your own data, this easily can be several million reads. So run times will be much longer and you can benefit from using more CPU’s to run jobs.

3 Tutorial start: Prepare your data

If you work on this tutorial outside of the NIOZ tutorial download the data from zenodo. Have patience, they take a while to download. Exchange the path to where you downloaded the data.

Click me if you use the tutorial outside of the workshop and want to use the zenodo files

#set a variable to the directory from which we want to work
#change this according to your preferences
wdir="/export/lv3/scratch/bioinformatics_workshop/S10_Assembly"

#go to our working directory
cd $wdir

#download the data
curl "https://zenodo.org/record/7752831/files/rawdata_1.tar.gz?download=1" --output rawdata_1.tar.gz
curl "https://zenodo.org/record/7752831/files/adaptors.tar.gz?download=1" --output adaptors.tar.gz

#unzip the data
tar xvzf rawdata_1.tar.gz
tar xvzf adaptors.tar.gz

#prepare custom scripts
mkdir scripts
wget https://raw.githubusercontent.com/ndombrowski/Assembly_tutorial/main/scripts/length%2BGC.pl -O scripts/length+GC.pl
wget https://raw.githubusercontent.com/ndombrowski/Assembly_tutorial/main/scripts/screen_list_new.pl -O scripts/screen_list_new.pl

#cleanup folder
rm *gz

Exercise: Setting up the working directory and required files

Go into your personal folder in /export/lv3/scratch/workshop_2021/Users/<username>
Make a new working directory, and name it Assemblies
go into that folder, this is your working directory
inside that folder, set a symbolic link to a folder with the metagenomic data, which is found here: /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/rawdata_1/
set another symbolic link to a folder with secondary data we need from here: /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/adaptors/

Click me to see an answer

#set variable for the User directory
user_name="ndombrowski"

#prepare working directory
cd /export/lv3/scratch/bioinformatics_workshop/Users/${user_name}

#generate a folder to run the assembly workflow in
mkdir Assemblies
cd Assemblies

#create symbolic links to the metagenome we work with
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/rawdata_1/ .

#create symbolic link to an adapter file we need to clean up the genomes
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/adaptors/ .

#create symbolic link to a folder with custom scripts
ln -s /export/lv3/scratch/bioinformatics_workshop/S10_Assembly/scripts/ .

3.1 Prepare a list with samples to loop through

As discussed in the introduction, we will use for-loops to run the same script over our 3 different metagenomes. Something useful to have is a file that lists the samples we are working with. So, let’s start by creating such a list.

echo -e "NIOZ114\nNIOZ118\nNIOZ130" > SampleList

Questions

It is always good to explore your data before doing things with it. To do this:

Check the data structure and open the file with less (ideally without unzipping the file, for example using zcat). Do you know what is stored in each of the four lines? If you are unsure, click the button below for a more detailed explanation/
Answer how many sequences we have in each file.
Find out whether the R1 and R2 files (in which store the forward and the reverse reads in, respectively) have the same number of sequences in the file.

Click me to see an answer

#Question1:
zcat rawdata_1/NIOZ114_R1.fastq.gz | head

The four lines in a fastq file contain:

A sequence identifier with information about the sequencing run and the cluster. The exact contents of this line vary by based on the BCL to FASTQ conversion software used.
The sequence (the base calls; A, C, T, G and N).
A separator, which is simply a plus (+) sign.
The base call quality scores. These are Phred +33 encoded, using ASCII characters to represent the numerical quality scores.

In case you are unsure about the format of fastq files, you can find some more info here.

#Question2 and 3:
cd rawdata_1
for sample in *gz; do zcat $sample | echo $((`wc -l`/4)); done
cd ..

You should get the following numbers:

774200 (NIOZ114_R1)
774200 (NIOZ114_R2)

354428 (NIOZ118_R1)
354428 (NIOZ118_R1)

288432 (NIOZ30_R1)
288432 (NIOZ130_R2)

We can see that:

The R1 and R2 files have the same number of sequences. This is important, if you ever see that you don’t have the same number this indicates that there might have been a problem when downloading the data or that the sequence center might have done some cleaning. You should follow up on this and figure out why the numbers are not the same.
You also see that we have much more reads for sample NIOZ114 than for the other samples which might be important to keep in mind when interpreting the data.

4 Quality control

4.1 Check initial quality

4.1.1 FastQC to determine the quality info

FastQC is one of the many tools we can use to look at the quality of the data. Among other things it looks at:

Overall read quality,
QC bias
Number of Ns (i.e. unassigned bases)
Over-represented sequences (often adapter sequences)

Some examples for good and bad reports can be found here

#make folder
mkdir 2_seq_quality
mkdir 2_seq_quality/fastqc

#run fastq
for sample in `cat SampleList`; do fastqc -t 4 rawdata_1/${sample}* -o 2_seq_quality/fastqc; done

#view report
chrome 2_seq_quality/fastqc/NIOZ114_R1_fastqc.html

If you want to exit chrome press: Control + X

Notice: on ada this is a bit slow and often faster to download the data to your own computer and view from there

The output should look something like this and among others we see:

the general sequence statistics
the quality scores, across the sequence length. The phred score goes from 0-40 (good to bad) and we can see that the sequences generally are a bit worse at the beginning. This has something to do with the sequencing approach and is nothing to worry about. However, if you would see that the complete sequence is bad that is something to discuss for example with the sequencing center.

The average sequence quality. If you have a bump at the beginning, that means we have more low quality sequences that we should to get rid of.

- The duplication level. I.e. here most of our sequences are singletons. If you have 16S sequences that will look a bit different and we will have more duplicated sequences.

- Overrepresented sequences Here, esp. if we have adapter sequences left, you will see a sequence information. This is important to note down, as we want to remove adapters in the cleaning step and its important to check if we used the right adapter information for the cleaning.

4.1.2 MultiQC to summarize the fastqc results

MultiQC is a reporting tool that parses summary statistics from results and log files generated by other bioinformatics tools. MultiQC doesn’t run other tools for you - it’s designed to be placed at the end of analysis pipelines or to be run manually when you’ve finished running your tools (Ewels et al., 2016).

#make folder
mkdir 2_seq_quality/multiqc

#open conda environments 
source ~/.bashrc.conda3
conda activate multiqc

#run MultiQC
multiqc -d 2_seq_quality/fastqc -o 2_seq_quality/multiqc

#view report
chrome 2_seq_quality/multiqc/multiqc_report.html 

#close conda environment and reset environment back to default (yes, run it twice to first deactivate conda and then anaconda)
conda deactivate
conda deactivate

MultiQC summarizes the data across all metagenomes and should look something like this:

We also get the plots we saw before for fastqc now just with multiple lines, each representing one of our metagenomes.

Here we also get a zoom into the adapter content. If we hover over a line we see exactly what adapter was found as well.

4.2 Run trimmomatic

Next, we want to clean our sequences. For example, we want to remove adapters as well as low quality sequences. For this we can for example use Trimmomatic.

Trimmomatic : is a flexible trimmer for Illumina sequence data. We will use this tool to filter our sequences and only retain them if they have a high enough quality. Additionally, we will remove Ns and for this we will use trimmomatic.Bolger et al., 2014

While this is running (this might take a bit):

Check the trimmomatic manual and control what the individual options do

Note

Depending on the number of samples and the HPC you have, this job is also something that can easily be run faster using GNU parallel.

#prepare folder
mkdir 3_trimmomatic

#run trimmomatic
for sample in `cat SampleList`; do java -jar /opt/biolinux/Trinity/trinity-plugins/Trimmomatic-0.36/trimmomatic-0.36.jar PE -threads 4 -trimlog 3_trimmomatic/${sample}.trimlog.txt  rawdata_1/${sample}_R1.fastq.gz rawdata_1/${sample}_R2.fastq.gz 3_trimmomatic/${sample}_R1_paired.fastq.gz 3_trimmomatic/${sample}_R1_unpaired.fastq.gz 3_trimmomatic/${sample}_R2_paired.fastq.gz 3_trimmomatic/${sample}_R2_unpaired.fastq.gz ILLUMINACLIP:adaptors/TruSeq3-PE-2_modif_no_small.fa:1:30:10  SLIDINGWINDOW:5:22 MINLEN:100; done

#check files generated for one of our metagenomes
ll 3_trimmomatic/NIOZ114*

If we check the contents of the newly generated folder 3_trimmomatic we should see something like the image below. Overall, for each metagenome 5 files are generated:

A log file, which gives the read name, surviving sequence length, the location of the first surviving base, i.e. the amount trimmed from the start, the location of the last surviving base in the original read and the amount trimmed from the end”
The paired reads for the forward and reverse reads. In this scenario both the forward and the corresponding reverse read are retained when the quality of each is good enough
The unpaired reads for the forward and reverse reads. For these reads, only one of the F and R reads had good quality stats and is retained

What is happening:

ILLUMINACLIP: Cuts the adapter and other Illumina-specific sequences from the read
SLIDINGWINDOW: Performs a sliding window trimming approach to identify low quality sequences. The window starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
MINLEN: Drops the read if it is below a specified length

Using the argument -trimlog creates a log of all read trimmings. It will contain the following details:

the read name
the surviving sequence length
the location of the first surviving base, aka. the amount trimmed from the start
the location of the last surviving base in the original read
the amount trimmed from the end

Example output for one sample (printed to screen):

Input Read Pairs: 262,355
Both Surviving: 227,265 (86.62%)
Forward Only Surviving: 21,581 (8.23%)
Reverse Only Surviving: 7,569 (2.89%)
Dropped: 5,940 (2.26%)

4.3 Check quality of trimmed reads and check what happened

Homework

Once you are done with the full tutorial:

Run fastqc and multiqc on the paired fastq files generated with trimmomatic. Since we will use only the paired data for the assembly, we will only look at these files. Specifically, compare how the raw reads and trimmed reads compare.

Click me to see an answer

#prepare folder
mkdir 2_seq_quality/fastqc_trimmed

#run fastq
for sample in `cat SampleList`; do fastqc -t 2 3_trimmomatic/${sample}*_paired.fastq.gz -o 2_seq_quality/fastqc_trimmed; done

#make folder for multiqc
mkdir 2_seq_quality/multiqc_trimmed

#open conda environments 
source ~/.bashrc.conda3
conda activate multiqc

#run multiqc
multiqc -d 2_seq_quality/fastqc* -o 2_seq_quality/multiqc_trimmed

#view report
chrome 2_seq_quality/multiqc_trimmed/multiqc_report.html

#close conda environment and reset environment back to default (yes, run it twice to first deactivate conda and then anaconda)
conda deactivate
conda deactivate

If we compare the results for i.e. NIOZ114_R1 we should see that:

We lost a few sequences
The few short sequences are gone
We lost the few sequences with bad quality scores
The adapters are gone

5 Run Assemblies

There are a number of different tools to assemble metagenomes, here, we will introduce two of them.

Megahit
Metaspades

Note

For this tutorial we will JUST work with Megahit since it requires less memory than Metaspades.

However, you well get all necessary background information to be able to use Metaspades as well and if you are interested some examples on how to compare the output from the different tools is given as well.

5.1 Run Megahit

MEGAHIT is an ultra-fast and memory-efficient NGS assembler. It is optimized for metagenomes, but also works well on single genome and single-cell assemblies. Li et al., 2016 The benefit compared to Metaspades is that it works well while not using a lot of memory, which allows us to run it on less powerful computers.

Other features:

Makes use of de Bruijn graphs. If you want to know more about the theory, you can check this paper
Sequences are split into kmers (k-mers are subsequences of length n contained within a biological sequence).
Megahit builds assemblies from kmers with different lengths (for example: 21,29,39,59,79,99,119,141)
- i.e. ATGG as two 3mers: ATG and TGG
In each kmer iteration potential errors in the graphs are cleaned

Note

This takes a bit to run, so we only run the assembly for one metagenome. However, for the binnng all three assemblies need to be generated.
When you do this with your own data do not change the kmer sizes that Megahit uses by default! We just reduce the numbers of kmers we use for this workflow in order to speed the process up (it should take only a few minutes).
We are only using four threads to run this job (via -t 4). For your own purposes increase the number to whatever is suitable for your cluster.
Since this takes a bit of time to run, and if you want to proceed faster, feel free to only run the assembly for one metagenome. However, if you plan on also running the binning tutorial after this tutorial its recommended to generate all 3 assemblies with Megahit.

While this is running (it takes a bit), feel free to revisit your code or have a look at different information sites on:

#prepare folder
mkdir 4_assemblies
mkdir 4_assemblies/megahit

#run megahit on all metagenomes (one sample)
megahit -1 3_trimmomatic/NIOZ114_R1_paired.fastq.gz -2 3_trimmomatic/NIOZ114_R2_paired.fastq.gz  -t 2 -o 4_assemblies/megahit/NIOZ114 --k-min 43 --k-max 127 --k-step 28

#run megahit on all metagenomes (all samples)
#for i in `cat SampleList`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz  -t 20 -o 4_assemblies/megahit/$i; done

#lets how the content of an assembly looks
head 4_assemblies/megahit/NIOZ114/final.contigs.fa 

#control how many contigs we get for each assembly
grep -c ">" 4_assemblies/megahit/N*/final.contigs.fa

The files generated from Megahit should look like this:

These files contain the following:

final.contigs.fa = contains our assembly
intermediate_contigs = a folder that contains the assemblies for the different kmer sizes
log = log info listing the info on what happened at each step. In here, the end is interesting as it tells us the basic contig stats and should look something like this: 4512 contigs, total 7371931 bp, min 202 bp, max 138579 bp, avg 1634 bp, N50 10836 bp

Exercise

Run Megahit also for the other two assembles, we will need all of this to proceed with the binning. Depending on how busy the server you might run this as homework after the session today.
Check how many contigs were generated for each assembly

Click me to see an answer

#run just the assemblies for the two missing metagenomes
#adjust the number of cpus used based on how busy the server is
for i in `echo -e "NIOZ118\nNIOZ130"`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz  -t 4 -o 4_assemblies/megahit/$i; done


#control how many contigs we get for each assembly
grep -c ">" 4_assemblies/megahit/N*/final.contigs.fa

To run this while you work through the rest of the tutorial, run the assembly for NIOZ118 and NIOZ130 in a screen or via slurm on the HPC. A short reminder on how to use screen:

#start a screen and give it a descriptive name:
screen -S testrun

#see what screens you have running with:
screen -ls

#restart an existing screen with:
screen -r testrun

#detach a screen (from outside a screen) with:
screen -d
 
#to completely close and remove the screen, type:
exit

5.2 Run Metaspades

Warning

Metaspades is much more memory intensive compared to Megahit, which is why for this tutorial we WILL NOT run it. However, feel free to test it afterwards.

Metaspades is another commonly used assembler that is a, for metagenomes, adjusted spades assembler

Features:

metaSPAdes first constructs the de Bruijn graph of all reads using SPAdes
similar as Megahit, this tool also generates assemblies with different kmers, the default is to automatically select the best range of kmers to use
Has an option to also run a read error correction (this is however memory intensive)

Here, we use fewer kmers compared to megahit due to the longer run-time. When running this yourself you should test different parameters and test, compare how the assembly is affected by looking at the number of contigs generated, average contig length and other parameters we will discuss a bit later.

#make folder
mkdir 4_assemblies/metaspades

#run spades on a single metagenome
#adjust the number of CPUs used accordingly
#spades.py --meta -t 4 -k 21,33,55,77 -1  3_trimmomatic/NIOZ114_R1_paired.fastq.gz -2 3_trimmomatic/NIOZ114_R2_paired.fastq.gz -o metaspades/NIOZ114

#run spades on all metagenomes
for sample in `cat SampleList`; do spades.py --meta -t 4 -k 21,39,59,99,127 -1  3_trimmomatic/${sample}_R1_paired.fastq.gz -2 3_trimmomatic/${sample}_R2_paired.fastq.gz -o 4_assemblies/metaspades/${sample}; done

#check content of a folder
#we use find to simplify the output and not list the content of subfolders
find 4_assemblies/metaspades/NIOZ114/ -maxdepth 1 -not -type d

Homework

Run Metaspades on either one or all three assemblies, view the content of the generated files and check how many contigs we have.

5.3 Run quast to evaluate the assembly

Quast can be used to view assembly statistics of a single assembly but also compare individual assemblies. For now, we will just look at a single assembly, but on your own, do a comparison between spades and Megahit. Metaquast: MetaQUAST evaluates and compares metagenome assemblies based on alignments to close references. It is based on QUAST genome quality assessment tool, but addresses features specific for metagenome datasets.Mikheenko et al., 2016

Something that is good to know if you work with single genomes: Quast allows you to load reference genomes and align your assembly against this reference genome. That way, you can investigate your assembly for contamination and missassemblies. In our own experience this does not work too well with metagenomes at the moment, because not enough reference data can be loaded.

#make folders
mkdir 5_quast/

#generate summary statistics for a single metagenome
metaquast.py 4_assemblies/megahit/NIOZ114/final.contigs.fa -t 2 --no-snps --space-efficient  -o 5_quast --max-ref-number 0

#view report
nano 5_quast/report.tsv

Information on the output:

N50: Imagine that you line up all the contigs in your assembly in the order of their sequence lengths (Fig. 1a). You have the longest contig first, then the second longest, and so on with the shortest ones in the end. Then you start adding up the lengths of all contigs from the beginning, so you take the longest contig + the second longest + the third longest and so on — all the way until you’ve reached the number that is making up 50% of your total assembly length. That length of the contig that you stopped counting at, this will be your N50 number.

N50: While N50 corresponds to the sequence length in base pairs, L50 represents the number of sequences. L50 is simply the rank of your contig that gives you the N50 length.

Often the N50 is viewed as a good store for a good assembly, since ideally a good genome has few contigs and thus a high N50 and a bad genome that is fragmented should have a low N50. This is why generally, we view a large N50 as an indicator for a good assembly. Issues with this are discussed in this blog post. In brief, if you filter your assembly before you by removing short contigs you can artificially inflate the N50. Another issue with the N50 is that it does not evaluate the correctness of an assembly. This is discussed in the follow up blog post. There are some proposed solutions but this typically require some reference genomes to compare our genomes to, which of course is more tricky for metagenomes. You can try to generate a reference database for Quast, however, to our knowledge, there is no ideal solution yet around this.

Homework

Run all 3 assemblies both with Metaspades and Megahit. Watch the resources on ada when running this!
Compare the statistics with quast

Click me to see an answer

#run megahit
#for i in `cat SampleList`; do megahit -1 3_trimmomatic/${i}_R1_paired.fastq.gz -2 3_trimmomatic/${i}_R2_paired.fastq.gz  -t 10 -o 4_assemblies/megahit/$i; done

#run metaspades
#for sample in `cat SampleList`; do spades.py --meta -t 10 -k 21,39,59,99,127 -1  3_trimmomatic/${sample}_R1_paired.fastq.gz -2 3_trimmomatic/${sample}_R2_paired.fastq.gz -o 4_assemblies/metaspades/${sample}; done

#make folders
mkdir 5_quast/
mkdir 5_quast/all_comparisons

#compare all genomes w/o references as input
metaquast.py 4_assemblies/megahit/NIOZ*/final.contigs.fa 4_assemblies/metaspades/NIOZ*/contigs.fasta -t 10 --no-snps --space-efficient  -o 5_quast/all_comparisons --max-ref-number 0

#view report
less -S 5_quast/all_comparisons/report.tsv

The output from the tool should look something like this:

Contigs:
Pre-processing…
1 4_assemblies/megahit/NIOZ114/final.contigs.fa.gz ==> NIOZ114_final.contigs
2 4_assemblies/megahit/NIOZ118/final.contigs.fa.gz ==> NIOZ118_final.contigs
3 4_assemblies/megahit/NIOZ130/final.contigs.fa.gz ==> NIOZ130_final.contigs
4 4_assemblies/metaspades/NIOZ114/contigs.fasta ==> NIOZ114_contigs
5 4_assemblies/metaspades/NIOZ118/contigs.fasta ==> NIOZ118_contigs
6 4_assemblies/metaspades/NIOZ130/contigs.fasta ==> NIOZ130_contigs

If we check the N50 and N75, both megahit and metaspades give similar scores. What tool we decide on might also depend on the available memory, where megahit usually is better.

6 Assembly curation + basic stats (length, GC)

Notice:

Depending on how the sequence header looks after your assembly the first step might need to be adjusted a bit if you run this for your own data.

The goal of this section is to generate a concise, informative but simple header. For example, if you generate assemblies from multiple metagenomes it is useful to not only have the contig number but also a sample ID in the header name. A good rule of thumb is to avoid spaces or other weird symbols when choosing sample IDs.

For this workflow:

We check how the sequence header from our contigs looks like. The initial contig from Megahit should look something like this: k141_5 flag=1 multi=2.0000 len=570, which has some useful information but if we deal with multiple assemblies is not very informative and there is the chance that we easily mix up things. So we first want to add the sample ID into the sequence header as well.
We store the information from Megahit into a new text document.
We shorten the header and just keep the sample and the contig ID. I.e. the header might now look like this: NIOZ114_K141_5
We calculate the length and GC of each contig. This is a bit redundant, since the Megahit header provides that info but other tools, such as Metaspades, do not give these numbers by default.
Generally, contigs less than 2500 bp are not very informative and are often misbinned (if you want to assemble these contigs into MAGs), so for now we remove them and will not work with them in later parts of the workflow
We count how many contigs we have left

Note

You will get some error messages here, if you do not have the assemblies for all 3 metagenomes. However, this is nothing to worry about, for NIOZ114 the relevant files will still be generated.

#add the sample id into the file header for the assembly file to avoid bugs and easily identify where the contig originally came from
parallel -j3 'i={}; sed "s/>/>${i}_/g" 4_assemblies/megahit/$i/final.contigs.fa | sed "s/k141_/sc/g" | sed "s/ /_1 /1" > 4_assemblies/megahit/${i}/${i}_temp' ::: `cat SampleList`

#check what we did
head 4_assemblies/megahit/NIOZ114/NIOZ114_temp

#store the header info in an additional file (to keep the length)
parallel -j3 'i={};  grep ">" 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_info_megahit.txt' ::: `cat SampleList`

#clean the header and remove everything after the space
parallel -j3 'i={}; cut -f1 -d " " 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_contigs.fna' ::: `cat SampleList`

#check how the header looks now
head 4_assemblies/megahit/NIOZ114/NIOZ114_contigs.fna

#check how many contigs we have --> 6000 to 14 000 contigs
grep -c ">"  4_assemblies/megahit/NIOZ*/NIOZ*_contigs.fna

#make a table with the length and GC for each contig
parallel -j3 'i={}; perl scripts/length+GC.pl 4_assemblies/megahit/${i}/${i}_temp > 4_assemblies/megahit/${i}/${i}_length_gc.txt' ::: `cat SampleList`

#check output file
head 4_assemblies/megahit/NIOZ114/NIOZ114_length_gc.txt

#as a sanity check, count the number of contigs for each assembly  --> 6000 to 14 000 contigs
wc -l 4_assemblies/megahit/N*/N*_length_gc.txt

#make list for contigs above a certain length threshold (i.e. only keep contigs with >2500 bp)
for i in `cat SampleList`; do awk '$6>=2500 {print $1,$5,$6}' 4_assemblies/megahit/${i}/${i}_length_gc.txt > 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt; done

#check how one of the output file looks like
head 4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt

#check how many contigs we have now 
#for NIOZ114 we should have around 705 instead of the initial 6000 contigs
#we can also see that almost 90% contigs were removed
wc -l 4_assemblies/megahit/N*/N*_Contigs_2500.txt

#count nr of reads for >2500 and total length 
for i in `cat SampleList`; do awk '{Length+=$3}END{print FILENAME,NR,Length}' 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt; done

The output file, i.e. 4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt, should look something like this in the end and for each contig gives the GC and length information:

NIOZ114_sc48_1 0.582 3180
NIOZ114_sc51_1 0.344 2979
NIOZ114_sc85_1 0.333 3851
NIOZ114_sc96_1 0.334 7741

The summary statistics for the number of contigs > 2500 bp and total sequence length should look something like this:

4_assemblies/megahit/NIOZ114/NIOZ114_Contigs_2500.txt 705 12904683  
4_assemblies/megahit/NIOZ118/NIOZ118_Contigs_2500.txt 118 1374909  
4_assemblies/megahit/NIOZ130/NIOZ130_Contigs_2500.txt 422 2038370

Now lets remove the short contigs from our actual sequences:

#for each assembly make a new fna file that only includes contigs >2500bp
for i in `cat SampleList`; do perl scripts/screen_list_new.pl <(cut -f1 -d " " 4_assemblies/megahit/${i}/${i}_Contigs_2500.txt) 4_assemblies/megahit/${i}/${i}_contigs.fna keep > 4_assemblies/megahit/${i}/${i}_contigs_2500.fna; done

#control step
grep -c ">" 4_assemblies/megahit/N*/*_contigs_2500.fna

The files we get have so many contigs (this number needs to be consistent with the numbers we have above during the step #count nr of reads for >2500 and total length). Again, if we just work with NIOZ1114, you might get some error messages that you can ignore):

4_assemblies/megahit/NIOZ114/NIOZ114_contigs_2500.fna:705  
4_assemblies/megahit/NIOZ118/NIOZ118_contigs_2500.fna:118  
4_assemblies/megahit/NIOZ130/NIOZ130_contigs_2500.fna:422

Now we can clean things up to not store large or unnecessary files:

#rm temp files
for i in `cat SampleList`; do rm 4_assemblies/megahit//${i}/${i}_temp; done

#gzip "old" contigs
for i in `cat SampleList`; do gzip 4_assemblies/megahit//${i}/final.contigs.fa; done
for i in `cat SampleList`; do gzip 4_assemblies/megahit//${i}/*_info_megahit.txt; done

#rm intermediate contig files, since they take up space
rm -f 4_assemblies/megahit//*/intermediate_contigs/*fa

6.1 Combine all contig info into one file

#prepare folder
mkdir  4_assemblies/megahit/AllContigs

#combine contigs (this can be useful if you would want to annotate the contigs, not the MAGs)
cat  4_assemblies/megahit/N*/N*_contigs_2500.fna >  4_assemblies/megahit/AllContigs/All_contigs_2500.fna

#count the total nr of contigs we have: 1245
grep -c ">" 4_assemblies/megahit/AllContigs/All_contigs_2500.fna

#combine length and GC statistics for each contig into one table
cat  4_assemblies/megahit/N*/N*_Contigs_2500.txt >  4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt

#check the number of contigs in that table: 1245
wc -l 4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt

#check the contents of the summary file
head 4_assemblies/megahit/AllContigs/All_contigs_2500_length_GC.txt

We have a total of 1245 contigs from our 3 metagenomes with a length greater than 2,500 bp. If we just worked with NIOZ114 we got 705 contigs with a length greater than 2,500 bp.