6 Amplicon long-read analysis

Now that we have learned how to use the command line to issue commands and use an HPC we can now analyse our long-read amplicon data and:

Perform quality filtering by choosing appropriate cutoffs based on our quality assessment we did in the prior section
Align the quality filtered reads to a reference database
Generate a count matrix

Important: All of this should be run on the Crunchomics HPC, so be sure that you connected with ssh -X user@omics-h0.science.uva.nl.

6.1 `chopper`: Quality cleaning

In this section, you will learn how to perform quality filtering of your amplicon long-read data using chopper. Quality control is an essential first step to remove low quality reads that introduce errors in downstream analyses such as sequence alignment. Typical things you want to do are:

Remove sequence adapters, barcodes, primer sequences (this already has been done for your data)
Remove reads with a too low quality score
Remove reads that are too long or too short
Trim read ends if they have low quality scores

chopper is designed for filtering and trimming of long-read data and you can install chopper with mamba:

mamba create -n chopper_0.11.0 -c bioconda chopper=0.11.0

In this tutorial, you will always receive example code as a starting point, with sections containing ... for you to fill in. If you get stuck in this section, first check chopper’s documentation or use the built-in help via the CLI (chopper -h).

Tasks

Use your NanoPlot quality plots from the previous section and choose some quality and length cutoffs.

Task 1: Perform quality filtering using chopper

Do the following:

Create a new folder to store the results and name the output file barcodeX_filtered.fastq.gz
Request 2 CPUs and 20G of memory with srun, and ensure chopper itself also uses 2 threads
Investigate the Nanoplot LengthvsQualityScatterPlot.html that you generated in the last exercise and decide on the right thresholds to use with chopper to:
- Discard sequences that are too long or too short
- Discard reads with a too low phred score
Check the tools manual carefully for how to deal with .gz files. Hint: Closely check the example section

IMPORTANT: After chopper finished you will see something like “Kept 1985conda de reads out of 3785 reads”. If you see less than 50% of the reads remaining use less strict length and quality filters. You need enough reads for the next steps

Task 2: Assess the quality of the filtering step

Investigate the quality of the filtered FASTQ file by running:

NanoPlot
seqkit
- Important: Be sure that you name the output file something like barcodeX_filtered.tsv, we need this file later!

Afterwards open NanoPlots and seqkit. Investigate the outputs and record:

The total number of sequences that went into the analysis
The total number of sequences that passed the quality filtering

# Generate an output folder for the results
... 

# Activate the chopper conda environment
...

# Run chopper
srun --cpus-per-task 2 --mem=20G chopper ...

# Deactivate the conda environment 
...

# Make a folder for the Nanoplot results and run Nanoplot 
# (don't forget to activate the right conda environment first and deactivate it after)
srun ...

# Run seqkit
# Make sure that the output file is something like path/barcodeX_filtered.tsv
srun ...

Click to see the answer

# Generate an output folder for the results
mkdir results/chopper

# Activate the chopper conda environment
conda activate chopper_0.11.0

# Run chopper
srun --cpus-per-task 2 --mem=20G chopper \
    -i data/barcodeX.fastq.gz \
    -q 12 \
    --minlength 1400 --maxlength 1540 \
    --threads 2 | gzip \
    > results/chopper/barcodeX_filtered.fastq.gz

# Deactivate the conda environment 
conda deactivate

# Run Nanoplot
conda activate nanoplot

mkdir -p results/nanoplot/cleaned

srun --cpus-per-task 2 --mem=10G NanoPlot \
    --fastq results/chopper/barcodeX_filtered.fastq.gz \
    -t 2 \
    --tsv_stats \
    --plots dot \
    -o results/nanoplot/cleaned

conda deactivate

# Run seqkit 
srun --cpus-per-task=1 --mem=5G seqkit stats results/chopper/barcodeX_filtered.fastq.gz  \
    -Tao results/seqkit/barcodeX_filtered.tsv --threads 1

Below are example results from one of my analyses. These numbers will be different for you, however, this process will make you more familiar for what to look for in your own analyses.

Input: 29,468 reads
Passed filtering: 24,600 reads (~80%). You ideally want to keep 50%-90% of the reads for the down-stream analyses depending on the quality of the data and how many reads you generated.

6.2 `Minimap2`: Map reads to reference database

Nanopore long-read sequencing can produce reads that span nearly the entire 16S rRNA gene, making it useful for identifying microbial species. However, these long reads come with a trade-off: they have a higher error rate compared to short-read technologies.

Assume that your average read quality is Q10. The Phred score (Q) is a measure of how confident the sequencer is in each base call. The probability of a base being called incorrectly can be calculated as:

\[P=10^{-Q/10}\]

In R we can compute this with:

Q <- 12
P_error <- 10^(-Q/10)
percent_error <- P_error * 100
message("Error rate: ", round(percent_error, 2), "%")

Error rate: 6.31%

This shows that with a mean Phred quality of 12, each base has roughly a 6% chance of being incorrect. For a 1,500 bp 16S amplicon we expect 90 errors per sequence with such a quality score. While Nanopore long reads are useful because they capture the full 16S rRNA gene, a high level of errors can make species-level identification unreliable.

To classify our filtered Nanopore reads, we use Minimap2 to map each read against a reference database of 16S rRNA sequences from all the strains used in your lab experiments. Mapping identifies the reference sequence most similar to each read. The read is then aligned base-by-base to the corresponding reference sequence to check for exact matches or mismatches. Depending on the similarity, one or more hits may be recorded for each read.

Mapping tools like Minimap2 are well-suited for Nanopore data because they tolerate mismatches as well as small insertions and deletions, which are common in error-prone long reads. The output of Minimap2 is a PAF (Pairwise mApping Format) file, which contains information about how well each read maps to the reference sequences.

Some reads may map equally well to multiple reference sequences (multi-mappers), particularly if the strains have highly similar 16S sequences (for example the different Pseudomonas strains you worked with in the lab). These multi-mapping reads are important to keep in mind when summarizing taxonomic abundances.

Minimap2 is installed by default on Crunchomics.

Tasks

Task 1: Get the reference database

We have already prepared a non-redundant database of 16S rRNA gene sequences. Download it to your data/ folder using:

wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/data/amplicons_nr.fasta -P data

This file (amplicons_nr.fasta) contains one to two representative 16S rRNA sequences per strain used in the lab. View the file with head and use grep to count how many sequences are in the file.

Task 2: Run minimap2

Next, you will map your filtered reads against the reference database using minimap2.

Before running, make sure to check all the available options with minimap2 -h and then make sure tha tyou:

Create a folder to store your results
Use srun with 2 CPUs and 20G of memory
Use -cx map-ont for accurate Nanopore vs reference mapping
Ensure that the output is in PAF format
Provide both
- data/amplicons_nr.fasta (the reference)
- barcodeX_filtered.fastq.gz (the cleaned fastq sequences)
Use > results/minimap2/barcodeX.paf to store the output

Task 3: Explore the output

After the alignment finishes:

Count the number of alignments with wc
View the file with less and scroll through the first few lines to inspect the structure.
Check the minimap2 manual to find out what the different columns mean.

Answer the following questions:

How many alignments are there? Is this number higher or lower than your total number of input reads? Why might that be?
Look at the first 12 columns of the PAF file. Which of these could help you filter unreliable alignments or identify multi-mapping reads?

# Count the number of reference sequences
... 

# Generate a suitable output folder 
...

# Run minimap2 
...

# Count the number of alignments
...

Click to see the answer

# Count the number of reference sequences
grep -c ">"  data/amplicons_nr.fasta

# Generate a suitable output folder 
mkdir results/minimap2

# Run minimap2 
srun --cpus-per-task 2 --mem=20G minimap2 \
        -cx map-ont -t 2 \
        data/amplicons_nr.fasta \
        results/chopper/barcodeX_filtered.fastq.gz \
        > results/minimap2/barcodeX.paf

# Count number of alignments: 2059 (-1 because of the header)
wc -l results/minimap2/barcodeX.paf

# Explore the file itself
less -S results/minimap2/barcodeX.paf

Answer:

There are 45,318 alignments for 24,600 reads. If you see more alignments then there are reads, there might be different reasons for that:
1. If your experiment included Pseudomonas strains you likely will have more alignments than reads. The higher number of alignments occurs because some reads might align equally well to multiple reference 16S sequences, these are multi-mapping reads.
2. Some reads might only partially or weakly align to highly similar regions in the 16S rRNA gene
Useful PAF columns for assessing alignment reliability include:
- Column 10: number of matching bases (can be use to calculate % identity)
- Column 11: total length of the alignment (including gaps; can be use to calculate % identity)
- Column 12: mapping quality. 0 is used to mark and thereby identify multi-mappers
- Columns 3–4: read start and end positions (can be used to calculate how much of the read aligned)

Information on the PAF output:

Col	Description
1	Query sequence name
2	Query sequence length
3	Query start coordinate (0-based)
4	Query end coordinate (0-based)
5	‘+’ if query/target on the same strand; ‘-’ if opposite
6	Target sequence name
7	Target sequence length
8	Target start coordinate on the original strand
9	Target end coordinate on the original strand
10	Number of matching bases in the mapping
11	Number bases, including gaps, in the mapping
12	Mapping quality (0-255 with 255 for missing)

6.3 Generate a count table

Now that we have our alignment results, our goal is to summarize them into a count table, a matrix showing how many reads aligned to each reference 16S rRNA sequence. Before creating the count table, we first need to filter our results to make sure we only count high-confidence matches by:

Removing low-confidence hits by filtering out alignments with:
- Low sequence identity (many mismatches)
- Low coverage (only part of the read mapped to the reference)
- Low mapping quality (reads could map equally well elsewhere)
Selecting the single best alignment for each read (to handle multi-mappers)

Afterwards, we can count how many reads map to each reference sequence. These counts can then be summarized at higher taxonomic levels (e.g., genus) to account for the uncertainty due to sequencing errors.

Instead of using a single pre-built program, we will use a custom Python script to perform this step. So why use a tool like Python or R for this?

Tools like Minimap2 produce large, text-based alignment files. While command-line tools are great for specific tasks (like counting lines or filtering by quality), programming languages like Python and R allow you to:

Combine multiple filtering criteria (e.g. identity, coverage, quality)
Parse complex data formats
Summarize and visualize results reproducible

Tasks

Task 1: Install the necessary software

To run the required software, we first need to install some packages to parse the data. You can do this by running the following:

mamba create -n python_3.10.18 -c conda-forge python=3.10.18 matplotlib pandas

Task 2: Get the python scripts

Download the two required files:

wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/scripts/paf_to_matrix.py -P scripts
wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/data/accession_to_genus.tsv -P data

paf_to_matrix.py is the python script that filters alignment results and generates count tables
accession_to_genus.tsv is a two-column tab-delimited file linking each 16S rRNA gene accession to its genus

You can inspect the script’s help menu to understand its required inputs by typing python scripts/paf_to_matrix.py -h. The main arguments are:

-i: path to the folder containing PAF alignment files
-o: output folder
-t: path to two-column TSV file linking accessions to genus names
-s: (optional) seqkit stats file generated with _Tao, used to calculate unmapped reads
Filtering thresholds for identity, coverage, and mapping quality

Task 3: Generate the count table

Now, use paf_to_matrix.py to convert your alignment results into a count table. You will need to:

Make sure to activate the conda environment we setup in Task 1
Provide the path to your PAF alignment folder (-i)
Add your seqkit stats file (-s)
Include the accession-to-genus mapping file (-t)
Choose the folder in which you want to save your results (-o)

Hint: Python scripts like this one are not computationally demanding, you can safely run it on the login node without using srun.

# Activate the python conda environment 
... 

# Make folder to store the results 
...

# Run the python script
python scripts/paf_to_matrix.py \
  -i ... \
  -s ... \
  -t ... \
  -o ...

Task 4: Explore the count table

paf_to_matrix.py will:

Print a brief summary of the filtering to the screen
Generate several tables:
- otu_table_multimappers.tsv: A count table based on all alignments, including multi-mappers
- otu_table.tsv: A count table based on only best (primary) hits per read
- otu_table_genus.tsv: A count table summarized at the genus rank

Inspect the outputs and answer the following:

How many reads were lost due to the filtering? Check the options of python scripts/paf_to_matrix.py -h and think about whether there are any parameters you would change in case too many reads were removed?
Are the taxa listed in your count tables consistent with the strains used in your experiment?
Do the relative abundances of taxa match your expectations? If not, what biological or technical factors could explain the differences?

Click to see the answer

# Activate the python conda environment 
conda activate python_3.10.18

# Make folder to store the results 
mkdir results/tables

# Run the python script
python scripts/paf_to_matrix.py \
  -i results/minimap2/ \
  -s results/seqkit/barcodeX_filtered.tsv \
  -t data/accession_to_genus.tsv \
  -o results/tables

conda deactivate

About 12% of reads were lost during the filtering. If you had more alignments than reads, the filtered reads were likely secondary (less confident) alignments. You will also loose potential contaminations during this step. That is because reads coming from contaminant bacteria might still map to sequences in the reference database but with much lower sequence identity and therefore will be discarded.
Most reads map to Streptococcus and Lactococcus, consistent with the experiment. A few unassigned reads may come from contamination or poor-quality reads.
Streptococcus had ~10× more reads than Lactococcus. This may indicate stronger growth due to better adaptation to liquid media or competitive advantage in the co-culture.

6.1 chopper: Quality cleaning

6.2 Minimap2: Map reads to reference database

6.3 Generate a count table

6.1 `chopper`: Quality cleaning

6.2 `Minimap2`: Map reads to reference database