mamba create -n chopper_0.11.0 -c bioconda chopper=0.11.06 Amplicon long-read analysis
Now that we have learned how to use the command line to issue commands and use an HPC we can now analyse our long-read amplicon data and:
- Perform quality filtering by choosing appropriate cutoffs based on our quality assessment we did in the prior section
- Align the quality filtered reads to a reference database
- Generate a count matrix
Important: All of this should be run on the Crunchomics HPC, so be sure that you connected with ssh -X user@omics-h0.science.uva.nl.
6.1 chopper: Quality cleaning
In this section, you will learn how to perform quality filtering of your amplicon long-read data using chopper. Quality control is an essential first step to remove low quality reads that introduce errors in downstream analyses such as sequence alignment. Typical things you want to do are:
- Remove sequence adapters, barcodes, primer sequences (this already has been done for your data)
- Remove reads with a too low quality score
- Remove reads that are too long or too short
- Trim read ends if they have low quality scores
chopper is designed for filtering and trimming of long-read data and you can install chopper with mamba:
In this tutorial, you will always receive example code as a starting point, with sections containing ... for you to fill in. If you get stuck in this section, first check chopper’s documentation or use the built-in help via the CLI (chopper -h).
Use your NanoPlot quality plots from the previous section and choose some quality and length cutoffs.
Task 1: Perform quality filtering using chopper
Do the following:
- Create a new folder to store the results and name the output file
barcodeX_filtered.fastq.gz - Request 2 CPUs and 20G of memory with
srun, and ensure chopper itself also uses 2 threads - Investigate the Nanoplot LengthvsQualityScatterPlot.html that you generated in the last exercise and decide on the right thresholds to use with chopper to:
- Discard sequences that are too long or too short
- Discard reads with a too low phred score
- Check the tools manual carefully for how to deal with
.gzfiles. Hint: Closely check the example section
IMPORTANT: After chopper finished you will see something like “Kept 1985conda de reads out of 3785 reads”. If you see less than 50% of the reads remaining use less strict length and quality filters. You need enough reads for the next steps
Task 2: Assess the quality of the filtering step
Investigate the quality of the filtered FASTQ file by running:
- NanoPlot
- seqkit
- Important: Be sure that you name the output file something like
barcodeX_filtered.tsv, we need this file later!
- Important: Be sure that you name the output file something like
Afterwards open NanoPlots and seqkit. Investigate the outputs and record:
- The total number of sequences that went into the analysis
- The total number of sequences that passed the quality filtering
# Generate an output folder for the results
...
# Activate the chopper conda environment
...
# Run chopper
srun --cpus-per-task 2 --mem=20G chopper ...
# Deactivate the conda environment
...
# Make a folder for the Nanoplot results and run Nanoplot
# (don't forget to activate the right conda environment first and deactivate it after)
srun ...
# Run seqkit
# Make sure that the output file is something like path/barcodeX_filtered.tsv
srun ...# Generate an output folder for the results
mkdir results/chopper
# Activate the chopper conda environment
conda activate chopper_0.11.0
# Run chopper
srun --cpus-per-task 2 --mem=20G chopper \
-i data/barcodeX.fastq.gz \
-q 12 \
--minlength 1400 --maxlength 1540 \
--threads 2 | gzip \
> results/chopper/barcodeX_filtered.fastq.gz
# Deactivate the conda environment
conda deactivate
# Run Nanoplot
conda activate nanoplot
mkdir -p results/nanoplot/cleaned
srun --cpus-per-task 2 --mem=10G NanoPlot \
--fastq results/chopper/barcodeX_filtered.fastq.gz \
-t 2 \
--tsv_stats \
--plots dot \
-o results/nanoplot/cleaned
conda deactivate
# Run seqkit
srun --cpus-per-task=1 --mem=5G seqkit stats results/chopper/barcodeX_filtered.fastq.gz \
-Tao results/seqkit/barcodeX_filtered.tsv --threads 1Below are example results from one of my analyses. These numbers will be different for you, however, this process will make you more familiar for what to look for in your own analyses.
- Input: 29,468 reads
- Passed filtering: 24,600 reads (~80%). You ideally want to keep 50%-90% of the reads for the down-stream analyses depending on the quality of the data and how many reads you generated.
6.2 Minimap2: Map reads to reference database
Nanopore long-read sequencing can produce reads that span nearly the entire 16S rRNA gene, making it useful for identifying microbial species. However, these long reads come with a trade-off: they have a higher error rate compared to short-read technologies.
Assume that your average read quality is Q10. The Phred score (Q) is a measure of how confident the sequencer is in each base call. The probability of a base being called incorrectly can be calculated as:
\[P=10^{-Q/10}\]
In R we can compute this with:
Q <- 12
P_error <- 10^(-Q/10)
percent_error <- P_error * 100
message("Error rate: ", round(percent_error, 2), "%")Error rate: 6.31%
This shows that with a mean Phred quality of 12, each base has roughly a 6% chance of being incorrect. For a 1,500 bp 16S amplicon we expect 90 errors per sequence with such a quality score. While Nanopore long reads are useful because they capture the full 16S rRNA gene, a high level of errors can make species-level identification unreliable.
To classify our filtered Nanopore reads, we use Minimap2 to map each read against a reference database of 16S rRNA sequences from all the strains used in your lab experiments. Mapping identifies the reference sequence most similar to each read. The read is then aligned base-by-base to the corresponding reference sequence to check for exact matches or mismatches. Depending on the similarity, one or more hits may be recorded for each read.
Mapping tools like Minimap2 are well-suited for Nanopore data because they tolerate mismatches as well as small insertions and deletions, which are common in error-prone long reads. The output of Minimap2 is a PAF (Pairwise mApping Format) file, which contains information about how well each read maps to the reference sequences.
Some reads may map equally well to multiple reference sequences (multi-mappers), particularly if the strains have highly similar 16S sequences (for example the different Pseudomonas strains you worked with in the lab). These multi-mapping reads are important to keep in mind when summarizing taxonomic abundances.
Minimap2 is installed by default on Crunchomics.
Task 1: Get the reference database
We have already prepared a non-redundant database of 16S rRNA gene sequences. Download it to your data/ folder using:
wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/data/amplicons_nr.fasta -P dataThis file (amplicons_nr.fasta) contains one to two representative 16S rRNA sequences per strain used in the lab. View the file with head and use grep to count how many sequences are in the file.
Task 2: Run minimap2
Next, you will map your filtered reads against the reference database using minimap2.
Before running, make sure to check all the available options with minimap2 -h and then make sure tha tyou:
- Create a folder to store your results
- Use srun with 2 CPUs and 20G of memory
- Use
-cx map-ontfor accurate Nanopore vs reference mapping - Ensure that the output is in PAF format
- Provide both
data/amplicons_nr.fasta(the reference)barcodeX_filtered.fastq.gz(the cleaned fastq sequences)
- Use
> results/minimap2/barcodeX.pafto store the output
Task 3: Explore the output
After the alignment finishes:
- Count the number of alignments with
wc - View the file with
lessand scroll through the first few lines to inspect the structure. - Check the minimap2 manual to find out what the different columns mean.
Answer the following questions:
- How many alignments are there? Is this number higher or lower than your total number of input reads? Why might that be?
- Look at the first 12 columns of the PAF file. Which of these could help you filter unreliable alignments or identify multi-mapping reads?
# Count the number of reference sequences
...
# Generate a suitable output folder
...
# Run minimap2
...
# Count the number of alignments
...# Count the number of reference sequences
grep -c ">" data/amplicons_nr.fasta
# Generate a suitable output folder
mkdir results/minimap2
# Run minimap2
srun --cpus-per-task 2 --mem=20G minimap2 \
-cx map-ont -t 2 \
data/amplicons_nr.fasta \
results/chopper/barcodeX_filtered.fastq.gz \
> results/minimap2/barcodeX.paf
# Count number of alignments: 2059 (-1 because of the header)
wc -l results/minimap2/barcodeX.paf
# Explore the file itself
less -S results/minimap2/barcodeX.pafAnswer:
- There are 45,318 alignments for 24,600 reads. If you see more alignments then there are reads, there might be different reasons for that:
- If your experiment included Pseudomonas strains you likely will have more alignments than reads. The higher number of alignments occurs because some reads might align equally well to multiple reference 16S sequences, these are multi-mapping reads.
- Some reads might only partially or weakly align to highly similar regions in the 16S rRNA gene
- Useful PAF columns for assessing alignment reliability include:
- Column 10: number of matching bases (can be use to calculate % identity)
- Column 11: total length of the alignment (including gaps; can be use to calculate % identity)
- Column 12: mapping quality. 0 is used to mark and thereby identify multi-mappers
- Columns 3–4: read start and end positions (can be used to calculate how much of the read aligned)
Information on the PAF output:
| Col | Description |
|---|---|
| 1 | Query sequence name |
| 2 | Query sequence length |
| 3 | Query start coordinate (0-based) |
| 4 | Query end coordinate (0-based) |
| 5 | ‘+’ if query/target on the same strand; ‘-’ if opposite |
| 6 | Target sequence name |
| 7 | Target sequence length |
| 8 | Target start coordinate on the original strand |
| 9 | Target end coordinate on the original strand |
| 10 | Number of matching bases in the mapping |
| 11 | Number bases, including gaps, in the mapping |
| 12 | Mapping quality (0-255 with 255 for missing) |
6.3 Generate a count table
Now that we have our alignment results, our goal is to summarize them into a count table, a matrix showing how many reads aligned to each reference 16S rRNA sequence. Before creating the count table, we first need to filter our results to make sure we only count high-confidence matches by:
- Removing low-confidence hits by filtering out alignments with:
- Low sequence identity (many mismatches)
- Low coverage (only part of the read mapped to the reference)
- Low mapping quality (reads could map equally well elsewhere)
- Selecting the single best alignment for each read (to handle multi-mappers)
Afterwards, we can count how many reads map to each reference sequence. These counts can then be summarized at higher taxonomic levels (e.g., genus) to account for the uncertainty due to sequencing errors.
Instead of using a single pre-built program, we will use a custom Python script to perform this step. So why use a tool like Python or R for this?
Tools like Minimap2 produce large, text-based alignment files. While command-line tools are great for specific tasks (like counting lines or filtering by quality), programming languages like Python and R allow you to:
- Combine multiple filtering criteria (e.g. identity, coverage, quality)
- Parse complex data formats
- Summarize and visualize results reproducible
Task 1: Install the necessary software
To run the required software, we first need to install some packages to parse the data. You can do this by running the following:
mamba create -n python_3.10.18 -c conda-forge python=3.10.18 matplotlib pandasTask 2: Get the python scripts
Download the two required files:
wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/scripts/paf_to_matrix.py -P scripts
wget https://raw.githubusercontent.com/ndombrowski/MicEco2025/refs/heads/main/data/accession_to_genus.tsv -P datapaf_to_matrix.pyis the python script that filters alignment results and generates count tablesaccession_to_genus.tsvis a two-column tab-delimited file linking each 16S rRNA gene accession to its genus
You can inspect the script’s help menu to understand its required inputs by typing python scripts/paf_to_matrix.py -h. The main arguments are:
-i: path to the folder containing PAF alignment files-o: output folder-t: path to two-column TSV file linking accessions to genus names-s: (optional) seqkit stats file generated with _Tao, used to calculate unmapped reads- Filtering thresholds for identity, coverage, and mapping quality
Task 3: Generate the count table
Now, use paf_to_matrix.py to convert your alignment results into a count table. You will need to:
- Make sure to activate the conda environment we setup in Task 1
- Provide the path to your PAF alignment folder (
-i) - Add your seqkit stats file (
-s) - Include the accession-to-genus mapping file (
-t) - Choose the folder in which you want to save your results (
-o)
Hint: Python scripts like this one are not computationally demanding, you can safely run it on the login node without using srun.
# Activate the python conda environment
...
# Make folder to store the results
...
# Run the python script
python scripts/paf_to_matrix.py \
-i ... \
-s ... \
-t ... \
-o ...Task 4: Explore the count table
paf_to_matrix.py will:
- Print a brief summary of the filtering to the screen
- Generate several tables:
otu_table_multimappers.tsv: A count table based on all alignments, including multi-mappersotu_table.tsv: A count table based on only best (primary) hits per readotu_table_genus.tsv: A count table summarized at the genus rank
Inspect the outputs and answer the following:
- How many reads were lost due to the filtering? Check the options of
python scripts/paf_to_matrix.py -hand think about whether there are any parameters you would change in case too many reads were removed? - Are the taxa listed in your count tables consistent with the strains used in your experiment?
- Do the relative abundances of taxa match your expectations? If not, what biological or technical factors could explain the differences?
# Activate the python conda environment
conda activate python_3.10.18
# Make folder to store the results
mkdir results/tables
# Run the python script
python scripts/paf_to_matrix.py \
-i results/minimap2/ \
-s results/seqkit/barcodeX_filtered.tsv \
-t data/accession_to_genus.tsv \
-o results/tables
conda deactivate- About 12% of reads were lost during the filtering. If you had more alignments than reads, the filtered reads were likely secondary (less confident) alignments. You will also loose potential contaminations during this step. That is because reads coming from contaminant bacteria might still map to sequences in the reference database but with much lower sequence identity and therefore will be discarded.
- Most reads map to Streptococcus and Lactococcus, consistent with the experiment. A few unassigned reads may come from contamination or poor-quality reads.
- Streptococcus had ~10× more reads than Lactococcus. This may indicate stronger growth due to better adaptation to liquid media or competitive advantage in the co-culture.