# Go to working directory
wdir="/home/ndombro/personal/teaching/2024/miceco"
cd $wdirWorkflow for sequence classification
Project description
NanoClass2 is used to classify 16S rRNA amplicon sequences from 21 samples. DNA samples were taken from Winogradksy columns with wood or paper as substrate.
For each substrate 3 independent columns were sampled and for each columns there are 3-4 replicates. DNA was sampled on two days, by two different groups of students (i.e. Practical_groups 1 and 2):
| #NAME | Sample | Carbon_source | Wino_Column | Practical_group |
|---|---|---|---|---|
| BC22 | GM11 | paper | No | 2 |
| BC17 | KB-P1 | paper | P1 | 1 |
| BC18 | WB-P1 | paper | P1 | 1 |
| BC2 | SB-P1 | paper | P1 | 2 |
| BC1 | LK-P2 | paper | P2 | 2 |
| BC16 | RvO-P2 | paper | P2 | 1 |
| BC20 | IV-P2 | paper | P2 | 1 |
| BC15 | EK-P3 | paper | P3 | 1 |
| BC19 | NK-P3 | paper | P3 | 1 |
| BC3 | PM-P3 | paper | P3 | 2 |
| BC10 | EP-SD1 | wood | SD1 | 1 |
| BC12 | RK-SD1 | wood | SD1 | 1 |
| BC23 | EW-SD1 | wood | SD1 | 2 |
| BC24 | DG-SD1 | wood | SD1 | 2 |
| BC5 | AW-SD2 | wood | SD2 | 2 |
| BC6 | UNZ-SD2 | wood | SD2 | 2 |
| BC7 | DH-SD2 | wood | SD2 | 2 |
| BC11 | AL-SD3 | wood | SD3 | 1 |
| BC13 | BS-SD3 | wood | SD3 | 1 |
| BC14 | DSR-SD3 | wood | SD3 | 1 |
| BC9 | BF-SD3 | wood | SD3 | 1 |
Dependencies
- python v3.10.12
- sed (GNU sed) 4.5
- GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
- seqkit v2.7.0
- Nanoplot v1.42.0
- NanoClass2 v0.1
Folder setup
Data was analysed on the Uva Crunchomics HPC:
Prepare input files
Inspect data
We work with the following data:
- Sequencing data was provided as zipped folder by Peter Kuperus Oktober 22, 2024 via SurfDrive.
- The file
filelists/barcode_to_samplewas generated manually from the mapping file provided by Gerard Muyzer and links the barcode to the sample ID.
The data was uploaded to Crunchomics and the barcode IDs were extracted from the file names to be able to loop through individual samples:
# Make data folders
mkdir data
mkdir filelists
# Upload sequencing data from local PC to Crunchomics
scp data/mic2024.zip crunchomics:/home/ndombro/personal/teaching/2024/miceco/data
# Unzip data folder
unzip data/mic2024.zip
mv mic2024 data
# Make a list of barcodes
ls data/mic2024/*/fastq_pass/barcode*/*fastq.gz | \
sed 's/.*barcode\([0-9]\+\).*/barcode\1/' \
| sort -u > filelists/barcodes.txt
# We work with so many barcodes: 21
wc -l filelists/barcodes.txt
# Count number of files we work with per barcode
for i in `cat filelists/barcodes.txt`; do
count=$((ll data/mic2024/*/fastq_pass/${i}/*fastq.gz) | wc -l)
echo "Barcode ${i} has ${count} fastq files"
doneResults
We work with 21 barcodes and for each barcode we work with several fastq.gz files:
Barcode barcode01 has 145 fastq files
Barcode barcode02 has 145 fastq files
Barcode barcode03 has 145 fastq files
Barcode barcode05 has 145 fastq files
Barcode barcode06 has 144 fastq files
Barcode barcode07 has 145 fastq files
Barcode barcode09 has 132 fastq files
Barcode barcode10 has 132 fastq files
Barcode barcode11 has 132 fastq files
Barcode barcode12 has 132 fastq files
Barcode barcode13 has 132 fastq files
Barcode barcode14 has 132 fastq files
Barcode barcode15 has 132 fastq files
Barcode barcode16 has 132 fastq files
Barcode barcode17 has 132 fastq files
Barcode barcode18 has 132 fastq files
Barcode barcode19 has 3 fastq files
Barcode barcode20 has 132 fastq files
Barcode barcode22 has 89 fastq files
Barcode barcode23 has 145 fastq files
Barcode barcode24 has 144 fastq files
We can see that:
- for each sample, we have several fastq files. Therefore, we first want to combine them into one single file per sample before running NanoClass.
- Barcode19 seems to have less files, thus we need to keep this in mind and observe read counts more carefully for that sample
Combine individual files
# Generate folders
mkdir data/combined_data
# Combine individual files/sample
for i in `cat filelists/barcodes.txt`; do
cat data/mic2024/*/fastq_pass/${i}/*fastq.gz > data/combined_data/${i}.fastq.gz
done
# Sanity check: We work with 21 combined files
ll data/combined_data/* | wc -lDo quality control
Calculate read counts
Next, we calculate the read counts before and after combining the individual files. This is useful to know how many reads we work with but also to see whether the merge worked correctly.
# count total number of reads
total_after_reads=0
total_files=0
for i in $(cat filelists/barcodes.txt); do
#read counts before combining
before_lines=$(zcat data/mic2024/*/fastq_pass/${i}/*fastq.gz | wc -l)
before_count=$((before_lines / 4))
#read counts after combining
after_lines=$(zcat data/combined_data/${i}.fastq.gz | wc -l)
after_count=$((after_lines / 4))
# accumulate total after reads and files
total_after_reads=$((total_after_reads + after_count))
total_files=$((total_files + 1))
echo "Total reads for ${i} - Before: ${before_count}, After: ${after_count}"
done
# calculate average after count
average_after_count=$((total_after_reads / total_files))
echo "We have so many samples: ${total_files}"
echo "We have so many reads in total: ${total_after_reads}"
echo "Average read count for the combined barcodes: ${average_after_count}"Results
- We have so many samples: 21
- We have so many reads in total: 413,614
- On average we have: 19,695 reads
- Notice that for barcode 19 and 22 we retain only very few read counts
Read counts before and after merging individual files:
Total reads for barcode01 - Before: 60759, After: 60759
Total reads for barcode02 - Before: 78436, After: 78436
Total reads for barcode03 - Before: 28991, After: 28991
Total reads for barcode05 - Before: 4852, After: 4852
Total reads for barcode06 - Before: 17883, After: 17883
Total reads for barcode07 - Before: 18625, After: 18625
Total reads for barcode09 - Before: 12756, After: 12756
Total reads for barcode10 - Before: 15666, After: 15666
Total reads for barcode11 - Before: 15038, After: 15038
Total reads for barcode12 - Before: 7512, After: 7512
Total reads for barcode13 - Before: 11099, After: 11099
Total reads for barcode14 - Before: 6846, After: 6846
Total reads for barcode15 - Before: 14785, After: 14785
Total reads for barcode16 - Before: 18654, After: 18654
Total reads for barcode17 - Before: 16957, After: 16957
Total reads for barcode18 - Before: 30101, After: 30101
Total reads for barcode19 - Before: 3, After: 3
Total reads for barcode20 - Before: 16945, After: 16945
Total reads for barcode22 - Before: 109, After: 109
Total reads for barcode23 - Before: 31428, After: 31428
Total reads for barcode24 - Before: 6169, After: 6169
Generate summary statistics
Next, we calculate the quality statistics (total read count, average read count, read length, etc) to be able to screen the data for potential issues:
# Make data folders
mkdir -p results/seqkit
mkdir -p results/nanoplot
# Run seqkit
seqkit stats -a -To results/seqkit/seqkit_stats.tsv data/combined_data/*fastq.gz --threads 10
# Generate plots to visualize the statistics better
conda activate nanoplot_1.42.0
for file in data/combined_data/*fastq.gz; do
sample=$(echo $file | cut -d "/" -f3 | sed "s/\.fastq.gz//g")
echo "Starting analysis for "$file""
mkdir results/nanoplot/"$sample"
srun --cpus-per-task 10 --mem=50G \
NanoPlot --fastq $file -o results/nanoplot/"$sample" --threads 10
done
conda deactivateSummary
- Reads are on average 1358 bp long with Q1: 1374, Q2: 1425 and Q3: 1454.
- Visualizing the plots,
- the majority of the data was hovering around 1400 bp
- The read quality was more spread and and went from 10 to max 18
- Conclusion:
- No sample had shorter than expected amplicons
- Two samples had less than 200 reads and need to be removed at some point of the pipeline
- I will use phred score cutoff ~10
- I will use length cutoffs of 1100 bp (min) and 1600 bp (max) based on the Q1 and Q3 information from seqkit
Run NanoClass2
NanoClass2 will be used for quality cleaning and to classify the sequence reads for each sample.
To run NanoClass2 we need:
- The sequence files in fastq.gz format (one file per sample, which we already generated)
- A csv mapping file that lists what samples you want to have analysed and how the samples are named
- A config file in which we specify with what parameters we want to run NanoClass2
- A bash file that allows us to run NanoClass2 on an HPC
Prepare the mapping file
The csv mapping file needs to include the following:
- A run name
- A unique sample name
- The barcode name (optional)
- The path to your fastq.gz files (should be already demultiplexed)
- Notice: Sample and run labels can only contain letters and numbers
For this analysis the file looks something like this:
run,sample,barcode,path
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode01.fastq.gz
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode02.fastq.gz
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode03.fastq.gz
This file can be generatd in excel, but we can also extract all the info we need from the file path:
echo "run,sample,barcode,path" > filelists/mapping_file.csv
ls data/combined_data/*fastq.gz | \
while read path; do
run="mic2024" # Set static run name
sample=$(echo "$path" | cut -d "/" -f3 | sed "s/\.fastq.gz//g") # Extract sampleID
barcode=$(echo "$sample" | sed "s/barcode/BC/g")
fullpath=$(realpath "$path") # To get the full absolute path
echo "$run,$barcode,,$fullpath" # Combine all data
done >> filelists/mapping_file.csvPrepare the config file
The Snakemake configuration file, i.e. config.yaml, allows to adjust parameters, such as the name of the mapping file or the parameters used by different tools, outside of Snakemake. The key things to change are:
- The location of your sample mapping file under
samples: - The methods you want to use
- The min, max read length to keep
- The quality phred score to keep
# Copy the required files from the NanoClass2 software folder
cp /zfs/omics/projects/amplicomics/bin/NanoClass2/jobscript.sh .
cp /zfs/omics/projects/amplicomics/bin/NanoClass2/config.yaml .Changes made to the config file:
samples: "filelists/mapping_file.csv"
methods: ["minimap"]
Since I do not want to subsample (useful if we want to reduce the data size for testing more than one classifier), I also edited this line:
subsample
skip: true
Based on the quality checking I also made some changes. I used the lowest/highest seqkit Q1/Q3 data to set thresholds for the length.
minlen: 1100
maxlen: 1600
quality: 10
Run NanoClass2
To run NanoClass2 on the Crunchomics HPC, the jobscript.sh file can be used without any edits. This script is pre-configured to do everything for you such as:
- Load the correct environments that have all required tools installed
- Start NanoClass2 using snakemake, a job scheduler that takes care of what software is run when. Snakemake is useful as it allows to run things in parallel as long as there are enough resources available. Snakemake also allows to re-start a job and it will simply pick things up from whereever things went wrong.
# Do a dry-run to see if we set everything correctly
# Since this is just a test, we can run this on the headnode
conda activate /zfs/omics/projects/amplicomics/miniconda3/envs/snakemake_nanoclass2
snakemake \
-s /zfs/omics/projects/amplicomics/bin/NanoClass2/Snakefile \
--configfile config.yaml --use-conda \
--conda-prefix /zfs/omics/projects/amplicomics/bin/NanoClass2/.snakemake/conda \
--cores 1 --nolock --rerun-incomplete -np
# Submit job via a job script to the compute node: job id is 74581
sbatch jobscript.sh
# Generate a report out of this
snakemake --report report.html \
--configfile config.yaml \
-s /zfs/omics/projects/amplicomics/bin/NanoClass2/Snakefile
conda deactivate
# Run seqkit on the cleaned data
seqkit stats -a -To results/seqkit/seqkit_stats_filtered.tsv results/data/mic2024/chopper/BC*gz --threads 10
# Replace barcode with sample IDs in the otu table
awk -v OFS="\t" 'NR==FNR {mapping[$1]=$2; next} {for (i=1; i<=NF; i++) if ($i in mapping) $i=mapping[$i]; print}' \
filelists/barcode_to_sample \
<(sed "s/mic2024_minimap_//g" results/tables/otu-table.tsv) \
> results/tables/otu-table-updated.txt
# Make the header compatible for MicrobiomeAnalyst
sed -i "s/taxid/\#NAME/g" results/tables/otu-table-updated.txt
# Remove Gerards sample
python scripts/filter_columns.py -i results/tables/otu-table-updated.txt \
-c GM11 \
-o results/tables/otu-table-forMA.txtComments
- Looking at the seqkit stats we went from:
- on average 19,695 to 17,513 reads
- total 413,614 to 367,773 reads
- total 561,031,890 to 520,739,003 bp
- Link to the report is here