Workflow for sequence classification

Project description

NanoClass2 is used to classify 16S rRNA amplicon sequences from 21 samples. DNA samples were taken from Winogradksy columns with wood or paper as substrate.

For each substrate 3 independent columns were sampled and for each columns there are 3-4 replicates. DNA was sampled on two days, by two different groups of students (i.e. Practical_groups 1 and 2):

#NAME	Sample	Carbon_source	Wino_Column	Practical_group
BC22	GM11	paper	No	2
BC17	KB-P1	paper	P1	1
BC18	WB-P1	paper	P1	1
BC2	SB-P1	paper	P1	2
BC1	LK-P2	paper	P2	2
BC16	RvO-P2	paper	P2	1
BC20	IV-P2	paper	P2	1
BC15	EK-P3	paper	P3	1
BC19	NK-P3	paper	P3	1
BC3	PM-P3	paper	P3	2
BC10	EP-SD1	wood	SD1	1
BC12	RK-SD1	wood	SD1	1
BC23	EW-SD1	wood	SD1	2
BC24	DG-SD1	wood	SD1	2
BC5	AW-SD2	wood	SD2	2
BC6	UNZ-SD2	wood	SD2	2
BC7	DH-SD2	wood	SD2	2
BC11	AL-SD3	wood	SD3	1
BC13	BS-SD3	wood	SD3	1
BC14	DSR-SD3	wood	SD3	1
BC9	BF-SD3	wood	SD3	1

Dependencies

python v3.10.12
sed (GNU sed) 4.5
GNU Awk 4.2.1, API: 2.0 (GNU MPFR 3.1.6-p2, GNU MP 6.1.2)
seqkit v2.7.0
Nanoplot v1.42.0
NanoClass2 v0.1

Folder setup

Data was analysed on the Uva Crunchomics HPC:

# Go to working directory
wdir="/home/ndombro/personal/teaching/2024/miceco"
cd $wdir

Prepare input files

Inspect data

We work with the following data:

Sequencing data was provided as zipped folder by Peter Kuperus Oktober 22, 2024 via SurfDrive.
The file filelists/barcode_to_sample was generated manually from the mapping file provided by Gerard Muyzer and links the barcode to the sample ID.

The data was uploaded to Crunchomics and the barcode IDs were extracted from the file names to be able to loop through individual samples:

# Make data folders
mkdir data 
mkdir filelists 

# Upload sequencing data from local PC to Crunchomics
scp data/mic2024.zip crunchomics:/home/ndombro/personal/teaching/2024/miceco/data

# Unzip data folder
unzip data/mic2024.zip
mv mic2024 data

# Make a list of barcodes
ls data/mic2024/*/fastq_pass/barcode*/*fastq.gz | \
  sed 's/.*barcode\([0-9]\+\).*/barcode\1/' \
  | sort -u > filelists/barcodes.txt

# We work with so many barcodes: 21
wc -l filelists/barcodes.txt

# Count number of files we work with per barcode
for i in `cat filelists/barcodes.txt`; do
    count=$((ll data/mic2024/*/fastq_pass/${i}/*fastq.gz) | wc -l)
    echo "Barcode ${i} has ${count} fastq files"
done

Results

We work with 21 barcodes and for each barcode we work with several fastq.gz files:

Barcode barcode01 has 145 fastq files
Barcode barcode02 has 145 fastq files
Barcode barcode03 has 145 fastq files
Barcode barcode05 has 145 fastq files
Barcode barcode06 has 144 fastq files
Barcode barcode07 has 145 fastq files
Barcode barcode09 has 132 fastq files
Barcode barcode10 has 132 fastq files
Barcode barcode11 has 132 fastq files
Barcode barcode12 has 132 fastq files
Barcode barcode13 has 132 fastq files
Barcode barcode14 has 132 fastq files
Barcode barcode15 has 132 fastq files
Barcode barcode16 has 132 fastq files
Barcode barcode17 has 132 fastq files
Barcode barcode18 has 132 fastq files
Barcode barcode19 has 3 fastq files
Barcode barcode20 has 132 fastq files
Barcode barcode22 has 89 fastq files
Barcode barcode23 has 145 fastq files
Barcode barcode24 has 144 fastq files

We can see that:

for each sample, we have several fastq files. Therefore, we first want to combine them into one single file per sample before running NanoClass.
Barcode19 seems to have less files, thus we need to keep this in mind and observe read counts more carefully for that sample

Combine individual files

# Generate folders
mkdir data/combined_data

# Combine individual files/sample
for i in `cat filelists/barcodes.txt`; do
    cat data/mic2024/*/fastq_pass/${i}/*fastq.gz > data/combined_data/${i}.fastq.gz
done

# Sanity check: We work with 21 combined files
ll data/combined_data/* | wc -l

Do quality control

Calculate read counts

Next, we calculate the read counts before and after combining the individual files. This is useful to know how many reads we work with but also to see whether the merge worked correctly.

# count total number of reads
total_after_reads=0
total_files=0

for i in $(cat filelists/barcodes.txt); do
    #read counts before combining
    before_lines=$(zcat data/mic2024/*/fastq_pass/${i}/*fastq.gz | wc -l)
    before_count=$((before_lines / 4))

    #read counts after combining
    after_lines=$(zcat data/combined_data/${i}.fastq.gz | wc -l)
    after_count=$((after_lines / 4))

    # accumulate total after reads and files
    total_after_reads=$((total_after_reads + after_count))
    total_files=$((total_files + 1))

    echo "Total reads for ${i} - Before: ${before_count}, After: ${after_count}"
done

# calculate average after count
average_after_count=$((total_after_reads / total_files))

echo "We have so many samples: ${total_files}"
echo "We have so many reads in total: ${total_after_reads}"
echo "Average read count for the combined barcodes: ${average_after_count}"

Results

We have so many samples: 21
We have so many reads in total: 413,614
On average we have: 19,695 reads
Notice that for barcode 19 and 22 we retain only very few read counts

Read counts before and after merging individual files:

Total reads for barcode01 - Before: 60759, After: 60759
Total reads for barcode02 - Before: 78436, After: 78436
Total reads for barcode03 - Before: 28991, After: 28991
Total reads for barcode05 - Before: 4852, After: 4852
Total reads for barcode06 - Before: 17883, After: 17883
Total reads for barcode07 - Before: 18625, After: 18625
Total reads for barcode09 - Before: 12756, After: 12756
Total reads for barcode10 - Before: 15666, After: 15666
Total reads for barcode11 - Before: 15038, After: 15038
Total reads for barcode12 - Before: 7512, After: 7512
Total reads for barcode13 - Before: 11099, After: 11099
Total reads for barcode14 - Before: 6846, After: 6846
Total reads for barcode15 - Before: 14785, After: 14785
Total reads for barcode16 - Before: 18654, After: 18654
Total reads for barcode17 - Before: 16957, After: 16957
Total reads for barcode18 - Before: 30101, After: 30101
Total reads for barcode19 - Before: 3, After: 3
Total reads for barcode20 - Before: 16945, After: 16945
Total reads for barcode22 - Before: 109, After: 109
Total reads for barcode23 - Before: 31428, After: 31428
Total reads for barcode24 - Before: 6169, After: 6169

Generate summary statistics

Next, we calculate the quality statistics (total read count, average read count, read length, etc) to be able to screen the data for potential issues:

# Make data folders
mkdir -p results/seqkit
mkdir -p results/nanoplot

# Run seqkit
seqkit stats -a -To results/seqkit/seqkit_stats.tsv data/combined_data/*fastq.gz --threads 10

# Generate plots to visualize the statistics better
conda activate nanoplot_1.42.0

for file in data/combined_data/*fastq.gz; do
  sample=$(echo $file | cut -d "/" -f3 | sed "s/\.fastq.gz//g")
  echo "Starting analysis for "$file""
  mkdir results/nanoplot/"$sample"
  srun --cpus-per-task 10 --mem=50G \
    NanoPlot --fastq $file -o results/nanoplot/"$sample" --threads 10
done

conda deactivate

Summary

Reads are on average 1358 bp long with Q1: 1374, Q2: 1425 and Q3: 1454.
Visualizing the plots,
- the majority of the data was hovering around 1400 bp
- The read quality was more spread and and went from 10 to max 18
Conclusion:
- No sample had shorter than expected amplicons
- Two samples had less than 200 reads and need to be removed at some point of the pipeline
- I will use phred score cutoff ~10
- I will use length cutoffs of 1100 bp (min) and 1600 bp (max) based on the Q1 and Q3 information from seqkit

Run NanoClass2

NanoClass2 will be used for quality cleaning and to classify the sequence reads for each sample.

To run NanoClass2 we need:

The sequence files in fastq.gz format (one file per sample, which we already generated)
A csv mapping file that lists what samples you want to have analysed and how the samples are named
A config file in which we specify with what parameters we want to run NanoClass2
A bash file that allows us to run NanoClass2 on an HPC

Prepare the mapping file

The csv mapping file needs to include the following:

A run name
A unique sample name
The barcode name (optional)
The path to your fastq.gz files (should be already demultiplexed)
Notice: Sample and run labels can only contain letters and numbers

For this analysis the file looks something like this:

run,sample,barcode,path
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode01.fastq.gz
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode02.fastq.gz
mic2024,BC01,,/home/ndombro/personal/teaching/2024/miceco/data/combined_data/barcode03.fastq.gz

This file can be generatd in excel, but we can also extract all the info we need from the file path:

echo "run,sample,barcode,path" > filelists/mapping_file.csv

ls data/combined_data/*fastq.gz | \
while read path; do
  run="mic2024" # Set static run name
  sample=$(echo "$path" | cut -d "/" -f3 | sed "s/\.fastq.gz//g") # Extract sampleID
  barcode=$(echo "$sample" | sed "s/barcode/BC/g")
  fullpath=$(realpath "$path")  # To get the full absolute path
  echo "$run,$barcode,,$fullpath" # Combine all data
done >> filelists/mapping_file.csv

Prepare the config file

The Snakemake configuration file, i.e. config.yaml, allows to adjust parameters, such as the name of the mapping file or the parameters used by different tools, outside of Snakemake. The key things to change are:

The location of your sample mapping file under samples:
The methods you want to use
The min, max read length to keep
The quality phred score to keep

# Copy the required files from the NanoClass2 software folder
cp /zfs/omics/projects/amplicomics/bin/NanoClass2/jobscript.sh .
cp /zfs/omics/projects/amplicomics/bin/NanoClass2/config.yaml .

Changes made to the config file:

samples:                           "filelists/mapping_file.csv"

methods:                           ["minimap"]

Since I do not want to subsample (useful if we want to reduce the data size for testing more than one classifier), I also edited this line:

subsample   
  skip:                          true

Based on the quality checking I also made some changes. I used the lowest/highest seqkit Q1/Q3 data to set thresholds for the length.

    minlen:                        1100
    maxlen:                        1600
    quality:                       10

Run NanoClass2

To run NanoClass2 on the Crunchomics HPC, the jobscript.sh file can be used without any edits. This script is pre-configured to do everything for you such as:

Load the correct environments that have all required tools installed
Start NanoClass2 using snakemake, a job scheduler that takes care of what software is run when. Snakemake is useful as it allows to run things in parallel as long as there are enough resources available. Snakemake also allows to re-start a job and it will simply pick things up from whereever things went wrong.

# Do a dry-run to see if we set everything correctly 
# Since this is just a test, we can run this on the headnode 
conda activate /zfs/omics/projects/amplicomics/miniconda3/envs/snakemake_nanoclass2

snakemake \
    -s /zfs/omics/projects/amplicomics/bin/NanoClass2/Snakefile \
    --configfile config.yaml --use-conda \
    --conda-prefix /zfs/omics/projects/amplicomics/bin/NanoClass2/.snakemake/conda \
    --cores 1 --nolock --rerun-incomplete -np

# Submit job via a job script to the compute node: job id is 74581
sbatch jobscript.sh

# Generate a report out of this
snakemake --report report.html \
  --configfile config.yaml \
  -s /zfs/omics/projects/amplicomics/bin/NanoClass2/Snakefile

conda deactivate 

# Run seqkit on the cleaned data
seqkit stats -a -To results/seqkit/seqkit_stats_filtered.tsv results/data/mic2024/chopper/BC*gz --threads 10

# Replace barcode with sample IDs in the otu table
awk -v OFS="\t" 'NR==FNR {mapping[$1]=$2; next} {for (i=1; i<=NF; i++) if ($i in mapping) $i=mapping[$i]; print}' \
  filelists/barcode_to_sample \
  <(sed "s/mic2024_minimap_//g" results/tables/otu-table.tsv) \
  > results/tables/otu-table-updated.txt

# Make the header compatible for MicrobiomeAnalyst
sed -i "s/taxid/\#NAME/g" results/tables/otu-table-updated.txt

# Remove Gerards sample
python scripts/filter_columns.py -i results/tables/otu-table-updated.txt \
  -c GM11 \
  -o results/tables/otu-table-forMA.txt

Comments

Looking at the seqkit stats we went from:
- on average 19,695 to 17,513 reads
- total 413,614 to 367,773 reads
- total 561,031,890 to 520,739,003 bp
Link to the report is here