Running NanoClass2

NanoClass2 will take de-multiplexed and compressed fastq files as input and generate OTU tables and several summary statistics as output.

To run NanoClass2 please, provide a single fastq.gz file per sample, i.e. concatenate your files before running NanoClass2.

If you have no preference for a classifier, you first might want to test, which classifier performs best on your data. You can do this by first subsampling your dataset and running all 11 classification tools on this smaller subsample. Tests on data sets of variable complexity have shown that a small subset of data is enough to assess tool performance. Once you have decided which tool or tools to use you can run this tool on your entire dataset.

Edit the configuration file

In the configuration file, also called config.yaml, you can specify how NanoClass2 should be run.

To tell Snakemake where your data is located and to change the default settings start by copying the config.yaml file found in the NanoClass2 folder to the folder in which you want to analyse your data, i.e. like this:

cp <path_to_NanoClass2_folder>/config.yaml .

You can also run your analyses in the NanoClass2 folder you downloaded, but it is preferable to separate software from analyses.

Next, open the config.yaml with an editor, such as nano. There are several things you can modify:

1. The samples you want to analyse

Here, you need to provide the path to a comma-separated mapping file that describes the samples you want to analyse in the section: samples_file: "example_files/mapping.csv".

This file needs to contain the following columns:

  1. run: The run name of your sample. This is useful if you want to run NanoClass several times on the same data but with different settings. Results for each run will be generated in different folders based on the run name. You can use any label as long as it consists of only letters and numbers.
  2. sample: The names of your sample. This ID will be used to label all files created in subsequent steps. Your sample names should be unique and only contain letters and numbers. Do not use other symbols, such as spaces, dots or underscores in your sample names.
  3. barcode: The barcode column should remain empty as NanoClass2 currently only accepts demultiplexed data.
  4. path: Absolute path to the fastq.gz files. The workflow accepts one file per barcode, so if you have more than one file merge these files first, for example using the cat command.

An example file could look like this (exchange with the absolute path to your samples. Preferably don’t use the relative path unless you start your analysis relative to where the Snakefile is.):

run,sample,barcode,path
demo,R1A,,<absolute_path>/barcode01_merged.fastq.gz
demo,R2A,,<absolute_path>/barcode02_merged.fastq.gz
demo,R3A,,<absolute_path>/barcode03_merged.fastq.gz

2. The classifiers you want to use

You can choose what classifiers you want to use in methods: ["blastn","centrifuge","dcmegablast","idtaxa","megablast","minimap","mothur","qiime","rdp","spingo"]. If you want to first compare different classifiers you can use the full list if you just want to use one (or two, …) classifiers you could for example write: methods: ["blastn"]

3. Subsampling our reads

Important

In this section, we can specify whether we would like to subset the number of reads and how many reads per sample you like to include. This allows us to run NanoClass2 in two modes:

  1. With subsampling: Use this, if you want to compare several classifiers on a smaller subsample of your data. We can subsample our data, by setting skip: false in the subsample section of the yaml file. Additionally, we can specify how many reads we want to analyse per sample with samplesize: 100.
  2. Without subsampling: Use this, if you already know what classifier you want to use. You can toggle the subsampling off by setting using skip: true in the subsample section of the yaml file.

In the example below subsample is enabled and 100 random reads per sample will be included in the analyses.

subsample:
    skip:                          false
    samplesize:                    100
    environment:                   "preprocess.yml"

In the second example below, we disable subsampling and analyse all our reads.

subsample:
    skip:                          true
    samplesize:                    100
    environment:                   "preprocess.yml"

4. Output folder name

The name of the folder in which the results should be stored in.

5. Other settings

Finally, you can change tool specific parameters: If desired, there are several parameters that can be changed by the user, such as the numbers of threads to use for the different tools or the settings used for the read quality filtering (in qual_filtering). Please check out the config.yaml file, which provides more detailed explanations on each setting.

Running NanoClass2

Dry-run

To test whether the workflow is defined properly do a dry-run first. In the command below, do the following changes:

  1. Provide the path to where you installed NanoClass2 (and where the file specifying how the workflow is run, the Snakemake file, can be found) afer --s
  2. Provide the path to the edited config file after --configfile
  3. Provide the path to where you want Snakemake to install all program dependencies after --conda-prefix. We recommend to install these into the folder in which you downloaded NanoClass2 but you can change this if desired
#activate conda environment with your snakemake installation, i.e. 
conda activate snakemake_nanoclass

#perform a dry-run
snakemake --cores 1 \
    -s <path_to_NanoClass2_install>/Snakefile \
    --configfile <path_to_edited_config_file>/config.yaml \
    --use-conda --conda-prefix <path_to_NanoClass2_install>/.snakemake/conda \
    --nolock --rerun-incomplete -np

Run NanoClass2 interactively

If the dry-run was successful you can run Snakemake interactively with the command below. Adjust the cores according to your system.

Warning

When running NanoClass2 for the first time, you might get a warning to set condas channel priority to strict when installing some dependencies, for example QIIME 2. If this happens we for now recommend setting the conda channel priorities to flexible to not run into issue when installing software dependencies.

You can check your channel priorities in ~/.condarc (Linux-environment)

You can reset the channel priorities with conda config --set channel_priority flexible.

snakemake --cores 1 \
    -s <path_to_NanoClass2_install>/Snakefile \
    --configfile config.yaml \
    --use-conda --conda-prefix <path_to_NanoClass2_install>/.snakemake/conda \
    --nolock --rerun-incomplete

Generate a report

After a successful run, you can create a report with some of the key output files as follows:

snakemake --report report.html \
  --configfile config.yaml \
  -s <path_to_NanoClass2_install>/Snakefile

Afterwards, the report can be found in the output folder and includes:

  • A shematic of the steps executed
  • A PDF containing information about the reads after quality filtering. To view the full report, download the individual files as only the first page is shown in the preview
  • Bargraphs for different taxonomic ranks (and classifiers if more than one was used)
  • Information about the run-times for each sample and statistics on the runtimes for each tool/sample

Additionally, the output folder contains:

  • <output_dir>/data/<run>/chopper: The filtered reads (useful if you want to do something else with these reads)
  • <output_dir>/classifications/<run>/<method>: The OTU matrix for each sample and classifier
  • <output_dir>/tables: An OTU as well as a taxonomy table combined for each sample and classifier used