Introduction to Data Analysis

in Microbial Ecology

Nina Dombrowski

Introduction to Bioinformatics

Bioinformatics applies computational methods to store, manage, and analyse biological data

Experiments involving sequencing generates too much data to analyse manually
Your Nanopore sequencing generated ~ 950,000 sequencing reads
Computational tools allow you to store, filter and analyse such data

The Command Line Interface (CLI)

A text-based interface for giving instructions to a computer

Allows to handle large datasets effectively
Gives us access to many bioinformatic tools
Easy to document and reproduce workflows

High Performance Computing (HPC)

A shared computer system that provides more memory, CPUs, space than a typical laptop

Real-life application

In this tutorial, you will analyse use the CLI and HPC to analyse your nanopore sequence reads to:

Determine which strains are present in the mixed communities
Quantify their relative abundances
Asses whether the counts fit with your assumptions about individual interactions

Workflow

flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class A,B,C,D,E,F greenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none

PAF:

Pairwise Alignment Format
A table with information about how each read maps to the reference

The FASTQ format

FASTQ format is a text-based format for storing a sequence and its corresponding quality scores

Phred scores

The Phred score is a measure for base quality. The larger the Phred value, the better the quality of a base

Phred scores

The Phred score is a measure for base quality. The larger the Phred value, the better the quality of a base

Bringing things together

# Run Nanoplot on FASTQ files to assess quality
1NanoPlot \
2    --fastq data/barcode01.fastq \
    -o QC_plots \
    -t 2 \
    --tsv_stats

1: Define what tool we want to use
2: Use options to modify the tools behavior

Practical part

You will have today and Monday to work on a practical and learn how to:

Navigate the command line
Work on an HPC
Analyze sequencing data

https://ndombrowski.github.io/MicEco2025/

Introduction to Data Analysis in Microbial Ecology

Introduction to Bioinformatics

The Command Line Interface (CLI)

High Performance Computing (HPC)

Real-life application

Workflow

The FASTQ format

Phred scores

Phred scores

Bringing things together

Practical part

Introduction to Data Analysis

in Microbial Ecology