Introduction to Data Analysis

in Microbial Ecology

Nina Dombrowski

Introduction to Bioinformatics

Bioinformatics applies computational methods to store, manage, and analyse biological data

  • Experiments involving sequencing generates too much data to analyse manually
  • Your Nanopore sequencing generated ~ 950,000 sequencing reads
  • Computational tools allow you to store, filter and analyse such data

The Command Line Interface (CLI)

A text-based interface for giving instructions to a computer

  • Allows to handle large datasets effectively
  • Gives us access to many bioinformatic tools
  • Easy to document and reproduce workflows

High Performance Computing (HPC)

A shared computer system that provides more memory, CPUs, space than a typical laptop


Real-life application

In this tutorial, you will analyse use the CLI and HPC to analyse your nanopore sequence reads to:

  • Determine which strains are present in the mixed communities
  • Quantify their relative abundances
  • Asses whether the counts fit with your assumptions about individual interactions

Workflow



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class A,B,C,D,E,F greenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


PAF:

  • Pairwise Alignment Format
  • A table with information about how each read maps to the reference

The FASTQ format

FASTQ format is a text-based format for storing a sequence and its corresponding quality scores

Phred scores

The Phred score is a measure for base quality. The larger the Phred value, the better the quality of a base

Phred scores

The Phred score is a measure for base quality. The larger the Phred value, the better the quality of a base

Bringing things together

# Run Nanoplot on FASTQ files to assess quality
1NanoPlot \
2    --fastq data/barcode01.fastq \
    -o QC_plots \
    -t 2 \
    --tsv_stats
1
Define what tool we want to use
2
Use options to modify the tools behavior

Practical part

You will have today and Monday to work on a practical and learn how to:

  • Navigate the command line
  • Work on an HPC
  • Analyze sequencing data

https://ndombrowski.github.io/MicEco2025/