Introduction to Data Analysis

in Microbial Ecology

Nina Dombrowski

Workflow



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class A,B,C,D,E,F greenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


Quality filtering



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef darkgreenfill fill:#365154,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class A,B,C darkgreenfill
    class D,E,F greenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


Quality filtering

Quality filtering

Useful Tools:

  • Porechop (adapter removal)
  • Chopper
  • Filtlong

Read mapping



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef darkgreenfill fill:#365154,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class DB,C,D,E darkgreenfill
    class A,B,F greenfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


Read mapping

Multi-mappers


Multi-mapping reads are reads that are mapping to multiple loci on the reference genome.

Multi-mappers


We can use mismatches and differences in read coverage to select the best match.

Count table and what comes next



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef darkgreenfill fill:#365154,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class F darkgreenfill
    class A,B,C,D,E greenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


Count table and what comes next

Expected taxa composition:
culture taxon
C1 Pseudomonas
C1 Flavobacterium
C2 Pseudomonas
C2 Flavobacterium
C2 Streptomyces
Count table
taxon C1_rep1 C1_rep2 C1_rep3 C2_rep1 C2_rep2 C2_rep3
Pseudomonas 900 850 800 300 400 250
Streptomyces 0 0 0 800 3600 850
Flavobacterium 800 600 1200 900 4200 850
Count table
taxon C1_rep1 C1_rep2 C1_rep3 C2_rep1 C2_rep2 C2_rep3
Pseudomonas 900 850 800 300 400 250
Streptomyces 0 0 0 800 3600 850
Flavobacterium 800 600 1200 900 4200 850
total 1700 1450 2000 2000 8200 1950

Data wrangling

# Data re-structuring
## Convert wide to long format
## and add extra columns
df <- counts_wide |> 
    pivot_longer(
        cols = starts_with("C"),
        names_to = "sample",
        values_to = "count"
    ) |> 
    separate_wider_delim(sample, delim = "_", names = c("culture", "rep"), cols_remove = FALSE) 

## Calculate relative abundance
## and order factors by taxa abundance 
df <- df |> 
    group_by(sample) |> 
    mutate(rel_abund = count / sum(count) * 100) |> 
    ungroup() |> 
    mutate(taxon = fct_reorder(taxon, rel_abund, .fun = sum))

# Plot data
p <- ggplot(df, aes(x = sample, y = rel_abund, fill = taxon)) +
  geom_col(width = 0.9) +
  scale_fill_manual(values = c('#CCEDB1', '#41B7C4', '#144348ff')) +
  labs(x = "", y = "Relative abundance (%)", fill = "Genus") +
  facet_wrap(~culture, scales = "free_x") +
  theme_classic()

Statistics

# Filter taxa based on expected presence 
filtered_counts <- df |> 
  inner_join(expected, by = c("culture", "taxon"))

# Run ANOVA
res_aov <- aov(rel_abund ~ taxon * culture, data = filtered_counts)
summary(res_aov)
              Df Sum Sq Mean Sq F value   Pr(>F)    
taxon          2  924.5   462.2   9.974 0.004151 ** 
culture        1 1354.5  1354.5  29.228 0.000299 ***
taxon:culture  1 1012.6  1012.6  21.851 0.000875 ***
Residuals     10  463.4    46.3                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


# Run Posthoc test
res_tukey <- TukeyHSD(res_aov)
res_tukey$`taxon:culture` |> 
  as.data.frame() |> 
  filter(`p adj` < 0.05)
                                      diff       lwr       upr        p adj
Pseudomonas:C2-Flavobacterium:C1 -38.57986 -57.88592 -19.27379 0.0004143459
Pseudomonas:C2-Pseudomonas:C1    -39.62110 -58.92717 -20.31504 0.0003321700
Pseudomonas:C2-Flavobacterium:C2 -35.70356 -55.00963 -16.39750 0.0007787975
Streptomyces:C2-Pseudomonas:C2    31.59787  12.29181  50.90394 0.0020231948

Data interpretation



flowchart LR
    classDef greenfill fill:#5B888C,stroke:#333,stroke-width:1,color:#fff;
    classDef dbfill fill:#E2F0F1,stroke:#333,stroke-width:1,color:#333;
    classDef darkgreenfill fill:#365154,stroke:#333,stroke-width:1,color:#fff;

    %% Workflow boxes
    A[Raw reads] -->|FASTQ| B[Quality control]
    B -->|FASTQ| C[Quality filtering]
    C -->|FASTQ| D[Mapping to <br> Reference DB]
    D -->|PAF| E[Quality filtering]
    E -->|PAF| F[Count table]

    %% Reference database node
    DB[(16S <br> Reference  <br> Database)] -->|FASTA| D  

    %% Assign classes after nodes exist
    class A,B,C,D,E,F darkgreenfill
    class DB dbfill

    %% Arrow styles (index starts at 0 for first arrow)
    linkStyle 0,1,2,3,4,5 stroke:#5B888C,stroke-width:2,color:#000, fill: none


Once you have visualized your data and performed statistics, you can check whether the results fit the hypotheses you have made based on your interaction experiments