Phylogeny tutorial

1 General introduction

This tutorial is the written-up version of a lecture given by Dr. Anja Spang and Dr. Nina Dombrowski as part of a bioinformatics workshop given for graduate students at the Royal Netherlands Institute for Sea Research (NIOZ) in the Netherlands.

The goal of this tutorial is for you to learn how to run a phylogeny analysis using single as well as concatenated marker genes.

During this tutorial we will work with three proteins that are found in ~40 archaeal genomes (all belonging to the same phylum) and we want to generate individual phylogenetic trees for each of these three single proteins as well as for a concatenation of those proteins.

This notebook consists of two parts:

A theoretical part discussing various aspects important for phylogenetic analyses
A practical part in which we describe how to generate phylogenetic trees

The tutorial was setup to work on the NIOZ servers, where all of the required tools are already installed. The exception are custom scripts, which will be provided as part of this tutorial. If you work on a different system you need to set up programs, such as mafft, yourself and change file paths if needed.

1.1 Version programs

The versions might differ from what you have available on your system and should not be problematic but for consistency we list the version numbers that were used to generate this workflow as well as links with install instructions.

Python 2.7.5
perl v5.16.3
HMMER 3.1b2; install instructions can be found here
GNU parallel 20181222; install instructions can be found here
Iqtree v2.1.1
BMGE v1.12
MAFFT v7.453

2 Theory

2.1 What is a phylogenetic tree and how do we read it?

A phylogenetic tree can be seen as:

Useful terms:

tip: Species or groups of interest that are represented by the sequences we collected and want to use for our phylogeny
node: point where lineages diverge, representing a speciation event from a common ancestor
branch length: amount of character change or genetic change. The longer the branch, the more changes have accumulated
root: the most recent common ancestor of all sequences in the phylogeny
outgroup: a taxon outside of the group of interest, often used to root your tree. Ideally, this should be general the most closely related sequence(s) outside the group of interest.
scale bar: usually when you see scientific figures showing phylogenetic trees you will see a scale bar, which shows the substitutions across sites

Sometimes trees are also represented as a cladogram in which branch lengths do not correspond to the amount of character change. Below we see a phylogram next to a cladogram. In cladograms branches are either equal in length or, as in the example below, all branches end flush. If you are unsure if you are dealing with a cladogram, it might be useful to check if there is a scale bar on the diagram. If there isn’t, it is probably a cladogram.

2.2 Some basic terminology

1. Monophyly

The yellow dots in the figure above show synapomorphies, which are defined as shared derived characters that are present in all members of a phylogenetic group and its ancestor.

2. Polyphyly

3. Paraphyly

A group is paraphyletic if it consists of the group’s last common ancestor and all descendants of that ancestor excluding a few (typically only one or two) monophyletic subgroups.

4. Divergent versus convergent evolution

Divergent evolution or divergent selection is the accumulation of differences between closely related populations within a species, leading to speciation (i.e. the the formation of two or more descendant species from a single ancestral species). Divergent evolution may occur as a response to changes in abiotic factors, such as a change in environmental conditions, or when a new niche becomes available.
Parallel evolution: the similar development of a trait in distinct species that are not closely related, but share a similar original trait in response to similar evolutionary pressure
Convergent evolution: Organisms not closely related independently evolve similar traits as a result of having to adapt to similar environments or ecological niches. Convergent evolution occurs when descendants resemble each other more than their ancestors did with respect to some feature.

5. Rooted versus unrooted trees

A rooted tree is a tree in which one of the nodes is defined as the root, and thus the direction of evolutionary change is determined. In contrast, an unrooted tree has no pre-determined root and therefore shows no hierarchy, i.e. the directionality of evolution is unknown. Rooting an unrooted tree involves inserting a new node, which will function as the root node.

The most widely known way to root is outgroup rooting. In this case, we need to use external knowledge (or make a hopefully reasonable assumption) to identify at least one taxon known to have diverged earlier then the group of interest. Howeverk the outgroup should not be too distant to establish character homology.

An alternative way to root is midpoint rooting. In this case, the longest distance between two tips on the tree is identified and the root is placed precisely in the middle of that distance. The assumption behind midpoint rooting is that character changes across the phylogenetic tree are approximately clock-like, i.e they happen approximately at the same speed in every lineage. Unfortunately, this is rarely the case and evolution is in general far from clock-like (e.g. different rates of evolution in different species, different selection pressure on different sites in a proteins etc). Whenever, the assumption of the clock is violated, midpoint rooting will be misleading.

2.3 How do we select marker genes?

Before we can construct a phylogenetic tree, we need to find genes with a shared ancestry, i.e. we need to find homologous genes.

Homologous sequences can be further distinguished into orthologs and paralogs:

Orthologs are genes that in different species evolved from a common ancestral gene
Paralogs are gene copies created by a duplication event within the same genome.
sometimes, paralogous genes evolve different functions due to decreased selective pressure on the copy of the duplicated gene

Selecting a good marker depends on the scientific question? Do we want to:

Approximate a species tree? Then we can use ribosomal RNA genes (i.e. the 16S rRNA gene) or universal, conserved and single-copy marker genes or proteins. -Study the evolution of enzyme families? Then we can look at genes or protein sequences belonging to a specific family, such as the ATP synthase and include all homologs of our gene of interest in a phylogenetic analysis
Use markers for concatenations when generating a species tree ? Then its important that the individual markers have the same underlying topology/evolutionary history and that the correct orthologs have been identified.

One important thing to remember is that a RNA/gene or a protein does not necessarily equal the species tree. For example, differences between the two can be due to horizontal gene transfers, incomplete lineage sorting, gene duplications or gene losses.

2.4 Alignments

Before phylogenetic analyses can be performed, sequences have to be aligned such that each column of the alignment contains those character states that have evolved from the same ancestral state. There are several algorithms to do this:

Progressive alignments: Here, the algorithm first calculates pairwise distances between all sequences. Then a guide tree is computed, usually using neighbour joining methods. Afterwards, the algorithm builds a progressive alignment using the branching order of the tree. This means that two species that are closest in the tree are aligned first. Then the sequences are treated as one pair of sequences (fixing gaps) and aligned to the next closest sequence. This is repeated until all sequences are aligned. Programs that use this approach are ClustalW and ClustalX.

The benefit of this approach is that it is fast and can be used for a large number of sequences. However, if errors are introduced early during the progressive alignment (e.g. misalignments, introduction of erroneous gaps etc), these will remain throughout the analysis. Especially, if errors are introduced during the first steps, they will be carried throughout the whole alignment.

Below we see an example of errors in the alignment, i.e. the sequences in the black box look very messy. There is the option to manually clean the alignment to remove these issues.

However, since manual cleaning takes a lot of work and is often not reproducible, it is recommended to first of all use better alignment tools, which decrease the amount of erroneously aligned sites. One option to do this are:

Iterative alignments: This approach gradually optimizes alignments. In contrast to progressive alignments, these alignments are based on a progressive alignment combined with a scoring system and refinement steps. These algorithms work similarly to progressive methods but repeatedly realign the initial sequences. Due to the refinement steps this approach is slower but much more accurate. Programs that use this approach are mafft and muscle.

2.5 Trimming and visualizing alignments

As we have mentioned above, alignments are not perfect. They can have gappy regions that are generated when certain regions are poorly conserved and therefore challenging to align accurately (i.e. this is especially the case for the beginning and end of the sequences).

To reduce the amount of erroneously aligned sites and gappy regions, it is recommended to implement a trimming step, which removes noisy regions.

Commonly used trimming tools, that remove certain columns from the alignment based on specific parameters of interest are:

TrimAL
BMGE

Due to the issues with alignments it is always recommended to visually inspect your alignment both before and after trimming using tools such as:

seaview
jalview
…

2.6 Reconstructing trees

There are many different algorithms to construct a phylogenetic tree:

2.6.1 Neighbour joining approaches:

Construct a tree by sequentially finding pairs of the nearest neighbors (with shortest pairwise distances) based on a distance matrix.
Such approaches correct for multiple substitutions, i.e. mutational saturation.
Use a clustering algorithm for tree reconstruction and optimize the length of internal branches
Are very fast but therefore have several disadvantages: e.g. instead of taking into account all character states of a sequence, these approaches only considers distance information (and as such don’t look at all the information a sequence provides). Furthermore, this approach does not implement an optimality criterion.

Due to the disadvantages, neighbour joining approaches are mainly used to generate quick guide tree, which serves as input for alternative treeing methods such as maximum likelihood or Bayesian. These latter methods subsequently improve the initial guide tree (see below).

2.6.2 Maximum parsimony (MP) approaches

Choose the simplest explanation that fits the evidence
Are very sensitive to long branch attraction (which is problematic, more on that a bit later)
Can not correct for multiple substitutions –> and as such underestimates true divergence
Can be used for small datasets, when we have slow rates of evolution or look at closely related sequences.

Long-branch attraction (LBA) is the erroneous grouping of highly divergent sequences as sisters. This grouping is usually based on the parallel accumulation of a large amount of substitutions. I.e. convergent changes along branches are misinterpreted as being similar due to common decent. While LBA often occurs in parsimony based methods, it is a common problem in most phylogenetic reconstructions and also occurs in methods implementing models of evolution, especially if these are too simple and do not well describe the evolutionary processes that have shaped the sequences being analysed. For example in the tree below, we can see how the long branches from taxa A and B are erroneously attracted to each other in the right tree:

2.6.3 Maximun likelihood (ML) approaches

ML algorithms search for the tree that maximizes the probability of observing character stats (i.e. the aligned positions) given a tree topology and a model of evolution
Involve numerical optimization techniques that find the combination of branch length and evolutionary parameters that maximize the likelihood.
This approach is computationally demanding and the resources needed dependent on the chosen model of evolution.

Commonly used tools:

FastTree: very fast, suitable for a first guide tree or trees for very large datasets, rough approximation
RaxML: slower, possible to use more complex models of evolution
IQ-tree: slower, possible to use more complex models of evolution, extremely well documented and constantly updated

2.6.4 Bayesian approaches

Character-state methods that use an optimally criterion
In contrast to MP and ML, Bayesian approaches do not try to find the best tree.
Instead, Bayesian approaches search for a set of plausible trees given the data. In turn, the posterior distribution holds a confidence estimate of any evolutionary relationship
The user needs to specify a prior believe, i.e. prior distribution on model parameters, branch length and tree topology.
This approach is very slow and recommended for smaller datasets.

Commonly used tools:

MrBayes
PhyloBayes

2.7 Models of sequence evolution

2.7.1 For DNA

Substitution model, or models of sequence evolution, are models that describe changes over evolutionary time.

There are many substitution models, one of these is the Jukes Cantor model:

The Jukes Cantor model often is too simple, and other models account for different base/amino acid frequencies, such as the Felsenstein model.

These frequencies are plotted in a substitution matrix : A two dimensional matrix with scores describing the probability of one amino acid (or nucleotide) being replaced by another.

We can also take into account that there are different frequencies of transversions versus transitions, such as in the Kimura 2 parameter model:

The general time reversible model (GTR) is the most general neutral, independent, finite-site, time-reversible model possible and is generally well suited for phylogenetic reconstructions based on DNA/RNA sequences.

2.7.2 For proteins

For proteins many substitution models exist, i.e. in IQ-tree we can use these: http://www.iqtree.org/doc/Substitution-Models

And there are further parameters that we can use specify different kinds of base frequencies with these settings:

For example GTR+FO optimizes base frequencies by ML whereas GTR+F counts base frequencies directly from the alignment.

While this sounds overwhelming there are some tools you can use to estimate what the best model for your data is and we will discuss this a bit later in the practical session.

Other things to consider are:

Rate heterogeneities
Protein mixture models

2.7.2.1 Rate heterogenity:

In more traditional phylogenetic models, all genes are assumed to evolve at the same rate. However, genes evolve at very different rates and it is important to account for this rate heterogeneity.

Some examples:

Nucleotide substitution rates vary for different positions in the sequence, i.e. the third position in codons often mutates faster.
Additionally, the genetic code is degenerate (i.e. redundant) and transitions are less likely to change amino acids.
Rates of mutation can vary among sites because of selective pressures associated with structural and functional constraints

So how can we account for rate heterogeneity?

We need to model rate distributions over sites
i.e. we can use gamma distributions in which we assume that the rate at each site is a random variable drawn from a statistical distribution (i.e. a gamma distribution)

Some options to account for this in IQ-TREE are:

2.7.2.2 Protein mixture models

Standard protein substitution models use a single amino acid replacement matrix but site evolution is highly heterogenous. I.e. depending on their position and role in the proteins shape and structure, sites will accept only a very set of amino acids, while selecting against other, unfavored, amino acids. Therefore, a single relpacement matrix is often not enough to represent all the complexity of evolutionary processes. To overcome this we can combine several amino acid replacement matrices to better fit protein evolution.

Options in IQ-TREE are:

Complex models usually generate a tree that will better fit the data, i.e. the tree will have a higher likelihood. But a more complex model will have more free parameters to estimate and thus might have a greater error (i.e. variance). Therefore, the simplest model that is good enough to model your data is the best model.

2.8 Choosing the best models

What model we choose depends on our data but we have to keep in mind that we can generate erroneous trees if our model is too simple to describe our data. However, if we chose models with to many parameters (i.e our model is too complex) then we can also get a tree with a wrong topology.

There are other tools, but for our purpose, let’s see how we can determine the best model using IQ-TREE (more on this in the practical part):

with the standard model selection (-m TEST option) or the new ModelFinder (-m MFP) option.
these models automatically select the best-fit model for our phylogenetic analyses
NOTE: to test the more complex C10-C60 models, these need to be specified individually

2.9 Confidence: Can I trust my trees?

Various methods allow to assess the confidence in branching patterns or branch supports.

Methods that are implemented in IQ-TREE:

Bootstrapping
Ultrafast bootstrap approximation
SH-like approximate likelihood tests

For Bayesian trees, i.e. PhyloBayes, we can use:

Posterior predictive (PP) tests

Below we see an example that explains the rationale behind bootstrapping, one of the most commonly used approaches.

3 2. Practical part

The basic workflows for phylogenetic analyses normally follow these steps:

Identify your marker genes/proteins
Align marker genes
Trim alignments
When working with more than one gene/protein AND when these have the same evolutionary history: concatenate alignments
Run phylogenetic analyses
Visualize results

In this tutorial we will go through all these steps.

For this workflows it is generally recommended to put files for different parts of the workflow into different subfolders. Here, our structure will look as follows:

3.1 Preparing your working directory

Getting started

Let’s first start with how to get to your files. For this to work:

Go to the directory you want to work with the files and make a new directory named Phylogeny_tutorial
Into this folder make a a symbolic link to the two folders containing the data we need: 0_Scripts_and_needed_files and 1_Protein_Seqs from /export/lv3/scratch/bioinformatics_workshop/S13_Phylogeny.

For people following the tutorial on their own, both folders are also part of the github repository and can be found in the folder Input_files.

#set variable for the base directory and personal user directory
user_name="ndombrowski"
base_dir="/export/lv3/scratch/bioinformatics_workshop/Users/"

#set variable to the folder with the data
download_dir="/export/lv3/scratch/bioinformatics_workshop/S13_Phylogeny"

#go to working directory
wdir=${base_dir}/${user_name}
cd $wdir 

#make a folder for the Phylogeny_tutorial and go into it
mkdir Phylogeny_tutorial
cd Phylogeny_tutorial

#make a symlink for the relevant data
ln -s ${download_dir}/0_Scripts_and_needed_files/ .
ln -s ${download_dir}/1_Protein_Seqs/ .

Questions:

Check what files you have in each of the two new directories. Specifically, answer with how many files we work based on what you see in the 1_Protein_Seqs folder
Look into a sequence file that is found in the folder 1_Protein_Seqs named arCOG00779.faa, what do the sequence names look like?
How many sequences do we have in each fasta file?

Click me to see an answer

#Qst 1: list all files in your directory and subdirectories
ll 0_Scripts_and_needed_files/*
ll 1_Protein_Seqs/*

#Qst2: if we want to check how our file looks we can view the first 10 lines with head
#We can see that we have a very short sequence header, in this case it is just the genome id
#This is not very descriptive and usually too short, i.e. we would want at least a protein ID in there
#but for our purposes this is a good format as it allows us to combine the sequences later
head 1_Protein_Seqs/arCOG00779.faa

#Qst3: check how many sequences we have in an individual file
grep -c ">" 1_Protein_Seqs/arCOG00779.faa

#Qst3: check how many sequences we have across all files file
grep -c ">" 1_Protein_Seqs/*.faa

3.2 Identify your marker genes/proteins

If we have a set of genomes we are interested, we often first need to identify phylogenetic markers in these genomes. There are several options how you can do this and we cover this in more detail in our Annotation tutorial.

Warning

Since identifying genes/proteins usually takes a bit longer, below you can find an example on how you can do this but we will not actually run this code during this tutorial.

Below All_Arcogs_2018.hmm is a database we also use in our Annotation tutorial and this database of clusters of orthologous groups is especially useful if you work with archaeal genomes. For bacterial genomes NCBI_COGs_Oct2020.hmm might be more suitable and we recommend some studies and useful marker proteins for your own research at the end of this tutorial.

Briefly, we extracted marker proteins from our genomes of interest as follows:

We work with 40 archaeal genomes, all belonging to the same phylum in the DPANN archaea
We identified proteins in these genomes using Prokka
Then we concantenated the proteins from all 40 genomes into a single file that we named Genomes.faa
In these 40 genomes, we searched for sequences that reflect marker genes we commonly use for phylogenetic analyses using a hmmsearch (see below). Specifically, since we work with archaeal genomes we annotate proteins in our genomes using the archaeal Clusters of Orthologous Genes (arCOGs).
Next, we extract proteins assigned to commonly used marker proteins, for example ribosomal proteins, and generate one fasta file for each marker of interest.

#identify arcogs in our genomes of interest
hmmsearch --tblout arcogs/sequence_results.txt --domtblout arcogs/domain_results.txt All_Arcogs_2018.hmm Genomes.faa > Output.txt

#format the full table (i.e. remove unnecessary lines) 
#and only select sequences above a certain e-value
sed 's/ \+ /\t/g' arcogs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > arcogs/sequence_results_red_e_cutoff.txt

#get best hit for each protein based on bit score, and then evalue (in case two sequences have the same bitscore)
sort -t$'\t' -k3,3gr -k4,4g arcogs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  arcogs/sequence_results_red_e_cutoff_uniq.txt

#if we want to extract the proteins for our phylogenetic markers we could do this as follows:
grep "arCOG00779" arcogs/sequence_results_red_e_cutoff_uniq.txt > MarkerLists/single_markers/arCOG00779.txt

#if we have a lot of markers we are looking for, we can store these in list and use a for loop:
for sample in `cat  Markers_of_interest.txt; do grep "$sample" arcogs/sequence_results_red_e_cutoff_uniq.txt > MarkerLists/single_markers/${sample}.txt; done

#extract the proteinID from our hmmer results(check that this ID is indeed in column1!) 
for sample in `cat Markers_of_interest.txt`; do awk -F'\t' -v OFS='\t' '{print $1 }' MarkerLists/split/$sample* > MarkerLists/protein_list_non_dedup/$sample |LC_ALL=C  sort ; done

#extract faa sequences based on the proteinID we extracted
mkdir Marker_Genes

for sample in `cat  Markers_of_interest.txt`; do perl 0_Scripts_and_needed_files/screen_list_new.pl MarkerLists/protein_list_non_dedup/$sample Genomes.faa keep > Marker_Genes/${sample}.faa; done

In the code above, we have a file with all our genome protein sequences and a database of hmm profiles that among others include the arCOG IDs we are interested in. More on hidden hmm profiles can be found here.

Note

A word on single-copy marker genes

Using this approach you might not end up with single-copy marker genes. For example, we might find that ribosomal protein 3 (rps3) occurs 2x in the genome of Nanoarchaeum equitans (Ne) despite ribosomal proteins usually only occurring as single-copy. This second copy can be explained by multiple things: It could be a contamination, a wrongly annotate protein or a true duplication.

When you use the code above we would extract not a single copy of rps3 but two copies for Ne. When running a phylogeny using single gene trees that is not an issue and we can even use the phylogenetic tree to help us investigate if the second copy is indeed a contamination, a wrongly annotated protein or a true duplication event. However, for concatenated phylogenies (more on this a bit later) this becomes problematic as we can only concatenate sequences from different markers if they occur 1x per genome.

If you a marker occurs more than once in your genome and you want to concatenate sequences there are different ways around this:

One way is to run a single gene tree and select the sequence that is most likely the true sequence and falls where you expect it to fall (this is the recommended approach but can be time-consuming if you work with many taxa).
Alternatively, you can keep the best hit based on scores such as the e-value by filtering the hmmsearch results further with awk or python. However, this approach is not recommended because you can not decided based on an e-value what of the two sequences is the best to keep.
We also have a script that decides what proteins to use by comparing sequence identities and will integrate it at a later point in the tutorial.

For this tutorial, this will not be a problem, as we cleaned the sequences for you beforehand.

3.3 Naming files and what are we actually working with?

For a lot of workflows it is very important how your genomes and especially your protein sequences are named. For example, you should ask yourself: how long is the name, are there extra symbols, is it informative, etc. Especially if you have a large number of files, or very large files, it is extremely helpful to have a consistent naming scheme.

Some very general hints for naming files:

Have a useful header that describes what is inside your file. I.e. for a protein fasta header, you might not only include the protein ID but also the genome ID
Be short, concise and do not use extra symbols (try to avoid using spaces, ideally use mainly - , or _)
If you want to concatenate files (which is important for phylogenetic analyses), your fasta files need to have a common part in the sequence header. I.e. your genome ID (i.e. B21_G17 in our files) should be part of the sequence header.

In our example we have a very simple header, for each protein sequence we just have the genome ID. This is over-simplified for the sake of this tutorial, more commonly you will also have the proteinID in the fasta header. The reason we removed this beforehand, using the cut command, is that we later want to concatenate the sequences and for this to work, we need a common name. If you work with your own files this is something to keep in mind.

The three proteins we are working with during this tutorial are:

arCOG00779: Ribosomal protein L15, average length of 150 amino acids
arCOG01762: DNA-directed RNA polymerase subunit B, average length of 850 amino acids
arCOG04050: 5’-3’ exonuclease, average length of 450 amino acids

If we would look at the sequences of one of these protein faa files with a sequence viewer we would see something like this:

You can look at the sequence yourself with:

seaview 1_Protein_Seqs/arCOG00779.faa

We can see that the sequence looks very messy and unorganized, so the first thing we have to do is to align our sequences.

3.4 Align sequences

There are different tools to align sequences, our recommended tool is MAFFT

Depending on how many sequences we have, we can run this in two ways: step by step or more efficiently in a for loop (or using GNU parallel).

Run MAFFT individually for each gene:

#create a list for the markers we work with
ls 1_Protein_Seqs/*faa | sed 's/1_Protein_Seqs\///g' | sed 's/\.faa//g' > MarkerList

#count if the numbers of markers stored in the list is correct: 3
wc -l  MarkerList

#prep folder
mkdir 2_Mafft

#run MAFFT
mafft-linsi --reorder --thread 2  1_Protein_seqs/arCOG00779.faa > 2_Mafft/arCOG00779.aln
mafft-linsi --reorder --thread 2  1_Protein_seqs/arCOG01762.faa > 2_Mafft/arCOG01762.aln
mafft-linsi --reorder --thread 2  1_Protein_seqs/arCOG04050.faa > 2_Mafft/arCOG04050.aln

Run MAFFT for each file using one command (for example using a for-loop):

Advanced exercise:

Run all 3 sequences using a loop or using GNU parallel.

Click me to see an answer

#create a list for the markers we work with
ls 1_Protein_Seqs/*faa | sed 's/1_Protein_Seqs\///g' | sed 's/\.faa//g' > MarkerList

#count if the numbers of markers stored in the list is correct: 3
wc -l  MarkerList

#prep folder
mkdir 2_Mafft

#using a for-loop
for i in `cat MarkerList`; do mafft-linsi --reorder --thread 2 1_Protein_Seqs/${i}.faa > 2_Mafft/${i}.aln; done

#using gnu parallel
parallel -j3 'i={};mafft-linsi --reorder --thread 2 1_Protein_Seqs/${i}.faa > 2_Mafft/${i}.aln' ::: `cat MarkerList`

Note

If you have less than 1000 sequences we STRONGLY recommend that you run mafft_linsi for your main analyses as it is much more accurate. For more sequences, this becomes too time consuming and you could just use mafft.

Info on the options used in the command above:

MAFFT is our tool we use for aligning sequences
MAFFT-linsi: Here, we run mafft with an iterative refinement method incorporating a local pairwise alignment. This works best for (in our experience) up to 1000 sequences. With more sequences, you should check the website for how to best run the aln in your case.
--reorder is an option that reorders our sequences based on how similar they are
--thread defines with how many CPUs we run our analysis
> redirects the output of a command to a file
In the GNU parallel command:
- -j3 defines how many jobs we run in parallel, here: 3

Exercise

Look at the files using seaview. How do they look compared to when we looked at the unaligned sequences?

Click me to see an answer

Answer: Our files should look something like this using seaview:

Looking at the highlighted areas we can already see that there are some potentially problematic regions:

The ends usually do not look too good because they are more difficult to align
Often we have a few sequences with insertions

Due to these issues it often makes sense to trim the alignment, i.e. remove problematic positions in the alignment.

3.5 Trimming

Same as for the trimming, there are different tools you can use, such as trimAL or BMGE. Generally, BMGE is more conservative than trimAL.

For this tutorial, we will run BMGE for each of our three proteins using a for loop, but as you have seen above, we can also use GNU parallel to speed things up even more:

#prepare folder
mkdir 3_BMGE

#trim our alignments
for i in `cat MarkerList`; do java -jar /opt/biolinux/BMGE-1.12/BMGE.jar -i 2_Mafft/${i}.aln -t AA -g 0.2 -m BLOSUM30 -h 0.55 -of 3_BMGE/${i}_trimmed.aln; done

The options here are:

-t the input file, which contains our trimmed sequence alignment
-g the maximum gap rate (range 1-0; 0.2 is used by default) = the proportion of gap characters (i.e., dashes or other characters indicating gaps) allowed at each position in the alignment. Setting a lower value for -g will result in a more stringent trimming of the alignment, as fewer gaps will be allowed at each position.
-m Which BLOSUM (BLOcks SUbstitution Matrix) matrix we use for aligning and trimming the alignment (default 62). BLOSUM matrices are a family of matrices that are commonly used in sequence alignment to quantify the similarity between two amino acid sequences. They are based on the observation that certain pairs of amino acids occur together more frequently than others in evolutionarily related proteins. The BLOSUM matrices are named according to the percentage of identity between sequences used to generate them. For example, BLOSUM62 is based on a set of sequences that are at least 62% identical to each other. If you are working with highly divergent sequences, you may want to use a more permissive matrix, such as BLOSUM30 or BLOSUM45.On the other hand, if you are working with very closely related sequences, you may want to use a more restrictive matrix, such as BLOSUM80 or BLOSUM90.
-h The maximum entropy threshold that is used to identify blocks of conserved positions in the alignment (range 1-0; default 0.5). Entropy is a measure of the amount of variation or diversity at a given position in the alignment. Positions with high entropy contain a mixture of different amino acids, while positions with low entropy are more conserved and contain fewer different amino acids.
-of the name of the output file and output format

BMGE will print to the screen how much shorter the alignment got after the trimming: For example, in the alignment for arCOG00779 we initially had 203 characters, which were reduced to 137 characters after trimming:

If we check our alignment with seaview after trimming, we should see something like this:

The highlighted areas are the regions we before labelled as “problematic”. We can see that they have been removed.

Depending on your settings and the type of proteins analysed, i.e. when we have a very short protein, BMGE sometimes can be very stringent. Here, you can always play with the settings and control the gap penalties, for more information see the paper and here

Another alternative you might want to check is Trimal.

3.6 Generate a phylogenetic tree

As discussed above there are several tools to use. We will work with IQ-TREE, which as some good documentation that can be found here:

http://www.iqtree.org/doc/ http://www.iqtree.org/doc/iqtree-doc.pdf

As well as slides from an IQ-TREE workshop:

http://www.iqtree.org/workshop/

Running phylogenies can be memory intensive, so depending on your job and the capacity of the server/computer you are using:

Check what resources you have available as some trees need to have >20 CPUs to run well.
Additionally, the longer the alignment and the more sequences you have, the more memory will be needed. Before starting the run, IQ-TREE will print to the screen (and the log file how much memory a job is expected to take and you can estimate the memory settings accordingly). I.e. you will see something like: NOTE: 2 MB RAM (0 GB) is required!
You do not want to clog the system and stop other users from running their job, so be mindful of these things when submitting a job to ada or laplace.

To run phylogenies on the NIOZ servers we first have to load the module for IQ-tree on ada (and laplace) to be able to use IQ-TREE:

#check avail modules on ada
module avail

#load module we need
module load iqtree/2.1.1

3.6.1 Examples for running iqtree with model selection

Warning

These are just some examples (and need some time to run). So you can test them on your own as an exercise after the workshop and we will NOT run them now. If you run them on your own, you can run them individually, in a for-loop or with GNU parallel.

The basic command to run an alignment and test for different models is:

#Run an extended model selection that additionally includes the FreeRate model
iqtree2 -s alignment.aln -m MF  -pre outputfolder/alignment

The command above will test all models known to by IQ-TREE, if you want to speed this up you can choose a certain set of models instead of testing all models.

#Run an extended model selection on the WAG, LG and JTT model that additionally includes FreeRate models
iqtree2 -s alignment.aln -m MF -mset WAG,LG,JTT

This is important: By default, substitution models are not included in these tests. If we want to test them we have to add them to the list of models we want to test. Generally, it is recommended to include them in the test and the following selection would be quite comprehensive for testing models.

#Run an extended model selection that additionally includes FreeRate model and  run tree with the best model found
iqtree2 -s alignment.aln -m MF -madd LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60,LG+C10+R+F,LG+C20+R+F,LG+C30+R+F,LG+C40+R+F,LG+C50+R+F,LG+C60+R+F --score-diff all

Used options:

-m MF Extended model selection (not followed by tree inference)
-m MFP Extended model selection followed by a tree interference
-mset WAG,LG,JTT Comma-separated model list in case we want to not test all models
-madd List of mixture models to consider
-pre Prefix for all outputfiles (useful if we need to rerun trees to i.e. test different models)

Exercise

Run a model test using the WAG, LG, JTT models on the trimmed alignment for arCOG04050

Click me to look at the code

mkdir -p 4_treefiles/model_test

#Run an extended model selection on the WAG, LG and JTT model that additionally includes FreeRate models
iqtree2 -s 3_BMGE/arCOG04050_trimmed.aln -m MF -mset WAG,LG,JTT -pre 4_treefiles/model_test/arCOG04050_model_test

When we check what gets printed to the screen, we see that IQ-TREE compares 71 different models and it looks for the best model using different Criteria (we won’t go into the specifics here) but we generally recommend using the Bayesian Information Criterion. The best model chosen can be found in the line Best-fit model: LG+I+G4 chosen according to BIC. The results are also stored in 4_treefiles/model_test/arCOG04050_model_test.log

Regardless of the criterion, the best model is the one with the lowest AIC or BIC score.

3.6.2 Run a tree on our protein files

We have seen, that for arCOG00779 the best model to use is LG+I+G4 (at least when not testing for mixture models).

However, for now, lets try to generate a phylogentic tree using one of our simpler models, which has the benefit that it runs quicker.

#run iqtree on all our three alignments
for i in `cat MarkerList`; do iqtree2 -s 3_BMGE/${i}_trimmed.aln -m JTT -nt 4 -B 1000 -pre 4_treefiles/${i}_JTT; done

The options are:

-s Input alignment that can be provided in PHYLIP/FASTA/NEXUS/CLUSTAL/MSF format
-m model to use (here: JTT, notice: We just run JTT because it is relatively fast but generally it is not a model you would use since it is too simple)
-nt number of threads (4 CPUS here, we can also use AUTO to determine best fit, which is recommended for larger trees)
-B Ultrafast bootstrapping (using 1000 replicates)

IQ-tree will provide you with several output files. In our case, these files will start with the marker name followed by the model and have the following file extensions:

.iqtree: Full results of the run, this is the main report file that records what was run when (among other things)
.log: The log file that records how we ran the command and everything that gets printed to the screen
.contree: Consensus tree with assigned branch supports where branch lengths are optimized on the original alignment; printed if Ultrafast Bootstrap is selected
.ckp.gz: Checkpoint file; useful if a job was stopped because of RAM/CPU limits because it allows us to pick up the run from the last checkpoint
.mldist: file with maximum likelihood distances
.splits.nex: support values in percentage for all splits (bipartitions), computed as the occurence frequencies in the bootstrap trees. This file is in NEXUS format, which can be viewed with the program SplitsTree to explore the conflicting signals in the data.
.uniqueseq.phy: among a group of identical sequences, IQ-TREE will keep the first two and ignore the rest and this file contains this “reduced” sequence set
.treefile: Maximum likelihood tree in NEWICK format –> we usually work with this one

Some more general comments:

The more complex the model, the more CPUs you would need. Setting -nt auto automatically chooses the best amount of CPUs. When using this make sure that you can use the maximum number of CPUs and do not overload it. Since this will take a bit if time to compute, do this only if you have extra time or after the workshop
Since we have very simple trees, i.e all genomes belong to the same phylum, we might not see many differences between the models. But generally, the more complex your data, the better it is to have an appropriate model
Using mixture models with short alignments, i.e. only ~200 columns, is tricky and this usually has not enough information to reliably estimate model parameters, so it is more useful to run mixture models on concatenated alignments.

Exercise/Homework

Note

The exercises below will take a bit longer to run, so if you are interested you can try this as homework:

Run a LG tree for all 3 alignments
Run a LG+C10+F+R tree for all 3 alignments, do you notice a difference in run time?
Run a model test with and without the C-series, does it make a difference for the shorter or longer proteins in how long the analysis takes?

Click me to see an answer

#run iqtree with different settings
for i in `cat MarkerList`; do iqtree2 -s 3_BMGE/${i}_trimmed.aln -m LG -nt 4 -B 1000 -pre 4_treefiles/${i}_LG; done

for i in `cat MarkerList`; do iqtree2 -s 3_BMGE/${i}_trimmed.aln -m LG+C10+F+R -nt 4 -B 1000 -pre 4_treefiles/${i}_LGC10FR; done

#prepare a folder to store our results in
mkdir 5_modeltest

#do a model test for the short protein and compare the standard models
iqtree2 -s 3_BMGE/arCOG00779_trimmed.aln -m MF -pre 5_modeltest/arCOG00779_model_test_MFonly 

#do a model test for the short protein and also test the C-series
iqtree2 -s 3_BMGE/arCOG00779_trimmed.aln -m MF -madd LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60,LG+C10+R+F,LG+C20+R+F,LG+C30+R+F,LG+C40+R+F,LG+C50+R+F,LG+C60+R+F --score-diff all -pre 5_modeltest/arCOG00779_model_test_all

#do a model test for the longest protein and compare the standard models
iqtree2 -s 3_BMGE/arCOG01762_trimmed.aln -m MF -pre 5_modeltest/arCOG01762_model_test_MFonly 

#do a model test for the longest protein and also test the C-series
iqtree2 -s 3_BMGE/arCOG01762_trimmed.aln -m MF -madd LG+C10,LG+C20,LG+C30,LG+C40,LG+C50,LG+C60,LG+C10+R+F,LG+C20+R+F,LG+C30+R+F,LG+C40+R+F,LG+C50+R+F,LG+C60+R+F --score-diff all -pre 5_modeltest/arCOG01762_model_test_all

3.6.3 What do tree files look like?

There are two general output files for phylogenetic trees:

1. Newick files

A newick tree would look like this (unrooted) if you would open it in nano:

(rhizobia:20, (pseudomonas:30, alteromonas:22):7, (shewanella:33, vibrio:12):40);

Or like this if it would be rooted:

(rhizobia:20, (pseudomonas:30, alteromonas:22):7):20

Note

You can see in the example above, that a the syntax of Newick files uses brackets, colons and dots. Therefore, it is recommended to avoid these symbols for your genome names/in your fasta headers.

2. Nexus files

Nexus files look slightly different and generally look like this:

 #NEXUS
 Begin data;
 Dimensions ntax=4 nchar=15;
 Format datatype=dna missing=? gap=-;
 Matrix
 Species1   {{DNA sequence|atgctagctagctcg}}
 Species2   {{DNA sequence|atgcta??tag-tag}}
 Species3   {{DNA sequence|atgttagctag-tgg}}
 Species4   {{DNA sequence|atgttagctag-tag}}           
 ;
 End;

What file format is generated depends a bit on the tool you use but there are scripts that can be found online to convert between the two file formats. For now we will work with newick files.

Exercise

Open one of the treefiles generated by IQ-TREE and find out what file format we have

Click me to see an answer

We can look at the files using head 4_treefiles/arCOG00779_JTT.treefile.

When doing this, we should see that we have a list of Genomes in between brackets indicating that we are working with a Newick file.

3.6.4 Renaming files

When opening our tree file, we see that the genome have a rather non-informative header. Often, having an easy and short header makes scripts easier to run but is not useful if we want to visualize the tree. However, if we provide a mapping file, we can replace the names with something more useful.

A mapping file should consist of two columns that are tab-separated. The first column should be the name exactly as it appears in our tree file, the second column has the name we want to replace it with. I.e. like this

The mapping file is provided in the first folder as well as the script we will use to search and then replace our names as follows:

for i in 4_treefiles/*treefile; do perl 0_Scripts_and_needed_files//Replace_tree_names.pl 0_Scripts_and_needed_files/names_to_replace ${i} > ${i}_renamed; done

#compare the output before and after
head 4_treefiles/arCOG04050_JTT.treefile
head 4_treefiles/arCOG04050_JTT.treefile_renamed

3.7 Visualizing tree files

There are several tools for this. One example is the web-based tool iTOL but for this example we have a look at the software Figtree, which is free to use, and ready to use on ada.

To open our file in Figtree we can use either the GUI interface or type

figtree 4_treefiles/arCOG00779_JTT.treefile_renamed

To view everything relevant for you:

Add the bootstrap values by going to the left hand side panel and extend the branch labels section. In there select display and here select the label,
Root the tree by extending the Trees panel and selecting Root tree (uses midpoint rooting). If we know an outgroup we can manually root using an outgroup as well
Order our tree, i.e. beautify the tree by extending the trees panel and ordering the nodes.

Additionally, with larger trees we can:

Change the color by: File –> Annotations –> load Annotations.txt an example how such a file would look like can be found in 0_Scripts_and_needed_files)

Now we should see something like this:

With Annotations.txt we could for example color the different phyla different and thus very quickly can see, whether our tree looks ok or not (i.e. do all members of the same phylum cluster together or are there signs of horizontal gene transfer).

3.8 Concatenating multiple proteins

If we want to run species trees it is recommended to concatenate several genes/proteins since this way we can increase the amount of information we have and thus better resolve deeper branches.

If you want to do this, the following requirements need to be fulfilled:

The fasta headers of our sequences must have the same name
The sequences must have the same length (this is why we will concatenate after the alignment and trimming step)
This approach is only suitable if your marker gene occurs only 1x in your genome (beware of paralogs and contaminations). If you a marker occurs more than once in your genome there are different ways around this. One way is to run a single gene tree and select the sequence that is most likely the true sequence and falls where you expect it to fall. Alternatively, you can keep the best hit. However, this is not recommended because you can not decided based on an e-value what sequences most likely is a contamination. We also have a script to help with this step and will integrate it at a later point in the tutorial.
Your proteins have the same evolutionary history, as is the case for example for ribosomal proteins

In theory if we would concatenate two proteins, we are doing something like this:

For this to work you often have to clean up the sequence header, i.e. often the header might look like this: GCA_002494525-GCA_002494525_04800 = it has the genome and the protein ID and to be able to concatenate your sequences the header should look like this: GCA_002494525

For this tutorial, the fasta header was already prepared for you. If you run this workflow by yourself, you would have to do this, so let’s briefly discuss how you could do this. Below you see an example code , how you can use the cut command to shorten fasta headers using the cut command:

#shorten header
#for sample in `cat MarkerList`; do cut -f1 -d "-" 1_Protein_Seqs/${sample}* > 1_Protein_Seqs/${sample}_clean.faa  ; done

As we said, for this tutorial, we do not need to do anything and can immediately concatenate two (or more) sequences using a script as shown below:

perl 0_Scripts_and_needed_files/catfasta2phyml.pl -f -c 3_BMGE/*.aln > 3_BMGE/Concat_seqs.aln

This will give you an output that will tell you in what order the sequences were combined and were exactly they are positioned:

Then we can run the tree on the concatenated alignment and rename the tree as we have done before:

#run tree
iqtree2 -s 3_BMGE/Concat_seqs.aln -m JTT -nt 2 -B 1000 -pre 4_treefiles/Concat_JTT

#using a better model (this runs for 15-20 min, so best run this after the workshop)
#iqtree2 -s 3_BMGE/Concat_seqs.aln -m LG+C10+F+R -nt 4 -B 1000 -pre 4_treefiles/Concat_LC_C10FR

#clean up genome names
perl 0_Scripts_and_needed_files/Replace_tree_names.pl 0_Scripts_and_needed_files/names_to_replace 4_treefiles/Concat_JTT.treefile > 4_treefiles/Concat_JTT.treefile_renamed

#perl 0_Scripts_and_needed_files/Replace_tree_names.pl 0_Scripts_and_needed_files/names_to_replace 4_treefiles/Concat_LC_C10FR.treefile > 4_treefiles/Concat_LC_C10FR.treefile_renamed

Exercise

Have a look at the tree and compare it with the other trees based on single marker genes that you have looked at before.

Since we have a tree with genomes all belonging to the same phylum the trees will look very similar but if we look closely we see that several taxa are not in their place, i.e. especially if we compare single gene trees with the concatenated gene tree.

We can also see, highlighted with the red circle, that even using a sequence alignment with 1000 amino acids (remember 16S just has ~ 1500 nucleotides) some genomes are not well resolved.

Finally, in the example above you see some sequences on very long branches, for example NIOZ134_mx_b281_2, which also can create artefacts in trees. This can be due to a lot of markers missing in this genome, contamination or that this is a fast evolving lineage.

Generally, you can improve phylogenetic trees by using better models as well as having a longer alignment. Currently, there are different marker sets that are used that include at least 14 and up to several hundred proteins.

4 Selecting marker genes to generated a species tree

You can run trees for every gene you are interested in but for species trees there are some things to watch out for:

Ideally you want to concatenate several genes with a similar history. More sequence information will help to resolve your tree better.
If you want to concatenate markers they should show only limited numbers of horizontal gene transfers
If you work with archaea and bacteria, those should be monophyletic in single gene trees. If they are not, the marker is not good for a species tree.

4.1 Commonly used marker sets:

Ribosomal marker sets, i.e. the 14 RP dataset (ribosomal proteins L2, L3, L4, L5, L6, L14, L16, L18, L22, L24, S3, S8, S10, S17 and S19) used in Hug et al 2016.
48 marker protein set described by Zaremba-Niedzwiedzka et al. 2017
GTDB marker sets for archaea and bacteria that can be found here

4.2 Personal thoughts on what marker set to use

As a general note we have to say that normally no marker set is perfect and especially for new taxa (esp. on higher levels, i.e. phylum rank) it might make sense to manually check single protein trees for issues, such as HGT. In our own work (Dombrowski N, Williams TA, Sun J et al. Undinarchaeota illuminate DPANN phylogeny and the impact of gene transfer on archaeal evolution. Nature Communications 2020;11:3939) we noticed that some commonly used marker sets can be problematic and we found markers:

Were archaea and bacteria are not monophyletic
That have signs of a lot of HGT

These effects mean that the genes do not have a common history, which can generate issues when running phylogenetic trees. Therefore, especially if you want to describe novel lineages, it often can be useful to:

Manually check single marker protein trees
Discard markers were monophyly is violated or that show signs of gene transfers
Generated a concatenated alignment only of “good” markers

For a start, we can however, recommend using the 14 RP marker set or the Zaremba marker sets that have been manually checked more extensively in previous work.