7  Working with annotation data

As mentioned in the beginning section, if working with genomes we often have large lists of proteins and their annotations, often from different databases.

Here, we might be interested in condensing this information for each genome or for each phylum etc . OR we want to merge our data with other tables that are ordered differently. We might also want to plot the data to compare the presence absence levels of different genomes. This section with discuss how to do this in a step by step manner.

7.1 Load essential packages

# | warning: false

library("ggplot2")
library("plyr")
library("dplyr")

Attaching package: 'dplyr'
The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library("grid")
library("gplots")

Attaching package: 'gplots'
The following object is masked from 'package:stats':

    lowess
library("gridExtra")

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
library("multcomp")
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select

Attaching package: 'TH.data'
The following object is masked from 'package:MASS':

    geyser
library("reshape2")
library("RColorBrewer")
library('tidyr')

Attaching package: 'tidyr'
The following object is masked from 'package:reshape2':

    smiths
library('tidyverse')
-- Attaching packages --------------------------------------- tidyverse 1.3.2 --
v tibble  3.1.8     v stringr 1.4.1
v readr   2.1.2     v forcats 0.5.2
v purrr   0.3.4     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::arrange()     masks plyr::arrange()
x gridExtra::combine() masks dplyr::combine()
x purrr::compact()     masks plyr::compact()
x dplyr::count()       masks plyr::count()
x dplyr::failwith()    masks plyr::failwith()
x dplyr::filter()      masks stats::filter()
x dplyr::id()          masks plyr::id()
x dplyr::lag()         masks stats::lag()
x dplyr::mutate()      masks plyr::mutate()
x dplyr::rename()      masks plyr::rename()
x MASS::select()       masks dplyr::select()
x dplyr::summarise()   masks plyr::summarise()
x dplyr::summarize()   masks plyr::summarize()
library(data.table)

Attaching package: 'data.table'

The following object is masked from 'package:purrr':

    transpose

The following objects are masked from 'package:reshape2':

    dcast, melt

The following objects are masked from 'package:dplyr':

    between, first, last
library(knitr)
library(kableExtra)

Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

7.2 Deal with the mapping file for our genomes

7.2.1 Load mapping file

mapping file = a table that has the metadata for our genomes. In this specific example it lists our 46 DPANN genomes into clusters based on their phylogenetic placement.

We later want to summarize our data by individual bins and the phylogenetic clusters defined in the mapping file.

Notice: We can also sort our mapping file beforehand and use this order to order our tables, graph, etc…

#read in the mapping file
design <- read.table("../data/mapping.txt", sep="\t", header=T, fill=TRUE, quote = "")

#check the structure of the mapping file
head(design)

7.3 Clean mapping file

Here, we want to add an extra column that summarizes how many genomes are in each phylogenetic cluster.

First lets summarize how many bins we have for each cluster:

  • ddply() = summarize our data by counting the number of bins we have in each cluster
  • order() = next, we order our data based on the clusters with the greater number of bins
#add a new column that links the BinID and the taxonomic cluster
design$NewName2 <-paste(design$BinID, design$Cluster, sep = "-")
head(design)
#transform mapping file and summarize how many genomes we have/cluster
Number_of_taxa <- ddply(design, .(Cluster), summarize, NrGenomes = length(Cluster))

#order our data
Number_of_taxa <- Number_of_taxa[order(Number_of_taxa$NrGenomes, decreasing = TRUE),] 

#view data
head(Number_of_taxa)

Next, add this information into our original mapping file:

  • paste() = paste together different columns. With sep we can control what delimiter we want to use when combining info
#add a new column with the cluster count info into mapping file (then it can be added as a label into figures)
design_2 <- merge(design, Number_of_taxa, by = "Cluster" )

#view wether all went fine
head(design_2)

Next, let’s merge all relevant into one column using paste() again:

#generate a new column into the mapping file that links the cluster name with the number of genomes of clusters in each cluster
design_2$ClusterName <- paste(design_2$Cluster, " (", design_2$NrGenomes, ")", sep = "")

#view data
head(design_2)

7.4 Generate lists to order dataframes

The default behaviour of a lot of R functions is to order data alphabetically but this is not what we might want. I.e. we want to order by different columns or simply order by the initial order in our mapping file. To do this it is useful to have vectors that are order our bins, clusters, etc… like we want them.

In this example the mapping file was ordered like this

  • The basal group
  • Groups from an older dataset (marine and aquifer)
  • two black sea clades

Lets first check, whether our new mapping file still has that order

#check the old data
design$Cluster
 [1] "Basal"   "Basal"   "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Marine" 
 [8] "Marine"  "Marine"  "Marine"  "Marine"  "Marine"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[22] "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[43] "Clade2"  "Clade2"  "Clade2"  "Clade2" 
#check the new data
design_2$Cluster
 [1] "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Basal"   "Basal"   "Clade1" 
 [8] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[22] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Marine"  "Marine" 
[43] "Marine"  "Marine"  "Marine"  "Marine" 

If we check the individual factors, we can see that the new dataframe is order by alphabet. Since this is not what we want, lets correct this

  • match(design_2$BinID, design$BinID) = We check what positions in design_2 match with design. Here, the first argument are the values to be matched and the second are the values to be matched against.
  • order() = we reorder our dataframe based on the results from match()
#reorder our old dataframe by the original one
design_2 <- design_2[ order(match(design_2$BinID, design$BinID)), ]

#check, if the basal clade is now in the correct order
design_2$Cluster
 [1] "Basal"   "Basal"   "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Marine" 
 [8] "Marine"  "Marine"  "Marine"  "Marine"  "Marine"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[22] "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[43] "Clade2"  "Clade2"  "Clade2"  "Clade2" 

Now, that the basal clade is in the first position, lets make vectors of the bins and clusters in order this we use:

  • unique() = we use unique on all ClusterNames in our mapping file. That way instead of having repeated names, we only have the unique ones
  • as.character() = we make sure that our R object we generated is a character (so has words).
#make a list to order our bins
Bin_order <- as.character(unique(design_2$BinID))
Bin_order
 [1] "NIOZ132_cc_b51_2"   "NIOZ134_mx_b281_2"  "SRR2090153_bin1042"
 [4] "SRR2090153_bin461"  "SRR2090159_bin1129" "SRR2090159_bin1288"
 [7] "GCA_002494525"      "GCA_002495465"      "GCA_002501985"     
[10] "SRR4028224_bin17"   "SRR5007147_bin71"   "U_76725"           
[13] "NIOZ124_cc_b247_2"  "NIOZ125_mb_b254_2"  "NIOZ129_mb_b262_2" 
[16] "NIOZ126_mb_b401_2"  "NIOZ122_mb_b319_2"  "NIOZ134_mb_b41_2"  
[19] "NIOZ132_mb_b282_2"  "NIOZ127_mb_b161_2"  "NIOZ123_bs_b5_2"   
[22] "NIOZ121_cc_b94_2"   "NIOZ136_cc_b15_2"   "GCA_002502135"     
[25] "NIOZ134_mb_b293_2"  "NIOZ132_mb_b260_2"  "NIOZ126_mb_b137_2" 
[28] "NIOZ129_mb_b254_2"  "NIOZ127_mb_b282_2"  "NIOZ122_mb_b305_2" 
[31] "NIOZ136_mb_b335_2"  "NIOZ132_mb_b198_2"  "NIOZ136_mb_b104_2" 
[34] "NIOZ125_mb_b178_2"  "NIOZ121_mb_b48_2"   "NIOZ134_cc_b149_2" 
[37] "NIOZ134_mb_b361_2"  "NIOZ136_mb_b226_2"  "NIOZ123_bs_b392_2" 
[40] "NIOZ124_mb_b130_2"  "NIOZ132_cc_b149_2"  "NIOZ125_cc_b75_2"  
[43] "NIOZ132_mx_b314_2"  "NIOZ120_mb_b229_2"  "NIOZ126_mb_b304_2" 
[46] "NIOZ119_mb_b5_2"   
#make cluster order
Cluster_order <- as.character(unique(design_2$ClusterName))
Cluster_order
[1] "Basal (2)"   "Aquifer (4)" "Marine (6)"  "Clade1 (11)" "Clade2 (23)"

7.5 Deal with mapping files for the annotations

The mapping files (Arcog_mapping and Metabolism_file_KEGG) that are provided with this tutorial give extra information and order our KO and arCOG annotations. Genes_of_interest is a list of key metabolic genes we want to look at more closely.

7.5.1 Load and view the tables

#general mapping file for arcog IDs
Arcog_mapping <- read.table("../data/ar14_arCOGdef19.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Arcog_mapping)
#pathway mapping file
Metabolism_file_KEGG <- read.table("../data/Metabolism_Table_KO_Apr2020.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Metabolism_file_KEGG)
#load the genes of interest
Genes_of_interest <- read.table("../data/Genes_of_interest.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Arcog_mapping)

7.5.2 Make a vector to order our genes of interest

  • We use unique() to remove any duplicate gene names
  • arrange() can be used to order our dataframe by more than one column.

You notice here that the syntax for arrange() is a bit unusual and we use the %>% symbol (a so-called forward pipe operator). This symbol is commonly used in the dplyr and tidyr packages which are extremely useful to summarize data. This function passes the left hand side of the operator to the first argument of the right hand side of the operator. In the following example, the data frame Genes_of_interest gets passed to arrange()

#order metabolic genes
Genes_Metabolism_order_temp <- Genes_of_interest %>% arrange(Order, Order2)

#make a unique vector for our genes of interest
Genes_Metabolism_order <- as.character(unique(Genes_Metabolism_order_temp$Gene))
Genes_Metabolism_order
  [1] "HmgB"                     "HmgA"                    
  [3] "MvK"                      "M3k"                     
  [5] "M3P5K"                    "mvaD"                    
  [7] "Ipk"                      "GFPS"                    
  [9] "EgsA"                     "GGGP_synthase"           
 [11] "UbiA"                     "GGR"                     
 [13] "CarS"                     "PgsA"                    
 [15] "ASS"                      "Asd"                     
 [17] "Pol-B1"                   "Pol-B2"                  
 [19] "Pol-B3"                   "PolB-casposon-associated"
 [21] "PolD_small"               "PolD_large"              
 [23] "Pol4"                     "PolX"                    
 [25] "Orc1"                     "SSB/RPA1"                
 [27] "RPA2"                     "RPA3"                    
 [29] "GinS"                     "PriS"                    
 [31] "PriL"                     "DnaG"                    
 [33] "Lig "                     "LigN"                    
 [35] "Pcn1"                     "RfcS"                    
 [37] "RfcL"                     "FEN1"                    
 [39] "RnhA"                     "RnhB"                    
 [41] "MCM2"                     "MPH1"                    
 [43] "BRR2"                     "MUS81"                   
 [45] "SrmB"                     "Dna2"                    
 [47] "TopA"                     "GyrA"                    
 [49] "GyrB"                     "Rgy"                     
 [51] "TopB"                     "Top6A"                   
 [53] "Top6B"                    "RecJ/Cdc45"              
 [55] "SbcC"                     "SbcD"                    
 [57] "NurA"                     "Nth"                     
 [59] "Nfo"                      "Nfi"                     
 [61] "NucS"                     "XthA"                    
 [63] "DinG"                     "SSO0112"                 
 [65] "HerA"                     "RadA"                    
 [67] "RadB"                     "Udg1 "                   
 [69] "AlkD"                     "Ogt"                     
 [71] "Ogg"                      "hjc"                     
 [73] "UvrA"                     "UvrB"                    
 [75] "UvrC"                     "MthTI"                   
 [77] "MthZI"                    "MjaII"                   
 [79] "MjaIII"                   "MjaV"                    
 [81] "Dcm"                      "DNA_met"                 
 [83] "HmfAB"                    "HmvA"                    
 [85] "ScpA"                     "ScpB"                    
 [87] "MC1"                      "Smc"                     
 [89] "Alba1"                    "ELP3"                    
 [91] "FtsZ"                     "MinD"                    
 [93] "Cren-1"                   "actin_like"              
 [95] "Rkd1"                     "Rkd2"                    
 [97] "CetZ1"                    "CdvA"                    
 [99] "CdvB"                     "Vps4"                    
[101] "RpoC/Rpo3"                "RpoC/Rpo11"              
[103] "RpoB/Rpo2"                "RpoA/Rpo1"               
[105] "Rpo6/RpoZ"                "RPB11"                   
[107] "Rpo4"                     "RPB5"                    
[109] "RPB7"                     "RPB8"                    
[111] "RPB10"                    "RPC12"                   
[113] "Rpo13"                    "TBP"                     
[115] "TFB"                      "TFE"                     
[117] "TFS"                      "LysM"                    
[119] "Spt4"                     "NusG"                    
[121] "NusA"                     "EGD2"                    
[123] "L1"                       "L2"                      
[125] "L3"                       "L4"                      
[127] "L5"                       "L6P"                     
[129] "L7AE"                     "L10"                     
[131] "L10AE/L16"                "L11"                     
[133] "L12E/L44/L45/RPP1/RPP2"   "L13"                     
[135] "L13E"                     "L14"                     
[137] "L14E/L6E/L27E"            "L15"                     
[139] "L15E"                     "L18"                     
[141] "L18E"                     "L19E"                    
[143] "L20A"                     "L21E"                    
[145] "L22"                      "L23"                     
[147] "L24"                      "L24E"                    
[149] "L29"                      "L30"                     
[151] "L30E"                     "L31E"                    
[153] "L32E"                     "L34E"                    
[155] "L35AE/L33A"               "L37AE/L43A"              
[157] "L37E"                     "L38E"                    
[159] "L39E"                     "L40E"                    
[161] "L41E"                     "L44E"                    
[163] "S2"                       "S3"                      
[165] "S3AE"                     "S4"                      
[167] "S4E"                      "S5"                      
[169] "S6E/S10"                  "S7"                      
[171] "S8"                       "S8E"                     
[173] "S9"                       "S10"                     
[175] "S11"                      "S12"                     
[177] "S13"                      "S14"                     
[179] "S15P"                     "S17"                     
[181] "S17E"                     "S19"                     
[183] "S19E"                     "S24E"                    
[185] "S25"                      "S26"                     
[187] "S27AE"                    "S27E"                    
[189] "S28E/S33"                 "S30"                     
[191] "PelA"                     "RsgA"                    
[193] "HflX"                     "AlaS"                    
[195] "ArgS"                     "AsnS"                    
[197] "CysS"                     "GlnS"                    
[199] "GRS1"                     "HisS"                    
[201] "IleS"                     "LeuS"                    
[203] "LysS"                     "LysU"                    
[205] "MetG"                     "PheS"                    
[207] "SerS"                     "ProS"                    
[209] "ThrS"                     "TrpS"                    
[211] "TyrS"                     "ValS"                    
[213] "EIF1A"                    "InfB"                    
[215] "SUI1"                     "EIF2A"                   
[217] "EIF2B "                   "EIF2B-like"              
[219] "EIF2G"                    "EIF4A"                   
[221] "EIF5a"                    "EIF6"                    
[223] "Tuf"                      "EFB1"                    
[225] "FusA"                     "eRF1"                    
[227] "Rli1"                     "Csl4"                    
[229] "Rrp4"                     "Rrp41"                   
[231] "Rrp42"                    "Dph2"                    
[233] "Dph5"                     "Dph6"                    
[235] "Trm5"                     "TYW1"                    
[237] "TYW2"                     "TYW3"                    
[239] "EndA"                     "Fau1"                    
[241] "LigT"                     "Nob1"                    
[243] "RtcB"                     "Archaease"               
[245] "RCL1"                     "RnjA"                    
[247] "Rnz"                      "Rnp1"                    
[249] "Rnp2"                     "Rnp3"                    
[251] "Rnp4"                     "TgtA"                    
[253] "DnaJ"                     "DnaK"                    
[255] "GrpE"                     "GroL"                    
[257] "HtpX"                     "Hsp16"                   
[259] "Hsp19"                    "PsmA"                    
[261] "PRE1"                     "Pan1"                    
[263] "Pcm"                      "ThsA"                    
#define a order for metabolic pathways
Pathway_order <- as.character(unique(Genes_of_interest$pathway_2))
Pathway_order
 [1] "DNA_Polymerase"       "RFs"                  "Helicase"            
 [4] "Topoisomerase"        "Repair_Recombination" "Methylase"           
 [7] "Chromatin"            "Cell_cycle"           "RNA_Polymerase"      
[10] "TFs"                  "Transcription"        "Ribosome"            
[13] "tRNA_synthetase"      "Exosome"              "Diphthamide_BS"      
[16] "Wybutosine_BS"        "Translation"          "Posttranslation"     
[19] "Mevalonate"           "Lipids"              

7.6 Deal with annotation file

7.6.1 Read in table

You already here notice that this takes a bit longer and we just work with 46 bins. This is a good reason to keep python in mind as depending on your computer the more memory heavy operations might get challenging. Another alternative would be to run R on the clusters.

#read in data and view it
Input <- read.table("../data/UAP2_Annotation_table_u.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Input)

7.6.3 Parse table to make it easier to work with it

Here we:

  • Subset the data for the columns we are interested in. Esp. for larger dataframes this will make the operations a bit quicker. For very large dataframes, i.e. 5000 genomes, it might be better to switch to python
  • Convert data from wide to long format
  • Clean factors. After subsetting often factors are not removed, we clean them up in that step

Info:

Converting a wide to a long dataframe

  • Wide dataframe: The Input data in this example is considered as a wide dataframe. I.e. all the gene IDs we are interested in are spread out into different columns
  • Long dataframe: The gene IDs we are interested in are found all in the same column. Important, most R functions work with long dataframes.
#print the column names to subset our datatable
colnames(Input)
 [1] "accession"          "BinID"              "TaxString"         
 [4] "NewContigID"        "OldContigId"        "ContigIdMerge"     
 [7] "ContigNewLength"    "GC"                 "ProteinID"         
[10] "ProteinGC"          "ProteinLength"      "Prokka"            
[13] "arcogs"             "arcogs_geneID"      "arcogs_Description"
[16] "Pathway"            "arcogs_evalue"      "KO_hmm"            
[19] "e_value"            "bit_score"          "bit_score_cutoff"  
[22] "Definition"         "confidence"         "PFAM_hmm"          
[25] "PFAM_description"   "Pfam_Evalue"        "TIRGR"             
[28] "TIGR_description"   "EC"                 "TIGR_Evalue"       
[31] "CAZy"               "CAZy_evalue"        "Description"       
[34] "TDBD_ID"            "TPDB_evalue"        "HydDB"             
[37] "Description.1"      "HydDB_evalue"       "PFAM"              
[40] "PFAMdescription"    "IPR"                "IPRdescription"    
[43] "TopHit"             "E_value"            "PecID"             
[46] "TaxID"              "TaxString.1"       
#only keep the columns we actually want to work with
Input_subset = Input[,c('BinID','accession','arcogs','KO_hmm','PFAM_hmm','TIRGR','CAZy','Description.1' )]
kable((head(Input_subset)), format='markdown')
BinID accession arcogs KO_hmm PFAM_hmm TIRGR CAZy Description.1
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00010 arCOG00570 K17830 PF01494 TIGR02032 - -
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00020 - - - - - -
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00030 arCOG01358 K15865 PF04055 TIGR01578 - -
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00040 arCOG01169 K01689 PF00113 TIGR01060 - -
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00050 arCOG04245 K02998 PF00318 TIGR01012 - -
NIOZ119_mb_b5_2 NIOZ119_mb_b5_2-PJFDGLDN_00060 arCOG01728 K06990 PF01875 TIGR04336 - -
#convert dataframe from wide to long
Input_long <- reshape2::melt(Input_subset,  id=c("accession","BinID"))

#give informative headers
colnames(Input_long) <- c("accession", "BinID", "DB", "gene")

#clean factors, to remove issues when counting
Input_long$gene <- as.factor(Input_long$gene)
kable((head(Input_long)), format='markdown')
accession BinID DB gene
NIOZ119_mb_b5_2-PJFDGLDN_00010 NIOZ119_mb_b5_2 arcogs arCOG00570
NIOZ119_mb_b5_2-PJFDGLDN_00020 NIOZ119_mb_b5_2 arcogs -
NIOZ119_mb_b5_2-PJFDGLDN_00030 NIOZ119_mb_b5_2 arcogs arCOG01358
NIOZ119_mb_b5_2-PJFDGLDN_00040 NIOZ119_mb_b5_2 arcogs arCOG01169
NIOZ119_mb_b5_2-PJFDGLDN_00050 NIOZ119_mb_b5_2 arcogs arCOG04245
NIOZ119_mb_b5_2-PJFDGLDN_00060 NIOZ119_mb_b5_2 arcogs arCOG01728

7.7 Make count tables

7.7.1 Generate a count table for our genomes of interest

Now we want to count, how often does a genome (i.e. NIOZ134_mb_b41_2) have a gene. I.e. how often do we want arCOG00570, arCOG01358, …

7.7.1.0.1 Do this via a loop (not executed, just an example)

Notice:

Since we run this chunk with , eval = FALSE we can still see the code but it is not executed. This is done because some computations take some time, which we do not want to spend, but I still want to show the code to give some alternative examples.

#count the number of proteins for each genome of interest
y <- c()
for (i in Bin_order) {
  x <-  table(subset(Input_long, BinID %in% paste(i))$gene)
  y <- cbind (y,x)
}

#clean-up the table
Counts_Table_loop <- y
colnames(Counts_Table_loop) <- Bin_order
Counts_Table_loop <- as.data.frame(Counts_Table_loop)
kable((head(Counts_Table_loop)), format='markdown')

#the '-' (=not identified genes) is also counted and listed in the first column and removed at this step
Counts_Table_loop <- Counts_Table_loop[-1,]
kable((head(Counts_Table_loop)), format='markdown')
7.7.1.0.2 Do this via ddply ((not executed, just an example))

New functions:

  • spread() = converts our long to a wide dataframe by using the BinIDs as new column names, the count table as values to populate our dataframe and with missing values we print a 0.
#count data and clean header
Counts_Table_long <- ddply(Input_long, .(BinID, gene), summarize, GeneCount = length(gene))
colnames(Counts_Table_long) <- c("BinID", "geneID", "count")
kable((head(Counts_Table_long)), format='markdown')

#transform to wide format, with fill = 0 instead of a NA we add a 0
Counts_Table_wide <- spread(Counts_Table_long, BinID, count, fill = 0 )

#view data
kable((head(Counts_Table_wide)), format='markdown')
7.7.1.0.3 Do this via tidyr (usually a bit faster than ddplyr, which is why we use this way)

Here, we use the %>% symbol again: In the following example, the a subset of the Input_long data (only 3 columns, not the whole dataframe) gets passed to count()

New functions:

  • count() = A function of the dplyr package. Here, we count the unique protein IDs grouped by BinID and gene (i.e. roughly equivalent to the columns we want to keep)
#count data and clean header
Counts_Table_long <- Input_long[,c('accession', 'BinID','gene')] %>% count(BinID, gene, sort = FALSE)
colnames(Counts_Table_long) <- c("BinID", "geneID", "count")
kable((head(Counts_Table_long)), format='markdown')
BinID geneID count
GCA_002494525 - 2350
GCA_002494525 GT2_Glyco_tranf_2_3 1
GCA_002494525 GT2_Glycos_transf_2 2
GCA_002494525 GT4 2
GCA_002494525 GT66 1
GCA_002494525 GT83 1

When viewing the data we also see that proteins with no annotations are counted (the minus symbol), since we do not care about this at this stage, lets remove everything with a minus symbol

#delete rows with a minus symbol
Counts_Table_long <- Counts_Table_long[Counts_Table_long$geneID!= "-", ]

#clean factors
Counts_Table_long$geneID <- factor(Counts_Table_long$geneID)

#view data
kable((head(Counts_Table_long)), format='markdown')
BinID geneID count
2 GCA_002494525 GT2_Glyco_tranf_2_3 1
3 GCA_002494525 GT2_Glycos_transf_2 2
4 GCA_002494525 GT4 2
5 GCA_002494525 GT66 1
6 GCA_002494525 GT83 1
7 GCA_002494525 K00012 1

Now, we can convert the long to a wide table, since this format is a bit easier to read in excel later.

#transform to wide format, with fill = 0 instead of a NA we add a 0
Counts_Table_wide <- spread(Counts_Table_long, BinID, count, fill = 0 )
kable((head(Counts_Table_wide)), format='markdown')
geneID GCA_002494525 GCA_002495465 GCA_002501985 GCA_002502135 NIOZ119_mb_b5_2 NIOZ120_mb_b229_2 NIOZ121_cc_b94_2 NIOZ121_mb_b48_2 NIOZ122_mb_b305_2 NIOZ122_mb_b319_2 NIOZ123_bs_b392_2 NIOZ123_bs_b5_2 NIOZ124_cc_b247_2 NIOZ124_mb_b130_2 NIOZ125_cc_b75_2 NIOZ125_mb_b178_2 NIOZ125_mb_b254_2 NIOZ126_mb_b137_2 NIOZ126_mb_b304_2 NIOZ126_mb_b401_2 NIOZ127_mb_b161_2 NIOZ127_mb_b282_2 NIOZ129_mb_b254_2 NIOZ129_mb_b262_2 NIOZ132_cc_b149_2 NIOZ132_cc_b51_2 NIOZ132_mb_b198_2 NIOZ132_mb_b260_2 NIOZ132_mb_b282_2 NIOZ132_mx_b314_2 NIOZ134_cc_b149_2 NIOZ134_mb_b293_2 NIOZ134_mb_b361_2 NIOZ134_mb_b41_2 NIOZ134_mx_b281_2 NIOZ136_cc_b15_2 NIOZ136_mb_b104_2 NIOZ136_mb_b226_2 NIOZ136_mb_b335_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 SRR4028224_bin17 SRR5007147_bin71 U_76725
CBM44 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CE4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
GH119 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
GH130 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GH95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GT2_Glyco_tranf_2_3 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.

#change the rownames
rownames(Counts_Table_wide) <- Counts_Table_wide$geneID

#view data
kable((head(Counts_Table_wide)), format='markdown')
geneID GCA_002494525 GCA_002495465 GCA_002501985 GCA_002502135 NIOZ119_mb_b5_2 NIOZ120_mb_b229_2 NIOZ121_cc_b94_2 NIOZ121_mb_b48_2 NIOZ122_mb_b305_2 NIOZ122_mb_b319_2 NIOZ123_bs_b392_2 NIOZ123_bs_b5_2 NIOZ124_cc_b247_2 NIOZ124_mb_b130_2 NIOZ125_cc_b75_2 NIOZ125_mb_b178_2 NIOZ125_mb_b254_2 NIOZ126_mb_b137_2 NIOZ126_mb_b304_2 NIOZ126_mb_b401_2 NIOZ127_mb_b161_2 NIOZ127_mb_b282_2 NIOZ129_mb_b254_2 NIOZ129_mb_b262_2 NIOZ132_cc_b149_2 NIOZ132_cc_b51_2 NIOZ132_mb_b198_2 NIOZ132_mb_b260_2 NIOZ132_mb_b282_2 NIOZ132_mx_b314_2 NIOZ134_cc_b149_2 NIOZ134_mb_b293_2 NIOZ134_mb_b361_2 NIOZ134_mb_b41_2 NIOZ134_mx_b281_2 NIOZ136_cc_b15_2 NIOZ136_mb_b104_2 NIOZ136_mb_b226_2 NIOZ136_mb_b335_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 SRR4028224_bin17 SRR5007147_bin71 U_76725
CBM44 CBM44 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CE4 CE4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
GH119 GH119 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
GH130 GH130 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GH95 GH95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GT2_Glyco_tranf_2_3 GT2_Glyco_tranf_2_3 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1

When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.

#delete the first column
Counts_Table_wide <- Counts_Table_wide[,-1]
kable((head(Counts_Table_wide)), format='markdown')
GCA_002494525 GCA_002495465 GCA_002501985 GCA_002502135 NIOZ119_mb_b5_2 NIOZ120_mb_b229_2 NIOZ121_cc_b94_2 NIOZ121_mb_b48_2 NIOZ122_mb_b305_2 NIOZ122_mb_b319_2 NIOZ123_bs_b392_2 NIOZ123_bs_b5_2 NIOZ124_cc_b247_2 NIOZ124_mb_b130_2 NIOZ125_cc_b75_2 NIOZ125_mb_b178_2 NIOZ125_mb_b254_2 NIOZ126_mb_b137_2 NIOZ126_mb_b304_2 NIOZ126_mb_b401_2 NIOZ127_mb_b161_2 NIOZ127_mb_b282_2 NIOZ129_mb_b254_2 NIOZ129_mb_b262_2 NIOZ132_cc_b149_2 NIOZ132_cc_b51_2 NIOZ132_mb_b198_2 NIOZ132_mb_b260_2 NIOZ132_mb_b282_2 NIOZ132_mx_b314_2 NIOZ134_cc_b149_2 NIOZ134_mb_b293_2 NIOZ134_mb_b361_2 NIOZ134_mb_b41_2 NIOZ134_mx_b281_2 NIOZ136_cc_b15_2 NIOZ136_mb_b104_2 NIOZ136_mb_b226_2 NIOZ136_mb_b335_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 SRR4028224_bin17 SRR5007147_bin71 U_76725
CBM44 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CE4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0
GH119 0 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0
GH130 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GH95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
GT2_Glyco_tranf_2_3 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1
#order our data so that the bins start first with the bins from the basal group
Counts_Table_wide <- Counts_Table_wide[,Bin_order]

#view data
kable((head(Counts_Table_wide)), format='markdown')
NIOZ132_cc_b51_2 NIOZ134_mx_b281_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 GCA_002494525 GCA_002495465 GCA_002501985 SRR4028224_bin17 SRR5007147_bin71 U_76725 NIOZ124_cc_b247_2 NIOZ125_mb_b254_2 NIOZ129_mb_b262_2 NIOZ126_mb_b401_2 NIOZ122_mb_b319_2 NIOZ134_mb_b41_2 NIOZ132_mb_b282_2 NIOZ127_mb_b161_2 NIOZ123_bs_b5_2 NIOZ121_cc_b94_2 NIOZ136_cc_b15_2 GCA_002502135 NIOZ134_mb_b293_2 NIOZ132_mb_b260_2 NIOZ126_mb_b137_2 NIOZ129_mb_b254_2 NIOZ127_mb_b282_2 NIOZ122_mb_b305_2 NIOZ136_mb_b335_2 NIOZ132_mb_b198_2 NIOZ136_mb_b104_2 NIOZ125_mb_b178_2 NIOZ121_mb_b48_2 NIOZ134_cc_b149_2 NIOZ134_mb_b361_2 NIOZ136_mb_b226_2 NIOZ123_bs_b392_2 NIOZ124_mb_b130_2 NIOZ132_cc_b149_2 NIOZ125_cc_b75_2 NIOZ132_mx_b314_2 NIOZ120_mb_b229_2 NIOZ126_mb_b304_2 NIOZ119_mb_b5_2
CBM44 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CE4 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GH119 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 1 1 0 0 1 1 1 0 1
GH130 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GH95 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
GT2_Glyco_tranf_2_3 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

If you run these three examples yourself, take not how different the speed is.

7.7.2 Generate a count table for our clusters of interest

Same as above, but now we want to know for our 4 aquifer genomes, how many have Gene Xx and show this as percent. I.e. if 1/4 genomes have a gene, then 25% have it.

First, lets merge in our taxa info into our count table, we need this to summarize our data by clusters.

#merge the count table with mapping file to add in the taxa info (might take a while depending on size)
Counts_Table_long_Tax <- merge(Counts_Table_long, design_2[,c("BinID", "ClusterName", "NrGenomes")], by = "BinID")
kable((head(Counts_Table_long_Tax)), format='markdown')
BinID geneID count ClusterName NrGenomes
GCA_002494525 GT2_Glyco_tranf_2_3 1 Marine (6) 6
GCA_002494525 GT2_Glycos_transf_2 2 Marine (6) 6
GCA_002494525 GT4 2 Marine (6) 6
GCA_002494525 GT66 1 Marine (6) 6
GCA_002494525 GT83 1 Marine (6) 6
GCA_002494525 K00012 1 Marine (6) 6

Next, whenever we have a value higher than one, we replace it with 1. That way we deal with our data like it is a presence/absence data. I.e. 0 = no genes present & 1 = gene present

#convert counts to presence/absence matrix (just using 0/1) (this is needed to calculate the percentage across clusters)
Counts_Table_long_Tax$count[Counts_Table_long_Tax$count > 1] <- 1
kable((head(Counts_Table_long_Tax)), format='markdown')
BinID geneID count ClusterName NrGenomes
GCA_002494525 GT2_Glyco_tranf_2_3 1 Marine (6) 6
GCA_002494525 GT2_Glycos_transf_2 1 Marine (6) 6
GCA_002494525 GT4 1 Marine (6) 6
GCA_002494525 GT66 1 Marine (6) 6
GCA_002494525 GT83 1 Marine (6) 6
GCA_002494525 K00012 1 Marine (6) 6

Now, we can use tidyr to count of how many genomes in a cluster have a gene.

#count data and clean header
Counts_Table_long_Tax_sum <- Counts_Table_long_Tax[,c('ClusterName', 'geneID','NrGenomes', 'count')] %>% count(ClusterName, geneID, NrGenomes, sort = FALSE)
colnames(Counts_Table_long_Tax_sum) <- c("ClusterName", "geneID", "NrGenomes", "quantity")
kable((head(Counts_Table_long_Tax_sum)), format='markdown')
ClusterName geneID NrGenomes quantity
Aquifer (4) GT2_Glycos_transf_2 4 4
Aquifer (4) GT4 4 3
Aquifer (4) GT66 4 2
Aquifer (4) GT83 4 3
Aquifer (4) K00003 4 2
Aquifer (4) K00008 4 3

Next, we want to calculate the percentage. I.e. in the first example, we have 4 aquifer genomes, two of which [FeFe]_Group_C3 (=50%).

#calculate of the percentage to answer of the total genomes per cluster how many have a certain gene
#notice: if running for your own data check here that your percentage makes sense. I.e. we do not want values above 100
Counts_Table_long_Tax_sum$percentage <- round(Counts_Table_long_Tax_sum$quantity/Counts_Table_long_Tax_sum$NrGenomes*100, digits = 0)
kable((head(Counts_Table_long_Tax_sum)), format='markdown')
ClusterName geneID NrGenomes quantity percentage
Aquifer (4) GT2_Glycos_transf_2 4 4 100
Aquifer (4) GT4 4 3 75
Aquifer (4) GT66 4 2 50
Aquifer (4) GT83 4 3 75
Aquifer (4) K00003 4 2 50
Aquifer (4) K00008 4 3 75

For printing this, we want to convert our long to a wide table by using spread(). Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.

#convert long to wide format and clean table (i.e. place the rownames)
Counts_Table_long_Tax_sum_wide <- spread(Counts_Table_long_Tax_sum[,c("geneID", "ClusterName", "percentage")], ClusterName, percentage)

#change the rownames
rownames(Counts_Table_long_Tax_sum_wide) <- Counts_Table_long_Tax_sum_wide$geneID

#view data
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
geneID Aquifer (4) Basal (2) Clade1 (11) Clade2 (23) Marine (6)
CBM44 CBM44 NA NA 18 NA NA
CE4 CE4 NA 50 NA NA NA
GH119 GH119 NA NA NA 52 NA
GH130 GH130 NA 50 NA NA NA
GH95 GH95 NA 50 NA NA NA
GT2_Glyco_tranf_2_3 GT2_Glyco_tranf_2_3 NA NA NA NA 83

When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.

#delete the first column
Counts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,-1]
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Aquifer (4) Basal (2) Clade1 (11) Clade2 (23) Marine (6)
CBM44 NA NA 18 NA NA
CE4 NA 50 NA NA NA
GH119 NA NA NA 52 NA
GH130 NA 50 NA NA NA
GH95 NA 50 NA NA NA
GT2_Glyco_tranf_2_3 NA NA NA NA 83

Now we see that we have NAs for genes that are not present in some of our clades. If we want to do more math then NAs are not helpful and we instead want to have a 0 there instead.

#replace NAs with 0
Counts_Table_long_Tax_sum_wide[is.na(Counts_Table_long_Tax_sum_wide)] <- 0
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Aquifer (4) Basal (2) Clade1 (11) Clade2 (23) Marine (6)
CBM44 0 0 18 0 0
CE4 0 50 0 0 0
GH119 0 0 0 52 0
GH130 0 50 0 0 0
GH95 0 50 0 0 0
GT2_Glyco_tranf_2_3 0 0 0 0 83

Finally, we want to sort our data, starting with the Basal clade and ending with the Black Sea Clades

#sort by cluster order (defined by the order of the mapping file)
Counts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,Cluster_order]
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Basal (2) Aquifer (4) Marine (6) Clade1 (11) Clade2 (23)
CBM44 0 0 0 18 0
CE4 50 0 0 0 0
GH119 0 0 0 0 52
GH130 50 0 0 0 0
GH95 50 0 0 0 0
GT2_Glyco_tranf_2_3 0 0 83 0 0

7.8 Merge our tables with the mapping data we have

Now, that we have our count tables both for the bins as well as for all the clusters, we now want to add some gene description and subset the data based on different categores.

7.8.1 For the bins

7.8.1.1 Add gene descriptions

Remember above, we made a list of descriptions that links all geneIDs with what is behind all the gene IDs? Now we want to add this info back in in order to print all the counts.

#merge
Counts_Table_final <- merge(All_Genes_Description, Counts_Table_wide, by.x="Gene", by.y="row.names", all.x = T, sort = F)
kable((head(Counts_Table_final)), format='markdown')
Gene Description NIOZ132_cc_b51_2 NIOZ134_mx_b281_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 GCA_002494525 GCA_002495465 GCA_002501985 SRR4028224_bin17 SRR5007147_bin71 U_76725 NIOZ124_cc_b247_2 NIOZ125_mb_b254_2 NIOZ129_mb_b262_2 NIOZ126_mb_b401_2 NIOZ122_mb_b319_2 NIOZ134_mb_b41_2 NIOZ132_mb_b282_2 NIOZ127_mb_b161_2 NIOZ123_bs_b5_2 NIOZ121_cc_b94_2 NIOZ136_cc_b15_2 GCA_002502135 NIOZ134_mb_b293_2 NIOZ132_mb_b260_2 NIOZ126_mb_b137_2 NIOZ129_mb_b254_2 NIOZ127_mb_b282_2 NIOZ122_mb_b305_2 NIOZ136_mb_b335_2 NIOZ132_mb_b198_2 NIOZ136_mb_b104_2 NIOZ125_mb_b178_2 NIOZ121_mb_b48_2 NIOZ134_cc_b149_2 NIOZ134_mb_b361_2 NIOZ136_mb_b226_2 NIOZ123_bs_b392_2 NIOZ124_mb_b130_2 NIOZ132_cc_b149_2 NIOZ125_cc_b75_2 NIOZ132_mx_b314_2 NIOZ120_mb_b229_2 NIOZ126_mb_b304_2 NIOZ119_mb_b5_2
arCOG00570 “Geranylgeranyl_reductase,_flavoprotein” 2 4 3 1 3 2 1 1 1 1 1 1 2 3 3 3 2 3 3 3 3 2 1 1 2 1 1 0 1 1 1 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1
arCOG01358 2-methylthioadenine_synthetase 1 2 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 0 1 1 1 1 1 1 1
arCOG01169 Enolase 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
arCOG04245 Ribosomal_protein_S2 1 0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1
arCOG01728 “Predicted_class_III_extradiol_dioxygenase,_MEMO1_family” 2 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 2 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1
arCOG00546 mRNA_degradation_ribonuclease_J1/J2 1 0 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 2 0 1 0 1 1 0 1 1 0 1
#print (and beautify elsewhere)
write.table(Counts_Table_final, "../output_examples/Counts_Table_final.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.1.2 Merging with the arcog_table

#merge
Arcog_Data <- merge(Arcog_mapping, Counts_Table_wide, by.x="arcog", by.y="row.names", all.x = T, sort = F)
kable((head(Arcog_Data)), format='markdown')
arcog Pathway GeneID Gene COG PFAM CD TIGR NIOZ132_cc_b51_2 NIOZ134_mx_b281_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 GCA_002494525 GCA_002495465 GCA_002501985 SRR4028224_bin17 SRR5007147_bin71 U_76725 NIOZ124_cc_b247_2 NIOZ125_mb_b254_2 NIOZ129_mb_b262_2 NIOZ126_mb_b401_2 NIOZ122_mb_b319_2 NIOZ134_mb_b41_2 NIOZ132_mb_b282_2 NIOZ127_mb_b161_2 NIOZ123_bs_b5_2 NIOZ121_cc_b94_2 NIOZ136_cc_b15_2 GCA_002502135 NIOZ134_mb_b293_2 NIOZ132_mb_b260_2 NIOZ126_mb_b137_2 NIOZ129_mb_b254_2 NIOZ127_mb_b282_2 NIOZ122_mb_b305_2 NIOZ136_mb_b335_2 NIOZ132_mb_b198_2 NIOZ136_mb_b104_2 NIOZ125_mb_b178_2 NIOZ121_mb_b48_2 NIOZ134_cc_b149_2 NIOZ134_mb_b361_2 NIOZ136_mb_b226_2 NIOZ123_bs_b392_2 NIOZ124_mb_b130_2 NIOZ132_cc_b149_2 NIOZ125_cc_b75_2 NIOZ132_mx_b314_2 NIOZ120_mb_b229_2 NIOZ126_mb_b304_2 NIOZ119_mb_b5_2
arCOG00009 E PotE Amino acid transporter COG00531 pfam00324 TIGR00909 2 4 1 1 1 1 2 1 2 1 1 1 3 3 2 1 3 3 3 3 2 1 0 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 0 1 1 0
arCOG00014 G RbsK “Sugar kinase, ribokinase family” COG00524 pfam00294 cd01174 TIGR02152 1 2 1 2 1 2 2 2 2 2 2 2 2 3 2 2 2 3 2 2 2 2 2 3 1 2 2 3 2 4 2 2 2 2 3 2 2 3 2 4 1 2 2 2 2 1
arCOG00017 R - Predicted transcriptional regulator COG02522 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
arCOG00018 F Nnr2 “NAD(P)H-hydrate repair enzyme Nnr, NAD(P)H-hydrate dehydratase domain” COG00063 “pfam03853,pfam01256” cd01171 “TIGR00197,TIGR00196” 1 1 1 0 1 0 1 1 1 2 2 0 1 1 1 1 1 1 0 2 1 2 1 2 1 1 1 1 1 1 1 1 0 1 1 2 0 2 1 1 1 0 0 1 0 0
arCOG00029 F PyrE Orotate phosphoribosyltransferase COG00461 pfam00156 cd06223 TIGR00336 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1
arCOG00030 F Apt Adenine/guanine phosphoribosyltransferase or related PRPP-binding protein COG00503 pfam00156 cd06223 TIGR01090 0 1 2 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 2 2 1 2 1 2 2 2 2 1 2 0 2 2 1 0 1 1 0 1
#print (and beautify elsewhere)
write.table(Arcog_Data, "../output_examples/ArCOG_Data.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.1.3 Merging with the metabolism metadata file

#merge
KEGG_Metabolism <- merge(Metabolism_file_KEGG, Counts_Table_wide, by.x="KO", by.y="row.names", all.x = T, sort = F)
kable((head(KEGG_Metabolism)), format='markdown')
KO IDs pathway_A order Gene Comment.Uniprot.comment NIOZ132_cc_b51_2 NIOZ134_mx_b281_2 SRR2090153_bin1042 SRR2090153_bin461 SRR2090159_bin1129 SRR2090159_bin1288 GCA_002494525 GCA_002495465 GCA_002501985 SRR4028224_bin17 SRR5007147_bin71 U_76725 NIOZ124_cc_b247_2 NIOZ125_mb_b254_2 NIOZ129_mb_b262_2 NIOZ126_mb_b401_2 NIOZ122_mb_b319_2 NIOZ134_mb_b41_2 NIOZ132_mb_b282_2 NIOZ127_mb_b161_2 NIOZ123_bs_b5_2 NIOZ121_cc_b94_2 NIOZ136_cc_b15_2 GCA_002502135 NIOZ134_mb_b293_2 NIOZ132_mb_b260_2 NIOZ126_mb_b137_2 NIOZ129_mb_b254_2 NIOZ127_mb_b282_2 NIOZ122_mb_b305_2 NIOZ136_mb_b335_2 NIOZ132_mb_b198_2 NIOZ136_mb_b104_2 NIOZ125_mb_b178_2 NIOZ121_mb_b48_2 NIOZ134_cc_b149_2 NIOZ134_mb_b361_2 NIOZ136_mb_b226_2 NIOZ123_bs_b392_2 NIOZ124_mb_b130_2 NIOZ132_cc_b149_2 NIOZ125_cc_b75_2 NIOZ132_mx_b314_2 NIOZ120_mb_b229_2 NIOZ126_mb_b304_2 NIOZ119_mb_b5_2
K00362 ID14 Nitrogen N2 nirB; nitrite reductase (NADH) large subunit [EC:1.7.1.15] Dissimilatory 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TIGR02374 ID15 Nitrogen N2 “NirB, nitrite reductase [NAD(P)H]” Dissimilatory 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TIGR02378 ID17 Nitrogen N2 “NirD, nitrite reductase [NAD(P)H]” Dissimilatory 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K04748 ID31 no_pathway norQ; nitric oxide reductase NorQ protein 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
K00376 ID36 Nitrogen N5 nosZ; nitrous-oxide reductase [EC:1.7.2.4] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TIGR04246 ID37 Nitrogen N5 “nosZ, nitrous-oxide reductase, Sec-dependent” 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#print
write.table(KEGG_Metabolism, "../output_examples/KEGG_Metabolism.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.2 For the clusters

The process works exactly the same as above. So try by yourself if you can merge things ;-).

7.9 Plot the data for our genes of interest

Here, we are not interested in plotting all genes but just want to plot things that are listed under the lipid pathway.

Since we are only interested in the Lipid pathway for the genes of interest (the table, among others, also lists genes involved in informational processing), we first need to make a gene list of just the genes we are interested in.

New functions:

  • subset() = subsets a dataframe. Syntax –> subset(dataframe, column, %in% pattern_we_look_for)
#subset genes of interest and clean factors
Genes_Lipids <- subset(Genes_of_interest, Pathway_1 %in% "Lipids")
Genes_Lipids$Gene <- factor(Genes_Lipids$Gene)
Genes_Lipids$arcog <- factor(Genes_Lipids$arcog)

#check how many genes we have
dim(Genes_Lipids)
[1] 16 12
kable((head(Genes_Lipids)), format='markdown')
Pathway_1 pathway_2 arcog xx Pathway Gene_short Gene_name Comment Gene Order Order2 Other
262 Lipids Mevalonate arCOG01767 NA I HmgB 3-hydroxy-3-methylglutaryl CoA synthase Catalyzes the condensation of acetyl-CoA with acetoacetyl-CoA to form 3-hydroxy-3-methylglutaryl-CoA (HMG-CoA). Functions in the mevalonate (MVA) pathway HmgB Lipid_1 262
263 Lipids Mevalonate arCOG04260 NA I HmgA “3-hydroxy-3-methylglutaryl-coenzyme A reductase, HMG-CoA reductase” Catalyzes the NADPH-dependent reductive deacylation of (S)-3-hydroxy-3-methylglutaryl-CoA (HMG-CoA) to (R)-mevalonate. Functions in the mevalonate (MVA) pathway HmgA Lipid_1 263
264 Lipids Mevalonate arCOG01028 NA I MvK Mevalonate kinase Catalyzes the phosphorylation of (R)-mevalonate (MVA) to (R)-mevalonate 5-phosphate (MVAP). MvK Lipid_1 264
265 Lipids Mevalonate K18689 NA I M3k Mevalonate 3-kinase Catalyzes the phosphorylation of mevalonate (MVA) to yield mevalonate-3-phosphate. Functions in an alternative mevalonate pathway M3k Lipid_1 265
266 Lipids Mevalonate arCOG01967 NA I M3P5K Mevalonate-3-phosphate 5-kinase “Phosphorylates mevalonate 3-phosphate to form mevalonate 3,5-bisphosphate. Functions in an alternative mevalonate pathway” M3P5K Lipid_1 266
267 Lipids Mevalonate arCOG02937 NA I mvaD Diphosphomevalonate decarboxylase Catalyzes the decarboxylation of mevalonate 5-phosphate (MVAP) to isopentenyl phosphate (IP). Functions in the mevalonate (MVA) pathway leading to IPP mvaD Lipid_1 267

Next, we want to make sure that the order is ok. In this specific example, we manually defined two columns for ordering (Order and Order2). We sort based on these columns and make a vector to order our genes of interest and our pathways of interest. For the lipid genes we look at the mevalonate pathway and general lipid biosynthesis genes

  • ``length()``` - lets us check the length of a vector, here it allows us to see that we would expect 16 genes
#define an order (we arange the dataframe based on two columsn, Order and Order2)
Genes_Lipids_order_temp <- Genes_Lipids %>% arrange(Order, Order2)
Genes_Lipids_order <- as.character(unique(Genes_Lipids_order_temp$Gene))
length(Genes_Lipids_order)
[1] 16
Genes_Lipids_order
 [1] "HmgB"          "HmgA"          "MvK"           "M3k"          
 [5] "M3P5K"         "mvaD"          "Ipk"           "GFPS"         
 [9] "EgsA"          "GGGP_synthase" "UbiA"          "GGR"          
[13] "CarS"          "PgsA"          "ASS"           "Asd"          
#The lipids belong to two different pathways, these 2 pathways we want to show in two separate heatmaps
Lipids_Pathway_order <- as.character(unique(Genes_Lipids$pathway_2))
Lipids_Pathway_order
[1] "Mevalonate" "Lipids"    

Now that we know what genes we are interested in, lets subset our original count table.

#subset our original count table for genes of interest and clean factors
Genes_Lipids_counts <- subset(Counts_Table_long_Tax_sum, geneID %in% as.character(Genes_Lipids$arcog))
Genes_Lipids_counts$geneID <- factor(Genes_Lipids_counts$geneID)

#control that all went fine
length(unique(Genes_Lipids_counts$geneID))
[1] 15
dim(Genes_Lipids_counts)
[1] 60  5
kable((head(Genes_Lipids_counts)), format='markdown')
ClusterName geneID NrGenomes quantity percentage
1924 Aquifer (4) arCOG00476 4 4 100
1947 Aquifer (4) arCOG00570 4 4 100
1967 Aquifer (4) arCOG00670 4 3 75
1968 Aquifer (4) arCOG00671 4 4 100
1997 Aquifer (4) arCOG00860 4 3 75
2032 Aquifer (4) arCOG00982 4 3 75

With length we see that now we just have 15 genes. How can we find out what gene is missing?

  • setdiff() = we compare two vectors and print the elements that differ.
setdiff(Genes_Lipids$arcog, Genes_Lipids_counts$geneID)
[1] "K18689"

We can see that we miss K18689. If we check our original annotations input, we can see that K18689 does not exist in that table. So we can nout pull information because of that. This gives a good example, why it is important to check your data as we do not know whether this is an issue with the code or what the problem could be.


Now, since our count data does not know that we categorize our different genes into different pathways, lets add this info in with merge()

#add in metadata (from the pathway info)
Key_Lipids_genes_cluster <- merge(Genes_Lipids_counts, Genes_Lipids, by.x ="geneID", by.y = 'arcog' , all.x = T )
kable((head(Key_Lipids_genes_cluster)), format='markdown')
geneID ClusterName NrGenomes quantity percentage Pathway_1 pathway_2 xx Pathway Gene_short Gene_name Comment Gene Order Order2 Other
arCOG00476 Aquifer (4) 4 4 100 Lipids Lipids NA L UbiA geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42] Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547) UbiA Lipid_2 272
arCOG00476 Basal (2) 2 2 100 Lipids Lipids NA L UbiA geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42] Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547) UbiA Lipid_2 272
arCOG00476 Clade1 (11) 11 11 100 Lipids Lipids NA L UbiA geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42] Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547) UbiA Lipid_2 272
arCOG00476 Marine (6) 6 6 100 Lipids Lipids NA L UbiA geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42] Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547) UbiA Lipid_2 272
arCOG00476 Clade2 (23) 23 19 83 Lipids Lipids NA L UbiA geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42] Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547) UbiA Lipid_2 272
arCOG00570 Clade2 (23) 23 22 96 Lipids Lipids NA L GGR digeranylgeranylglycerophospholipid reductase [EC:1.3.1.101 1.3.7.11] “reduction of 2,3-digeranylgeranylglycerophospholipids (unsaturated archaeols) into 2,3-diphytanylglycerophospholipids (saturated archaeols)” GGR Lipid_2 273

7.9.1 Categorizing data

There are different ways to color code data. By default the way we do it, we use a gradual color scale from 0-100%. However, we could also define categories with the ifelse statement we learned before. Here, we define 4 categories (100%, 75-100 and, 33-75 and 0-33%)

#define color code (not used for the current figure, but can be changed)
#here , we define 3 color levels, which sometimes is useful to show very clear cutoffs
Key_Lipids_genes_cluster$category <- ifelse(Key_Lipids_genes_cluster$percentage == 100, "1",
                                         ifelse(Key_Lipids_genes_cluster$percentage >= 75, "0.75",
                                                ifelse(Key_Lipids_genes_cluster$percentage >= 33, "0.33", "0")))

7.9.2 Order our data

Remember the annoying thing that R sorts alphabetically? Let’s make sure we ahve the order we want.

#define order for the plot
Key_Lipids_genes_cluster$ClusterName2 <-  factor(Key_Lipids_genes_cluster$ClusterName, levels = rev(Cluster_order))
Key_Lipids_genes_cluster$Gene2 <-  factor(Key_Lipids_genes_cluster$Gene, levels = Genes_Lipids_order)
Key_Lipids_genes_cluster$pathway_2b <-  factor(Key_Lipids_genes_cluster$pathway_2, levels = Lipids_Pathway_order)

7.9.3 Plotting

In the example here, we use a gradual scale. If we would want to use our 4 categories we can use this code #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) and replacing the fill = percentage with fill = category.

#plot
p1_Lipids <- 
  ggplot(Key_Lipids_genes_cluster, aes(x=Gene2, y=ClusterName2)) + 
  geom_tile(aes(fill = percentage)) +
  geom_hline(yintercept=2.5) +
  facet_wrap( ~ pathway_2b, nrow = 1, scales='free_x') +
  scale_fill_distiller(palette = "Blues", direction = 1) +
  theme_bw() +
  #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) +
  labs(x="", y="", fill="Percentage") + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black")) +
  theme(legend.position="left",
        axis.text.x=element_text(angle=45,vjust = 1, hjust=1, size=8),
        #axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        plot.margin=unit(c(0, 0, 0, 0), "mm"))

p1_Lipids

as.data.frame(levels(Key_Lipids_genes_cluster$ClusterName2))
p1_Lipids2 <- 
  ggplot(Key_Lipids_genes_cluster, aes(x=Gene2, y=ClusterName2)) + 
  geom_tile(aes(fill = percentage)) +
  geom_hline(yintercept=2.5) +
  facet_wrap( ~ pathway_2b, nrow = 1, scales='free_x') +
  scale_fill_distiller(palette = "Blues", direction = 1) +
  theme_bw() +
  #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) +
  labs(x="", y="", fill="Percentage") + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black")) +
  theme(legend.position="left",
        axis.text.x=element_text(angle=45,vjust = 1, hjust=1, size=8),
        #axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        plot.margin=unit(c(0, 0, 0, 0), "mm"))

p1_Lipids2

If we plot with facets we can sometimes have the problem that different genes have different widths. We can correct this behaviour with ggplotGrob.

# convert ggplot object to grob object (used to rescale plot)
gp_lipid <- ggplotGrob(p1_Lipids2)

# optional: take a look at the grob object's layout
gtable::gtable_show_layout(gp_lipid)

# get gtable columns corresponding to the facets (5 & 9, in this case)
facet.columns <- gp_lipid$layout$l[grepl("panel", gp_lipid$layout$name)]

# get the number of unique x-axis values per facet (1 & 3, in this case)
x.var <- sapply(ggplot_build(p1_Lipids)$layout$panel_scales_x,
                function(l) length(l$range$range))

# change the relative widths of the facet columns based on
# how many unique x-axis values are in each facet
gp_lipid$widths[facet.columns] <- gp_lipid$widths[facet.columns] * x.var

# plot result
grid::grid.draw(gp_lipid)

#print
#pdf("2_Output/Figure_S64.pdf", paper="special", family="sans",width=8, height=7, useDingbats=FALSE)
#grid::grid.draw(gp_lipid)
#dev.off() 

In this example we see that only the aquifer and the basal clade have all required genes for the mevalonate pathway and the lipid pathway that is required to make a key archaeal lipid.