7 Working with annotation data

As mentioned in the beginning section, if working with genomes we often have large lists of proteins and their annotations, often from different databases.

Here, we might be interested in condensing this information for each genome or for each phylum etc . OR we want to merge our data with other tables that are ordered differently. We might also want to plot the data to compare the presence absence levels of different genomes. This section with discuss how to do this in a step by step manner.

7.1 Load essential packages

# | warning: false

library("ggplot2")
library("plyr")
library("dplyr")


Attaching package: 'dplyr'

The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library("grid")
library("gplots")


Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

library("gridExtra")


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

library("multcomp")

Loading required package: mvtnorm

Loading required package: survival

Loading required package: TH.data

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select


Attaching package: 'TH.data'

The following object is masked from 'package:MASS':

    geyser

library("reshape2")
library("RColorBrewer")
library('tidyr')


Attaching package: 'tidyr'

The following object is masked from 'package:reshape2':

    smiths

library('tidyverse')

-- Attaching packages --------------------------------------- tidyverse 1.3.2 --

v tibble  3.1.8     v stringr 1.4.1
v readr   2.1.2     v forcats 0.5.2
v purrr   0.3.4     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::arrange()     masks plyr::arrange()
x gridExtra::combine() masks dplyr::combine()
x purrr::compact()     masks plyr::compact()
x dplyr::count()       masks plyr::count()
x dplyr::failwith()    masks plyr::failwith()
x dplyr::filter()      masks stats::filter()
x dplyr::id()          masks plyr::id()
x dplyr::lag()         masks stats::lag()
x dplyr::mutate()      masks plyr::mutate()
x dplyr::rename()      masks plyr::rename()
x MASS::select()       masks dplyr::select()
x dplyr::summarise()   masks plyr::summarise()
x dplyr::summarize()   masks plyr::summarize()

library(data.table)


Attaching package: 'data.table'

The following object is masked from 'package:purrr':

    transpose

The following objects are masked from 'package:reshape2':

    dcast, melt

The following objects are masked from 'package:dplyr':

    between, first, last

library(knitr)
library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

7.2 Deal with the mapping file for our genomes

7.2.1 Load mapping file

mapping file = a table that has the metadata for our genomes. In this specific example it lists our 46 DPANN genomes into clusters based on their phylogenetic placement.

We later want to summarize our data by individual bins and the phylogenetic clusters defined in the mapping file.

Notice: We can also sort our mapping file beforehand and use this order to order our tables, graph, etc…

#read in the mapping file
design <- read.table("../data/mapping.txt", sep="\t", header=T, fill=TRUE, quote = "")

#check the structure of the mapping file
head(design)

7.3 Clean mapping file

Here, we want to add an extra column that summarizes how many genomes are in each phylogenetic cluster.

First lets summarize how many bins we have for each cluster:

ddply() = summarize our data by counting the number of bins we have in each cluster
order() = next, we order our data based on the clusters with the greater number of bins

#add a new column that links the BinID and the taxonomic cluster
design$NewName2 <-paste(design$BinID, design$Cluster, sep = "-")
head(design)

#transform mapping file and summarize how many genomes we have/cluster
Number_of_taxa <- ddply(design, .(Cluster), summarize, NrGenomes = length(Cluster))

#order our data
Number_of_taxa <- Number_of_taxa[order(Number_of_taxa$NrGenomes, decreasing = TRUE),] 

#view data
head(Number_of_taxa)

Next, add this information into our original mapping file:

paste() = paste together different columns. With sep we can control what delimiter we want to use when combining info

#add a new column with the cluster count info into mapping file (then it can be added as a label into figures)
design_2 <- merge(design, Number_of_taxa, by = "Cluster" )

#view wether all went fine
head(design_2)

Next, let’s merge all relevant into one column using paste() again:

#generate a new column into the mapping file that links the cluster name with the number of genomes of clusters in each cluster
design_2$ClusterName <- paste(design_2$Cluster, " (", design_2$NrGenomes, ")", sep = "")

#view data
head(design_2)

7.4 Generate lists to order dataframes

The default behaviour of a lot of R functions is to order data alphabetically but this is not what we might want. I.e. we want to order by different columns or simply order by the initial order in our mapping file. To do this it is useful to have vectors that are order our bins, clusters, etc… like we want them.

In this example the mapping file was ordered like this

The basal group
Groups from an older dataset (marine and aquifer)
two black sea clades

Lets first check, whether our new mapping file still has that order

#check the old data
design$Cluster

 [1] "Basal"   "Basal"   "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Marine" 
 [8] "Marine"  "Marine"  "Marine"  "Marine"  "Marine"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[22] "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[43] "Clade2"  "Clade2"  "Clade2"  "Clade2"

#check the new data
design_2$Cluster

 [1] "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Basal"   "Basal"   "Clade1" 
 [8] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[22] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Marine"  "Marine" 
[43] "Marine"  "Marine"  "Marine"  "Marine"

If we check the individual factors, we can see that the new dataframe is order by alphabet. Since this is not what we want, lets correct this

match(design_2$BinID, design$BinID) = We check what positions in design_2 match with design. Here, the first argument are the values to be matched and the second are the values to be matched against.
order() = we reorder our dataframe based on the results from match()

#reorder our old dataframe by the original one
design_2 <- design_2[ order(match(design_2$BinID, design$BinID)), ]

#check, if the basal clade is now in the correct order
design_2$Cluster

 [1] "Basal"   "Basal"   "Aquifer" "Aquifer" "Aquifer" "Aquifer" "Marine" 
 [8] "Marine"  "Marine"  "Marine"  "Marine"  "Marine"  "Clade1"  "Clade1" 
[15] "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1"  "Clade1" 
[22] "Clade1"  "Clade1"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[29] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[36] "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2"  "Clade2" 
[43] "Clade2"  "Clade2"  "Clade2"  "Clade2"

Now, that the basal clade is in the first position, lets make vectors of the bins and clusters in order this we use:

unique() = we use unique on all ClusterNames in our mapping file. That way instead of having repeated names, we only have the unique ones
as.character() = we make sure that our R object we generated is a character (so has words).

#make a list to order our bins
Bin_order <- as.character(unique(design_2$BinID))
Bin_order

 [1] "NIOZ132_cc_b51_2"   "NIOZ134_mx_b281_2"  "SRR2090153_bin1042"
 [4] "SRR2090153_bin461"  "SRR2090159_bin1129" "SRR2090159_bin1288"
 [7] "GCA_002494525"      "GCA_002495465"      "GCA_002501985"     
[10] "SRR4028224_bin17"   "SRR5007147_bin71"   "U_76725"           
[13] "NIOZ124_cc_b247_2"  "NIOZ125_mb_b254_2"  "NIOZ129_mb_b262_2" 
[16] "NIOZ126_mb_b401_2"  "NIOZ122_mb_b319_2"  "NIOZ134_mb_b41_2"  
[19] "NIOZ132_mb_b282_2"  "NIOZ127_mb_b161_2"  "NIOZ123_bs_b5_2"   
[22] "NIOZ121_cc_b94_2"   "NIOZ136_cc_b15_2"   "GCA_002502135"     
[25] "NIOZ134_mb_b293_2"  "NIOZ132_mb_b260_2"  "NIOZ126_mb_b137_2" 
[28] "NIOZ129_mb_b254_2"  "NIOZ127_mb_b282_2"  "NIOZ122_mb_b305_2" 
[31] "NIOZ136_mb_b335_2"  "NIOZ132_mb_b198_2"  "NIOZ136_mb_b104_2" 
[34] "NIOZ125_mb_b178_2"  "NIOZ121_mb_b48_2"   "NIOZ134_cc_b149_2" 
[37] "NIOZ134_mb_b361_2"  "NIOZ136_mb_b226_2"  "NIOZ123_bs_b392_2" 
[40] "NIOZ124_mb_b130_2"  "NIOZ132_cc_b149_2"  "NIOZ125_cc_b75_2"  
[43] "NIOZ132_mx_b314_2"  "NIOZ120_mb_b229_2"  "NIOZ126_mb_b304_2" 
[46] "NIOZ119_mb_b5_2"

#make cluster order
Cluster_order <- as.character(unique(design_2$ClusterName))
Cluster_order

[1] "Basal (2)"   "Aquifer (4)" "Marine (6)"  "Clade1 (11)" "Clade2 (23)"

7.5 Deal with mapping files for the annotations

The mapping files (Arcog_mapping and Metabolism_file_KEGG) that are provided with this tutorial give extra information and order our KO and arCOG annotations. Genes_of_interest is a list of key metabolic genes we want to look at more closely.

7.5.1 Load and view the tables

#general mapping file for arcog IDs
Arcog_mapping <- read.table("../data/ar14_arCOGdef19.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Arcog_mapping)

#pathway mapping file
Metabolism_file_KEGG <- read.table("../data/Metabolism_Table_KO_Apr2020.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Metabolism_file_KEGG)

#load the genes of interest
Genes_of_interest <- read.table("../data/Genes_of_interest.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Arcog_mapping)

7.5.2 Make a vector to order our genes of interest

We use unique() to remove any duplicate gene names
arrange() can be used to order our dataframe by more than one column.

You notice here that the syntax for arrange() is a bit unusual and we use the %>% symbol (a so-called forward pipe operator). This symbol is commonly used in the dplyr and tidyr packages which are extremely useful to summarize data. This function passes the left hand side of the operator to the first argument of the right hand side of the operator. In the following example, the data frame Genes_of_interest gets passed to arrange()

#order metabolic genes
Genes_Metabolism_order_temp <- Genes_of_interest %>% arrange(Order, Order2)

#make a unique vector for our genes of interest
Genes_Metabolism_order <- as.character(unique(Genes_Metabolism_order_temp$Gene))
Genes_Metabolism_order

  [1] "HmgB"                     "HmgA"                    
  [3] "MvK"                      "M3k"                     
  [5] "M3P5K"                    "mvaD"                    
  [7] "Ipk"                      "GFPS"                    
  [9] "EgsA"                     "GGGP_synthase"           
 [11] "UbiA"                     "GGR"                     
 [13] "CarS"                     "PgsA"                    
 [15] "ASS"                      "Asd"                     
 [17] "Pol-B1"                   "Pol-B2"                  
 [19] "Pol-B3"                   "PolB-casposon-associated"
 [21] "PolD_small"               "PolD_large"              
 [23] "Pol4"                     "PolX"                    
 [25] "Orc1"                     "SSB/RPA1"                
 [27] "RPA2"                     "RPA3"                    
 [29] "GinS"                     "PriS"                    
 [31] "PriL"                     "DnaG"                    
 [33] "Lig "                     "LigN"                    
 [35] "Pcn1"                     "RfcS"                    
 [37] "RfcL"                     "FEN1"                    
 [39] "RnhA"                     "RnhB"                    
 [41] "MCM2"                     "MPH1"                    
 [43] "BRR2"                     "MUS81"                   
 [45] "SrmB"                     "Dna2"                    
 [47] "TopA"                     "GyrA"                    
 [49] "GyrB"                     "Rgy"                     
 [51] "TopB"                     "Top6A"                   
 [53] "Top6B"                    "RecJ/Cdc45"              
 [55] "SbcC"                     "SbcD"                    
 [57] "NurA"                     "Nth"                     
 [59] "Nfo"                      "Nfi"                     
 [61] "NucS"                     "XthA"                    
 [63] "DinG"                     "SSO0112"                 
 [65] "HerA"                     "RadA"                    
 [67] "RadB"                     "Udg1 "                   
 [69] "AlkD"                     "Ogt"                     
 [71] "Ogg"                      "hjc"                     
 [73] "UvrA"                     "UvrB"                    
 [75] "UvrC"                     "MthTI"                   
 [77] "MthZI"                    "MjaII"                   
 [79] "MjaIII"                   "MjaV"                    
 [81] "Dcm"                      "DNA_met"                 
 [83] "HmfAB"                    "HmvA"                    
 [85] "ScpA"                     "ScpB"                    
 [87] "MC1"                      "Smc"                     
 [89] "Alba1"                    "ELP3"                    
 [91] "FtsZ"                     "MinD"                    
 [93] "Cren-1"                   "actin_like"              
 [95] "Rkd1"                     "Rkd2"                    
 [97] "CetZ1"                    "CdvA"                    
 [99] "CdvB"                     "Vps4"                    
[101] "RpoC/Rpo3"                "RpoC/Rpo11"              
[103] "RpoB/Rpo2"                "RpoA/Rpo1"               
[105] "Rpo6/RpoZ"                "RPB11"                   
[107] "Rpo4"                     "RPB5"                    
[109] "RPB7"                     "RPB8"                    
[111] "RPB10"                    "RPC12"                   
[113] "Rpo13"                    "TBP"                     
[115] "TFB"                      "TFE"                     
[117] "TFS"                      "LysM"                    
[119] "Spt4"                     "NusG"                    
[121] "NusA"                     "EGD2"                    
[123] "L1"                       "L2"                      
[125] "L3"                       "L4"                      
[127] "L5"                       "L6P"                     
[129] "L7AE"                     "L10"                     
[131] "L10AE/L16"                "L11"                     
[133] "L12E/L44/L45/RPP1/RPP2"   "L13"                     
[135] "L13E"                     "L14"                     
[137] "L14E/L6E/L27E"            "L15"                     
[139] "L15E"                     "L18"                     
[141] "L18E"                     "L19E"                    
[143] "L20A"                     "L21E"                    
[145] "L22"                      "L23"                     
[147] "L24"                      "L24E"                    
[149] "L29"                      "L30"                     
[151] "L30E"                     "L31E"                    
[153] "L32E"                     "L34E"                    
[155] "L35AE/L33A"               "L37AE/L43A"              
[157] "L37E"                     "L38E"                    
[159] "L39E"                     "L40E"                    
[161] "L41E"                     "L44E"                    
[163] "S2"                       "S3"                      
[165] "S3AE"                     "S4"                      
[167] "S4E"                      "S5"                      
[169] "S6E/S10"                  "S7"                      
[171] "S8"                       "S8E"                     
[173] "S9"                       "S10"                     
[175] "S11"                      "S12"                     
[177] "S13"                      "S14"                     
[179] "S15P"                     "S17"                     
[181] "S17E"                     "S19"                     
[183] "S19E"                     "S24E"                    
[185] "S25"                      "S26"                     
[187] "S27AE"                    "S27E"                    
[189] "S28E/S33"                 "S30"                     
[191] "PelA"                     "RsgA"                    
[193] "HflX"                     "AlaS"                    
[195] "ArgS"                     "AsnS"                    
[197] "CysS"                     "GlnS"                    
[199] "GRS1"                     "HisS"                    
[201] "IleS"                     "LeuS"                    
[203] "LysS"                     "LysU"                    
[205] "MetG"                     "PheS"                    
[207] "SerS"                     "ProS"                    
[209] "ThrS"                     "TrpS"                    
[211] "TyrS"                     "ValS"                    
[213] "EIF1A"                    "InfB"                    
[215] "SUI1"                     "EIF2A"                   
[217] "EIF2B "                   "EIF2B-like"              
[219] "EIF2G"                    "EIF4A"                   
[221] "EIF5a"                    "EIF6"                    
[223] "Tuf"                      "EFB1"                    
[225] "FusA"                     "eRF1"                    
[227] "Rli1"                     "Csl4"                    
[229] "Rrp4"                     "Rrp41"                   
[231] "Rrp42"                    "Dph2"                    
[233] "Dph5"                     "Dph6"                    
[235] "Trm5"                     "TYW1"                    
[237] "TYW2"                     "TYW3"                    
[239] "EndA"                     "Fau1"                    
[241] "LigT"                     "Nob1"                    
[243] "RtcB"                     "Archaease"               
[245] "RCL1"                     "RnjA"                    
[247] "Rnz"                      "Rnp1"                    
[249] "Rnp2"                     "Rnp3"                    
[251] "Rnp4"                     "TgtA"                    
[253] "DnaJ"                     "DnaK"                    
[255] "GrpE"                     "GroL"                    
[257] "HtpX"                     "Hsp16"                   
[259] "Hsp19"                    "PsmA"                    
[261] "PRE1"                     "Pan1"                    
[263] "Pcm"                      "ThsA"

#define a order for metabolic pathways
Pathway_order <- as.character(unique(Genes_of_interest$pathway_2))
Pathway_order

 [1] "DNA_Polymerase"       "RFs"                  "Helicase"            
 [4] "Topoisomerase"        "Repair_Recombination" "Methylase"           
 [7] "Chromatin"            "Cell_cycle"           "RNA_Polymerase"      
[10] "TFs"                  "Transcription"        "Ribosome"            
[13] "tRNA_synthetase"      "Exosome"              "Diphthamide_BS"      
[16] "Wybutosine_BS"        "Translation"          "Posttranslation"     
[19] "Mevalonate"           "Lipids"

7.6 Deal with annotation file

7.6.1 Read in table

You already here notice that this takes a bit longer and we just work with 46 bins. This is a good reason to keep python in mind as depending on your computer the more memory heavy operations might get challenging. Another alternative would be to run R on the clusters.

#read in data and view it
Input <- read.table("../data/UAP2_Annotation_table_u.txt", sep="\t", header=T, fill=TRUE, quote = "")
head(Input)

7.6.2 Make a mapping file that links all annotation IDs to their descriptions

What we do:

separate columns we are interested in for each Database of interest, i.e. arCOGs, and remove duplicate rows by using unique()
change the column names using colnames(). Here, we want to make sure that all the 6 new objects we generate have the same columns
combine our 6 dataframes using rbind(). For this to work we need the same number of columns.

In theory that would be a nice example for a loop as well, since we do exactly the same thing for 6x.

#generate Description table for all DBs of interest
Arcogs_Description <- unique(Input[,c("arcogs","arcogs_Description" )])
colnames(Arcogs_Description) <- c("Gene", "Description")
kable((head(Arcogs_Description)), format='markdown')

Gene	Description
arCOG00570	“Geranylgeranyl_reductase,_flavoprotein”
-	-
arCOG01358	2-methylthioadenine_synthetase
arCOG01169	Enolase
arCOG04245	Ribosomal_protein_S2
arCOG01728	“Predicted_class_III_extradiol_dioxygenase,_MEMO1_family”

KOs_Description <- Input[,c("KO_hmm","Definition" )]
colnames(KOs_Description) <- c("Gene", "Description")
kable((head(KOs_Description)), format='markdown')

Gene	Description
K17830	digeranylgeranylglycerophospholipid_reductase_[EC:1.3.1.101_1.3.7.11]
-	-
K15865	threonylcarbamoyladenosine_tRNA_methylthiotransferase_CDKAL1_[EC:2.8.4.5]
K01689	enolase_[EC:4.2.1.11]
K02998	small_subunit_ribosomal_protein_SAe
K06990	MEMO1_family_protein

Pfam_Description <- unique(Input[,c("PFAM_hmm","PFAM_description" )])
colnames(Pfam_Description) <- c("Gene", "Description")
kable((head(Pfam_Description)), format='markdown')

Gene	Description
PF01494	FAD_binding_domain
-	-
PF04055	Radical_SAM_superfamily
PF00113	“Enolase,_C-terminal_TIM_barrel_domain”
PF00318	Ribosomal_protein_S2
PF01875	Memo-like_protein

TIRGR_Description <- unique(Input[,c("TIRGR","TIGR_description" )])
colnames(TIRGR_Description) <- c("Gene", "Description")
kable((head(TIRGR_Description)), format='markdown')

Gene	Description
TIGR02032	geranylgeranyl_reductase_family
-	-
TIGR01578	“MiaB-like_tRNA_modifying_enzyme,_archaeal-type”
TIGR01060	phosphopyruvate_hydratase
TIGR01012	ribosomal_protein_uS2
TIGR04336	AmmeMemoRadiSam_system_protein_B

Cazy_Description <- unique(Input[,c("CAZy","Description" )])
colnames(Cazy_Description) <- c("Gene", "Description")
kable((head(Cazy_Description)), format='markdown')

	Gene	Description
1	-	-
81	GH119	__alpha-amylase_(EC_3.2.1.1)
245	GT4	“__sucrose_synthase_(EC_2.4.1.13);sucrose-phosphate_synthase(EC_2.4.1.14);alpha-glucosyltransferase(EC_2.4.1.52);lipopolysaccharide_N-acetylglucosaminyltransferase(EC_2.4.1.56);phosphatidylinositol_alpha-mannosyltransferase(EC_2.4.1.57);_GDP-Man:Man1GlcNAc2-PP-dolichol_alpha-1,3-mannosyltransferase(EC_2.4.1.132);_GDP-Man:Man3GlcNAc2-PP-dolichol/Man4GlcNAc2-PP-dolichol_alpha-1,2-mannosyltransferase(EC_2.4.1.131);digalactosyldiacylglycerol_synthase(EC_2.4.1.141);1,2-diacylglycerol_3-glucosyltransferase(EC_2.4.1.157);__diglucosyl_diacylglycerol_synthase_(EC_2.4.1.208);trehalose_phosphorylase(EC_2.4.1.231);_NDP-Glc:alpha-glucose_alpha-glucosyltransferase/alpha,alpha-trehalose_synthase(EC_2.4.1.245);_GDP-Man:Man2GlcNAc2-PP-dolichol_alpha-1,6-mannosyltransferase(EC_2.4.1.257);_UDP-GlcNAc:2-deoxystreptamine_alpha-N-acetylglucosaminyltransferase(EC_2.4.1.283);_UDP-GlcNAc:ribostamycin_alpha-N-acetylglucosaminyltransferase(EC_2.4.1.285);UDP-Gal_alpha-galactosyltransferase(EC_2.4.1.-);UDP-Xyl_alpha-xylosyltransferase(EC_2.4.2.-);UDP-GlcA_alpha-glucuronyltransferase(EC_2.4.1.-);UDP-Glc_alpha-glucosyltransferase(EC_2.4.1.-);_UDP-GalNAc:GalNAc-PP-Und_alpha-1,3-N-acetylgalactosaminyltransferase(EC_2.4.1.306);_UDP-GalNAc:N,N’-diacetylbacillosaminyl-PP-Und_alpha-1,3-N-acetylgalactosaminyltransferase(EC_2.4.1.290);ADP-dependent_alpha-maltose-1-phosphate_synthase(2.4.1.342)”
291	GT2_Glycos_transf_2	-
613	GT83	__undecaprenyl_phosphate-alpha-L-Ara4N:4-amino-4-deoxy-beta-L-arabinosyltransferase(EC_2.4.2.43);_dodecaprenyl_phosphate-beta-galacturonic_acid:lipopolysaccharide_core_alpha-galacturonosyl_transferase(EC_2.4.1.-)
1357	GT66	__dolichyl-diphosphooligosaccharide<83><80><9a><82><9d>protein_glycotransferase_(EC_2.4.99.18);undecaprenyl-diphosphooligosaccharide<83><80><9a><82><9d>protein_glycotransferase(EC_2.4.99.19)_

HydDB_Description <- unique(Input[,c("Description.1","Description.1" )])
colnames(HydDB_Description) <- c("Gene", "Description")
kable((head(HydDB_Description)), format='markdown')

	Gene	Description
1	-	-
4787	[FeFe]_Group_A1	[FeFe]_Group_A1
18935	[NiFe]_Group_4g	[NiFe]_Group_4g
19355	[NiFe]_Group_3b	[NiFe]_Group_3b
37081	[FeFe]_Group_C3	[FeFe]_Group_C3

#make a file with a description of all the ids for each search
All_Genes_Description <- rbind(Arcogs_Description,KOs_Description,Pfam_Description,TIRGR_Description, Cazy_Description, HydDB_Description)

7.6.3 Parse table to make it easier to work with it

Here we:

Subset the data for the columns we are interested in. Esp. for larger dataframes this will make the operations a bit quicker. For very large dataframes, i.e. 5000 genomes, it might be better to switch to python
Convert data from wide to long format
Clean factors. After subsetting often factors are not removed, we clean them up in that step

Info:

Converting a wide to a long dataframe

Wide dataframe: The Input data in this example is considered as a wide dataframe. I.e. all the gene IDs we are interested in are spread out into different columns
Long dataframe: The gene IDs we are interested in are found all in the same column. Important, most R functions work with long dataframes.

#print the column names to subset our datatable
colnames(Input)

 [1] "accession"          "BinID"              "TaxString"         
 [4] "NewContigID"        "OldContigId"        "ContigIdMerge"     
 [7] "ContigNewLength"    "GC"                 "ProteinID"         
[10] "ProteinGC"          "ProteinLength"      "Prokka"            
[13] "arcogs"             "arcogs_geneID"      "arcogs_Description"
[16] "Pathway"            "arcogs_evalue"      "KO_hmm"            
[19] "e_value"            "bit_score"          "bit_score_cutoff"  
[22] "Definition"         "confidence"         "PFAM_hmm"          
[25] "PFAM_description"   "Pfam_Evalue"        "TIRGR"             
[28] "TIGR_description"   "EC"                 "TIGR_Evalue"       
[31] "CAZy"               "CAZy_evalue"        "Description"       
[34] "TDBD_ID"            "TPDB_evalue"        "HydDB"             
[37] "Description.1"      "HydDB_evalue"       "PFAM"              
[40] "PFAMdescription"    "IPR"                "IPRdescription"    
[43] "TopHit"             "E_value"            "PecID"             
[46] "TaxID"              "TaxString.1"

#only keep the columns we actually want to work with
Input_subset = Input[,c('BinID','accession','arcogs','KO_hmm','PFAM_hmm','TIRGR','CAZy','Description.1' )]
kable((head(Input_subset)), format='markdown')

BinID	accession	arcogs	KO_hmm	PFAM_hmm	TIRGR	CAZy	Description.1
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00010	arCOG00570	K17830	PF01494	TIGR02032	-	-
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00020	-	-	-	-	-	-
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00030	arCOG01358	K15865	PF04055	TIGR01578	-	-
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00040	arCOG01169	K01689	PF00113	TIGR01060	-	-
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00050	arCOG04245	K02998	PF00318	TIGR01012	-	-
NIOZ119_mb_b5_2	NIOZ119_mb_b5_2-PJFDGLDN_00060	arCOG01728	K06990	PF01875	TIGR04336	-	-

#convert dataframe from wide to long
Input_long <- reshape2::melt(Input_subset,  id=c("accession","BinID"))

#give informative headers
colnames(Input_long) <- c("accession", "BinID", "DB", "gene")

#clean factors, to remove issues when counting
Input_long$gene <- as.factor(Input_long$gene)
kable((head(Input_long)), format='markdown')

accession	BinID	DB	gene
NIOZ119_mb_b5_2-PJFDGLDN_00010	NIOZ119_mb_b5_2	arcogs	arCOG00570
NIOZ119_mb_b5_2-PJFDGLDN_00020	NIOZ119_mb_b5_2	arcogs	-
NIOZ119_mb_b5_2-PJFDGLDN_00030	NIOZ119_mb_b5_2	arcogs	arCOG01358
NIOZ119_mb_b5_2-PJFDGLDN_00040	NIOZ119_mb_b5_2	arcogs	arCOG01169
NIOZ119_mb_b5_2-PJFDGLDN_00050	NIOZ119_mb_b5_2	arcogs	arCOG04245
NIOZ119_mb_b5_2-PJFDGLDN_00060	NIOZ119_mb_b5_2	arcogs	arCOG01728

7.7 Make count tables

7.7.1 Generate a count table for our genomes of interest

Now we want to count, how often does a genome (i.e. NIOZ134_mb_b41_2) have a gene. I.e. how often do we want arCOG00570, arCOG01358, …

7.7.1.0.1 Do this via a loop (not executed, just an example)

Notice:

Since we run this chunk with , eval = FALSE we can still see the code but it is not executed. This is done because some computations take some time, which we do not want to spend, but I still want to show the code to give some alternative examples.

#count the number of proteins for each genome of interest
y <- c()
for (i in Bin_order) {
  x <-  table(subset(Input_long, BinID %in% paste(i))$gene)
  y <- cbind (y,x)
}

#clean-up the table
Counts_Table_loop <- y
colnames(Counts_Table_loop) <- Bin_order
Counts_Table_loop <- as.data.frame(Counts_Table_loop)
kable((head(Counts_Table_loop)), format='markdown')

#the '-' (=not identified genes) is also counted and listed in the first column and removed at this step
Counts_Table_loop <- Counts_Table_loop[-1,]
kable((head(Counts_Table_loop)), format='markdown')

7.7.1.0.2 Do this via ddply ((not executed, just an example))

New functions:

spread() = converts our long to a wide dataframe by using the BinIDs as new column names, the count table as values to populate our dataframe and with missing values we print a 0.

#count data and clean header
Counts_Table_long <- ddply(Input_long, .(BinID, gene), summarize, GeneCount = length(gene))
colnames(Counts_Table_long) <- c("BinID", "geneID", "count")
kable((head(Counts_Table_long)), format='markdown')

#transform to wide format, with fill = 0 instead of a NA we add a 0
Counts_Table_wide <- spread(Counts_Table_long, BinID, count, fill = 0 )

#view data
kable((head(Counts_Table_wide)), format='markdown')

7.7.1.0.3 Do this via tidyr (usually a bit faster than ddplyr, which is why we use this way)

Here, we use the %>% symbol again: In the following example, the a subset of the Input_long data (only 3 columns, not the whole dataframe) gets passed to count()

New functions:

count() = A function of the dplyr package. Here, we count the unique protein IDs grouped by BinID and gene (i.e. roughly equivalent to the columns we want to keep)

#count data and clean header
Counts_Table_long <- Input_long[,c('accession', 'BinID','gene')] %>% count(BinID, gene, sort = FALSE)
colnames(Counts_Table_long) <- c("BinID", "geneID", "count")
kable((head(Counts_Table_long)), format='markdown')

BinID	geneID	count
GCA_002494525	-	2350
GCA_002494525	GT2_Glyco_tranf_2_3	1
GCA_002494525	GT2_Glycos_transf_2	2
GCA_002494525	GT4	2
GCA_002494525	GT66	1
GCA_002494525	GT83	1

When viewing the data we also see that proteins with no annotations are counted (the minus symbol), since we do not care about this at this stage, lets remove everything with a minus symbol

#delete rows with a minus symbol
Counts_Table_long <- Counts_Table_long[Counts_Table_long$geneID!= "-", ]

#clean factors
Counts_Table_long$geneID <- factor(Counts_Table_long$geneID)

#view data
kable((head(Counts_Table_long)), format='markdown')

	BinID	geneID	count
2	GCA_002494525	GT2_Glyco_tranf_2_3	1
3	GCA_002494525	GT2_Glycos_transf_2	2
4	GCA_002494525	GT4	2
5	GCA_002494525	GT66	1
6	GCA_002494525	GT83	1
7	GCA_002494525	K00012	1

Now, we can convert the long to a wide table, since this format is a bit easier to read in excel later.

#transform to wide format, with fill = 0 instead of a NA we add a 0
Counts_Table_wide <- spread(Counts_Table_long, BinID, count, fill = 0 )
kable((head(Counts_Table_wide)), format='markdown')

geneID	GCA_002494525	GCA_002495465	GCA_002501985	GCA_002502135	NIOZ119_mb_b5_2	NIOZ120_mb_b229_2	NIOZ121_mb_b48_2	NIOZ122_mb_b319_2	NIOZ123_bs_b392_2	NIOZ125_cc_b75_2	NIOZ125_mb_b178_2	NIOZ127_mb_b282_2	NIOZ132_mb_b282_2	NIOZ132_mx_b314_2	NIOZ134_mb_b293_2	NIOZ134_mx_b281_2	NIOZ136_mb_b226_2	NIOZ136_mb_b335_2	SRR4028224_bin17	U_76725
CBM44	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0
CE4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0
GH119	0	0	0	1	1	1	1	0	1	1	1	1	0	1	1	0	1	1	0	0
GH130	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GH95	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GT2_Glyco_tranf_2_3	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1

Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.

#change the rownames
rownames(Counts_Table_wide) <- Counts_Table_wide$geneID

#view data
kable((head(Counts_Table_wide)), format='markdown')

	geneID	GCA_002494525	GCA_002495465	GCA_002501985	GCA_002502135	NIOZ119_mb_b5_2	NIOZ120_mb_b229_2	NIOZ121_mb_b48_2	NIOZ122_mb_b319_2	NIOZ123_bs_b392_2	NIOZ125_cc_b75_2	NIOZ125_mb_b178_2	NIOZ127_mb_b282_2	NIOZ132_mb_b282_2	NIOZ132_mx_b314_2	NIOZ134_mb_b293_2	NIOZ134_mx_b281_2	NIOZ136_mb_b226_2	NIOZ136_mb_b335_2	SRR4028224_bin17	U_76725
CBM44	CBM44	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0
CE4	CE4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0
GH119	GH119	0	0	0	1	1	1	1	0	1	1	1	1	0	1	1	0	1	1	0	0
GH130	GH130	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GH95	GH95	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GT2_Glyco_tranf_2_3	GT2_Glyco_tranf_2_3	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1

When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.

#delete the first column
Counts_Table_wide <- Counts_Table_wide[,-1]
kable((head(Counts_Table_wide)), format='markdown')

	GCA_002494525	GCA_002495465	GCA_002501985	GCA_002502135	NIOZ119_mb_b5_2	NIOZ120_mb_b229_2	NIOZ121_mb_b48_2	NIOZ122_mb_b319_2	NIOZ123_bs_b392_2	NIOZ125_cc_b75_2	NIOZ125_mb_b178_2	NIOZ127_mb_b282_2	NIOZ132_mb_b282_2	NIOZ132_mx_b314_2	NIOZ134_mb_b293_2	NIOZ134_mx_b281_2	NIOZ136_mb_b226_2	NIOZ136_mb_b335_2	SRR4028224_bin17	U_76725
CBM44	0	0	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0
CE4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2	0	0	0	0
GH119	0	0	0	1	1	1	1	0	1	1	1	1	0	1	1	0	1	1	0	0
GH130	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GH95	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
GT2_Glyco_tranf_2_3	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1

#order our data so that the bins start first with the bins from the basal group
Counts_Table_wide <- Counts_Table_wide[,Bin_order]

#view data
kable((head(Counts_Table_wide)), format='markdown')

	NIOZ134_mx_b281_2	GCA_002494525	GCA_002495465	GCA_002501985	SRR4028224_bin17	U_76725	NIOZ122_mb_b319_2	NIOZ132_mb_b282_2	GCA_002502135	NIOZ134_mb_b293_2	NIOZ127_mb_b282_2	NIOZ136_mb_b335_2	NIOZ125_mb_b178_2	NIOZ121_mb_b48_2	NIOZ136_mb_b226_2	NIOZ123_bs_b392_2	NIOZ125_cc_b75_2	NIOZ132_mx_b314_2	NIOZ120_mb_b229_2	NIOZ119_mb_b5_2
CBM44	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	0	0	0	0	0
CE4	2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
GH119	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1
GH130	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
GH95	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
GT2_Glyco_tranf_2_3	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

If you run these three examples yourself, take not how different the speed is.

7.7.2 Generate a count table for our clusters of interest

Same as above, but now we want to know for our 4 aquifer genomes, how many have Gene Xx and show this as percent. I.e. if 1/4 genomes have a gene, then 25% have it.

First, lets merge in our taxa info into our count table, we need this to summarize our data by clusters.

#merge the count table with mapping file to add in the taxa info (might take a while depending on size)
Counts_Table_long_Tax <- merge(Counts_Table_long, design_2[,c("BinID", "ClusterName", "NrGenomes")], by = "BinID")
kable((head(Counts_Table_long_Tax)), format='markdown')

BinID	geneID	count	ClusterName	NrGenomes
GCA_002494525	GT2_Glyco_tranf_2_3	1	Marine (6)	6
GCA_002494525	GT2_Glycos_transf_2	2	Marine (6)	6
GCA_002494525	GT4	2	Marine (6)	6
GCA_002494525	GT66	1	Marine (6)	6
GCA_002494525	GT83	1	Marine (6)	6
GCA_002494525	K00012	1	Marine (6)	6

Next, whenever we have a value higher than one, we replace it with 1. That way we deal with our data like it is a presence/absence data. I.e. 0 = no genes present & 1 = gene present

#convert counts to presence/absence matrix (just using 0/1) (this is needed to calculate the percentage across clusters)
Counts_Table_long_Tax$count[Counts_Table_long_Tax$count > 1] <- 1
kable((head(Counts_Table_long_Tax)), format='markdown')

BinID	geneID	count	ClusterName	NrGenomes
GCA_002494525	GT2_Glyco_tranf_2_3	1	Marine (6)	6
GCA_002494525	GT2_Glycos_transf_2	1	Marine (6)	6
GCA_002494525	GT4	1	Marine (6)	6
GCA_002494525	GT66	1	Marine (6)	6
GCA_002494525	GT83	1	Marine (6)	6
GCA_002494525	K00012	1	Marine (6)	6

Now, we can use tidyr to count of how many genomes in a cluster have a gene.

#count data and clean header
Counts_Table_long_Tax_sum <- Counts_Table_long_Tax[,c('ClusterName', 'geneID','NrGenomes', 'count')] %>% count(ClusterName, geneID, NrGenomes, sort = FALSE)
colnames(Counts_Table_long_Tax_sum) <- c("ClusterName", "geneID", "NrGenomes", "quantity")
kable((head(Counts_Table_long_Tax_sum)), format='markdown')

ClusterName	geneID	NrGenomes	quantity
Aquifer (4)	GT2_Glycos_transf_2	4	4
Aquifer (4)	GT4	4	3
Aquifer (4)	GT66	4	2
Aquifer (4)	GT83	4	3
Aquifer (4)	K00003	4	2
Aquifer (4)	K00008	4	3

Next, we want to calculate the percentage. I.e. in the first example, we have 4 aquifer genomes, two of which [FeFe]_Group_C3 (=50%).

#calculate of the percentage to answer of the total genomes per cluster how many have a certain gene
#notice: if running for your own data check here that your percentage makes sense. I.e. we do not want values above 100
Counts_Table_long_Tax_sum$percentage <- round(Counts_Table_long_Tax_sum$quantity/Counts_Table_long_Tax_sum$NrGenomes*100, digits = 0)
kable((head(Counts_Table_long_Tax_sum)), format='markdown')

ClusterName	geneID	NrGenomes	quantity	percentage
Aquifer (4)	GT2_Glycos_transf_2	4	4	100
Aquifer (4)	GT4	4	3	75
Aquifer (4)	GT66	4	2	50
Aquifer (4)	GT83	4	3	75
Aquifer (4)	K00003	4	2	50
Aquifer (4)	K00008	4	3	75

For printing this, we want to convert our long to a wide table by using spread(). Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.

#convert long to wide format and clean table (i.e. place the rownames)
Counts_Table_long_Tax_sum_wide <- spread(Counts_Table_long_Tax_sum[,c("geneID", "ClusterName", "percentage")], ClusterName, percentage)

#change the rownames
rownames(Counts_Table_long_Tax_sum_wide) <- Counts_Table_long_Tax_sum_wide$geneID

#view data
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')

	geneID	Aquifer (4)	Basal (2)	Clade1 (11)	Clade2 (23)	Marine (6)
CBM44	CBM44	NA	NA	18	NA	NA
CE4	CE4	NA	50	NA	NA	NA
GH119	GH119	NA	NA	NA	52	NA
GH130	GH130	NA	50	NA	NA	NA
GH95	GH95	NA	50	NA	NA	NA
GT2_Glyco_tranf_2_3	GT2_Glyco_tranf_2_3	NA	NA	NA	NA	83

When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.

#delete the first column
Counts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,-1]
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')

	Aquifer (4)	Basal (2)	Clade1 (11)	Clade2 (23)	Marine (6)
CBM44	NA	NA	18	NA	NA
CE4	NA	50	NA	NA	NA
GH119	NA	NA	NA	52	NA
GH130	NA	50	NA	NA	NA
GH95	NA	50	NA	NA	NA
GT2_Glyco_tranf_2_3	NA	NA	NA	NA	83

Now we see that we have NAs for genes that are not present in some of our clades. If we want to do more math then NAs are not helpful and we instead want to have a 0 there instead.

#replace NAs with 0
Counts_Table_long_Tax_sum_wide[is.na(Counts_Table_long_Tax_sum_wide)] <- 0
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')

	Basal (2)	Clade1 (11)	Clade2 (23)	Marine (6)
CBM44	0	18	0	0
CE4	50	0	0	0
GH119	0	0	52	0
GH130	50	0	0	0
GH95	50	0	0	0
GT2_Glyco_tranf_2_3	0	0	0	83

Finally, we want to sort our data, starting with the Basal clade and ending with the Black Sea Clades

#sort by cluster order (defined by the order of the mapping file)
Counts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,Cluster_order]
kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')

	Basal (2)	Marine (6)	Clade1 (11)	Clade2 (23)
CBM44	0	0	18	0
CE4	50	0	0	0
GH119	0	0	0	52
GH130	50	0	0	0
GH95	50	0	0	0
GT2_Glyco_tranf_2_3	0	83	0	0

7.8 Merge our tables with the mapping data we have

Now, that we have our count tables both for the bins as well as for all the clusters, we now want to add some gene description and subset the data based on different categores.

7.8.1 For the bins

7.8.1.1 Add gene descriptions

Remember above, we made a list of descriptions that links all geneIDs with what is behind all the gene IDs? Now we want to add this info back in in order to print all the counts.

#merge
Counts_Table_final <- merge(All_Genes_Description, Counts_Table_wide, by.x="Gene", by.y="row.names", all.x = T, sort = F)
kable((head(Counts_Table_final)), format='markdown')

Gene	Description	NIOZ132_cc_b51_2	NIOZ134_mx_b281_2	SRR2090153_bin1042	SRR2090153_bin461	SRR2090159_bin1129	SRR2090159_bin1288	GCA_002494525	GCA_002495465	GCA_002501985	SRR4028224_bin17	SRR5007147_bin71	U_76725	NIOZ124_cc_b247_2	NIOZ125_mb_b254_2	NIOZ129_mb_b262_2	NIOZ126_mb_b401_2	NIOZ122_mb_b319_2	NIOZ134_mb_b41_2	NIOZ132_mb_b282_2	NIOZ127_mb_b161_2	NIOZ123_bs_b5_2	NIOZ121_cc_b94_2	NIOZ136_cc_b15_2	GCA_002502135	NIOZ134_mb_b293_2	NIOZ132_mb_b260_2	NIOZ126_mb_b137_2	NIOZ129_mb_b254_2	NIOZ127_mb_b282_2	NIOZ122_mb_b305_2	NIOZ136_mb_b335_2	NIOZ132_mb_b198_2	NIOZ136_mb_b104_2	NIOZ125_mb_b178_2	NIOZ121_mb_b48_2	NIOZ134_cc_b149_2	NIOZ134_mb_b361_2	NIOZ136_mb_b226_2	NIOZ123_bs_b392_2	NIOZ124_mb_b130_2	NIOZ132_cc_b149_2	NIOZ125_cc_b75_2	NIOZ132_mx_b314_2	NIOZ120_mb_b229_2	NIOZ126_mb_b304_2	NIOZ119_mb_b5_2
arCOG00570	“Geranylgeranyl_reductase,_flavoprotein”	2	4	3	1	3	2	1	1	1	1	1	1	2	3	3	3	2	3	3	3	3	2	1	1	2	1	1	0	1	1	1	1	1	1	1	1	1	2	1	1	2	1	1	1	1	1
arCOG01358	2-methylthioadenine_synthetase	1	2	0	1	0	1	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	1	1	2	1	1	0	1	1	1	1	1	1	1
arCOG01169	Enolase	1	1	0	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	1
arCOG04245	Ribosomal_protein_S2	1	0	0	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	1	1	1	0	0	1	1	1	1	1	1	1	1	1	1
arCOG01728	“Predicted_class_III_extradiol_dioxygenase,_MEMO1_family”	2	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	2	1	1	0	0	1	1	1	0	1	1	1	1	1	1	1	0	1	1	1	1	1	1	1	1	1	0	1
arCOG00546	mRNA_degradation_ribonuclease_J1/J2	1	0	0	0	1	0	1	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	1	0	0	1	0	2	0	1	0	1	1	0	1	1	0	1

#print (and beautify elsewhere)
write.table(Counts_Table_final, "../output_examples/Counts_Table_final.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.1.2 Merging with the arcog_table

#merge
Arcog_Data <- merge(Arcog_mapping, Counts_Table_wide, by.x="arcog", by.y="row.names", all.x = T, sort = F)
kable((head(Arcog_Data)), format='markdown')

arcog	Pathway	GeneID	Gene	COG	PFAM	CD	TIGR	NIOZ132_cc_b51_2	NIOZ134_mx_b281_2	SRR2090153_bin1042	SRR2090153_bin461	SRR2090159_bin1129	SRR2090159_bin1288	GCA_002494525	GCA_002495465	GCA_002501985	SRR4028224_bin17	SRR5007147_bin71	U_76725	NIOZ124_cc_b247_2	NIOZ125_mb_b254_2	NIOZ129_mb_b262_2	NIOZ126_mb_b401_2	NIOZ122_mb_b319_2	NIOZ134_mb_b41_2	NIOZ132_mb_b282_2	NIOZ127_mb_b161_2	NIOZ123_bs_b5_2	NIOZ121_cc_b94_2	NIOZ136_cc_b15_2	GCA_002502135	NIOZ134_mb_b293_2	NIOZ132_mb_b260_2	NIOZ126_mb_b137_2	NIOZ129_mb_b254_2	NIOZ127_mb_b282_2	NIOZ122_mb_b305_2	NIOZ136_mb_b335_2	NIOZ132_mb_b198_2	NIOZ136_mb_b104_2	NIOZ125_mb_b178_2	NIOZ121_mb_b48_2	NIOZ134_cc_b149_2	NIOZ134_mb_b361_2	NIOZ136_mb_b226_2	NIOZ123_bs_b392_2	NIOZ124_mb_b130_2	NIOZ132_cc_b149_2	NIOZ125_cc_b75_2	NIOZ132_mx_b314_2	NIOZ120_mb_b229_2	NIOZ126_mb_b304_2	NIOZ119_mb_b5_2
arCOG00009	E	PotE	Amino acid transporter	COG00531	pfam00324		TIGR00909	2	4	1	1	1	1	2	1	2	1	1	1	3	3	2	1	3	3	3	3	2	1	0	1	1	2	1	1	1	1	1	1	1	1	1	1	1	2	1	1	1	1	0	1	1	0
arCOG00014	G	RbsK	“Sugar kinase, ribokinase family”	COG00524	pfam00294	cd01174	TIGR02152	1	2	1	2	1	2	2	2	2	2	2	2	2	3	2	2	2	3	2	2	2	2	2	3	1	2	2	3	2	4	2	2	2	2	3	2	2	3	2	4	1	2	2	2	2	1
arCOG00017	R	-	Predicted transcriptional regulator	COG02522				1	1	0	0	0	0	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
arCOG00018	F	Nnr2	“NAD(P)H-hydrate repair enzyme Nnr, NAD(P)H-hydrate dehydratase domain”	COG00063	“pfam03853,pfam01256”	cd01171	“TIGR00197,TIGR00196”	1	1	1	0	1	0	1	1	1	2	2	0	1	1	1	1	1	1	0	2	1	2	1	2	1	1	1	1	1	1	1	1	0	1	1	2	0	2	1	1	1	0	0	1	0	0
arCOG00029	F	PyrE	Orotate phosphoribosyltransferase	COG00461	pfam00156	cd06223	TIGR00336	1	0	1	0	1	0	1	1	1	1	1	1	1	1	0	0	0	1	1	1	1	0	0	0	0	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	0	0	0	1
arCOG00030	F	Apt	Adenine/guanine phosphoribosyltransferase or related PRPP-binding protein	COG00503	pfam00156	cd06223	TIGR01090	0	1	2	1	1	1	1	1	1	1	1	1	0	0	0	0	0	0	0	0	0	0	0	1	0	1	2	2	1	2	1	2	2	2	2	1	2	0	2	2	1	0	1	1	0	1

#print (and beautify elsewhere)
write.table(Arcog_Data, "../output_examples/ArCOG_Data.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.1.3 Merging with the metabolism metadata file

#merge
KEGG_Metabolism <- merge(Metabolism_file_KEGG, Counts_Table_wide, by.x="KO", by.y="row.names", all.x = T, sort = F)
kable((head(KEGG_Metabolism)), format='markdown')

KO	IDs	pathway_A	order	Gene	Comment.Uniprot.comment	NIOZ134_mx_b281_2	SRR2090153_bin461	SRR2090159_bin1129	GCA_002494525	GCA_002495465	SRR4028224_bin17	SRR5007147_bin71	U_76725
K00362	ID14	Nitrogen	N2	nirB; nitrite reductase (NADH) large subunit [EC:1.7.1.15]	Dissimilatory	1	0	0	0	0	0	0	0
TIGR02374	ID15	Nitrogen	N2	“NirB, nitrite reductase [NAD(P)H]”	Dissimilatory	2	0	0	0	0	0	0	0
TIGR02378	ID17	Nitrogen	N2	“NirD, nitrite reductase [NAD(P)H]”	Dissimilatory	1	1	1	0	0	0	0	0
K04748	ID31	no_pathway		norQ; nitric oxide reductase NorQ protein		0	0	1	0	0	0	0	0
K00376	ID36	Nitrogen	N5	nosZ; nitrous-oxide reductase [EC:1.7.2.4]		0	0	0	0	1	0	0	0
TIGR04246	ID37	Nitrogen	N5	“nosZ, nitrous-oxide reductase, Sec-dependent”		0	0	0	1	1	1	1	1

#print
write.table(KEGG_Metabolism, "../output_examples/KEGG_Metabolism.txt",  sep = "\t", quote = F, row.names = T, na = "")

7.8.2 For the clusters

The process works exactly the same as above. So try by yourself if you can merge things ;-).

7.9 Plot the data for our genes of interest

Here, we are not interested in plotting all genes but just want to plot things that are listed under the lipid pathway.

Since we are only interested in the Lipid pathway for the genes of interest (the table, among others, also lists genes involved in informational processing), we first need to make a gene list of just the genes we are interested in.

New functions:

subset() = subsets a dataframe. Syntax –> subset(dataframe, column, %in% pattern_we_look_for)

#subset genes of interest and clean factors
Genes_Lipids <- subset(Genes_of_interest, Pathway_1 %in% "Lipids")
Genes_Lipids$Gene <- factor(Genes_Lipids$Gene)
Genes_Lipids$arcog <- factor(Genes_Lipids$arcog)

#check how many genes we have
dim(Genes_Lipids)

[1] 16 12

kable((head(Genes_Lipids)), format='markdown')

	Pathway_1	pathway_2	arcog	xx	Pathway	Gene_short	Gene_name	Comment	Gene	Order	Order2
262	Lipids	Mevalonate	arCOG01767	NA	I	HmgB	3-hydroxy-3-methylglutaryl CoA synthase	Catalyzes the condensation of acetyl-CoA with acetoacetyl-CoA to form 3-hydroxy-3-methylglutaryl-CoA (HMG-CoA). Functions in the mevalonate (MVA) pathway	HmgB	Lipid_1	262
263	Lipids	Mevalonate	arCOG04260	NA	I	HmgA	“3-hydroxy-3-methylglutaryl-coenzyme A reductase, HMG-CoA reductase”	Catalyzes the NADPH-dependent reductive deacylation of (S)-3-hydroxy-3-methylglutaryl-CoA (HMG-CoA) to (R)-mevalonate. Functions in the mevalonate (MVA) pathway	HmgA	Lipid_1	263
264	Lipids	Mevalonate	arCOG01028	NA	I	MvK	Mevalonate kinase	Catalyzes the phosphorylation of (R)-mevalonate (MVA) to (R)-mevalonate 5-phosphate (MVAP).	MvK	Lipid_1	264
265	Lipids	Mevalonate	K18689	NA	I	M3k	Mevalonate 3-kinase	Catalyzes the phosphorylation of mevalonate (MVA) to yield mevalonate-3-phosphate. Functions in an alternative mevalonate pathway	M3k	Lipid_1	265
266	Lipids	Mevalonate	arCOG01967	NA	I	M3P5K	Mevalonate-3-phosphate 5-kinase	“Phosphorylates mevalonate 3-phosphate to form mevalonate 3,5-bisphosphate. Functions in an alternative mevalonate pathway”	M3P5K	Lipid_1	266
267	Lipids	Mevalonate	arCOG02937	NA	I	mvaD	Diphosphomevalonate decarboxylase	Catalyzes the decarboxylation of mevalonate 5-phosphate (MVAP) to isopentenyl phosphate (IP). Functions in the mevalonate (MVA) pathway leading to IPP	mvaD	Lipid_1	267

Next, we want to make sure that the order is ok. In this specific example, we manually defined two columns for ordering (Order and Order2). We sort based on these columns and make a vector to order our genes of interest and our pathways of interest. For the lipid genes we look at the mevalonate pathway and general lipid biosynthesis genes

``length()``` - lets us check the length of a vector, here it allows us to see that we would expect 16 genes

#define an order (we arange the dataframe based on two columsn, Order and Order2)
Genes_Lipids_order_temp <- Genes_Lipids %>% arrange(Order, Order2)
Genes_Lipids_order <- as.character(unique(Genes_Lipids_order_temp$Gene))
length(Genes_Lipids_order)

[1] 16

Genes_Lipids_order

 [1] "HmgB"          "HmgA"          "MvK"           "M3k"          
 [5] "M3P5K"         "mvaD"          "Ipk"           "GFPS"         
 [9] "EgsA"          "GGGP_synthase" "UbiA"          "GGR"          
[13] "CarS"          "PgsA"          "ASS"           "Asd"

#The lipids belong to two different pathways, these 2 pathways we want to show in two separate heatmaps
Lipids_Pathway_order <- as.character(unique(Genes_Lipids$pathway_2))
Lipids_Pathway_order

[1] "Mevalonate" "Lipids"

Now that we know what genes we are interested in, lets subset our original count table.

#subset our original count table for genes of interest and clean factors
Genes_Lipids_counts <- subset(Counts_Table_long_Tax_sum, geneID %in% as.character(Genes_Lipids$arcog))
Genes_Lipids_counts$geneID <- factor(Genes_Lipids_counts$geneID)

#control that all went fine
length(unique(Genes_Lipids_counts$geneID))

[1] 15

dim(Genes_Lipids_counts)

[1] 60  5

kable((head(Genes_Lipids_counts)), format='markdown')

	ClusterName	geneID	NrGenomes	quantity	percentage
1924	Aquifer (4)	arCOG00476	4	4	100
1947	Aquifer (4)	arCOG00570	4	4	100
1967	Aquifer (4)	arCOG00670	4	3	75
1968	Aquifer (4)	arCOG00671	4	4	100
1997	Aquifer (4)	arCOG00860	4	3	75
2032	Aquifer (4)	arCOG00982	4	3	75

With length we see that now we just have 15 genes. How can we find out what gene is missing?

setdiff() = we compare two vectors and print the elements that differ.

setdiff(Genes_Lipids$arcog, Genes_Lipids_counts$geneID)

[1] "K18689"

We can see that we miss K18689. If we check our original annotations input, we can see that K18689 does not exist in that table. So we can nout pull information because of that. This gives a good example, why it is important to check your data as we do not know whether this is an issue with the code or what the problem could be.

Now, since our count data does not know that we categorize our different genes into different pathways, lets add this info in with merge()

#add in metadata (from the pathway info)
Key_Lipids_genes_cluster <- merge(Genes_Lipids_counts, Genes_Lipids, by.x ="geneID", by.y = 'arcog' , all.x = T )
kable((head(Key_Lipids_genes_cluster)), format='markdown')

geneID	ClusterName	NrGenomes	quantity	percentage	Pathway_1	pathway_2	xx	Pathway	Gene_short	Gene_name	Comment	Gene	Order	Order2
arCOG00476	Aquifer (4)	4	4	100	Lipids	Lipids	NA	L	UbiA	geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42]	Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547)	UbiA	Lipid_2	272
arCOG00476	Basal (2)	2	2	100	Lipids	Lipids	NA	L	UbiA	geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42]	Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547)	UbiA	Lipid_2	272
arCOG00476	Clade1 (11)	11	11	100	Lipids	Lipids	NA	L	UbiA	geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42]	Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547)	UbiA	Lipid_2	272
arCOG00476	Marine (6)	6	6	100	Lipids	Lipids	NA	L	UbiA	geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42]	Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547)	UbiA	Lipid_2	272
arCOG00476	Clade2 (23)	23	19	83	Lipids	Lipids	NA	L	UbiA	geranylgeranylglycerol-phosphate geranylgeranyltransferase [EC:2.5.1.42]	Notice: only the homologs identified in the aquifer MAGs harbor the characteristic DGGGP synthase domain (IPR023547)	UbiA	Lipid_2	272
arCOG00570	Clade2 (23)	23	22	96	Lipids	Lipids	NA	L	GGR	digeranylgeranylglycerophospholipid reductase [EC:1.3.1.101 1.3.7.11]	“reduction of 2,3-digeranylgeranylglycerophospholipids (unsaturated archaeols) into 2,3-diphytanylglycerophospholipids (saturated archaeols)”	GGR	Lipid_2	273

7.9.1 Categorizing data

There are different ways to color code data. By default the way we do it, we use a gradual color scale from 0-100%. However, we could also define categories with the ifelse statement we learned before. Here, we define 4 categories (100%, 75-100 and, 33-75 and 0-33%)

#define color code (not used for the current figure, but can be changed)
#here , we define 3 color levels, which sometimes is useful to show very clear cutoffs
Key_Lipids_genes_cluster$category <- ifelse(Key_Lipids_genes_cluster$percentage == 100, "1",
                                         ifelse(Key_Lipids_genes_cluster$percentage >= 75, "0.75",
                                                ifelse(Key_Lipids_genes_cluster$percentage >= 33, "0.33", "0")))

7.9.2 Order our data

Remember the annoying thing that R sorts alphabetically? Let’s make sure we ahve the order we want.

#define order for the plot
Key_Lipids_genes_cluster$ClusterName2 <-  factor(Key_Lipids_genes_cluster$ClusterName, levels = rev(Cluster_order))
Key_Lipids_genes_cluster$Gene2 <-  factor(Key_Lipids_genes_cluster$Gene, levels = Genes_Lipids_order)
Key_Lipids_genes_cluster$pathway_2b <-  factor(Key_Lipids_genes_cluster$pathway_2, levels = Lipids_Pathway_order)

7.9.3 Plotting

In the example here, we use a gradual scale. If we would want to use our 4 categories we can use this code #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) and replacing the fill = percentage with fill = category.

#plot
p1_Lipids <- 
  ggplot(Key_Lipids_genes_cluster, aes(x=Gene2, y=ClusterName2)) + 
  geom_tile(aes(fill = percentage)) +
  geom_hline(yintercept=2.5) +
  facet_wrap( ~ pathway_2b, nrow = 1, scales='free_x') +
  scale_fill_distiller(palette = "Blues", direction = 1) +
  theme_bw() +
  #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) +
  labs(x="", y="", fill="Percentage") + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black")) +
  theme(legend.position="left",
        axis.text.x=element_text(angle=45,vjust = 1, hjust=1, size=8),
        #axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        plot.margin=unit(c(0, 0, 0, 0), "mm"))

p1_Lipids

as.data.frame(levels(Key_Lipids_genes_cluster$ClusterName2))

p1_Lipids2 <- 
  ggplot(Key_Lipids_genes_cluster, aes(x=Gene2, y=ClusterName2)) + 
  geom_tile(aes(fill = percentage)) +
  geom_hline(yintercept=2.5) +
  facet_wrap( ~ pathway_2b, nrow = 1, scales='free_x') +
  scale_fill_distiller(palette = "Blues", direction = 1) +
  theme_bw() +
  #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) +
  labs(x="", y="", fill="Percentage") + 
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(), 
        axis.line = element_line(colour = "black")) +
  theme(legend.position="left",
        axis.text.x=element_text(angle=45,vjust = 1, hjust=1, size=8),
        #axis.text.y=element_blank(),
        axis.ticks=element_blank(),
        axis.line=element_blank(),
        plot.margin=unit(c(0, 0, 0, 0), "mm"))

p1_Lipids2

If we plot with facets we can sometimes have the problem that different genes have different widths. We can correct this behaviour with ggplotGrob.

# convert ggplot object to grob object (used to rescale plot)
gp_lipid <- ggplotGrob(p1_Lipids2)

# optional: take a look at the grob object's layout
gtable::gtable_show_layout(gp_lipid)

# get gtable columns corresponding to the facets (5 & 9, in this case)
facet.columns <- gp_lipid$layout$l[grepl("panel", gp_lipid$layout$name)]

# get the number of unique x-axis values per facet (1 & 3, in this case)
x.var <- sapply(ggplot_build(p1_Lipids)$layout$panel_scales_x,
                function(l) length(l$range$range))

# change the relative widths of the facet columns based on
# how many unique x-axis values are in each facet
gp_lipid$widths[facet.columns] <- gp_lipid$widths[facet.columns] * x.var

# plot result
grid::grid.draw(gp_lipid)

#print
#pdf("2_Output/Figure_S64.pdf", paper="special", family="sans",width=8, height=7, useDingbats=FALSE)
#grid::grid.draw(gp_lipid)
#dev.off()

In this example we see that only the aquifer and the basal clade have all required genes for the mevalonate pathway and the lipid pathway that is required to make a key archaeal lipid.