As mentioned in the beginning section, if working with genomes we often have large lists of proteins and their annotations, often from different databases.
Here, we might be interested in condensing this information for each genome or for each phylum etc . OR we want to merge our data with other tables that are ordered differently. We might also want to plot the data to compare the presence absence levels of different genomes. This section with discuss how to do this in a step by step manner.
v tibble 3.1.8 v stringr 1.4.1
v readr 2.1.2 v forcats 0.5.2
v purrr 0.3.4
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::arrange() masks plyr::arrange()
x gridExtra::combine() masks dplyr::combine()
x purrr::compact() masks plyr::compact()
x dplyr::count() masks plyr::count()
x dplyr::failwith() masks plyr::failwith()
x dplyr::filter() masks stats::filter()
x dplyr::id() masks plyr::id()
x dplyr::lag() masks stats::lag()
x dplyr::mutate() masks plyr::mutate()
x dplyr::rename() masks plyr::rename()
x MASS::select() masks dplyr::select()
x dplyr::summarise() masks plyr::summarise()
x dplyr::summarize() masks plyr::summarize()
library(data.table)
Attaching package: 'data.table'
The following object is masked from 'package:purrr':
transpose
The following objects are masked from 'package:reshape2':
dcast, melt
The following objects are masked from 'package:dplyr':
between, first, last
library(knitr)library(kableExtra)
Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':
group_rows
7.2 Deal with the mapping file for our genomes
7.2.1 Load mapping file
mapping file = a table that has the metadata for our genomes. In this specific example it lists our 46 DPANN genomes into clusters based on their phylogenetic placement.
We later want to summarize our data by individual bins and the phylogenetic clusters defined in the mapping file.
Notice: We can also sort our mapping file beforehand and use this order to order our tables, graph, etc…
#read in the mapping filedesign <-read.table("../data/mapping.txt", sep="\t", header=T, fill=TRUE, quote ="")#check the structure of the mapping filehead(design)
7.3 Clean mapping file
Here, we want to add an extra column that summarizes how many genomes are in each phylogenetic cluster.
First lets summarize how many bins we have for each cluster:
ddply() = summarize our data by counting the number of bins we have in each cluster
order() = next, we order our data based on the clusters with the greater number of bins
#add a new column that links the BinID and the taxonomic clusterdesign$NewName2 <-paste(design$BinID, design$Cluster, sep ="-")head(design)
#transform mapping file and summarize how many genomes we have/clusterNumber_of_taxa <-ddply(design, .(Cluster), summarize, NrGenomes =length(Cluster))#order our dataNumber_of_taxa <- Number_of_taxa[order(Number_of_taxa$NrGenomes, decreasing =TRUE),] #view datahead(Number_of_taxa)
Next, add this information into our original mapping file:
paste() = paste together different columns. With sep we can control what delimiter we want to use when combining info
#add a new column with the cluster count info into mapping file (then it can be added as a label into figures)design_2 <-merge(design, Number_of_taxa, by ="Cluster" )#view wether all went finehead(design_2)
Next, let’s merge all relevant into one column using paste() again:
#generate a new column into the mapping file that links the cluster name with the number of genomes of clusters in each clusterdesign_2$ClusterName <-paste(design_2$Cluster, " (", design_2$NrGenomes, ")", sep ="")#view datahead(design_2)
7.4 Generate lists to order dataframes
The default behaviour of a lot of R functions is to order data alphabetically but this is not what we might want. I.e. we want to order by different columns or simply order by the initial order in our mapping file. To do this it is useful to have vectors that are order our bins, clusters, etc… like we want them.
In this example the mapping file was ordered like this
The basal group
Groups from an older dataset (marine and aquifer)
two black sea clades
Lets first check, whether our new mapping file still has that order
If we check the individual factors, we can see that the new dataframe is order by alphabet. Since this is not what we want, lets correct this
match(design_2$BinID, design$BinID) = We check what positions in design_2 match with design. Here, the first argument are the values to be matched and the second are the values to be matched against.
order() = we reorder our dataframe based on the results from match()
#reorder our old dataframe by the original onedesign_2 <- design_2[ order(match(design_2$BinID, design$BinID)), ]#check, if the basal clade is now in the correct orderdesign_2$Cluster
The mapping files (Arcog_mapping and Metabolism_file_KEGG) that are provided with this tutorial give extra information and order our KO and arCOG annotations. Genes_of_interest is a list of key metabolic genes we want to look at more closely.
#load the genes of interestGenes_of_interest <-read.table("../data/Genes_of_interest.txt", sep="\t", header=T, fill=TRUE, quote ="")head(Arcog_mapping)
7.5.2 Make a vector to order our genes of interest
We use unique() to remove any duplicate gene names
arrange() can be used to order our dataframe by more than one column.
You notice here that the syntax for arrange() is a bit unusual and we use the %>% symbol (a so-called forward pipe operator). This symbol is commonly used in the dplyr and tidyr packages which are extremely useful to summarize data. This function passes the left hand side of the operator to the first argument of the right hand side of the operator. In the following example, the data frame Genes_of_interest gets passed to arrange()
#order metabolic genesGenes_Metabolism_order_temp <- Genes_of_interest %>%arrange(Order, Order2)#make a unique vector for our genes of interestGenes_Metabolism_order <-as.character(unique(Genes_Metabolism_order_temp$Gene))Genes_Metabolism_order
You already here notice that this takes a bit longer and we just work with 46 bins. This is a good reason to keep python in mind as depending on your computer the more memory heavy operations might get challenging. Another alternative would be to run R on the clusters.
#read in data and view itInput <-read.table("../data/UAP2_Annotation_table_u.txt", sep="\t", header=T, fill=TRUE, quote ="")head(Input)
7.6.2 Make a mapping file that links all annotation IDs to their descriptions
What we do:
separate columns we are interested in for each Database of interest, i.e. arCOGs, and remove duplicate rows by using unique()
change the column names using colnames(). Here, we want to make sure that all the 6 new objects we generate have the same columns
combine our 6 dataframes using rbind(). For this to work we need the same number of columns.
In theory that would be a nice example for a loop as well, since we do exactly the same thing for 6x.
#generate Description table for all DBs of interestArcogs_Description <-unique(Input[,c("arcogs","arcogs_Description" )])colnames(Arcogs_Description) <-c("Gene", "Description")kable((head(Arcogs_Description)), format='markdown')
#make a file with a description of all the ids for each searchAll_Genes_Description <-rbind(Arcogs_Description,KOs_Description,Pfam_Description,TIRGR_Description, Cazy_Description, HydDB_Description)
7.6.3 Parse table to make it easier to work with it
Here we:
Subset the data for the columns we are interested in. Esp. for larger dataframes this will make the operations a bit quicker. For very large dataframes, i.e. 5000 genomes, it might be better to switch to python
Convert data from wide to long format
Clean factors. After subsetting often factors are not removed, we clean them up in that step
Info:
Converting a wide to a long dataframe
Wide dataframe: The Input data in this example is considered as a wide dataframe. I.e. all the gene IDs we are interested in are spread out into different columns
Long dataframe: The gene IDs we are interested in are found all in the same column. Important, most R functions work with long dataframes.
#print the column names to subset our datatablecolnames(Input)
#only keep the columns we actually want to work withInput_subset = Input[,c('BinID','accession','arcogs','KO_hmm','PFAM_hmm','TIRGR','CAZy','Description.1' )]kable((head(Input_subset)), format='markdown')
BinID
accession
arcogs
KO_hmm
PFAM_hmm
TIRGR
CAZy
Description.1
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00010
arCOG00570
K17830
PF01494
TIGR02032
-
-
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00020
-
-
-
-
-
-
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00030
arCOG01358
K15865
PF04055
TIGR01578
-
-
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00040
arCOG01169
K01689
PF00113
TIGR01060
-
-
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00050
arCOG04245
K02998
PF00318
TIGR01012
-
-
NIOZ119_mb_b5_2
NIOZ119_mb_b5_2-PJFDGLDN_00060
arCOG01728
K06990
PF01875
TIGR04336
-
-
#convert dataframe from wide to longInput_long <- reshape2::melt(Input_subset, id=c("accession","BinID"))#give informative headerscolnames(Input_long) <-c("accession", "BinID", "DB", "gene")#clean factors, to remove issues when countingInput_long$gene <-as.factor(Input_long$gene)kable((head(Input_long)), format='markdown')
accession
BinID
DB
gene
NIOZ119_mb_b5_2-PJFDGLDN_00010
NIOZ119_mb_b5_2
arcogs
arCOG00570
NIOZ119_mb_b5_2-PJFDGLDN_00020
NIOZ119_mb_b5_2
arcogs
-
NIOZ119_mb_b5_2-PJFDGLDN_00030
NIOZ119_mb_b5_2
arcogs
arCOG01358
NIOZ119_mb_b5_2-PJFDGLDN_00040
NIOZ119_mb_b5_2
arcogs
arCOG01169
NIOZ119_mb_b5_2-PJFDGLDN_00050
NIOZ119_mb_b5_2
arcogs
arCOG04245
NIOZ119_mb_b5_2-PJFDGLDN_00060
NIOZ119_mb_b5_2
arcogs
arCOG01728
7.7 Make count tables
7.7.1 Generate a count table for our genomes of interest
Now we want to count, how often does a genome (i.e. NIOZ134_mb_b41_2) have a gene. I.e. how often do we want arCOG00570, arCOG01358, …
7.7.1.0.1 Do this via a loop (not executed, just an example)
Notice:
Since we run this chunk with , eval = FALSE we can still see the code but it is not executed. This is done because some computations take some time, which we do not want to spend, but I still want to show the code to give some alternative examples.
#count the number of proteins for each genome of interesty <-c()for (i in Bin_order) { x <-table(subset(Input_long, BinID %in%paste(i))$gene) y <-cbind (y,x)}#clean-up the tableCounts_Table_loop <- ycolnames(Counts_Table_loop) <- Bin_orderCounts_Table_loop <-as.data.frame(Counts_Table_loop)kable((head(Counts_Table_loop)), format='markdown')#the '-' (=not identified genes) is also counted and listed in the first column and removed at this stepCounts_Table_loop <- Counts_Table_loop[-1,]kable((head(Counts_Table_loop)), format='markdown')
7.7.1.0.2 Do this via ddply ((not executed, just an example))
New functions:
spread() = converts our long to a wide dataframe by using the BinIDs as new column names, the count table as values to populate our dataframe and with missing values we print a 0.
#count data and clean headerCounts_Table_long <-ddply(Input_long, .(BinID, gene), summarize, GeneCount =length(gene))colnames(Counts_Table_long) <-c("BinID", "geneID", "count")kable((head(Counts_Table_long)), format='markdown')#transform to wide format, with fill = 0 instead of a NA we add a 0Counts_Table_wide <-spread(Counts_Table_long, BinID, count, fill =0 )#view datakable((head(Counts_Table_wide)), format='markdown')
7.7.1.0.3 Do this via tidyr (usually a bit faster than ddplyr, which is why we use this way)
Here, we use the %>% symbol again: In the following example, the a subset of the Input_long data (only 3 columns, not the whole dataframe) gets passed to count()
New functions:
count() = A function of the dplyr package. Here, we count the unique protein IDs grouped by BinID and gene (i.e. roughly equivalent to the columns we want to keep)
#count data and clean headerCounts_Table_long <- Input_long[,c('accession', 'BinID','gene')] %>%count(BinID, gene, sort =FALSE)colnames(Counts_Table_long) <-c("BinID", "geneID", "count")kable((head(Counts_Table_long)), format='markdown')
BinID
geneID
count
GCA_002494525
-
2350
GCA_002494525
GT2_Glyco_tranf_2_3
1
GCA_002494525
GT2_Glycos_transf_2
2
GCA_002494525
GT4
2
GCA_002494525
GT66
1
GCA_002494525
GT83
1
When viewing the data we also see that proteins with no annotations are counted (the minus symbol), since we do not care about this at this stage, lets remove everything with a minus symbol
#delete rows with a minus symbolCounts_Table_long <- Counts_Table_long[Counts_Table_long$geneID!="-", ]#clean factorsCounts_Table_long$geneID <-factor(Counts_Table_long$geneID)#view datakable((head(Counts_Table_long)), format='markdown')
BinID
geneID
count
2
GCA_002494525
GT2_Glyco_tranf_2_3
1
3
GCA_002494525
GT2_Glycos_transf_2
2
4
GCA_002494525
GT4
2
5
GCA_002494525
GT66
1
6
GCA_002494525
GT83
1
7
GCA_002494525
K00012
1
Now, we can convert the long to a wide table, since this format is a bit easier to read in excel later.
#transform to wide format, with fill = 0 instead of a NA we add a 0Counts_Table_wide <-spread(Counts_Table_long, BinID, count, fill =0 )kable((head(Counts_Table_wide)), format='markdown')
geneID
GCA_002494525
GCA_002495465
GCA_002501985
GCA_002502135
NIOZ119_mb_b5_2
NIOZ120_mb_b229_2
NIOZ121_cc_b94_2
NIOZ121_mb_b48_2
NIOZ122_mb_b305_2
NIOZ122_mb_b319_2
NIOZ123_bs_b392_2
NIOZ123_bs_b5_2
NIOZ124_cc_b247_2
NIOZ124_mb_b130_2
NIOZ125_cc_b75_2
NIOZ125_mb_b178_2
NIOZ125_mb_b254_2
NIOZ126_mb_b137_2
NIOZ126_mb_b304_2
NIOZ126_mb_b401_2
NIOZ127_mb_b161_2
NIOZ127_mb_b282_2
NIOZ129_mb_b254_2
NIOZ129_mb_b262_2
NIOZ132_cc_b149_2
NIOZ132_cc_b51_2
NIOZ132_mb_b198_2
NIOZ132_mb_b260_2
NIOZ132_mb_b282_2
NIOZ132_mx_b314_2
NIOZ134_cc_b149_2
NIOZ134_mb_b293_2
NIOZ134_mb_b361_2
NIOZ134_mb_b41_2
NIOZ134_mx_b281_2
NIOZ136_cc_b15_2
NIOZ136_mb_b104_2
NIOZ136_mb_b226_2
NIOZ136_mb_b335_2
SRR2090153_bin1042
SRR2090153_bin461
SRR2090159_bin1129
SRR2090159_bin1288
SRR4028224_bin17
SRR5007147_bin71
U_76725
CBM44
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
CE4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
GH119
0
0
0
1
1
1
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
GH130
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GH95
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GT2_Glyco_tranf_2_3
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.
#change the rownamesrownames(Counts_Table_wide) <- Counts_Table_wide$geneID#view datakable((head(Counts_Table_wide)), format='markdown')
geneID
GCA_002494525
GCA_002495465
GCA_002501985
GCA_002502135
NIOZ119_mb_b5_2
NIOZ120_mb_b229_2
NIOZ121_cc_b94_2
NIOZ121_mb_b48_2
NIOZ122_mb_b305_2
NIOZ122_mb_b319_2
NIOZ123_bs_b392_2
NIOZ123_bs_b5_2
NIOZ124_cc_b247_2
NIOZ124_mb_b130_2
NIOZ125_cc_b75_2
NIOZ125_mb_b178_2
NIOZ125_mb_b254_2
NIOZ126_mb_b137_2
NIOZ126_mb_b304_2
NIOZ126_mb_b401_2
NIOZ127_mb_b161_2
NIOZ127_mb_b282_2
NIOZ129_mb_b254_2
NIOZ129_mb_b262_2
NIOZ132_cc_b149_2
NIOZ132_cc_b51_2
NIOZ132_mb_b198_2
NIOZ132_mb_b260_2
NIOZ132_mb_b282_2
NIOZ132_mx_b314_2
NIOZ134_cc_b149_2
NIOZ134_mb_b293_2
NIOZ134_mb_b361_2
NIOZ134_mb_b41_2
NIOZ134_mx_b281_2
NIOZ136_cc_b15_2
NIOZ136_mb_b104_2
NIOZ136_mb_b226_2
NIOZ136_mb_b335_2
SRR2090153_bin1042
SRR2090153_bin461
SRR2090159_bin1129
SRR2090159_bin1288
SRR4028224_bin17
SRR5007147_bin71
U_76725
CBM44
CBM44
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
CE4
CE4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
GH119
GH119
0
0
0
1
1
1
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
GH130
GH130
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GH95
GH95
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GT2_Glyco_tranf_2_3
GT2_Glyco_tranf_2_3
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.
#delete the first columnCounts_Table_wide <- Counts_Table_wide[,-1]kable((head(Counts_Table_wide)), format='markdown')
GCA_002494525
GCA_002495465
GCA_002501985
GCA_002502135
NIOZ119_mb_b5_2
NIOZ120_mb_b229_2
NIOZ121_cc_b94_2
NIOZ121_mb_b48_2
NIOZ122_mb_b305_2
NIOZ122_mb_b319_2
NIOZ123_bs_b392_2
NIOZ123_bs_b5_2
NIOZ124_cc_b247_2
NIOZ124_mb_b130_2
NIOZ125_cc_b75_2
NIOZ125_mb_b178_2
NIOZ125_mb_b254_2
NIOZ126_mb_b137_2
NIOZ126_mb_b304_2
NIOZ126_mb_b401_2
NIOZ127_mb_b161_2
NIOZ127_mb_b282_2
NIOZ129_mb_b254_2
NIOZ129_mb_b262_2
NIOZ132_cc_b149_2
NIOZ132_cc_b51_2
NIOZ132_mb_b198_2
NIOZ132_mb_b260_2
NIOZ132_mb_b282_2
NIOZ132_mx_b314_2
NIOZ134_cc_b149_2
NIOZ134_mb_b293_2
NIOZ134_mb_b361_2
NIOZ134_mb_b41_2
NIOZ134_mx_b281_2
NIOZ136_cc_b15_2
NIOZ136_mb_b104_2
NIOZ136_mb_b226_2
NIOZ136_mb_b335_2
SRR2090153_bin1042
SRR2090153_bin461
SRR2090159_bin1129
SRR2090159_bin1288
SRR4028224_bin17
SRR5007147_bin71
U_76725
CBM44
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
CE4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
0
0
0
0
0
0
GH119
0
0
0
1
1
1
0
1
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
GH130
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GH95
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
GT2_Glyco_tranf_2_3
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
#order our data so that the bins start first with the bins from the basal groupCounts_Table_wide <- Counts_Table_wide[,Bin_order]#view datakable((head(Counts_Table_wide)), format='markdown')
NIOZ132_cc_b51_2
NIOZ134_mx_b281_2
SRR2090153_bin1042
SRR2090153_bin461
SRR2090159_bin1129
SRR2090159_bin1288
GCA_002494525
GCA_002495465
GCA_002501985
SRR4028224_bin17
SRR5007147_bin71
U_76725
NIOZ124_cc_b247_2
NIOZ125_mb_b254_2
NIOZ129_mb_b262_2
NIOZ126_mb_b401_2
NIOZ122_mb_b319_2
NIOZ134_mb_b41_2
NIOZ132_mb_b282_2
NIOZ127_mb_b161_2
NIOZ123_bs_b5_2
NIOZ121_cc_b94_2
NIOZ136_cc_b15_2
GCA_002502135
NIOZ134_mb_b293_2
NIOZ132_mb_b260_2
NIOZ126_mb_b137_2
NIOZ129_mb_b254_2
NIOZ127_mb_b282_2
NIOZ122_mb_b305_2
NIOZ136_mb_b335_2
NIOZ132_mb_b198_2
NIOZ136_mb_b104_2
NIOZ125_mb_b178_2
NIOZ121_mb_b48_2
NIOZ134_cc_b149_2
NIOZ134_mb_b361_2
NIOZ136_mb_b226_2
NIOZ123_bs_b392_2
NIOZ124_mb_b130_2
NIOZ132_cc_b149_2
NIOZ125_cc_b75_2
NIOZ132_mx_b314_2
NIOZ120_mb_b229_2
NIOZ126_mb_b304_2
NIOZ119_mb_b5_2
CBM44
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
CE4
0
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
GH119
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
1
0
0
1
1
0
0
1
1
0
0
1
1
1
0
1
GH130
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
GH95
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
GT2_Glyco_tranf_2_3
0
0
0
0
0
0
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
If you run these three examples yourself, take not how different the speed is.
7.7.2 Generate a count table for our clusters of interest
Same as above, but now we want to know for our 4 aquifer genomes, how many have Gene Xx and show this as percent. I.e. if 1/4 genomes have a gene, then 25% have it.
First, lets merge in our taxa info into our count table, we need this to summarize our data by clusters.
#merge the count table with mapping file to add in the taxa info (might take a while depending on size)Counts_Table_long_Tax <-merge(Counts_Table_long, design_2[,c("BinID", "ClusterName", "NrGenomes")], by ="BinID")kable((head(Counts_Table_long_Tax)), format='markdown')
BinID
geneID
count
ClusterName
NrGenomes
GCA_002494525
GT2_Glyco_tranf_2_3
1
Marine (6)
6
GCA_002494525
GT2_Glycos_transf_2
2
Marine (6)
6
GCA_002494525
GT4
2
Marine (6)
6
GCA_002494525
GT66
1
Marine (6)
6
GCA_002494525
GT83
1
Marine (6)
6
GCA_002494525
K00012
1
Marine (6)
6
Next, whenever we have a value higher than one, we replace it with 1. That way we deal with our data like it is a presence/absence data. I.e. 0 = no genes present & 1 = gene present
#convert counts to presence/absence matrix (just using 0/1) (this is needed to calculate the percentage across clusters)Counts_Table_long_Tax$count[Counts_Table_long_Tax$count >1] <-1kable((head(Counts_Table_long_Tax)), format='markdown')
BinID
geneID
count
ClusterName
NrGenomes
GCA_002494525
GT2_Glyco_tranf_2_3
1
Marine (6)
6
GCA_002494525
GT2_Glycos_transf_2
1
Marine (6)
6
GCA_002494525
GT4
1
Marine (6)
6
GCA_002494525
GT66
1
Marine (6)
6
GCA_002494525
GT83
1
Marine (6)
6
GCA_002494525
K00012
1
Marine (6)
6
Now, we can use tidyr to count of how many genomes in a cluster have a gene.
#count data and clean headerCounts_Table_long_Tax_sum <- Counts_Table_long_Tax[,c('ClusterName', 'geneID','NrGenomes', 'count')] %>%count(ClusterName, geneID, NrGenomes, sort =FALSE)colnames(Counts_Table_long_Tax_sum) <-c("ClusterName", "geneID", "NrGenomes", "quantity")kable((head(Counts_Table_long_Tax_sum)), format='markdown')
ClusterName
geneID
NrGenomes
quantity
Aquifer (4)
GT2_Glycos_transf_2
4
4
Aquifer (4)
GT4
4
3
Aquifer (4)
GT66
4
2
Aquifer (4)
GT83
4
3
Aquifer (4)
K00003
4
2
Aquifer (4)
K00008
4
3
Next, we want to calculate the percentage. I.e. in the first example, we have 4 aquifer genomes, two of which [FeFe]_Group_C3 (=50%).
#calculate of the percentage to answer of the total genomes per cluster how many have a certain gene#notice: if running for your own data check here that your percentage makes sense. I.e. we do not want values above 100Counts_Table_long_Tax_sum$percentage <-round(Counts_Table_long_Tax_sum$quantity/Counts_Table_long_Tax_sum$NrGenomes*100, digits =0)kable((head(Counts_Table_long_Tax_sum)), format='markdown')
ClusterName
geneID
NrGenomes
quantity
percentage
Aquifer (4)
GT2_Glycos_transf_2
4
4
100
Aquifer (4)
GT4
4
3
75
Aquifer (4)
GT66
4
2
50
Aquifer (4)
GT83
4
3
75
Aquifer (4)
K00003
4
2
50
Aquifer (4)
K00008
4
3
75
For printing this, we want to convert our long to a wide table by using spread(). Also, we want our geneIDs to be the new rownames and we do this by using the rownames() functions. We do this since some functions do not like to have characters in their dataframe.
#convert long to wide format and clean table (i.e. place the rownames)Counts_Table_long_Tax_sum_wide <-spread(Counts_Table_long_Tax_sum[,c("geneID", "ClusterName", "percentage")], ClusterName, percentage)#change the rownamesrownames(Counts_Table_long_Tax_sum_wide) <- Counts_Table_long_Tax_sum_wide$geneID#view datakable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
geneID
Aquifer (4)
Basal (2)
Clade1 (11)
Clade2 (23)
Marine (6)
CBM44
CBM44
NA
NA
18
NA
NA
CE4
CE4
NA
50
NA
NA
NA
GH119
GH119
NA
NA
NA
52
NA
GH130
GH130
NA
50
NA
NA
NA
GH95
GH95
NA
50
NA
NA
NA
GT2_Glyco_tranf_2_3
GT2_Glyco_tranf_2_3
NA
NA
NA
NA
83
When we change the rownames and view the data, we see that the geneID is now both in the rownames as well as the first column. Since that is a bit messy, we next remove the first column.
#delete the first columnCounts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,-1]kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Aquifer (4)
Basal (2)
Clade1 (11)
Clade2 (23)
Marine (6)
CBM44
NA
NA
18
NA
NA
CE4
NA
50
NA
NA
NA
GH119
NA
NA
NA
52
NA
GH130
NA
50
NA
NA
NA
GH95
NA
50
NA
NA
NA
GT2_Glyco_tranf_2_3
NA
NA
NA
NA
83
Now we see that we have NAs for genes that are not present in some of our clades. If we want to do more math then NAs are not helpful and we instead want to have a 0 there instead.
#replace NAs with 0Counts_Table_long_Tax_sum_wide[is.na(Counts_Table_long_Tax_sum_wide)] <-0kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Aquifer (4)
Basal (2)
Clade1 (11)
Clade2 (23)
Marine (6)
CBM44
0
0
18
0
0
CE4
0
50
0
0
0
GH119
0
0
0
52
0
GH130
0
50
0
0
0
GH95
0
50
0
0
0
GT2_Glyco_tranf_2_3
0
0
0
0
83
Finally, we want to sort our data, starting with the Basal clade and ending with the Black Sea Clades
#sort by cluster order (defined by the order of the mapping file)Counts_Table_long_Tax_sum_wide <- Counts_Table_long_Tax_sum_wide[,Cluster_order]kable((head(Counts_Table_long_Tax_sum_wide)), format='markdown')
Basal (2)
Aquifer (4)
Marine (6)
Clade1 (11)
Clade2 (23)
CBM44
0
0
0
18
0
CE4
50
0
0
0
0
GH119
0
0
0
0
52
GH130
50
0
0
0
0
GH95
50
0
0
0
0
GT2_Glyco_tranf_2_3
0
0
83
0
0
7.8 Merge our tables with the mapping data we have
Now, that we have our count tables both for the bins as well as for all the clusters, we now want to add some gene description and subset the data based on different categores.
7.8.1 For the bins
7.8.1.1 Add gene descriptions
Remember above, we made a list of descriptions that links all geneIDs with what is behind all the gene IDs? Now we want to add this info back in in order to print all the counts.
The process works exactly the same as above. So try by yourself if you can merge things ;-).
7.9 Plot the data for our genes of interest
Here, we are not interested in plotting all genes but just want to plot things that are listed under the lipid pathway.
Since we are only interested in the Lipid pathway for the genes of interest (the table, among others, also lists genes involved in informational processing), we first need to make a gene list of just the genes we are interested in.
#subset genes of interest and clean factorsGenes_Lipids <-subset(Genes_of_interest, Pathway_1 %in%"Lipids")Genes_Lipids$Gene <-factor(Genes_Lipids$Gene)Genes_Lipids$arcog <-factor(Genes_Lipids$arcog)#check how many genes we havedim(Genes_Lipids)
[1] 16 12
kable((head(Genes_Lipids)), format='markdown')
Pathway_1
pathway_2
arcog
xx
Pathway
Gene_short
Gene_name
Comment
Gene
Order
Order2
Other
262
Lipids
Mevalonate
arCOG01767
NA
I
HmgB
3-hydroxy-3-methylglutaryl CoA synthase
Catalyzes the condensation of acetyl-CoA with acetoacetyl-CoA to form 3-hydroxy-3-methylglutaryl-CoA (HMG-CoA). Functions in the mevalonate (MVA) pathway
HmgB
Lipid_1
262
263
Lipids
Mevalonate
arCOG04260
NA
I
HmgA
“3-hydroxy-3-methylglutaryl-coenzyme A reductase, HMG-CoA reductase”
Catalyzes the NADPH-dependent reductive deacylation of (S)-3-hydroxy-3-methylglutaryl-CoA (HMG-CoA) to (R)-mevalonate. Functions in the mevalonate (MVA) pathway
HmgA
Lipid_1
263
264
Lipids
Mevalonate
arCOG01028
NA
I
MvK
Mevalonate kinase
Catalyzes the phosphorylation of (R)-mevalonate (MVA) to (R)-mevalonate 5-phosphate (MVAP).
MvK
Lipid_1
264
265
Lipids
Mevalonate
K18689
NA
I
M3k
Mevalonate 3-kinase
Catalyzes the phosphorylation of mevalonate (MVA) to yield mevalonate-3-phosphate. Functions in an alternative mevalonate pathway
M3k
Lipid_1
265
266
Lipids
Mevalonate
arCOG01967
NA
I
M3P5K
Mevalonate-3-phosphate 5-kinase
“Phosphorylates mevalonate 3-phosphate to form mevalonate 3,5-bisphosphate. Functions in an alternative mevalonate pathway”
M3P5K
Lipid_1
266
267
Lipids
Mevalonate
arCOG02937
NA
I
mvaD
Diphosphomevalonate decarboxylase
Catalyzes the decarboxylation of mevalonate 5-phosphate (MVAP) to isopentenyl phosphate (IP). Functions in the mevalonate (MVA) pathway leading to IPP
mvaD
Lipid_1
267
Next, we want to make sure that the order is ok. In this specific example, we manually defined two columns for ordering (Order and Order2). We sort based on these columns and make a vector to order our genes of interest and our pathways of interest. For the lipid genes we look at the mevalonate pathway and general lipid biosynthesis genes
``length()``` - lets us check the length of a vector, here it allows us to see that we would expect 16 genes
#define an order (we arange the dataframe based on two columsn, Order and Order2)Genes_Lipids_order_temp <- Genes_Lipids %>%arrange(Order, Order2)Genes_Lipids_order <-as.character(unique(Genes_Lipids_order_temp$Gene))length(Genes_Lipids_order)
#The lipids belong to two different pathways, these 2 pathways we want to show in two separate heatmapsLipids_Pathway_order <-as.character(unique(Genes_Lipids$pathway_2))Lipids_Pathway_order
[1] "Mevalonate" "Lipids"
Now that we know what genes we are interested in, lets subset our original count table.
#subset our original count table for genes of interest and clean factorsGenes_Lipids_counts <-subset(Counts_Table_long_Tax_sum, geneID %in%as.character(Genes_Lipids$arcog))Genes_Lipids_counts$geneID <-factor(Genes_Lipids_counts$geneID)#control that all went finelength(unique(Genes_Lipids_counts$geneID))
We can see that we miss K18689. If we check our original annotations input, we can see that K18689 does not exist in that table. So we can nout pull information because of that. This gives a good example, why it is important to check your data as we do not know whether this is an issue with the code or what the problem could be.
Now, since our count data does not know that we categorize our different genes into different pathways, lets add this info in with merge()
#add in metadata (from the pathway info)Key_Lipids_genes_cluster <-merge(Genes_Lipids_counts, Genes_Lipids, by.x ="geneID", by.y ='arcog' , all.x = T )kable((head(Key_Lipids_genes_cluster)), format='markdown')
“reduction of 2,3-digeranylgeranylglycerophospholipids (unsaturated archaeols) into 2,3-diphytanylglycerophospholipids (saturated archaeols)”
GGR
Lipid_2
273
7.9.1 Categorizing data
There are different ways to color code data. By default the way we do it, we use a gradual color scale from 0-100%. However, we could also define categories with the ifelse statement we learned before. Here, we define 4 categories (100%, 75-100 and, 33-75 and 0-33%)
#define color code (not used for the current figure, but can be changed)#here , we define 3 color levels, which sometimes is useful to show very clear cutoffsKey_Lipids_genes_cluster$category <-ifelse(Key_Lipids_genes_cluster$percentage ==100, "1",ifelse(Key_Lipids_genes_cluster$percentage >=75, "0.75",ifelse(Key_Lipids_genes_cluster$percentage >=33, "0.33", "0")))
7.9.2 Order our data
Remember the annoying thing that R sorts alphabetically? Let’s make sure we ahve the order we want.
#define order for the plotKey_Lipids_genes_cluster$ClusterName2 <-factor(Key_Lipids_genes_cluster$ClusterName, levels =rev(Cluster_order))Key_Lipids_genes_cluster$Gene2 <-factor(Key_Lipids_genes_cluster$Gene, levels = Genes_Lipids_order)Key_Lipids_genes_cluster$pathway_2b <-factor(Key_Lipids_genes_cluster$pathway_2, levels = Lipids_Pathway_order)
7.9.3 Plotting
In the example here, we use a gradual scale. If we would want to use our 4 categories we can use this code #scale_fill_manual(values= c("white", "blue", "blue", "dodgerblue")) and replacing the fill = percentage with fill = category.
If we plot with facets we can sometimes have the problem that different genes have different widths. We can correct this behaviour with ggplotGrob.
# convert ggplot object to grob object (used to rescale plot)gp_lipid <-ggplotGrob(p1_Lipids2)# optional: take a look at the grob object's layoutgtable::gtable_show_layout(gp_lipid)# get gtable columns corresponding to the facets (5 & 9, in this case)facet.columns <- gp_lipid$layout$l[grepl("panel", gp_lipid$layout$name)]# get the number of unique x-axis values per facet (1 & 3, in this case)x.var <-sapply(ggplot_build(p1_Lipids)$layout$panel_scales_x,function(l) length(l$range$range))# change the relative widths of the facet columns based on# how many unique x-axis values are in each facetgp_lipid$widths[facet.columns] <- gp_lipid$widths[facet.columns] * x.var# plot resultgrid::grid.draw(gp_lipid)
In this example we see that only the aquifer and the basal clade have all required genes for the mevalonate pathway and the lipid pathway that is required to make a key archaeal lipid.