1  Introduction into R

R is a statistical programming language and environment and free, open source and in active development. This tutorial will introduce into the basic concepts of R.

This tutorial will work with example data for two datasets:

1. Growth data

We have two data files that work with a similar experimental setup:

1.1. Growth_Data.txt

During this experiments, we are doing plant growth experiments and treated our plants with different microbes wondering if any microbe affects plant growth in a positive way.

This file contains measurements of root length and shoot fresh weight for plants grown under control treatments (=MgCL) or when treating with 4 different bacteria. For simplicity only 1 biological experiment with 7-10 individual measurements per treatment were included.

1.2. Timecourse.txt

We found positive effects for some of the strains tested above and now we want to know how long it takes for this effect to appear. To answer this, we measured root length of our plants when adding our microbe and compared it to control treatments at 5 different time points. Something unique with this dataframe is that we have empty cells (=NAs) and we need to deal with them as some R functions don’t like empty cells.

1.1 Good practices for coding

These practices are useful regardless of the computational language you use.

  • Record what program versions you used
  • For each project, document who wrote the code, when you did it and why
  • Put dependencies in the beginning (ie packages)
  • Record the working directory (wdir)
  • Document ALL your code and comment it (using the # symbol)
  • Comment code in detail, so that you can still understand it after 5 years
  • Break code into smaller pieces for better readability
  • Test each line of code and build in control steps
  • If you work with random numbers, report the seed
  • Use sessionInfo() at the end of the script, which documents all the packages used within R for the current project
  • For larger files: save objects not workspaces (for space reasons)
  • Have descriptive names for objects, short and simple but easy enough to understand what they mean

1.2 The example data

Let’s have a look at the data structure for our first dataframe:

We can see that we test our bacteria under different nutrient conditions (noP and P, which equals to normal phosphorous and low phosphorus concentrations added) and different treatments (=conditions), which are control treatments (MgCl) and different strains of microbes (Strain 101, Strain28, etc.). For each of these treatment we measured the shoot fresh weight in mg and the root length (in cm).

The timecourse data looks similar, we just have an extra column for the different timepoints and we only have measurements for the root length.

2. Annotation data

This file is specific to the output of the Spang_team annotation pipeline but this workflow can be used for any type of categorical data one wants to summarize.

Using this workflow, we generated the file UAP2_Annotation_table_u.txt, which includes annotations for a set of 46 DPANN genomes. This includes annotations across several databases (arCOG, KO, PFAM, …) for each individual protein found across all these 46 genomes.

Specifically, we want to learn how to:

  • Make a count table for each genome
  • Make a count table for clusters of interest
  • Make a heatmap for genes of interest
  • Merge our results with some pre-sorted tables

For this to work we have some additional files to make our job easier:

  • mapping.txt = a list that defines to what cluster (i.e. grouping based on a phylogenetic tree) or bins belong to
  • Genes_of_interest = a list of genes we are interested in and that we want to plot in a heatmap
  • ar14_arCOGdef19.txt = metadata for the arCOG annotations
  • Metabolism_Table_KO_Apr2020.txt = metadata for KOs and sorted by pathways

The annotation table looks like this:

accession BinID TaxString NewContigID OldContigId ContigIdMerge ContigNewLength GC ProteinID ProteinGC ProteinLength Prokka arcogs arcogs_geneID arcogs_Description Pathway arcogs_evalue KO_hmm e_value bit_score bit_score_cutoff Definition confidence PFAM_hmm PFAM_description Pfam_Evalue TIRGR TIGR_description EC TIGR_Evalue CAZy CAZy_evalue Description TDBD_ID TPDB_evalue HydDB Description.1 HydDB_evalue PFAM PFAMdescription IPR IPRdescription TopHit E_value PecID TaxID TaxString.1
NIOZ119_mb_b5_2-PJFDGLDN_00010 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_1 NIOZ119_sc1610046_1 NIOZ119_mb_b5_2_contig_1 2539 0.432 PJFDGLDN_00010 0.094 350 Digeranylgeranylglycerophospholipid reductase arCOG00570 - “Geranylgeranyl_reductase,_flavoprotein” I 2.80E-75 K17830 2.10E-69 243.3 246.87 digeranylgeranylglycerophospholipid_reductase_[EC:1.3.1.101_1.3.7.11] - PF01494 FAD_binding_domain 5.90E-09 TIGR02032 geranylgeranyl_reductase_family 1.3.1.- 1.40E-49 - - - - - - - - IPR036188;_IPR002938 FAD/NAD(P)-binding_domain_superfamily;_FAD-binding_domain PF01494 FAD_binding_domain KYK22416.1_hypothetical_protein_AYK24_02275_[Thermoplasmatales_archaeon_SG8-52-4] 2.80E-58 36.5 1803819 “Archaea,Euryarchaeota,Thermoplasmata,Thermoplasmatales,none,none,Thermoplasmatales_archaeon_SG8-52-4”
NIOZ119_mb_b5_2-PJFDGLDN_00020 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_1 NIOZ119_sc1610046_1 NIOZ119_mb_b5_2_contig_1 2539 0.432 PJFDGLDN_00020 0.057 105 hypothetical protein - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
NIOZ119_mb_b5_2-PJFDGLDN_00030 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_1 NIOZ119_sc1610046_1 NIOZ119_mb_b5_2_contig_1 2539 0.432 PJFDGLDN_00030 0.092 304 tRNA-2-methylthio-N(6)-dimethylallyladenosine synthase arCOG01358 MiaB 2-methylthioadenine_synthetase J 5.80E-90 K15865 1.20E-109 375.8 343.7 threonylcarbamoyladenosine_tRNA_methylthiotransferase_CDKAL1_[EC:2.8.4.5] high_score PF04055 Radical_SAM_superfamily 2.60E-21 TIGR01578 “MiaB-like_tRNA_modifying_enzyme,_archaeal-type” - 1.70E-104 - - - - - - - - IPR005839;_IPR002792;_IPR006466;_IPR006638;_IPR007197;_IPR020612;_IPR023404 “Methylthiotransferase;_TRAM_domain;_MiaB-like_tRNA_modifying_enzyme,_archaeal-type;_Elp3/MiaB/NifB;_Radical_SAM;_Methylthiotransferase,_conserved_site;_Radical_SAM,_alpha/beta_horseshoe” PF01938;_PF04055 TRAM_domain;_Radical_SAM_superfamily OIO63284.1_hypothetical_protein_AUJ83_01460_[Candidatus_Woesearchaeota_archaeon_CG1_02_33_12] 8.20E-91 55.6 1805422 “Archaea,Candidatus_Woesearchaeota,none,none,none,none,Candidatus_Woesearchaeota_archaeon_CG1_02_33_12”
NIOZ119_mb_b5_2-PJFDGLDN_00040 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_2 NIOZ119_sc560284_1 NIOZ119_mb_b5_2_contig_2 4191 0.456 PJFDGLDN_00040 0.065 92 Enolase arCOG01169 Eno Enolase G 4.00E-26 K01689 5.10E-20 81 48.73 enolase_[EC:4.2.1.11] high_score PF00113 “Enolase,_C-terminal_TIM_barrel_domain” 1.20E-20 TIGR01060 phosphopyruvate_hydratase 4.2.1.11 1.10E-25 - - - - - - - - IPR020810;_IPR036849;_IPR000941;_IPR020809 “Enolase,_C-terminal_TIM_barrel_domain;_Enolase-like,_C-terminal_domain_superfamily;_Enolase;_Enolase,_conserved_site” PF00113 “Enolase,_C-terminal_TIM_barrel_domain” BAW30993.1_2-phosphoglycerate_dehydratase_[Methanothermobacter_sp._MT-2] 3.80E-23 70.7 1898379 “Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanothermobacter,Methanothermobacter_sp._MT-2”
NIOZ119_mb_b5_2-PJFDGLDN_00050 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_2 NIOZ119_sc560284_1 NIOZ119_mb_b5_2_contig_2 4191 0.456 PJFDGLDN_00050 0.059 218 30S ribosomal protein S2 arCOG04245 RpsB Ribosomal_protein_S2 J 8.90E-64 K02998 1.60E-46 167.8 210.97 small_subunit_ribosomal_protein_SAe - PF00318 Ribosomal_protein_S2 7.00E-24 TIGR01012 ribosomal_protein_uS2 - 1.00E-72 - - - - - - - - IPR005707;_IPR023454;_IPR023591;_IPR018130;_IPR001865 “Ribosomal_protein_S2,_eukaryotic/archaeal;_Ribosomal_protein_S2,_archaeal;_Ribosomal_protein_S2,_flavodoxin-like_domain_superfamily;_Ribosomal_protein_S2,_conserved_site;_Ribosomal_protein_S2” PF00318 Ribosomal_protein_S2 A0B6E5.1_RecName:_Full=30S_ribosomal_protein_S2 1.20E-56 52.8 349307 “Archaea,Euryarchaeota,Methanomicrobia,Methanosarcinales,Methanotrichaceae,Methanothrix,Methanothrix_thermoacetophila”
NIOZ119_mb_b5_2-PJFDGLDN_00060 NIOZ119_mb_b5_2 UAP2 PJFDGLDN_2 NIOZ119_sc560284_1 NIOZ119_mb_b5_2_contig_2 4191 0.456 PJFDGLDN_00060 0.100 280 hypothetical protein arCOG01728 Mho1 “Predicted_class_III_extradiol_dioxygenase,_MEMO1_family” R 8.80E-74 K06990 1.40E-72 253.3 47 MEMO1_family_protein high_score PF01875 Memo-like_protein 4.50E-77 TIGR04336 AmmeMemoRadiSam_system_protein_B - 5.30E-91 - - - - - - - - IPR002737 MEMO1_family PF01875 Memo-like_protein OYT50994.1_AmmeMemoRadiSam_system_protein_B_[Candidatus_Bathyarchaeota_archaeon_ex4484_135] 1.30E-66 47 2012509 “Archaea,Candidatus_Bathyarchaeota,none,none,none,none,Candidatus_Bathyarchaeota_archaeon_ex4484_135”

1.3 Working in R

1.3.1 Opening R via the terminal

If you work with linux or want to start R from the terminal then open your terminal, change your directory to the R_exercises folder and then just type R.

Then, you should see something like this:

To check your R version, start and quit R you can type the following in your terminal:

#Ask what R version we have
R.version

#start R
R

#exit R
q()

1.3.2 RStudio (everything in one place):

R is command-line only while RStudio is a GUI ((graphical user interface)) version of R. Therefore, working in RStudio makes everything a bit more interactive.

RStudio includes the following:

  • Script separate from command-line (left-hand screen)
  • Lists your variables (upper, right-hand corner)
  • Manual and an extensive help function
  • Easy install of new packages
  • Plots are shown within RStudio

1.4 Documenting code

1.4.1 Markdown

Markdown is a lightweight markup language that you can use to add formatting elements to plain text text documents.

Some examples:

  • Headings are defined with ‘#’, or ‘##’, or ‘###’ for first, second and third level.
  • Lists are created by using ’*’ for (bullets) and ‘1’, ‘2’, … for numbered lists.

But why should we bother to write with Markdown when you can press buttons in an interface to format your text?

  • Its used for a lot of things, including code documentation or building websites
  • Files containing Markdown-formatted text can be opened a lot application making it extremely portable
  • You can work with Markdown on different operating systems
  • Its used by a lot of tools, such as github, jupyter or RStudio.

Its not the goal of this tutorial to introduce into markdown, but there is some good material online:

1.4.2 R code in Markdown

  • The R-code is embedded in between the
    ```{r} and ``` symbols.

In Rstudio, on the top-right position of such a section you will find three symbols. Pressing the middle one will run all code chunks above, while the right symbol will run the current R-chunk.

An important menu-button is the Knit or Render button at the top, left-hand corner of RStudio. Pressing this button will create the final, rendered document (i.e. a HTML or PDF)

1.4.3 Software

1.4.3.1 Rmarkdown

One nice way to commend code is combine R-code with informative text in R markdown format (rmd).

R Markdown supports dozens of static and dynamic output formats including: HTML, PDF, MS Word, Beamer, HTML5 slides, Tufte-style handouts, books, dashboards, shiny applications, scientific articles, websites, and more.

The R markdown file specifies code chunks which will be executed in R (or python or bash) and plain text which will be written to the report as is. A report is created by rendering the file in R, then the R-code is executed and the results are merged in a pdf or html output.

How you create a Rmarkdown document in R:

  • Open RStudio
  • press File/new File/R markdown

This will create an R markdown file that already contains some R code and text. You can also open this document (the rmd file) in RStudio and see how the code looks.

You can also open a new file in any text editor and save it with the .rmd extension.

1.4.3.2 Quarto

Quarto is the predecessor of RMarkdown and is an open-source scientific and technical publishing system built on Pandoc and allows to:

  • Create dynamic content with Python, R, Julia, and Observable.
  • Author documents as plain text markdown or Jupyter notebooks.
  • Publish high-quality articles, reports, presentations, websites, blogs, and books in HTML, PDF, MS Word, ePub, and more.
  • Author with scientific markdown, including equations, citations, crossrefs, figure panels, callouts, advanced layout, and more.

If you installed the newest version of RStudio, Quarto is already installed and we can create a quarto document with

  • Open RStudio
  • press File/New File/Quarto document

Same as with RMarkdown, we document with Markdown (and HTML if we want), so knowing some basics is very useful.

1.4.4 Execution options

There are a wide variety of options available for customizing output from executed code

  • include = FALSE — prevents code and results from appearing in the finished file. R Markdown still runs the code in the chunk, and the results can be used by other chunks.
  • echo = FALSE — prevents code, but not the results from appearing in the finished file. This is a useful way to embed figures.
  • message = FALSE — prevents messages that are generated by code from appearing in the finished file.
  • warning = FALSE — prevents warnings that are generated by code from appearing in the finished.
  • fig.cap = "..." — adds a caption to graphical results.

For setting these options inside a quarto document, see more here.

1.4.5 Using languages other than R

R Markdown/Quarto support several languages, such as bash and python, and you can call them in the same way as R code.

This is useful if you for example modify a dataframe in bash but then want continue to work on the data R. With proper documenting you can document the code in the same file.

Below is just an example, we see that we only need to “tell” R to use bash instead of R inside the top of the code chunk.

#run echo to print something to the screen
echo 'hello world'
hello world
#run echo and follow with a sed to modify text
echo 'a b c' | sed 's/ /\|/g'
a|b|c
#list qmd files we have in our directory
ls  *qmd
1_intro.qmd
2_misc.qmd
basic_operations.qmd
control_structures.qmd
data_transformations.qmd
data_types.qmd
parsing_output_from_annotations.qmd
plotting_basics.qmd
stats.qmd
test.qmd
#show whats in our data
head ../data/Growth_Data.txt 
SampleID    Nutrient    Condition   FW_shoot_mg     Rootlength
noP noP MgCl    10.26   5.931015152
noP noP MgCl    6.52    5.74344697
noP noP MgCl    12.17   6.834719697
noP noP MgCl    11.37   6.742734848
noP noP MgCl    9.8 6.736886364
noP noP MgCl    3.75    4.236348485
noP noP MgCl    5.38    4.753484848
noP noP MgCl    4.53    5.532333333
noP noP MgCl    7.75    5.484363636

A general introduction into bash and awk is provided in separate tutorials that are also available on github.

1.5 Getting help

  1. Some good places to check for things online are:
  • www.r-project.org
  • stack overflow
  • many more
  1. Inside of R, we can get help on functions and other things by typing either of the following:
help(mean)
?mean

1.6 What is a workspace?

The workspace is your current R working environment and includes any user-defined objects (i.e. vectors, matrices, data frames, lists, functions).

At the end of an R session, the user can save an image of the current workspace that is automatically reloaded the next time R is started.

We can check our workspace as follows:

#print the current working directory
getwd() 
[1] "/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Code_snippets/R/code"
#list the objects in the current workspace
ls()   
[1] "annotation_data" "growth_data"    

1.7 The working directory

The directory from which you work is usually first set from where you start R or where the script resides (the latter is the case for this example). But it can be re-set to find your data more easily. Ideally, you make one wdir per project and define the path in the script (see later below). It is recommended to have a similar format for these project folders, i.e. consider to create subfolders for input and output files. From the wdir you set you can load files using absolute and relative paths.

An example would be something with a structure like this:

In this example you see that we have 4 projects, and in each folder we have the R_script and folders for the required input and required output files. Also useful to have is a text file with the session info and if you plan to share this with others it is also good to have a README file that provides some background on the analysis done.

Options to see the working dir and set the working directory in R are:

#print your wdir
getwd()

#setting your wdir
setwd(getwd())

1.8 Packages

Packages are a collection of functions and tools that can be added to R that are often contributed by the community.

  • There might be incompatibilities and packages are updated frequently but updating can break dependencies.
  • You need to install packages and load them EVERY TIME you want to use them. Therefore, ideally add them at the beginning of your scripts.

1.8.1 Installing packages

We have two ways to install packages:

  1. Via the console by typing:

install.packages("package-name")

This will download a package from one of the CRAN mirrors assuming that a binary is available for your operating system. If you have not set a preferred CRAN mirror in your options(), then a menu will pop up asking you to choose a location.

  1. Using R studio:

Go to the lower right hand-side window and click on packages and then install. Find the packages your are interested in.

Notice: If libraries come with their own data (i.e. example tables), then the data needs to be loaded separately. I.e. via data(cars) to load the cars data from the cars package.

1.8.2 Updating packages

  • Use old.packages() to list all your locally installed packages that are now out of date.
  • update.packages() will update all packages in the known libraries interactively. This can take a while if you haven not done it recently. To update everything without any user intervention, use the ask = FALSE argument.

1.8.3 Loading packages into your current R session

R will not remember what libraries you have loaded after you closed R. Therefore, you need to load libraries every time you re-open R. Here, we will load some libraries that are usually quite helpful and it is recommended to make the libraries you load part of each of your scripts. For example like this:

#some example packages needed for Rmarkdown
library(knitr)
library(kableExtra)

1.9 The assignment operator <-

The <- symbol assigns a value to a variable,

General rules for the syntax R uses:

  • R is case sensitive
  • If a variable exists, it will overwrite it with a new variable without asking
  • If you work with characters, i.e. words like ‘hello’, then this needs to be written with quotes around it: “hello” (this will become clearer below)
  • ls() shows all the variables that are known by the system at the moment
  • you can remove individual objects with rm() and remove all objects rm(list=ls())

We can store more or less everything in a variable and use it later. For example, we can store numbers and do some math with them:

#store some numbers
x <- 1
y <-4

#do some simple math with the numbers we have stored
x+y
[1] 5

1.10 Use build in functions

Functions are build in code that we can use to make our life easier, i.e. we can calculate lengths of vectors, do math or do statistical analyses.

Base R already knows many useful functions but loading new packages greatly increases our repertoire.

A list of most used functions can be found here

A function consists of:

  1. Function name
  2. Arguments (optional, some might be set with a default) = control how exactly the function behaves
  3. Body of the function = defines what the function does

As an example lets test some simple functions: print and log:

#use the print function
print(3+5)
[1] 8
#use the log function
log(10)
[1] 2.302585

1.10.1 Call the default values of a function

Every function comes with a set of arguments that you can set but that usually also have some default values. In R Studio you can easily access all those details with the help function.

  • ? allows us to first of all check exactly what a function is doing. If you scroll down to the bottom of the help page you also get some examples on how to use a function.
  • More specifically the help function also allows us to get details on the arguments of a function.
  • For example, if we check the help page of read.table we see that by default this function does not read in a header and if we want to provide a header we have to change that argument.
#let's check what **log** is doing
?log

#lets check the default arguments of a function
?read.table

Other useful functions:

Name Function
ls() List objects in your current workspace
rm(object) Remove object from your current workspace
rm(list = ls()) Remove all objects from your current workspace

1.11 Read data into R

To work with any kind of data we need to first read the data into R to be able to work with it.

For tables, there are some things to be aware of:

  • It matters what separator our table uses to specify individual columns. I.e. some programs store data using commas while others use tab as default delimiter. By default R assume we use a tab, but we can change this behavior when we read in the table.
  • Do not have any hash symbols (#) in your table. R will read this as a commented cell and not read the table from that point onward
  • Avoid empty cells, as these sometimes can mess up your data.

For now, let’s read in the table with our growth data and store it under the variable name growth_data. To read in this file we need to direct it to the correct path as we do not have the file in the working directory but in a subdirectory namd data.

Options that are good to keep in mind when reading in a table:

  • sep = define that our field separator is a tab. A tab is written like this /t. If your data is using a space or comma, you can change that here.
  • header = tell R that our data comes with a custom header (the first row in your dataframe)
  • quote = deals with some annoying issue with data formatting in excel files

General notice:

  • To view data, the head() command is extremely practical, use it always when you modify data to check if everything went alright
  • dim() is another useful function that displays the dimensions of a table, i.e. how many rows and columns we have. Again, this is useful to verify our data after we have transformed it to check if everything went alright.
  • colnames() allows to only we the column names
  • rownames() allows to only we the row names. Usually these are numbers, but we can also add anything else into the rows.
#read in data
timecourse <- read.table("../data/Timecourse.txt", sep="\t", header=T,  quote = "")
growth_data <- read.table("../data/Growth_Data.txt", sep="\t", header=T,  quote = "")

#check the first few lines of our data
head(growth_data)
#check the dimensions of our data
dim(growth_data)
[1] 105   5
#check the column names
colnames(growth_data)
[1] "SampleID"    "Nutrient"    "Condition"   "FW_shoot_mg" "Rootlength" 
#check the row names
rownames(growth_data)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
 [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
 [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
 [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
 [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
 [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105"

Useful comments:

Sometimes we have to deal with really large data that take long to load with read.table. The function fread() from the data.table package is a very nice alternative.

This script sometimes uses kable for making tables visually attractive in html. Whenever you see a function using kable simply replace it with the head() function, i.e. write head(growth_data)

1.12 Write data into a text file

Now, if we would have modified the table we might want to store it on your computer. We can do this using write.table() and below we use a different output directory. Notice, we always start from the location we set as working directory.

write.table(growth_data, "../output_examples/growth_data_changed.txt",  sep = "\t", row.names = T, quote =F)

Arguments:

  • sep –> we define what delimiter we want to use
  • row.names = T –> we want to include whatever is in the rownames
  • quote = F –> we do not want R to add any quotes around our columns.

1.13 Useful functions in R

1.13.1 Base functions

R comes with the base package. This package contains the basic functions which let R function as a language: arithmetic, input/output, basic programming support, etc. Its contents are available through inheritance from any environment.

For a complete list of functions, use library(help = “base”).

Apart from the elementary operations, common arithmetic functions are available: log, exp, sin, cos, tan, sqrt, etc. Other useful functions one can use on vectors are:

Name Function
max select smallest element
min select largest element
length gives the number of elements
sum sums all elements
mean obtains the mean value
var unbiased sample variance
sort see exercise 2c

1.13.2 The unique command

unique() allows to determine duplicate rows and allows us to subset our data for certain categories. For example, for very large dataframes we often can simplify things.

Here, if we have a lot of treatments and did the experiment a long time ago, we might want to ask for a table that lists the treatments.

#make unique contig list that still contains info of our bin ID
mapping_file <- unique(growth_data[,c("SampleID", "Nutrient", "Condition")])

#view data
head(mapping_file)

1.13.3 The merge command

We can also add additional metadata to our growth data.

One way to do this is the cbind() or rbind() functions. However, these functions require the two dataframes to have the exact number of columns or rows, which we do not have.

Here, the merge() function of the data.table package is very useful to merge data with different dimensions as long as they have a common pattern (i.e. the SampleID).

First lets build an artificial mapping file that incldues the number of days we grew our plants:

#make mapping that contains our basic sample info
mapping_file <- unique(growth_data[,c("SampleID", "Nutrient", "Condition")])

#add a new column, where we list our experiment ID
mapping_file$Comment <- "FirstExperiment"

#view data
head(mapping_file)

Now we can use this mapping file and merge it with our growth data as follows:

#load our package
library(plyr)

#merge our mapping file with our growth data
new_data_frame <- merge(growth_data, mapping_file, by = "SampleID")

#view data
head(new_data_frame)

This now is a good example to check that all went fine and that the new dataframes has the same number of rows (=measurements) compared to the original dataframe.

#control that all went fine
dim(growth_data)
[1] 105   5
dim(new_data_frame)
[1] 105   8

With dim we see that we still have 105 rows (i.e. measurements) and that we now added 3 new columns.

#if there is no match between dataframe 1 and dataframe 2 columns will by default be deleted. If you want to keep all columns do:
#new_data_frame <- merge(growth_data, mapping_file, by = "SampleID". all.x = T)

1.14 Combine commands into one line

While this gets more difficult to read, sometimes it might be useful to combine several commands into one go to condense code Generally, it is easier to just write line by line especially if you read your code months later.

What we want to do:

  • in the example above we duplicate the columns for Nutrient and Condition and before merging we might first subset the mapping file to only include the info we want to merge.
  • So our two steps are:
    • trim the mapping file
    • merge

To do this, we use these two lines of code:

#make mapping file more simple
mapping_reduced <- mapping_file[,c("SampleID", "Comment")]

#merge
new_data_frame <- merge(growth_data, mapping_reduced, by = "SampleID")
head(new_data_frame)

Now, this worked fine but requires a bit more code and we need to create one more object.

We could also combine these two lines of code into one line by subsetting our mapping file INSIDE the merge function as follows:

#clean mapping file and merge
new_data_frame <- merge(growth_data, mapping_file[,c("SampleID", "Comment")], by = "SampleID")

#view data
head(new_data_frame)