2  Data objects

In R, all types of data are treated as objects. As such objects are units that we work with, i.e. data and functions.

Roughly we distinguish between:

Everything that exists is an object.
Everything that happens is a function call.

Below, we introduce all these different types of objects.

2.1 Data types

When programming, data,values,etc. are stored in different ways:

R has 6 atomic classes. Below you can find each class with an example

  • character = “hello”
  • numeric (real or decimal) = 3, 14, ….
  • logical = TRUE
  • complex = 1+4i
  • integer = 2 (Must add a L at end to denote integer)
  • double = a number class, like the integer but with double precision floating points

Here, is a quick example, how we can find out some things about our objects using:

    • c()= a function that will create a vector (a one dimensional array) and in our case store 3 numbers. We need to use this every time we deal with more than one number, character, etc….
  • class() = what class is our data?
  • length() = how long is our data?
#create some objects
character_object <- "dataset"
number_object <- c(1,4,5)

#asking with what type we work
class(character_object)
[1] "character"
class(number_object)
[1] "numeric"
#ask how long our objects are
length(character_object)
[1] 1
length(number_object)
[1] 3

2.2 Data structures

There are many types of data structures, the most frequently used ones being:

  • Vectors
  • Factors
  • Matrices
  • Lists
  • Data frames

Certain operations only work on certain kind of structures, therefore, it is important to know what kind of data we are working with.

In R, you do not need to specify the type of data a variable will receive beforehand. You simply do the assignment, R will create a so called R-Object and assign a data type automatically.

2.2.1 Vectors

A vector is a collection of items of the same type (i.e characters, numbers). You can read in numbers and characters into the same vector, however, the number will be then seen as a character if you mix different classes.

#lets create a random vector
a_vector <- c(2, 3, 5, 7, 1) 

#show the vector we just created
a_vector
[1] 2 3 5 7 1
#asking how long your vector is
length(a_vector)
[1] 5

2.2.1.1 Vector indexing

If we want to only retrieve part of the data stored in a vector we can create a subset using the index as shown below.

  • square brackets [] = allow us to retrieve certain elements of a vector, i.e. [3] retrieves the 3rd element
  • we can combine c() and [] if we want to retrieve several elements of a vector.
#retrieve the third element stored in a vector
a_vector[3]
[1] 5
#retrieve the 1st and 3rd element by combining ``c()`` and []
a_vector[c(1,3)]
[1] 2 5
#retrieve the 1-3rd element
a_vector[c(1:3)]
[1] 2 3 5
#we can also add vectors of the same length together
x <- c(1,2,3,4)
y <- c(1,2,3,4)

#and now we can combine our vectors
x + y
[1] 2 4 6 8

Beware: If we add two vectors of different length, the shorter vector is duplicated. This only works if the shorter vector is proportional to the longer one

#adding vectors of different lengths
x <- c(1,2)
y <- c(1,2,3,4)

#and now we can combine our vectors
x + y
[1] 2 4 4 6

Another way to extend vectors is:

  • append() –> Add elements to a vector.
#add another datapoint to our vector
a_vector <- append(a_vector, 13)
a_vector
[1]  2  3  5  7  1 13
#add +1 to all our four numbers
a_vector <- a_vector + 1
a_vector
[1]  3  4  6  8  2 14
#remove the first element of our vector
a_vector <- a_vector[-1]
a_vector
[1]  4  6  8  2 14

We not only can extract the nth element but if we have header names then we can also use these to retrieve data:

#create a vector and give it names (i.e. for counts from some microbes)
x <- c(300, 410, 531)
names(x) <- c("Ecoli","Archaeoglobus","Ignicoccus")

#check how our data looks
x
        Ecoli Archaeoglobus    Ignicoccus 
          300           410           531 
#now we can retrieve part of the vector using the names
x[c("Ecoli","Ignicoccus")]
     Ecoli Ignicoccus 
       300        531 

2.2.1.2 Changing vectors

We can also change elements in our vector:

#create a vector
x <- 1:10

#change the second last positions to 5 and 9
x[9:10] <- c(5,9)

#check if this worked
x
 [1] 1 2 3 4 5 6 7 8 5 9
#we can not only add things, we can also remove this using the minus symbol
#i.e. lets remove the third element in our vector
x[-3]
[1] 1 2 4 5 6 7 8 5 9
#if we want to remove more than one thing we can use the **c()**
#lets remove elements 4 until (and including) 9
x[-c(4:9)]
[1] 1 2 3 9

2.2.2 Matrix

Matrices are the R objects in which the elements are arranged in a two-dimensional rectangular layout. They contain elements of the same type. Although you can construct matrices with characters or logicals, matrices are generally used to store numeric data.

The basic syntax for creating a matrix is:

matrix(data, nrow, ncol, byrow, dimnames)

  • data: input vector whose components become the data elements from the matrix.
  • nrow: number of rows to be created.
  • ncol: number of columns to be created.
  • byrow: logical. If FALSE,(the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
  • dimnames: A ‘dimnames’<80><99>’ attribute for the matrix: NULL or a list of length 2 giving the row and column names respectively.

In contrast in a data frame (see below) the columns contain different types of data, while in a matrix all the elements are the same type of data. A matrix in R is like a mathematical matrix, containing all the same type of thing (usually numbers). R often but not always can use dataframes and a matrix used interchangeably.

  • Individual elements in a matrix can be printed using [row,column]. For example [2,3] would pull out the value in the 2nd ROW and third COLUMN.

  • dim() is extremely useful to control whether our data was transformed correctly during different operations. For example, after we merge two files we would like to know that they still have the same number of rows as when we started the analysis. Same if we remove for example 10 samples, then we want to make sure that this is indeed what happened.

  • head() is another useful function to check the first rows of a larger matrix (or dataframe)

  • tail() same as head but showing the last rows

Let’s start with creating a matrix with 3 columns and 4 rows (so including 12 data points)

#define our row and column names
row.names = c("row1", "row2", "row3", "row4")
col.names = c("col1", "col2", "col3")

#create our matrix (check the help function to see what is happening)
matrix_A <- matrix(c(1:12), nrow = 4, byrow = T, dimnames = list(row.names,col.names))

#check how our matrix looks like
matrix_A
     col1 col2 col3
row1    1    2    3
row2    4    5    6
row3    7    8    9
row4   10   11   12
#print the value in the 2row and 3rd column
matrix_A[2,3]
[1] 6
#print the values in the 3rd column
matrix_A[,3]
row1 row2 row3 row4 
   3    6    9   12 
#print everything except the 1st row
matrix_A[-1,]
     col1 col2 col3
row2    4    5    6
row3    7    8    9
row4   10   11   12
#print everything except the 2nd column
matrix_A[,-2]
     col1 col3
row1    1    3
row2    4    6
row3    7    9
row4   10   12
#see the dimensions of matrix, i.e. the nr of rows and columns
dim(matrix_A)
[1] 4 3
#check the first rows of our matrix, since our data is small, everything is shown
head(matrix_A)
     col1 col2 col3
row1    1    2    3
row2    4    5    6
row3    7    8    9
row4   10   11   12

2.2.3 Lists

Sometimes you need to store data of different types. For example, if you are collecting cell counts, you might want to have cell counts (numeric), the microbes investigated (character), their status (logical, with TRUE for alive and FALSE for dead, …. This kind of data can be stored in lists. Lists are the R objects which contain elements of different types (numeric, strings, vectors, even another list, or a matrix).

A list is created using the list() function.

For example, the following variable x is a list containing copies of three vectors n, s, b.

#define our vectors
n = c(20, 30, 50) 
s = c("Ecoli", "Archaeoglobus", "Bacillus") 
b = c(TRUE, FALSE, TRUE) 

#combine the vectors in a list
our_list = list(counts=n, strain=s, status=b) 

#show our list
our_list
$counts
[1] 20 30 50

$strain
[1] "Ecoli"         "Archaeoglobus" "Bacillus"     

$status
[1]  TRUE FALSE  TRUE
#sublist the second element in a list
our_list[2]
$strain
[1] "Ecoli"         "Archaeoglobus" "Bacillus"     
#retrieve the 2nd and 3rd member of our list
our_list[c(2, 3)] 
$strain
[1] "Ecoli"         "Archaeoglobus" "Bacillus"     

$status
[1]  TRUE FALSE  TRUE
#we can also retrieve elements of a list if we know the name using two different ways:
our_list$strain
[1] "Ecoli"         "Archaeoglobus" "Bacillus"     
our_list[["strain"]]
[1] "Ecoli"         "Archaeoglobus" "Bacillus"     

In the last example we use the $ dollar symbol to extract data, i.e. to extract variables in a dataset (a matrix, list, dataframe). I.e. above the data we want to access is ‘our_list’ and the variable we want to extract is the strain.

2.2.4 Dataframes

Dataframes are tables in which each column contains values of one variable type and each row contains one set of values from each column. You can think of a data frame as a list of vectors of equal length. Most of our data very likely will be stored as dataframes.

A Dataframe usually follows these rules:

  • The top line of the table, called the header, contains the column names.
  • Column names (i.e. the header of our data) should be non-empty (if they are, R provides the object with default values).
  • Row names should be unique
  • Each column should contain the same number of data items
  • Each horizontal line after the header is a data row, which begins with the name of the row, and then followed by the actual data.
  • Each data member of a row is called a cell.

Importantly, most of the things we have learned before, i.e. how to subset data, apply here too.

The growth data that we have read into R will be used to explain how dataframes work.

2.2.4.1 Viewing data Dataframes

  • We can use the brackets as before to extract certain rows or columns.
  • We can use the dollar sign to again extract information as long as we know the column names. I.e. now we want to access the shoot fresh weight (FW_shoot_mg) in our ‘growth_data’ dataframe.
  • colnames() allows us to access the column names, i.e. the headers
  • rownames() allows us to access the rownames of our data (usually these are numbered if not specified otherwise while reading the table)
  • dim() allows us to check the dimensions (i.e. the number of rows and columns). This is useful to regullary check, especially if we modified our data somehow.
  • head() shows the first rows of our dataframe
#view our table
head(growth_data)
#check how many rows and columns our data has
dim(growth_data)
[1] 105   5
#extract the data from the 2nd row
growth_data[2,]
#extract the first three columns
head(growth_data[,1:3])
#extract a column of our data using the column name
#combine it with the unique function, to remove duplicates
unique(growth_data$Condition)
[1] "MgCl"      "Strain101" "Strain230" "Strain28" 
#print our headers
colnames(growth_data)
[1] "SampleID"    "Nutrient"    "Condition"   "FW_shoot_mg" "Rootlength" 
#print the rownames
rownames(growth_data)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
 [25] "25"  "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36" 
 [37] "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48" 
 [49] "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60" 
 [61] "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72" 
 [73] "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84" 
 [85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96" 
 [97] "97"  "98"  "99"  "100" "101" "102" "103" "104" "105"

When we print the rownames, we see that we have numbers from 1-105. When reading in a table into R it is the default behavior how rownames are generated. As a general rule, if you want o have other rownames, these must be unique.

2.2.4.2 Adding new columns to Dataframes

Below is a very basic way to add a new column (we name it newColumn) and fill all rows with the word comment

#expand a dataframe, functions data.frame or cbind (or see below)
growth_data$newColumn <- "comment"

#check if that worked
head(growth_data)

There are more sophisticated ways to add columns based on conditions or even merge dataframes. Some of these we will discuss later.

2.3 Check the structure of our data

If we read in our own data, we should check as what type of class our table is stored. We have several ways to do this:

  • class() = determines as what kind of object is stored
  • str() = display the internal structure of an R object.
#check what kind of data we have:
class(growth_data)
[1] "data.frame"
#check how are different parts of our data stored?
str(growth_data)
'data.frame':   105 obs. of  6 variables:
 $ SampleID   : chr  "noP" "noP" "noP" "noP" ...
 $ Nutrient   : chr  "noP" "noP" "noP" "noP" ...
 $ Condition  : chr  "MgCl" "MgCl" "MgCl" "MgCl" ...
 $ FW_shoot_mg: num  10.26 6.52 12.17 11.37 9.8 ...
 $ Rootlength : num  5.93 5.74 6.83 6.74 6.74 ...
 $ newColumn  : chr  "comment" "comment" "comment" "comment" ...

We see that

  • our data is stored in a dataframe
  • that the data stored in different formats, i.e. numeric and characters
  • our data contains 105 observations and 6 variables

2.4 Factors

Factors are data objects that are used to represent categorical data and store it in its different levels. They are an important class for statistical analysis and for plotting. Factors are stored as integers, and have labels associated with these unique integers. Once created, factors can only contain a pre-defined set values, known as levels. By default, R always sorts levels in alphabetical order.

  • factor() allows us to create our own factor
#lets make a vector
Nutrients <- c("P", "P", "noP", "noP")

#lets make our own simple factor
Nutrients_factor <- factor(Nutrients)

#lets compare the vector and factor we generated
Nutrients
[1] "P"   "P"   "noP" "noP"
Nutrients_factor
[1] P   P   noP noP
Levels: P noP

When we check our factor, we see that R assigns one level to P and another level to noP. We can also see, that R sorts the levels in an alphabetical way, i.e. first we have noP then P, even though in the initial code we first had P before noP.

Notice: This looks different in the rendered HTML were we first have P and then noP for whatever reason.

2.4.1 Checking the behaviour of factors

Now, lets check how factors behave.

  • levels() = only prints the levels of a given factor. We can also run this on any column of our dataframe.
  • nlevels() = check how many levels we have.
  • While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings. We can test this by looking at what type of object we generated.
#only print the levels
levels(Nutrients_factor)
[1] "P"   "noP"
#check how many levels we have
nlevels(Nutrients_factor)
[1] 2
#what class do we have
class(Nutrients_factor)
[1] "factor"
typeof(Nutrients_factor)
[1] "integer"

2.4.2 Ordering factor levels

For some things, the order of things might matter and then we need to order the factors ourselves.

#check our levels
levels(Nutrients_factor)
[1] "P"   "noP"
#reorder levels
Nutrients_factor_reordered <- factor(Nutrients_factor, levels = c("P", "noP"))

#check our levels
levels(Nutrients_factor_reordered)
[1] "P"   "noP"

2.4.3 Converting factors

Sometimes you need to explicitly convert factors to either text or numbers. Or numbers to characters, etc. To do this, you use the functions as.character() or as.numeric().

#convert our factor to a character
Nutrients_characters <- as.character(Nutrients_factor)
Nutrients_characters
[1] "P"   "P"   "noP" "noP"