3 Data transformations
3.1 Dealing with NAs in our data
3.1.1 Removing NAs
NAs are generated when our data contains a missing value. This can become problematic for certain computations and we can decide to remove all NAs.
The function to do this is is.na
.
#create a vector that includes an NA
<- c(1,2,3,NA,5)
y
#check whether we have NAs
is.na(y)
[1] FALSE FALSE FALSE TRUE FALSE
#remove NAs from our data
!is.na(y)] y[
[1] 1 2 3 5
!=
is “not equal to.”- The function
is.na(z)
gives a logical vector of the same size as z with value TRUE if and only if the corresponding element in z is NA. - I.e. in this example we have FALSE FALSE FALSE TRUE FALSE
- When using
y[!is.na(y)]
, we retain the columns wereis.na
is False
3.1.1.1 Replacing NAs with something else
Another option might be to replace a NA with a 0 (or whatever else makes sense in a given context)
#create a vector that includes an NA
<- c(1,2,3,NA,5)
x
#check whether we have NAs
is.na(x)
[1] FALSE FALSE FALSE TRUE FALSE
#remove NAs from our data
is.na(x)] <- 0
x[
#check data
x
[1] 1 2 3 0 5
Notice, that here we use a similar syntax compared to before. Now we do
- Find the NAs (I.e. in this example we have FALSE FALSE FALSE TRUE FALSE)
- If NA is TRUE then replace it with a 0
3.1.2 Replacing NAs in our dataframe
Remember, in the beginning we said that their are NAs in our timecourse data. Let’s have a look at this data and find the rows that contain no values.
#check in which rows we have NAs inside the root length measurements
which(is.na(timecourse$Rootlength))
[1] 29 30 233 234 235 236 237 238 239 240 268 269 270 330 360 374 375 376
[19] 377 378 379 380 381 382 383 384 385 386 387 388 389 390 402 403 404 405
[37] 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 444 445 446
[55] 447 448 449 450 475 476 477 478 479 480 492 493 494 495 496 497 498 499
[73] 500 515 516 517 518 519 520 533 534 535 536 537 538 539 540 550 551 552
[91] 553 554 555 556 557 558 559 560 574 575 605
#check a row with a NA value
29,] timecourse[
#remove rows that contain NAs
<- timecourse[!is.na(timecourse$Rootlength), ]
timecourse_noNA
#check if that worked
which(is.na(timecourse_noNA$Rootlength))
integer(0)
3.2 Subsetting our data
We already have seen important ways to subset data:
- Use of the index by using the square brackets
#subsetting rows and columns using the index
1:3,2:4] growth_data[
- Use of operators
#subset using operators (only print rows if the Nutrient column equals P)
<- growth_data[growth_data$Nutrient == "P", ]
P_data head(P_data)
Now we want to look at some other examples:
3.2.1 grep and grepl
Oftentimes you may need to filter a data set based on a partial character string that is beyond the scope of comparison operators.
R provides such functions (grep and grepl) that match character patterns in specified vector. While both of these functions find patterns, they return different output types based on those patterns.
- grep returns numeric values that correspond to the indexed locations of the patterns
- grepl returns a logical vector in which ‘TRUE’ represents a pattern match.
In our growth data, we only want to print measurements of our controls (i.e. MgCL treatments).
#find the row index values with MgCL treatments using grep
grep("MgCl", growth_data$Condition)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 53 54 55 56 57 58 59 60 61 62 63 64
[26] 65 66 67
#use grepl to search for MgCL treatments, returning FALSE/TRUE statements for each row of our dataframe
#with "TRUE" representing matched patterns
grepl("Strain", growth_data$Condition)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now lets use this to actually filter our data table for a pattern.
3.2.1.1 Filter usign grep
#use grep
<-timecourse_noNA[grep("MgCl", timecourse_noNA$Condition),]
filter_for_value
#check the first rows of our data
head(filter_for_value)
#check the dimensions of the original dataframe
dim(timecourse_noNA)
[1] 526 5
#check the dimensions of our new dataframe
dim(filter_for_value)
[1] 235 5
3.2.1.2 Filter usign grepl
<-timecourse_noNA[grepl("MgC", timecourse_noNA$Condition),]
filter_for_value
#check the first rows of our data
head(filter_for_value)
#check the dimensions of the original dataframe
dim(timecourse_noNA)
[1] 526 5
#check the dimensions of our new dataframe
dim(filter_for_value)
[1] 235 5
3.2.2 Discard matching columns
#filter data set based on values that do not match the specified pattern (by using the ! symbol)
<- timecourse_noNA[!grepl("MgCl", timecourse_noNA$Condition),]
filter_for_not_a_value
#view dataframe
head(filter_for_not_a_value)
#check the dimensions of the original dataframe
dim(timecourse_noNA)
[1] 526 5
#check the dimensions of our new dataframe
dim(filter_for_not_a_value)
[1] 291 5
Other comments:
- Using regular expressions (programming symbol pattern) will increase their functionality
- Specified patterns are case sensitive (‘t’ does not equal ‘T’)
- Any matching pattern will be returned despite the context in which that pattern is located (i.e., grep(‘the’, data) with return matches for ‘the’, ‘theme’, ‘heather’, ‘breathe’, and so on. This is where regular expressions are useful for specifying where in a string the pattern should appear.
3.2.3 Using regular expressions
A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with wildcard notations such as *.txt to find all text files in a file manager.
Regular expressions are explained in the AWK and General notebook. But just to give an example lets just grep Strains that have a 3 letter number
If you want to know more, see this cheat sheet here
<- growth_data[grepl("[0-9]{3}", growth_data$Condition),]
filter_3letters
#check the structure of our data
head(filter_3letters)
#check the dimensions of the original dataframe
dim(timecourse_noNA)
[1] 526 5
#check the dimensions of our new dataframe
dim(filter_3letters)
[1] 49 5
Here, [0-9]
searches for every number from 0-9 and we look for three numbers.
3.2.4 Levels and subsetting
One important thing with subsetting is that the levels are still kept. So with the command above we remove every row that is not Strain101 and Strain230. However, the levels are still kept. Let’s check this:
levels(filter_3letters$Condition)
[1] "MgCl" "Strain101" "Strain230" "Strain28"
So we see that MgCl and Strain28 are still in the levels even if they do not occur in our table itself. Sometimes when working with subsetted dataframes, i.e. when doing stats or plotting, this can interfere with our analysis. Here, it is useful to drop empty levels.
<- droplevels(filter_3letters)
filter_3letters_clean levels(filter_3letters_clean$Condition)
[1] "Strain101" "Strain230"
3.3 Add new columns to our data and combining values in different columns
We can also add new columns into original table, i.e. if we want to not show the fresh weight in mg but in g.
Below you can see that if we have numerical data in a column, we can use normal math operators (like +,-,/)
#convert mg to g
$FW_shoot_g <- growth_data$FW_shoot_mg/10
growth_datahead(growth_data)
#we can also round our data
$FW_shoot_g <- round(growth_data$FW_shoot_mg/10, digits = 2)
growth_datahead(growth_data)
We can also add (or substract, etc …) values from different columns. I.e. here we could calculate the ratios.
#we can also do math with the values in two columns, i.e. if we want to calculate the ration between root length and fresh weight
$ratio <- growth_data$Rootlength/growth_data$FW_shoot_mg
growth_datahead(growth_data)