5 Statistics in R

A lot of helpful information that that was used here was inspired by a more detailed site that can be found here

And please take note, that the writer of this tutorial is no statistician and if you spot issues feel free to let me know!

5.1 Summary stats

To do some basic stats we use the nutrient growth data set, which you find in the R_exercises folder.

means() = print the mean of a certain column
summary()= print the mean, median, max, min, … of a certain column.

#get the menas of our root length
mean(growth_data$Rootlength)

[1] 9.046117

#return summary stats for the root length
summary(growth_data$Rootlength)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.236   6.721   8.099   9.046  11.793  14.716

We can also summarize data of several columns

#return summary stats for the root length
summary(growth_data[,c("Rootlength", "FW_shoot_mg")])

   Rootlength      FW_shoot_mg   
 Min.   : 4.236   Min.   : 3.75  
 1st Qu.: 6.721   1st Qu.: 7.69  
 Median : 8.099   Median :11.62  
 Mean   : 9.046   Mean   :15.23  
 3rd Qu.:11.793   3rd Qu.:22.05  
 Max.   :14.716   Max.   :39.39

We can also run summary() on our complete dataframe.

#run summary on all our data
summary(growth_data)

   SampleID           Nutrient          Condition          FW_shoot_mg   
 Length:105         Length:105         Length:105         Min.   : 3.75  
 Class :character   Class :character   Class :character   1st Qu.: 7.69  
 Mode  :character   Mode  :character   Mode  :character   Median :11.62  
                                                          Mean   :15.23  
                                                          3rd Qu.:22.05  
                                                          Max.   :39.39  
   Rootlength    
 Min.   : 4.236  
 1st Qu.: 6.721  
 Median : 8.099  
 Mean   : 9.046  
 3rd Qu.:11.793  
 Max.   :14.716

This way we do not only get summary stats for our values, but also we get some counts on how many measurements we have for different conditions.

5.2 The table command

The function table builds a contingency table of the counts at each factor level, or combinations of factor levels. This is quite useful to count the number of data points with certain metadata associated. So for example for our genome data we can ask how many measurements we made for each condition.

#summarize how many measurements we have for each treatment
table(growth_data$SampleID)


      P   P_101   P_230    P_28     noP noP_101 noP_230  noP_28 
     15      12      13      13      13      13      11      15

Here, we see that we have a slightly different number of measurements for each condition and timepoint. For this tutorial we can ignore them, but this might be relevant if we have huge differences for some statistical analyses. This approach allows you to for larger datasets to easily check for samples that might be outliers in terms of measurements per sample.

5.3 The ddply command

The table command can get rather slow and there are some useful packages to speed up things and run more complicated mathematical operations. One example is the ddply() function of the plyr package. A useful feature of ddply is that this tool stores the output as a dataframe instead of a table.

Below we see examples were we summarize the data for the root length across different nutrient conditions. I.e. we do not want to have the mean for all root lengths but we want to see if roots have different lengths under low and high P conditions.

#load our package
library(plyr)

#calculate the mean root length across our Sample IDs
growth_data_summarized <- ddply(growth_data, .(Nutrient), summarize, Mean_RootLength = mean(Rootlength))

#view data
head(growth_data_summarized)

The structure of the command is as follows:

ddply(Input_Data, .(Colum_we_want_to_sum_arcoss), summarize, New_Column_name = mean(Column_we_want_to_calc_the_mean_for))

We can also summarize the data combining different conditions (i.e. nutrient and condition).

#Summarize our data across Nutrients and Conditions
growth_data_summarized <- ddply(growth_data, .(Nutrient, Condition), summarize, Mean_RootLength = mean(Rootlength))

#view data
head(growth_data_summarized)

We can also calculate the mean, sd and se in one line of code

#and we can use even fancier functions (i.e. get the se and sd), check the plyr package for more details
growth_data_summarized <- ddply(growth_data, .(Nutrient, Condition), summarize, RootLength = mean(Rootlength), sd = sd (Rootlength), se = sd(Rootlength) / sqrt((length(Rootlength))))

#view data
head(growth_data_summarized)

5.4 tidyr

Tydir is another package to summarize data but also to transform data. For now we will just discuss the basics to summarize data but we will try to add an extended section on this package later.

You notice here that the syntax is a bit unusual and we use the %>% symbol (a so-called forward pipe operator). This symbol is commonly used in the dplyr and tidyr packages. This function passes the left hand side of the operator to the first argument of the right hand side of the operator.

In the following example, the a subset of the growth_data data (only 3 columns, not the whole dataframe) gets passed to count()

New functions:

count() = A function of the dplyr package. Here, we count the unique protein IDs grouped by BinID and gene (i.e. roughly equivalent to the columns we want to keep)
summarize() creates new data frame into one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input. It will contain one column for each grouping variable and one column for each of the summary statistics that you have specified.
mean() = calculate the arithmetic mean.

#read in package
library(tidyverse)

#count how many measurements we have per conditions
growth_data %>% count(Condition, sort = FALSE)

#count how many measurements we have per nutrient and conditions
growth_data %>% count(Nutrient, Condition, sort = FALSE)

#get more comprehensive data stats and summarize for the whole dataset
growth_data %>% summarise(
          count = n(),
          mean_root = mean(Rootlength, na.rm = TRUE))

Good things about tydir are:

it is extremely fast, so is important for dealing with larger datasets
we can combine several compands by using the pipe

I.e. below we can see that we first group the input data and then summarized the group data.

#get more comprehensive data stats and summarize and grouping for the different conditions
growth_data %>% 
    group_by(Nutrient, Condition) %>%  
    summarise(
          count = n(),
          mean_root = mean(Rootlength, na.rm = TRUE) ,.groups = 'drop')

We can also do other stats than just calculating the mean:

#get more comprehensive data stats and summarize and grouping for the different conditions
growth_data %>% 
    group_by(Nutrient, Condition) %>%  
    summarise(
          count = n(),
          mean_root = mean(Rootlength, na.rm = TRUE),
          sd_root = sd(Rootlength),.groups = 'drop')

Other useful summary functions:

mean(x): sum of x divided by the length
median(x): 50% of x is above and 50% is below
sd(x): standard deviation
IQR(x): interquartile range (robust equivalent of sd when outliers are present in the data)
mad(x): median absolute deviation (robust equivalent of sd when outliers are present in the data)
min(x): minimum value of x
max(x): maximum value of x
quantile(x, 0.25): 25% of x is below this value
first(x): equivalent to x[1]
nth(x, 2): equivalent to n<-2; x[n]
last(x): equivalent to x[length(x)]
n(x): the number of elements in x
sum(!is.na(x)): count non-missing values
n_distinct(x): count the number of unique value
sum(x > 10): count the number of elements where x > 10
mean(y == 0): proportion of elements where y = 0

5.5 More stats: tapply

Tapply is a base R function and allows to apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors. It can be used as an alternative to ddply.

Suppose now we want to estimate the mean root length for each growth condition. Notice, how we can vary the index?

#index = factors we want to select
tapply(X = growth_data$Rootlength, INDEX = growth_data$Nutrient, FUN = mean,na.rm = TRUE)

        P       noP 
11.513931  6.530845

#same but for two-way tables (this is not so useful here, but might be handy when you have different conditions for the same organism or a timecourse)
tapply(growth_data$Rootlength, INDEX = list(growth_data$Nutrient, growth_data$Condition),FUN = mean, na.rm = TRUE)

        MgCl Strain101 Strain230  Strain28
P   9.853457 12.729133 11.814678 12.007390
noP 5.805198  7.094616  7.172883  6.200309

5.6 Linear models

5.6.1 Basics

It is very simple to investigate linear relationships among variables in R. We want to estimate how a quantitative dependent variable changes according to the levels of one or more categorical independent variables. In the command below, the linear relationship between Rootlength (the dependent variable, i.e. the one we<80><99>re trying to predict) and FW_shoot_mg (the independent variable or the predictor) is calculated.

We need to use the function summary to see the results of that command; coef extracts the best-fit coefficients, anova performs an analysis of variance; there are many other extractor functions.

In this case lets work with the data were we do not have to deal with the different time points to simplify things.

#read in data were we have several measurements
growth_data <- read.table("../data/Growth_Data.txt", sep="\t", header=T, fill=TRUE, quote = "")

#is there a correlation between freshweight and root length?
linearmodel <- lm(Rootlength ~ FW_shoot_mg, data = growth_data)
linearmodel


Call:
lm(formula = Rootlength ~ FW_shoot_mg, data = growth_data)

Coefficients:
(Intercept)  FW_shoot_mg  
     5.2912       0.2465

#let's extract the entire t-table
summary(linearmodel)


Call:
lm(formula = Rootlength ~ FW_shoot_mg, data = growth_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5263 -0.9978 -0.0181  1.0094  3.9292 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.29124    0.30469   17.37   <2e-16 ***
FW_shoot_mg  0.24648    0.01712   14.40   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.615 on 103 degrees of freedom
Multiple R-squared:  0.6681,    Adjusted R-squared:  0.6649 
F-statistic: 207.4 on 1 and 103 DF,  p-value: < 2.2e-16

Here one of the values is the model p-Value (bottom last line) and the p-Value of individual predictor variables (extreme right column under <80><98>Coefficients<80><99>). The p-Values are very important because, we can consider a linear model to be statistically significant only when both these p-Values are less that the pre-determined statistical significance level, which is ideally 0.05. This is visually interpreted by the significance stars at the end of the row. The more the stars beside the variable<80><99>s p-Value, the more significant the variable.

Residuals: The section summarizes the residuals, the error between the prediction of the model and the actual results. Smaller residuals are better.
Coefficients: For each variable and the intercept, a weight is produced and that weight has other attributes like the standard error, a t-test value and significance.
Estimate: This is the weight given to the variable. In the simple regression case (one variable plus the intercept), for every increase in root length, the model predicts an increase of 0.24.
Std. Error: Tells you how precisely was the estimate measured. It<80><99>s really only useful for calculating the t-value.
t-value and Pr(>[t]): The t-value is calculated by taking the coefficient divided by the Std. Error. It is then used to test whether or not the coefficient is significantly different from zero. If it isn<80><99>t significant, then the coefficient really isn<80><99>t adding anything to the model and could be dropped or investigated further. Pr(>|t|) is the significance level.

Performance Measures:

Three sets of measurements are provided.

Residual Standard Error: This is the standard deviation of the residuals. Smaller is better.
Multiple / Adjusted R-Square: For one variable, the distinction doesn<80><99>t really matter. R-squared shows the amount of variance explained by the model. Adjusted R-Square takes into account the number of variables and is most useful for multiple-regression.
F-Statistic: The F-test checks if at least one variable<80><99>s weight is significantly different than zero. This is a global test to help asses a model. If the p-value is not significant (e.g. greater than 0.05) than your model is essentially not doing anything.

We also can just print parts of the data:

#print only the coefficients 
coef(summary(linearmodel))

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 5.2912436 0.30469198 17.36588 2.278273e-32
FW_shoot_mg 0.2464783 0.01711628 14.40023 2.039703e-26

#print only the anova stats
anova(linearmodel)

#plot add the best line to a plot
with(growth_data, plot(Rootlength ~ FW_shoot_mg, col = 2))
abline(linearmodel)

If we look at the stats and the p value we see a nice correlation but also that we have two distinct clusters as well as more spread in the cluster that is more to the right. These clusters likely are the two different nutrient conditions and sometimes it might make sense to separate data to get a clearer picture. Something else to consider is to ask whether the data is normally distributed and based on that what statistical test to choose.

5.6.2 Analysing residuals

Anyone can fit a linear model in R. The real test is analyzing the residuals (the error or the difference between actual and predicted results).

There are four things we are looking for when analyzing residuals.

The mean of the errors is zero (and the sum of the errors is zero)
The distribution of the errors are normal.
All of the errors are independent.
Variance of errors is constant (Homoscedastic)

In R, you pull out the residuals by referencing the model and then the resid variable inside the model. Using the simple linear regression model (simple.fit) we<80><99>ll plot a few graphs to help illustrate any problems with the model.

Below some examples:

simple.fit <- linearmodel

layout(matrix(c(1,1,2,3),2,2,byrow=T))

#Rootlength x Residuals Plot
plot(simple.fit$resid~growth_data$Rootlength[order(growth_data$Rootlength)],
 main="Rootlength x Residuals\nfor Simple Regression",
 xlab="Marketing Rootlength", ylab="Residuals")
abline(h=0,lty=2)

#Histogram of Residuals
hist(simple.fit$resid, main="Histogram of Residuals",
 ylab="Residuals")

#Q-Q Plot
qqnorm(simple.fit$resid)
qqline(simple.fit$resid)

The histogram and QQ-plot are the ways to visually evaluate if the residual fit a normal distribution.

If the histogram looks like a bell-curve it might be normally distributed.
If the QQ-plot has the vast majority of points on or very near the line, the residuals may be normally distributed.

5.7 Normal distribution

5.7.1 Visualize our data via density plots

There are different ways to visualize this, one example is ggdensity of the ggpubr package.

library("ggpubr")


Attaching package: 'ggpubr'

The following object is masked from 'package:plyr':

    mutate

#is the data for my different variables normally distributed
ggdensity(growth_data$Rootlength)

We see nicely that we have two tails that likely represent the two nutrient conditions. To test this, we can simply subset the data as we have done before.

ggdensity(growth_data, x = "Rootlength",
   add = "mean", rug = TRUE,
   color = "Nutrient", palette = c("#00AFBB", "#E7B800"))

No we see that indeed the tool tails we see are seem to be due to our two nutrient conditions.

5.7.2 Visualize our data via Q-Q plots

Another way to represent data is in a Q-Q plot: Q-Q plot (or quantile-quantile plot) draws the correlation between a given sample and the normal distribution. A 45-degree reference line is also plotted.

#plot all data
ggqqplot(growth_data$Rootlength)

#plot by group
ggqqplot(growth_data, x = "Rootlength",
         color = "Nutrient",  palette = c("#00AFBB", "#E7B800"))

Again, here we see that our data for the indivdual growth conditions fit quite nicely into normal distribution.

5.7.3 Test for normality

#for all data
shapiro.test(growth_data$Rootlength)


    Shapiro-Wilk normality test

data:  growth_data$Rootlength
W = 0.91773, p-value = 6.744e-06

#separate nutrient conditions
noP_data <- growth_data[growth_data$Nutrient == "noP", ]

#test for noP only
shapiro.test(noP_data$Rootlength)


    Shapiro-Wilk normality test

data:  noP_data$Rootlength
W = 0.96403, p-value = 0.1171

We can use ddplyr and the dlookr package for doing a group-wise comparison

library(dplyr)
library(dlookr)


Attaching package: 'dlookr'

The following object is masked from 'package:tidyr':

    extract

The following object is masked from 'package:base':

    transform

growth_data %>%
  group_by(Nutrient) %>%
  normality()

The shapiro.test tests the NULL hypothesis that the samples came from a Normal distribution. This means that if your p-value <= 0.05, then you would reject the NULL hypothesis that the samples came from a normal distribution.

From the output, the p-value < 0.05 for our complete dataset implies that the distribution of the data is significantly different from normal distribution. In other words, we can not assume the normality. However, we expect quite some differences dependent on the growth pattern and once we only look at our low P data we see that our indeed is normally distributed.

Notice: Shapiro works only for sample sizes between 3-5000 numbers since when you feed it more data, the chances of the null hypothesis being rejected becomes larger. An alternative is the Anderson-Darling test that however has a similar problem with the Shapiro Wilk test. For large samples, you are most likely to reject the null hypothesis, so be aware of this.

library(nortest)
ad.test(noP_data$Rootlength)


    Anderson-Darling normality test

data:  noP_data$Rootlength
A = 0.58838, p-value = 0.1187

5.8 ANOVA

When we visualize the data, we see that there is a difference between the nutrient conditions but we want to know whether it is significant and more importantly, whether there is also a difference based on our treatments with different strains of microbes.

#lets visually compare our data with ggpubr again
library("ggpubr")

ggboxplot(growth_data, x = "Condition", y = "Rootlength", color = "Nutrient",
          palette = c("#00AFBB", "#E7B800"))

# We want to know whether root length depends on nutrient treatment
aov <- aov(Rootlength ~ Nutrient, data = growth_data)
summary(aov)

             Df Sum Sq Mean Sq F value Pr(>F)    
Nutrient      1  651.8   651.8     425 <2e-16 ***
Residuals   103  157.9     1.5                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, we see that there are significant differences based on our nutrient treatments. Now lets see how we can look at both the nutrient treatment and growth conditions.

# We want to know if root length depends on condition and nutrient
aov <- aov(Rootlength ~ Nutrient + Condition, data = growth_data)
summary(aov)

             Df Sum Sq Mean Sq F value   Pr(>F)    
Nutrient      1  651.8   651.8  712.58  < 2e-16 ***
Condition     3   66.5    22.2   24.23 7.27e-12 ***
Residuals   100   91.5     0.9                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA table we can conclude that both nutrient condition and treatment are statistically significant. Nutrient treatment is the most significant factor variable.

Not the above fitted model is called additive model. It makes an assumption that the two factor variables are independent. If you think that these two variables might interact to create an synergistic effect, replace the plus symbol (+) by an asterisk (*), as follows.

# Two-way ANOVA with interaction effect

# These two calls are equivalent
aov <- aov(Rootlength ~ Nutrient * Condition, data = growth_data)
aov <- aov(Rootlength ~ Nutrient + Condition + Nutrient:Condition, data = growth_data)

#summarize the aov
summary(aov)

                   Df Sum Sq Mean Sq F value   Pr(>F)    
Nutrient            1  651.8   651.8 817.131  < 2e-16 ***
Condition           3   66.5    22.2  27.780 4.72e-13 ***
Nutrient:Condition  3   14.1     4.7   5.891 0.000976 ***
Residuals          97   77.4     0.8                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that the two main effects (Nutrient and Condition) are statistically significant, as well as their interaction.

Note that, in the situation where the interaction is not significant you should use the additive model.

5.9 TUKEY

In ANOVA test, a significant p-value indicates that some of the group means are different, but we don<80><99>t know which pairs of groups are different. It<80><99>s possible to perform multiple pairwise-comparison, to determine if the mean difference between specific pairs of group are statistically significant.

As the ANOVA test is significant, we can compute Tukey HSD (Tukey Honest Significant Differences). Tukey test is a single-step multiple comparison procedure and statistical test. It is a post-hoc analysis, what means that it is used in conjunction with an ANOVA.

#test with anova
aov <- aov(Rootlength ~ Nutrient * Condition, data = growth_data)

#run tukey
TukeyHSD(aov)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Rootlength ~ Nutrient * Condition, data = growth_data)

$Nutrient
           diff       lwr       upr p adj
noP-P -4.983086 -5.329067 -4.637105     0

$Condition
                          diff        lwr        upr     p adj
Strain101-MgCl       2.1029052  1.4604930  2.7453173 0.0000000
Strain230-MgCl       1.6836191  1.0341749  2.3330633 0.0000000
Strain28-MgCl        1.2784796  0.6545138  1.9024455 0.0000034
Strain230-Strain101 -0.4192861 -1.0864726  0.2479004 0.3597470
Strain28-Strain101  -0.8244256 -1.4668377 -0.1820134 0.0061445
Strain28-Strain230  -0.4051395 -1.0545837  0.2443048 0.3663501

$`Nutrient:Condition`
                                  diff        lwr        upr     p adj
noP:MgCl-P:MgCl             -4.0482585 -5.0967311 -2.9997859 0.0000000
P:Strain101-P:MgCl           2.8756763  1.8040558  3.9472967 0.0000000
noP:Strain101-P:MgCl        -2.7588405 -3.8073131 -1.7103679 0.0000000
P:Strain230-P:MgCl           1.9612207  0.9127482  3.0096933 0.0000023
noP:Strain230-P:MgCl        -2.6805735 -3.7789218 -1.5822252 0.0000000
P:Strain28-P:MgCl            2.1539326  1.1054600  3.2024052 0.0000002
noP:Strain28-P:MgCl         -3.6531484 -4.6634819 -2.6428149 0.0000000
P:Strain101-noP:MgCl         6.9239348  5.8162834  8.0315861 0.0000000
noP:Strain101-noP:MgCl       1.2894180  0.2041458  2.3746903 0.0087961
P:Strain230-noP:MgCl         6.0094793  4.9242070  7.0947515 0.0000000
noP:Strain230-noP:MgCl       1.3676850  0.2341551  2.5012149 0.0073003
P:Strain28-noP:MgCl          6.2021911  5.1169189  7.2874633 0.0000000
noP:Strain28-noP:MgCl        0.3951101 -0.6533625  1.4435827 0.9391499
noP:Strain101-P:Strain101   -5.6345168 -6.7421681 -4.5268654 0.0000000
P:Strain230-P:Strain101     -0.9144555 -2.0221069  0.1931958 0.1845494
noP:Strain230-P:Strain101   -5.5562498 -6.7112241 -4.4012755 0.0000000
P:Strain28-P:Strain101      -0.7217437 -1.8293950  0.3859077 0.4749256
noP:Strain28-P:Strain101    -6.5288247 -7.6004451 -5.4572042 0.0000000
P:Strain230-noP:Strain101    4.7200613  3.6347890  5.8053335 0.0000000
noP:Strain230-noP:Strain101  0.0782670 -1.0552629  1.2117969 0.9999989
P:Strain28-noP:Strain101     4.9127731  3.8275008  5.9980453 0.0000000
noP:Strain28-noP:Strain101  -0.8943079 -1.9427805  0.1541647 0.1537008
noP:Strain230-P:Strain230   -4.6417943 -5.7753242 -3.5082644 0.0000000
P:Strain28-P:Strain230       0.1927118 -0.8925604  1.2779841 0.9993212
noP:Strain28-P:Strain230    -5.6143692 -6.6628417 -4.5658966 0.0000000
P:Strain28-noP:Strain230     4.8345061  3.7009762  5.9680360 0.0000000
noP:Strain28-noP:Strain230  -0.9725749 -2.0709232  0.1257734 0.1223122
noP:Strain28-P:Strain28     -5.8070810 -6.8555536 -4.7586084 0.0000000

We can see that most differences are significant, with the exception of Strain28, which in most cases does not show an effect.

For some representations it is useful to plot significant letters. We can do this using some extra packages as follows:

#load library
library(agricolae)


Attaching package: 'agricolae'

The following objects are masked from 'package:dlookr':

    kurtosis, skewness

#separate nutrient conditions
noP_data <- growth_data[growth_data$Nutrient == "noP", ]

#run an anova
aov_noP <- aov(Rootlength ~ Condition, data = noP_data)

#run test
HSD.test(aov_noP,"Condition", group=TRUE,console=TRUE)


Study: aov_noP ~ "Condition"

HSD Test for Rootlength 

Mean Square Error:  0.606883 

Condition,  means

          Rootlength       std  r      Min      Max
MgCl        5.805198 0.7738124 13 4.236348 6.834720
Strain101   7.094616 0.5939804 13 5.947965 7.973207
Strain230   7.172883 0.6117203 11 6.120089 8.197116
Strain28    6.200309 0.9988989 15 4.384756 7.428674

Alpha: 0.05 ; DF Error: 48 
Critical Value of Studentized Range: 3.763749 

Groups according to probability of means differences and alpha level( 0.05 )

Treatments with the same letter are not significantly different.

          Rootlength groups
Strain230   7.172883      a
Strain101   7.094616      a
Strain28    6.200309      b
MgCl        5.805198      b

5.10 Check the homogeneity of variances

The residuals versus fits plot is used to check the homogeneity of variances. In the plot below, there is no evident relationships between residuals and fitted values (the mean of each groups), which is good. So, we can assume the homogeneity of variances. Only a few points (41, 58 and 77 are detected as outliers, which can severely normality and homogeneity of variance. It can be useful to remove outliers to meet the test assumptions.)

#check for homogeneity
plot(aov, 1)

#Use the Levene<e2><80><99>s test to check the homogeneity of variances. 
library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

leveneTest(Rootlength ~ Nutrient * Condition, data = growth_data)

From the output above we can see that the p-value is not less than the significance level of 0.05. This means that there is no evidence to suggest that the variance across groups is statistically significantly different. Therefore, we can assume the homogeneity of variances in the different treatment groups.

5.11 Check for normality v2

Normality plot of the residuals. In the plot below, the quantiles of the residuals are plotted against the quantiles of the normal distribution. A 45-degree reference line is also plotted. The normal probability plot of residuals is used to verify the assumption that the residuals are normally distributed.

The normal probability plot of the residuals should approximately follow a straight line. which we can see below. Again, we see the points marked as potential outliers.

## plot
plot(aov, 2)