Here I will present a real-life example of how to go from your OTU table to NMDS plots and hypothesis testing in R:
We recently concluded a project that sought to determine whether snail food sources (leaf-surface microbes) were significantly different between current snail sites and proposed enclosure locations. The issue for the Oahu Army Natural Resources Program was that these endangered snails are being eaten by predators (rats and other invasive snails). They have developed an effective way to build a structure that keeps predators out, but some of the places where these endangered snails currently live are on steep and remote mountain ridges which makes construction work impossible. The idea was to build safe spaces nearby on level ground and move snails into the enclosure, but there are a lot of factors that play into whether they will survive the move. One of these possible factors was their diet; If the microbial communities on leaf-surfaces were significantly different between current and proposed sites it might have negative consequences for snail fitness.
I will assume that you were able to import your BIOM file into R if you are using your own data (refer to this previous post for how to do this), so I will start with the OTU table in csv format. You can download the OTU table here and the metadata table here for our snail study to follow along with this tutorial.
First, let’s load the required packages and import the otu table and metadata table from the .csv files (make sure to change the file paths to wherever you downloaded the data):
library(ggplot2) library(vegan) otus = read.csv(file = "~/Desktop/EPIPHYTE_OTU_TABLE.csv", header = TRUE, check.names = FALSE, row.names = 1) metadata = read.csv(file = "~/Desktop/EPIPHYTE_METADATA.csv", header = TRUE, check.names = FALSE, row.names = 1)
I noticed that some of my samples don’t have any OTUs in them (zeros all the way down) so I want to remove those. We can check for, and remove, samples matching any criteria we like from both the OTU table and the metadata table at the same time.
good_samples <- colnames(otus[(colSums(decostand(otus,"pa")) >= 1)]) # decostand(x,"pa") counts presence/absence otus = otus[,good_samples] metadata = metadata[good_samples,]
There are more parsimonious ways of doing this, but this way you can easily adapt it to remove samples with richness less than any value you like…that’s why I bothered with decostand().
Now that we have gotten rid of useless samples, we need to transpose the OTU table so that samples = rows and OTUs = columns, because this is how the vegan package likes it.
t_otus <- as.data.frame(t(otus))
Next, we will rarefy our data set down to the lowest OTU abundance so that we can compare evenly between samples regardless of sampling depth artefacts. There has been some recent debate about the efficacy of rarefying community data (read a very thought-provoking paper here) and there is growing consensus that rarefying data down to a “lowest-common-denomenator” is not statistically valid and can lead to loss of statistical power. That said, it is currently still a common practice and for the purposes of this example, we will go ahead and rarefy rather than normalize the data in other ways in order to focus on the NMDS process. Note that rarefaction and rarefying are two completely different things!
min_depth = min(colSums(otus)) t_otus_rarefied <- as.data.frame(round(rrarefy(t_otus, min_depth)))
Above, we determined our OTU count for the lowest abundance sample and then rarefied the data to that (randomly selected only 3,679 hits from each sample). Next, we want to transform our rarefied OTU table (square-root) and determine the best method for calculating a distance matrix from it.
sqrt_t_otus_rarefied = sqrt(t_otus_rarefied) rank.totus <- rankindex(as.matrix(sqrt_t_otus_rarefied), t_otus_rarefied, indices = c("bray", "euclid", "manhattan", "horn"), method = "spearman") print(paste("The highest rank was given by the", names(sort(rank.totus, decreasing = TRUE)), "method."))
This prints out the method which is best for our data set, in this case “Bray-Curtis,” so we will use “bray” as the method to construct our community distance matrix:
otus_dist = as.matrix((vegdist(t_otus_rarefied, "bray")))
Now we are finally ready to perform NMDS on our data! The following will conduct NMDS and use the results to compose a data frame with NMDS coordinates and associated metadata for each sample:
#perform NMDS NMDS = metaMDS(otus_dist)
#build a data frame with NMDS coordinates and metadata MDS1 = NMDS$points[,1] MDS2 = NMDS$points[,2] NMDS = data.frame(MDS1 = MDS1, MDS2 = MDS2, Host = metadata$Host_Taxon, Location = metadata$Location)
Here is a peek at what our data frame should look like:
> head(NMDS) MDS1 MDS2 Host Location 69CV.5 -0.07177226 0.09103480 Mellicope Culvert_69 DS.5 0.06119335 -0.14087551 Myrica Dan_Sievert-Palikea SP.3 0.02909795 0.10252711 Unknown Skeet_Pass 3P.5 0.01576544 0.01717921 Antidesma 3_Points EKH.20 0.18278557 0.04607938 Unknown Ekahanui KB.9 -0.01055083 0.21422620 Lapalapa Kaala_Bog
It’s time to plot this and take a look at where our samples fall in “ordination space.” We will use ggplot2 because it’s lovely. It takes a bit of effort to get used to, but it’s an excellent package for plotting and comes with a ton of functionality. (Here is a nice intro tutorial for playing with ggplot)
ggplot(NMDS, aes(x=MDS1, y=MDS2, col=Location)) + geom_point() + stat_ellipse() + theme_bw() + labs(title = "NMDS Plot")
Here we see that our ellipses (representing 95% CI around the centroid) have a lot of overlap. To me, this looks like the epiphyte communities aren’t very different for each other: Boring news for scientists looking for a story, but great news for the snails who might have to move to a new location. Still, we need to run some statistical tests to make sure. There are two main options for this sort of project: adonis(), and anosim(). Adonis is the vegan implementation of a permutational analysis of variance, and anosim is similar, but is an analysis of similarity.
Here is some more detail about the two methods (taken from the QIIME help page), though you should read more about them in the R help files:
Adonis is a nonparametric statistical method that takes a distance matrix and a category to determine sample grouping from. It computes an R2 value (effect size) which shows the percentage of variation explained by the supplied category, as well as a p-value to determine the statistical significance. Adonis creates a set by first identifying the relevant centroids of the data and then calculating the squared deviations from these points. After that, significance tests are performed using F-tests based on sequential sums of squares from permutations of the raw data.
ANOSIM is a method that tests whether two or more groups of samples are significantly different (similar to adonis, above). You can specify a category in the metadata to separate samples into groups and then test whether there are significant differences between those groups. Since ANOSIM is nonparametric, statistical significance is determined through permutations.
Note: ANOSIM only works with a categorical variable that is used to do the grouping. Mantel is recommended for continuous variables.
To do this with our data in R:
anosim_location = anosim(otus_dist, metadata$Location) anosim_location # take a look at results summary(anosim_location) plot(anosim_location)
This performed anosim on our distance matrix with Location as a categorical predictor. Anosim has built-in summary() and plot() methods that provide a lot of helpful information for interpreting the results. Your anosim result should look like this:
anosim(dat = otus_dist, grouping = metadata$Location) Dissimilarity: user supplied square matrix ANOSIM statistic R: 0.4326 Significance: 0.001
This essentially says that our communities are statistically different from each other, contrary to what our plot seemed to show. Let’s take a look at PermANOVA, using adonis(), which is generally considered to be more robust than anosim().
adonis_location = adonis(otus_dist ~ Location, metadata) adonis_location # take a look at results; there are no summary() or plot() methods included
And the results I got were:
Call: adonis(formula = otus_dist ~ Location, data = metadata) Permutation: free Number of permutations: 999 Terms added sequentially (first to last) Df SumsOfSqs MeanSqs F.Model R2 Pr(>F) Location 6 6.069 1.01142 2.7649 0.16658 0.001 *** Residuals 83 30.362 0.36581 0.83342 Total 89 36.430 1.00000
Similar to anosim! Still, the real question we were asking with these data wasn’t about all seven sites together…we had some paired sites in mind. For example, the snails at the site called “Skeet Pass” are under consideration to be moved to the proposed site at “Kaala Bog.” So, the real question is whether each current site is different from its proposed site(s), not whether any of the communities are different. For a standard ANOVA we could just run a post-hoc test to determine which groups are different from each other (e.g., a Tukey test), but anosim and adonis do not have any valid post-hoc methods currently. In that case, what we can do is split up our data into paired sites and perform anosim on each subset, which should give us an idea as to whether there are differences between the sites that we care about.
# Get columns from each of the seven site locations CV69_cols = grep('^69CV',names(otus)) DSPK_cols = grep('^DS',names(otus)) Skeet_cols = grep('^SP',names(otus)) KK_cols = grep('^KK',names(otus)) ThreeP_cols = grep('^3P',names(otus)) KB_cols = grep('^KB',names(otus)) EKH_cols = grep('^EKH',names(otus)) # Generate a rarefied OTU table from each site: Culvert_69 <- t_otus_rarefied[CV69_cols,] DS_Palikea <- t_otus_rarefied[DSPK_cols,] Skeet_Pass <- t_otus_rarefied[Skeet_cols,] Kaaikukai <- t_otus_rarefied[KK_cols,] Three_Pts <- t_otus_rarefied[ThreeP_cols,] KaalaBog <- t_otus_rarefied[KB_cols,] Ekahanui <- t_otus_rarefied[EKH_cols,] #group OTUs from each set of donor/recipient sites...these will be the three comparisons skeet_kaala = rbind(Skeet_Pass,KaalaBog) three_pts_cv69 = rbind(Three_Pts, Culvert_69) palikea_sites = rbind(DS_Palikea, Kaaikukai, Ekahanui) #remove "empty" OTUs that pop up since we are subsetting the large OTU table skeet_kaala = skeet_kaala[,which(colSums(skeet_kaala) != 0)] three_pts_cv69 = three_pts_cv69[,which(colSums(three_pts_cv69) != 0)] palikea_sites = palikea_sites[,which(colSums(palikea_sites) != 0)] #perform anosim on each meaningful combination anosim(three_pts_cv69, metadata$Location) anosim(skeet_kaala, metadata$Location) anosim(palikea_sites, metadata$Location)
And here is what I got:
anosim(dat = three_pts_cv69, grouping = metadata$Location) Dissimilarity: bray ANOSIM statistic R: -0.05695 Significance: 0.713
anosim(dat = skeet_kaala, grouping = metadata$Location) Dissimilarity: bray ANOSIM statistic R: -0.04582 Significance: 0.72
anosim(dat = palikea_sites, grouping = metadata$Location) Dissimilarity: bray ANOSIM statistic R: -0.08068 Significance: 0.907