NMDS Example

Achatinella mustelina hanging out and eating leaf-surface microbes

Here I will present a real-life example of how to go from your OTU table to NMDS plots and hypothesis testing in R:

We recently concluded a project that sought to determine whether snail food sources (leaf-surface microbes) were significantly different between current snail sites and proposed enclosure locations.  The issue for the Oahu Army Natural Resources Program was that these endangered snails are being eaten by predators (rats and other invasive snails).  They have developed an effective way to build a structure that keeps predators out, but some of the places where these endangered snails currently live are on steep and remote mountain ridges which makes construction work impossible.  The idea was to build safe spaces nearby on level ground and move snails into the enclosure, but there are a lot of factors that play into whether they will survive the move.  One of these possible factors was their diet; If the microbial communities on leaf-surfaces were significantly different between current and proposed sites it might have negative consequences for snail fitness.

I will assume that you were able to import your BIOM file into R if you are using your own data (refer to this previous post for how to do this), so I will start with the OTU table in csv format.  You can download the OTU table here and the metadata table here for our snail study to follow along with this tutorial.

First, let’s load the required packages and import the otu table and metadata table from the .csv files (make sure to change the file paths to wherever you downloaded the data):

library(ggplot2)
library(vegan)
otus = read.csv(file = "~/Desktop/EPIPHYTE_OTU_TABLE.csv", header = TRUE, check.names = FALSE, row.names = 1)
metadata = read.csv(file = "~/Desktop/EPIPHYTE_METADATA.csv", header = TRUE, check.names = FALSE, row.names = 1)

I noticed that some of my samples don’t have any OTUs in them (zeros all the way down) so I want to remove those.  We can check for, and remove, samples matching any criteria we like from both the OTU table and the metadata table at the same time.

good_samples <- colnames(otus[(colSums(decostand(otus,"pa")) >= 1)])     # decostand(x,"pa") counts presence/absence
otus = otus[,good_samples]
metadata = metadata[good_samples,]

There are more parsimonious ways of doing this, but this way you can easily adapt it to remove samples with richness less than any value you like…that’s why I bothered with decostand().

Now that we have gotten rid of useless samples, we need to transpose the OTU table so that samples = rows and OTUs = columns, because this is how the vegan package likes it.

t_otus <- as.data.frame(t(otus))

Next, we will rarefy our data set down to the lowest OTU abundance so that we can compare evenly between samples regardless of sampling depth artefacts.  There has been some recent debate about the efficacy of rarefying community data (read a very thought-provoking paper here) and there is growing consensus that rarefying data down to a “lowest-common-denomenator” is not statistically valid and can lead to loss of statistical power.  That said, it is currently still a common practice and for the purposes of this example, we will go ahead and rarefy rather than normalize the data in other ways in order to focus on the NMDS process.  Note that rarefaction and rarefying are two completely different things!

min_depth = min(colSums(otus))
 t_otus_rarefied <- as.data.frame(round(rrarefy(t_otus, min_depth)))

Above, we determined our OTU count for the lowest abundance sample and then rarefied the data to that (randomly selected only 3,679 hits from each sample).  Next, we want to transform our rarefied OTU table (square-root) and determine the best method for calculating a distance matrix from it.

sqrt_t_otus_rarefied = sqrt(t_otus_rarefied)
rank.totus <- rankindex(as.matrix(t_otus_sqrt_dist), t_otus_rarefied, indices = c("bray", "euclid", "manhattan", "horn"), method = "spearman")
print(paste("The highest rank was given by the", names(sort(rank.totus, decreasing = TRUE)[1]), "method."))

This prints out the method which is best for our data set, in this case “Bray-Curtis,” so we will use “bray” as the method to construct our community distance matrix:

otus_dist = as.matrix((vegdist(t_otus_rarefied, "bray")))

Now we are finally ready to perform NMDS on our data!  The following will conduct NMDS and use the results to compose a data frame with NMDS coordinates and associated metadata for each sample:

#perform NMDS
 NMDS = metaMDS(otus_dist)
#build a data frame with NMDS coordinates and metadata
 MDS1 = NMDS$points[,1]
 MDS2 = NMDS$points[,2]
 NMDS = data.frame(MDS1 = MDS1, MDS2 = MDS2, Host = metadata$Host_Taxon, Location = metadata$Location)

Here is a peek at what our data frame should look like:

> head(NMDS)
              MDS1        MDS2      Host            Location
69CV.5 -0.07177226  0.09103480 Mellicope          Culvert_69
DS.5    0.06119335 -0.14087551    Myrica Dan_Sievert-Palikea
SP.3    0.02909795  0.10252711   Unknown          Skeet_Pass
3P.5    0.01576544  0.01717921 Antidesma            3_Points
EKH.20  0.18278557  0.04607938   Unknown            Ekahanui
KB.9   -0.01055083  0.21422620  Lapalapa           Kaala_Bog

It’s time to plot this and take a look at where our samples fall in “ordination space.”  We will use ggplot2 because it’s lovely.  It takes a bit of effort to get used to, but it’s an excellent package for plotting and comes with a ton of functionality.  (Here is a nice intro tutorial for playing with ggplot)

ggplot(NMDS, aes(x=MDS1, y=MDS2, col=Location)) +
 geom_point() +
 stat_ellipse() +
 theme_bw() +
 labs(title = "NMDS Plot")

EPIPHYTE_NMDS_PLOT

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Here we see that our ellipses (representing 95% CI around the centroid) have a lot of overlap.  To me, this looks like the epiphyte communities aren’t very different for each other: Boring news for scientists looking for a story, but great news for the snails who might have to move to a new location.  Still, we need to run some statistical tests to make sure.  There are two main options for this sort of project: adonis(), and anosim().  Adonis is the vegan implementation of a permutational analysis of variance, and anosim is similar, but is an analysis of similarity.

Here is some more detail about the two methods (taken from the QIIME help page), though you should read more about them in the R help files:

Adonis is a nonparametric statistical method that takes a distance matrix and a category to determine sample grouping from. It computes an R2 value (effect size) which shows the percentage of variation explained by the supplied category, as well as a p-value to determine the statistical significance. Adonis creates a set by first identifying the relevant centroids of the data and then calculating the squared deviations from these points. After that, significance tests are performed using F-tests based on sequential sums of squares from permutations of the raw data.

—–

ANOSIM is a method that tests whether two or more groups of samples are significantly different (similar to adonis, above). You can specify a category in the metadata to separate samples into groups and then test whether there are significant differences between those groups.  Since ANOSIM is nonparametric, statistical significance is determined through permutations.
Note: ANOSIM only works with a categorical variable that is used to do the grouping. Mantel is recommended for continuous variables.

 

To do this with our data in R:

anosim_location = anosim(otus_dist, metadata$Location)
 anosim_location # take a look at results
 summary(anosim_location)
 plot(anosim_location)

This performed anosim on our distance matrix with Location as a categorical predictor.  Anosim has built-in summary() and plot() methods that provide a lot of helpful information for interpreting the results.  Your anosim result should look like this:

 

Call:

anosim(dat = otus_dist, grouping = metadata$Location) 
Dissimilarity: user supplied square matrix 

ANOSIM statistic R: 0.4326 
      Significance: 0.001 

This essentially says that our communities are statistically different from each other, contrary to what our plot seemed to show.  Let’s take a look at PermANOVA, using adonis(), which is generally considered to be more robust than anosim().

adonis_location = adonis(otus_dist ~ Location, metadata)
adonis_location # take a look at results; there are no summary() or plot() methods included

And the results I got were:

 

Call:
adonis(formula = otus_dist ~ Location, data = metadata) 

Permutation: free
Number of permutations: 999

Terms added sequentially (first to last)

          Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
Location   6     6.069 1.01142  2.7649 0.16658  0.001 ***
Residuals 83    30.362 0.36581         0.83342           
Total     89    36.430                 1.00000

Similar to anosim!  Still, the real question we were asking with these data wasn’t about all seven sites together…we had some paired sites in mind.  For example, the snails at the site called “Skeet Pass” are under consideration to be moved to the proposed site at “Kaala Bog.”  So, the real question is whether each current site is different from its proposed site(s), not whether any of the communities are different.  For a standard ANOVA we could just run a post-hoc test to determine which groups are different from each other (e.g., a Tukey test), but anosim and adonis do not have any valid post-hoc methods currently.  In that case, what we can do is split up our data into paired sites and perform anosim on each subset, which should give us an idea as to whether there are differences between the sites that we care about.

# Get columns from each of the seven site locations
CV69_cols = grep('^69CV',names(otus)) 
DSPK_cols = grep('^DS',names(otus))
Skeet_cols = grep('^SP',names(otus))
KK_cols = grep('^KK',names(otus))
ThreeP_cols = grep('^3P',names(otus))
KB_cols = grep('^KB',names(otus))
EKH_cols = grep('^EKH',names(otus))

# Generate a rarefied OTU table from each site:
Culvert_69 <- t_otus_rarefied[CV69_cols,]
DS_Palikea <- t_otus_rarefied[DSPK_cols,]
Skeet_Pass <- t_otus_rarefied[Skeet_cols,]
Kaaikukai <- t_otus_rarefied[KK_cols,]
Three_Pts <- t_otus_rarefied[ThreeP_cols,]
KaalaBog <- t_otus_rarefied[KB_cols,]
Ekahanui <- t_otus_rarefied[EKH_cols,]

#group OTUs from each set of donor/recipient sites...these will be the three comparisons
skeet_kaala = rbind(Skeet_Pass,KaalaBog)
three_pts_cv69 = rbind(Three_Pts, Culvert_69)
palikea_sites = rbind(DS_Palikea, Kaaikukai, Ekahanui)

#remove "empty" OTUs that pop up since we are subsetting the large OTU table
skeet_kaala = skeet_kaala[,which(colSums(skeet_kaala) != 0)]
three_pts_cv69 = three_pts_cv69[,which(colSums(three_pts_cv69) != 0)]
palikea_sites = palikea_sites[,which(colSums(palikea_sites) != 0)]

#perform anosim on each meaningful combination
anosim(three_pts_cv69, metadata$Location)
anosim(skeet_kaala, metadata$Location)
anosim(palikea_sites, metadata$Location)

And here is what I got:

anosim(dat = three_pts_cv69, grouping = metadata$Location) 
Dissimilarity: bray 

ANOSIM statistic R: -0.05695 
      Significance: 0.713
anosim(dat = skeet_kaala, grouping = metadata$Location) 
Dissimilarity: bray 

ANOSIM statistic R: -0.04582 
      Significance: 0.72
anosim(dat = palikea_sites, grouping = metadata$Location) 
Dissimilarity: bray 

ANOSIM statistic R: -0.08068 
      Significance: 0.907 

So, there seems to be no significant differences between current and proposed site combinations, even though our first analysis showed that the sites were different from each other, in general.  This makes sense intellectually, since the current and proposed sites for the snails were pretty close together spatially, but the three current snail sites were further apart.  Bottom line is that as long as they don’t move the snail populations too far from where they are, the microbial community (food source) won’t be too different for them.

I hope this was a helpful example of how to get started doing (and plotting) NMDS and using the anosim() and adonis() functions to determine whether community compositions are statistically different.

Leave a Reply