Getting started with QIIME for fungal ITS – 3. Initial OTU analyses

DSCF29082

So you’ve picked your OTUs and have a nice biom file, probably called: otu_table_mc2_w_tax.biom
But what do you do with it?

At this point you can stay in QIIME for a bit longer and perform a suite of analyses or you can migrate onto another platform such as R.  For now, we’ll just look at what we can do in QIIME.  First, take a peek at your OTU table summary:

biom summarize-table -i otu_table_mc2_w_tax.biom

It will look something like this (truncated):

Num samples: 356
Num observations: 38438
Total count: 7645596
Table density (fraction of non-zero values): 0.014

Counts/sample summary:
 Min: 4.0
 Max: 63667.0
 Median: 21810.000
 Mean: 21476.393
 Std. dev.: 14864.453
 Sample Metadata Categories: None provided
 Observation Metadata Categories: taxonomy

Counts/sample detail:
PCR.Neg1: 4.0
GB.29: 7.0
Neg.Myc3: 7.0
PCR.Neg3: 9.0
. . .

Choose your rarefaction depth:

There is a lot of handy information in the summary and it really gives you a nice look at each sample’s richness.  That’s important for when we rarefy the data so we can find a meaningful “lowest common denominator.”  In my example the median reads/sample is 21810, but we see right off that there are several samples that are MUCH lower than this.  Of the four shown above, three are negative controls, so that is good, but one (GB.29) just appears to not have worked.  At this point we can’t be sure whether it’s an issue with the sample or with lab chemistry, but we will certainly not be interested in including it in further analyses.  As I scroll down the summary I find about fifteen samples (out of 356) with “low” read counts below 1000, and then it shoots up quickly.  You want to pick a rarefaction depth that won’t throw out too many of your samples (or any of the ones you really, really care about) but isn’t too shallow for robust statistical measures.  For my dataset I chose a depth of 5000 reads.  I will lose about 20 samples, but half of those are negative controls or less-important to my central question.  If you are looking at soil 16S amplicons or a similarly highly diverse community, 5000 reads probably wouldn’t be enough to see any levelling off of your rarefaction curves, but for fungal ITS it’s likely fine.

Generate rarefaction curves:

We can use a convenient workflow script called alpha_rarefaction.py to generate rarefied OTU tables, compare chosen alpha diversity metrics, and plot our curves all with one easy command.

alpha_rarefaction.py -i otu_table_mc2_w_tax.biom -m mapping_file.txt -o ~/Analyses/Alpha_Div/ -n 50 -e 5000

The mapping file is the same one we created for the add_qiime_labels.py script.  The -n flag tells it how many steps to take, to get from our minimum depth (10, by default) to our maximum depth (set by the -e flag, 5000 in this case).  Since we are using ITS data, we don’t have a phylogenetic tree and thus will be unable to generate phylogenetic diversity metrics like we could for alignable markers like 16S.  The output will include a nice HTML file that will let us explore rarefied alpha diversity graphically. Since we are unable to use phylogenetic diversity and Chao1 isn’t particularly robust to sparse data tables, such as are common in this type of study, here’s an example, looking at OTU counts for each of my sample sites:

Fungal ITS rarefaction curve

Observed OTUs by location. OTU counts of foliar epiphyte fungi at each of 7 sites associated with the endangered Oahu Tree Snail, Achatinella mustelina. Maximum rarefaction depth of 5000 reads.

The HTML file will allow you to explore this based on any of your metadata columns such as site, treatment, or even by sample.  It’s a nice way to get a quick and dirty overview of your community diversity and to see if your sampling depth was sufficient.

Take a look at beta diversity next:

QIIME has another handy workflow script that allows us to look at beta diversity (diversity BETWEEN samples).  It’s called core_diversity_analyses.py

This script will actually take care of generating our alpha diversity and rarefaction plots as well, but it’s nice to take a look at these ahead of time, as we just did, to determine whether our rarefaction depth is adequate to address our diversity.  If our rarefaction curves show no sign of levelling off, we ought to rarefy at a greater depth (even if we have to abandon a few more samples).  We can call this core diversity workflow as follows:

core_diversity_analyses.py -i otu_table_mc2_w_tax.biom -o ~/Analyses/Core_div/ -m mapping_file.txt -e 5000 -c Location --nonphylogenetic_diversity

Note that we must again tell the script to not use phylogenetic diversity measures since we are using ITS data.  The -c flag is our chance to give a comma-separated list of the metadata categories we want to use for statistical comparisons.  In my case, the question I’m asking is about community differences based on location, but you can test any categorical variable that you include in your mapping file, such as an experimental treatment, date, etc.  This flag must exactly match your column header(s).

What we get from this is a whole slew of files but QIIME is kind to us and outputs another HTML file called index.html that we can use to explore the rest of the output.  We get Bray-Curtis principle coordinates and distance measures (along with nifty emperor plots), alpha diversity and rarefaction plots, taxonomy summaries, and group significance for the categories we chose to test.  Here’s what the HTML output looks like:

QIIME core diversity output Fungal ITS

QIIME core diversity output

You can see this simple script accomplishes quite a bit!  You can even get a nice overview of your taxonomy, though if you are interested in statistical comparisons of specific taxa between samples/groups you will need some additional tools.
Keep in mind that taxonomic assignments are only as good as the database you use!  The UNITE fungal database is pretty good, but it is heavy on forest soil fungi so if you are sampling from a very different environment (like the tropical phyllosphere I am sampling) you may get rather large proportions of “Unassigned” taxa.  Additionally, if your samples have a lot of non-target ITS (plant, animal, etc.) the fungal UNITE database isn’t going to let you know about it.  It can only let you know about diversity that it “is aware of!”  To get around this it helps to tack on some meaningful outgroups to the UNITE database files.  In my case, I added a few fasta reads of plant ITS to the UNITE fasta file and their associated taxonomic assignments to the UNITE taxonomy txt file.  You can see how to format these by peeking at a few lines of each of these files, and then simply concatenate your outgroups to the end.  This is particularly useful if you didn’t use ITSx in order to isolate “only fungal” ITS sequences.

These core diversity measures might be all you need to answer your question, but in most cases you’ll want to dig a bit deeper.  In a future post I’ll go into cleaning up your OTU table to deal with spurious reads, as well as some additional tools to test hypotheses.

Leave a Reply