I discussed how to prepare all your reads and combine them into one fasta file in the previous post. Once you have that fasta file (called “combined_seqs.fna”) we can do a couple of optional cleanup steps before picking OTUs. First, we can run ITSx to extract fungal ITS sequences and remove the pesky 5.8S and LSU reads. You can download the program at that link, or ask your institutional HPC system admins to install it. Fact is, ITSx is really neat but it takes FOREVER to run on a big file like what we have. In fact, even using 20 cores on a supercomputer was going to take about 10 days for my small/average sized fasta file. In order to get around this I split up my combined_seqs.fna into ten pieces with fasta-splitter and ran ten separate one-day jobs. If you decide to use ITSx, I suggest you spend some time carefully reading the manual.
Another cleanup step we can take is to remove chimeras. Chimeric sequences are a product of incomplete PCR cycling, where an unfinished extension on a template DNA strand becomes a priming site for a different template and you wind up with an amalgam (or chimera) of two different templates joined into one. We can use VSEARCH, which is a free version of the excellent USEARCH, to detect and remove these chimeras. Here’s an example of the command to remove chimeras
vsearch -uchime_ref ~/combined_seqs.fna -db uchime_sh_refs_dynamic_develop_985_11.03.2015.ITS1.fasta -strand plus -nonchimeras ~/combined_seqs_no_chimeras.fna -threads 20
Note that we are pointing the program to a chimeric database. Since we are dealing with fungal ITS, we will need to download and supply this database (and others) at several points during our analyses. You can find the current chimeric database here.
While you are at it, go ahead and download the rest of the UNITE fungal database for use in QIIME since that will be our next step. This .zip file comes with three databases. We will be using the reference database called sh_refs_qiime_ver7_dynamic_s_31.01.2015.fasta and the taxonomy database called sh_taxonomy_qiime_ver7_dynamic_s_31.01.2015.txt
We will be picking OTUs from our combined_seqs_no_chimeras.fna file (the output from our VSEARCH script). We will use the open-reference approach. A comparison of OTU picking approaches can be found on the QIIME website. The basic command for OTU picking is (in our case):
pick_open_reference_otus.py -i combined_seqs_no_chimeras.fna -o ~/OTUs/ -r ~/YOUR_FILEPATH/sh_refs_qiime_ver7_dynamic_s_31.01.2015.fasta --suppress-align-and-tree -p ~/OTU_params.txt -a -O 20
The -a and -O flags are for running this in parallel. This means you get a minimum 20X speed boost if on the HPC. On a laptop, if you have 4 cores, you can pass -a -O 4 to run it on all four cores. If you don’t know, just leave off the -a and -O.
The –suppress-align-and-tree flag is there since ITS data cannot be reliably aligned like other common markers, and thus is not informative in that context.
The OTU_params.txt is a parameter file. It’s a simple text file listing changes to the default behavior in QIIME scripts. In our case, we want it to point QIIME to our UNITE database. Here’s a good example, but, as always, make sure the filepaths it references are adapted to your own filepaths so the scripts know where to look:
Running this pick_open_reference_otus.py script will take between 5 minutes and 2 weeks depending on your dataset size and computing environment. But when it is finished, the output file we are most interested in is called otu_table_mc2_w_tax.biom
This is essentially a compressed data frame that shows each OTU’s abundance in each sample. You can learn more about biom files and how to work with them here: http://biom-format.org/
I am personally not a fan of the biom format in this situation, but looks like we’re stuck with it for the time being. You can always convert it to the much more useful .tsv format (though this gets buggy sometimes). Anyway, One useful command to get an idea of what your otu table looks like is:
biom summarize-table -i otu_table_mc2_w_tax.biom -o otu_summary.txt
This txt file will give you the number of reads and otus for each sample and you will need this information for when we rarefy our data for meaningful comparisons. We’ll get to that in a future post when we take a look at what to do with our OTU table.