Getting started with QIIME for fungal ITS – 1. Data Prep through formatted fasta

DSCF28262

There are dozens of excellent resources and tutorials for working with 16S data in QIIME, but far fewer for ITS amplicons.  If you are looking at fungal communities, you will probably be working with ITS data and so there’s a bit steeper of a learning curve.  I highly recommend the tutorial created by Noah Fierer et al. as a great place to get started learning about the quirks specific to ITS datasets.  Here’s a link to the iPython notebook:  http://nbviewer.jupyter.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/Fungal-ITS-analysis.ipynb

But for those just getting off the ground, I’ve put together the steps we just used for a survey of fungal foliar epiphytes in Hawaii.  This post starts with downloading your Illumina data and takes you through preparing a fasta file of your combined samples, properly formatted for QIIME and ready to start analyzing.  The most important difference to keep in mind when working with ITS as opposed to “easy” markers like 16S is that ITS sequences are highly variable.  This means they don’t align well and are generally not very useful for inferring phylogeny.  They are, however, our current best marker gene for fungi (though some other candidates may be emerging!).  See this paper for a full discussion:  http://www.pnas.org/content/109/16/6241.full.pdf

———————————————————————————————————————————————————————————-

 

First, like any bioinformatics project, it helps if you set up your project with a simple directory structure and carry out everything within that.  I like to use three main folders within each project,  ./Seqs/  ./OTUs/ ./Analyses/  but whatever you are comfortable with  as long as you can keep track of where things are.  I strongly recommend the excellent book Bioinformatics Data Skills, by Vince Buffalo if you are just getting started with bioinformatics.  It thoroughly covers best practices and skills that will apply to every project you will ever work on.

Good documentation for all QIIME scripts can be found at http://qiime.org/scripts

 

———————————————————————————————————————————————————————————-

1. Download your sequence files.  These are in .fastq.gz format and are probably contained within an annoying labyrinth of subdirectories.

2. Get those fastq.gz files all in the same place.  To do this from the command line, navigate to the parent directory that contains all of those read subdirectories and type:

find . -maxdepth 8 -type f -exec mv {} .. \;

This finds all the files buried within up to 8 levels of subdirectories and moves them all to the parent directory.

3. unzip them.  Navigate to the directory with all of those files and type:

gunzip *.fastq.gz

4. Merge the paired ends with PEAR (or don’t, depending on your samples).  Pear can be downloaded here: http://sco.h-its.org/exelixis/web/software/pear/

You can do it by hand for each pair of sequences (very tedious) or you can use a shell script or the perl script “run_pear.pl” from within the directory with your files to do them as a batch.  By default, Pear will do some quality control on your sequences (check the documentation on the download site for specifics) but you will probably want to do some further cleaning of low-quality reads.  There are many options, but I like the fastx-toolkit, from the Hannon Lab.    

perl run_pear.pl *.fastq -o ./merged_reads -g

If you get an error about “not being able to run in parallel,” it means your version of perl is missing Parallel::ForkManager.  In Linux, I was able to fix it simply by using:

cpan Parallel::ForkManager

5. Pear gives you 4 output files for each pair of reads.  For now, we mostly care about the ones that were successfully assembled.  Move them all into a new folder called ./assembled_reads by typing:

mv *.assembled.fastq ./assembled_reads

6. Convert those fastq files into fasta files.  QIIME has a built-in script for this called convert_fastaqual_fastq.py so if you have QIIME installed on your machine or you are running an interactive session on the HPC you can just navigate to the /assembled_reads/ directory and type:

for fn in *.fastq;  do;  convert_fastaqual_fastq.py -c fastq_to_fastaqual -o fastaquals -f $fn;  done

This will loop through all the .fastq files in your directory and run the third line on each one, converting it to a fasta file (along with an associated quality score file which we can ignore for our purposes).  These files will all be jumbled up together so type:

 mv ./assembled_reads/*.fna ./fastas

Now you should have a directory that is something like /YourProject/Seqs/fastas/ which contains all of your merged, quality-filtered, reads in fasta format. (they will be called *.fna, but that doesn’t matter)

7.  Make a mapping file.  (<–Link to example) Remember, you can make this in Excel, but be sure to save it as a TAB-separated plain text file.  If you want a nice list of filenames so you don’t have to type them by hand (you don’t want to do that), just navigate to the new fastas folder and type:

ls -1 >> filenames.txt

That will get you a list (you’ll have to manually edit out the entry that says “filenames.txt”) that you can import into your mapping file under a column named “InputFileName”

Be sure to include any potentially useful columns of metadata here such as various treatment levels, sample locations, etc., but remember that your final column must be titled “Description.”  That column is a good place to give each sample a memorable name that will help you pick them out in subsequent figures.  Since our sequences have been demultiplexed already the columns marked “BarcodeSequence” and “LinkerPrimerSequence” aren’t that important, but they must still be there.

Save this file as something like mapping_file.txt

If you did this in excel and it spits out a .csv file, that’s fine…we can just replace the commas with tabs by typing:

sed -i 's/,/\t/g' mapping_file.csv

8.  Make sure your mapping file is kosher.  Type:

validate_mapping_file.py -b -m mapping_file.txt

If you got something wrong it will let you know.  It will also try to fix errors and if it’s successful will create a new file called mapping_file_corrected.txt

9.  Combine all of those fasta files into one big file using your validated mapping file.
Navigate to the /Seqs/ directory and move your mapping file there.  Now type:

add_qiime_labels.py -i ./fastas/ -m mapping_file.txt -c InputFileName

This gives you one LARGE file called combined_seqs.fna with proper QIIME labels.  This is what you will use for OTU-picking.  My next post will cover some further pipeline options that are useful for fungal data, such as ITSx, and picking OTUs.

Leave a Reply