It seems that getting all of the programs and their dependencies installed on your machine correctly is perhaps THE major hurdle in starting a bioinformatics project. If you are lucky enough to have access to a good HPC server with a bioinformatician on staff, she or he will probably keep everything up-to-date and working properly, but you will: 1) Inevitably want to run some software that isn’t installed on your server, and 2) Get tired of logging on, transferring files, writing job scripts, and waiting “in line” to run simple tasks that your average laptop these days can handle just fine. You’ll need to have those “bread and butter” programs installed on your personal (or lab) machine and you’ll need to have them playing nicely together.
There is a veritable cornucopia of freely available programs, scripts, and languages to do most any data task you will need. Most of these are expertly crafted, but some are buggy at best. To add to this crippling variety of choices, each of these tools likely has several versions available and some of them won’t work on your computing platform, whether it be Mac, Windows, or Linux. In general you can find ways to get the job done on any of these systems, but Windows users will have a tougher time of it. If there’s something I’m evangelical about, it’s Linux (and MoBio soil DNA kits!) so I’ll be covering things from that perspective, but I’ll also include download links for Mac versions when I can find them. The setup should be similar since both are based on a Unix framework. I can’t hope to be comprehensive, and programs are constantly being updated/created, but here’s a brief list of the bread and butter stuff you’ll want on your machine:
The links above will take you to download sites, and these all have pretty good installation instructions, but I’ll briefly go over each one.
BASH is the basic shell that lets you interact with your computer. There are other shells out there, but BASH is probably the best. If you have a Mac or Linux, you already have BASH installed (check by typing “echo $SHELL” into your terminal). It’s what you interact with when you are working on the command line. There are some neat tools to improve your command line experience and increase your productivity once you are used to the BASH interface. One of the best I’ve found is called tmux. It allows you to simultaneously run several shells so you can be logged into a server and running a script on your own machine at the same time.
Python is a pretty intuitive coding language and Anaconda is a package that sets it up with a bunch of useful tools for scientific computing. At this point it’s best to go with the latest version of Python v2, and avoid Python v3. This will probably change in the near future, but for now QIIME runs Python 2 so that’s what we want. Actually, it runs only Python versions “greater than 2.7, but less than 3.0″ so, yeah.
Perl is probably already installed on your machine as well (check your version by typing “perl -v” into your terminal). Lots of useful scripts are floating around out there written in this language so it will come in handy for sure!
QIIME is the altogether far-too-useful wrapper for a great collection of bioinformatics scripts developed in Python. It is brought to us by the good people at the Knight and Caporaso labs, so send them a brief thank you note sometime! For now, don’t bother with trying to download the full version. We can do pretty much everything with the QIIME Base Install. If you are on a Mac, you’ll probably want to go with MacQIIME, developed by the Werner lab, but Linux users and those who want to do a native install on their system should do a QIIME Base Install. (Note: I have not used MacQIIME, but it is a self-contained package, similar in some ways to a virtual machine, and aspects of it will be invisible to your system unless you explicitly call them. People seem to like it just fine.)
In Linux, getting a base install of QIIME (once you have installed Python v2.7+) is as easy as typing the following into your terminal:
sudo pip install numpy
sudo pip install qiime
At this point you will be able to call QIIME commands straight from your terminal! Test it out by typing:
If you are all set up you’ll get a bunch of output describing your system and versions of the dependant software, along with QIIME’s current default parameters (these are very important to be aware of). It should look something like this:
QIIME config test example
RStudio is a collection of tools that makes working with R much more enjoyable and productive. You can keep track of all the values, variables, and objects in your environment, see plots in real-time, and edit scripts simultaneously. It also has some limited point-and-click functionality for loading data and browsing help files (being able to view help files while composing code is my favourite feature!). Make sure you install R before you install RStudio. Things will go more smoothly. Some versions of Linux come with R pre-installed you can test this by typing “R –version” into your terminal. Here’s a glimpse of RStudio’s interface:
The Fastx-Toolkit is a great collection of easy-to-use scripts that can manipulate and perform quality control on your fastq and fasta sequence files. These were developed by the Hannon Lab, and they maintain them well. To get the most recent version you will need to download and compile the source code. This sounds more difficult than it is. Click this link to download version 0.0.14 (the most recent version as of this writing), and click this link to download the gtextutils package (Required to run Fastx-Toolkit). Extract the compressed files into a convenient location such as “~/Desktop/BioInfoTools” . . . Now enter your terminal, navigate to the uncompressed libtextutils-0.7 directory and type:
sudo make install
Now, navigate to the fastx_toolkit-0.0.14 directory and type:
sudo apt-get install gnuplot
sudo apt-get install GCC
sudo make install
That was too easy!
Pear is not exactly essential, but if you are working with paired-end ITS reads (such as most Illumina runs) it’s the best option for merging your forward and reverse reads. Click here to download the source code. Uncompress the file somewhere handy again and type the following to compile and install:
sudo make install
Now, you’ll probably want to use the Perl script, run_pear.pl, to efficiently utilize Pear on lots of files at once. You can download it here by right-clicking the page at that link and selecting “Save page as…” All you have to do is save the file somewhere in one of your PATH directories. To see what is in your PATH variable, type “echo $PATH” into your terminal. Move run_pear.pl into one of those directories, type “exec bash” into your terminal to let your shell know that there is a new file in your PATH, and you can now call this script using “perl run_pear.pl . . .” (making sure to pass it the appropriate parameters, of course). I got a funny error at first trying to run this Perl script because I was missing the “Parallel::ForkManager” module. If you get that error, you can fix it by typing:
That should fix it up!
If you want to practice with some of this shiny new software on a tiny data set you can get one here. Inside this archive are small subsets of three samples of Hawaiian epiphytic fungi. Each sample consists of a forward “R1″ and reverse “R2″ read.