
When you cluster sequences into Operational Taxonomic Units (OTUs) you are almost certainly masking biologically meaningful information. Think about it… If you have 16S or ITS amplicon data, typically when you cluster them, sequences that have less than 3% variance from each other are possibly being binned into the same OTU. These small differences might actually denote different strains of the same organism or different species entirely. It’s certainly handy to bin these sequences and treat the bins like they are “species,” but it’s not necessarily the best course of action. The main headache comes a bit later down the road though: It isn’t possible to compare OTUs between studies unless ALL the raw sequences you are interested in are re-clustered together from scratch simultaneously. This is because the sequences that wind up binned together in an OTU depend on the other sequences in the pool when clustering occurs. There are a couple of great tools that are now readily available that allow amplicon data from Illumina to be processed WITHOUT needing to cluster OTUs. They operate on the principle of estimating a sequencing error profile from the data and inferring Ribosomal Sequence Variants (RSVs) from error-corrected read data. I won’t go into how they work, but here are links to their websites and associated papers.
UNOISE2:
DADA2: http://doi:10.1038/nmeth.3869
I’ll be discussing DADA2, because it is entirely open-source and available as a convenient R package. I won’t be going over how to use it since the tutorial at the DADA2 website is excellent on its own, but I will walk through a recent test I ran, comparing different ways of preparing data for DADA2 processing.
The primary DADA2 workflow takes your raw paired-end fastq files, filters them, estimates error profiles, and then merges Fwd and Rev reads after error correction. It is also possible to just use the Fwd reads (rather common due to significantly lower read quality on Illumina Rev reads). The question that I and some others had was how the outcomes of these two workflows compared to each other, and in particular, how did they compare to data that had already been merged outside of DADA2. Fungal ITS amplicon studies often use PEAR to merge Fwd and Rev reads (see previous posts). Could we simply follow this standard pipeline, merging with PEAR, and then using the assembled reads as input to DADA2?
To test this I used a rather random selection of 44 Illumina ITS1 amplicon samples from the Sequence Read Archive (SRA) (Accessions: SRR3480407 – SRR3480450) and ran the same DADA2 process on 1) Just the Fwd reads, 2) The reads merged with PEAR, and 3) The raw reads, merged within DADA2, as per the standard pipeline in the DADA2 tutorial. All code for these tests is below.
Let’s get to the results:
This table shows the RSV count (number of unique sequence variants), the read count (number of valid sequences assigned to an RSV), the proportion of chimeric sequences detected (and removed) and the mean Shannon alpha diversity index for the 44 samples tested.
Right away we can see that merging the data with PEAR prior to running it through DADA2 made some big changes in the outcome. Fewer chimeric sequences were found, but far fewer reads made it into far fewer RSVs.
Looking at species accumulation curves for each method, we see that the highest values come from using the forward reads only (Green). Close behind, the Fwd and Rev reads merged within the standard DADA2 pipeline (Black), and far below, the Fwd and Rev reads merged with PEAR prior to running through DADA2 (Red).
Shannon diversity shows a similar pattern. Those extra RSVs present in the Fwd read version of the data don’t account for much extra diversity within samples. But the pre-merged samples have significantly lower alpha diversity.
Beta diversity analysis also shows that the pre-merged samples come out looking very different.
Again, one of the benefits of skipping OTU clustering is that RSVs are directly comparable between analyses as long as the filtration parameters were the same. This Venn-Diagram shows the proportion of RSVs shared between the different merging strategies. Skipping the reverse reads entirely yielded roughly double the RSVs as opposed to merging within DADA2. These “extra” RSVs seem to be due to the DADA2 merging process dumping a lot of reads that didn’t overlap with sufficient quality after error correction. That’s not too surprising given the relatively junky quality of the reverse reads (see below). Merging with PEAR ahead of time seemed to dump even more reads. Only 119 RSVs were shared between all strategies.
Forward read quality stats.
Reverse read quality stats.
Well, that’s it for now. I don’t want to say anything stupid about whether it’s a good or bad thing that merging with PEAR prior to DADA2 seems to drop a lot of sequences, but preventing the loss of biological signal was one of the reasons for moving toward OTU-free bioinformatics in the first place. Until I am persuaded otherwise, I’ll probably stick with using only the forward reads when I use DADA2.
Here’s the R code I used. If you want to try this for your own data, just point this code to directories where you have the Fwd reads only, the successfully-assembled (PEAR) reads, and the raw Fwd and Rev reads.
##################################################################################### # Forward Reads Only ##################################################################################### library(dada2); packageVersion("dada2") library(ShortRead) # File parsing - For this, we will use only the forward illumina reads - make sure to move fwd reads into their own directory for simplest processing path <- "~/PATH/TO/FWD/READS/" # CHANGE to the directory containing your demultiplexed fastq files filtpath <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory if(!file_test("-d", filtpath)) dir.create(filtpath) # make directory for filtered fqs if not already present fns <- list.files(path) fastqs <- fns[grepl(".fastq$", fns)] # CHANGE if different file extensions or to target only certain sequences fns = file.path(path, fns) fns = fns[grepl(".fastq$", fns)] # visualize a couple of fwd read quality profiles to help you decide reasonable filtration parameters plotQualityProfile(fns[[1]]) # Filter fastqs and store in new directory for(i in seq_along(fns)) { fastq <- fastqs[[i]] fastqFilter(fn = file.path(path,fastq), fout = file.path(filtpath, fastq), trimLeft=12, truncLen=240, maxEE=1, truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE) } # File parsing fns = list.files(filtpath) fns = file.path(filtpath, fns) filts <- fns[grepl("fastq$", fns)] # CHANGE if different file extensions sample.names <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz names(filts) <- sample.names drp.learn <- derepFastq(filts, verbose = TRUE) dd.learn <- dada(drp.learn, err=NULL, selfConsist=TRUE, multithread=TRUE) err <- dd.learn[[1]]$err_out rm(drp.learn);rm(dd.learn) # Sample inference derep <- derepFastq(filts) dds <- dada(derep, err=err, multithread=TRUE) # Construct sequence table and remove chimeras seqtab.fwd <- makeSequenceTable(dds) seqtab.nochimera.fwd <- removeBimeraDenovo(seqtab.fwd, multithread=TRUE) # This is the RSV table used for community analyses # fraction of chimeric sequences detected chimeras.fwd = 1-(sum(seqtab.nochimera.fwd)/sum(seqtab.fwd)) # make data frame df.fwd = as.data.frame(seqtab.nochimera.fwd) ##################################################################################### # Paired with PEAR first ##################################################################################### # File parsing - For this, we will use only the reads already merged with PEAR path <- "~/PATH/TO/PEAR-MERGED/READS/" # CHANGE to the directory containing your merged, assembled fastq files filtpath <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory if(!file_test("-d", filtpath)) dir.create(filtpath) # make directory for filtered fqs if not already present fns <- list.files(path) fastqs <- fns[grepl(".fastq$", fns)] # CHANGE if different file extensions or to target only certain sequences fns = file.path(path, fns) fns = fns[grepl(".fastq$", fns)] # visualize a couple of fwd read quality profiles to help you decide reasonable filtration parameters plotQualityProfile(fns[[1]]) # Filter fastqs and store in new directory for(i in seq_along(fns)) { fastq <- fastqs[[i]] fastqFilter(fn = file.path(path,fastq), fout = file.path(filtpath, fastq), trimLeft=12, truncLen=240, maxEE=1, truncQ=2, maxN=0, rm.phix=TRUE, compress=TRUE, verbose=TRUE) } # File parsing fns = list.files(filtpath) fns = file.path(filtpath, fns) filts <- fns[grepl("fastq$", fns)] # CHANGE if different file extensions sample.names <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz names(filts) <- sample.names drp.learn <- derepFastq(filts, verbose = TRUE) dd.learn <- dada(drp.learn, err=NULL, selfConsist=TRUE, multithread=TRUE) err <- dd.learn[[1]]$err_out rm(drp.learn);rm(dd.learn) # Sample inference derep <- derepFastq(filts) dds <- dada(derep, err=err, multithread=TRUE) # Construct sequence table and remove chimeras seqtab.pear <- makeSequenceTable(dds) seqtab.nochimera.pear <- removeBimeraDenovo(seqtab.pear, multithread=TRUE) # This is the RSV table used for community analyses # fraction of chimeric sequences detected chimeras.pear = 1-(sum(seqtab.nochimera.pear)/sum(seqtab.pear)) # make data frame df.pear = as.data.frame(seqtab.nochimera.pear) ##################################################################################### # Paired within DADA2 ##################################################################################### path <- "~/PATH/TO/RAW/READS/" # CHANGE to the directory containing your raw fwd and rev reads filt_path <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory # Sort ensures forward/reverse reads are in same order fnFs <- sort(list.files(path, pattern="_R1_.fastq")) fnRs <- sort(list.files(path, pattern="_R2_.fastq")) # inspect fwd and rev reads plotQualityProfile(fnFs[[1]]) # representative fwd file plotQualityProfile(fnRs[[1]]) # representative rev file # Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1) # Specify the full path to the fnFs and fnRs fnFs <- file.path(path, fnFs) fnRs <- file.path(path, fnRs) filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz")) filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz")) #filter fwd and rev reads out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160), maxN=0, maxEE=c(1,1), truncQ=2, rm.phix=TRUE, compress=TRUE, multithread=TRUE, trimLeft = 12) # learn errors errF <- learnErrors(filtFs, multithread=TRUE) errR <- learnErrors(filtRs, multithread=TRUE) # de-replicate derepFs <- derepFastq(filtFs, verbose=TRUE) derepRs <- derepFastq(filtRs, verbose=TRUE) # Name the derep-class objects by the sample names names(derepFs) <- sample.names names(derepRs) <- sample.names # infer sequence variants in each sample dadaFs <- dada(derepFs, err=errF, multithread=TRUE) dadaRs <- dada(derepRs, err=errR, multithread=TRUE) # merge paried end reads after error correction mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE) # construct RSV table seqtab.dd2_merged <- makeSequenceTable(mergers) # remove chimeras seqtab.nochimera.dd2_merged <- removeBimeraDenovo(seqtab.dd2_merged, multithread=TRUE) # fraction of chimeric sequences chimeras.dd2_merged = 1 - sum(seqtab.nochimera.dd2_merged)/sum(seqtab.dd2_merged) # make data frame df.dd2_merged = as.data.frame(seqtab.nochimera.dd2_merged) ##################################################################################### # Simple Analyses to compare methods ##################################################################################### # inspect dimensions dim(seqtab.nochimera.fwd); chimeras.fwd dim(seqtab.nochimera.pear); chimeras.pear dim(seqtab.nochimera.dd2_merged); chimeras.dd2_merged # Tables of sequence lengths (and counts) table(nchar(names(df.fwd))) table(nchar(names(df.pear))) table(nchar(names(df.dd2_merged))) # Remove sequences longer than 228 bp from DD2-merged seq table right.length = which(nchar(names(df.dd2_merged)) == 228) df.dd2_merged = df.dd2_merged[,right.length] library(vegan) spacc.fwd = specaccum(df.fwd) spacc.pear = specaccum(df.pear) spacc.dd2_merged = specaccum(df.dd2_merged) plot(spacc.fwd, col = 'Green') plot(spacc.pear, add = TRUE, col = 'Red') plot(spacc.dd2_merged, add = TRUE) # NMDS meta.fw = metaMDS(df.fwd) mds1.fw = meta.fw$points[,1] mds2.fw = meta.fw$points[,2] meta.pear = metaMDS(df.pear) mds1.pear = meta.pear$points[,1] mds2.pear = meta.pear$points[,2] meta.dd = metaMDS(df.dd2_merged) mds1.dd = meta.dd$points[,1] mds2.dd = meta.dd$points[,2] # build data frame NMDS = data.frame(MDS1 = c(mds1.fw,mds1.pear,mds1.dd), MDS2 = c(mds2.fw,mds2.pear,mds2.dd), Method = c(rep("Fwd_Reads",44),rep("PEAR",44),rep("DD2",44))) summary(NMDS$MDS1) summary(NMDS$MDS2) NMDS = NMDS[NMDS$MDS1 < 0,] # get rid of single crazy outlier to make plotting visible # plot library(ggplot2) ggplot(NMDS, mapping = aes(x=MDS1,y=MDS2,col=Method)) + geom_point() + theme_bw() + ggtitle("NMDS plot of Bray Dissimilarity Colored by Merging Method") # Add grouping metadata Trt = c(rep("Trt_1",22),rep("Trt_2",22)) # NMDS plot for each method colored by artificial groups ggplot(mapping = aes(x=mds1.fw[mds1.fw<0],y=mds2.fw[mds1.fw<0],col=Trt[mds1.fw<0])) + geom_point() ggplot(mapping = aes(x=mds1.pear[mds1.pear<0],y=mds2.pear[mds1.pear<0],col=Trt[mds1.pear<0])) + geom_point() ggplot(mapping = aes(x=mds1.dd[mds1.dd<0],y=mds2.dd[mds1.dd<0],col=Trt[mds1.dd<0])) + geom_point() # Adonis for each method by artificial groupings adon.fwd = adonis(vegdist(df.fwd) ~ Trt) adon.fwd adon.pear = adonis(vegdist(df.pear) ~ Trt) adon.pear adon.dd = adonis(vegdist(df.dd2_merged) ~ Trt) adon.dd # shannon diversity shannon.fwd = (diversity(df.fwd)) shannon.pear = diversity(df.pear) shannon.dd = diversity(df.dd2_merged) # build data frame Shannon = data.frame(Shannon = c(shannon.fwd,shannon.pear,shannon.dd), Method = c(rep("Fwd_Reads",44),rep("PEAR",44),rep("DD2",44))) #plot ggplot(Shannon, aes(x=Method,y=Shannon,fill=Method)) + geom_boxplot() + theme_bw() # how many RSVs from PEAR method are seen in other methods? library(VennDiagram) Fwd = names(df.fwd) PEAR = names(df.pear) DD2 = names(df.dd2_merged) # make VennDiagram venn.diagram(list(Fwd = Fwd, PEAR = PEAR, DD2 = DD2),"~/Desktop/dada2_test/output/Venn.tiff") # merge RSV tables / need to edit sample (row) names first to remove duplicates row.names(seqtab.nochimera.fwd) <- paste(row.names(seqtab.nochimera.fwd),"_Fwd", sep = "") row.names(seqtab.nochimera.pear) <- paste(row.names(seqtab.nochimera.pear),"_Pear", sep = "") row.names(seqtab.nochimera.dd2_merged) <- paste(row.names(seqtab.nochimera.dd2_merged),"_DD2", sep = "") # merge tables seqtab = mergeSequenceTables(seqtab.nochimera.fwd,seqtab.nochimera.pear,seqtab.nochimera.dd2_merged) # add grouping variables Method = c(rep("Fwd",44),rep("Pear",44),rep("DD2",44)) #beta dispersion beta_dispersion = betadisper(vegdist(seqtab, method = "jaccard", binary = TRUE),Method) plot(beta_dispersion) # presence absence seqtab.pa = decostand(seqtab, method = 'pa') #plot sample richness ggplot(mapping = aes(x=row.names(seqtab.pa),y=rowSums(seqtab.pa),col=Method)) + geom_point() + theme_bw() + ggtitle("Sample Richness") + labs(x="Samples",y="Sample Richness") + theme(axis.text.x = element_blank(), legend.title = element_text()) # comparison summary info # number of RSVs found with each method a = length(names(df.fwd)) b=length(names(df.pear)) c=length(names(df.dd2_merged)) # number of valid read counts found with each method d=sum(colSums(df.fwd)) e=sum(colSums(df.pear)) f=sum(colSums(df.dd2_merged)) # table of summary info data.frame(Method = c("Fwd_Only","PEAR_merged","DD_merged"), RSV_count = c(a,b,c), Read_Count = c(d,e,f), Percent_Chimeras = c(chimeras.fwd,chimeras.pear,chimeras.dd2_merged), Mean_Shannon_Div = c(mean(shannon.fwd),mean(shannon.pear),mean(shannon.dd))) # ===================