Breaking free from OTUs – If and when to merge paired-end reads

flower_day

When you cluster sequences into Operational Taxonomic Units (OTUs) you are almost certainly masking biologically meaningful information.  Think about it… If you have 16S or ITS amplicon data, typically when you cluster them, sequences that have less than 3% variance from each other are possibly being binned into the same OTU. These small differences might actually denote different strains of the same organism or different species entirely. It’s certainly handy to bin these sequences and treat the bins like they are “species,” but it’s not necessarily the best course of action. The main headache comes a bit later down the road though: It isn’t possible to compare OTUs between studies unless ALL the raw sequences you are interested in are re-clustered together from scratch simultaneously.  This is because the sequences that wind up binned together in an OTU depend on the other sequences in the pool when clustering occurs. There are a couple of great tools that are now readily available that allow amplicon data from Illumina to be processed WITHOUT needing to cluster OTUs. They operate on the principle of estimating a sequencing error profile from the data and inferring Ribosomal Sequence Variants (RSVs) from error-corrected read data. I won’t go into how they work, but here are links to their websites and associated papers.

UNOISE2:

DADA2:          http://doi:10.1038/nmeth.3869

I’ll be discussing DADA2, because it is entirely open-source and available as a convenient R package.  I won’t be going over how to use it since the tutorial at the DADA2 website is excellent on its own, but I will walk through a recent test I ran, comparing different ways of preparing data for DADA2 processing.

The primary DADA2 workflow takes your raw paired-end fastq files, filters them, estimates error profiles, and then merges Fwd and Rev reads after error correction. It is also possible to just use the Fwd reads (rather common due to significantly lower read quality on Illumina Rev reads). The question that I and some others had was how the outcomes of these two workflows compared to each other, and in particular, how did they compare to data that had already been merged outside of DADA2.  Fungal ITS amplicon studies often use PEAR to merge Fwd and Rev reads (see previous posts). Could we simply follow this standard pipeline, merging with PEAR, and then using the assembled reads as input to DADA2?

To test this I used a rather random selection of 44 Illumina ITS1 amplicon samples from the Sequence Read Archive (SRA) (Accessions: SRR3480407 – SRR3480450) and ran the same DADA2 process on 1) Just the Fwd reads, 2) The reads merged with PEAR, and 3) The raw reads, merged within DADA2, as per the standard pipeline in the DADA2 tutorial.  All code for these tests is below.

Let’s get to the results:

table

 

 

This table shows the RSV count (number of unique sequence variants), the read count (number of valid sequences assigned to an RSV), the proportion of chimeric sequences detected (and removed) and the mean Shannon alpha diversity index for the 44 samples tested.

Right away we can see that merging the data with PEAR prior to running it through DADA2 made some big changes in the outcome. Fewer chimeric sequences were found, but far fewer reads made it into far fewer RSVs.

spp_accum_curves

 

 

 

Looking at species accumulation curves for each method, we see that the highest values come from using the forward reads only (Green).  Close behind, the Fwd and Rev reads merged within the standard DADA2 pipeline (Black), and far below, the Fwd and Rev reads merged with PEAR prior to running through DADA2 (Red).

 

Shannon_div

 

 

 

 

 

Shannon diversity shows a similar pattern.  Those extra RSVs present in the Fwd read version of the data don’t account for much extra diversity within samples. But the pre-merged samples have significantly lower alpha diversity.

 

beta_dispersion

 

Beta diversity analysis also shows that the pre-merged samples come out looking very different.

 

Venn

 

Again, one of the benefits of skipping OTU clustering is that RSVs are directly comparable between analyses as long as the filtration parameters were the same.  This Venn-Diagram shows the proportion of RSVs shared between the different merging strategies.  Skipping the reverse reads entirely yielded roughly double the RSVs as opposed to merging within DADA2.  These “extra” RSVs seem to be due to the DADA2 merging process dumping a lot of reads that didn’t overlap with sufficient quality after error correction. That’s not too surprising given the relatively junky quality of the reverse reads (see below).  Merging with PEAR ahead of time seemed to dump even more reads.  Only 119 RSVs were shared between all strategies.

 

fwd-qual

Forward read quality stats.

 

 

 

 

 

 

 

rev-qual

Reverse read quality stats.

 

 

 

 

 

 

 

 

 

Well, that’s it for now. I don’t want to say anything stupid about whether it’s a good or bad thing that merging with PEAR prior to DADA2 seems to drop a lot of sequences, but preventing the loss of biological signal was one of the reasons for moving toward OTU-free bioinformatics in the first place. Until I am persuaded otherwise, I’ll probably stick with using only the forward reads when I use DADA2.

 

Here’s the R code I used. If you want to try this for your own data, just point this code to directories where you have the Fwd reads only, the successfully-assembled (PEAR) reads, and the raw Fwd and Rev reads.

#####################################################################################
# Forward Reads Only
#####################################################################################

library(dada2); packageVersion("dada2")
library(ShortRead)

# File parsing - For this, we will use only the forward illumina reads - make sure to move fwd reads into their own directory for simplest processing
path <- "~/PATH/TO/FWD/READS/" # CHANGE to the directory containing your demultiplexed fastq files
filtpath <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory
if(!file_test("-d", filtpath)) dir.create(filtpath) # make directory for filtered fqs if not already present
fns <- list.files(path)
fastqs <- fns[grepl(".fastq$", fns)] # CHANGE if different file extensions or to target only certain sequences

fns = file.path(path, fns)
fns = fns[grepl(".fastq$", fns)]
# visualize a couple of fwd read quality profiles to help you decide reasonable filtration parameters
plotQualityProfile(fns[[1]])

# Filter fastqs and store in new directory
for(i in seq_along(fns)) {
  fastq <- fastqs[[i]]
  fastqFilter(fn = file.path(path,fastq), fout = file.path(filtpath, fastq),
              trimLeft=12, truncLen=240,
              maxEE=1, truncQ=2, maxN=0, rm.phix=TRUE,
              compress=TRUE, verbose=TRUE)
}

# File parsing
fns = list.files(filtpath)
fns = file.path(filtpath, fns)
filts <- fns[grepl("fastq$", fns)] # CHANGE if different file extensions
sample.names <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
names(filts) <- sample.names

drp.learn <- derepFastq(filts, verbose = TRUE)
dd.learn <- dada(drp.learn, err=NULL, selfConsist=TRUE, multithread=TRUE)
err <- dd.learn[[1]]$err_out
rm(drp.learn);rm(dd.learn)

# Sample inference
derep <- derepFastq(filts)
dds <- dada(derep, err=err, multithread=TRUE)

# Construct sequence table and remove chimeras
seqtab.fwd <- makeSequenceTable(dds)
seqtab.nochimera.fwd <- removeBimeraDenovo(seqtab.fwd, multithread=TRUE) # This is the RSV table used for community analyses

# fraction of chimeric sequences detected
chimeras.fwd = 1-(sum(seqtab.nochimera.fwd)/sum(seqtab.fwd))

# make data frame
df.fwd = as.data.frame(seqtab.nochimera.fwd)





#####################################################################################
# Paired with PEAR first
#####################################################################################

# File parsing - For this, we will use only the reads already merged with PEAR
path <- "~/PATH/TO/PEAR-MERGED/READS/" # CHANGE to the directory containing your merged, assembled fastq files
filtpath <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory
if(!file_test("-d", filtpath)) dir.create(filtpath) # make directory for filtered fqs if not already present
fns <- list.files(path)
fastqs <- fns[grepl(".fastq$", fns)] # CHANGE if different file extensions or to target only certain sequences

fns = file.path(path, fns)
fns = fns[grepl(".fastq$", fns)]

# visualize a couple of fwd read quality profiles to help you decide reasonable filtration parameters
plotQualityProfile(fns[[1]])

# Filter fastqs and store in new directory
for(i in seq_along(fns)) {
  fastq <- fastqs[[i]]
  fastqFilter(fn = file.path(path,fastq), fout = file.path(filtpath, fastq),
              trimLeft=12, truncLen=240,
              maxEE=1, truncQ=2, maxN=0, rm.phix=TRUE,
              compress=TRUE, verbose=TRUE)
}

# File parsing
fns = list.files(filtpath)
fns = file.path(filtpath, fns)
filts <- fns[grepl("fastq$", fns)] # CHANGE if different file extensions
sample.names <- sapply(strsplit(basename(filts), "_"), `[`, 1) # Assumes filename = samplename_XXX.fastq.gz
names(filts) <- sample.names

drp.learn <- derepFastq(filts, verbose = TRUE)
dd.learn <- dada(drp.learn, err=NULL, selfConsist=TRUE, multithread=TRUE)
err <- dd.learn[[1]]$err_out
rm(drp.learn);rm(dd.learn)

# Sample inference
derep <- derepFastq(filts)
dds <- dada(derep, err=err, multithread=TRUE)

# Construct sequence table and remove chimeras
seqtab.pear <- makeSequenceTable(dds)
seqtab.nochimera.pear <- removeBimeraDenovo(seqtab.pear, multithread=TRUE) # This is the RSV table used for community analyses

# fraction of chimeric sequences detected
chimeras.pear = 1-(sum(seqtab.nochimera.pear)/sum(seqtab.pear))

# make data frame
df.pear = as.data.frame(seqtab.nochimera.pear)


#####################################################################################
# Paired within DADA2
#####################################################################################

path <- "~/PATH/TO/RAW/READS/" # CHANGE to the directory containing your raw fwd and rev reads
filt_path <- file.path(path, "filtered") # Filtered files go into the filtered/ subdirectory

# Sort ensures forward/reverse reads are in same order
fnFs <- sort(list.files(path, pattern="_R1_.fastq"))
fnRs <- sort(list.files(path, pattern="_R2_.fastq"))

# inspect fwd and rev reads
plotQualityProfile(fnFs[[1]]) # representative fwd file
plotQualityProfile(fnRs[[1]]) # representative rev file

# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1)
# Specify the full path to the fnFs and fnRs
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))

#filter fwd and rev reads
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(240,160),
                     maxN=0, maxEE=c(1,1), truncQ=2, rm.phix=TRUE,
                     compress=TRUE, multithread=TRUE, trimLeft = 12)

# learn errors
errF <- learnErrors(filtFs, multithread=TRUE)
errR <- learnErrors(filtRs, multithread=TRUE)

# de-replicate
derepFs <- derepFastq(filtFs, verbose=TRUE)
derepRs <- derepFastq(filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(derepFs) <- sample.names
names(derepRs) <- sample.names

# infer sequence variants in each sample
dadaFs <- dada(derepFs, err=errF, multithread=TRUE)
dadaRs <- dada(derepRs, err=errR, multithread=TRUE)

# merge paried end reads after error correction
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE)

# construct RSV table
seqtab.dd2_merged <- makeSequenceTable(mergers)

# remove chimeras
seqtab.nochimera.dd2_merged <- removeBimeraDenovo(seqtab.dd2_merged, multithread=TRUE)

# fraction of chimeric sequences
chimeras.dd2_merged = 1 - sum(seqtab.nochimera.dd2_merged)/sum(seqtab.dd2_merged)

# make data frame
df.dd2_merged = as.data.frame(seqtab.nochimera.dd2_merged)



#####################################################################################
# Simple Analyses to compare methods
#####################################################################################

# inspect dimensions
dim(seqtab.nochimera.fwd); chimeras.fwd
dim(seqtab.nochimera.pear); chimeras.pear
dim(seqtab.nochimera.dd2_merged); chimeras.dd2_merged

# Tables of sequence lengths (and counts)
table(nchar(names(df.fwd)))
table(nchar(names(df.pear)))
table(nchar(names(df.dd2_merged)))

# Remove sequences longer than 228 bp from DD2-merged seq table
right.length = which(nchar(names(df.dd2_merged)) == 228)
df.dd2_merged = df.dd2_merged[,right.length]

library(vegan)
spacc.fwd = specaccum(df.fwd)
spacc.pear = specaccum(df.pear)
spacc.dd2_merged = specaccum(df.dd2_merged)

plot(spacc.fwd, col = 'Green')
plot(spacc.pear, add = TRUE, col = 'Red')
plot(spacc.dd2_merged, add = TRUE)



# NMDS
meta.fw = metaMDS(df.fwd)
mds1.fw = meta.fw$points[,1]
mds2.fw = meta.fw$points[,2]

meta.pear = metaMDS(df.pear)
mds1.pear = meta.pear$points[,1]
mds2.pear = meta.pear$points[,2]

meta.dd = metaMDS(df.dd2_merged)
mds1.dd = meta.dd$points[,1]
mds2.dd = meta.dd$points[,2]

# build data frame
NMDS = data.frame(MDS1 = c(mds1.fw,mds1.pear,mds1.dd), MDS2 = c(mds2.fw,mds2.pear,mds2.dd), Method = c(rep("Fwd_Reads",44),rep("PEAR",44),rep("DD2",44)))
summary(NMDS$MDS1)
summary(NMDS$MDS2)
NMDS = NMDS[NMDS$MDS1 < 0,] # get rid of single crazy outlier to make plotting visible

# plot
library(ggplot2)
ggplot(NMDS, mapping = aes(x=MDS1,y=MDS2,col=Method)) +
  geom_point() + theme_bw() + ggtitle("NMDS plot of Bray Dissimilarity Colored by Merging Method")

# Add grouping metadata
Trt = c(rep("Trt_1",22),rep("Trt_2",22))

# NMDS plot for each method colored by artificial groups
ggplot(mapping = aes(x=mds1.fw[mds1.fw<0],y=mds2.fw[mds1.fw<0],col=Trt[mds1.fw<0])) +
  geom_point()

ggplot(mapping = aes(x=mds1.pear[mds1.pear<0],y=mds2.pear[mds1.pear<0],col=Trt[mds1.pear<0])) +
  geom_point()

ggplot(mapping = aes(x=mds1.dd[mds1.dd<0],y=mds2.dd[mds1.dd<0],col=Trt[mds1.dd<0])) +
  geom_point()

# Adonis for each method by artificial groupings
adon.fwd = adonis(vegdist(df.fwd) ~ Trt)
adon.fwd

adon.pear = adonis(vegdist(df.pear) ~ Trt)
adon.pear

adon.dd = adonis(vegdist(df.dd2_merged) ~ Trt)
adon.dd


# shannon diversity
shannon.fwd = (diversity(df.fwd))
shannon.pear = diversity(df.pear)
shannon.dd = diversity(df.dd2_merged)

# build data frame
Shannon = data.frame(Shannon = c(shannon.fwd,shannon.pear,shannon.dd), Method = c(rep("Fwd_Reads",44),rep("PEAR",44),rep("DD2",44)))

#plot
ggplot(Shannon, aes(x=Method,y=Shannon,fill=Method)) + geom_boxplot() +
  theme_bw() 

# how many RSVs from PEAR method are seen in other methods?
library(VennDiagram)

Fwd = names(df.fwd)
PEAR = names(df.pear)
DD2 = names(df.dd2_merged)

# make VennDiagram
venn.diagram(list(Fwd = Fwd, PEAR = PEAR, DD2 = DD2),"~/Desktop/dada2_test/output/Venn.tiff")

# merge RSV tables / need to edit sample (row) names first to remove duplicates
row.names(seqtab.nochimera.fwd) <- paste(row.names(seqtab.nochimera.fwd),"_Fwd", sep = "")
row.names(seqtab.nochimera.pear) <- paste(row.names(seqtab.nochimera.pear),"_Pear", sep = "")
row.names(seqtab.nochimera.dd2_merged) <- paste(row.names(seqtab.nochimera.dd2_merged),"_DD2", sep = "")

# merge tables
seqtab = mergeSequenceTables(seqtab.nochimera.fwd,seqtab.nochimera.pear,seqtab.nochimera.dd2_merged)

# add grouping variables
Method = c(rep("Fwd",44),rep("Pear",44),rep("DD2",44))

#beta dispersion
beta_dispersion = betadisper(vegdist(seqtab, method = "jaccard", binary = TRUE),Method)
plot(beta_dispersion)

# presence absence
seqtab.pa = decostand(seqtab, method = 'pa')

#plot sample richness
ggplot(mapping = aes(x=row.names(seqtab.pa),y=rowSums(seqtab.pa),col=Method)) +
  geom_point() +
  theme_bw() + ggtitle("Sample Richness") + labs(x="Samples",y="Sample Richness") +
  theme(axis.text.x = element_blank(), legend.title = element_text()) 


# comparison summary info

# number of RSVs found with each method
a = length(names(df.fwd))
b=length(names(df.pear))
c=length(names(df.dd2_merged))

# number of valid read counts found with each method
d=sum(colSums(df.fwd))
e=sum(colSums(df.pear))
f=sum(colSums(df.dd2_merged))


# table of summary info
data.frame(Method = c("Fwd_Only","PEAR_merged","DD_merged"), RSV_count = c(a,b,c), Read_Count = c(d,e,f),
           Percent_Chimeras = c(chimeras.fwd,chimeras.pear,chimeras.dd2_merged),
           Mean_Shannon_Div = c(mean(shannon.fwd),mean(shannon.pear),mean(shannon.dd)))

# ===================