This document is a guide to transforming your HTS count data into a form that you can more easily work with. You also incidentally get some measures of OTU diversity. If you’re not familiar with processing HTS datasets, this is a good place to start as it begins with importing your sequences from mothur. I used a dataset of V4 16S sequences from North American bat fecal samples, sequenced on the Illumina Miseq platform (PE300).

HTS datasets are inherently compositional: the OTU counts don’t have any meaning without the total read depth of the sample. In order to compare the abundance of OTUs between groups of samples, you must either scale your reads, normalize them, rarefy them, or transform them into a different number space.

Dataset Loading

library(compositions)
library(zCompositions)
library(ape)
library(plyr)
library(dplyr)
library(ggplot2)
library(gplots)
library(lme4)
library(tidyr)
library(vegan)
library(scales)
library(grid)
library(reshape2)
library(phyloseq)
library(ALDEx2)
library(gridExtra)
library(stringr)
library(viridis)

m.g.colors <- c( "#999999", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#CBD588", "#5F7FC7", "orange","#DA5724", "#508578", "#CD9BCD",
   "#AD6F3B", "#673770","#D14285", "#652926", "#C84248", 
  "#8569D5", "#5E738F","#D1A33D", "#8A7C64", "#599861" )

more.colors <- c("indianred1", "steelblue3",  "skyblue1", "mediumorchid","royalblue4", "olivedrab3",
   "pink", "#FFED6F", "mediumorchid3", "ivory2", "tan1", "aquamarine3", "#C0C0C0",
    "mediumvioletred", "#999933", "#666699", "#CC9933", "#006666", "#3399FF",
   "#993300", "#CCCC99", "#666666", "#FFCC66", "#6699CC", "#663366", "#9999CC", "#CCCCCC",
   "#669999", "#CCCC66", "#CC6600", "#9999FF", "#0066CC", "#99CCCC", "#999999", "#FFCC00",
   "#009999", "#FF9900", "#999966", "#66CCCC", "#339966", "#CCCC33", "#EDEDED"
)

theme_set(theme_classic(base_size=12, base_family="Avenir"))

Replace these file names with the names of your mothur shared file and your mothur constaxonomy file.

taxfile="bat16s.cons.taxonomy"
sharedfile="bat16s.shared"

You should have a metadata file with more information about your samples. If you don’t already have this, make one where the first row corresponds to the names you used for your samples in mothur.

mdf <- read.csv("metadata_clip_0520.csv", header=TRUE, row.names=1)
mdf

This metadataset does not include a record for all samples.

This example arguably has too much metadata. I’m going to select only columns that I might be interested in analyzing.

mdf <- subset(mdf, select=c(Bat_Index, Geo.Section, Year, Species, SexReprod, Age, WingScore.))
mdf

Read in the mothur files.

mothur_data <- import_mothur(mothur_constaxonomy_file = taxfile, mothur_shared_file = sharedfile)

mothur_data
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 30903 taxa and 367 samples ]
## tax_table()   Taxonomy Table:    [ 30903 taxa by 6 taxonomic ranks ]

There is no information on blanks in my metadata file. We should pull those sample names out and create dummy rows in the metadata file. All of those rows begin with Blank, so they should be easy to extract.

allsamples <- colnames(data.frame(otu_table(mothur_data)))
blanks <- allsamples[sapply(allsamples, function(x) grepl("Blank", x))]
blanks
##  [1] "Blank_005_D01" "Blank_005_H01" "Blank_006_D04" "Blank_006_G08"
##  [5] "Blank_006_H07" "Blank_007_D08" "Blank_007_D11" "Blank_008_D06"
##  [9] "Blank_008_E05" "Blank_008_E06" "Blank_008_H08"
blank.mdf <- data.frame(row.names=blanks)

blank.mdf$Bat_Index <- NA
blank.mdf$Geo.Section <- NA
blank.mdf$Year <- NA
blank.mdf$Species <- NA
blank.mdf$SexReprod <- NA
blank.mdf$Age <- NA
blank.mdf$WingScore. <- NA

blank.mdf

Add a SampleType column to both metadataframes

blank.mdf$SampleType <- "Blank"
mdf$SampleType <- "Bat"

Combine the metadata files.

mdf <- rbind(mdf, blank.mdf)
mdf

Now we can merge together the mothur data with the metadata.

moth_merge <- merge_phyloseq(mothur_data, sample_data(mdf))
moth_merge
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 30903 taxa and 294 samples ]
## sample_data() Sample Data:       [ 294 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 30903 taxa by 6 taxonomic ranks ]

Fix the rank names on the taxonomy table and then remove any reads that aren’t bacteria or archaea.

colnames(tax_table(moth_merge)) <- c("Kingdom", "Phylum", "Class", 
  "Order", "Family", "Genus")

miseq.phylo <- moth_merge %>%
  subset_taxa(
    Kingdom %in% c("Bacteria", "Archaea") &
    Family  != "mitochondria" &
    Class   != "Chloroplast"
  )

miseq.phylo
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 30903 taxa and 294 samples ]
## sample_data() Sample Data:       [ 294 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 30903 taxa by 6 taxonomic ranks ]

Examine Read Depths

Let’s see how varied the total read depth is between our samples.

read.depths <- sample_sums(miseq.phylo)
hist(read.depths, breaks=20)

summary(read.depths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      25   12379   26219   26321   37333   73230

This sample set has high variance. It also includes some blanks, which should have very low read depths.

depth.df <- data.frame(row.names=names(read.depths), depth=read.depths)
depth.df$SampleType <- mdf[row.names(depth.df),"SampleType"]

depth.df %>% group_by(SampleType) %>% summarise(min=min(depth), max=max(depth), mean=mean(depth))

Quality Control

Generally, I prefer to remove singletons (OTU counts of 1) before doing any other analysis. There are some diversity measurements which rely on the presence of singletons within a sample (Chao), so you might choose to leave them in for those sorts of analyses only. Most downstream processing will improve however if you drop singletons within each sample as well.

We can also use the read counts from our Blank controls to understand what OTUs might be contaminants and what OTU count threshold should be considered a true detection.

Not all blank controls are equally useful – The blanks in this example are empty water wells that went through both PCR amplification and sequencing. This is arguably less useful than an empty water well that only went through sequencing, but we can work with it. You might also have blank controls from your DNA extraction step, which can be treated similarly.

otu.table <- data.frame(otu_table(miseq.phylo))
otu.blanks <- otu.table[,blanks]
nonzero.otu.blanks <- otu.blanks[which(rowSums(otu.blanks) > 0),]
nonzero.otu.blanks

We can see that the most common OTUs in our dataset are present at low levels in at least one of all blanks. This isn’t too surprising. Otu00001 is highly abundant, but is detected a max of 25 reads in one blank. Outside of that blank, it seems like there could be background detections that show up as counts of 2 or 3.

I’m the most interested in total detections across all blanks.

rowSums(nonzero.otu.blanks)
## Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007 Otu00008 
##      164       28       14       17       29       10       23        9 
## Otu00009 Otu00010 Otu00011 Otu00012 Otu00013 Otu00014 Otu00015 Otu00016 
##       12       15        9       11        9        4        5       10 
## Otu00017 Otu00018 Otu00019 Otu00020 Otu00021 Otu00022 Otu00024 Otu00025 
##        3        4        7       10        4        3        8        1 
## Otu00026 Otu00027 Otu00028 Otu00029 Otu00030 Otu00032 Otu00033 Otu00034 
##        3        2        6        4        2        2        3        2 
## Otu00035 Otu00038 Otu00041 Otu00043 Otu00044 Otu00045 Otu00046 Otu00047 
##        4       15        8        4        2        2        3        2 
## Otu00048 Otu00050 Otu00051 Otu00052 Otu00054 Otu00058 Otu00060 Otu00062 
##        6        1        2        1        1        1        1        1 
## Otu00066 Otu00068 Otu00069 Otu00070 Otu00076 Otu00077 Otu00080 Otu00082 
##        2        2        4        1        3        1        3        2 
## Otu00083 Otu00084 Otu00086 Otu00091 Otu00095 Otu00096 Otu00098 Otu00103 
##        2        3        1        1        1        6        2        1 
## Otu00109 Otu00110 Otu00111 Otu00112 Otu00113 Otu00114 Otu00115 Otu00116 
##        1        1        1        1        1        3        1        2 
## Otu00117 Otu00120 Otu00123 Otu00124 Otu00129 Otu00133 Otu00135 Otu00138 
##        1        1        1        1        1        1        2        1 
## Otu00143 Otu00145 Otu00146 Otu00148 Otu00157 Otu00158 Otu00163 Otu00164 
##        1        1        1        1        1        1        1        1 
## Otu00166 Otu00179 Otu00183 Otu00188 Otu00190 Otu00197 Otu00198 Otu00203 
##        2        2        1        1        1        1        1        1 
## Otu00206 Otu00208 Otu00214 Otu00234 Otu00237 Otu00242 Otu00245 Otu00247 
##        2        1        2        2        1        1        1        1 
## Otu00252 Otu00256 Otu00258 Otu00263 Otu00264 Otu00266 Otu00267 Otu00273 
##        1        1        1        2        2        1        3        1 
## Otu00282 Otu00287 Otu00288 Otu00291 Otu00293 Otu00294 Otu00300 Otu00302 
##        1        1        2        1        1        1        1        2 
## Otu00305 Otu00348 Otu00349 Otu00374 Otu00378 Otu00383 Otu00427 Otu00448 
##        1        1        1        1        1        1        1        1 
## Otu00489 Otu00503 Otu00521 Otu00537 Otu00560 Otu00566 Otu00607 Otu00707 
##        1        1        1        1        1        1        1        1 
## Otu00715 Otu00721 Otu00771 Otu00791 Otu00845 Otu00848 Otu00850 Otu00854 
##        1        1        1        1        1        1        1        1 
## Otu01104 Otu01282 Otu01335 Otu01441 Otu01445 Otu01617 Otu01736 Otu01939 
##        1        1        2        1        1        1        1        2 
## Otu02090 Otu02619 Otu03047 Otu04638 Otu05849 Otu06997 Otu07035 Otu07162 
##        1        1        1        1        1        1        1        1 
## Otu07798 Otu08558 Otu08764 Otu08792 Otu10537 Otu11326 Otu12971 Otu13286 
##        1        1        1        1        1        1        1        1 
## Otu13922 Otu13992 Otu14605 Otu15308 Otu15599 Otu15662 Otu15851 Otu16556 
##        1        1        1        2        1        1        1        1 
## Otu17131 Otu17893 Otu18065 Otu18218 Otu18664 Otu19703 Otu19763 Otu20019 
##        1        1        1        1        1        1        1        1 
## Otu20181 Otu22014 Otu23353 Otu24816 Otu24926 Otu24957 Otu25150 Otu25525 
##        1        1        1        1        1        1        1        1 
## Otu25759 Otu27127 Otu27577 Otu28338 Otu29260 
##        1        1        1        1        1

You could choose to drop OTUs that don’t occur above 164 times if you want to be really sure that you’re not counting contaminants.

I think for now I’m going to stick with dropping single reads. We’ll make a version of the phyloseq object with no singletons (across the whole dataset), a version with no single reads within samples, and a raw read version so you can see the difference.

otu.table.nosingle <- otu.table[which(rowSums(otu.table) > 1),]
otu.table.nosinglereads <- data.frame(otu.table)
otu.table.nosinglereads[otu.table.nosinglereads==1] <- 0
otu.table.nosinglereads <- otu.table.nosinglereads[which(rowSums(otu.table.nosinglereads)>0),]

phylo.nosingle <- merge_phyloseq(otu_table(otu.table.nosingle, taxa_are_rows=TRUE), tax_table(miseq.phylo), sample_data(miseq.phylo))
phylo.nosinglereads <- merge_phyloseq(otu_table(otu.table.nosinglereads,taxa_are_rows=TRUE),  tax_table(miseq.phylo), sample_data(miseq.phylo))

Rarefactions

Rarefactions illustrate how completely the sequences within each environmental sample were picked up by the sequencer… how well your sample was sampled! The rarefaction function takes a random subsample of each column in your otu table of a given size, starting with a very small subsample, and tallies how many unique OTUs were observed in that subsample The analysis is repeated with increasing subsample sizes until the maximum actual read depth for your table is reached. Then the analysis is repeated many times and an average is taken to smooth out the curve.

Typically, rarefaction curves are plotted with the subsampling depth on the x axis and the total number of unique OTUs observed on the y-axis. In an ideal dataset, each curve should reach a horizontal asymptote, ideally around the same depth. This is almost never the case really.

Mothur can calculate rarefaction curves for you, which can be very convenient, but of course will only calculate them on your unprocessed data. I like to calculate them in R so that I can compare the effects of dropping singletons, singlereads, etc., but it is very processor intensive and takes a long time to run.

This function is optimized to run on multiple processors to speed things up, so if your personal computer has multiple cores (it probably has at least 2), this newest version may help. If not, and you want to run this part in R, consider using the HPC resources you have available.

First make an array of sampling depths. It should extend to the maximum read depth in your dataset. This function creates 250 evenly spaced sampling depths from 1 to the maximum depth. You can adjust this depending on the resolution you want for your curves.

max.depth <- max(sample_sums(miseq.phylo))
max.depth
## [1] 73230
sampling.depth <- seq(from=2, to=max.depth, length.out=250, )
sampling.depth <- sapply(sampling.depth, round)
library("doParallel")
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
cores <- detectCores()-1
cores
## [1] 3

If cores is less than 2, then don’t run this on your laptop! It won’t help improve the speed. You can either run a very slow version or run this section of code on the supercomputer. The core rarefaction function is the same.

make.rarefaction <- function(adf, col, perm, sampling.depth) {
  otu.weight <- adf[,col]
  otu.vec <- rep(rownames(adf), otu.weight)
  otu.counts <- sapply(sampling.depth[sampling.depth < length(otu.vec)], function(x) cust.sample.perm(otu.vec, x, perm))

  otu.avg <- colMeans(otu.counts)
  full.count <- c(otu.avg, rep(NA, length(sampling.depth)-length(otu.avg)))
  return(full.count)
}

cust.sample.perm <- function(samp.vec, n, perm){
  max=length(samp.vec)
  if( n > max){
    return(rep(NA, perm))
  }
  all.samples <- replicate(perm, sample(samp.vec, n, replace=FALSE))
  return(apply(all.samples, 2, function(x) length(unique(x))))
}

If you are using a version of linux or MacOS, run the code like so:

start <- proc.time()
cl <- makeCluster(cores, type = "FORK")
registerDoParallel(cl)
rarefaction.raw <- foreach(x=colnames(otu.table), .combine=cbind) %dopar% {
  make.rarefaction(otu.table, x, 25, sampling.depth)
}
stopCluster(cl)
rarefaction.raw <- data.frame(rarefaction.raw)
colnames(rarefaction.raw) <- colnames(otu.table)
rarefaction.raw$Sampling.Depth <- sampling.depth
print(proc.time() - start)

On my mac, which has an i5 processor with 4 cores, this ran in about 22 minutes.

If you are running R on Windows, the code will necessarily run a little bit slower since FORK type clusters are not available. Run this chunk instead:

start <- proc.time()
cl <- makeCluster(cores)
registerDoParallel(cl)
rarefaction.raw <- foreach(x=colnames(otu.table), .combine=cbind) %dopar% {
  make.rarefaction(otu.table, x, 25, sampling.depth)
}
rarefaction.raw <- data.frame(rarefaction.raw)
colnames(rarefaction.raw) <- colnames(otu.table)
rarefaction.raw$Sampling.Depth <- sampling.depth
stopCluster(cl)
print(proc.time() - start)

If you do not have more than one core and want to run this on your personal computer, run the slow version like this:

start <- proc.time()
rarefaction.raw <- sapply(colnames(otu.table), function(x) make.rarefaction(otu.table, x, 25, sampling.depth))
colnames(rarefaction.raw) <- colnames(otu.table)
rarefaction.raw$Sampling.Depth <- sampling.depth
print(proc.time() - start)

However you ran the rarefaction curve code, save the results. There’s no need to run it twice!

write.csv(rarefaction.raw, "rarefaction.curves.raw.csv")

To run the rarefaction code on the supercomputer, you will only need to upload a .csv version of your otu table. Save one like so, upload it to your workspace using sftp, winfcp, or filezilla.

write.csv(otu.table, "raw.otu.table.csv")

You will also need to load the R module. For best results, load the same version of R that you use on your personal computer. Then start the R terminal. Do that like so:

module load R/4.0.0
R

You will need to have the doParallel and foreach packages installed. If you don’t have those in your R libraries on the supercompuiter, install doParallel, preferrably using BioConductor. Install Bioconductor if you don’t have it.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.12")
BiocManager::install("doParallel")

In the R terminal, copy and paste the following:

library("doParallel")
otu.table <- read.csv("raw.otu.table.csv", header=TRUE, row.names=1)
max.depth <- max(colSums(otu.table))
sampling.depth <- seq(from=2, to=max.depth, length.out=250, )
sampling.depth <- sapply(sampling.depth, round)


make.rarefaction <- function(adf, col, perm, sampling.depth) {
  otu.weight <- adf[,col]
  otu.vec <- rep(rownames(adf), otu.weight)
  otu.counts <- sapply(sampling.depth[sampling.depth < length(otu.vec)], function(x) cust.sample.perm(otu.vec, x, perm))

  otu.avg <- colMeans(otu.counts)
  full.count <- c(otu.avg, rep(NA, length(sampling.depth)-length(otu.avg)))
  return(full.count)
}

cust.sample.perm <- function(samp.vec, n, perm){
  max=length(samp.vec)
  if( n > max){
    return(rep(NA, perm))
  }
  all.samples <- replicate(perm, sample(samp.vec, n, replace=FALSE))
  return(apply(all.samples, 2, function(x) length(unique(x))))
}

cores <- detectCores()

start <- proc.time()
cl <- makeCluster(cores, type = "FORK")
registerDoParallel(cl)
rarefaction.raw <- foreach(x=colnames(otu.table), .combine=cbind) %dopar% {
  make.rarefaction(otu.table, x, 25, sampling.depth)
}
stopCluster(cl)
rarefaction.raw <- data.frame(rarefaction.raw)
colnames(rarefaction.raw) <- colnames(otu.table)
rarefaction.raw$Sampling.Depth <- sampling.depth
print(proc.time() - start)

write.csv(rarefaction.raw, "rarefaction.curves.raw.csv")
print("done")

Then download the resulting csv.

On an MSI cluster with 24 processors and 16g of RAM, this took 76 seconds. (!!)

rarefaction.curve.df <- read.csv("rarefaction.curves.raw.csv", header=TRUE, row.names=1)
rarefaction.curve.df

To plot the curves, melt the rarefaction dataframe.

rare.df.melt <- data.frame(melt(rarefaction.curve.df, id="Sampling.Depth")) 

ggplot(rare.df.melt, aes(x=Sampling.Depth, y=value, group=variable)) + geom_line() + coord_cartesian(xlim=c(0,max.depth)) + ylab("Unique OTUs")
## Warning: Removed 47037 row(s) containing missing values (geom_path).

As you can see, the sampling depth is quite varied, and the rate of OTU gain near the end of each line is also varied.

You can break this up by any factor in your metadata file to find out if there are subsets of your data that might be more similar to eachother and easier to analyze as groups.

colnames(rare.df.melt) <- c("Sampling.Depth", "SID", "OTUs")
rare.df.melt$SampleType <- mdf[as.vector(rare.df.melt$SID), "SampleType"]
rare.df.melt$Species <- mdf[as.vector(rare.df.melt$SID), "Species"]
ggplot(rare.df.melt, aes(x=Sampling.Depth, y=OTUs, group=SID, color=Species)) + geom_line(alpha=0.7)
## Warning: Removed 47037 row(s) containing missing values (geom_path).

In this case, the species grouping isn’t very useful.

Let’s see how our interpretation of the rarefaction curves change when we calculate them using the otu table without singletons and without single reads. I’m not displaying the code to run this; it’s the same as the code we used to make rarefaction curves from the raw otu table, just using a different otu table input. Run them however you’d like and resume here.

nosingle.rare <- read.csv("rarefaction.curves.nosingle.csv", header=TRUE, row.names=1)
nosingleread.rare <- read.csv("rarefaction.curves.nosinglereads.csv", header=TRUE, row.names=1)

rare.df.melt.ns <- data.frame(melt(nosingle.rare, id="Sampling.Depth"))
rare.df.melt.nsr <- data.frame(melt(nosingleread.rare, id="Sampling.Depth"))
f <- ggplot(rare.df.melt, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + coord_cartesian(xlim=c(0,max.depth), ylim = c(0, 2500)) + ylab("Unique OTUs") + ggtitle("Raw")

g <- ggplot(rare.df.melt.ns, aes(x=Sampling.Depth, y=value, group=variable)) + geom_line() + coord_cartesian(xlim=c(0,max.depth), ylim = c(0, 2500)) + ylab("Unique OTUs") + ggtitle("No Singleton")

h <- ggplot(rare.df.melt.nsr, aes(x=Sampling.Depth, y=value, group=variable)) + geom_line() + coord_cartesian(xlim=c(0,max.depth), ylim = c(0, 2500)) + ylab("Unique OTUs") + ggtitle("No Single Read")

grid.arrange(f, g, h, nrow=1)
## Warning: Removed 47037 row(s) containing missing values (geom_path).
## Warning: Removed 47052 row(s) containing missing values (geom_path).
## Warning: Removed 47131 row(s) containing missing values (geom_path).

Removing singletons has a modest but appreciable effect on the total number of raw OTUs, but does not flatten the sampling curves to the same degree that removing single reads does. I recommend using the no singleton set for any broad diversity measurements (Simpson, Shannon, etc.) and the no single read set for any diversity ordinations. It is not necessary to calcualte rarefaction curves for a no single read set because these low count reads are removed at a later step in analysis. I’ve only made them here to illustrate the difference between these sets. Th

Diversity Measurement: OTU Richness

Rarefaction curves can themselves be used as measurements of OTU richness. If you want a tabular representation rather than a figure, consider using Chao 1 or the Chao abundance based coverage estimator (ACE) or an incidence based coverage estimator (ICE). Use ICE if you want to/have to ignore abundance and use presence/absence data only. For more about how these estimators work, see this article. These estimators rely on the proportion of singletons in each sample (not across all the data), so should not be used with the no-single-read dataset. You can also choose to use them with the raw data.

I like the implementation in the package fossil. phyloseq also has convenience wrappers for these functions.

Here’s Chao 1, which is based on the proportion of singletons and doubletons in each sample.

library("fossil")
## Loading required package: sp
## Loading required package: maps
## 
## Attaching package: 'maps'
## The following object is masked from 'package:plyr':
## 
##     ozone
## Loading required package: shapefiles
## Loading required package: foreign
## 
## Attaching package: 'shapefiles'
## The following objects are masked from 'package:foreign':
## 
##     read.dbf, write.dbf
raw.chao1 <- apply(otu.table, 2, chao1)
ns.chao1 <- apply(otu.table.nosingle, 2, chao1)

raw.chao1[1:10]
## Blank_005_D01 Blank_005_H01 Blank_006_D04 Blank_006_G08 Blank_006_H07 
##     167.66667      93.50000     110.08333     101.54545     106.00000 
## Blank_007_D08 Blank_007_D11 Blank_008_D06 Blank_008_E05 Blank_008_E06 
##     123.10000     161.12500     161.25000      21.16667      54.00000
ns.chao1[1:10]
## Blank_005_D01 Blank_005_H01 Blank_006_D04 Blank_006_G08 Blank_006_H07 
##     157.50000      93.50000     110.08333      85.90909      96.25000 
## Blank_007_D08 Blank_007_D11 Blank_008_D06 Blank_008_E05 Blank_008_E06 
##      97.50000     120.50000     161.25000      18.00000      49.12500

And here’s ACE.

raw.ace <- apply(otu.table, 2, ACE)
ns.ace <- apply(otu.table.nosingle, 2, ACE)

raw.ace[1:10]
## Blank_005_D01 Blank_005_H01 Blank_006_D04 Blank_006_G08 Blank_006_H07 
##     111.11765      51.15000     105.05382     121.87673      54.54919 
## Blank_007_D08 Blank_007_D11 Blank_008_D06 Blank_008_E05 Blank_008_E06 
##     117.44927      91.70057     113.87870      18.05556      91.13160
ns.ace[1:10]
## Blank_005_D01 Blank_005_H01 Blank_006_D04 Blank_006_G08 Blank_006_H07 
##     103.34531      51.15000     105.05382      98.13520      49.27215 
## Blank_007_D08 Blank_007_D11 Blank_008_D06 Blank_008_E05 Blank_008_E06 
##      88.38776      67.05179     113.87870      16.00000      79.85950

These methods both compute an extrapolated species richness based on something like your rarefaction curves.

It’s important to note that these estimators are not designed for use with compositional data. These extrapolations give you an estimate of the true number of reads in your sample, not the true number of reads in the environment from which your sample was drawn! The method does not address underlying differences in total read depth – at the end you have a corrected value for total OTU richness, but it still should be interpreted in the context of total read depth of each sample.

otu.freq <- data.frame(otu.table)
otu.freq[otu.freq > 1] <- 1
raw.richness <- colSums(otu.freq)

otu.freq <- data.frame(otu.table.nosingle)
otu.freq[otu.freq > 1] <- 1
raw.richness.ns <- colSums(otu.freq)

boxplot(raw.ace, ns.ace, raw.chao1, ns.chao1, raw.richness, raw.richness.ns, xaxt="n") 
axis(side=1, at=c(1,2,3,4, 5, 6), labels=FALSE)
text(x=c(1,2,3,4, 5, 6), pos=1, par("usr")[3], xpd=TRUE, srt=45, labels=c("ACE Raw", "ACE S", "Chao1 Raw", "Chao1 S", "Raw", "Raw S"))

All of these estimators are simply the end points of your rarefaction curves. The coverage estimators essentially extend the slope of the curves to the end, while the raw numbers draw a straight line. Let’s illustrate this with a small subset of samples.

sample.5 <- colnames(otu.table[,40:45])
sample.5
## [1] "MLG16_1_0183" "MLG16_1_0184" "MLG16_1_0185" "MLG16_1_0186" "MLG16_1_0188"
## [6] "MLG16_1_0192"

Here are our rarefaction curves for those samples.

ggplot(rare.df.melt[which(rare.df.melt$SID %in% sample.5 & !(is.na(rare.df.melt$OTUs))),], aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line()

Let’s plot the raw OTU richness as points at the maximum sampling depth on our chart (about 40000 in this example).

mini.rare.df <- rare.df.melt[which(rare.df.melt$SID %in% sample.5 & !(is.na(rare.df.melt$OTUs))),]

richness.df <- data.frame(raw.rich = raw.richness, chao1=raw.chao1, ace=raw.ace, row.names=names(raw.richness), SID=names(raw.richness))

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_point(data=richness.df[sample.5,], aes(x=40000, y=raw.rich, color=SID))

Geometrically, this is as if we have drawn a straight line from the end of each rarefaction curve.

max.rare <- rare.df.melt[which(! is.na(rare.df.melt$OTUs)),] %>% group_by(SID) %>% summarise(max=max(Sampling.Depth))
max.rare <- data.frame(max.rare)
row.names(max.rare) <- max.rare$SID

richness.df$max.depth <- max.rare[row.names(richness.df),"max"]
richness.df[sample.5,]
ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_point(data=richness.df[sample.5,], aes(x=60000, y=raw.rich, color=SID)) + geom_segment(data=richness.df[sample.5,], aes(x=max.depth, y=raw.rich, xend=60000, yend=raw.rich, color=SID), linetype=2)

Chao 1 extends the curve.

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_point(data=richness.df[sample.5,], aes(x=60000, y=chao1, color=SID)) + geom_segment(data=richness.df[sample.5,], aes(x=max.depth, y=raw.rich, xend=60000, yend=chao1, color=SID), linetype=2)

You can see that it appears that Chao1 also extrapolates along the x axis.

ACE also extends the curve. It performs less well on samples with steep rarefaction curves.

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_point(data=richness.df[sample.5,], aes(x=60000, y=ace, color=SID)) + geom_segment(data=richness.df[sample.5,], aes(x=max.depth, y=raw.rich, xend=60000, yend=ace, color=SID), linetype=2)

Whether extrapolated or not, you’re really stuck with the fact that these sequencing samples are samples of samples. Richness of OTUs is not easy to compare to other study results. If you want to compare the relative richness of samples within your dataset, then consider using a scaling method and then report richness relative to the maximum unique OTUs in your dataset. Or to the total number of unique OTUs observed across the dataset. It’s really important to scale or downsample if you’re going to do that.

If you would like to assess differences in OTU richness between groups of samples, I suggest making confidence intervals around the mean values for your groups.

Let’s use the set without singletons and with single reads included. Assign your group variable to a column. I’m using “Species” from my metadata file. I’m also going to drop the Blank samples because they don’t have species values.

rare.df.melt.ns$Species <- mdf[as.vector(rare.df.melt.ns$variable),"Species"]
rare.df.melt.ns <- rare.df.melt.ns[which(!is.na(rare.df.melt.ns$Species)),]
rare.df.melt.ns.nonull <- rare.df.melt.ns[which(!is.na(rare.df.melt.ns$value)),]
rare.df.melt.ns.nonull

Now use dplyr to group together the reads at each sampling depth and take an average value, along with standard deviation and the number of non-null measurements at each level..

ns.species.avgs <- rare.df.melt.ns.nonull %>% group_by(Sampling.Depth, Species) %>% summarise(OTUs=mean(value), OTUsd = sd(value, na.rm=TRUE), n=n())
## `summarise()` has grouped output by 'Sampling.Depth'. You can override using the `.groups` argument.
ns.species.avgs <- data.frame(ns.species.avgs)
ns.species.avgs

You can calculate a 95% confidence interval using the statistics we just summarized.

ns.species.avgs$u95 <- ns.species.avgs$OTUs + ((1.96*ns.species.avgs$OTUsd)/sqrt(ns.species.avgs$n))
ns.species.avgs$l95 <- ns.species.avgs$OTUs - ((1.96*ns.species.avgs$OTUsd)/sqrt(ns.species.avgs$n))

Plot those like this, with geom_ribbon.

ggplot(ns.species.avgs, aes(x=Sampling.Depth, y=OTUs, group=Species, color=Species)) + geom_line() + scale_color_manual(values=c("#999999", "#56B4E9", "#009E73"), name="Species") + geom_ribbon(aes(ymin=l95, ymax=u95, group=Species, fill=Species), alpha=0.5, linetype='blank') + scale_fill_manual(values=c( "#999999", "#56B4E9", "#009E73"), name="Species")+ xlab("Sampling Depth") + ylab("Mean OTU Richness")

You can test for overlap of these 95% confidence interval at any depth you like. If you were trying to assert that they were statistically different, you would have a stronger argument the more they were separated along the entire length of sampling depth. In this case, there is only a weak argument that the EPFU bats have higher mean OTU richenss than MYLU and MYSE bats.

95% confidence intervals calculated in this way rely on the assumption of a normal distribution at each sampling depth interval, which is probably not borne out. You can also estimate a 95% confidence interval in a non parametric way, like this by calculating the quantiles directly:

ns.species.avgs.np <- rare.df.melt.ns.nonull %>% group_by(Sampling.Depth, Species) %>% summarise(OTUs=mean(value), u95=quantile(value, probs=0.95), l95=quantile(value, probs=0.05))
## `summarise()` has grouped output by 'Sampling.Depth'. You can override using the `.groups` argument.
ns.species.avgs.np <- data.frame(ns.species.avgs.np)
ns.species.avgs.np

Plot this in the same way.

ggplot(ns.species.avgs.np, aes(x=Sampling.Depth, y=OTUs, group=Species, color=Species)) + geom_line() + scale_color_manual(values=c("#999999", "#56B4E9", "#009E73"), name="Species") + geom_ribbon(aes(ymin=l95, ymax=u95, group=Species, fill=Species), alpha=0.5, linetype='blank') + scale_fill_manual(values=c( "#999999", "#56B4E9", "#009E73"), name="Species")+ xlab("Sampling Depth") + ylab("Mean OTU Richness")

Much less rosy, huh?

Read Scaling

Simple arithmetic scaling

Read scaling is one way of addressing the different sequencing depths of your samples.

The simplest solution is to scale all of your reads down to the lowest total sequencing depth. The effect of this is to drop many many OTUs in higher read samples, so some data filtering is necessary to make this a good strategy.

Here are all of our read depths:

summary(colSums(otu.table.nosingle))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      24   12358   26172   26280   37261   73169

Let’s drop the blanks and then see what it looks like.

otu.table.bats <- otu.table.nosingle[,!colnames(otu.table.nosingle) %in% blanks]
write.csv(otu.table.bats, "otu.table.bats.csv")
summary(colSums(otu.table.bats))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1486   14001   27461   27299   37888   73169

I’m going to start by dropping samples that had fewer than 2000 total reads. This number is highly arbitrary. The goal is to reduce the total range of values, so if your mean sample depth is low, you might be able to keep very low read samples.

otu.table.bats <- otu.table.bats[,which(colSums(otu.table.bats) > 2000)]
otu.table.bats <- otu.table.bats[which(rowSums(otu.table.bats) > 0),]
summary(colSums(otu.table.bats))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2154   14145   27870   27575   38212   73169
minval <- min(colSums(otu.table.bats))
minval
## [1] 2154

Simple read scaling:

scaled.otu.table <- apply(otu.table.bats, MARGIN=2, function(x) round(minval*x/sum(x)))
scaled.otu.table <- data.frame(scaled.otu.table)
scaled.otu.table <- scaled.otu.table[which(rowSums(scaled.otu.table) > 0),]

You can also accomplish this in phyloseq by using these commands:

bat.phylo <- miseq.phylo %>% prune_samples(sample_sums(.) > 2000,.) %>% transform_sample_counts(function(x) round(minval*x/sum(x))) %>% prune_taxa(taxa_sums(.) > 0,.)
bat.phylo
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 5002 taxa and 280 samples ]
## sample_data() Sample Data:       [ 280 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 5002 taxa by 6 taxonomic ranks ]

What is the effect of simple read scaling on richness? I made some rarefaction curves. Let’s check them out.

rarefaction.curve.scaled <- read.csv("rarefaction.curves.scaled.csv", header=TRUE, row.names=1)
rare.df.melt.scaled <- data.frame(melt(rarefaction.curve.scaled, id="Sampling.Depth"))
colnames(rare.df.melt.scaled) <- c("Sampling.Depth", "SID", "OTUs")

ggplot(rare.df.melt.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + ylab("Unique OTUs")
## Warning: Removed 3840 row(s) containing missing values (geom_path).

The lines have been artifially compressed. It’s hard to tease out what’s happening with all of the lines, so let’s look at just the 5 we saw before.

mini.rare.df.scaled <- rare.df.melt.scaled[which(rare.df.melt.scaled$SID %in% sample.5 & !(is.na(rare.df.melt.scaled$OTUs))),]

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_line(data =mini.rare.df.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=2) +  scale_color_manual(values=m.g.colors)

As you can see, these OTUs only represent a small fraction of the total OTUs in the dataset.

ggplot() + geom_line(data =mini.rare.df.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=1) +  scale_color_manual(values=m.g.colors)

Notice that the relative OTU richness of samples has changed in the downscaling process!

This sort of distortion is less of a problem if you select a higher read depth to scale to, but that comes at the expense of dropping low read depth samples from your study.

Bootstrap samples

One alternative to simple scaling is to take boostrapped samples at a common depth.

This function runs at a similar speed to the rarefaction curves. I am again going to use a multiprocessor version (on the supercomputer).

get.bootstrapped.sample <- function(adf, col, n, reps) {
  otu.weight <- adf[,col]
  otu.vec <- rep(rownames(adf), otu.weight)
  otu.subsamples = replicate(reps, sample(otu.vec, n, replace=T))
  otu.rows = apply(otu.subsamples, 2, function(y) sapply(row.names(adf), function(x) length(which(y==x))))
  
  otu.means <- rowMeans(otu.rows)
  otu.round <- sapply(otu.means, function(x) floor(x))
  return(otu.round)
}
otu.table <- read.csv("otu.table.bats.csv", header=TRUE, row.names=1)
otu.table.bats <- otu.table[,which(colSums(otu.table) > 2000)]
otu.table.bats <- otu.table.bats[which(rowSums(otu.table.bats) > 0),]
minval <- min(colSums(otu.table.bats))

get.bootstrapped.sample <- function(adf, col, n, reps) {
  otu.weight <- adf[,col]
  otu.vec <- rep(rownames(adf), otu.weight)
  otu.subsamples = replicate(reps, sample(otu.vec, n, replace=TRUE))
  otu.rows = apply(otu.subsamples, 2, function(y) sapply(row.names(adf), function(x) length(which(y==x))))
  
  otu.means <- rowMeans(otu.rows)
  otu.round <- sapply(otu.means, function(x) round(x))
  return(otu.round)
}

cores <- detectCores()

start <- proc.time()
cl <- makeCluster(cores, type = "FORK")
registerDoParallel(cl)
bootstrapped.counts <- foreach(x=colnames(otu.table.bats), .combine=cbind) %dopar% {
  get.bootstrapped.sample(otu.table.bats, x, minval, 25)
}
stopCluster(cl)

bootstrapped.counts <- data.frame(bootstrapped.counts)
colnames(bootstrapped.counts) <- colnames(otu.table.bats)
print(proc.time() - start)

write.csv(bootstrapped.counts, "bootstrapped.counts.2000.csv")
print("done")

If you don’t have access to the supercomputer, you can run the function like this on your own laptop, but it will run very slowly.

otu.table.bats <- otu.table[,which(colSums(otu.table) > 2000)]
otu.table.bats <- otu.table.bats[which(rowSums(otu.table.bats) > 0),]
minval <- min(colSums(otu.table.bats))

get.bootstrapped.sample <- function(adf, col, n, reps) {
  otu.weight <- adf[,col]
  otu.vec <- rep(rownames(adf), otu.weight)
  otu.subsamples = replicate(reps, sample(otu.vec, n, replace=T))
  otu.rows = apply(otu.subsamples, 2, function(y) sapply(row.names(adf), function(x) length(which(y==x))))
  
  otu.means <- rowMeans(otu.rows)
  otu.round <- sapply(otu.means, function(x) round(x))
  return(otu.round)
}

bootstrapped.counts <- sapply(colnames(otu.table), function(x) get.bootsrapped.sample(otu.table, x, lowestsample, 25))

bootstrapped.counts <- data.frame(bootstrapped.counts)
colnames(bootstrapped.counts) <- colnames(otu.table.bats)

write.csv(bootstrapped.counts, "bootstrapped.counts.csv")
bootstrapped.otu.table <- read.csv("bootstrapped.counts.csv", header=TRUE, row.names=1)

What does this version look like as rarefaction curves?

rarefaction.curve.boot <- read.csv("rarefaction.curves.bootstrap.csv", header=TRUE, row.names=1)
rare.df.melt.boot <- data.frame(melt(rarefaction.curve.boot, id="Sampling.Depth"))
colnames(rare.df.melt.boot) <- c("Sampling.Depth", "SID", "OTUs")

g <- ggplot(rare.df.melt.boot, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + ylab("Unique OTUs") + ggtitle("Bootstrapped") + coord_cartesian(ylim=c(0,500))

f <- ggplot(rare.df.melt.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + ylab("Unique OTUs") + ggtitle("Scaled") + coord_cartesian(ylim=c(0,500))

grid.arrange(g, f, nrow=1)
## Warning: Removed 1559 row(s) containing missing values (geom_path).
## Warning: Removed 3840 row(s) containing missing values (geom_path).

Let’s look more closely at the 5 samples we selected.

mini.rare.df.boot <- rare.df.melt.boot[which(rare.df.melt.boot$SID %in% sample.5 & !(is.na(rare.df.melt.boot$OTUs))),]

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_line(data =mini.rare.df.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=2) +  scale_color_manual(values=m.g.colors) + geom_line(data =mini.rare.df.boot, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=3)

As you can see, we still lose hundreds of OTUs by down sampling.

ggplot() + geom_line(data =mini.rare.df.boot, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=1) +  scale_color_manual(values=m.g.colors)

This method didn’t relp with the relative position of the low read depth yellow line, but it did fix the relative order of the others. If we zoom in on the original bootstrapped curves, we can see why.

ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) +  scale_color_manual(values=m.g.colors) + coord_cartesian(xlim=c(0,3000), ylim=c(0,250))

The OTU gain rate of the dark blue and yellow lines swap at a sampling depth around 2500. The best we can hope for is to perpetuate the same trends that are present in the rarefactions below that depth. Let’s test that hypothesis by bootstrapping and scaling again to a depth of 4000.

otu.table.bats <- otu.table[,which(colSums(otu.table) > 4000)]
otu.table.bats <- otu.table.bats[which(rowSums(otu.table.bats) > 0),]
summary(colSums(otu.table.bats))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4267   15458   28265   28153   38396   73230
minval <- min(colSums(otu.table.bats))
minval
## [1] 4267
scaled.otu.table <- apply(otu.table.bats, MARGIN=2, function(x) round(minval*x/sum(x)))
scaled.otu.table <- data.frame(scaled.otu.table)
scaled.otu.table <- scaled.otu.table[which(rowSums(scaled.otu.table) > 0),]
write.csv(scaled.otu.table, "scaled.otu.table.4000.csv")
rarefaction.curve.boot <- read.csv("rarefaction.curves.boot.4000.csv", header=TRUE, row.names=1)
rare.df.melt.boot <- data.frame(melt(rarefaction.curve.boot, id="Sampling.Depth"))
colnames(rare.df.melt.boot) <- c("Sampling.Depth", "SID", "OTUs")

rarefaction.curve.scaled <- read.csv("rarefaction.curves.scaled.4000.csv", header=TRUE, row.names=1)
rare.df.melt.scaled <- data.frame(melt(rarefaction.curve.scaled, id="Sampling.Depth"))
colnames(rare.df.melt.scaled) <- c("Sampling.Depth", "SID", "OTUs")

g <- ggplot(rare.df.melt.boot, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + ylab("Unique OTUs") + ggtitle("Bootstrapped") + coord_cartesian(ylim=c(0,750))

f <- ggplot(rare.df.melt.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID)) + geom_line() + ylab("Unique OTUs") + ggtitle("Scaled") + coord_cartesian(ylim=c(0,750))

grid.arrange(g, f, nrow=1)
## Warning: Removed 1302 row(s) containing missing values (geom_path).
## Warning: Removed 2813 row(s) containing missing values (geom_path).

mini.rare.df.boot <- rare.df.melt.boot[which(rare.df.melt.boot$SID %in% sample.5 & !(is.na(rare.df.melt.boot$OTUs))),]

mini.rare.df.scaled <- rare.df.melt.scaled[which(rare.df.melt.scaled$SID %in% sample.5 & !(is.na(rare.df.melt.scaled$OTUs))),]

g <- ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_line(data =mini.rare.df.scaled, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=2) +  scale_color_manual(values=m.g.colors) + coord_cartesian(xlim=c(0,5000), ylim=c(0,350)) + ggtitle("Scaled") + theme(legend.position="none")

f <- ggplot() + geom_line(data =mini.rare.df, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID)) + geom_line(data =mini.rare.df.boot, aes(x=Sampling.Depth, y=OTUs, group=SID, color=SID), linetype=2) +  scale_color_manual(values=m.g.colors) + coord_cartesian(xlim=c(0,5000), ylim=c(0,350)) + ggtitle("Bootstrapped") + theme(legend.position="none")

grid.arrange(g,f, nrow=1)

When bootstrapped at a sampling depth of 4000, the curves resemble the original rarefactions, at least for this subset.

Diversity measurement: Richness

If you want to calculate relative richness, you can do it with either scaled or bootstrapped tables. I’m proceeding with the bootstrapped set, which is more faithful overall to the true distribution of OTUs. I’m going to refer to that dataset as an even depth dataset.

even.depth.otu.table <- read.csv("bootstrapped.counts.4000.csv", header=TRUE, row.names=1)

To calculate richness, transform the otu table into a presence/absence table.

even.depth.pa <- data.frame(even.depth.otu.table)
even.depth.pa[even.depth.pa > 1] <- 1
otu.richness <- colSums(even.depth.pa)

This numbers are comparable, to one another, but are also bound by the maximum bootstrapping value.

That means it’s valid to make boxplots like this:

mdf <- mdf[names(otu.richness),]
mdf$otu.richness <- otu.richness[row.names(mdf)]

ggplot(mdf, aes(x=Species, y=otu.richness, fill=Species)) + geom_boxplot() + scale_fill_manual(values=m.g.colors)

But the numbers on the y axis really only have meaning if you know that this otu table was bootstrapped to around 4000 reads. You can put that in a caption, but that’s really asking your reader to do your work for you. Transform the values relative to the maximum richness values to make these values interpretable.

mdf$relative.otu.richness <- mdf$otu.richness/max(mdf$otu.richness)
ggplot(mdf, aes(x=Species, y=relative.otu.richness, fill=Species)) + geom_boxplot() + scale_fill_manual(values=m.g.colors)

Diversity measurement: Evenness

There are vegan implementations and phyloseq convenience wrappers of both Simpson and Shannon’s diversity, but I think it’s important to understand how they really work.

Evenness is a measurement of how dominated a sample might be by only a few taxa. One convenient way of measuring this is by using statistical entropy (\(H\)). This is known as the Shannon diversity index. If you speak math, this is the definition:

\(H = - \sum^S_{j=1}p_j \ln p_j\)

where \(S\) is the set of all otus and \(p_j\) is the proportion of a particular taxa within sample \(j\).

There’s some ambiguity about the shannon index, since you can calculate it using any logorithm. I prefer to use the natural log, but I’ve seen implementations using log10, and computer scientists tend to prefer to use log2.

A low value of \(H\) indicates low entropy, meaning that the sample is uneven or dominated by only a few abundant taxa. A high value of \(H\) indicates that all OTUs are evenly distributed through the sample.

The maximum value that \(H\) can have depends on the total number of nonzero OTUs in the sample. Because of this ambiguous bound, I prefer to report Shannon’s equitability rather than the raw Shannon diversity index. \(E_H\) is \(H\) divided by the maximum possible value of \(H\) for a given number of nonzero OTUs counts. It is bound between 0 and 1, where 1 represents a perfectly even sample.

get.shannon.diversity <- function(otucounts){
  otus <- otucounts[otucounts > 0]
  proportion.otus <- otus/sum(otus)
  raw.shannon <- sapply(proportion.otus, function(x) x*log(x))
  total.shannon <- (sum(raw.shannon)*-1)
  max.shannon <- log(length(otus))
  return(total.shannon/max.shannon)
}
shannon.equitability <- apply(even.depth.otu.table, 2, get.shannon.diversity)
mdf$shannon.equitability <- shannon.equitability[row.names(mdf)]

Shannon diversity cannot treat 0s in your count table because the definition of entropy relies on logarithms and log(0) is undefined. That can cause some counterintuitive behavior.

Zero counts are ignored, so vector A below will appear to be perfectly even, even though 50% of OTUs are not present at all. It is indistinguishable from vector c, where all otus are present.

Even though vector B is almost perfectly even, the 5 is counted while the 0s are not, so it is less even than vector A.

a <- c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1)
b <- c(1, 1, 1, 1, 1, 1, 1, 5, 1, 1)
c <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1)

get.shannon.diversity(a)
## [1] 1
get.shannon.diversity(b)
## [1] 0.8964959
get.shannon.diversity(c)
## [1] 1

Simpson’s diversity index (\(D\)) is an alternative evenness metric that can account for zeros.

\(\lambda= \sum^S_{j=1}p^2_j\)

where \(S\) is the set of all otus and \(p_j\) is the proportion of a particular taxa within sample \(j\).

Low values of \(\lambda\) represent even compositions. Because this is confusing, \(\lambda\) is sometimes reported as \(1-\lambda\). Sometimes it’s reported as the inverse simpson index, \(\dfrac{1}{\lambda}\) . Both swap the interpretation, where low values represent uneven compositions, just like the Shannon index. For consistency’s sake, I’ve chosen to report as \(1-\lambda\).

get.simpson.diversity <- function(otus){
  proportion.otus <- otus/sum(otus)
  raw.simpson <- proportion.otus*proportion.otus
  total.simpson <- (sum(raw.simpson))
  return(1-total.simpson)
}
get.simpson.diversity(a)
## [1] 0.8
get.simpson.diversity(b)
## [1] 0.8265306
get.simpson.diversity(c)
## [1] 0.9
simpson.index <- apply(even.depth.otu.table, 2, function(x) get.simpson.diversity(x))
mdf$simpson.index<- simpson.index[row.names(mdf)]

The Simpson index and Shannon equitability are correlated.

ggplot(mdf, aes(x=simpson.index, y=shannon.equitability)) + geom_point()

phyloseq and vegan convenience wrappers

Using the vegan or phyloseq versions is easy.

vegan.simpson <- vegan::diversity(even.depth.otu.table, index="simpson", MARGIN = 2)
vegan.shannon <- vegan::diversity(even.depth.otu.table, index="shannon", MARGIN = 2)
mdf$simpson.vegan <- vegan.simpson[row.names(mdf)]
mdf$shannon.vegan <- vegan.shannon[row.names(mdf)]

ggplot(mdf, aes(x=simpson.vegan, y=simpson.index)) + geom_point()

ggplot(mdf, aes(x=shannon.vegan, y=shannon.equitability)) + geom_point()

Note that the vegan implementation of Shannon’s diversity is not the same as Shannono’s equitabiltiy. The diversity function does not currently have that option and only reports unscaled Shannon’s diversity.

To use the phyloseq versions, first make a phyloseq object with your even depth table.

The name of this function does not make sense, as evenness measurements are really not richness measurements.

even.phylo <- merge_phyloseq(otu_table(even.depth.otu.table, taxa_are_rows=TRUE), tax_table(miseq.phylo), sample_data(miseq.phylo))

pseq.measures <- phyloseq::estimate_richness(even.phylo, split=TRUE, measures=c("Shannon", "Simpson"))

mdf$pseq.simpson <- pseq.measures[row.names(mdf),"Simpson"]
mdf$pseq.shannon <- pseq.measures[row.names(mdf),"Shannon"]
ggplot(mdf, aes(x=simpson.vegan, y=pseq.simpson)) + geom_point()

ggplot(mdf, aes(x=shannon.vegan, y=pseq.shannon)) + geom_point()

The phyloseq versions are simply wrappers for the vegan versions and have exactly the same values.

Read Normalization

Thus far you have worked with counts of data. We made some attempt to normalize these counts by scaling or bootstrapping to a common depth, but lost many OTU records in the process. If we want to continue trying to treat each sample at a common depth, we can normalize each sample to its own read depth.

You could do this to an unscaled dataset. The argument against doing so is that some columns in your OTU table (the ones with high read depth) would benefit unfairly from having a larger sample size than the low depth read ones. If you are using an arithmetic normalization method, this is true. So for now, we’ll use the even depth table.

Within phyloseq, normalizing to read depth is simple.

normalized.phylo <- even.phylo %>% transform_sample_counts(function(x) x/sum(x))
normalized.phylo
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 15497 taxa and 274 samples ]
## sample_data() Sample Data:       [ 274 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 15497 taxa by 6 taxonomic ranks ]

You can use this normalized dataset to make figures – for some examples see the tutorial on major taxonomic groups.

Read Transformation

Each sample is a composition of reads with a very different total read count. The ideal transformation that will allow us to compare proportions of OTUs within each sample is a centerd log-ratio transform.

Normalizing the data to add up to one imposes an aribitrary sum on each sample: an increase in any one OTU must therefore be matched by a decrease in all of the others. Some of this problem is baked into the dataset – the Illumina instrument itself imposes an arbitrary sum whether or not you choose to normalize each sample to 1 or not.

Data that has been transformed with CLR is all on the same scale, but columns do not sum to any arbitrary number, so increases in one OTU do not induce what looks like a decrease in others. Instead, each read count is taken as a ratio to the geometric mean of all read counts within the sample.

In order to use this logarithm based transform, the counts are first normalized and then zero values in the dataset must be replaced. This is one of my favorite topics in bioinformatics, but the short of it is that I recommend using the multiplicative replacement. This replaces zeros with a small value and then adjusts the proportions of other non-zero OTUs in the column so that everything still sums to 1.

transform.clr <- function(otu.df){
  print(dim(otu.df))
  d.1 <- data.frame(otu.df[which(rowSums(otu.df) > 10),], 
    check.names=F)
  print(dim(d.1))
  d.czm <- t(cmultRepl(t(d.1), label=0, method="CZM"))
  print(dim(d.czm))
  d.clr <- apply(d.czm, 2, function(x){log(x) - mean(log(x))})
  print(dim(d.clr))
  return(d.clr)
}

To use this function, supply an otu table. The function assumes that rows are otus and columns are samples.

clr.table <- transform.clr(otu.table)
## [1] 30903   294
## [1] 5061  294
## No. corrected values:  370728 
## [1] 5061  294
## [1] 5061  294

You will still lose an enormous amount of OTUs when making this transform. Not because of the transform itself, but because of the zero replacement step and the necessity of making your dataframe less sparse.

Unless you are working with transcripts, I would suggest agglommerating your reads at the genus level before running a CLR transform. You will likely keep many much more of your dataset. You might also feel better about the OTU loss if you first drop OTUs that were very poorly classified.

tax.table <- data.frame(tax_table(miseq.phylo))
tax.table[which(sapply(tax.table$Phylum, function(x) grepl("unclassified", x, ignore.case=TRUE))),]

More than 2,000 of these OTUs are unclassified at the phylum level. Let’s drop those and then agglomerate by genus.

new.tax.table <- tax.table[which(!sapply(tax.table$Phylum, function(x) grepl("unclassified", x, ignore.case=TRUE))),]

phylo.sub <- merge_phyloseq(tax_table(as.matrix(new.tax.table)), otu_table(miseq.phylo), sample_data(miseq.phylo))

genus.phylo <- tax_glom(phylo.sub, taxrank="Genus")
genus.phylo
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1745 taxa and 294 samples ]
## sample_data() Sample Data:       [ 294 samples by 8 sample variables ]
## tax_table()   Taxonomy Table:    [ 1745 taxa by 6 taxonomic ranks ]

We now have a much smaller dataframe. Typically if there are differences this large between the genus level assignments and OTU abundance, clustering was not optimal.

Extract the OTU table from the phyloseq object.

gen.otu.table <- data.frame(otu_table(genus.phylo))

Now use the CLR transform.

gen.clr.table <- transform.clr(gen.otu.table)
## [1] 1745  294
## [1] 986 294
## No. corrected values:  78855 
## [1] 986 294
## [1] 986 294

About half of the OTUs were still dropped. For this particular very sparse dataset, this might be the best we can do if we want to preserve the abundance of unclassified genera.

You can use this transformed abundance table with any normal multivariate statistics technique, including PCAs. For some examples, see the ordination tutorial on the front page.