For this tutorial, raw sequences of the V5-V6 region of the 16S rRNA gene were obtained from BioProject PRJNA377760 via GenBank. These sequences were collected from fecal samples of 11 different animal species from the Northern Minnesota area (Brown et al. 2017). The sequences are derived from a mixture of Illumina MiSeq and HiSeq sources and were assembled separately using mothur (Schloss et al. 2009). Paired end sequences were trimmed to 150 bp and aligned with the SILVA ssRNA reference database (version 1.32, non-redundant).
This workflow covers how to make figures that illustrate the major taxonomic groups in your samples and also how to make statistically sound comparisons between them. It is never enough to stop at visualization. TL;DR: stacked bar charts are misleading, consider alternatives.
library(compositions)
library(zCompositions)
library(ape)
library(plyr)
library(dplyr)
library(ggplot2)
library(gplots)
library(lme4)
library(tidyr)
library(vegan)
library(scales)
library(grid)
library(reshape2)
library(phyloseq)
library(ALDEx2)
library(gridExtra)
library(stringr)
library(viridis)
m.g.colors <- c( "#999999", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7", "#CBD588", "#5F7FC7", "orange","#DA5724", "#508578", "#CD9BCD",
"#AD6F3B", "#673770","#D14285", "#652926", "#C84248",
"#8569D5", "#5E738F","#D1A33D", "#8A7C64", "#599861" )
more.colors <- c("indianred1", "steelblue3", "skyblue1", "mediumorchid","royalblue4", "olivedrab3",
"pink", "#FFED6F", "mediumorchid3", "ivory2", "tan1", "aquamarine3", "#C0C0C0",
"mediumvioletred", "#999933", "#666699", "#CC9933", "#006666", "#3399FF",
"#993300", "#CCCC99", "#666666", "#FFCC66", "#6699CC", "#663366", "#9999CC", "#CCCCCC",
"#669999", "#CCCC66", "#CC6600", "#9999FF", "#0066CC", "#99CCCC", "#999999", "#FFCC00",
"#009999", "#FF9900", "#999966", "#66CCCC", "#339966", "#CCCC33", "#EDEDED"
)
theme_set(theme_classic(base_size=12, base_family="Avenir"))
taxfile="srcmiseq.v132.cons.taxonomy"
sharedfile="srcmiseq.v132.shared"
This is the full metadata that accompanies the SRA download. We’ll keep only some information.
mdf.full <- read.csv("SraRunTable.txt", header=TRUE, row.names = 1)
mdf.full
mdf <- subset(mdf.full, select=c(Host, Individual, Instrument, Sample.Name, source_category))
mdf
We’ll first look at the Illumina MiSeq sequences and remove the HiSeq set.
mdf.miseq <- mdf[which(mdf$Instrument=="Illumina MiSeq"),]
mdf.miseq
There are 46 MiSeq sequencing samples.
mothur_data <- import_mothur(mothur_constaxonomy_file = taxfile, mothur_shared_file = sharedfile)
moth_merge <- merge_phyloseq(sample_data(mdf.miseq), tax_table(mothur_data), otu_table(mothur_data))
colnames(tax_table(moth_merge)) <- c("Kingdom", "Phylum", "Class",
"Order", "Family", "Genus")
miseq.phylo <- moth_merge %>%
subset_taxa(
Kingdom %in% c("Bacteria", "Archaea") &
Family != "mitochondria" &
Class != "Chloroplast"
)
miseq.phylo
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2003 taxa and 46 samples ]
## sample_data() Sample Data: [ 46 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 2003 taxa by 6 taxonomic ranks ]
This low OTU count is not surprising since each record is a phylotype, representing all reads identified as the same genus regardless of dissimilarity with one another. On this basis, we would expect rarefactions to be fairly complete.
min(sample_sums(miseq.phylo))
## [1] 15018
miseq.sub <- miseq.phylo %>% prune_samples(sample_sums(miseq.phylo) > 15018, .) %>% prune_taxa(taxa_sums(.) > 0,.)
min(sample_sums(miseq.sub))
## [1] 33723
To accurately compare the group composition for samples with different depths, reads should be relativized, but also drawn from a similar depth. This is a simple scaling function.
I’ll replace with the bootstrapped sample later.
min.miseq <- min(sample_sums(miseq.sub))
scaled.miseq <- miseq.sub %>% transform_sample_counts(function(x) round(min.miseq*x/sum(x))) %>% prune_taxa(taxa_sums(.) > 0,.)
scaled.miseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1604 taxa and 45 samples ]
## sample_data() Sample Data: [ 45 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 1604 taxa by 6 taxonomic ranks ]
taxfile="srchiseq.cons.taxonomy"
sharedfile="srchiseq.shared"
Now remove the Illumina MiSeq sequences and keep the HiSeq set.
mdf.hiseq <- mdf[which(mdf$Instrument=="Illumina HiSeq 2500"),]
mdf.hiseq
There are 168 HiSeq sequencing samples.
mothur_data <- import_mothur(mothur_constaxonomy_file = taxfile, mothur_shared_file = sharedfile)
moth_merge <- merge_phyloseq(sample_data(mdf.hiseq), tax_table(mothur_data), otu_table(mothur_data))
colnames(tax_table(moth_merge)) <- c("Kingdom", "Phylum", "Class",
"Order", "Family", "Genus")
hiseq.phylo <- moth_merge %>%
subset_taxa(
Kingdom %in% c("Bacteria", "Archaea") &
Family != "mitochondria" &
Class != "Chloroplast"
)
hiseq.phylo
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2450 taxa and 168 samples ]
## sample_data() Sample Data: [ 168 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 2450 taxa by 6 taxonomic ranks ]
read.depths <- sample_sums(hiseq.phylo)
hist(read.depths)
summary(read.depths)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1225 154382 208400 220640 264612 729460
min(sample_sums(hiseq.phylo))
## [1] 1225
hiseq.sub <- hiseq.phylo %>% prune_samples(sample_sums(hiseq.phylo) > 33723, .) %>% prune_taxa(taxa_sums(.) > 0,.)
min(sample_sums(hiseq.sub))
## [1] 37528
hiseq.sub
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2431 taxa and 164 samples ]
## sample_data() Sample Data: [ 164 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 2431 taxa by 6 taxonomic ranks ]
If we drop three samples, we can rarefy to 10331 reads in the next step. If we drop sample reads that are nearer to the minimum miseq sample depth, we drop 4.
To accurately compare the group composition for samples with different depths, reads should be relativized, but also drawn from a similar depth. This is a simple scaling function.
min.hiseq <- min(sample_sums(hiseq.sub))
scaled.hiseq <- hiseq.sub %>% transform_sample_counts(function(x) round(min.hiseq*x/sum(x))) %>% prune_taxa(taxa_sums(.) > 0,.)
scaled.hiseq
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1713 taxa and 164 samples ]
## sample_data() Sample Data: [ 164 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 1713 taxa by 6 taxonomic ranks ]
Use phyloseq.
merge.phylo <- merge_phyloseq(otu_table(scaled.hiseq), otu_table(scaled.miseq), tax_table(scaled.hiseq), tax_table(scaled.miseq), sample_data(scaled.hiseq), sample_data(scaled.miseq))
merge.phylo
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1823 taxa and 209 samples ]
## sample_data() Sample Data: [ 209 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 1823 taxa by 6 taxonomic ranks ]
Super convenient.
Also merge together the unscaled versions.
merge.phylo.unscaled <- merge_phyloseq(otu_table(miseq.sub), otu_table(hiseq.sub), tax_table(miseq.sub), tax_table(hiseq.sub), sample_data(miseq.sub), sample_data(hiseq.sub))
merge.phylo.unscaled
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 2434 taxa and 209 samples ]
## sample_data() Sample Data: [ 209 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 2434 taxa by 6 taxonomic ranks ]
otu.table <- data.frame(otu_table(merge.phylo))
tax.table <- data.frame(tax_table(merge.phylo))
mdf <- data.frame(sample_data(merge.phylo))
merge.phylo.genus <- tax_glom(merge.phylo.unscaled, taxrank="Genus")
write.csv(data.frame(otu_table(merge.phylo.genus)), "fecal.genus.library.unscaled.otu.table.csv")
write.csv(data.frame(tax_table(merge.phylo.genus)), "fecal.genus.library.tax.table.csv")
write.csv(data.frame(sample_data(merge.phylo.genus)), "fecal.genus.library.metadata.csv")
Stacked bar charts of the phyla are extremely common, but can be misleading and are easy to miscalculate. While they are an intuitive representation of compositional data (which this data inherently is!), grouping samples together into categories to represent as a stacked bar is a distortion of the data. If you want to compare the abundance of sets of samples, then reads must be relativized to each sample’s total depth. The read count only has meaning in context of that depth. Even when counts are transformed to relative abundance values, the relative proportions of taxonomic groups to one another within each sample are very sensitive to subsetting. See an example below.
ex.data <- data.frame("sampletype"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"), "bacteria"=c("Bacteroidetes", "Firmicutes", "Proteobacteria", "Verrucomicrobia", "Unclassified", "Bacteroidetes", "Firmicutes", "Proteobacteria", "Verrucomicrobia", "Unclassified"), "abundance"=c(10, 20, 30, 20, 20, 1000, 4000, 1000, 3500, 500))
#Create a vector of scaled relative abundance of all reads
relative.vals <- data.frame(ex.data %>% group_by(sampletype) %>% summarise(abundance=abundance/sum(abundance)))
## `summarise()` regrouping output by 'sampletype' (override with `.groups` argument)
ex.data$relative.abundance <- relative.vals$abundance
ggplot(ex.data, aes(x=sampletype, y=relative.abundance, fill=bacteria)) + geom_bar(stat="identity") + ylab("Relative Abundance")
Here it seems like Sample B has roughly twice as much Firmicutes as Sample A does. Samples A and B have about the same amount of Bacteroidetes.
But let’s say that we don’t care about unclassified reads and want to discard them… or a more likely scenario, they were unclassifiable period and removed at an earlier point in our sequencing pipeline.
ex.data.sub <- ex.data[which(ex.data$bacteria != "Unclassified"),]
relative.vals <- data.frame(ex.data.sub %>% group_by(sampletype) %>% summarise(abundance=abundance/sum(abundance)))
## `summarise()` regrouping output by 'sampletype' (override with `.groups` argument)
ex.data.sub$relative.abundance <- relative.vals$abundance
ggplot(ex.data.sub, aes(x=sampletype, y=relative.abundance, fill=bacteria)) + geom_bar(stat="identity") + ylab("Relative Abundance")
Now it seems like sample A has slightly more Bacteroidetes than sample B does. It also seems like sample B has only about 1/3 more Firmicutes than sample A does.
The subset has made the relative proportions between samples unstable. This is really important if you are for example trying to draw inferences about differential abundances between sample types.
Using proportional abundance is a great solution for the instability caused by subsetting, since it centers data relative to a geometric mean rather than an arithmetic sum. This means that each taxon is a vector in its own dimension, rather than all of them sharing a linear space.
To apply this transform, you use a logarithm, so you can’t have any zero values in your dataset. For this example, there are no zeros, but if there were, you would use the zcompositions package to replace them.
Here’s our original, transformed using proportional abundance
proportional.vals <- data.frame(ex.data %>% group_by(sampletype) %>% summarise(abundance=log(abundance)-mean(log(abundance))))
## `summarise()` regrouping output by 'sampletype' (override with `.groups` argument)
ex.data$proportional.abundance <- proportional.vals$abundance
ggplot(ex.data, aes(x=sampletype, y=proportional.abundance, fill=bacteria)) + geom_bar(stat="identity") + ylab("Relative Abundance")
And the subset version, subset after using a CLR transform on the initial values.
ex.data.sub <- ex.data[which(ex.data$bacteria != "Unclassified"),]
ggplot(ex.data.sub, aes(x=sampletype, y=proportional.abundance, fill=bacteria)) + geom_bar(stat="identity") + ylab("Relative Abundance")
Interpreting proportional abundance requires a little more work than relative abundance, but is very much worth it. These bar charts are weird because they dip below the y-axis. Negative proportional abundance values represent taxa that are less common. You might have noticed that the Bacteroidetes bar is now huge. But it’s the smallest value! It is the least common and is farthest away from the geometric mean (here centered at zero). Stacked bar charts for this reason are not the best choice. Instead, consider heat maps.
ggplot(ex.data, aes(x=sampletype, y=bacteria, fill=proportional.abundance)) + geom_tile() + scale_fill_gradient2(low=muted("blue"), high=muted("red")) + ggtitle("All Bacteria")
ggplot(ex.data.sub, aes(x=sampletype, y=bacteria, fill=proportional.abundance)) + geom_tile() + scale_fill_gradient2(low=muted("blue"), high=muted("red")) + ggtitle("Subset Bacteria")
Use a diverging heatmap to help make interpretation easy. One unit of proportional abundance is one fold change in abundance, relative to the geometric mean for the sample.
It’s still not an ideal visual presentation, but where it really counts, the actual differences between samples, you can see that the values are stable. This quality makes the CLR transform great for comparing samples togehter that have different sampling depths, or total counts.
ex.data[which(ex.data$sampletype=="A"),"proportional.abundance"] - ex.data[which(ex.data$sampletype=="B"),"proportional.abundance"]
## [1] -0.2464287 -0.9395759 0.8521836 -0.8060445 1.1398656
ex.data.sub[which(ex.data.sub$sampletype=="A"),"proportional.abundance"] - ex.data.sub[which(ex.data.sub$sampletype=="B"),"proportional.abundance"]
## [1] -0.2464287 -0.9395759 0.8521836 -0.8060445
In practice, I use this function to transform matrices using a CLR. This function expects rows to be taxa and columns to be samples, and it returns a matrix with the same orientation.
transform.clr <- function(otu.df){
d.1 <- data.frame(otu.df[which(rowSums(otu.df) > 10),],
check.names=F)
d.czm <- t(cmultRepl(t(d.1), label=0, method="CZM"))
d.clr <- apply(d.czm, 2, function(x){log(x) - mean(log(x))})
return(d.clr)
}
Interpreting proportional abundance requires a little more work than relative abundance, but is very much worth it.
However! Sometimes a stacked bar chart of relative abundance is what’s required. Introducing your data to non-experts or generally people who don’t want to hear about how to interpret proportional abundance is a regular feature of lab, committee, and meetings with collaborators. Here is how you make one that is not misleading. Avoid using this kind of chart completely if you are displaying arbitrary subsets of your taxa or more than one subset of your taxa.
Agglomerate your taxa by phylum and relativize sample counts to each sample’s total read depth.
phylum.phylo <- tax_glom(merge.phylo, taxrank="Phylum") %>% transform_sample_counts(function(x) x/sum(x))
phylum.melt <- psmelt(phylum.phylo)
The easiest phylum chart to make is one where each bar represents one sample. No additional transformations are needed.
ggplot(phylum.melt, aes(x=Sample, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=c(m.g.colors, more.colors)) + ylab("Relative Abundance") + guides(fill=guide_legend(ncol=2)) + theme(axis.text.x = element_blank())
This is almost unreadable if you include all phyla. You can threshold by abundance.
ggplot(phylum.melt[which(phylum.melt$Abundance >= 0.02),], aes(x=Sample, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=c(m.g.colors, more.colors)) + ylab("Relative Abundance Phyla >= 2%") + theme(axis.text.x = element_blank())
This chart only includes a record if its abundance is > 2% within each sample.
Note that this can be misleading. Taxa that are 1% abundant in some samples will show up as absent. Your bars will not add up to 100%, which becomes a more noticeable problem the more diverse your dataset is.
You can fill in that 100% like this:
phylum.remaining <- phylum.melt[which(phylum.melt$Abundance >= 0.02),] %>% group_by(Sample) %>% summarize(Abundance=1-sum(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
phylum.remaining <- data.frame(phylum.remaining)
phylum.remaining$Phylum <- "Other"
phy.sub.melt <- rbind(phylum.remaining, subset(phylum.melt[which(phylum.melt$Abundance >= 0.02),], select=c(Sample, Abundance, Phylum)))
ggplot(phy.sub.melt, aes(x=Sample, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=c(m.g.colors, more.colors)) + ylab("Relative Abundance Phyla >= 2%") + theme(axis.text.x = element_blank())
Label the X-axis with the host type.
host.labels <- as.vector(mdf$Host)
names(host.labels) <- row.names(mdf)
theme_set(theme_classic(base_size=9, base_family="Avenir"))
ggplot(phy.sub.melt, aes(x=Sample, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=c(m.g.colors, more.colors)) + ylab("Relative Abundance Phyla >= 2%") + scale_x_discrete(labels=host.labels) + theme(axis.text.x = element_text(angle=45, hjust=1))
In this case it’s still horrible because there are so many samples!
Here’s how you would make a phylum bar chart where the relative heights of bars account for differences in the number of samples in a category. The way you do that is normalize the sum of sample abundance to the number of samples… otherwise known as taking the mean value!
Decide in advance what categorical variable(s) you want on your plot.
phy.avgs <- phylum.melt %>% group_by(Host, Phylum) %>% summarize(Abundance=mean(Abundance))
## `summarise()` regrouping output by 'Host' (override with `.groups` argument)
phy.avgs <- data.frame(phy.avgs)
ggplot(phy.avgs, aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=c(m.g.colors, more.colors)) + ylab("Mean Relative Abundance") + theme(axis.text.x = element_text(angle=45, hjust=1))
You can also threshold these averaged phylum values. You have choices on how to do this… you can use the original subset of taxa that you used for the sample graph, you can include only taxa with a mean abundance > 2% across the dataframe , or you can include only taxa with a sum abundance > 2% across the dataframe. They will all look a little bit different.
per.sample.98.percent <- unique(phy.sub.melt$Phylum)
across.98.percent <- phylum.melt %>% group_by(Phylum) %>% summarise(meanAbd = mean(Abundance), sumAbd = sum(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
across.98.percent <- data.frame(across.98.percent)
mean.98.percent = across.98.percent[which(across.98.percent$meanAbd >= 0.02), "Phylum"]
sum.98.percent = across.98.percent[which(across.98.percent$sumAbd >= 0.02), "Phylum"]
#set common colors to make comparing graphs easier
phylum.colors <- c(m.g.colors, more.colors)[1:length(unique(phy.avgs$Phylum))]
names(phylum.colors) <- unique(phy.avgs$Phylum)
phylum.colors
## Acetothermia Acidobacteria Actinobacteria
## "#999999" "#56B4E9" "#009E73"
## Armatimonadetes Atribacteria Bacteria_unclassified
## "#F0E442" "#0072B2" "#D55E00"
## Bacteroidetes BRC1 Chlamydiae
## "#CC79A7" "#CBD588" "#5F7FC7"
## Chloroflexi Cloacimonetes Cyanobacteria
## "orange" "#DA5724" "#508578"
## Deferribacteres Deinococcus-Thermus Dependentiae
## "#CD9BCD" "#AD6F3B" "#673770"
## Elusimicrobia Entotheonellaeota Epsilonbacteraeota
## "#D14285" "#652926" "#C84248"
## FBP FCPU426 Fibrobacteres
## "#8569D5" "#5E738F" "#D1A33D"
## Firmicutes Fusobacteria Gemmatimonadetes
## "#8A7C64" "#599861" "indianred1"
## Kiritimatiellaeota Latescibacteria Lentisphaerae
## "steelblue3" "skyblue1" "mediumorchid"
## Nitrospinae Nitrospirae Omnitrophicaeota
## "royalblue4" "olivedrab3" "pink"
## Patescibacteria Planctomycetes Poribacteria
## "#FFED6F" "mediumorchid3" "ivory2"
## Proteobacteria Rokubacteria Spirochaetes
## "tan1" "aquamarine3" "#C0C0C0"
## Synergistetes Tenericutes Verrucomicrobia
## "mediumvioletred" "#999933" "#666699"
## WPS-2 WS2 Zixibacteria
## "#CC9933" "#006666" "#3399FF"
ggplot(phy.avgs[which(phy.avgs$Phylum %in% per.sample.98.percent),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=phylum.colors) + ylab("Mean Relative Abundance") + ggtitle("Abundance Per Sample >= 2%") + theme(axis.text.x = element_text(angle=45, hjust=1)) + guides(fill=guide_legend(ncol=2))
ggplot(phy.avgs[which(phy.avgs$Phylum %in% mean.98.percent),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=phylum.colors) + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance >= 2%") + theme(axis.text.x = element_text(angle=45, hjust=1))
ggplot(phy.avgs[which(phy.avgs$Phylum %in% sum.98.percent),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=phylum.colors) + ylab("Mean Relative Abundance") + ggtitle("Sum Abundance >= 2%") + theme(axis.text.x = element_text(angle=45, hjust=1)) + guides(fill=guide_legend(ncol=2))
You might also decide at this point to be kind to your readers and print the Latin names as common names, or make some other more intelligible transformation of your categorical labels.
mdf$CommonNames <- factor(mdf$Host)
levels(mdf$CommonNames) <- c("Cow", "Goose", "Dog", "Beaver", "Deer", "Cat", "Chicken", "Gull", "Rabbit", "Turkey", "Pig")
c.names <- levels(mdf$CommonNames)
names(c.names) <- levels(factor(mdf$Host))
c.names
## Bos taurus Branta sp. Canis lupus Castor sp.
## "Cow" "Goose" "Dog" "Beaver"
## Cervidae sp. Felis silvestris Gallus gallus Larus sp.
## "Deer" "Cat" "Chicken" "Gull"
## Leporidae sp. Meleagris gallopavo Sus sp.
## "Rabbit" "Turkey" "Pig"
ggplot(phy.avgs[which(phy.avgs$Phylum %in% sum.98.percent),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=phylum.colors) + ylab("Sum Relative Abundance") + ggtitle("Sum Abundance >= 2%") + theme(axis.text.x = element_text(angle=45, hjust=1)) + scale_x_discrete(labels=c.names) + guides(fill=guide_legend(ncol=2))
Just for demonstration, this is the incorrect way to calculate abundance for a group of samples. Note that the bars do not add up to 1. The second plot stretches the data, but this is still a misrepresentation.
ggplot(phylum.melt[which(phylum.melt$Abundance >= 0.02),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity") + scale_fill_manual(values=phylum.colors) + ylab("Relative Abundance Phyla >= 2%") + ggtitle("Incorrect Method") + guides(fill=guide_legend(ncol=2))
ggplot(phylum.melt[which(phylum.melt$Abundance >= 0.02),], aes(x=Host, y=Abundance, fill=Phylum)) + geom_bar(stat="identity", position="fill") + scale_fill_manual(values=phylum.colors) + ylab("Relative Abundance Phyla >= 2%") + ggtitle("Incorrect Method") + guides(fill=guide_legend(ncol=2))
Usually, these plots are not actually all that useful at the Phylum level.
Here’s one at the family level.
fam.phylo <- tax_glom(merge.phylo, taxrank="Family") %>% transform_sample_counts(function(x) x/sum(x))
fam.melt <- psmelt(fam.phylo)
fam.avgs <- fam.melt %>% group_by(Host, Family) %>% summarize(Abundance=mean(Abundance))
## `summarise()` regrouping output by 'Host' (override with `.groups` argument)
fam.avgs <- data.frame(fam.avgs)
across.98.percent <- fam.melt %>% group_by(Family) %>% summarise(meanAbd = mean(Abundance), sumAbd = sum(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
across.98.percent <- data.frame(across.98.percent)
mean.98.percent = across.98.percent[which(across.98.percent$meanAbd >= 0.02), "Family"]
sum.98.percent = across.98.percent[which(across.98.percent$sumAbd >= 0.02), "Family"]
At the family level, the only filtering strategy that makes sense is taking only those taxa with a mean relative abundance > 2%. This biases the chart away from taxa that are unevenly abundant except at the extreme.
length(c(more.colors, m.g.colors))
## [1] 65
length(unique(sum.98.percent))
## [1] 121
length(unique(mean.98.percent))
## [1] 13
To include more taxa, you can filter earlier, like we did with taxa. Again, this does distort the data.
fam.sub <- fam.melt[which(fam.melt$Abundance >= 0.02),]
sample.98.percent <- unique(fam.sub$Family)
length(sample.98.percent)
## [1] 69
Or change your threshold percentage for mean/sum relative abundance.
mean.88.percent = across.98.percent[which(across.98.percent$meanAbd >= 0.12), "Family"]
sum.88.percent = across.98.percent[which(across.98.percent$sumAbd >= 0.12), "Family"]
mean.99.percent = across.98.percent[which(across.98.percent$meanAbd >= 0.01), "Family"]
sum.99.percent = across.98.percent[which(across.98.percent$sumAbd >= 0.01), "Family"]
length(mean.88.percent)
## [1] 1
length(sum.88.percent)
## [1] 63
length(mean.99.percent)
## [1] 16
length(sum.99.percent)
## [1] 157
family.colors <- c(m.g.colors, more.colors)[1:length(sum.88.percent)]
names(family.colors) <- sum.88.percent
ggplot(fam.avgs[which(fam.avgs$Family %in% mean.98.percent),], aes(x=Host, y=Abundance, fill=Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors) + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 2%") + scale_x_discrete(labels=c.names) + guides(fill=guide_legend(reverse=TRUE))+ coord_flip()
ggplot(fam.avgs[which(fam.avgs$Family %in% mean.99.percent),], aes(x=Host, y=Abundance, fill=Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors) + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names) + guides(fill=guide_legend(reverse=TRUE, ncol=2)) + coord_flip()
ggplot(fam.avgs[which(fam.avgs$Family %in% sum.88.percent),], aes(x=Host, y=Abundance, fill=Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors) + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Sum Relative Abundance >= 10%") + scale_x_discrete(labels=c.names) + guides(fill=guide_legend(reverse=TRUE)) + coord_flip()
ggplot(fam.avgs[which(fam.avgs$Family %in% sum.88.percent),], aes(x=Host, y=Abundance, fill=Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors) + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Sum Relative Abundance >= 10%") + theme(legend.position="none") + scale_x_discrete(labels=c.names) + guides(fill=guide_legend(reverse=TRUE)) + coord_flip()
Add “Other Families” Category.
family.remaining <- fam.avgs[which(fam.avgs$Family %in% mean.99.percent),] %>% group_by(Host) %>% summarize(Abundance=1-sum(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
family.remaining <- data.frame(family.remaining)
family.remaining$Family <- "Other Families"
fam.sub.melt <- rbind(family.remaining, subset(fam.avgs[which(fam.avgs$Family %in% mean.99.percent),], select=c(Host, Abundance, Family)))
Reorder the factor levels to put “Other” at the end.
fam.sub.melt$Factor.Family <- factor(fam.sub.melt$Family, levels=c("Other Families", levels(factor(fam.avgs[which(fam.avgs$Family %in% mean.99.percent),"Family"]))))
levels(fam.sub.melt$Factor.Family)
## [1] "Other Families" "Akkermansiaceae"
## [3] "Bacteria_unclassified" "Bacteroidaceae"
## [5] "Bacteroidales_unclassified" "Christensenellaceae"
## [7] "Clostridiaceae_1" "Erysipelotrichaceae"
## [9] "Lachnospiraceae" "Lactobacillaceae"
## [11] "Mollicutes_RF39_fa" "Muribaculaceae"
## [13] "Peptostreptococcaceae" "Prevotellaceae"
## [15] "Rikenellaceae" "Ruminococcaceae"
## [17] "Veillonellaceae"
levels(factor(fam.sub.melt$Family))
## [1] "Akkermansiaceae" "Bacteria_unclassified"
## [3] "Bacteroidaceae" "Bacteroidales_unclassified"
## [5] "Christensenellaceae" "Clostridiaceae_1"
## [7] "Erysipelotrichaceae" "Lachnospiraceae"
## [9] "Lactobacillaceae" "Mollicutes_RF39_fa"
## [11] "Muribaculaceae" "Other Families"
## [13] "Peptostreptococcaceae" "Prevotellaceae"
## [15] "Rikenellaceae" "Ruminococcaceae"
## [17] "Veillonellaceae"
family.colors.other <- c(family.colors, "Other Families"="#333333")
ggplot(fam.sub.melt, aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=TRUE, ncol=2)) + coord_flip()
I personally prefer to use heat maps for several reasons. Distinguishing more than 12 colors at a time can be difficult; putting taxon names on the y axis frees you up to include many more taxa on one graph. Using one color map for abundance also allows you to choose different transformations of abundance that don’t necessarily sum to 1.
You can make a heatmap using relative abundance:
ggplot(fam.avgs[which(fam.avgs$Family %in% sum.88.percent),], aes(x=Host, y=Family, fill=Abundance)) + geom_tile() + scale_fill_viridis(option="magma") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Sum Relative Abundance >= 10%")
But it will be easier to interpret if you use a centered log ratio transform of your dataframe.
You’ll see all the benefits of a clr transform if you use your unscaled data.
fam.phylo <- tax_glom(merge.phylo.unscaled, taxrank="Family")
fam.clr.table <- data.frame(transform.clr(data.frame(otu_table(fam.phylo))))
## No. corrected values: 18258
fam.clr.phylo <- merge_phyloseq(otu_table(fam.clr.table, taxa_are_rows=TRUE), tax_table(fam.phylo), sample_data(fam.phylo))
fam.clr.phylo
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 571 taxa and 209 samples ]
## sample_data() Sample Data: [ 209 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 571 taxa by 6 taxonomic ranks ]
fam.melt <- psmelt(fam.clr.phylo)
First notice how you can list all samples more easily with this layout. Organize by facet panels.
ggplot(fam.melt, aes(x=Sample, y=Family, fill=Abundance)) + geom_tile() + scale_fill_gradient2(high=muted("red"), low=muted("blue")) + facet_wrap(~Host, scales="free_x", nrow=1)
It’s still clearly important to filter to families of interest.
Luckily, subcompositions of clr transformed values are stable, meaning that the relative proportional abundance of two taxa are the same no matter what subset you make.
We can similarly decide to keep only the taxa that were common across all taxa. Sums of CLR values really don’t make sense to use, but we can take the means instead.
clr.avgs <- fam.melt %>% group_by(Family) %>% summarise(Abundance=mean(Abundance))
## `summarise()` ungrouping output (override with `.groups` argument)
abundant.families <- data.frame(clr.avgs %>% filter(Abundance > 2))
abundant.families
ggplot(fam.melt[which(fam.melt$Family %in% abundant.families$Family),], aes(x=Sample, y=Family, fill=Abundance)) + geom_tile() + scale_fill_gradient2(high=muted("red"), low=muted("blue")) + facet_wrap(~Host, scales="free_x", nrow=1) + theme(axis.text.x=element_blank())
You can calculate average values for groups of samples in the same way. When we did this for relative abundance values, you might have notcied that doing it the incorrect way arrived at somewhat similar results if the bars were stretched… the same is not true for proportional abundance! That’s because adding together the log transformed values is not the same as averaging.
group.clr.avgs <- fam.melt %>% group_by(Host, Family) %>% summarise(Abundance=mean(Abundance))
## `summarise()` regrouping output by 'Host' (override with `.groups` argument)
group.clr.avgs <- data.frame(group.clr.avgs)
ggplot(group.clr.avgs[which(group.clr.avgs$Family %in% abundant.families$Family),], aes(x=Host, y=Family, fill=Abundance)) + geom_tile() + scale_fill_gradient2(high=muted("red"), low=muted("blue"))
Some of these familiesare rare and not likely to be recognized by the average reader. You can group them by phylum for easier interpretation.
fam.melt$Phy.Family <- paste(substr(fam.melt$Phylum, start=1, stop=3), fam.melt$Family, sep=": ")
group.clr.avgs <- fam.melt %>% group_by(Host, Phy.Family, Family) %>% summarise(Abundance=mean(Abundance))
## `summarise()` regrouping output by 'Host', 'Phy.Family' (override with `.groups` argument)
group.clr.avgs <- data.frame(group.clr.avgs)
ggplot(group.clr.avgs[which(group.clr.avgs$Family %in% abundant.families$Family),], aes(x=Host, y=Phy.Family, fill=Abundance)) + geom_tile() + scale_fill_gradient2(high=muted("red"), low=muted("blue"))
From a descriptive point of view, the simplest way to characterize the major taxa for your samples is just to look in detail at the same dataframe you use to make graphics.
group.clr.avgs
This dataframe tells you what the most common taxa are and what their proportional abundance is. You could also use the relative abundance dataframe.
fam.avgs %>% group_by(Host) %>% filter(Abundance > 0.02)
Stacked bar charts and heatmaps provide great jumping off points for discussion, but you cannot draw conclusions from describing these alone. Bar charts are visual representations of point estimates. They do not represent any variance in your data if you’re displaying group means. To represent variance between samples in the same category, you could show boxplots, but even at the phylum levels that can be visually overwhelming. What you really want are statistical tests.
If you want to make statistically valid comparisons between sample groups, I suggest using the ALDEx2 package. This package is designed to accomodate many of the unique qualities of enoromous compositional datasets. This package includes an implementation of the t-test based on the dirichlet distribution that CLR transformed abundance values tend to follow, as well as support for a GLM for more complex hypotheses.
First the simple case of two categories. You can also use this strategy to make pairwise comparisons between groups.
Let’s do an example with a more simple version of this dataset that includes only cats and dogs… to revisit an old rivalry.
catdog <- prune_samples(sample_data(merge.phylo.unscaled)$source_category %in% c("dog", "cat"), merge.phylo.unscaled) %>% prune_taxa(taxa_sums(.) > 0,.)
catdog
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1384 taxa and 31 samples ]
## sample_data() Sample Data: [ 31 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 1384 taxa by 6 taxonomic ranks ]
otu.table <- data.frame(otu_table(catdog))
tax.table <- data.frame(tax_table(catdog))
mdf<- data.frame(sample_data(catdog))
mdf %>% group_by(Host) %>% tally()
You can compare the compositions of these groups at any taxonomic level that you want. However, t-test p-values must be corrected for multiple hypotheses and so become more stringent the more taxa you include. For this reason, differences will be more apparent at middle taxonomy levels, where there are not thousands of comparisons to correct for!
We’ll try an OTU version anyway.
First make a vector of the sample IDs to compare.
compare.samples <- as.vector(row.names(mdf))
This comparison again will work better with fewer hypothesis corrections, so let’s drop very uncommon records.
d.1 <- otu.table[which(rowSums(otu.table) > 10),]
dim(otu.table)
## [1] 1384 31
dim(d.1)
## [1] 527 31
aldex.in <- data.frame(d.1,
check.names=F)
conditions <- mdf[compare.samples,"source_category"]
ALDEx2 uses a vector of conditions that is one character long. If the names in the vector are more than one character long, then they’ll be truncated to the first level. That will work out fine in this case since ‘dog’ will become ‘d’ and ‘cat’ will become ‘c’, but sometimes you will need to fix it. Like if we were comparing ‘cheetah’ and ‘cat’.
Here’s where we run it. ALDEx2 makes a clr transform internally… for detail on how all this work, please see Dr. Greg Gloor’s documentation on his https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.pdf and his papers.
x.d.c <- aldex.clr(aldex.in, conditions, mc.samples=128, verbose=TRUE)
## operating in serial mode
## removed rows with sums equal to zero
## computing center with all features
## data format is OK
## dirichlet samples complete
## transformation complete
Run t-tests on all OTUs.
x.tt.d.c <- aldex.ttest(x.d.c, conditions, paired.test=FALSE)
## Warning in if (hist.plot == TRUE) {: the condition has length > 1 and only the
## first element will be used
Calculate the difference in effect size for both groups. This is Cohen’s d and is a more stringent test than corrected p-values since it is based more on the magnitude of difference. Cohen’s \(\delta =\frac{\mu_1 - \mu_2}{s}\) Values range between $-,+$. There must be a large difference in abundance, and the OTUs must be consistently abundant in order to have a large effect size.
x.effect.d.c <- aldex.effect(x.d.c, conditions, include.sample.summary=FALSE, verbose=TRUE)
## Warning in if ("BiocParallel" %in% rownames(installed.packages()) & useMC == :
## the condition has length > 1 and only the first element will be used
## operating in serial mode
## sanity check complete
## rab.all complete
## rab.win complete
## rab of samples complete
## within sample difference calculated
## between group difference calculated
## group summaries calculated
## effect size calculated
## summarizing output
Combine datasets.
x.all.d.c <- data.frame(x.tt.d.c, x.effect.d.c, stringsAsFactors=FALSE)
ALDEx2 has some nice plotting functions.
aldex.plot(x.all.d.c, type="MW", test="welch")
This chart highlights OTUs with Benjamin-Hochberg corrected p-values < 0.05 in red. Gray dots are highly abundant, but not differentially. Black dots are rare taxa that have similar proportions between sample types. I use charts like these to decide whether or not to move further with the comparison. If there are no red dots, and most are in the interior of the dotted line cone, then it’s probably not worth doing the next steps.
sig.records <- rownames(x.all.d.c[which(x.all.d.c$we.eBH < 0.05 & abs(x.all.d.c$effect) >= 1),])
dim(tax.table[sig.records,])
## [1] 17 6
x.all.d.c[sig.records,]
tax.table[sig.records,]
29 OTUs had a corrected p-value < 0.05, adn 12 had effect sizes with absolute values > 1. P{ositive differences correspond to OTUs more abundant in dogs, and negative differences correspond to OTUs more abundant in cats.
As I mentioned, this is a very stringent test and it’s not too surprising that despite many significantly different test results, we found few with large effect sizes. Let’s try to reduce the number of competing hypotheses by agglomerating OTUs at the family level. You can think of this as putting better numbers on the stacked bar charts we made earlier.
Just for quick reference:
theme_set(theme_classic(base_size=12, base_family="Avenir"))
ggplot(fam.sub.melt[which(fam.sub.melt$Host %in% c("Felis silvestris", "Canis lupus")),], aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=FALSE, ncol=2))
Appraising this visually, it seems like there are differences, but we don’t know how consistent they are. The mean relative abundance of Lachnospiraceae seems higher in cats than it is in dogs, but we don’t know the variance.
To find out, let’s take the phyloseq catdog
object we made earlier and agglomerate by family.
catdog.family <- tax_glom(catdog, taxrank="Family")
catdog.family
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 511 taxa and 31 samples ]
## sample_data() Sample Data: [ 31 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 511 taxa by 6 taxonomic ranks ]
This includes very rare families that don’t show up on our bar chart, but most likely, unless the differences in abundance are incredibly consistent, they won’t really show up later in our ALDEx2 plots either.
otu.table <- data.frame(otu_table(catdog.family))
tax.table <- data.frame(tax_table(catdog.family))
mdf<- data.frame(sample_data(catdog.family))
Make a vector of the sample IDs to compare.
compare.samples <- as.vector(row.names(mdf))
This comparison again will work better with fewer hypothesis corrections, so let’s drop very uncommon records. At this high taxonomic level, you might consider dropping even more. We’ll make two versions.
d.1 <- otu.table[which(rowSums(otu.table) > 10),]
d.1000 <- otu.table[which(rowSums(otu.table) > 1000),]
dim(otu.table)
## [1] 511 31
dim(d.1)
## [1] 250 31
dim(d.1000)
## [1] 73 31
This first version will include very rare families (250 families).
aldex.in <- data.frame(d.1,
check.names=F)
conditions <- mdf[compare.samples,"source_category"]
x.d.c <- aldex.clr(aldex.in, conditions, mc.samples=128, verbose=TRUE)
## operating in serial mode
## removed rows with sums equal to zero
## computing center with all features
## data format is OK
## dirichlet samples complete
## transformation complete
Run t-tests
x.tt.d.c <- aldex.ttest(x.d.c, conditions, paired.test=FALSE)
## Warning in if (hist.plot == TRUE) {: the condition has length > 1 and only the
## first element will be used
Calculate the difference in effect size for both groups.
x.effect.d.c <- aldex.effect(x.d.c, conditions, include.sample.summary=FALSE, verbose=TRUE)
## Warning in if ("BiocParallel" %in% rownames(installed.packages()) & useMC == :
## the condition has length > 1 and only the first element will be used
## operating in serial mode
## sanity check complete
## rab.all complete
## rab.win complete
## rab of samples complete
## within sample difference calculated
## between group difference calculated
## group summaries calculated
## effect size calculated
## summarizing output
Combine datasets.
x.all.d.c <- data.frame(x.tt.d.c, x.effect.d.c, stringsAsFactors=FALSE)
Plot
aldex.plot(x.all.d.c, type="MW", test="welch")
Interestingly, at the family level, it seems like there are fewer differences. This means that although dogs and cats have similar family level composition, the genus and species of Bacteria within those families are different.
At least initially here, it seems like cats have relatively more specialized gut biota than dogs.
sig.records <- rownames(x.all.d.c[which(x.all.d.c$we.eBH < 0.05 & abs(x.all.d.c$effect) >= 1),])
dim(tax.table[sig.records,])
## [1] 7 6
x.all.d.c[sig.records,]
tax.table[sig.records,]
broad.sig.df <- x.all.d.c[sig.records,]
There are 8 families with both large effect sizes and that are significantly different. Examining the column diff.btw
shows us that all of these are more abundant in cats than they are in dogs. Note the rab.all
column. You can see that the relative abundance for some of these taxa within the full cat/dog dataset is very low.
Examining only the most common taxa (73 most abundant families which had >= 1000 reads across all samples):
aldex.in <- data.frame(d.1000,
check.names=F)
conditions <- mdf[compare.samples,"source_category"]
x.d.c <- aldex.clr(aldex.in, conditions, mc.samples=128, verbose=TRUE)
## operating in serial mode
## removed rows with sums equal to zero
## computing center with all features
## data format is OK
## dirichlet samples complete
## transformation complete
x.tt.d.c <- aldex.ttest(x.d.c, conditions, paired.test=FALSE)
## Warning in if (hist.plot == TRUE) {: the condition has length > 1 and only the
## first element will be used
x.effect.d.c <- aldex.effect(x.d.c, conditions, include.sample.summary=FALSE, verbose=TRUE)
## Warning in if ("BiocParallel" %in% rownames(installed.packages()) & useMC == :
## the condition has length > 1 and only the first element will be used
## operating in serial mode
## sanity check complete
## rab.all complete
## rab.win complete
## rab of samples complete
## within sample difference calculated
## between group difference calculated
## group summaries calculated
## effect size calculated
## summarizing output
x.all.d.c <- data.frame(x.tt.d.c, x.effect.d.c, stringsAsFactors=FALSE)
Plot
aldex.plot(x.all.d.c, type="MW", test="welch")
You can see that by reducing the number of taxa, we reduced the number of competing hypotheses, so the tests returned more significant results. Dropping extremely rare taxa also made the distribution of values within each samples less skewed towards zero, which generally improves the fit to the dirichlet distribution and narrowed the pooled standard deviation for our OTUs.
sig.records <- rownames(x.all.d.c[which(x.all.d.c$we.eBH < 0.05 & abs(x.all.d.c$effect) >= 1),])
dim(tax.table[sig.records,])
## [1] 9 6
x.all.d.c[sig.records,]
tax.table[sig.records,]
Removing rare families made some of those cat-specialist families undetectable. We do now however detect some families that are relatively more abundant in dogs than in cats. Note that for some of these families, while the differential abundance (diff.btw
) is large, the relative abundance within each category is still low (less than 2).
How you threshold your otu table makes a huge difference in your conclusions, as you can see. It’s important to think about whether or not you are looking for the overal general pattern differences (putting numbers to your stacked bar charts), or if you’re looking for differences in rare taxa that are not very abundant.
Let’s look at the family names associated with these OTUs and compare to our stacked bar chart.
sig.df <- x.all.d.c[sig.records,]
row.names(sig.df) <- tax.table[row.names(sig.df),"Family"]
sig.df
row.names(broad.sig.df) <- tax.table[row.names(broad.sig.df),"Family"]
broad.sig.df
ggplot(fam.sub.melt[which(fam.sub.melt$Host %in% c("Felis silvestris", "Canis lupus")),], aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=FALSE, ncol=2))
None of the largest, most visible bars in this chart are actually significantly different. The only ones that appear in the significant ALDEx2 results are Muribaculaceae and Rikenellaceae, which are invisible on this chart. All the others have been folded into “Other Families” because they were not highly abundant across all of the other animal feces samples.
If you want to force ALDEx2 to only consider tests between the families you’ve identified as abundant or any other arbitrary set of OTUs, you can do that. I wouldn’t recommend this technique, and here’s why:
interesting.otus <- row.names(tax.table[which(tax.table$Family %in% unique(fam.sub.melt$Family)),])
d.i <- otu.table[interesting.otus,]
dim(d.i)
## [1] 16 31
aldex.in <- data.frame(d.i,
check.names=F)
conditions <- mdf[compare.samples,"source_category"]
x.d.c <- aldex.clr(aldex.in, conditions, mc.samples=128, verbose=TRUE)
## operating in serial mode
## removed rows with sums equal to zero
## computing center with all features
## data format is OK
## dirichlet samples complete
## transformation complete
x.tt.d.c <- aldex.ttest(x.d.c, conditions, paired.test=FALSE)
## Warning in if (hist.plot == TRUE) {: the condition has length > 1 and only the
## first element will be used
x.effect.d.c <- aldex.effect(x.d.c, conditions, include.sample.summary=FALSE, verbose=TRUE)
## Warning in if ("BiocParallel" %in% rownames(installed.packages()) & useMC == :
## the condition has length > 1 and only the first element will be used
## operating in serial mode
## sanity check complete
## rab.all complete
## rab.win complete
## rab of samples complete
## within sample difference calculated
## between group difference calculated
## group summaries calculated
## effect size calculated
## summarizing output
x.all.d.c <- data.frame(x.tt.d.c, x.effect.d.c, stringsAsFactors=FALSE)
aldex.plot(x.all.d.c, type="MW", test="welch")
It doesn’t really change much.
See that although these differences are significant after correction, the effect sizes are small. Without the context of other less abundant taxa, even taxa that we identified as relatively more abundant have smaller effect sizes.
row.names(x.all.d.c) <- tax.table[row.names(x.all.d.c),"Family"]
x.all.d.c
Reassure yourself that this is because of high variance that is not apparent in a bar chart. Make a boxplot using the not averaged values.
First look at like the stacked bar chart, in terms of relative abundance:
fam.rel.phylo <- tax_glom(merge.phylo, taxrank="Family") %>% transform_sample_counts(function(x) x/sum(x))
fam.rel.melt <- psmelt(fam.rel.phylo)
theme_set(theme_classic(base_size=8, base_family="Avenir"))
ggplot(fam.rel.melt[which(fam.rel.melt$source_category %in% c("cat", "dog") & fam.rel.melt$Family %in% unique(fam.sub.melt$Family)),], aes(x=Family, y=Abundance, fill=source_category)) + geom_boxplot() + facet_wrap(~Phylum, scales="free_x") + ylab("Relative Abundance") + theme(axis.text.x = element_text(angle=45, hjust=1))
Now look at it like ALDEx2 is looking at it, with proportional abundance:
theme_set(theme_classic(base_size=8, base_family="Avenir"))
ggplot(fam.melt[which(fam.melt$source_category %in% c("cat", "dog") & fam.melt$Family %in% unique(fam.sub.melt$Family)),], aes(x=Family, y=Abundance, fill=source_category)) + geom_boxplot() + facet_wrap(~Phylum, scales="free_x") + ylab("Proportional Abundance") + theme(axis.text.x = element_text(angle=45, hjust=1))
These values are not really that different, except for Muribaculaceae and maybe a few others which again, are relatively uncommon.
While their digestive systems use different enzymes in the stomach, the overall structure and physiology of the cat and dog GI system is actually not that different.
Cat | Dog |
---|---|
These and other pictures of GI tracts are modified slightly from (Stevens and Hume 2004)
Lest you be totally discourage by this whole process, know that ALDEx2 sometimes can confirm patterns you might see in your bar charts. Let’s look at two animals which are much more different from one another. Dogs and Turkeys.
First a cursory boxplot and stacked barchart examination.
theme_set(theme_classic(base_size=12, base_family="Avenir"))
ggplot(fam.sub.melt[which(fam.sub.melt$Host %in% c("Meleagris gallopavo", "Canis lupus")),], aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=FALSE, ncol=2))
These definitely look more different. But what about the variance?
theme_set(theme_classic(base_size=8, base_family="Avenir"))
ggplot(fam.melt[which(fam.melt$source_category %in% c("turkey", "dog") & fam.melt$Family %in% unique(fam.sub.melt$Family)),], aes(x=Family, y=Abundance, fill=source_category)) + geom_boxplot() + facet_wrap(~Phylum, scales="free_x") + ylab("Proportional Abundance") + theme(axis.text.x = element_text(angle=45, hjust=1))
It definitely seems more likely that there will be real differences. Let’s look at abundant family differences only, for brevity’s sake.
turkeydog <- prune_samples(sample_data(merge.phylo.unscaled)$source_category %in% c("dog", "turkey"), merge.phylo.unscaled) %>% prune_taxa(taxa_sums(.) > 0,.)
turkeydog.family <- tax_glom(turkeydog, taxrank="Family")
otu.table <- data.frame(otu_table(turkeydog.family))
tax.table <- data.frame(tax_table(turkeydog.family))
mdf<- data.frame(sample_data(turkeydog.family))
mdf %>% group_by(Host) %>% tally()
d.1000 <- otu.table[which(rowSums(otu.table) > 10),]
aldex.in <- data.frame(d.1000,
check.names=F)
conditions <- mdf[,"source_category"]
x.d.c <- aldex.clr(aldex.in, conditions, mc.samples=128, verbose=TRUE)
## operating in serial mode
## removed rows with sums equal to zero
## computing center with all features
## data format is OK
## dirichlet samples complete
## transformation complete
x.tt.d.c <- aldex.ttest(x.d.c, conditions, paired.test=FALSE)
## Warning in if (hist.plot == TRUE) {: the condition has length > 1 and only the
## first element will be used
x.effect.d.c <- aldex.effect(x.d.c, conditions, include.sample.summary=FALSE, verbose=TRUE)
## Warning in if ("BiocParallel" %in% rownames(installed.packages()) & useMC == :
## the condition has length > 1 and only the first element will be used
## operating in serial mode
## sanity check complete
## rab.all complete
## rab.win complete
## rab of samples complete
## within sample difference calculated
## between group difference calculated
## group summaries calculated
## effect size calculated
## summarizing output
x.all.d.c <- data.frame(x.tt.d.c, x.effect.d.c, stringsAsFactors=FALSE)
aldex.plot(x.all.d.c, type="MW", test="welch")
sig.records <- rownames(x.all.d.c[which(x.all.d.c$we.eBH < 0.05 & abs(x.all.d.c$effect) >= 1),])
dim(tax.table[sig.records,])
## [1] 13 6
x.all.d.c[sig.records,]
tax.table[sig.records,]
sig.df <- x.all.d.c[sig.records,]
row.names(sig.df) <- tax.table[row.names(sig.df),"Family"]
sig.df
theme_set(theme_classic(base_size=8, base_family="Avenir"))
ggplot(fam.melt[which(fam.melt$source_category %in% c("turkey", "dog") & fam.melt$Family %in% unique(fam.sub.melt$Family)),], aes(x=Family, y=Abundance, fill=source_category)) + geom_boxplot() + facet_wrap(~Phylum, scales="free_x") + ylab("Proportional Abundance") + theme(axis.text.x = element_text(angle=45, hjust=1))
Indeed, Lachnospiraceae is differentially abundant between dogs and turkeys. However, despite the seemingly enormous difference in Lactobacillaceae and Peptotreptococcaceae, these families do not have large enough effect sizes. That is more obvious if you look at the box plots of proportional abundance.
theme_set(theme_classic(base_size=12, base_family="Avenir"))
ggplot(fam.sub.melt[which(fam.sub.melt$Host %in% c("Meleagris gallopavo", "Canis lupus")),], aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=FALSE, ncol=2))
Let’s examine a case of three animals with very different digestive systems. Cows are foregut fermenters that have rumens, Geese are birds with very simplified gut systems lacking differentiated colons but containing two paired ceca, and pigs (like humans) are hindgut fermenters with highly developed colons.
Check it out:
Cow | Goose | Pig |
---|---|---|
We can use ALDEx2 to create a generalized linear model of how the composition of the gut depends on host species. As in the case of the t-test, the ALDEx2 implementation is preferable because the underlying probability distribution that the package uses is the Dirichlet.
We need to first make a phyloseq subset of just the samples we care about. Let’s start with family level differences.
glm.phylo <- prune_samples(sample_data(merge.phylo.unscaled)$source_category %in% c("canadagoose", "swine", "dairycow"), merge.phylo.unscaled) %>% prune_taxa(taxa_sums(.) > 0,.)
glm.phylo
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 1948 taxa and 64 samples ]
## sample_data() Sample Data: [ 64 samples by 5 sample variables ]
## tax_table() Taxonomy Table: [ 1948 taxa by 6 taxonomic ranks ]
Agglomerate by family.
glm.phylo.fam <- tax_glom(glm.phylo, taxrank="Family")
otu.table <- data.frame(otu_table(glm.phylo.fam))
tax.table <- data.frame(tax_table(glm.phylo.fam))
mdf <- data.frame(sample_data(glm.phylo.fam))
mdf %>% group_by(source_category) %>% tally()
This implementation of GLM requires that you supply a model matrix. There are many ways to construct this object. Here’s how I do it.
guts <- factor(mdf$source_category)
#Encode the levels of 'guts' as single characters to help ALDEx2 later.
levels(guts) <- c("G", "C", "P")
glm.matrix <- model.matrix(~ guts)
glm.matrix[1:4,]
## (Intercept) gutsC gutsP
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
The model matrix has an intercept column and two other binary columns representing membership in the factor levels of mdf$source_category
. There is no column for canadagoose, so rows where gutsdairycow
and gutsswine
are both 0 are canadagoose samples.
You should still filter your otu table, just as we did for the t-tests.
d.1 <- otu.table[which(rowSums(otu.table) > 100),]
dim(otu.table)
## [1] 646 64
dim(d.1)
## [1] 266 64
aldex.in <- data.frame(d.1, check.names=F)
x <- aldex.clr(aldex.in, glm.matrix, mc.samples=128, verbose=TRUE)
x.glm.test <- aldex.glm(x, guts)
x.glm.test
Currently, the aldex2 package doesn’t support effect size testing for more than 2 groups, but you can always run the pairwise version of this test. For now we’ll just look at the GLM results.
You can pull out taxa which had both a positive Coefficient and a BH corrected p-value < 0.05 or 0.01 for each category. The reference point in this analysis is the Canada Goose, so its coefficient column is the intercept column.
sig.dairycow <- x.glm.test[which(x.glm.test$`model.gutsC Pr(>|t|).BH` < 0.05 & x.glm.test$`model.gutsC Estimate` > 0),]
sig.swine <- x.glm.test[which(x.glm.test$`model.gutsP Pr(>|t|).BH` < 0.05 & x.glm.test$`model.gutsP Estimate` > 0),]
sig.goose <- x.glm.test[which(x.glm.test$`(Intercept) Pr(>|t|).BH` < 0.05 & x.glm.test$`(Intercept) Estimate` > 0),]
dim(sig.dairycow)
## [1] 29 15
dim(sig.swine)
## [1] 31 15
dim(sig.goose)
## [1] 46 15
You can see that relative to the t-tests, the GLM identified many more significant taxa.
Let’s check out what taxa were identified as positively associated with dairy cows.
sort(tax.table[row.names(sig.dairycow),"Family"])
## [1] "Acetobacteraceae" "Acidaminococcaceae"
## [3] "Actinobacteria_unclassified" "Anaeroplasmataceae"
## [5] "Atopobiaceae" "Bacteria_unclassified"
## [7] "Bacteroidaceae" "Bacteroidales_RF16_group"
## [9] "Bacteroidales_unclassified" "Bacteroidia_unclassified"
## [11] "Beijerinckiaceae" "Bifidobacteriaceae"
## [13] "Burkholderiaceae" "Christensenellaceae"
## [15] "Chthoniobacteraceae" "Clostridiales_unclassified"
## [17] "Desulfovibrionaceae" "Family_XIII"
## [19] "Gastranaerophilales_fa" "Lachnospiraceae"
## [21] "Lactobacillaceae" "Mycoplasmataceae"
## [23] "Paludibacteraceae" "Pirellulaceae"
## [25] "Prevotellaceae" "Rikenellaceae"
## [27] "Ruminococcaceae" "Saccharimonadaceae"
## [29] "Streptococcaceae"
ggplot(fam.sub.melt[which(fam.sub.melt$Host %in% c("Bos taurus", "Sus sp.", "Branta sp.")),], aes(x=Host, y=Abundance, fill=Factor.Family)) + geom_bar(stat="identity") + scale_fill_manual(values=family.colors.other, name="Family") + ylab("Mean Relative Abundance") + ggtitle("Mean Abundance") + ggtitle("Mean Relative Abundance >= 1%") + scale_x_discrete(labels=c.names)+ guides(fill=guide_legend(reverse=FALSE, ncol=2))
As you can see many of the taxa identified in the bar chart appear in the list. But now look at the Goose list.
sort(tax.table[row.names(sig.goose),"Family"])
## [1] "Aeromonadaceae" "Bacillaceae"
## [3] "Bacillales_unclassified" "Bacilli_unclassified"
## [5] "Bacteria_unclassified" "Bacteroidaceae"
## [7] "Bacteroidales_unclassified" "Beijerinckiaceae"
## [9] "Burkholderiaceae" "Carnobacteriaceae"
## [11] "Caulobacteraceae" "Chitinophagaceae"
## [13] "Clostridiaceae_1" "Clostridiales_unclassified"
## [15] "Enterobacteriaceae" "Enterococcaceae"
## [17] "Erysipelotrichaceae" "Firmicutes_unclassified"
## [19] "Flavobacteriaceae" "Gammaproteobacteria_unclassified"
## [21] "Intrasporangiaceae" "Lachnospiraceae"
## [23] "Lactobacillaceae" "Lactobacillales_unclassified"
## [25] "Microbacteriaceae" "Micrococcaceae"
## [27] "Moraxellaceae" "Nocardioidaceae"
## [29] "Paenibacillaceae" "Peptostreptococcaceae"
## [31] "Pirellulaceae" "Planococcaceae"
## [33] "Prevotellaceae" "Pseudomonadaceae"
## [35] "Rhizobiaceae" "Rhodobacteraceae"
## [37] "Rhodocyclaceae" "Rikenellaceae"
## [39] "Ruminococcaceae" "Sphingomonadaceae"
## [41] "Spirosomaceae" "Sporichthyaceae"
## [43] "Streptococcaceae" "Veillonellaceae"
## [45] "Weeksellaceae" "Xanthomonadaceae"
There’s a big overlap in the taxa identified. That’s because the GLM didn’t identify taxa that are particularly unique to each category, just those which are abundant within it. This is sort of useful, but does a different thing than the t-tests did. It tells us what the most abundant taxa are within each category, but it doesn’t tell us whether or not the abundance of a given taxa is significantly different between animal hosts.
Also note that we didn’t make any measure of effect size, so we’re not able to qualify these differences like we were before.
Essentially what we want is an ANOVA!
Warning this is sort of experimental
Pick an arbitrary taxon of interest. Let’s try Lactobacillaceae because it’s biologically meaningful and looks like it might be different between our groups.
taxon <- row.names(tax.table[which(tax.table$Family=="Lactobacillaceae"),])
Make a CLR transformed table.
clr.table <- transform.clr(otu.table)
## No. corrected values: 2148
clr.table[taxon,]
## SRR6150541 SRR6150544 SRR6150545 SRR6150548 SRR6150549 SRR6150574 SRR6150577
## 4.084978 1.809646 3.122260 4.048468 5.327935 4.068716 4.775400
## SRR6150644 SRR6150658 SRR6150659 SRR6150660 SRR6150667 SRR6150695 SRR6150732
## 1.968394 10.159773 8.661907 8.814762 3.686408 8.034258 2.102417
## SRR6150733 SRR6150734 SRR6150735 SRR6150736 SRR6150737 SRR6150739 SRR6150741
## 2.785979 3.091119 1.823360 1.854571 1.410649 1.933819 4.371574
## SRR6150540 SRR6150542 SRR6150543 SRR6150546 SRR6150547 SRR6150570 SRR6150575
## 5.890258 7.174571 4.919733 6.453046 6.775714 5.215031 4.816612
## SRR6150576 SRR6150632 SRR6150634 SRR6150635 SRR6150636 SRR6150637 SRR6150638
## 5.422076 7.586613 6.219386 7.371781 6.564814 6.918272 6.866754
## SRR6150645 SRR6150646 SRR6150655 SRR6150656 SRR6150657 SRR6150661 SRR6150662
## 7.338880 7.898083 6.870178 6.934514 7.031687 7.445228 7.568251
## SRR6150679 SRR6150680 SRR6150693 SRR6150694 SRR6150696 SRR6150697 SRR6150698
## 7.515694 6.466796 7.142267 7.879519 7.658930 7.233414 7.513896
## SRR6150699 SRR6150700 SRR6150701 SRR6150702 SRR6150729 SRR6150738 SRR6150740
## 7.198939 7.139563 7.730142 6.413944 7.078331 4.937797 4.349034
## SRR6150742 SRR6150744 SRR6150745 SRR6150746 SRR6150747 SRR6150748 SRR6150749
## 6.753945 6.444539 6.668511 6.530274 7.065614 6.554850 6.415816
## SRR6150750
## 7.039510
Really, your values should follow a normal distribution to use an ANOVA, and these probably do not.
hist(clr.table[taxon,])
But the ANOVA is in practice very robust to non-normal distributions.
anova.dat <- data.frame(row.names=colnames(clr.table), abundance=clr.table[taxon,], host=mdf$source_category)
anova.dat
taxon.anova <- aov(abundance~host, data=anova.dat)
summary(taxon.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## host 2 146.5 73.23 39.74 8.93e-12 ***
## Residuals 61 112.4 1.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This test shows that indeed there is a difference in the amount of Lactobacillaceae. Always follow up with pairwise post-hoc tests.
TukeyHSD(taxon.anova)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = abundance ~ host, data = anova.dat)
##
## $host
## diff lwr upr p adj
## dairycow-canadagoose 2.350790 1.3854956 3.316085 0.0000006
## swine-canadagoose 3.612045 2.6039800 4.620110 0.0000000
## swine-dairycow 1.261255 0.2137718 2.308738 0.0144359
This shows us that the most significant difference is between canada geese and the other two animals.
Sanity check by making a boxplot.
ggplot(anova.dat, aes(x=host, y=abundance)) + geom_boxplot(fill="gray")
You can wrap all of this into a convenient function.
testTaxon <- function(taxrank, taxonname, clrtable, taxtable, metadata, predictor){
taxon <- row.names(tax.table[which(tax.table[,taxrank]==taxonname),])
if(length(taxon) == 0){
print("Taxon not found")
return(NULL)
}
hist(clrtable[taxon,])
anova.dat <- data.frame(row.names=colnames(clrtable), abundance=clrtable[taxon,], predictor=metadata[,predictor])
taxon.anova <- aov(abundance~predictor, data=anova.dat)
print(summary(taxon.anova))
print(TukeyHSD(taxon.anova))
print(ggplot(anova.dat, aes(x=predictor, y=abundance)) + geom_violin(fill="gray") + ggtitle(taxonname) + xlab(predictor))
ggplot(anova.dat, aes(x=predictor, y=abundance)) + geom_boxplot(fill="gray") + ggtitle(taxonname) + xlab(predictor)
}
testTaxon("Family", "Lactobacillaceae", clr.table, tax.table, mdf, "source_category")
## Df Sum Sq Mean Sq F value Pr(>F)
## predictor 2 146.5 73.23 39.74 8.93e-12 ***
## Residuals 61 112.4 1.84
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = abundance ~ predictor, data = anova.dat)
##
## $predictor
## diff lwr upr p adj
## dairycow-canadagoose 2.350790 1.3854956 3.316085 0.0000006
## swine-canadagoose 3.612045 2.6039800 4.620110 0.0000000
## swine-dairycow 1.261255 0.2137718 2.308738 0.0144359
Try it out with another that looks less promising
testTaxon("Family", "Ruminococcaceae", clr.table, tax.table, mdf, "source_category")
## Df Sum Sq Mean Sq F value Pr(>F)
## predictor 2 266.2 133.10 38.13 1.81e-11 ***
## Residuals 61 213.0 3.49
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = abundance ~ predictor, data = anova.dat)
##
## $predictor
## diff lwr upr p adj
## dairycow-canadagoose 4.3531745 3.024583 5.681766 0.000000
## swine-canadagoose 3.9516667 2.564208 5.339126 0.000000
## swine-dairycow -0.4015078 -1.843220 1.040204 0.782292
You can see several things from this chart. 1) The distributions here are very different, and more than that, they have extremely different variances. This makes an ANOVA an unreliable choice. Unfortunately, most non-parametric tests also operate under the assumption that your distributions are in the same family. 2) The ANOVA seems unreasonably sensitive to the differences here. 3) Once again, the stacked bar chart was misleading.0
Brown, Clairessa M, Christopher Staley, Ping Wang, Brent Dalzell, Chan Lan Chun, and Michael J Sadowsky. 2017. “A High-Throughput Dna-Sequencing Approach for Determining Sources of Fecal Bacteria in a Lake Superior Estuary.” Environmental Science & Technology 51 (15): 8263–71.
Schloss, Patrick D, Sarah L Westcott, Thomas Ryabin, Justine R Hall, Martin Hartmann, Emily B Hollister, Ryan A Lesniewski, et al. 2009. “Introducing Mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities.” Applied and Environmental Microbiology 75 (23): 7537–41.
Stevens, C Edward, and Ian D Hume. 2004. Comparative Physiology of the Vertebrate Digestive System. Cambridge University Press.