Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.
We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).
Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:
Here, we provide the compiled R Markdown document to reproduce the initial analysis of the mock microbial community without removing contaminants (referred to as no filter), as well as the ideal situation where all contaminants are removed (referred to as expected results). We also evaluate using the negative control as a filter, as well as a series of relative abundance filters as approached for removing contaminants from 16S rRNA gene sequencing experiments.
For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information on other methods to remove contaminants, see the relevant link above.
This analysis uses many open source packages that are available on CRAN and Bioconductor.
To run this analysis yourself, download the supplemental material here along with the R markdown file. Save these files in the same directory on your computer.
# load libraries
library(phyloseq)
library(ggplot2)
library(tidyverse)
library(RColorBrewer)
library(reshape2)
library(dplyr)
library(knitr)
options(digits=2)
library(kableExtra)
library(gridExtra)
library(ggpubr)
# save session info (packages and versions loaded)
session <- sessionInfo()
We created a function called expCompBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in grayscale.
# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
expCompBarPlot <- function(physeq, exp_taxa, title){
## physeq - phyloseq object that will be plotted
## exp_taxa - taxa that are expected to be in the mock community
## title - title for plot
#set up data_table
data_table <- as.data.frame(t(physeq@otu_table))
data_table$reference = FALSE
data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
sample_names <- sample_names(physeq)
data_table$id <- paste0('ASV_', 1:nrow(data_table))
dilution_labels <- sample_data(physeq)$Dilutions
set.seed(444)
# define the colors to use for reference and non-reference OTUs/ASVs
ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)
# add a color variable to the data table
data_table$color <- rep(NA, nrow(data_table))
data_table$color[data_table$reference] <- ref_colors
data_table$color[!data_table$reference] <- other_colors
# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"
color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
data_gg <- data_table %>% gather("sample", "count", sample_names)
data_gg <- inner_join(data_gg,color_gg)
# create the composition bar plot
comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
scale_fill_identity(guide = FALSE) +
scale_alpha_manual(name = "Sequence type",
labels = c("expected", "other"),
values = c(bright = 1, dull = 1),
guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
keywidth = NULL, keyheight = NULL)) +
labs(title = title, x = "Sample", y = "Relative Abundance") +
theme(legend.position = "right", legend.title = element_text(size = 10),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 12))
return(comp_bar)
}
# Define functions to evaluate filter performance
eval_filter <- function(physeq, mock_taxa, recovered_otus, removed_otus, filter_method){
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# identify non-mock community ASVs correctly classified as not belonging to mock community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
# identify mock community ASVs incorrectly classified as not belonging to mock community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
profile <- rbind(false_neg, false_pos,true_neg,true_pos)
long_profile <- melt(data = profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')
# Figures
classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) +
geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill()) +
scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
labs(x = "Sample", y = 'Proportion of Reads') +
ggtitle(paste0('Sequence classification for \n ', filter_method)) +
theme(legend.position = "right", legend.title = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
plot.title = element_text(size = 12))
compositionPlot <- expCompBarPlot(physeq,mock_taxa,paste0('Recovered profile after \n ',filter_method))
# plot figures
grid.arrange(compositionPlot[[1]], compositionPlot[[2]],classificationPlot, ncol = 3)
}
filter_results <- function(physeq_original, physeq_filter, physeq_removed, mock_taxa,filter_method){
## physeq_original <- the original phyloseq object containing actual counts
## physeq_filter <- the filtered phyloseq object containing actual counts
## physeq_removed <- a phyloseq object of removed actual counts
## mock_taxa <- list of mock taxa ASV sequences
# extract out original normalized otu table
physeq_norm <- transform_sample_counts(physeq_original,function(x) 100* x/sum(x))
original_otus <- as.matrix(as.data.frame(physeq_original@otu_table))
# identify the orginal proportion of contaminants
contaminants_original <- rowSums(original_otus[,!colnames(original_otus) %in% mock_taxa])
# identify the orginal proportion of mock community ASVs
mock_original <- rowSums(original_otus[,colnames(original_otus) %in% mock_taxa])
# Normalize the filtered physloseq object to relative abundance (each sample sums to 100)
physeq_filter_norm <- transform_sample_counts(physeq_filter,function(x) 100* x/sum(x))
# subset out the otu table of recovered otus (otus that are kept)
recovered_otus <- as.matrix(as.data.frame(physeq_filter@otu_table))
# create a subset of removed otus for evaluation
removed_otus <- as.matrix(as.data.frame(physeq_removed@otu_table))
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# identify non-mock community ASVs correctly classified as not belonging to mock community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
# identify mock community ASVs incorrectly classified as not belonging to mock community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
sensitivity <- true_pos/(true_pos + false_neg)
specificty <- true_neg/(true_neg + false_pos)
accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
## proportion of contaminants removed (of all total contaminant ASVs)
contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ contaminants_original) * 100
## proportion of mock removed (of all total mock ASVs)
mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_original) * 100
## total amount of conatminants remaining in ne
#contaminants_remaining = rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# calculate alpha diverity and summary of abundances
diversity <- estimate_richness(physeq_filter, measures = c('Observed','Shannon','InvSimpson'))
rel_abundance <- as.data.frame(physeq_filter_norm@otu_table)
mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
total_mock_abundance <- rowSums(mock_abundance)
con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
# return results
results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
# add filter_method to results table
results <- results %>%
mutate(method = filter_method) %>%
mutate(sample_names = rownames(results))
return(results)
}
summary_table <- function(results, filter_method){
caption_text = paste0(filter_method,' summary')
drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
results <- results %>% select(-one_of(drop_col))
kable(t(results), digits = 2, caption = caption_text) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
group_rows("Percent Removed", 1,2) %>%
group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>%
group_rows("Mock Abundances after Contaminant Removal", 8,ncol(results) - 2)
}
# define function for evaluateing abundance filtering
abundance_filter_eval <- function(physeq, abundance_filter){
# normalize physeq to 100
physeq_norm <- transform_sample_counts(physeq,function(x) 100* x/sum(x))
# subset out the otu table, name rec_otu for recovered otus (otus that are kept)
# name relative abundance otu table rec_otu_rel
rec_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
rec_otu <- as.matrix(as.data.frame(physeq@otu_table))
# remove low abundance sequences per sample
rec_otu_rel[rec_otu_rel <= abundance_filter] <- 0
rec_otu[rec_otu_rel == 0] <- 0
ps_filt <- physeq
otu_table(ps_filt) <- otu_table(rec_otu, taxa_are_rows = FALSE)
# create a subset of removed otus for evaluation
rem_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
rem_otu <- as.matrix(as.data.frame(physeq@otu_table))
rem_otu[rem_otu_rel > abundance_filter] <- 0
ps_rem <- physeq
otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)
# plot results
eval_filter(ps_filt, mock_taxa, rec_otu,rem_otu, paste0('Abundance less than ', abundance_filter, '% removed'))
# return results
results <- filter_results(physeq, ps_filt, ps_rem, mock_taxa,paste0('Abundance filter, ',abundance_filter) )
return(results)
}
This analysis assumes that you have the RData file and R Markdown file saved in the same directory.
## Load the dataset
load("mockDilutionsPrep.RData")
This work outlines each step required for evaluating the impact of contaminants on 16S rRNA gene experiments with samples that have varying starting material. The conclusions from this analysis is that contaminant ASVs increase with decreasing starting material and lead to distorted microbial community profiles and overestimated alpha diversity measurements.
While contaminants are prevalent in low microbial biomass samples, and potentially unavoidable, all hope is not lost. There are a few computational approaches to identify and remove contaminant sequences from 16S rRNA gene sequencing data. To see how we evaluated these different methods for identifying and removing contaminants from this data set, see: * Removing contaminant ASVs with decontam * Removing contaminant ASVs with SourceTracker * Evaluating SourceTracker results
This RMarkdown file provides the code for evaluating how unrecommended methods - removing all sequences that appear in a blank control and an abundance filter would perform for removing contaminants.
# phyloseq object with only expected sequences
mock_ps_exp <- prune_taxa(taxa_names(mock_ps_pure),mock_ps)
# phyloseq object with only contaminant sequences
mock_ps_not_exp <- prune_taxa(!taxa_names(mock_ps) %in% taxa_names(mock_ps_pure) ,mock_ps)
## create a summary of results with expected sequences only
results_expected <- filter_results(mock_ps, mock_ps_exp, mock_ps_not_exp ,mock_taxa,'Expected Results')
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_expected,'Expected Results')
Percent Removed | |||||||||
contaminants_removed | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
total_mock_abundance | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 9 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 8 |
Shannon | 1.9 | 1.9 | 1.9 | 1.9 | 1.8 | 1.9 | 1.8 | 1.8 | 1.9 |
InvSimpson | 5.4 | 5.4 | 5.6 | 5.8 | 4.9 | 5.8 | 4.9 | 5.4 | 6.3 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22 | 30 | 26 | 24 | 33 | 27 | 28 | 28 | 16 |
ASV_2 | 26 | 21 | 22 | 23 | 15 | 20 | 18 | 21 | 19 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.8 | 9.6 | 16.3 | 8.2 | 12.3 | 10.6 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.4 | 23.6 | 13.4 | 29.0 | 19.4 | 24.1 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.4 | 5.3 | 8.6 | 5.8 | 5.7 | 3.9 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.9 | 6.6 | 7.1 | 6.1 | 7.8 | 11.5 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.5 | 3.8 | 3.5 | 3.2 | 7.7 |
ASV_8 | 3.1 | 2.1 | 2.4 | 2.4 | 1.3 | 2.2 | 1.4 | 0.0 | 0.0 |
ASV_10 | 1.4 | 1.5 | 1.5 | 1.4 | 1.2 | 1.6 | 1.0 | 2.6 | 6.9 |
sensitivity | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
# calculate alpha diverity and summary of abundances
diversity <- estimate_richness(mock_ps, measures = c('Observed','Shannon','InvSimpson'))
## Warning in estimate_richness(mock_ps, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
rel_abundance <- as.data.frame(mock_ps_norm@otu_table)
mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
total_mock_abundance <- rowSums(mock_abundance)
con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
## Add true positive/true negative ect for 'uncorrected plot'
mock_otus <- as.matrix(as.data.frame(mock_ps@otu_table))
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(mock_otus[,colnames(mock_otus) %in% mock_taxa])
# % of mock community incorrectly classified as mock community ASVs
false_neg <- rowSums(mock_otus[,!colnames(mock_otus) %in% mock_taxa])
# true_pos and false_pos do not exist
# return results
results_original <- cbind(con_abundance, total_mock_abundance, diversity, mock_abundance, true_neg, false_neg)
# add filter_method to results table
results_original <- results_original %>%
mutate(method = 'Original Data') %>%
mutate(sample_names = rownames(results_original))
summary_table(results_original,'Original results - no contaminant removal')
## Warning: Unknown columns: `true_pos`, `false_pos`
Percent Removed | |||||||||
con_abundance | 0.05 | 0.14 | 1.80 | 4.45 | 11.96 | 27.86 | 64.48 | 55.77 | 80.06 |
total_mock_abundance | 100 | 100 | 98 | 96 | 88 | 72 | 36 | 44 | 20 |
Percent Remaining after Contaminant Removal | |||||||||
Observed | 18 | 20 | 114 | 172 | 262 | 312 | 381 | 147 | 193 |
Shannon | 1.9 | 1.9 | 2.0 | 2.2 | 2.5 | 3.4 | 4.5 | 4.1 | 4.8 |
Diversity Estimates after Contaminant Removal | |||||||||
InvSimpson | 5.4 | 5.4 | 5.9 | 6.4 | 6.3 | 11.1 | 28.0 | 24.6 | 71.6 |
ASV_1 | 22.1 | 29.6 | 25.7 | 22.7 | 28.9 | 19.4 | 9.8 | 12.3 | 3.2 |
ASV_2 | 26.0 | 20.9 | 22.1 | 21.7 | 13.4 | 14.5 | 6.3 | 9.3 | 3.9 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_3 | 21.7 | 14.5 | 16.8 | 16.0 | 8.4 | 11.7 | 2.9 | 5.5 | 2.1 |
ASV_4 | 6.6 | 14.2 | 12.3 | 14.8 | 20.8 | 9.7 | 10.3 | 8.6 | 4.8 |
ASV_5 | 11.06 | 7.92 | 9.06 | 8.00 | 4.64 | 6.18 | 2.05 | 2.52 | 0.78 |
ASV_6 | 5.1 | 5.3 | 5.3 | 5.6 | 5.8 | 5.1 | 2.1 | 3.5 | 2.3 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.1 | 4.0 | 2.8 | 1.2 | 1.4 | 1.5 |
ASV_8 | 3.13 | 2.06 | 2.32 | 2.32 | 1.10 | 1.61 | 0.49 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.49 | 1.35 | 1.04 | 1.17 | 0.36 | 1.16 | 1.38 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
We can also evaluate the success of removing contaminants by identifying how many contaminants were removed and remain after filtering.
We will use a similar scheme that was used for evaluating SourceTracker and decontam in our manuscript:
First, we will remove all sequences that appear in the blank control from the mock community dilution series to see how this method would perform
# removing all ASVs present in a blank control sample
# create an otu table to perform filtering on
# rec_otu -> 'recovered otus', will be the otu table results after filtering
rec_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
blank_taxa <- taxa_names(blank_ps)
# remove ASVs from blank
rec_otu <- rec_otu[,!colnames(rec_otu) %in% blank_taxa]
# create a phyloseq object with the filtered otu table
ps_blank_filter <- mock_ps
otu_table(ps_blank_filter) <- otu_table(rec_otu, taxa_are_rows = FALSE)
# create a subset of removed otus for evaluation
# rem_otu -> 'removed otus' the otus that were removed during filtering
rem_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
rem_otu <- rem_otu[,colnames(rem_otu) %in% blank_taxa]
# create a phyloseq object with removed OTUs as the otu table
ps_rem <- mock_ps
otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)
# run the evaluation function to summarize results
eval_filter(ps_blank_filter,mock_taxa, rec_otu, rem_otu, 'Blank control removal')
## Warning in data_table$color[data_table$reference] <- ref_colors: number of
## items to replace is not a multiple of replacement length
## Joining, by = c("id", "color", "sample", "count")
results_blank <- filter_results(mock_ps, ps_blank_filter, ps_rem,mock_taxa,'Blank control removal')
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_blank,'Blank control removal')
Percent Removed | |||||||||
contaminants_removed | 63 | 44 | 63 | 35 | 49 | 49 | 61 | 44 | 44 |
mock_ASVs_removed | 30 | 26 | 27 | 27 | 21 | 25 | 22 | 27 | 34 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.027 | 0.106 | 0.933 | 3.998 | 8.100 | 20.817 | 47.855 | 48.905 | 77.420 |
total_mock_abundance | 100 | 100 | 99 | 96 | 92 | 79 | 52 | 51 | 23 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 11 | 10 | 54 | 110 | 177 | 199 | 247 | 97 | 128 |
Shannon | 1.6 | 1.5 | 1.6 | 1.9 | 2.0 | 2.8 | 3.8 | 3.5 | 4.3 |
InvSimpson | 4.2 | 3.9 | 4.3 | 4.8 | 4.1 | 6.8 | 12.4 | 13.8 | 42.6 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 31.7 | 40.2 | 35.6 | 31.4 | 38.1 | 28.7 | 18.4 | 19.5 | 5.5 |
ASV_3 | 31.1 | 19.7 | 23.3 | 22.2 | 11.1 | 17.3 | 5.5 | 8.6 | 3.6 |
ASV_4 | 9.5 | 19.3 | 17.0 | 20.4 | 27.4 | 14.2 | 19.4 | 13.6 | 8.2 |
ASV_5 | 15.8 | 10.7 | 12.6 | 11.1 | 6.1 | 9.1 | 3.9 | 4.0 | 1.3 |
ASV_6 | 7.3 | 7.2 | 7.3 | 7.7 | 7.7 | 7.5 | 4.1 | 5.4 | 3.9 |
ASV_8 | 4.48 | 2.79 | 3.21 | 3.21 | 1.45 | 2.37 | 0.92 | 0.00 | 0.00 |
sensitivity | 0.63 | 0.44 | 0.63 | 0.35 | 0.49 | 0.49 | 0.61 | 0.44 | 0.44 |
specificty | 0.70 | 0.74 | 0.73 | 0.73 | 0.79 | 0.75 | 0.78 | 0.73 | 0.66 |
accuracy | 0.70 | 0.74 | 0.73 | 0.71 | 0.76 | 0.67 | 0.67 | 0.57 | 0.48 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
Removing all ASVs that were present in the blank control was too harsh - it ended up removing many of the expected mock community ASVs (6 out of 9 of the mock taxa - up to 69.1% of the expected sequences). Additionally, this approach still missed many contaminant taxa.
Another approach that has been used in the literature is applying an abundance filter to remove noise. There is not much guidance on how researchers choose (or even use) a filter. We evaluated filters ranging from 0.01% to 1%.
results_filter_0_01 <- abundance_filter_eval(mock_ps,0.01)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_filter_0_01, 'Abundance filter, 0.01%')
Percent Removed | |||||||||
contaminants_removed | 34.1270 | 32.5000 | 11.5743 | 8.0515 | 1.0671 | 0.1846 | 0.0339 | 0.0000 | 0.0092 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.033 | 0.094 | 1.593 | 4.111 | 11.843 | 27.826 | 64.474 | 55.769 | 80.063 |
total_mock_abundance | 100 | 100 | 98 | 96 | 88 | 72 | 36 | 44 | 20 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 11 | 13 | 68 | 123 | 244 | 305 | 377 | 147 | 192 |
Shannon | 1.9 | 1.9 | 2.0 | 2.2 | 2.5 | 3.4 | 4.5 | 4.1 | 4.8 |
InvSimpson | 5.4 | 5.4 | 5.8 | 6.3 | 6.3 | 11.1 | 28.0 | 24.6 | 71.6 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 25.7 | 22.8 | 28.9 | 19.5 | 9.8 | 12.3 | 3.2 |
ASV_2 | 26.0 | 20.9 | 22.1 | 21.8 | 13.4 | 14.5 | 6.3 | 9.3 | 3.9 |
ASV_3 | 21.7 | 14.5 | 16.9 | 16.1 | 8.5 | 11.7 | 2.9 | 5.5 | 2.1 |
ASV_4 | 6.6 | 14.2 | 12.3 | 14.8 | 20.8 | 9.7 | 10.3 | 8.6 | 4.8 |
ASV_5 | 11.06 | 7.92 | 9.08 | 8.03 | 4.64 | 6.19 | 2.05 | 2.52 | 0.78 |
ASV_6 | 5.1 | 5.3 | 5.3 | 5.6 | 5.8 | 5.1 | 2.2 | 3.5 | 2.3 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.1 | 4.0 | 2.8 | 1.2 | 1.4 | 1.5 |
ASV_8 | 3.13 | 2.06 | 2.32 | 2.32 | 1.10 | 1.61 | 0.49 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.49 | 1.36 | 1.04 | 1.17 | 0.36 | 1.16 | 1.38 |
sensitivity | 3.4e-01 | 3.3e-01 | 1.2e-01 | 8.1e-02 | 1.1e-02 | 1.8e-03 | 3.4e-04 | 0.0e+00 | 9.2e-05 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 0.98 | 0.96 | 0.88 | 0.72 | 0.36 | 0.44 | 0.20 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
results_filter_0_1 <- abundance_filter_eval(mock_ps,0.1)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_filter_0_1, 'Abundance filter, 0.1%')
Percent Removed | |||||||||
contaminants_removed | 100.0 | 100.0 | 85.4 | 68.5 | 54.8 | 42.1 | 20.1 | 1.3 | 1.0 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.00 | 0.00 | 0.27 | 1.45 | 5.79 | 18.28 | 59.18 | 55.46 | 79.90 |
total_mock_abundance | 100 | 100 | 100 | 99 | 94 | 82 | 41 | 45 | 20 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 9 | 9 | 11 | 15 | 31 | 74 | 156 | 136 | 179 |
Shannon | 1.9 | 1.9 | 1.9 | 2.0 | 2.1 | 2.8 | 4.0 | 4.1 | 4.8 |
InvSimpson | 5.4 | 5.4 | 5.7 | 6.0 | 5.5 | 8.7 | 21.3 | 24.3 | 70.5 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 26.1 | 23.4 | 30.9 | 22.0 | 11.2 | 12.4 | 3.2 |
ASV_2 | 26.1 | 21.0 | 22.4 | 22.4 | 14.3 | 16.4 | 7.2 | 9.3 | 3.9 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.5 | 9.0 | 13.3 | 3.3 | 5.5 | 2.1 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.2 | 22.3 | 10.9 | 11.8 | 8.6 | 4.8 |
ASV_5 | 11.06 | 7.93 | 9.20 | 8.25 | 4.96 | 7.01 | 2.36 | 2.54 | 0.79 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.8 | 6.2 | 5.8 | 2.5 | 3.5 | 2.3 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.2 | 3.1 | 1.4 | 1.4 | 1.5 |
ASV_8 | 3.13 | 2.06 | 2.35 | 2.39 | 1.18 | 1.82 | 0.56 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.51 | 1.40 | 1.11 | 1.33 | 0.41 | 1.17 | 1.39 |
sensitivity | 1.000 | 1.000 | 0.854 | 0.685 | 0.548 | 0.421 | 0.201 | 0.013 | 0.010 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 | 0.84 | 0.48 | 0.45 | 0.21 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
results_filter_1 <- abundance_filter_eval(mock_ps,1)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_filter_1, 'Abundance filter, 1%')
Percent Removed | |||||||||
contaminants_removed | 100 | 100 | 100 | 100 | 100 | 96 | 77 | 74 | 75 |
mock_ASVs_removed | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.4 | 0.0 | 3.9 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.6 | 30.2 | 24.7 | 50.6 |
total_mock_abundance | 100 | 100 | 100 | 100 | 100 | 98 | 70 | 75 | 49 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 9 | 9 | 9 | 9 | 9 | 10 | 13 | 18 | 17 |
Shannon | 1.9 | 1.9 | 1.9 | 1.9 | 1.8 | 2.0 | 2.2 | 2.5 | 2.7 |
InvSimpson | 5.4 | 5.4 | 5.6 | 5.8 | 4.9 | 6.0 | 7.2 | 8.9 | 13.6 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 26.1 | 23.7 | 32.8 | 26.5 | 19.7 | 21.0 | 8.2 |
ASV_2 | 26.1 | 21.0 | 22.5 | 22.8 | 15.2 | 19.7 | 12.6 | 15.8 | 9.9 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.8 | 9.6 | 16.0 | 5.9 | 9.3 | 5.4 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.4 | 23.6 | 13.2 | 20.7 | 14.6 | 12.4 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.4 | 5.3 | 8.4 | 4.1 | 4.3 | 0.0 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.9 | 6.6 | 7.0 | 4.3 | 5.9 | 5.9 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.5 | 3.8 | 2.5 | 2.4 | 3.9 |
ASV_8 | 3.1 | 2.1 | 2.4 | 2.4 | 1.3 | 2.2 | 0.0 | 0.0 | 0.0 |
ASV_10 | 1.4 | 1.5 | 1.5 | 1.4 | 1.2 | 1.6 | 0.0 | 2.0 | 3.6 |
sensitivity | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.96 | 0.77 | 0.74 | 0.75 |
specificty | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.98 | 1.00 | 0.96 |
accuracy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.99 | 0.84 | 0.85 | 0.80 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep, filter_results)
## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_rm' not found
## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_keep' not found
save.image('results_filtering.RData')
This work demonstrates using various filtering methods for identifying and removing contaminants from a 16S rRNA gene sequencing experiment. We were able to evaluate the performance of this method since we used a dilution series of a mock microbial community, where the expected composition of the samples are known. We also evaluated more sophisticated methods for contaminant identification and removal: the decontam frequency method, which is documented here and SourceTracker, which is documented here.