We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).
Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:
For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section.
Decontam is an R package developed specifically to identify and remove contaminants from 16S rRNA gene sequencing experiments using a simple statistical approach. See the preprint here and tutorials here.
Decontam implements two primary approaches for identifying contaminants. The first approach is a frequency approach which identifies significant inverse correlations between ASV abundance and DNA concentration measured prior to library prep. The second approach is prevalence approach which identifies ASVs that appear more often (are more prevalent) in blank control samples than in experimental samples. There are also many combinations available for using information from both of these approaches (see the decontam website for more details).
Since we only have one blank control sample, we cannot evaluate the prevalence approach and instead focus on the frequency approach. We measured DNA concentration with nanodrop prior to library preparation and these values are stored in the phyloseq object’s sample data as “DNA_conc”.
This document assumes that you have read Introduction and evaluating contaminants and have the data file from running that section.
This analysis uses freely available open source packages that are available on CRAN and Bioconductor.
To run this analysis yourself, download the supplemental material here along with the R markdown file.
# load libraries
library(ggplot2)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 2.1.1 ✔ purrr 0.3.2
## ✔ tidyr 0.8.3 ✔ dplyr 0.8.0.1
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ tibble 2.1.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(phyloseq)
library(decontam); packageVersion("decontam")
## [1] '1.1.2'
library(knitr)
options(digits=2)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# save session info (packages versions loaded)
session <- sessionInfo()
# load data from running the 'Controlling contaminants 16S' R Markdown file
load('mockDilutionsPrep.RData')
Running decontam is straightforward. We will use the isContaminant function from decontam, which requires:
We will store the results of decontam in and object called contam.freq:
# Evaluate different thresholds of decontam
contam.freq <- isContaminant(mock_ps, method="frequency", threshold=0.1, conc="DNA_conc")
Our new object contam.freq contains the results from running decontam. This object has a few components to it that can be accessed using the $ operator:
All of these values can be reviewed by viewing the contam.freq data frame. Here we will create a temporary data frame with shortened rownames for viewing since our rownames are the ASV sequences and are quite long (making viewing difficult).
## Create a temporary copy of contam.freq with shorter rownames for easy display
temp_contam.freq <- contam.freq
# change the rownames to something managable
rownames(temp_contam.freq) <- paste0("SV",seq(nrow(temp_contam.freq)))
# display the first 10 rows of the data frame
print(head(temp_contam.freq))
## freq prev p.freq p.prev p contaminant
## SV1 0.193 9 0.98 NA 0.98 FALSE
## SV2 0.153 9 0.99 NA 0.99 FALSE
## SV3 0.111 9 0.97 NA 0.97 FALSE
## SV4 0.113 9 0.99 NA 0.99 FALSE
## SV5 0.058 9 0.97 NA 0.97 FALSE
## SV6 0.045 9 1.00 NA 1.00 FALSE
It is also possible to plot the abundance of individual ASVs by DNA concentration with the plot_frequency function in decontam:
set.seed(100)
plot_frequency(mock_ps, taxa_names(mock_ps)[sample(which(contam.freq$contaminant),1)], conc="DNA_conc")
## Warning: Transformation introduced infinite values in continuous y-axis
Next, we use the contam.freq$contaminant result along with the prune_taxa phyloseq function to create a phyloseq object without the suspected contaminants.
# create phyloseq object with contaminant ASVs removed
ps.noncontam.freq <- prune_taxa(!contam.freq$contaminant, mock_ps)
We can also use this approach to create a phyloseq object of just contaminant ASVs so that they can be inspected more closely/confirmed as contaminants.
# create a phyloseq object with only contaminant ASVs
ps.contam.freq <- prune_taxa(contam.freq$contaminant, mock_ps)
Since we have a mock microbial dilution series, we can evaluate how well the contaminant removal worked by identifying the percentage of contaminant ASVs that were successfully removed.
# Identify the percent of contaminants identified
ps_removed_prop <- prune_taxa(contam.freq$contaminant, mock_ps_norm)
# limit ps_norm_exp to dilution series samples (remove blank)
ps_norm_exp <- prune_samples(sample_names(ps_norm_exp) %in% sample_names(ps_removed_prop), ps_norm_exp)
# Print the percent of contaminant ASVs that were removed
print(sample_sums(ps_removed_prop)/(100-sample_sums(ps_norm_exp)) * 100)
## D0 D1 D2 D3 D4 D5 D6 D7 D8
## 25 60 25 37 20 19 11 17 14
We can also identify the percentage of contaminant still remaining in the data set
# Identify the percentage of contaminant still remaining in the dataset
# normalize the post-decontam phyloseq object to 100
ps.noncontam.freq.norm <- transform_sample_counts(ps.noncontam.freq,function(x) 100* x/sum(x))
# remove expected ASVs
ps_rem_contam <- prune_taxa(mock_taxa,ps.noncontam.freq.norm)
# print the percent of contaminants remaining after decontam
sample_sums(ps_rem_contam)
## D0 D1 D2 D3 D4 D5 D6 D7 D8
## 100 100 99 97 90 76 38 49 22
We can also check that none of the expected ASVs were removed.
# identify if expected ASVs are present in the contaminant ASVs
intersect(taxa_names(contam.freq), taxa_names(mock_ps))
## character(0)
We repeat this on various levels of the threshold parameter to identify the ideal one to use on our data set. To do this, we create a function called testIsContam
# Create a function to evaluate the performance of decontam at different thresholds
testIsContam <- function(physeq, thr, mock_taxa){
# create normalized phyloseq object
physeq_norm <- transform_sample_counts(physeq,function(x) 100* x/sum(x))
# extract out original normalized otu table
original_otus <- as.matrix(as.data.frame(physeq@otu_table))
# identify the orginal proportion of contaminants
contaminants_original <- rowSums(original_otus[,!colnames(original_otus) %in% mock_taxa])
# identify the orginal proportion of mock community ASVs
mock_original <- rowSums(original_otus[,colnames(original_otus) %in% mock_taxa])
# apply decontam to the orginal phyloseq object
contam.freq <- isContaminant(physeq, method="frequency", threshold=thr, conc="DNA_conc")
# create physloseq object without contaminants
physeq.noncontam.freq <- prune_taxa(!contam.freq$contaminant, physeq)
# normalize the post-decontam phyloseq object to 100
physeq.noncontam.freq.norm <- transform_sample_counts(physeq.noncontam.freq,function(x) 100* x/sum(x))
# subset out the otu table of recovered otus (otus that are kept)
recovered_otus <- as.matrix(as.data.frame(physeq.noncontam.freq@otu_table))
# create a subset of removed otus for evaluation
removed_otus <- as.matrix(as.data.frame(physeq@otu_table))
removed_otus <- removed_otus[,!colnames(removed_otus) %in% colnames(recovered_otus)]
# Summarize results
## proportion of contaminants removed (of all total contaminant ASVs)
contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ contaminants_original) * 100
## proportion of mock removed (of all total mock ASVs)
mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_original) * 100
## total amount of contaminants remaining in ne
contaminants_remaining = rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# calculate alpha diverity (on non-normalized decontam data) and summary of relative abundances (on normalized decontam data)
diversity <- estimate_richness(physeq.noncontam.freq, measures = c('Observed','Shannon','InvSimpson'))
rel_abundance <- as.data.frame(physeq.noncontam.freq.norm@otu_table)
mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
total_mock_abundance <- rowSums(mock_abundance)
con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
# plot results
compositionPlot <- expCompBarPlot(physeq.noncontam.freq, mock_taxa, paste0('Decontam frequency, thr =', thr))
# Calculate classifications
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# identify non-mock community ASVs correctly classified as not belonging to mock community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
# identify mock community ASVs incorrectly classified as not belonging to mock community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
sensitivity <- true_pos/(true_pos + false_neg)
specificty <- true_neg/(true_neg + false_pos)
accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
# Plot classifications per sample (Figure 5A)
asv_profile <- rbind(false_neg,false_pos,true_neg,true_pos)
colnames(asv_profile) <- c('D0', 'D1','D2','D3', 'D4', 'D5', 'D6', 'D7', 'D8')
long_asv_profile <- melt(data = asv_profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
# Create color palette
customPalette <- c('#969696','#bdbdbd','#1B9E77', '#D95F02')
classificationPlot <- ggplot(long_asv_profile, aes(x = Var2, y = Abundance)) +
geom_col(aes(fill = Var1), width = 0.7, position = position_fill()) +
scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
labs(x = "Sample", y = 'Proportion of Reads') +
ggtitle(paste0('Sequence classification for \n Decontam frequency, thr = ', thr)) +
theme(legend.position = "right", legend.title = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 16),
plot.title = element_text(size = 16))
# plot figures
grid.arrange(compositionPlot, classificationPlot, ncol = 2)
# return results
results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
results <- results %>%
mutate(method = paste0('Decontam frequency, thr =', thr)) %>%
mutate(sample_names = rownames(results))
# return results
return(results)
}
# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
expCompBarPlot <- function(physeq, exp_taxa, title){
## physeq - phyloseq object that will be plotted
## exp_taxa - taxa that are expected to be in the mock community
## title - title for plot
#set up data_table
data_table <- as.data.frame(t(physeq@otu_table))
data_table$reference = FALSE
data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
sample_names <- sample_names(physeq)
data_table$id <- paste0('ASV_', 1:nrow(data_table))
dilution_labels <- sample_data(physeq)$Dilutions
set.seed(444)
# define the colors to use for reference and non-reference OTUs/ASVs
ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)
# add a color variable to the data table
data_table$color <- rep(NA, nrow(data_table))
data_table$color[data_table$reference] <- ref_colors
data_table$color[!data_table$reference] <- other_colors
# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"
color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
data_gg <- data_table %>% gather("sample", "count", sample_names)
data_gg <- inner_join(data_gg,color_gg)
# create the composition bar plot
comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
scale_fill_identity(guide = FALSE) +
scale_alpha_manual(name = "Sequence type",
labels = c("expected", "other"),
values = c(bright = 1, dull = 1),
guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
keywidth = NULL, keyheight = NULL)) +
labs(title = title, x = "Sample", y = "Relative Abundance") +
theme(legend.position = "right", legend.title = element_text(size = 10),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12),
plot.title = element_text(size = 12))
comp_bar
}
# create a function to write out summary table
summary_table <- function(results, thr){
caption_text = paste0('Decontam frequency method, threshold = ', thr, ' summary')
drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
results <- results %>% select(-one_of(drop_col))
kable(t(results), digits = 2, caption = caption_text) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
group_rows("Proportion Removed", 1,2) %>%
group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>%
group_rows("Mock Abundances after Contaminant Removal", 8,16) %>%
group_rows("Classification Performance", 17,20)
}
results_decontam_0_1 <- testIsContam(mock_ps,0.1, mock_taxa)
## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
## Joining, by = c("id", "color", "sample", "count")
results_decontam_0_2 <- testIsContam(mock_ps,0.2, mock_taxa)
## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
## Joining, by = c("id", "color", "sample", "count")
results_decontam_0_3 <- testIsContam(mock_ps,0.3, mock_taxa)
## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
## Joining, by = c("id", "color", "sample", "count")
results_decontam_0_4 <- testIsContam(mock_ps,0.4, mock_taxa)
## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
## Joining, by = c("id", "color", "sample", "count")
results_decontam_0_5 <- testIsContam(mock_ps,0.5, mock_taxa)
## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
## Joining, by = c("id", "color", "sample", "count")
# Display tables
summary_table(results_decontam_0_1, 0.1)
Proportion Removed | |||||||||
contaminants_removed | 25 | 60 | 25 | 37 | 20 | 19 | 11 | 17 | 14 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.037 | 0.056 | 1.352 | 2.845 | 9.845 | 23.793 | 61.807 | 51.138 | 77.601 |
total_mock_abundance | 100 | 100 | 99 | 97 | 90 | 76 | 38 | 49 | 22 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 17 | 16 | 91 | 152 | 241 | 288 | 354 | 129 | 176 |
Shannon | 1.9 | 1.9 | 2.0 | 2.1 | 2.4 | 3.2 | 4.4 | 3.9 | 4.7 |
InvSimpson | 5.4 | 5.4 | 5.8 | 6.1 | 6.0 | 10.0 | 24.5 | 20.6 | 63.1 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 25.8 | 23.1 | 29.6 | 20.5 | 10.5 | 13.6 | 3.6 |
ASV_2 | 26.0 | 20.9 | 22.2 | 22.1 | 13.7 | 15.3 | 6.7 | 10.2 | 4.3 |
ASV_3 | 21.7 | 14.5 | 16.9 | 16.3 | 8.6 | 12.4 | 3.1 | 6.0 | 2.4 |
ASV_4 | 6.6 | 14.2 | 12.3 | 15.0 | 21.3 | 10.2 | 11.1 | 9.5 | 5.4 |
ASV_5 | 11.06 | 7.93 | 9.10 | 8.13 | 4.75 | 6.53 | 2.21 | 2.79 | 0.88 |
ASV_6 | 5.1 | 5.3 | 5.3 | 5.7 | 6.0 | 5.4 | 2.3 | 3.8 | 2.6 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.1 | 4.0 | 2.9 | 1.3 | 1.6 | 1.7 |
ASV_8 | 3.13 | 2.06 | 2.33 | 2.36 | 1.13 | 1.70 | 0.52 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.50 | 1.38 | 1.06 | 1.24 | 0.38 | 1.28 | 1.55 |
Classification Performance | |||||||||
sensitivity | 0.25 | 0.60 | 0.25 | 0.37 | 0.20 | 0.19 | 0.11 | 0.17 | 0.14 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 0.99 | 0.97 | 0.90 | 0.77 | 0.43 | 0.54 | 0.31 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
summary_table(results_decontam_0_2, 0.2)
Proportion Removed | |||||||||
contaminants_removed | 48 | 90 | 67 | 59 | 50 | 48 | 40 | 46 | 43 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.026 | 0.013 | 0.594 | 1.881 | 6.360 | 16.845 | 52.258 | 40.506 | 69.623 |
total_mock_abundance | 100 | 100 | 99 | 98 | 94 | 83 | 48 | 59 | 30 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 14 | 10 | 58 | 120 | 199 | 244 | 312 | 103 | 148 |
Shannon | 1.9 | 1.9 | 1.9 | 2.0 | 2.2 | 2.9 | 4.2 | 3.5 | 4.5 |
InvSimpson | 5.4 | 5.4 | 5.7 | 6.0 | 5.6 | 8.4 | 20.1 | 14.6 | 51.9 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 26.0 | 23.3 | 30.7 | 22.4 | 13.1 | 16.6 | 4.9 |
ASV_2 | 26.1 | 21.0 | 22.4 | 22.3 | 14.2 | 16.7 | 8.4 | 12.5 | 5.9 |
ASV_3 | 21.7 | 14.5 | 17.0 | 16.5 | 9.0 | 13.5 | 3.9 | 7.3 | 3.2 |
ASV_4 | 6.6 | 14.2 | 12.4 | 15.2 | 22.1 | 11.1 | 13.8 | 11.6 | 7.3 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.2 | 4.9 | 7.1 | 2.8 | 3.4 | 1.2 |
ASV_6 | 5.1 | 5.4 | 5.3 | 5.7 | 6.2 | 5.9 | 2.9 | 4.6 | 3.5 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.2 | 3.2 | 1.7 | 1.9 | 2.3 |
ASV_8 | 3.13 | 2.06 | 2.34 | 2.38 | 1.17 | 1.86 | 0.65 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.51 | 1.39 | 1.10 | 1.35 | 0.48 | 1.56 | 2.10 |
Classification Performance | |||||||||
sensitivity | 0.48 | 0.90 | 0.67 | 0.59 | 0.50 | 0.48 | 0.40 | 0.46 | 0.43 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 0.99 | 0.98 | 0.94 | 0.85 | 0.61 | 0.70 | 0.54 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
summary_table(results_decontam_0_3,0.3)
Proportion Removed | |||||||||
contaminants_removed | 48 | 90 | 80 | 69 | 66 | 59 | 52 | 62 | 53 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.026 | 0.013 | 0.358 | 1.416 | 4.415 | 13.680 | 46.814 | 32.473 | 65.222 |
total_mock_abundance | 100 | 100 | 100 | 99 | 96 | 86 | 53 | 68 | 35 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 14 | 10 | 40 | 94 | 162 | 212 | 277 | 81 | 129 |
Shannon | 1.9 | 1.9 | 1.9 | 2.0 | 2.1 | 2.8 | 4.0 | 3.2 | 4.3 |
InvSimpson | 5.4 | 5.4 | 5.7 | 6.0 | 5.3 | 7.8 | 16.5 | 11.5 | 41.9 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 26.1 | 23.4 | 31.4 | 23.3 | 14.6 | 18.8 | 5.6 |
ASV_2 | 26.1 | 21.0 | 22.4 | 22.4 | 14.5 | 17.3 | 9.4 | 14.2 | 6.7 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.5 | 9.2 | 14.0 | 4.4 | 8.3 | 3.7 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.2 | 22.6 | 11.6 | 15.4 | 13.1 | 8.4 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.3 | 5.0 | 7.4 | 3.1 | 3.9 | 1.4 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.8 | 6.3 | 6.1 | 3.2 | 5.3 | 4.0 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.3 | 3.3 | 1.9 | 2.2 | 2.7 |
ASV_8 | 3.13 | 2.06 | 2.35 | 2.39 | 1.20 | 1.93 | 0.73 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.51 | 1.40 | 1.12 | 1.40 | 0.54 | 1.77 | 2.40 |
Classification Performance | |||||||||
sensitivity | 0.48 | 0.90 | 0.80 | 0.69 | 0.66 | 0.59 | 0.52 | 0.62 | 0.53 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 1.00 | 0.99 | 0.96 | 0.89 | 0.69 | 0.79 | 0.63 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
summary_table(results_decontam_0_4, 0.4)
Proportion Removed | |||||||||
contaminants_removed | 88 | 90 | 89 | 75 | 73 | 68 | 64 | 69 | 68 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.006 | 0.013 | 0.202 | 1.143 | 3.583 | 10.860 | 39.579 | 28.061 | 56.266 |
total_mock_abundance | 100 | 100 | 100 | 99 | 96 | 89 | 60 | 72 | 44 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 13 | 10 | 36 | 75 | 136 | 178 | 241 | 71 | 102 |
Shannon | 1.9 | 1.9 | 1.9 | 2.0 | 2.0 | 2.6 | 3.7 | 3.0 | 4.0 |
InvSimpson | 5.4 | 5.4 | 5.7 | 5.9 | 5.2 | 7.3 | 13.0 | 10.2 | 29.0 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22 | 30 | 26 | 23 | 32 | 24 | 17 | 20 | 7 |
ASV_2 | 26.1 | 21.0 | 22.4 | 22.5 | 14.6 | 17.9 | 10.6 | 15.1 | 8.5 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.6 | 9.2 | 14.5 | 5.0 | 8.9 | 4.6 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.3 | 22.8 | 11.9 | 17.5 | 14.0 | 10.5 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.3 | 5.1 | 7.6 | 3.5 | 4.1 | 1.7 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.8 | 6.4 | 6.3 | 3.7 | 5.6 | 5.0 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.3 | 3.4 | 2.1 | 2.3 | 3.4 |
ASV_8 | 3.13 | 2.06 | 2.35 | 2.40 | 1.21 | 1.99 | 0.83 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.52 | 1.40 | 1.13 | 1.45 | 0.61 | 1.89 | 3.02 |
Classification Performance | |||||||||
sensitivity | 0.88 | 0.90 | 0.89 | 0.75 | 0.73 | 0.68 | 0.64 | 0.69 | 0.68 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 1.00 | 0.99 | 0.97 | 0.91 | 0.77 | 0.83 | 0.74 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
summary_table(results_decontam_0_5,0.5)
Proportion Removed | |||||||||
contaminants_removed | 88 | 90 | 90 | 77 | 75 | 73 | 72 | 70 | 72 |
mock_ASVs_removed | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.006 | 0.013 | 0.188 | 1.083 | 3.319 | 9.525 | 33.577 | 27.689 | 52.506 |
total_mock_abundance | 100 | 100 | 100 | 99 | 97 | 90 | 66 | 72 | 47 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 13 | 10 | 35 | 72 | 126 | 160 | 215 | 69 | 94 |
Shannon | 1.9 | 1.9 | 1.9 | 2.0 | 2.0 | 2.5 | 3.5 | 3.0 | 3.9 |
InvSimpson | 5.4 | 5.4 | 5.7 | 5.9 | 5.2 | 7.1 | 10.8 | 10.1 | 25.2 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_1 | 22.1 | 29.7 | 26.1 | 23.5 | 31.7 | 24.4 | 18.3 | 20.2 | 7.6 |
ASV_2 | 26.1 | 21.0 | 22.4 | 22.5 | 14.7 | 18.1 | 11.7 | 15.2 | 9.2 |
ASV_3 | 21.7 | 14.5 | 17.1 | 16.6 | 9.3 | 14.7 | 5.5 | 8.9 | 5.0 |
ASV_4 | 6.6 | 14.2 | 12.5 | 15.3 | 22.9 | 12.1 | 19.3 | 14.0 | 11.4 |
ASV_5 | 11.1 | 7.9 | 9.2 | 8.3 | 5.1 | 7.8 | 3.8 | 4.1 | 1.9 |
ASV_6 | 5.1 | 5.4 | 5.4 | 5.8 | 6.4 | 6.4 | 4.0 | 5.6 | 5.5 |
ASV_7 | 2.7 | 3.7 | 3.2 | 3.2 | 4.3 | 3.5 | 2.3 | 2.3 | 3.6 |
ASV_8 | 3.13 | 2.06 | 2.35 | 2.40 | 1.21 | 2.02 | 0.91 | 0.00 | 0.00 |
ASV_10 | 1.44 | 1.54 | 1.52 | 1.40 | 1.14 | 1.47 | 0.67 | 1.90 | 3.28 |
Classification Performance | |||||||||
sensitivity | 0.88 | 0.90 | 0.90 | 0.77 | 0.75 | 0.73 | 0.72 | 0.70 | 0.72 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 1.00 | 0.99 | 0.97 | 0.92 | 0.82 | 0.83 | 0.78 |
prevalence | 0.0005 | 0.0014 | 0.0180 | 0.0445 | 0.1196 | 0.2786 | 0.6448 | 0.5577 | 0.8006 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
Using a threshold of 0.5 is reasonable since it removed many of the contaminant ASVs but did not remove any of the expected ASVs.
contam.freq <- isContaminant(mock_ps, method="frequency", threshold=0.5, conc="DNA_conc")
ps.noncontam.freq <- prune_taxa(!contam.freq$contaminant, mock_ps)
We plot the stacked bar plot of the contaminant corrected microbiome profiles for the mock microbial community dilution series. There are still some contaminants, but they are greatly reduced.
expCompBarPlot(ps.noncontam.freq, mock_taxa, 'Contaminant removed, Decontam (frequency, thr = 0.5)')
## Joining, by = c("id", "color", "sample", "count")
This work demonstrates using the frequency method implemented in decontam for identifying and removing contaminants from a 16S rRNA gene sequencing experiment. We were able to evaluate the performance of this method since we used a dilution series of a mock microbial community, where the expected composition of the samples are known. Decontam is one analytical method for identifying and removing contaminants from 16S rRNA gene sequencing data. We also evaluated using SourceTracker, which is documented here.
vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep)
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_rm' not found
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_keep' not found
save.image('results_decontam.RData')