Introduction

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Introduction and evaluating contaminants
Removing contaminant ASVs with filtering methods
Removing contaminant ASVs with decontam (this document)
Removing contaminant ASVs with SourceTracker
Evaluating SourceTracker results
Final evaluation of all methods

For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section.

Remove contaminant ASVs with decontam

Decontam is an R package developed specifically to identify and remove contaminants from 16S rRNA gene sequencing experiments using a simple statistical approach. See the preprint here and tutorials here.

Decontam implements two primary approaches for identifying contaminants. The first approach is a frequency approach which identifies significant inverse correlations between ASV abundance and DNA concentration measured prior to library prep. The second approach is prevalence approach which identifies ASVs that appear more often (are more prevalent) in blank control samples than in experimental samples. There are also many combinations available for using information from both of these approaches (see the decontam website for more details).

Since we only have one blank control sample, we cannot evaluate the prevalence approach and instead focus on the frequency approach. We measured DNA concentration with nanodrop prior to library preparation and these values are stored in the phyloseq object’s sample data as “DNA_conc”.

This document assumes that you have read Introduction and evaluating contaminants and have the data file from running that section.

Set up the workspace

This analysis uses freely available open source packages that are available on CRAN and Bioconductor.

To run this analysis yourself, download the supplemental material here along with the R markdown file.

# load libraries
library(ggplot2)
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  2.1.1       ✔ purrr   0.3.2  
## ✔ tidyr   0.8.3       ✔ dplyr   0.8.0.1
## ✔ readr   1.3.1       ✔ stringr 1.4.0  
## ✔ tibble  2.1.1       ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RColorBrewer)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(phyloseq)
library(decontam); packageVersion("decontam")

## [1] '1.1.2'

library(knitr)
options(digits=2)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

# save session info (packages versions loaded)
session <- sessionInfo()

# load data from running the 'Controlling contaminants 16S' R Markdown file
load('mockDilutionsPrep.RData')

Running decontam is straightforward. We will use the isContaminant function from decontam, which requires:

a phyloseq object or data table with the abundances of each ASV (or OTU) in each sample
the method to be used (in this case frequency)
the threshold for determining if an ASV is a contaminant or not
the name of the column from the sample data that contains the DNA concentration.

We will store the results of decontam in and object called contam.freq:

# Evaluate different thresholds of decontam
contam.freq <- isContaminant(mock_ps, method="frequency", threshold=0.1, conc="DNA_conc")

Our new object contam.freq contains the results from running decontam. This object has a few components to it that can be accessed using the $ operator:

contam.freq$freq -
contam.freq$p.freq - p-value for each ASV for the frequency method
contam.freq$prev - the number of times each ASV appears across the data set
contam.freq$p.prev - p-value for each ASV for the prevalence method (NA if the prevalence method is not used)
contam.freq$p - the p-value for each ASV used for classifying contaminants
contam.freq$contaminant - indicates if an ASV is considered a contaminant (TRUE) or not (FALSE) based on criteria provided in the function call.

All of these values can be reviewed by viewing the contam.freq data frame. Here we will create a temporary data frame with shortened rownames for viewing since our rownames are the ASV sequences and are quite long (making viewing difficult).

## Create a temporary copy of contam.freq with shorter rownames for easy display
temp_contam.freq <- contam.freq

# change the rownames to something managable
rownames(temp_contam.freq) <- paste0("SV",seq(nrow(temp_contam.freq)))

# display the first 10 rows of the data frame
print(head(temp_contam.freq))

##      freq prev p.freq p.prev    p contaminant
## SV1 0.193    9   0.98     NA 0.98       FALSE
## SV2 0.153    9   0.99     NA 0.99       FALSE
## SV3 0.111    9   0.97     NA 0.97       FALSE
## SV4 0.113    9   0.99     NA 0.99       FALSE
## SV5 0.058    9   0.97     NA 0.97       FALSE
## SV6 0.045    9   1.00     NA 1.00       FALSE

It is also possible to plot the abundance of individual ASVs by DNA concentration with the plot_frequency function in decontam:

set.seed(100)
plot_frequency(mock_ps, taxa_names(mock_ps)[sample(which(contam.freq$contaminant),1)], conc="DNA_conc")

## Warning: Transformation introduced infinite values in continuous y-axis

Next, we use the contam.freq$contaminant result along with the prune_taxa phyloseq function to create a phyloseq object without the suspected contaminants.

# create phyloseq object with contaminant ASVs removed  
ps.noncontam.freq <- prune_taxa(!contam.freq$contaminant, mock_ps)

We can also use this approach to create a phyloseq object of just contaminant ASVs so that they can be inspected more closely/confirmed as contaminants.

# create a phyloseq object with only contaminant ASVs
ps.contam.freq <- prune_taxa(contam.freq$contaminant, mock_ps)

Since we have a mock microbial dilution series, we can evaluate how well the contaminant removal worked by identifying the percentage of contaminant ASVs that were successfully removed.

# Identify the percent of contaminants identified
ps_removed_prop <- prune_taxa(contam.freq$contaminant, mock_ps_norm)

# limit ps_norm_exp to dilution series samples (remove blank)
ps_norm_exp <- prune_samples(sample_names(ps_norm_exp) %in% sample_names(ps_removed_prop), ps_norm_exp)

# Print the percent of contaminant ASVs that were removed
print(sample_sums(ps_removed_prop)/(100-sample_sums(ps_norm_exp)) * 100)

## D0 D1 D2 D3 D4 D5 D6 D7 D8 
## 25 60 25 37 20 19 11 17 14

We can also identify the percentage of contaminant still remaining in the data set

# Identify the percentage of contaminant still remaining in the dataset
# normalize the post-decontam phyloseq object to 100
ps.noncontam.freq.norm <- transform_sample_counts(ps.noncontam.freq,function(x) 100* x/sum(x))
# remove expected ASVs
ps_rem_contam <- prune_taxa(mock_taxa,ps.noncontam.freq.norm)
# print the percent of contaminants remaining after decontam 
sample_sums(ps_rem_contam)

##  D0  D1  D2  D3  D4  D5  D6  D7  D8 
## 100 100  99  97  90  76  38  49  22

We can also check that none of the expected ASVs were removed.

# identify if expected ASVs are present in the contaminant ASVs
intersect(taxa_names(contam.freq), taxa_names(mock_ps))

## character(0)

We repeat this on various levels of the threshold parameter to identify the ideal one to use on our data set. To do this, we create a function called testIsContam

# Create a function to evaluate the performance of decontam at different thresholds
testIsContam <-  function(physeq, thr, mock_taxa){
  # create normalized phyloseq object
    physeq_norm <- transform_sample_counts(physeq,function(x) 100* x/sum(x))
  # extract out original normalized otu table
    original_otus <- as.matrix(as.data.frame(physeq@otu_table))
  # identify the orginal proportion of contaminants
    contaminants_original <- rowSums(original_otus[,!colnames(original_otus) %in% mock_taxa])
  # identify the orginal proportion of mock community ASVs
    mock_original <- rowSums(original_otus[,colnames(original_otus) %in% mock_taxa])
          
  # apply decontam to the orginal phyloseq object
  contam.freq <- isContaminant(physeq, method="frequency", threshold=thr, conc="DNA_conc")
  # create physloseq object without contaminants
  physeq.noncontam.freq <- prune_taxa(!contam.freq$contaminant, physeq)
  # normalize the post-decontam phyloseq object to 100
  physeq.noncontam.freq.norm <- transform_sample_counts(physeq.noncontam.freq,function(x) 100* x/sum(x))

  # subset out the otu table of recovered otus (otus that are kept)
  recovered_otus <- as.matrix(as.data.frame(physeq.noncontam.freq@otu_table))
  
  # create a subset of removed otus for evaluation
  removed_otus <- as.matrix(as.data.frame(physeq@otu_table))
  removed_otus <- removed_otus[,!colnames(removed_otus) %in% colnames(recovered_otus)]
 
   # Summarize results  
    ## proportion of contaminants removed (of all total contaminant ASVs)
    contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ contaminants_original) * 100
    ## proportion of mock removed (of all total mock ASVs)
    mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_original) * 100
  ## total amount of contaminants remaining in ne
  contaminants_remaining =  rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  # calculate alpha diverity (on non-normalized decontam data) and summary of relative abundances (on normalized decontam data)
  diversity <- estimate_richness(physeq.noncontam.freq, measures = c('Observed','Shannon','InvSimpson'))
  rel_abundance <- as.data.frame(physeq.noncontam.freq.norm@otu_table)
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])

  # plot results
  compositionPlot <- expCompBarPlot(physeq.noncontam.freq, mock_taxa,  paste0('Decontam frequency, thr =', thr))
  
  # Calculate classifications

  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
  
  sensitivity <- true_pos/(true_pos + false_neg)
  specificty <- true_neg/(true_neg + false_pos)
  accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
  prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
  
# Plot classifications per sample (Figure 5A)
asv_profile <- rbind(false_neg,false_pos,true_neg,true_pos)
colnames(asv_profile) <- c('D0', 'D1','D2','D3', 'D4', 'D5', 'D6', 'D7', 'D8')

long_asv_profile <- melt(data = asv_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

# Create color palette
customPalette <- c('#969696','#bdbdbd','#1B9E77', '#D95F02')

  classificationPlot <- ggplot(long_asv_profile, aes(x = Var2, y = Abundance)) + 
    geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
      ggtitle(paste0('Sequence classification for \n  Decontam frequency, thr = ', thr)) + 
      theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 16),
        plot.title = element_text(size = 16))
  # plot figures
  grid.arrange(compositionPlot, classificationPlot, ncol = 2)
  # return results
  
  results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
  results <- results %>%
    mutate(method = paste0('Decontam frequency, thr =', thr)) %>%
    mutate(sample_names = rownames(results))  
  # return results
  return(results)

}

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color

 expCompBarPlot <- function(physeq, exp_taxa, title){
   ## physeq - phyloseq object that will be plotted
   ## exp_taxa - taxa that are expected to be in the mock community 
   ## title - title for plot
#set up data_table
  data_table <- as.data.frame(t(physeq@otu_table))

  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
  sample_names <- sample_names(physeq)
  data_table$id <- paste0('ASV_', 1:nrow(data_table))
  dilution_labels <- sample_data(physeq)$Dilutions

  set.seed(444)

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
  other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)

# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
  legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
    scale_fill_identity(guide = FALSE) +
    scale_alpha_manual(name = "Sequence type",
                     labels = c("expected", "other"),
                     values = c(bright = 1, dull = 1),
                     guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
                                          keywidth = NULL, keyheight = NULL)) +
    labs(title = title, x = "Sample", y = "Relative Abundance") +
    theme(legend.position = "right", legend.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))
  comp_bar
 }

# create a function to write out summary table
summary_table <- function(results, thr){
    caption_text = paste0('Decontam frequency method, threshold = ', thr, ' summary')
    drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
    results <- results %>% select(-one_of(drop_col))
  kable(t(results), digits = 2, caption = caption_text) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
    group_rows("Proportion Removed", 1,2) %>%
    group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
    group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>% 
  group_rows("Mock Abundances after Contaminant Removal", 8,16) %>% 
  group_rows("Classification Performance", 17,20)
  
}

Print Figures

results_decontam_0_1 <- testIsContam(mock_ps,0.1, mock_taxa)

## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

## Joining, by = c("id", "color", "sample", "count")

results_decontam_0_2 <- testIsContam(mock_ps,0.2, mock_taxa)

## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

## Joining, by = c("id", "color", "sample", "count")

results_decontam_0_3 <- testIsContam(mock_ps,0.3, mock_taxa)

## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

## Joining, by = c("id", "color", "sample", "count")

results_decontam_0_4 <- testIsContam(mock_ps,0.4, mock_taxa)

## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

## Joining, by = c("id", "color", "sample", "count")

results_decontam_0_5 <- testIsContam(mock_ps,0.5, mock_taxa)

## Warning in estimate_richness(physeq.noncontam.freq, measures = c("Observed", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

## Joining, by = c("id", "color", "sample", "count")

# Display tables

summary_table(results_decontam_0_1, 0.1)

Decontam frequency method, threshold = 0.1 summary
Proportion Removed
contaminants_removed	25	60	25	37	20	19	11	17	14
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.037	0.056	1.352	2.845	9.845	23.793	61.807	51.138	77.601
total_mock_abundance	100	100	99	97	90	76	38	49	22
Diversity Estimates after Contaminant Removal
Observed	17	16	91	152	241	288	354	129	176
Shannon	1.9	1.9	2.0	2.1	2.4	3.2	4.4	3.9	4.7
InvSimpson	5.4	5.4	5.8	6.1	6.0	10.0	24.5	20.6	63.1
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	25.8	23.1	29.6	20.5	10.5	13.6	3.6
ASV_2	26.0	20.9	22.2	22.1	13.7	15.3	6.7	10.2	4.3
ASV_3	21.7	14.5	16.9	16.3	8.6	12.4	3.1	6.0	2.4
ASV_4	6.6	14.2	12.3	15.0	21.3	10.2	11.1	9.5	5.4
ASV_5	11.06	7.93	9.10	8.13	4.75	6.53	2.21	2.79	0.88
ASV_6	5.1	5.3	5.3	5.7	6.0	5.4	2.3	3.8	2.6
ASV_7	2.7	3.7	3.2	3.1	4.0	2.9	1.3	1.6	1.7
ASV_8	3.13	2.06	2.33	2.36	1.13	1.70	0.52	0.00	0.00
ASV_10	1.44	1.54	1.50	1.38	1.06	1.24	0.38	1.28	1.55
Classification Performance
sensitivity	0.25	0.60	0.25	0.37	0.20	0.19	0.11	0.17	0.14
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	0.99	0.97	0.90	0.77	0.43	0.54	0.31
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

summary_table(results_decontam_0_2, 0.2)

Decontam frequency method, threshold = 0.2 summary
Proportion Removed
contaminants_removed	48	90	67	59	50	48	40	46	43
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.026	0.013	0.594	1.881	6.360	16.845	52.258	40.506	69.623
total_mock_abundance	100	100	99	98	94	83	48	59	30
Diversity Estimates after Contaminant Removal
Observed	14	10	58	120	199	244	312	103	148
Shannon	1.9	1.9	1.9	2.0	2.2	2.9	4.2	3.5	4.5
InvSimpson	5.4	5.4	5.7	6.0	5.6	8.4	20.1	14.6	51.9
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	26.0	23.3	30.7	22.4	13.1	16.6	4.9
ASV_2	26.1	21.0	22.4	22.3	14.2	16.7	8.4	12.5	5.9
ASV_3	21.7	14.5	17.0	16.5	9.0	13.5	3.9	7.3	3.2
ASV_4	6.6	14.2	12.4	15.2	22.1	11.1	13.8	11.6	7.3
ASV_5	11.1	7.9	9.2	8.2	4.9	7.1	2.8	3.4	1.2
ASV_6	5.1	5.4	5.3	5.7	6.2	5.9	2.9	4.6	3.5
ASV_7	2.7	3.7	3.2	3.2	4.2	3.2	1.7	1.9	2.3
ASV_8	3.13	2.06	2.34	2.38	1.17	1.86	0.65	0.00	0.00
ASV_10	1.44	1.54	1.51	1.39	1.10	1.35	0.48	1.56	2.10
Classification Performance
sensitivity	0.48	0.90	0.67	0.59	0.50	0.48	0.40	0.46	0.43
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	0.99	0.98	0.94	0.85	0.61	0.70	0.54
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

summary_table(results_decontam_0_3,0.3)

Decontam frequency method, threshold = 0.3 summary
Proportion Removed
contaminants_removed	48	90	80	69	66	59	52	62	53
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.026	0.013	0.358	1.416	4.415	13.680	46.814	32.473	65.222
total_mock_abundance	100	100	100	99	96	86	53	68	35
Diversity Estimates after Contaminant Removal
Observed	14	10	40	94	162	212	277	81	129
Shannon	1.9	1.9	1.9	2.0	2.1	2.8	4.0	3.2	4.3
InvSimpson	5.4	5.4	5.7	6.0	5.3	7.8	16.5	11.5	41.9
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	26.1	23.4	31.4	23.3	14.6	18.8	5.6
ASV_2	26.1	21.0	22.4	22.4	14.5	17.3	9.4	14.2	6.7
ASV_3	21.7	14.5	17.1	16.5	9.2	14.0	4.4	8.3	3.7
ASV_4	6.6	14.2	12.5	15.2	22.6	11.6	15.4	13.1	8.4
ASV_5	11.1	7.9	9.2	8.3	5.0	7.4	3.1	3.9	1.4
ASV_6	5.1	5.4	5.4	5.8	6.3	6.1	3.2	5.3	4.0
ASV_7	2.7	3.7	3.2	3.2	4.3	3.3	1.9	2.2	2.7
ASV_8	3.13	2.06	2.35	2.39	1.20	1.93	0.73	0.00	0.00
ASV_10	1.44	1.54	1.51	1.40	1.12	1.40	0.54	1.77	2.40
Classification Performance
sensitivity	0.48	0.90	0.80	0.69	0.66	0.59	0.52	0.62	0.53
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	1.00	0.99	0.96	0.89	0.69	0.79	0.63
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

summary_table(results_decontam_0_4, 0.4)

Decontam frequency method, threshold = 0.4 summary
Proportion Removed
contaminants_removed	88	90	89	75	73	68	64	69	68
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.006	0.013	0.202	1.143	3.583	10.860	39.579	28.061	56.266
total_mock_abundance	100	100	100	99	96	89	60	72	44
Diversity Estimates after Contaminant Removal
Observed	13	10	36	75	136	178	241	71	102
Shannon	1.9	1.9	1.9	2.0	2.0	2.6	3.7	3.0	4.0
InvSimpson	5.4	5.4	5.7	5.9	5.2	7.3	13.0	10.2	29.0
Mock Abundances after Contaminant Removal
ASV_1	22	30	26	23	32	24	17	20	7
ASV_2	26.1	21.0	22.4	22.5	14.6	17.9	10.6	15.1	8.5
ASV_3	21.7	14.5	17.1	16.6	9.2	14.5	5.0	8.9	4.6
ASV_4	6.6	14.2	12.5	15.3	22.8	11.9	17.5	14.0	10.5
ASV_5	11.1	7.9	9.2	8.3	5.1	7.6	3.5	4.1	1.7
ASV_6	5.1	5.4	5.4	5.8	6.4	6.3	3.7	5.6	5.0
ASV_7	2.7	3.7	3.2	3.2	4.3	3.4	2.1	2.3	3.4
ASV_8	3.13	2.06	2.35	2.40	1.21	1.99	0.83	0.00	0.00
ASV_10	1.44	1.54	1.52	1.40	1.13	1.45	0.61	1.89	3.02
Classification Performance
sensitivity	0.88	0.90	0.89	0.75	0.73	0.68	0.64	0.69	0.68
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	1.00	0.99	0.97	0.91	0.77	0.83	0.74
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

summary_table(results_decontam_0_5,0.5)

Decontam frequency method, threshold = 0.5 summary
Proportion Removed
contaminants_removed	88	90	90	77	75	73	72	70	72
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.006	0.013	0.188	1.083	3.319	9.525	33.577	27.689	52.506
total_mock_abundance	100	100	100	99	97	90	66	72	47
Diversity Estimates after Contaminant Removal
Observed	13	10	35	72	126	160	215	69	94
Shannon	1.9	1.9	1.9	2.0	2.0	2.5	3.5	3.0	3.9
InvSimpson	5.4	5.4	5.7	5.9	5.2	7.1	10.8	10.1	25.2
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	26.1	23.5	31.7	24.4	18.3	20.2	7.6
ASV_2	26.1	21.0	22.4	22.5	14.7	18.1	11.7	15.2	9.2
ASV_3	21.7	14.5	17.1	16.6	9.3	14.7	5.5	8.9	5.0
ASV_4	6.6	14.2	12.5	15.3	22.9	12.1	19.3	14.0	11.4
ASV_5	11.1	7.9	9.2	8.3	5.1	7.8	3.8	4.1	1.9
ASV_6	5.1	5.4	5.4	5.8	6.4	6.4	4.0	5.6	5.5
ASV_7	2.7	3.7	3.2	3.2	4.3	3.5	2.3	2.3	3.6
ASV_8	3.13	2.06	2.35	2.40	1.21	2.02	0.91	0.00	0.00
ASV_10	1.44	1.54	1.52	1.40	1.14	1.47	0.67	1.90	3.28
Classification Performance
sensitivity	0.88	0.90	0.90	0.77	0.75	0.73	0.72	0.70	0.72
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	1.00	0.99	0.97	0.92	0.82	0.83	0.78
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

Using a threshold of 0.5 is reasonable since it removed many of the contaminant ASVs but did not remove any of the expected ASVs.

contam.freq <- isContaminant(mock_ps, method="frequency", threshold=0.5, conc="DNA_conc")
ps.noncontam.freq <- prune_taxa(!contam.freq$contaminant, mock_ps)

We plot the stacked bar plot of the contaminant corrected microbiome profiles for the mock microbial community dilution series. There are still some contaminants, but they are greatly reduced.

expCompBarPlot(ps.noncontam.freq, mock_taxa, 'Contaminant removed, Decontam (frequency, thr = 0.5)')

## Joining, by = c("id", "color", "sample", "count")

This work demonstrates using the frequency method implemented in decontam for identifying and removing contaminants from a 16S rRNA gene sequencing experiment. We were able to evaluate the performance of this method since we used a dilution series of a mock microbial community, where the expected composition of the samples are known. Decontam is one analytical method for identifying and removing contaminants from 16S rRNA gene sequencing data. We also evaluated using SourceTracker, which is documented here.

vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep)

## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_rm' not found

## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_keep' not found

save.image('results_decontam.RData')

Controlling contaminants in 16S rRNA gene sequencing experiments with decontam

Lisa Karstens

Introduction

Remove contaminant ASVs with decontam

Set up the workspace

Print Figures