Introduction

Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Introduction and evaluating contaminants
Removing contaminant ASVs with filtering methods (this document)
Removing contaminant ASVs with decontam
Removing contaminant ASVs with SourceTracker
Evaluating SourceTracker results
Final evaluation of all methods

Here, we provide the compiled R Markdown document to reproduce the initial analysis of the mock microbial community without removing contaminants (referred to as no filter), as well as the ideal situation where all contaminants are removed (referred to as expected results). We also evaluate using the negative control as a filter, as well as a series of relative abundance filters as approached for removing contaminants from 16S rRNA gene sequencing experiments.

For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information on other methods to remove contaminants, see the relevant link above.

Analysis of the mock microbial dilution series dataset

Set up the workspace

This analysis uses many open source packages that are available on CRAN and Bioconductor.

To run this analysis yourself, download the supplemental material here along with the R markdown file. Save these files in the same directory on your computer.

# load libraries
library(phyloseq)
library(ggplot2) 
library(tidyverse)
library(RColorBrewer)
library(reshape2)
library(dplyr)
library(knitr)
options(digits=2)
library(kableExtra)
library(gridExtra)
library(ggpubr)

# save session info (packages and versions loaded)
session <- sessionInfo()

Create functions to use in this analysis

We created a function called expCompBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in grayscale.

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color

 expCompBarPlot <- function(physeq, exp_taxa, title){
   ## physeq - phyloseq object that will be plotted
   ## exp_taxa - taxa that are expected to be in the mock community 
   ## title - title for plot
#set up data_table
  data_table <- as.data.frame(t(physeq@otu_table))

  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
  sample_names <- sample_names(physeq)
  data_table$id <- paste0('ASV_', 1:nrow(data_table))
  dilution_labels <- sample_data(physeq)$Dilutions

  set.seed(444)

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
  other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)

# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
  legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
    scale_fill_identity(guide = FALSE) +
    scale_alpha_manual(name = "Sequence type",
                     labels = c("expected", "other"),
                     values = c(bright = 1, dull = 1),
                     guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
                                          keywidth = NULL, keyheight = NULL)) +
    labs(title = title, x = "Sample", y = "Relative Abundance") +
    theme(legend.position = "right", legend.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))
  return(comp_bar)
 }

# Define functions to evaluate filter performance
eval_filter <- function(physeq, mock_taxa, recovered_otus, removed_otus, filter_method){
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

  profile <- rbind(false_neg, false_pos,true_neg,true_pos)

  long_profile <- melt(data = profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )
  names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
  customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')

  # Figures
  classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
      ggtitle(paste0('Sequence classification for \n  ', filter_method)) + 
      theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

  compositionPlot <- expCompBarPlot(physeq,mock_taxa,paste0('Recovered profile after \n ',filter_method))
  # plot figures
  grid.arrange(compositionPlot[[1]], compositionPlot[[2]],classificationPlot, ncol = 3)
}

filter_results <- function(physeq_original, physeq_filter, physeq_removed, mock_taxa,filter_method){
  ## physeq_original <- the original phyloseq object containing actual counts
  ## physeq_filter <- the filtered phyloseq object containing actual counts
  ## physeq_removed <- a phyloseq object of removed actual counts
  ## mock_taxa <- list of mock taxa ASV sequences
  
  # extract out original normalized otu table
  physeq_norm <- transform_sample_counts(physeq_original,function(x) 100* x/sum(x))
  
  original_otus <- as.matrix(as.data.frame(physeq_original@otu_table))
  # identify the orginal proportion of contaminants
  contaminants_original <- rowSums(original_otus[,!colnames(original_otus) %in% mock_taxa])
  # identify the orginal proportion of mock community ASVs
  mock_original <- rowSums(original_otus[,colnames(original_otus) %in% mock_taxa])
  
  # Normalize the filtered physloseq object to relative abundance (each sample sums to 100)
  physeq_filter_norm <- transform_sample_counts(physeq_filter,function(x) 100* x/sum(x))
  
  # subset out the otu table of recovered otus (otus that are kept)
  recovered_otus <- as.matrix(as.data.frame(physeq_filter@otu_table))
  
  # create a subset of removed otus for evaluation
  removed_otus <- as.matrix(as.data.frame(physeq_removed@otu_table))
  
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as non-mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
  
  sensitivity <- true_pos/(true_pos + false_neg)
  specificty <- true_neg/(true_neg + false_pos)
  accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
  prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)

  ## proportion of contaminants removed (of all total contaminant ASVs)
  contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ contaminants_original) * 100
    ## proportion of mock removed (of all total mock ASVs)
  mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_original) * 100
  
  ## total amount of conatminants remaining in ne
  #contaminants_remaining =  rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  # calculate alpha diverity and summary of abundances
  diversity <- estimate_richness(physeq_filter, measures = c('Observed','Shannon','InvSimpson'))
  rel_abundance <- as.data.frame(physeq_filter_norm@otu_table)
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  # return results
  results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
  # add filter_method to results table
    results <- results %>%
    mutate(method =  filter_method) %>%
    mutate(sample_names = rownames(results))

  return(results)
  }

summary_table <- function(results, filter_method){
  
  caption_text = paste0(filter_method,' summary')
  drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
  results <- results %>% select(-one_of(drop_col))
  kable(t(results), digits = 2, caption = caption_text) %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
    group_rows("Percent Removed", 1,2) %>%
    group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
    group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>% 
    group_rows("Mock Abundances after Contaminant Removal", 8,ncol(results) - 2)

}
# define function for evaluateing abundance filtering
abundance_filter_eval <- function(physeq, abundance_filter){
  # normalize physeq to 100
  physeq_norm <- transform_sample_counts(physeq,function(x) 100* x/sum(x))
  # subset out the otu table, name rec_otu for recovered otus (otus that are kept)
  # name relative abundance otu table rec_otu_rel
  rec_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
  rec_otu <- as.matrix(as.data.frame(physeq@otu_table))
  
  # remove low abundance sequences per sample
  rec_otu_rel[rec_otu_rel <= abundance_filter] <- 0
  rec_otu[rec_otu_rel == 0] <- 0
  ps_filt <- physeq
  otu_table(ps_filt) <- otu_table(rec_otu, taxa_are_rows = FALSE)

  # create a subset of removed otus for evaluation
  rem_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
  rem_otu <- as.matrix(as.data.frame(physeq@otu_table))
  rem_otu[rem_otu_rel > abundance_filter] <- 0
  ps_rem <- physeq
  otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)

  # plot results
  eval_filter(ps_filt, mock_taxa, rec_otu,rem_otu, paste0('Abundance less than ', abundance_filter, '% removed'))
  
  # return results
  results <- filter_results(physeq, ps_filt, ps_rem, mock_taxa,paste0('Abundance filter, ',abundance_filter) )
  return(results)
  }

Load the dataset

This analysis assumes that you have the RData file and R Markdown file saved in the same directory.

## Load the dataset
load("mockDilutionsPrep.RData")

This work outlines each step required for evaluating the impact of contaminants on 16S rRNA gene experiments with samples that have varying starting material. The conclusions from this analysis is that contaminant ASVs increase with decreasing starting material and lead to distorted microbial community profiles and overestimated alpha diversity measurements.

While contaminants are prevalent in low microbial biomass samples, and potentially unavoidable, all hope is not lost. There are a few computational approaches to identify and remove contaminant sequences from 16S rRNA gene sequencing data. To see how we evaluated these different methods for identifying and removing contaminants from this data set, see: * Removing contaminant ASVs with decontam * Removing contaminant ASVs with SourceTracker * Evaluating SourceTracker results

This RMarkdown file provides the code for evaluating how unrecommended methods - removing all sequences that appear in a blank control and an abundance filter would perform for removing contaminants.

Create expected results based on the known sequences from the mock microbial communtiy

# phyloseq object with only expected sequences
mock_ps_exp <- prune_taxa(taxa_names(mock_ps_pure),mock_ps)
# phyloseq object with only contaminant sequences
mock_ps_not_exp <- prune_taxa(!taxa_names(mock_ps) %in% taxa_names(mock_ps_pure) ,mock_ps)

## create a summary of results with expected sequences only
results_expected <- filter_results(mock_ps, mock_ps_exp, mock_ps_not_exp ,mock_taxa,'Expected Results')

## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_expected,'Expected Results')

Expected Results summary
Percent Removed
contaminants_removed	100	100	100	100	100	100	100	100	100
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0	0	0	0	0	0	0	0	0
total_mock_abundance	100	100	100	100	100	100	100	100	100
Diversity Estimates after Contaminant Removal
Observed	9	9	9	9	9	9	9	8	8
Shannon	1.9	1.9	1.9	1.9	1.8	1.9	1.8	1.8	1.9
InvSimpson	5.4	5.4	5.6	5.8	4.9	5.8	4.9	5.4	6.3
Mock Abundances after Contaminant Removal
ASV_1	22	30	26	24	33	27	28	28	16
ASV_2	26	21	22	23	15	20	18	21	19
ASV_3	21.7	14.5	17.1	16.8	9.6	16.3	8.2	12.3	10.6
ASV_4	6.6	14.2	12.5	15.4	23.6	13.4	29.0	19.4	24.1
ASV_5	11.1	7.9	9.2	8.4	5.3	8.6	5.8	5.7	3.9
ASV_6	5.1	5.4	5.4	5.9	6.6	7.1	6.1	7.8	11.5
ASV_7	2.7	3.7	3.2	3.2	4.5	3.8	3.5	3.2	7.7
ASV_8	3.1	2.1	2.4	2.4	1.3	2.2	1.4	0.0	0.0
ASV_10	1.4	1.5	1.5	1.4	1.2	1.6	1.0	2.6	6.9
sensitivity	1	1	1	1	1	1	1	1	1
specificty	1	1	1	1	1	1	1	1	1
accuracy	1	1	1	1	1	1	1	1	1
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

Create a summary of results before contaminant removal

# calculate alpha diverity and summary of abundances
  diversity <- estimate_richness(mock_ps, measures = c('Observed','Shannon','InvSimpson'))

## Warning in estimate_richness(mock_ps, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

  rel_abundance <- as.data.frame(mock_ps_norm@otu_table)
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  
  ## Add true positive/true negative ect for 'uncorrected plot'
  mock_otus <- as.matrix(as.data.frame(mock_ps@otu_table))
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(mock_otus[,colnames(mock_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as mock community ASVs
  false_neg <- rowSums(mock_otus[,!colnames(mock_otus) %in% mock_taxa])
  
  # true_pos and false_pos do not exist  

  
  # return results
  results_original <- cbind(con_abundance, total_mock_abundance, diversity, mock_abundance, true_neg, false_neg)
  # add filter_method to results table
    results_original <- results_original %>%
    mutate(method =  'Original Data') %>%
    mutate(sample_names = rownames(results_original))
    
summary_table(results_original,'Original results - no contaminant removal')

## Warning: Unknown columns: `true_pos`, `false_pos`

Original results - no contaminant removal summary
Percent Removed
con_abundance	0.05	0.14	1.80	4.45	11.96	27.86	64.48	55.77	80.06
total_mock_abundance	100	100	98	96	88	72	36	44	20
Percent Remaining after Contaminant Removal
Observed	18	20	114	172	262	312	381	147	193
Shannon	1.9	1.9	2.0	2.2	2.5	3.4	4.5	4.1	4.8
Diversity Estimates after Contaminant Removal
InvSimpson	5.4	5.4	5.9	6.4	6.3	11.1	28.0	24.6	71.6
ASV_1	22.1	29.6	25.7	22.7	28.9	19.4	9.8	12.3	3.2
ASV_2	26.0	20.9	22.1	21.7	13.4	14.5	6.3	9.3	3.9
Mock Abundances after Contaminant Removal
ASV_3	21.7	14.5	16.8	16.0	8.4	11.7	2.9	5.5	2.1
ASV_4	6.6	14.2	12.3	14.8	20.8	9.7	10.3	8.6	4.8
ASV_5	11.06	7.92	9.06	8.00	4.64	6.18	2.05	2.52	0.78
ASV_6	5.1	5.3	5.3	5.6	5.8	5.1	2.1	3.5	2.3
ASV_7	2.7	3.7	3.2	3.1	4.0	2.8	1.2	1.4	1.5
ASV_8	3.13	2.06	2.32	2.32	1.10	1.61	0.49	0.00	0.00
ASV_10	1.44	1.54	1.49	1.35	1.04	1.17	0.36	1.16	1.38
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

Classify sequences as true mock, false mock, true contam, false contam

We can also evaluate the success of removing contaminants by identifying how many contaminants were removed and remain after filtering.

We will use a similar scheme that was used for evaluating SourceTracker and decontam in our manuscript:

true_neg - ASVs correctly identified as mock community ASVs (correctly remaining int the data set after filtering)
false_neg - ASVs incorrectly identified as mock community ASVs (unremoved contaminants)
true_pos - ASVs correctly identified as contaminant ASVs (correctly removing a contaminant ASV from the data set)
false_pos - ASVs incorrectly identified as contaminant ASVs (incorrectly removing a mock community read from the data set)

Removing sequences that appear in a blank control

First, we will remove all sequences that appear in the blank control from the mock community dilution series to see how this method would perform

# removing all ASVs present in a blank control sample
# create an otu table to perform filtering on
# rec_otu -> 'recovered otus', will be the otu table results after filtering
rec_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
blank_taxa <- taxa_names(blank_ps)
# remove ASVs from blank
rec_otu <- rec_otu[,!colnames(rec_otu) %in% blank_taxa]
# create a phyloseq object with the filtered otu table
ps_blank_filter <- mock_ps
otu_table(ps_blank_filter) <- otu_table(rec_otu, taxa_are_rows = FALSE)

# create a subset of removed otus for evaluation
# rem_otu -> 'removed otus' the otus that were removed during filtering
rem_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
rem_otu <- rem_otu[,colnames(rem_otu) %in% blank_taxa]
# create a phyloseq object with removed OTUs as the otu table
ps_rem <- mock_ps
otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)

# run the evaluation function to summarize results
eval_filter(ps_blank_filter,mock_taxa, rec_otu, rem_otu, 'Blank control removal')

## Warning in data_table$color[data_table$reference] <- ref_colors: number of
## items to replace is not a multiple of replacement length

## Joining, by = c("id", "color", "sample", "count")

results_blank <- filter_results(mock_ps, ps_blank_filter, ps_rem,mock_taxa,'Blank control removal')

## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_blank,'Blank control removal')

Blank control removal summary
Percent Removed
contaminants_removed	63	44	63	35	49	49	61	44	44
mock_ASVs_removed	30	26	27	27	21	25	22	27	34
Percent Remaining after Contaminant Removal
con_abundance	0.027	0.106	0.933	3.998	8.100	20.817	47.855	48.905	77.420
total_mock_abundance	100	100	99	96	92	79	52	51	23
Diversity Estimates after Contaminant Removal
Observed	11	10	54	110	177	199	247	97	128
Shannon	1.6	1.5	1.6	1.9	2.0	2.8	3.8	3.5	4.3
InvSimpson	4.2	3.9	4.3	4.8	4.1	6.8	12.4	13.8	42.6
Mock Abundances after Contaminant Removal
ASV_1	31.7	40.2	35.6	31.4	38.1	28.7	18.4	19.5	5.5
ASV_3	31.1	19.7	23.3	22.2	11.1	17.3	5.5	8.6	3.6
ASV_4	9.5	19.3	17.0	20.4	27.4	14.2	19.4	13.6	8.2
ASV_5	15.8	10.7	12.6	11.1	6.1	9.1	3.9	4.0	1.3
ASV_6	7.3	7.2	7.3	7.7	7.7	7.5	4.1	5.4	3.9
ASV_8	4.48	2.79	3.21	3.21	1.45	2.37	0.92	0.00	0.00
sensitivity	0.63	0.44	0.63	0.35	0.49	0.49	0.61	0.44	0.44
specificty	0.70	0.74	0.73	0.73	0.79	0.75	0.78	0.73	0.66
accuracy	0.70	0.74	0.73	0.71	0.76	0.67	0.67	0.57	0.48
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

Removing all ASVs that were present in the blank control was too harsh - it ended up removing many of the expected mock community ASVs (6 out of 9 of the mock taxa - up to 69.1% of the expected sequences). Additionally, this approach still missed many contaminant taxa.

Removing sequences based on an abundance filter

Another approach that has been used in the literature is applying an abundance filter to remove noise. There is not much guidance on how researchers choose (or even use) a filter. We evaluated filters ranging from 0.01% to 1%.

results_filter_0_01 <- abundance_filter_eval(mock_ps,0.01)

## Joining, by = c("id", "color", "sample", "count")

## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_0_01, 'Abundance filter, 0.01%')

Abundance filter, 0.01% summary
Percent Removed
contaminants_removed	34.1270	32.5000	11.5743	8.0515	1.0671	0.1846	0.0339	0.0000	0.0092
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.033	0.094	1.593	4.111	11.843	27.826	64.474	55.769	80.063
total_mock_abundance	100	100	98	96	88	72	36	44	20
Diversity Estimates after Contaminant Removal
Observed	11	13	68	123	244	305	377	147	192
Shannon	1.9	1.9	2.0	2.2	2.5	3.4	4.5	4.1	4.8
InvSimpson	5.4	5.4	5.8	6.3	6.3	11.1	28.0	24.6	71.6
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	25.7	22.8	28.9	19.5	9.8	12.3	3.2
ASV_2	26.0	20.9	22.1	21.8	13.4	14.5	6.3	9.3	3.9
ASV_3	21.7	14.5	16.9	16.1	8.5	11.7	2.9	5.5	2.1
ASV_4	6.6	14.2	12.3	14.8	20.8	9.7	10.3	8.6	4.8
ASV_5	11.06	7.92	9.08	8.03	4.64	6.19	2.05	2.52	0.78
ASV_6	5.1	5.3	5.3	5.6	5.8	5.1	2.2	3.5	2.3
ASV_7	2.7	3.7	3.2	3.1	4.0	2.8	1.2	1.4	1.5
ASV_8	3.13	2.06	2.32	2.32	1.10	1.61	0.49	0.00	0.00
ASV_10	1.44	1.54	1.49	1.36	1.04	1.17	0.36	1.16	1.38
sensitivity	3.4e-01	3.3e-01	1.2e-01	8.1e-02	1.1e-02	1.8e-03	3.4e-04	0.0e+00	9.2e-05
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	0.98	0.96	0.88	0.72	0.36	0.44	0.20
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

results_filter_0_1 <- abundance_filter_eval(mock_ps,0.1)

## Joining, by = c("id", "color", "sample", "count")

## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_0_1, 'Abundance filter, 0.1%')

Abundance filter, 0.1% summary
Percent Removed
contaminants_removed	100.0	100.0	85.4	68.5	54.8	42.1	20.1	1.3	1.0
mock_ASVs_removed	0	0	0	0	0	0	0	0	0
Percent Remaining after Contaminant Removal
con_abundance	0.00	0.00	0.27	1.45	5.79	18.28	59.18	55.46	79.90
total_mock_abundance	100	100	100	99	94	82	41	45	20
Diversity Estimates after Contaminant Removal
Observed	9	9	11	15	31	74	156	136	179
Shannon	1.9	1.9	1.9	2.0	2.1	2.8	4.0	4.1	4.8
InvSimpson	5.4	5.4	5.7	6.0	5.5	8.7	21.3	24.3	70.5
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	26.1	23.4	30.9	22.0	11.2	12.4	3.2
ASV_2	26.1	21.0	22.4	22.4	14.3	16.4	7.2	9.3	3.9
ASV_3	21.7	14.5	17.1	16.5	9.0	13.3	3.3	5.5	2.1
ASV_4	6.6	14.2	12.5	15.2	22.3	10.9	11.8	8.6	4.8
ASV_5	11.06	7.93	9.20	8.25	4.96	7.01	2.36	2.54	0.79
ASV_6	5.1	5.4	5.4	5.8	6.2	5.8	2.5	3.5	2.3
ASV_7	2.7	3.7	3.2	3.2	4.2	3.1	1.4	1.4	1.5
ASV_8	3.13	2.06	2.35	2.39	1.18	1.82	0.56	0.00	0.00
ASV_10	1.44	1.54	1.51	1.40	1.11	1.33	0.41	1.17	1.39
sensitivity	1.000	1.000	0.854	0.685	0.548	0.421	0.201	0.013	0.010
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	1.00	0.99	0.95	0.84	0.48	0.45	0.21
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

results_filter_1 <- abundance_filter_eval(mock_ps,1)

## Joining, by = c("id", "color", "sample", "count")

## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_1, 'Abundance filter, 1%')

Abundance filter, 1% summary
Percent Removed
contaminants_removed	100	100	100	100	100	96	77	74	75
mock_ASVs_removed	0.0	0.0	0.0	0.0	0.0	0.0	2.4	0.0	3.9
Percent Remaining after Contaminant Removal
con_abundance	0.0	0.0	0.0	0.0	0.0	1.6	30.2	24.7	50.6
total_mock_abundance	100	100	100	100	100	98	70	75	49
Diversity Estimates after Contaminant Removal
Observed	9	9	9	9	9	10	13	18	17
Shannon	1.9	1.9	1.9	1.9	1.8	2.0	2.2	2.5	2.7
InvSimpson	5.4	5.4	5.6	5.8	4.9	6.0	7.2	8.9	13.6
Mock Abundances after Contaminant Removal
ASV_1	22.1	29.7	26.1	23.7	32.8	26.5	19.7	21.0	8.2
ASV_2	26.1	21.0	22.5	22.8	15.2	19.7	12.6	15.8	9.9
ASV_3	21.7	14.5	17.1	16.8	9.6	16.0	5.9	9.3	5.4
ASV_4	6.6	14.2	12.5	15.4	23.6	13.2	20.7	14.6	12.4
ASV_5	11.1	7.9	9.2	8.4	5.3	8.4	4.1	4.3	0.0
ASV_6	5.1	5.4	5.4	5.9	6.6	7.0	4.3	5.9	5.9
ASV_7	2.7	3.7	3.2	3.2	4.5	3.8	2.5	2.4	3.9
ASV_8	3.1	2.1	2.4	2.4	1.3	2.2	0.0	0.0	0.0
ASV_10	1.4	1.5	1.5	1.4	1.2	1.6	0.0	2.0	3.6
sensitivity	1.00	1.00	1.00	1.00	1.00	0.96	0.77	0.74	0.75
specificty	1.00	1.00	1.00	1.00	1.00	1.00	0.98	1.00	0.96
accuracy	1.00	1.00	1.00	1.00	1.00	0.99	0.84	0.85	0.80
prevalence	0.0005	0.0014	0.0180	0.0445	0.1196	0.2786	0.6448	0.5577	0.8006
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

Save workspace

vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep, filter_results)

## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_rm' not found

## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_keep' not found

save.image('results_filtering.RData')

This work demonstrates using various filtering methods for identifying and removing contaminants from a 16S rRNA gene sequencing experiment. We were able to evaluate the performance of this method since we used a dilution series of a mock microbial community, where the expected composition of the samples are known. We also evaluated more sophisticated methods for contaminant identification and removal: the decontam frequency method, which is documented here and SourceTracker, which is documented here.

Controlling Contaminants in 16S rRNA Gene Sequencing Experiments

Lisa Karstens

5/6/2019