Introduction

Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Here, we provide the compiled R Markdown document to reproduce the initial analysis of the mock microbial community without removing contaminants (referred to as no filter), as well as the ideal situation where all contaminants are removed (referred to as expected results). We also evaluate using the negative control as a filter, as well as a series of relative abundance filters as approached for removing contaminants from 16S rRNA gene sequencing experiments.

For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information on other methods to remove contaminants, see the relevant link above.

Analysis of the mock microbial dilution series dataset

Set up the workspace

This analysis uses many open source packages that are available on CRAN and Bioconductor.

To run this analysis yourself, download the supplemental material here along with the R markdown file. Save these files in the same directory on your computer.

# load libraries
library(phyloseq)
library(ggplot2) 
library(tidyverse)
library(RColorBrewer)
library(reshape2)
library(dplyr)
library(knitr)
options(digits=2)
library(kableExtra)
library(gridExtra)
library(ggpubr)

# save session info (packages and versions loaded)
session <- sessionInfo()

Create functions to use in this analysis

We created a function called expCompBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in grayscale.

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color

 expCompBarPlot <- function(physeq, exp_taxa, title){
   ## physeq - phyloseq object that will be plotted
   ## exp_taxa - taxa that are expected to be in the mock community 
   ## title - title for plot
#set up data_table
  data_table <- as.data.frame(t(physeq@otu_table))

  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
  sample_names <- sample_names(physeq)
  data_table$id <- paste0('ASV_', 1:nrow(data_table))
  dilution_labels <- sample_data(physeq)$Dilutions

  set.seed(444)

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
  other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)

# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
  legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
    scale_fill_identity(guide = FALSE) +
    scale_alpha_manual(name = "Sequence type",
                     labels = c("expected", "other"),
                     values = c(bright = 1, dull = 1),
                     guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
                                          keywidth = NULL, keyheight = NULL)) +
    labs(title = title, x = "Sample", y = "Relative Abundance") +
    theme(legend.position = "right", legend.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))
  return(comp_bar)
 }

# Define functions to evaluate filter performance
eval_filter <- function(physeq, mock_taxa, recovered_otus, removed_otus, filter_method){
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

  profile <- rbind(false_neg, false_pos,true_neg,true_pos)

  long_profile <- melt(data = profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )
  names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
  customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')

  # Figures
  classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
      ggtitle(paste0('Sequence classification for \n  ', filter_method)) + 
      theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

  compositionPlot <- expCompBarPlot(physeq,mock_taxa,paste0('Recovered profile after \n ',filter_method))
  # plot figures
  grid.arrange(compositionPlot[[1]], compositionPlot[[2]],classificationPlot, ncol = 3)
}

filter_results <- function(physeq_original, physeq_filter, physeq_removed, mock_taxa,filter_method){
  ## physeq_original <- the original phyloseq object containing actual counts
  ## physeq_filter <- the filtered phyloseq object containing actual counts
  ## physeq_removed <- a phyloseq object of removed actual counts
  ## mock_taxa <- list of mock taxa ASV sequences
  
  # extract out original normalized otu table
  physeq_norm <- transform_sample_counts(physeq_original,function(x) 100* x/sum(x))
  
  original_otus <- as.matrix(as.data.frame(physeq_original@otu_table))
  # identify the orginal proportion of contaminants
  contaminants_original <- rowSums(original_otus[,!colnames(original_otus) %in% mock_taxa])
  # identify the orginal proportion of mock community ASVs
  mock_original <- rowSums(original_otus[,colnames(original_otus) %in% mock_taxa])
  
  # Normalize the filtered physloseq object to relative abundance (each sample sums to 100)
  physeq_filter_norm <- transform_sample_counts(physeq_filter,function(x) 100* x/sum(x))
  
  # subset out the otu table of recovered otus (otus that are kept)
  recovered_otus <- as.matrix(as.data.frame(physeq_filter@otu_table))
  
  # create a subset of removed otus for evaluation
  removed_otus <- as.matrix(as.data.frame(physeq_removed@otu_table))
  
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as non-mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
  
  sensitivity <- true_pos/(true_pos + false_neg)
  specificty <- true_neg/(true_neg + false_pos)
  accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
  prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)

  ## proportion of contaminants removed (of all total contaminant ASVs)
  contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ contaminants_original) * 100
    ## proportion of mock removed (of all total mock ASVs)
  mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_original) * 100
  
  ## total amount of conatminants remaining in ne
  #contaminants_remaining =  rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  # calculate alpha diverity and summary of abundances
  diversity <- estimate_richness(physeq_filter, measures = c('Observed','Shannon','InvSimpson'))
  rel_abundance <- as.data.frame(physeq_filter_norm@otu_table)
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  # return results
  results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
  # add filter_method to results table
    results <- results %>%
    mutate(method =  filter_method) %>%
    mutate(sample_names = rownames(results))

  return(results)
  }

summary_table <- function(results, filter_method){
  
  caption_text = paste0(filter_method,' summary')
  drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
  results <- results %>% select(-one_of(drop_col))
  kable(t(results), digits = 2, caption = caption_text) %>%
    kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
    group_rows("Percent Removed", 1,2) %>%
    group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
    group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>% 
    group_rows("Mock Abundances after Contaminant Removal", 8,ncol(results) - 2)

}
# define function for evaluateing abundance filtering
abundance_filter_eval <- function(physeq, abundance_filter){
  # normalize physeq to 100
  physeq_norm <- transform_sample_counts(physeq,function(x) 100* x/sum(x))
  # subset out the otu table, name rec_otu for recovered otus (otus that are kept)
  # name relative abundance otu table rec_otu_rel
  rec_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
  rec_otu <- as.matrix(as.data.frame(physeq@otu_table))
  
  # remove low abundance sequences per sample
  rec_otu_rel[rec_otu_rel <= abundance_filter] <- 0
  rec_otu[rec_otu_rel == 0] <- 0
  ps_filt <- physeq
  otu_table(ps_filt) <- otu_table(rec_otu, taxa_are_rows = FALSE)

  # create a subset of removed otus for evaluation
  rem_otu_rel <- as.matrix(as.data.frame(physeq_norm@otu_table))
  rem_otu <- as.matrix(as.data.frame(physeq@otu_table))
  rem_otu[rem_otu_rel > abundance_filter] <- 0
  ps_rem <- physeq
  otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)

  # plot results
  eval_filter(ps_filt, mock_taxa, rec_otu,rem_otu, paste0('Abundance less than ', abundance_filter, '% removed'))
  
  # return results
  results <- filter_results(physeq, ps_filt, ps_rem, mock_taxa,paste0('Abundance filter, ',abundance_filter) )
  return(results)
  }

Load the dataset

This analysis assumes that you have the RData file and R Markdown file saved in the same directory.

## Load the dataset
load("mockDilutionsPrep.RData")

This work outlines each step required for evaluating the impact of contaminants on 16S rRNA gene experiments with samples that have varying starting material. The conclusions from this analysis is that contaminant ASVs increase with decreasing starting material and lead to distorted microbial community profiles and overestimated alpha diversity measurements.

While contaminants are prevalent in low microbial biomass samples, and potentially unavoidable, all hope is not lost. There are a few computational approaches to identify and remove contaminant sequences from 16S rRNA gene sequencing data. To see how we evaluated these different methods for identifying and removing contaminants from this data set, see: * Removing contaminant ASVs with decontam * Removing contaminant ASVs with SourceTracker * Evaluating SourceTracker results

This RMarkdown file provides the code for evaluating how unrecommended methods - removing all sequences that appear in a blank control and an abundance filter would perform for removing contaminants.

Create expected results based on the known sequences from the mock microbial communtiy

# phyloseq object with only expected sequences
mock_ps_exp <- prune_taxa(taxa_names(mock_ps_pure),mock_ps)
# phyloseq object with only contaminant sequences
mock_ps_not_exp <- prune_taxa(!taxa_names(mock_ps) %in% taxa_names(mock_ps_pure) ,mock_ps)

## create a summary of results with expected sequences only
results_expected <- filter_results(mock_ps, mock_ps_exp, mock_ps_not_exp ,mock_taxa,'Expected Results')
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
summary_table(results_expected,'Expected Results')
Expected Results summary
Percent Removed
contaminants_removed 100 100 100 100 100 100 100 100 100
mock_ASVs_removed 0 0 0 0 0 0 0 0 0
Percent Remaining after Contaminant Removal
con_abundance 0 0 0 0 0 0 0 0 0
total_mock_abundance 100 100 100 100 100 100 100 100 100
Diversity Estimates after Contaminant Removal
Observed 9 9 9 9 9 9 9 8 8
Shannon 1.9 1.9 1.9 1.9 1.8 1.9 1.8 1.8 1.9
InvSimpson 5.4 5.4 5.6 5.8 4.9 5.8 4.9 5.4 6.3
Mock Abundances after Contaminant Removal
ASV_1 22 30 26 24 33 27 28 28 16
ASV_2 26 21 22 23 15 20 18 21 19
ASV_3 21.7 14.5 17.1 16.8 9.6 16.3 8.2 12.3 10.6
ASV_4 6.6 14.2 12.5 15.4 23.6 13.4 29.0 19.4 24.1
ASV_5 11.1 7.9 9.2 8.4 5.3 8.6 5.8 5.7 3.9
ASV_6 5.1 5.4 5.4 5.9 6.6 7.1 6.1 7.8 11.5
ASV_7 2.7 3.7 3.2 3.2 4.5 3.8 3.5 3.2 7.7
ASV_8 3.1 2.1 2.4 2.4 1.3 2.2 1.4 0.0 0.0
ASV_10 1.4 1.5 1.5 1.4 1.2 1.6 1.0 2.6 6.9
sensitivity 1 1 1 1 1 1 1 1 1
specificty 1 1 1 1 1 1 1 1 1
accuracy 1 1 1 1 1 1 1 1 1
prevalence 0.0005 0.0014 0.0180 0.0445 0.1196 0.2786 0.6448 0.5577 0.8006
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8

Create a summary of results before contaminant removal

# calculate alpha diverity and summary of abundances
  diversity <- estimate_richness(mock_ps, measures = c('Observed','Shannon','InvSimpson'))
## Warning in estimate_richness(mock_ps, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
  rel_abundance <- as.data.frame(mock_ps_norm@otu_table)
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  
  ## Add true positive/true negative ect for 'uncorrected plot'
  mock_otus <- as.matrix(as.data.frame(mock_ps@otu_table))
  #  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(mock_otus[,colnames(mock_otus) %in% mock_taxa])
  # % of mock community incorrectly classified as mock community ASVs
  false_neg <- rowSums(mock_otus[,!colnames(mock_otus) %in% mock_taxa])
  
  # true_pos and false_pos do not exist  

  
  # return results
  results_original <- cbind(con_abundance, total_mock_abundance, diversity, mock_abundance, true_neg, false_neg)
  # add filter_method to results table
    results_original <- results_original %>%
    mutate(method =  'Original Data') %>%
    mutate(sample_names = rownames(results_original))
    
summary_table(results_original,'Original results - no contaminant removal')
## Warning: Unknown columns: `true_pos`, `false_pos`
Original results - no contaminant removal summary
Percent Removed
con_abundance 0.05 0.14 1.80 4.45 11.96 27.86 64.48 55.77 80.06
total_mock_abundance 100 100 98 96 88 72 36 44 20
Percent Remaining after Contaminant Removal
Observed 18 20 114 172 262 312 381 147 193
Shannon 1.9 1.9 2.0 2.2 2.5 3.4 4.5 4.1 4.8
Diversity Estimates after Contaminant Removal
InvSimpson 5.4 5.4 5.9 6.4 6.3 11.1 28.0 24.6 71.6
ASV_1 22.1 29.6 25.7 22.7 28.9 19.4 9.8 12.3 3.2
ASV_2 26.0 20.9 22.1 21.7 13.4 14.5 6.3 9.3 3.9
Mock Abundances after Contaminant Removal
ASV_3 21.7 14.5 16.8 16.0 8.4 11.7 2.9 5.5 2.1
ASV_4 6.6 14.2 12.3 14.8 20.8 9.7 10.3 8.6 4.8
ASV_5 11.06 7.92 9.06 8.00 4.64 6.18 2.05 2.52 0.78
ASV_6 5.1 5.3 5.3 5.6 5.8 5.1 2.1 3.5 2.3
ASV_7 2.7 3.7 3.2 3.1 4.0 2.8 1.2 1.4 1.5
ASV_8 3.13 2.06 2.32 2.32 1.10 1.61 0.49 0.00 0.00
ASV_10 1.44 1.54 1.49 1.35 1.04 1.17 0.36 1.16 1.38
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8

Classify sequences as true mock, false mock, true contam, false contam

We can also evaluate the success of removing contaminants by identifying how many contaminants were removed and remain after filtering.

We will use a similar scheme that was used for evaluating SourceTracker and decontam in our manuscript:

Removing sequences that appear in a blank control

First, we will remove all sequences that appear in the blank control from the mock community dilution series to see how this method would perform

# removing all ASVs present in a blank control sample
# create an otu table to perform filtering on
# rec_otu -> 'recovered otus', will be the otu table results after filtering
rec_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
blank_taxa <- taxa_names(blank_ps)
# remove ASVs from blank
rec_otu <- rec_otu[,!colnames(rec_otu) %in% blank_taxa]
# create a phyloseq object with the filtered otu table
ps_blank_filter <- mock_ps
otu_table(ps_blank_filter) <- otu_table(rec_otu, taxa_are_rows = FALSE)

# create a subset of removed otus for evaluation
# rem_otu -> 'removed otus' the otus that were removed during filtering
rem_otu <- as.matrix(as.data.frame(mock_ps@otu_table))
rem_otu <- rem_otu[,colnames(rem_otu) %in% blank_taxa]
# create a phyloseq object with removed OTUs as the otu table
ps_rem <- mock_ps
otu_table(ps_rem) <- otu_table(rem_otu, taxa_are_rows = FALSE)

# run the evaluation function to summarize results
eval_filter(ps_blank_filter,mock_taxa, rec_otu, rem_otu, 'Blank control removal')
## Warning in data_table$color[data_table$reference] <- ref_colors: number of
## items to replace is not a multiple of replacement length
## Joining, by = c("id", "color", "sample", "count")

results_blank <- filter_results(mock_ps, ps_blank_filter, ps_rem,mock_taxa,'Blank control removal')
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
summary_table(results_blank,'Blank control removal')
Blank control removal summary
Percent Removed
contaminants_removed 63 44 63 35 49 49 61 44 44
mock_ASVs_removed 30 26 27 27 21 25 22 27 34
Percent Remaining after Contaminant Removal
con_abundance 0.027 0.106 0.933 3.998 8.100 20.817 47.855 48.905 77.420
total_mock_abundance 100 100 99 96 92 79 52 51 23
Diversity Estimates after Contaminant Removal
Observed 11 10 54 110 177 199 247 97 128
Shannon 1.6 1.5 1.6 1.9 2.0 2.8 3.8 3.5 4.3
InvSimpson 4.2 3.9 4.3 4.8 4.1 6.8 12.4 13.8 42.6
Mock Abundances after Contaminant Removal
ASV_1 31.7 40.2 35.6 31.4 38.1 28.7 18.4 19.5 5.5
ASV_3 31.1 19.7 23.3 22.2 11.1 17.3 5.5 8.6 3.6
ASV_4 9.5 19.3 17.0 20.4 27.4 14.2 19.4 13.6 8.2
ASV_5 15.8 10.7 12.6 11.1 6.1 9.1 3.9 4.0 1.3
ASV_6 7.3 7.2 7.3 7.7 7.7 7.5 4.1 5.4 3.9
ASV_8 4.48 2.79 3.21 3.21 1.45 2.37 0.92 0.00 0.00
sensitivity 0.63 0.44 0.63 0.35 0.49 0.49 0.61 0.44 0.44
specificty 0.70 0.74 0.73 0.73 0.79 0.75 0.78 0.73 0.66
accuracy 0.70 0.74 0.73 0.71 0.76 0.67 0.67 0.57 0.48
prevalence 0.0005 0.0014 0.0180 0.0445 0.1196 0.2786 0.6448 0.5577 0.8006
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8

Removing all ASVs that were present in the blank control was too harsh - it ended up removing many of the expected mock community ASVs (6 out of 9 of the mock taxa - up to 69.1% of the expected sequences). Additionally, this approach still missed many contaminant taxa.

Removing sequences based on an abundance filter

Another approach that has been used in the literature is applying an abundance filter to remove noise. There is not much guidance on how researchers choose (or even use) a filter. We evaluated filters ranging from 0.01% to 1%.

results_filter_0_01 <- abundance_filter_eval(mock_ps,0.01)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_0_01, 'Abundance filter, 0.01%')
Abundance filter, 0.01% summary
Percent Removed
contaminants_removed 34.1270 32.5000 11.5743 8.0515 1.0671 0.1846 0.0339 0.0000 0.0092
mock_ASVs_removed 0 0 0 0 0 0 0 0 0
Percent Remaining after Contaminant Removal
con_abundance 0.033 0.094 1.593 4.111 11.843 27.826 64.474 55.769 80.063
total_mock_abundance 100 100 98 96 88 72 36 44 20
Diversity Estimates after Contaminant Removal
Observed 11 13 68 123 244 305 377 147 192
Shannon 1.9 1.9 2.0 2.2 2.5 3.4 4.5 4.1 4.8
InvSimpson 5.4 5.4 5.8 6.3 6.3 11.1 28.0 24.6 71.6
Mock Abundances after Contaminant Removal
ASV_1 22.1 29.7 25.7 22.8 28.9 19.5 9.8 12.3 3.2
ASV_2 26.0 20.9 22.1 21.8 13.4 14.5 6.3 9.3 3.9
ASV_3 21.7 14.5 16.9 16.1 8.5 11.7 2.9 5.5 2.1
ASV_4 6.6 14.2 12.3 14.8 20.8 9.7 10.3 8.6 4.8
ASV_5 11.06 7.92 9.08 8.03 4.64 6.19 2.05 2.52 0.78
ASV_6 5.1 5.3 5.3 5.6 5.8 5.1 2.2 3.5 2.3
ASV_7 2.7 3.7 3.2 3.1 4.0 2.8 1.2 1.4 1.5
ASV_8 3.13 2.06 2.32 2.32 1.10 1.61 0.49 0.00 0.00
ASV_10 1.44 1.54 1.49 1.36 1.04 1.17 0.36 1.16 1.38
sensitivity 3.4e-01 3.3e-01 1.2e-01 8.1e-02 1.1e-02 1.8e-03 3.4e-04 0.0e+00 9.2e-05
specificty 1 1 1 1 1 1 1 1 1
accuracy 1.00 1.00 0.98 0.96 0.88 0.72 0.36 0.44 0.20
prevalence 0.0005 0.0014 0.0180 0.0445 0.1196 0.2786 0.6448 0.5577 0.8006
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
results_filter_0_1 <- abundance_filter_eval(mock_ps,0.1)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_0_1, 'Abundance filter, 0.1%')
Abundance filter, 0.1% summary
Percent Removed
contaminants_removed 100.0 100.0 85.4 68.5 54.8 42.1 20.1 1.3 1.0
mock_ASVs_removed 0 0 0 0 0 0 0 0 0
Percent Remaining after Contaminant Removal
con_abundance 0.00 0.00 0.27 1.45 5.79 18.28 59.18 55.46 79.90
total_mock_abundance 100 100 100 99 94 82 41 45 20
Diversity Estimates after Contaminant Removal
Observed 9 9 11 15 31 74 156 136 179
Shannon 1.9 1.9 1.9 2.0 2.1 2.8 4.0 4.1 4.8
InvSimpson 5.4 5.4 5.7 6.0 5.5 8.7 21.3 24.3 70.5
Mock Abundances after Contaminant Removal
ASV_1 22.1 29.7 26.1 23.4 30.9 22.0 11.2 12.4 3.2
ASV_2 26.1 21.0 22.4 22.4 14.3 16.4 7.2 9.3 3.9
ASV_3 21.7 14.5 17.1 16.5 9.0 13.3 3.3 5.5 2.1
ASV_4 6.6 14.2 12.5 15.2 22.3 10.9 11.8 8.6 4.8
ASV_5 11.06 7.93 9.20 8.25 4.96 7.01 2.36 2.54 0.79
ASV_6 5.1 5.4 5.4 5.8 6.2 5.8 2.5 3.5 2.3
ASV_7 2.7 3.7 3.2 3.2 4.2 3.1 1.4 1.4 1.5
ASV_8 3.13 2.06 2.35 2.39 1.18 1.82 0.56 0.00 0.00
ASV_10 1.44 1.54 1.51 1.40 1.11 1.33 0.41 1.17 1.39
sensitivity 1.000 1.000 0.854 0.685 0.548 0.421 0.201 0.013 0.010
specificty 1 1 1 1 1 1 1 1 1
accuracy 1.00 1.00 1.00 0.99 0.95 0.84 0.48 0.45 0.21
prevalence 0.0005 0.0014 0.0180 0.0445 0.1196 0.2786 0.6448 0.5577 0.8006
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
results_filter_1 <- abundance_filter_eval(mock_ps,1)
## Joining, by = c("id", "color", "sample", "count")
## Warning in estimate_richness(physeq_filter, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_filter_1, 'Abundance filter, 1%')
Abundance filter, 1% summary
Percent Removed
contaminants_removed 100 100 100 100 100 96 77 74 75
mock_ASVs_removed 0.0 0.0 0.0 0.0 0.0 0.0 2.4 0.0 3.9
Percent Remaining after Contaminant Removal
con_abundance 0.0 0.0 0.0 0.0 0.0 1.6 30.2 24.7 50.6
total_mock_abundance 100 100 100 100 100 98 70 75 49
Diversity Estimates after Contaminant Removal
Observed 9 9 9 9 9 10 13 18 17
Shannon 1.9 1.9 1.9 1.9 1.8 2.0 2.2 2.5 2.7
InvSimpson 5.4 5.4 5.6 5.8 4.9 6.0 7.2 8.9 13.6
Mock Abundances after Contaminant Removal
ASV_1 22.1 29.7 26.1 23.7 32.8 26.5 19.7 21.0 8.2
ASV_2 26.1 21.0 22.5 22.8 15.2 19.7 12.6 15.8 9.9
ASV_3 21.7 14.5 17.1 16.8 9.6 16.0 5.9 9.3 5.4
ASV_4 6.6 14.2 12.5 15.4 23.6 13.2 20.7 14.6 12.4
ASV_5 11.1 7.9 9.2 8.4 5.3 8.4 4.1 4.3 0.0
ASV_6 5.1 5.4 5.4 5.9 6.6 7.0 4.3 5.9 5.9
ASV_7 2.7 3.7 3.2 3.2 4.5 3.8 2.5 2.4 3.9
ASV_8 3.1 2.1 2.4 2.4 1.3 2.2 0.0 0.0 0.0
ASV_10 1.4 1.5 1.5 1.4 1.2 1.6 0.0 2.0 3.6
sensitivity 1.00 1.00 1.00 1.00 1.00 0.96 0.77 0.74 0.75
specificty 1.00 1.00 1.00 1.00 1.00 1.00 0.98 1.00 0.96
accuracy 1.00 1.00 1.00 1.00 1.00 0.99 0.84 0.85 0.80
prevalence 0.0005 0.0014 0.0180 0.0445 0.1196 0.2786 0.6448 0.5577 0.8006
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8

Save workspace

vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep, filter_results)
## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_rm' not found
## Warning in rm(vars_to_rm, vars_to_keep, filter_results): object
## 'vars_to_keep' not found
save.image('results_filtering.RData')

This work demonstrates using various filtering methods for identifying and removing contaminants from a 16S rRNA gene sequencing experiment. We were able to evaluate the performance of this method since we used a dilution series of a mock microbial community, where the expected composition of the samples are known. We also evaluated more sophisticated methods for contaminant identification and removal: the decontam frequency method, which is documented here and SourceTracker, which is documented here.