Introduction

Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here ).

Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Brief overview of the mock microbial dilution series data set

The data set generated for this manuscript is a dilution series of a mock microbial community. ZymoBIOMIC mock community standards was used, which consisted of eight bacterial species (Pseudomonas aeruginosa, Escherichia coli, Salmonella enterica, Lactobacillus fermentum, Enterococcus faecalis, Staphylococcus aureus, Listeria monocytogenes, Bacillus subtilis) and two fungal species (Saccharomyces cerevisiae and Cryptococcus neoformans). The mock microbial community was diluted with microbial free water (Qiagen) in eight rounds of a serial three-fold dilution prior to DNA extraction.

The V4 region of the 16S rRNA gene was amplified and sequenced with Illumina MiSeq. The raw sequencing data is available in the Sequence Read Archive (SRA) under accession number SRP155048. Reads were processed using DADA2 to generate amplicon sequence variants (ASVs), which were used to generate the phyloseq object used in this analysis.

Initial analysis of the mock community dilution series data set

Set up the workspace

This analysis uses many open source packages that are available on CRAN and Bioconductor.

To run this analysis yourself, download the supplemental material along with the RMarkdown file. Save these files in the same directory on your computer.

# load libraries
library(phyloseq)
library(ggplot2) 
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
library(RColorBrewer)
library(reshape2)
library(knitr)
options(digits=2)
library(dplyr)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 3.4.4
library(cowplot)
## Warning: package 'cowplot' was built under R version 3.4.4
# save session info (packages and versions loaded)
session <- sessionInfo()

Create functions to use in this analysis

We created a function called expCompBarPlot to display the expected mock community ASVs in color and unexpected contaminants sequences in grayscale.

# Create function to plot bar plots with contaminants in grey scale and expected mock community sequences in color

 expCompBarPlot <- function(physeq, exp_taxa, title){
   ## physeq - phyloseq object that will be plotted
   ## exp_taxa - taxa that are expected to be in the mock community 
   ## title - title for plot
#set up data_table
  data_table <- as.data.frame(t(physeq@otu_table))

  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
  sample_names <- sample_names(physeq)
  data_table$id <- rownames(data_table)
  dilution_labels <- sample_data(physeq)$Dilutions

  taxa <-taxa_names(physeq)
  other_taxa <- taxa[!taxa %in% exp_taxa]

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- setNames(c("#1F78B4", "#A6CEE3", "#33A02C", "#B2DF8A", "#E31A1C", "#FB9A99", "#FF7F00", "#FDBF6F", "#6A3D9A"), levels(as.factor(exp_taxa)))
set.seed(325)
  other_colors <- setNames(c(sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)), levels(as.factor(other_taxa)))
  
# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
  legend_color <- c(ref_colors,other_colors)
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = id, group = reference), width = 0.7, position = position_fill()) +
    labs(x = "Dilution Series", y = "Relative Abundance") +
    scale_fill_manual(values=legend_color) + 
    theme(legend.position = "none",
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

  # plot just a subset to get a reasonable sized legend
  data_gg_subset <- data_gg %>% subset(sample == 'D6') 
  data_gg_subset <- data_gg_subset %>% subset(count >0)
# to do - generalize
  data_gg_subset <- data_gg_subset %>% subset(id %in% c('ASV_1','ASV_2','ASV_10','ASV_3','ASV_4','ASV_5','ASV_6','ASV_7','ASV_8','ASV_16','ASV_12','ASV_15','ASV_18'))
  # force order
  data_gg_subset$id <- factor(data_gg_subset$id, c('ASV_1','ASV_2','ASV_10','ASV_3','ASV_4','ASV_5','ASV_6','ASV_7','ASV_8','ASV_16','ASV_12','ASV_15','ASV_18'))
  legend_taxonomy <- data_gg_subset$id
  # modifications
    levels(legend_taxonomy) <- list(Lactobacillus = 'ASV_1',  Escherichia_Shigella = 'ASV_2' , Staphylococcus = 'ASV_10', Salmonella_ASV1 = 'ASV_3' , Bacillus = 'ASV_4',  Pseudomonas = 'ASV_5' , Listeria = 'ASV_6', Enterococcus = 'ASV_7' , Salmonella_ASV2 = 'ASV_8',  Contaminant_ASV = 'ASV_16', Contaminant_ASV = 'ASV_12' , Contaminant_ASV = 'ASV_15' , Contaminant_ASV = 'ASV_18')

  
  legend_color_subset <- legend_color[names(legend_color) %in%  data_gg_subset$id]
  comp_bar_subset <- ggplot(data_gg_subset, aes(x = sample, y = count)) + 
    geom_col(aes(fill = id, group = reference), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=legend_color_subset, 
                      name = ' ',
                      breaks =  data_gg_subset$id,
                      labels =  legend_taxonomy)
legend <- get_legend(comp_bar_subset)
comp_legend <- as_ggplot(legend)




comp_bar_components <- list(comp_bar,comp_legend)
return(comp_bar_components)

}

Load the dataset

This analysis assumes that you have the RData file and R Markdown file saved in the same directory.

## Load the dataset
load("mockDilutions.RData")

This data set includes 3 phyloseq objects:

  • ps - phyloseq object with the mock community dilution series and blank control
  • mock_ps - phyloseq object with the mock community dilution series only
  • blank_ps - phyloseq object with the blank control only

and one other object: * asv_key - a matrix containing ASV name : ASV sequence pairs

mock_ps and blank_ps were created from the ps phyloseq object by subsetting the desired samples to make the summaries of these samples more straight forward. The asv_key object was created to make downstream plotting and summarizing easier while being able to retain the full asv sequences for each asv.

Evaluating the undiluted mock community sample

We expect the undiluted mock community sample to mostly contain the expected sequences from the mock community. To evaluate this, we subset out the undiluted mock community sample (D0) as a new separate phyloseq object and remove the ASVs that are not present in the sample. Since the mock community contains eight bacteria, we expect there to be eight ASVs.

# Create profile of only expected sequences from the undiluted mock communtiy sample

# subset the undiluted mock microbial sample (sample name 'D0')
mock_ps_pure <- subset_samples(mock_ps,sample_names(mock_ps)== 'D0')

# remove ASVs that are not present in the undiluted sample
mock_ps_pure <- prune_taxa(taxa_sums(mock_ps_pure)>0,mock_ps_pure)

# change the SampleType and sample_names of the pure mock microbial community sample 
sample_data(mock_ps_pure)$SampleType <-'MockCommunityProfile'
sample_names(mock_ps_pure) <-paste('mc',sample_names(mock_ps_pure),sep = '_')

# display a summary of the new phyloseq object
mock_ps_pure
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18 taxa and 1 samples ]
## sample_data() Sample Data:       [ 1 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 18 taxa by 6 taxonomic ranks ]

The phyloseq object for the undiluted mock community contains 18 ASVs instead of the expected eight ASVs. Further inspection of these ASVs indicates the eight expected sequences are present, and that there are 10 additional unexpected ASVs. One ASV is one nucleotide off from an expected sequence and is present in high abundance. This likely represents an actual mutation that occurred in our mock community. The other nine sequences are present in low abundance (< 60 reads out of a total of 222,159 reads) and are distinct from the expected sequences. These are low level contaminants which may be due to sample ‘cross talk’ - errors that occur from barcode switching during sequencing. We will remove these ASVs from the mock_ps_pure phyloseq object.

# Remove the unexpected ASVs from the undiluted mock microbial community dilution series

# make a list of the top 9 abundant ASV taxa names (this is plausible for filtering since the 9 sequences we want to remove are present in low abundance)
mock_taxa = names(sort(taxa_sums(mock_ps_pure), decreasing = TRUE)[1:9])

# subset the taxa in mock_ps_pure so only the expected sequences are present
mock_ps_pure<-prune_taxa(mock_taxa,mock_ps_pure)

Identifying the impact of contaminants with decreasing starting material

Now that we know what the expected sequences of our mock microbial community are in our data set, we evaluate the unexpected sequences across the dilution series.

# display a summary of the mock community dilution series phyloseq object
mock_ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 946 taxa and 9 samples ]
## sample_data() Sample Data:       [ 9 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 946 taxa by 6 taxonomic ranks ]

A brief overview of the phyloseq object indicates that there are 755 ASVs in the mock microbial dilution series- that is a LOT more than the expected nine ASVs!

Now we identify how the decreasing starting material impacts the number of reads per sample and the proportion of reads per sample that are unexpected contaminant sequences.

# create a phyloseq object that is normalized to 100 (relative abundance)
ps_norm <- transform_sample_counts(ps,function(x) 100* x/sum(x))
mock_ps_norm <- transform_sample_counts(mock_ps,function(x) 100* x/sum(x))

# Identify the proportion of each sample that is the expected mock community ASVs
ps_norm_exp <- prune_taxa(mock_taxa,ps_norm)

# Create a table with the dilution, number of reads per sample, and proportion of contaminants per sample
dilutionSummary <- data.frame(DilutionSeries = sample_names(ps),NumberOfReads = sample_sums(ps), PercentContaminants = 100-sample_sums(ps_norm_exp))

# Create a variable to indicate the sample order of the plots
dilutions<-c('D0','D1','D2','D3','D4','D5','D6','D7','D8')

# Create plots to summarize these data
## Plot Figure 1A - number of reads per sample across dilution series
fig_1a <- dilutionSummary %>% subset(DilutionSeries %in% dilutions[1:9] ) %>%
  ggplot(., aes(x = DilutionSeries, y = NumberOfReads)) + geom_bar(stat="identity", fill="steelblue") +
  theme_minimal() + scale_x_discrete(limits = dilutions) + 
  labs(x = "Dilution Series", y = "Number of Reads") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

## Plot Figure 1B - Percent of contaminants across dilution series
fig_1b <- dilutionSummary %>% subset(DilutionSeries %in% dilutions[1:9] ) %>%
  ggplot(., aes(x = DilutionSeries, y = PercentContaminants)) + geom_point(size = 3) +
  scale_x_discrete(limits = dilutions) +   
  labs(x = "Dilution Series", y = "Percent Contaminants") +
  theme(axis.text = element_text(size = 10),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

## Plot Figure 1C - Stacked bar plot of Mock microbial dilution series
fig_1c_components <- expCompBarPlot(mock_ps_norm, mock_taxa, '') 
## Joining, by = c("id", "color", "sample", "count")
fig_1c <- fig_1c_components[[1]]
fig_1c_legend <- fig_1c_components[[2]]

# create plot for manuscript

figure_1 <- ggdraw() +
  draw_plot(fig_1a, x = 0, y = .5, width = .5, height = .48) +
  draw_plot(fig_1b, x = .5, y = .5, width = .5, height = .48) +
  draw_plot(fig_1c, x = 0, y = 0, width = .7, height = 0.5) +
  draw_plot(fig_1c_legend, x = .7, y = 0, width = 0.3, height = 0.5) +
  draw_plot_label(label = c("A", "B", "C"), size = 12,
                  x = c(0, 0.5, 0), y = c(1, 1, 0.5))

figure_1

ggsave('mSystems00290-19_Figure_1.tiff', figure_1, device = 'tiff',units = 'in',  width = 6.8, height= 6, dpi = 300)

## Table 1 components
# Number of sequences per sample
sample_sums(mock_ps)
##     D0     D1     D2     D3     D4     D5     D6     D7     D8 
## 251419 172915 250861 247581 216341 136081 128053  41071  40927
# number of ASVs per sample 
temp <- as.data.frame(ps@otu_table)
temp[temp >0] <- 1
rowSums(temp)
## Blank    D0    D1    D2    D3    D4    D5    D6    D7    D8 
##   665    18    20   114   172   262   312   381   147   193
# number of known genera per sample 
temp_genus <- tax_glom(ps, "Genus")
temp_genus <- as.data.frame(temp_genus@otu_table)
# change to binary to count unique genera per sample
temp_genus[temp_genus >0] <- 1
rowSums(temp_genus)
## Blank    D0    D1    D2    D3    D4    D5    D6    D7    D8 
##   136    15    16    64    61    80    86   107    58    74

Evalutate the blank control

# subset the normalized blank control sample 
blank_ps_norm <- subset_samples(ps_norm,sample_names(ps_norm) %in% c('Blank'))
blank_ps_norm <- prune_taxa(taxa_sums(blank_ps_norm) > 0, blank_ps_norm)
blank_ps_norm
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 665 taxa and 1 samples ]
## sample_data() Sample Data:       [ 1 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 665 taxa by 6 taxonomic ranks ]
#identify ASVs from mock community present in blank
table(taxa_names(blank_ps_norm) %in% mock_taxa)
## 
## FALSE  TRUE 
##   662     3
# Get genera present with >5% abundance
blank_ps_norm %>% subset_taxa(taxa_names(blank_ps_norm) %in% taxa_names(mock_ps_pure)) %>% psmelt() %>% select(Abundance, Family, Genus)
##   Abundance             Family                Genus
## 2     1.109 Enterobacteriaceae Escherichia/Shigella
## 1     0.038  Staphylococcaceae       Staphylococcus
## 3     0.026    Enterococcaceae         Enterococcus
# Collapse at the genus level, keeping unassigned genera
blank_ps_norm <- tax_glom(blank_ps_norm, "Genus", NArm = FALSE)
blank_ps_norm
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 158 taxa and 1 samples ]
## sample_data() Sample Data:       [ 1 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 158 taxa by 6 taxonomic ranks ]
blank_ps_norm_melt <- psmelt(blank_ps_norm)
# Get genera present with >5% abundance
blank_ps_norm_melt %>% filter(Abundance > 5) %>% select(Abundance, Family, Genus)
## Warning: package 'bindrcpp' was built under R version 3.4.4
##   Abundance          Family               Genus
## 1      15.1  Bacteroidaceae         Bacteroides
## 2       6.9 Lachnospiraceae                <NA>
## 3       6.3 Ruminococcaceae    Faecalibacterium
## 4       6.0 Ruminococcaceae Ruminiclostridium_6
# Get genera present with abundance between 1 and 5%
blank_ps_norm_melt %>% filter(Abundance <= 5) %>% filter(Abundance > 1) %>% select(Abundance, Family, Genus)
##    Abundance              Family                         Genus
## 1        3.2     Lachnospiraceae Lachnospiraceae_NK4A136_group
## 2        3.2     Lachnospiraceae                       Blautia
## 3        3.0     Lachnospiraceae                  Agathobacter
## 4        2.8      Prevotellaceae        Prevotellaceae_UCG-001
## 5        2.6    Lactobacillaceae                 Lactobacillus
## 6        2.5      Prevotellaceae   Prevotellaceae_NK3B31_group
## 7        2.3     Lachnospiraceae                     Roseburia
## 8        2.2     Ruminococcaceae                Ruminococcus_2
## 9        2.2      Muribaculaceae                          <NA>
## 10       2.1     Lachnospiraceae                  Anaerostipes
## 11       1.9     Ruminococcaceae               Subdoligranulum
## 12       1.7     Ruminococcaceae                Ruminococcus_1
## 13       1.7      Prevotellaceae    Prevotellaceae_Ga6A1_group
## 14       1.5     Lachnospiraceae                  Butyrivibrio
## 15       1.5     Lachnospiraceae                   Lachnospira
## 16       1.5 Erysipelotrichaceae                  Turicibacter
## 17       1.4     Lachnospiraceae              Fusicatenibacter
## 18       1.3     Ruminococcaceae       Ruminococcaceae_UCG-014
## 19       1.3     Ruminococcaceae                          <NA>
## 20       1.1  Enterobacteriaceae          Escherichia/Shigella
## 21       1.1                <NA>                          <NA>

Evaluating contaminant ASVs

To evaluate the contaminant ASVs, we create a phyloseq object only containing the contaminant ASVs.

# create a list of unexpected sequences (contaminants)

# create a list of all ASV taxa names
contaminant_taxa<-taxa_names(mock_ps)
# remove the expected mock community ASV taxa names
contaminant_taxa <- contaminant_taxa[!(contaminant_taxa %in% mock_taxa)]

# create a phyloseq object that only contains the contaminant sequences (for use with sourcetracker)
contaminants_ps<-prune_taxa(contaminant_taxa,mock_ps)
contaminants_ps<- prune_taxa(taxa_sums(contaminants_ps)>0,contaminants_ps)

# change the sample names to indicate that these samples only contain contmaminant ASVs
sample_names(contaminants_ps)<-paste('con',sample_names(contaminants_ps),sep = '_')
sample_data(contaminants_ps)$SampleType<-'ContaminantProfile'

# create phyloseq object from normalized data to summarize contamiant contribution
contaminants_ps_norm<-prune_taxa(contaminant_taxa,mock_ps_norm)
contaminants_ps_norm<- prune_taxa(taxa_sums(contaminants_ps_norm)>0,contaminants_ps_norm)

Now, we use the contaminants_ps object to evaluate the number contaminant ASVs that are present in the blank control sample.

# Number of contaminant ASVs across dilution series
print(paste('Total number of contaminant ASVs', length(taxa_names(contaminants_ps))))
## [1] "Total number of contaminant ASVs 937"
# abundance of contaminant ASVs across samples
sample_sums(contaminants_ps_norm)
##    D0    D1    D2    D3    D4    D5    D6    D7    D8 
##  0.05  0.14  1.80  4.45 11.96 27.86 64.48 55.77 80.06
# Abundance of contaminant genera diltuion series
contaminant_genera <- tax_glom(contaminants_ps_norm, 'Genus', NArm = FALSE) %>% 
  psmelt() 

# Maximum abundance of contaminant genera per diltuion sample
contaminant_genera %>%
  group_by(Sample) %>%
  filter(Abundance == max(Abundance)) %>% 
  select(Sample, Abundance)
## # A tibble: 9 x 2
## # Groups:   Sample [9]
##   Sample Abundance
##   <chr>      <dbl>
## 1 D6       17.4   
## 2 D8       12.3   
## 3 D7       10.6   
## 4 D5        3.69  
## 5 D4        1.53  
## 6 D3        1.45  
## 7 D2        0.263 
## 8 D1        0.0497
## 9 D0        0.0203
# Create list of contamiant ASVs with abundance > 1% (Supplemental Table 2)
contam_table <-contaminants_ps_norm %>% 
  psmelt() %>%
  group_by(OTU) %>%
  mutate(max_abundance = max(Abundance)) %>% 
  filter(max_abundance > 1) %>%
  mutate(in_n_samples = sum(Abundance > 0)) %>%
  mutate(in_blank = OTU %in% taxa_names(blank_ps_norm)) %>%
  mutate(asv_sequence = asv_key[asv_key$asv_name == OTU,]$asv_sequence) %>%
  select( Phylum, Class, Order, Family, Genus,max_abundance, in_n_samples, in_blank, asv_sequence) %>% 
  unique()
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Adding missing grouping variables: `OTU`
# number of ASVs present in blank sample
print(paste('Number of contaminant ASVs also present in blank', length(intersect(taxa_names(contaminants_ps),taxa_names(blank_ps)))))
## [1] "Number of contaminant ASVs also present in blank 194"
# create a list of contaminants taxa that are not present in the blank control
contaminant_taxa_no_blank<-taxa_names(contaminants_ps)
contaminant_taxa_no_blank <- contaminant_taxa_no_blank[!(contaminant_taxa_no_blank %in% taxa_names(blank_ps))]

# Create  a binary list of contaminant ASVs indicating if the ASV is present in the blank control (1) or not (0)
contaminants_in_blank <- data.frame(matrix(1, ncol = length(taxa_names(contaminants_ps)), nrow = 1))
colnames(contaminants_in_blank) <- taxa_names(contaminants_ps)
contaminants_in_blank[,contaminant_taxa_no_blank] <- 0 
contaminants_in_blank <- t(contaminants_in_blank)

We found that there are a lot of contaminant ASVs present in the dilution series that aren’t in the blank control sample.

We also determine the proportion of contaminant ASVs that are not in found in the blank control to help us identify the actual impact of these.

# Identify the contribution per sample of contaminants that are not present in blanks
# generate a phyloseq object with contaminants only normalized to 100 
contaminant_ps_norm <- transform_sample_counts(contaminants_ps,function(x) 100* x/sum(x))
contaminant_no_blanks<-prune_taxa(contaminant_taxa_no_blank,contaminant_ps_norm)

# Plot the proportion of contaminant ASVs per sample that were not present in the blank control
plot_bar(contaminant_no_blanks,fill='Genus',, title = ' Proportion of Contaminant ASVs Not in the Blank Control Sample') + theme(legend.position='none') + ylim(c(0,100))

sample_sums(contaminant_no_blanks)
## con_D0 con_D1 con_D2 con_D3 con_D4 con_D5 con_D6 con_D7 con_D8 
##     37     56     37     65     51     51     39     56     56
# sum and summarize the amount of contaminant signal arising from ASVs in blank control
100 - sample_sums(contaminant_no_blanks)
## con_D0 con_D1 con_D2 con_D3 con_D4 con_D5 con_D6 con_D7 con_D8 
##     63     44     63     35     49     49     61     44     44
summary(100 - sample_sums(contaminant_no_blanks))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      35      44      49      50      61      63

From this plot, it is evident that the contaminants not present in the blank control make up a significant proportion of the contaminants. This is a bit surprising, and may be due to the blank control not representing all contaminants. In this experiment all blank controls were pooled into a single sample- it likely would have been better to not pool the blank controls.

We also evaluated how many times each contaminant ASV is present across the dilution series.

# Count number of contaminants present in only one sample
contaminant_bin<-as.data.frame(contaminants_ps@otu_table)
contaminant_bin[contaminant_bin>0]<-1
contaminant_bin = t(contaminant_bin)
contaminant_nsamples <- rowSums(contaminant_bin)
table(contaminant_nsamples)
## contaminant_nsamples
##   1   2   3   4   5   6   7   8   9 
## 675 106  66  38  26  12  12   1   1
# summarize information about contaminant ASVs only in one sample
contaminant_one_sample <- rownames(contaminant_bin[rowSums(contaminant_bin)==1,])
temp <- contaminants_ps %>% prune_taxa(contaminant_one_sample,.) %>% taxa_sums(.)
min(temp)
## [1] 2
max(temp)
## [1] 696
mean(temp)
## [1] 78

Impact of contaminants on alpha diversity

Next, we evaluate the impact that the contaminant ASVs have on commonly used alpha diversity metrics.

mock_alpha <- estimate_richness(mock_ps, measures=c("Observed", "Shannon", "InvSimpson"))
mock_alpha_pure <- estimate_richness(mock_ps_pure,  measures=c("Observed", "Shannon", "InvSimpson"))
plot_richness(mock_ps, measure = c('Observed','Shannon', 'InvSimpson') )

# calculate the difference between the observed and actual alpha diversity measures 
max_diff <- ((mock_alpha[9,] - mock_alpha_pure) / mock_alpha_pure) 

Finally, we save the workspace image so that we can revisit the analysis if needed and so that we can easily load the processed data for the next steps of analysis.

# Save workspace
save.image("mockDilutionsPrep.RData")

This work outlines each step required for evaluating the impact of contaminants on 16S rRNA gene experiments with samples that have varying starting material. The conclusions from this analysis is that contaminant ASVs increase with decreasing starting material and lead to distorted microbial community profiles and overestimated alpha diversity measurements.

While contaminants are prevalent in low microbial biomass samples, and potentially unavoidable, all hope is not lost. There are a few computational approaches to identify and remove contaminant sequences from 16S rRNA gene sequencing data. To see how we evaluated these different methods for identifying and removing contaminants from this data set, see: * Removing contaminant ASVs with decontam * Removing contaminant ASVs with SourceTracker * Evaluating SourceTracker results