Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.
We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here ).
Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:
The data set generated for this manuscript is a dilution series of a mock microbial community. ZymoBIOMIC mock community standards was used, which consisted of eight bacterial species (Pseudomonas aeruginosa, Escherichia coli, Salmonella enterica, Lactobacillus fermentum, Enterococcus faecalis, Staphylococcus aureus, Listeria monocytogenes, Bacillus subtilis) and two fungal species (Saccharomyces cerevisiae and Cryptococcus neoformans). The mock microbial community was diluted with microbial free water (Qiagen) in eight rounds of a serial three-fold dilution prior to DNA extraction.
The V4 region of the 16S rRNA gene was amplified and sequenced with Illumina MiSeq. The raw sequencing data is available in the Sequence Read Archive (SRA) under accession number SRP155048. Reads were processed using DADA2 to generate amplicon sequence variants (ASVs), which were used to generate the phyloseq object used in this analysis.
This analysis uses many open source packages that are available on CRAN and Bioconductor.
To run this analysis yourself, download the supplemental material along with the RMarkdown file. Save these files in the same directory on your computer.
# load libraries
library(phyloseq)
library(ggplot2)
library(tidyverse)
library(RColorBrewer)
library(reshape2)
library(knitr)
library(dplyr)
# save session info (packages and versions loaded)
session <- sessionInfo()
We created a function called expCompBarPlot to display the expected mock community ASVs in color and unexpected contaminants sequences in grayscale.
# Create function to plot bar plots with contaminants in grey scale and expected mock community sequences in color
expCompBarPlot <- function(physeq, exp_taxa, title){
## physeq - phyloseq object that will be plotted
## exp_taxa - taxa that are expected to be in the mock community
## title - title for plot
#set up data_table
data_table <- as.data.frame(t(physeq@otu_table))
data_table$reference = FALSE
data_table$reference[rownames(data_table) %in% exp_taxa] = TRUE
sample_names <- sample_names(physeq)
data_table$id <- paste0('ASV_', 1:nrow(data_table))
dilution_labels <- sample_data(physeq)$Dilutions
set.seed(444)
# define the colors to use for reference and non-reference OTUs/ASVs
ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)
# add a color variable to the data table
data_table$color <- rep(NA, nrow(data_table))
data_table$color[data_table$reference] <- ref_colors
data_table$color[!data_table$reference] <- other_colors
# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"
color_gg <- data_table %>% select(id, sample_names, color) %>% gather("sample", "count", sample_names)
legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
data_gg <- data_table %>% gather("sample", "count", sample_names)
data_gg <- inner_join(data_gg,color_gg)
# create the composition bar plot
comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
scale_fill_identity(guide = FALSE) +
scale_alpha_manual(name = "Sequence type",
labels = c("expected sequences", "other"),
values = c(bright = 1, dull = 1),
guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
keywidth = NULL, keyheight = NULL)) +
labs(title = title, x = "sample", y = "Relative Abundance") +
theme(legend.position = "right", legend.title = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 16),
plot.title = element_text(size = 16))
comp_bar
}
This analysis assumes that you have the RData file and R Markdown file saved in the same directory.
## Load the dataset
load("mockDilutions.RData")
This data set includes 3 phyloseq objects:
and one other object: * asv_key - a matrix containing ASV name : ASV sequence pairs
mock_ps and blank_ps were created from the ps phyloseq object by subsetting the desired samples to make the summaries of these samples more straight forward. The asv_key object was created to make downstream plotting and summarizing easier while being able to retain the full asv sequences for each asv.
We expect the undiluted mock community sample to mostly contain the expected sequences from the mock community. To evaluate this, we subset out the undiluted mock community sample (D0) as a new separate phyloseq object and remove the ASVs that are not present in the sample. Since the mock community contains eight bacteria, we expect there to be eight ASVs.
# Create profile of only expected sequences from the undiluted mock communtiy sample
# subset the undiluted mock microbial sample (sample name 'D0')
mock_ps_pure <- subset_samples(mock_ps,sample_names(mock_ps)== 'D0')
# remove ASVs that are not present in the undiluted sample
mock_ps_pure <- prune_taxa(taxa_sums(mock_ps_pure)>0,mock_ps_pure)
# change the SampleType and sample_names of the pure mock microbial community sample
sample_data(mock_ps_pure)$SampleType <-'MockCommunityProfile'
sample_names(mock_ps_pure) <-paste('mc',sample_names(mock_ps_pure),sep = '_')
# display a summary of the new phyloseq object
mock_ps_pure
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 18 taxa and 1 samples ]
## sample_data() Sample Data: [ 1 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 18 taxa by 6 taxonomic ranks ]
The phyloseq object for the undiluted mock community contains 18 ASVs instead of the expected eight ASVs. Further inspection of these ASVs indicates the eight expected sequences are present, and that there are 10 additional unexpected ASVs. One ASV is one nucleotide off from an expected sequence and is present in high abundance. This likely represents an actual mutation that occurred in our mock community. The other nine sequences are present in low abundance (< 60 reads out of a total of 222,159 reads) and are distinct from the expected sequences. These are low level contaminants which may be due to sample ‘cross talk’ - errors that occur from barcode switching during sequencing. We will remove these ASVs from the mock_ps_pure phyloseq object.
# Remove the unexpected ASVs from the undiluted mock microbial community dilution series
# make a list of the top 9 abundant ASV taxa names (this is plausible for filtering since the 9 sequences we want to remove are present in low abundance)
mock_taxa = names(sort(taxa_sums(mock_ps_pure), decreasing = TRUE)[1:9])
# subset the taxa in mock_ps_pure so only the expected sequences are present
mock_ps_pure<-prune_taxa(mock_taxa,mock_ps_pure)
Now that we know what the expected sequences of our mock microbial community are in our data set, we evaluate the unexpected sequences across the dilution series.
# display a summary of the mock community dilution series phyloseq object
mock_ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 946 taxa and 9 samples ]
## sample_data() Sample Data: [ 9 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 946 taxa by 6 taxonomic ranks ]
A brief overview of the phyloseq object indicates that there are 755 ASVs in the mock microbial dilution series- that is a LOT more than the expected nine ASVs!
Now we identify how the decreasing starting material impacts the number of reads per sample and the proportion of reads per sample that are unexpected contaminant sequences.
# create a phyloseq object that is normalized to 100 (relative abundance)
ps_norm <- transform_sample_counts(ps,function(x) 100* x/sum(x))
mock_ps_norm <- transform_sample_counts(mock_ps,function(x) 100* x/sum(x))
# Identify the proportion of each sample that is the expected mock community ASVs
ps_norm_exp <- prune_taxa(mock_taxa,ps_norm)
# Create a table with the dilution, number of reads per sample, and proportion of contaminants per sample
dilutionSummary <- data.frame(DilutionSeries = sample_names(ps),NumberOfReads = sample_sums(ps), PercentContaminants = 100-sample_sums(ps_norm_exp))
# Create a variable to indicate the sample order of the plots
dilutions<-c('D0','D1','D2','D3','D4','D5','D6','D7','D8', 'Blank')
# Create plots to summarize these data
## Plot Figure 1A - number of reads per sample across dilution series
dilutionSummary %>% subset(DilutionSeries %in% dilutions[1:9] ) %>%
ggplot(., aes(x = DilutionSeries, y = NumberOfReads)) + geom_bar(stat="identity", fill="steelblue") +
theme_minimal() + scale_x_discrete(limits = dilutions) +
labs(x = "Dilution Series", y = "Number of Reads") +
theme(axis.text = element_text(size = 14),
axis.title = element_text(size = 16),
plot.title = element_text(size = 16))
## Plot Figure 1B - Percent of contaminants across dilution series
dilutionSummary %>% subset(DilutionSeries %in% dilutions[1:9] ) %>%
ggplot(., aes(x = DilutionSeries, y = PercentContaminants)) + geom_point(size = 3) +
scale_x_discrete(limits = dilutions) +
labs(x = "Dilution Series", y = "Number of Reads") +
theme(axis.text = element_text(size = 14),
axis.title = element_text(size = 16),
plot.title = element_text(size = 16))
## Plot Figure 1C - Stacked bar plot of Mock microbial dilution series
expCompBarPlot(ps_norm, mock_taxa, 'Initial Mock Microbial Community Dilution') + scale_x_discrete(limits = dilutions)
## Joining, by = c("id", "color", "sample", "count")
## Table 1
# Number of sequences per sample
sample_sums(mock_ps)
## D0 D1 D2 D3 D4 D5 D6 D7 D8
## 251419 172915 250861 247581 216341 136081 128053 41071 40927
# number of ASVs per sample
temp <- as.data.frame(ps@otu_table)
temp[temp >0] <- 1
rowSums(temp)
## Blank D0 D1 D2 D3 D4 D5 D6 D7 D8
## 665 18 20 114 172 262 312 381 147 193
# number of known genera per sample
temp_genus <- tax_glom(ps, "Genus")
temp_genus <- as.data.frame(temp_genus@otu_table)
# change to binary to count unique genera per sample
temp_genus[temp_genus >0] <- 1
rowSums(temp_genus)
## Blank D0 D1 D2 D3 D4 D5 D6 D7 D8
## 136 15 16 64 61 80 86 107 58 74
# subset the normalized blank control sample
blank_ps_norm <- subset_samples(ps_norm,sample_names(ps_norm) %in% c('Blank'))
blank_ps_norm <- prune_taxa(taxa_sums(blank_ps_norm) > 0, blank_ps_norm)
blank_ps_norm
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 665 taxa and 1 samples ]
## sample_data() Sample Data: [ 1 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 665 taxa by 6 taxonomic ranks ]
# 655 unique ASVs
#identify ASVs from mock community present in blank
table(taxa_names(blank_ps_norm) %in% mock_taxa)
##
## FALSE TRUE
## 662 3
# Get genera present with >5% abundance
blank_ps_norm %>% subset_taxa(taxa_names(blank_ps_norm) %in% taxa_names(mock_ps_pure)) %>% psmelt() %>% select(Abundance, Family, Genus)
## Abundance Family Genus
## 2 1.10865797 Enterobacteriaceae Escherichia/Shigella
## 1 0.03793887 Staphylococcaceae Staphylococcus
## 3 0.02634643 Enterococcaceae Enterococcus
# Collapse at the genus level, keeping unassigned genera
blank_ps_norm <- tax_glom(blank_ps_norm, "Genus", NArm = FALSE)
blank_ps_norm
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 158 taxa and 1 samples ]
## sample_data() Sample Data: [ 1 samples by 6 sample variables ]
## tax_table() Taxonomy Table: [ 158 taxa by 6 taxonomic ranks ]
blank_ps_norm_melt <- psmelt(blank_ps_norm)
# Get genera present with >5% abundance
blank_ps_norm_melt %>% filter(Abundance > 5) %>% select(Abundance, Family, Genus)
## Abundance Family Genus
## 1 15.077011 Bacteroidaceae Bacteroides
## 2 6.913831 Lachnospiraceae <NA>
## 3 6.280990 Ruminococcaceae Faecalibacterium
## 4 6.029645 Ruminococcaceae Ruminiclostridium_6
# Get genera present with abundance between 1 and 5%
blank_ps_norm_melt %>% filter(Abundance <= 5) %>% filter(Abundance > 1) %>% select(Abundance, Family, Genus)
## Abundance Family Genus
## 1 3.244300 Lachnospiraceae Lachnospiraceae_NK4A136_group
## 2 3.176853 Lachnospiraceae Blautia
## 3 2.979782 Lachnospiraceae Agathobacter
## 4 2.759526 Prevotellaceae Prevotellaceae_UCG-001
## 5 2.607243 Lactobacillaceae Lactobacillus
## 6 2.459176 Prevotellaceae Prevotellaceae_NK3B31_group
## 7 2.312690 Lachnospiraceae Roseburia
## 8 2.245770 Ruminococcaceae Ruminococcus_2
## 9 2.210466 Muribaculaceae <NA>
## 10 2.104026 Lachnospiraceae Anaerostipes
## 11 1.914859 Ruminococcaceae Subdoligranulum
## 12 1.742026 Ruminococcaceae Ruminococcus_1
## 13 1.730961 Prevotellaceae Prevotellaceae_Ga6A1_group
## 14 1.520716 Lachnospiraceae Butyrivibrio
## 15 1.476981 Lachnospiraceae Lachnospira
## 16 1.463808 Erysipelotrichaceae Turicibacter
## 17 1.424815 Lachnospiraceae Fusicatenibacter
## 18 1.326806 Ruminococcaceae Ruminococcaceae_UCG-014
## 19 1.324172 Ruminococcaceae <NA>
## 20 1.108658 Enterobacteriaceae Escherichia/Shigella
## 21 1.073354 <NA> <NA>
To evaluate the contaminant ASVs, we create a phyloseq object only containing the contaminant ASVs.
# create a list of unexpected sequences (contaminants)
# create a list of all ASV taxa names
contaminant_taxa<-taxa_names(mock_ps)
# remove the expected mock community ASV taxa names
contaminant_taxa <- contaminant_taxa[!(contaminant_taxa %in% mock_taxa)]
# create a phyloseq object that only contains the contaminant sequences (for use with sourcetracker)
contaminants_ps<-prune_taxa(contaminant_taxa,mock_ps)
contaminants_ps<- prune_taxa(taxa_sums(contaminants_ps)>0,contaminants_ps)
# change the sample names to indicate that these samples only contain contmaminant ASVs
sample_names(contaminants_ps)<-paste('con',sample_names(contaminants_ps),sep = '_')
sample_data(contaminants_ps)$SampleType<-'ContaminantProfile'
# create phyloseq object from normalized data to summarize contamiant contribution
contaminants_ps_norm<-prune_taxa(contaminant_taxa,mock_ps_norm)
contaminants_ps_norm<- prune_taxa(taxa_sums(contaminants_ps_norm)>0,contaminants_ps_norm)
Now, we use the contaminants_ps object to evaluate the number contaminant ASVs that are present in the blank control sample.
# Number of contaminant ASVs across dilution series
print(paste('Total number of contaminant ASVs', length(taxa_names(contaminants_ps))))
## [1] "Total number of contaminant ASVs 937"
# abundance of contaminant ASVs across samples
sample_sums(contaminants_ps_norm)
## D0 D1 D2 D3 D4 D5
## 0.05011554 0.13879652 1.79780835 4.45470371 11.95566259 27.86355186
## D6 D7 D8
## 64.48189422 55.76927759 80.06450509
# Abundance of contaminant genera diltuion series
contaminant_genera <- tax_glom(contaminants_ps_norm, 'Genus', NArm = FALSE) %>%
psmelt()
# Maximum abundance of contaminant genera per diltuion sample
contaminant_genera %>%
group_by(Sample) %>%
filter(Abundance == max(Abundance)) %>%
select(Sample, Abundance)
## # A tibble: 9 x 2
## # Groups: Sample [9]
## Sample Abundance
## <chr> <dbl>
## 1 D6 17.4
## 2 D8 12.3
## 3 D7 10.6
## 4 D5 3.69
## 5 D4 1.53
## 6 D3 1.45
## 7 D2 0.263
## 8 D1 0.0497
## 9 D0 0.0203
# Create list of contamiant ASVs with abundance > 1% (Supplemental Table 2)
contam_table <-contaminants_ps_norm %>%
psmelt() %>%
group_by(OTU) %>%
mutate(max_abundance = max(Abundance)) %>%
filter(max_abundance > 1) %>%
mutate(in_n_samples = sum(Abundance > 0)) %>%
mutate(in_blank = OTU %in% taxa_names(blank_ps_norm)) %>%
mutate(asv_sequence = asv_key[asv_key$asv_name == OTU,]$asv_sequence) %>%
select( Phylum, Class, Order, Family, Genus,max_abundance, in_n_samples, in_blank, asv_sequence) %>%
unique()
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Warning in `==.default`(asv_key$asv_name, OTU): longer object length is not
## a multiple of shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
## Adding missing grouping variables: `OTU`
# number of ASVs present in blank sample
print(paste('Number of contaminant ASVs also present in blank', length(intersect(taxa_names(contaminants_ps),taxa_names(blank_ps)))))
## [1] "Number of contaminant ASVs also present in blank 194"
# create a list of contaminants taxa that are not present in the blank control
contaminant_taxa_no_blank<-taxa_names(contaminants_ps)
contaminant_taxa_no_blank <- contaminant_taxa_no_blank[!(contaminant_taxa_no_blank %in% taxa_names(blank_ps))]
# Create a binary list of contaminant ASVs indicating if the ASV is present in the blank control (1) or not (0)
contaminants_in_blank <- data.frame(matrix(1, ncol = length(taxa_names(contaminants_ps)), nrow = 1))
colnames(contaminants_in_blank) <- taxa_names(contaminants_ps)
contaminants_in_blank[,contaminant_taxa_no_blank] <- 0
contaminants_in_blank <- t(contaminants_in_blank)
We found that there are a lot of contaminant ASVs present in the dilution series that aren’t in the blank control sample.
We also determine the proportion of contaminant ASVs that are not in found in the blank control to help us identify the actual impact of these.
# Identify the contribution per sample of contaminants that are not present in blanks
# generate a phyloseq object with contaminants only normalized to 100
contaminant_ps_norm <- transform_sample_counts(contaminants_ps,function(x) 100* x/sum(x))
contaminant_no_blanks<-prune_taxa(contaminant_taxa_no_blank,contaminant_ps_norm)
# Plot the proportion of contaminant ASVs per sample that were not present in the blank control
plot_bar(contaminant_no_blanks,fill='Genus',, title = ' Proportion of Contaminant ASVs Not in the Blank Control Sample') + theme(legend.position='none') + ylim(c(0,100))
sample_sums(contaminant_no_blanks)
## con_D0 con_D1 con_D2 con_D3 con_D4 con_D5 con_D6 con_D7
## 37.30159 56.25000 37.40576 64.83815 51.38604 50.70813 39.38187 55.55556
## con_D8
## 56.40869
# sum and summarize the amount of contaminant signal arising from ASVs in blank control
100 - sample_sums(contaminant_no_blanks)
## con_D0 con_D1 con_D2 con_D3 con_D4 con_D5 con_D6 con_D7
## 62.69841 43.75000 62.59424 35.16185 48.61396 49.29187 60.61813 44.44444
## con_D8
## 43.59131
summary(100 - sample_sums(contaminant_no_blanks))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.16 43.75 48.61 50.08 60.62 62.70
From this plot, it is evident that the contaminants not present in the blank control make up a significant proportion of the contaminants. This is a bit surprising, and may be due to the blank control not representing all contaminants. In this experiment all blank controls were pooled into a single sample- it likely would have been better to not pool the blank controls.
We also evaluated how many times each contaminant ASV is present across the dilution series.
# Count number of contaminants present in only one sample
contaminant_bin<-as.data.frame(contaminants_ps@otu_table)
contaminant_bin[contaminant_bin>0]<-1
contaminant_bin = t(contaminant_bin)
contaminant_nsamples <- rowSums(contaminant_bin)
table(contaminant_nsamples)
## contaminant_nsamples
## 1 2 3 4 5 6 7 8 9
## 675 106 66 38 26 12 12 1 1
# summarize information about contaminant ASVs only in one sample
contaminant_one_sample <- rownames(contaminant_bin[rowSums(contaminant_bin)==1,])
temp <- contaminants_ps %>% prune_taxa(contaminant_one_sample,.) %>% taxa_sums(.)
min(temp)
## [1] 2
max(temp)
## [1] 696
mean(temp)
## [1] 77.76444
Next, we evaluate the impact that the contaminant ASVs have on commonly used alpha diversity metrics.
mock_alpha <- estimate_richness(mock_ps, measures=c("Observed", "Shannon", "InvSimpson"))
mock_alpha_pure <- estimate_richness(mock_ps_pure, measures=c("Observed", "Shannon", "InvSimpson"))
plot_richness(mock_ps, measure = c('Observed','Shannon', 'InvSimpson') )
# calculate the difference between the observed and actual alpha diversity measures
max_diff <- ((mock_alpha[9,] - mock_alpha_pure) / mock_alpha_pure)
Finally, we save the work space image so that we can revisit the analysis if needed and so that we can easily load the processed data for the next steps of analysis.
# Save workspace
save.image("mockDilutionsPrep.RData")
This work outlines each step required for evaluating the impact of contaminants on 16S rRNA gene experiments with samples that have varying starting material. The conclusions from this analysis is that contaminant ASVs increase with decreasing starting material and lead to distorted microbial community profiles and overestimated alpha diversity measurements.
While contaminants are prevalent in low microbial biomass samples, and potentially unavoidable, all hope is not lost. There are a few computational approaches to identify and remove contaminant sequences from 16S rRNA gene sequencing data. To see how we evaluated these different methods for identifying and removing contaminants from this data set, see: * Removing contaminant ASVs with decontam * Removing contaminant ASVs with SourceTracker * Evaluating SourceTracker results