We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).
We provide the compiled R Markdown documents to reproduce the analysis presented in the manuscript, divided into 6 primary sections:
Here, we provide the compiled R Markdown document to reproduce evaluating SourceTracker results for identifying and removing contaminants as presented in the manuscript.
For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information about setting up and using SourceTracker, see the Removing contaminant ASVs with SourceTracker section.
SourceTracker was used to identify contaminants in a 16S rRNA sequencing experiment. More information about SourceTracker can be found on the SourceTracker website. SourceTracker is a bit resource intensive and can take a while to run, so the code used to run SourceTracker for this analysis is available in a separate Markdown document here.
SourceTracker uses a Bayesian approach to predict the proportion of unknown samples (‘sink’ samples) arising from defined microbial sources (‘source’ samples). In theory, SourceTracker can be used to identify the proportion of experimental samples that arise from contaminants.
We tested two scenarios for recovering the mock microbial community profiles from the mock microbial Samples. In the first scenario, the expected mock community profile served as a source environment, mimicking the scenario when the experimental source is well defined. In the second scenario, the expected mock microbial community is unknown, and the proportion of sequences not predicted to be from the blank control or contaminant profile is the contamination-corrected profile. The second scenario is the more commonly encountered scenario, where the low microbial biomass environment that is being studied is poorly defined. For each scenario, we evaluated the use of the blank control and a combination of blank control and contaminant profiles as the source environments.
This R Markdown file loads in the result files from running SourceTracker and summarizes how well SourceTracker performed by:
The output from running SourceTracker is a series of text files that each contain an otu table for each source environment. The sum of all of the ASVs over all text files sums to 1 for each sample. Thus, SourceTracker predicts the proportion of each individual ASV arising from each source. We will step through the analysis of the SourceTracker results in detail for the first set of results, and then run through the analysis of the other sets, indicating key differences.
To run this analysis yourself, download the data along with the R markdown file here.
# load libraries
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(RColorBrewer)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(phyloseq)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(knitr)
options(digits=2)
# save session info (packages versions loaded)
session <- sessionInfo()
We will use a function called compBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in gray scale. This is similar to the function created in the decontam analysis, but uses a data frame rather than a phyloseq object.
We also create a function remainCont to calculate the percent of remaining contaminants after running SourceTracker and mockCom to calculate the percent of mock microbial community ASVs present in each sample.
# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
compBarPlot <- function(otus, plotTitle){
#set up data_table
data_table <- as.data.frame(t(otus))
sample_names <- colnames(data_table)
data_table$reference = FALSE
data_table$reference[rownames(data_table) %in% mock_taxa] = TRUE
data_table$id <- paste0('ASV_', 1:nrow(data_table))
dilution_labels <- sample_names
set.seed(444)
# define the colors to use for reference and non-reference OTUs/ASVs
ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)
# add a color variable to the data table
data_table$color <- rep(NA, nrow(data_table))
data_table$color[data_table$reference] <- ref_colors
data_table$color[!data_table$reference] <- other_colors
# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"
color_gg <- data_table %>% gather("sample", "count", sample_names)
legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
data_gg <- data_table %>% gather("sample", "count", sample_names)
data_gg <- inner_join(data_gg,color_gg)
# create the composition bar plot
comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
scale_fill_identity(guide = FALSE) +
scale_alpha_manual(name = "Sequence type",
labels = c("expected sequences", "other"),
values = c(bright = 1, dull = 1),
guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
keywidth = NULL, keyheight = NULL)) +
scale_x_discrete(labels = dilution_labels) +
labs(title = plotTitle, x = "sample", y = "Relative Abundance") +
theme(legend.position = "right", legend.title = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
plot.title = element_text(size = 12))
comp_bar
}
# Calculate % of contaminants remaining in each sample
remainCont <- function(otus){
#set up data_table
# renormalize the recovered profile to equal 1
data_table <- otus/rowSums(otus)
data_table <- data_table[,colnames(data_table) %in% mock_taxa]
print((1 - rowSums(data_table)) * 100)
}
# Calculate % of mock community ASVs in each sample
mockCom <- function(otus){
#set up data_table
data_table <- otus[,colnames(otus) %in% mock_taxa]
print((rowSums(data_table)) * 100)
}
summary_table <- function(results, scenario){
# add scenario to results table
caption_text = paste0('SourceTracker ', scenario, ' summary')
drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
results <- results %>% select(-one_of(drop_col))
kable(t(results), digits = 2, caption = caption_text) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
group_rows("Proportion Removed", 1,2) %>%
group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>%
group_rows("Mock Abundances after Contaminant Removal", 8,16) %>%
group_rows("Classification Performance", 17,20)
}
# Function to load sourcetracker results
load_sourcetracker <- function(file, otu_key) {
otus <- read.delim(file, row.names = 1)
# subset only the Dilution data
dilutions <- c('D0', 'D1','D2','D3', 'D4', 'D5', 'D6', 'D7', 'D8')
otus<-otus[rownames(otus) %in% dilutions,]
return(otus)
}
summarize_st <- function (recovered_otus,all_otus,mock_taxa, scenario){
# for simplicity, all other classifications are 'contaminants'
removed_otus <- all_otus - recovered_otus
con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# identify non-mock community ASVs correctly classified as not belonging to mock community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
# identify mock community ASVs incorrectly classified as not belonging to mock community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
sensitivity <- true_pos/(true_pos + false_neg)
specificty <- true_neg/(true_neg + false_pos)
accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
# Summarize results
## proportion of contaminants removed (of all total contaminant ASVs)
contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ con_orig) * 100
## proportion of mock removed (of all total mock ASVs)
mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_orig) * 100
## total amount of conatmiants remaining
contaminants_remaining = rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa]) *100
# calculate alpha diverity and summary of relative abundances (on re-normalized recovered data)
# create phyloseq object on the recovered otus
ps_st <- mock_ps
ps_st@otu_table <- otu_table(recovered_otus, taxa_are_rows = FALSE)
# Renormalize and transform into count matrix - not ideal, but necessary to compare to other data (to calculate diversity estimates)
ps_st <- transform_sample_counts(ps_st,function(x) 100000* x/sum(x))
otu_table(ps_st) <- otu_table(round(as.matrix((otu_table(ps_st)))), taxa_are_rows= FALSE)
diversity <- estimate_richness(ps_st, measures = c('Observed','Shannon','InvSimpson'))
# Calculate relative abundance on recovered otus
rel_abundance <- (rec_otus/rowSums(rec_otus))*100
mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
total_mock_abundance <- rowSums(mock_abundance)
con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
# return results
results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
results <- results %>%
mutate(method = paste0('SourceTracker, ', scenario)) %>%
mutate(sample_names = rownames(results))
return(results)
}
plot_st_classes <- function(mock_taxa, recovered_otus, all_otus, st_method){
# for simplicity, all other classifications are 'contaminants'
removed_otus <- all_otus - recovered_otus
con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])
# % of mock community ASVs correctly classified as mock community ASVs
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
# identify non-mock community ASVs correctly classified as not belonging to mock community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
# identify mock community ASVs incorrectly classified as not belonging to mock community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])
profile <- rbind(false_neg, false_pos,true_neg,true_pos)
long_profile <- melt(data = profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')
# Figures
classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) +
geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill()) +
scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
labs(x = "Sample", y = 'Proportion of Reads') +
ggtitle(paste0('Sequence classification for \n ', st_method)) +
theme(legend.position = "right", legend.title = element_text(size = 12),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12),
plot.title = element_text(size = 12))
compositionPlot <- compBarPlot(recovered_otus, paste0('SourceTracker ', st_method))
# plot figures
grid.arrange(compositionPlot, classificationPlot, ncol = 2)
}
We created an R work space that just contains the ASV sequences of the expected mock microbial community. This will help us identify the ASVs that were correctly and incorrectly assigned to the mock microbial community.
##
# Load workspace with Mock microbial community taxa ASV identifiers
load("mock_taxa.RData")
# load initial phyloseq object
load("mockDilutions.RData")
This data set includes 1 object: mock_taxa - A list of the expected mock microbial community ASV sequences
In this case, the defined sources were:
The results are defined by:
# load results from running SourceTracker with contaminant profiles and black controls
rec_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_MockCommunityProfile_contributions.txt", asv_key)
blank_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Blank_contributions.txt", asv_key)
cp_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_ContaminantProfile_contributions.txt", asv_key)
unk_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Unknown_contributions.txt", asv_key)
We will also create a new variables, mock_orig and con_orig that stores the proportion of each sample arising from the mock community and contaminants, respectively. Since the data was subsampled prior to running SourceTracker, the proportion of contaminant ASVs will vary. To identify the proportion of the original sample that arose from contaminants, we sum the proportion of non-mock community ASVs from each of the SourceTracker result files. We then sum the total amount of contaminant ASVs for each sample.
# determine total amounts of conatmiants across the dataset
rec_mock <- rec_otus[,colnames(rec_otus) %in% mock_taxa]
blank_mock<- blank_otus[,colnames(blank_otus) %in% mock_taxa]
cp_mock <- cp_otus[,colnames(cp_otus) %in% mock_taxa]
unk_mock <- unk_otus[,colnames(unk_otus) %in% mock_taxa]
# Sum the total amounts of contaminants across contaminant ASVs
mock_otus <- rec_mock + blank_mock + cp_mock + unk_mock
# Sum the total amount of contaminants per Samples sample
mock_orig <- rowSums(mock_otus)
# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa]
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa]
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa]
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa]
# Sum the total amounts of contaminants across contaminant ASVs
con_otus <- rec_con + blank_con + cp_con + unk_con
# Sum the total amount of contaminants per Samples sample
con_orig <- rowSums(con_otus)
#
all_otus <- rec_otus + blank_otus + cp_otus + unk_otus
To evaluate SourceTracker, we will generate a stacked bar plot of the SourceTracker predicted mock microbial community profile as well as evaluate the percentage of the samples that were correctly (indicated by the community abbreviation followed by _c) and incorrectly classified (indicated by the community abbreviation followed by _i).
To summarize the SourceTracker classifications, we will create stacked bar plots that summarize the proportion of reads correctly and incorrectly classified as arising from the mock microbial community (mock_c for correct, mock_i for incorrect), the contaminant profile (cp_c, cp_i), the blank (blank_c, blank_i), unknown (unk_c, unk_i).
all_otus <- rec_otus + blank_otus + cp_otus + unk_otus
results_st_sc1_case1 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 1 case 1')
summary_table(results_st_sc1_case1, 'scenario 1 case 1')
Proportion Removed | |||||||||
contaminants_removed | 98 | 99 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
mock_ASVs_removed | 0.00100 | 0.00033 | 0.00340 | 0.00035 | 0.00114 | 0.00368 | 0.01507 | 0.00225 | 0.00502 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.00067 | 0.00167 | 0.00272 | 0.00209 | 0.00454 | 0.00138 | 0.00754 | 0.00150 | 0.00167 |
total_mock_abundance | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 9 | 11 | 9 | 9 | 10 | 9 | 16 | 10 | 9 |
Shannon | 1.9 | 1.9 | 1.9 | 1.9 | 1.8 | 1.9 | 1.8 | 1.8 | 1.9 |
InvSimpson | 5.4 | 5.4 | 5.6 | 5.8 | 4.9 | 5.8 | 4.9 | 5.4 | 6.2 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_2 | 26 | 21 | 22 | 23 | 15 | 20 | 17 | 20 | 20 |
ASV_7 | 2.8 | 3.9 | 3.4 | 3.3 | 4.4 | 3.8 | 3.7 | 3.1 | 7.2 |
ASV_10 | 1.5 | 1.5 | 1.4 | 1.4 | 1.1 | 1.6 | 1.0 | 2.7 | 7.1 |
ASV_1 | 22 | 30 | 27 | 23 | 33 | 27 | 28 | 28 | 16 |
ASV_3 | 21.8 | 14.4 | 17.0 | 16.6 | 9.6 | 16.4 | 7.7 | 12.7 | 10.0 |
ASV_4 | 6.5 | 13.8 | 12.5 | 15.2 | 23.8 | 13.6 | 28.9 | 19.7 | 24.8 |
ASV_5 | 10.9 | 7.8 | 9.2 | 8.6 | 5.2 | 8.5 | 6.1 | 5.4 | 3.8 |
ASV_6 | 5.4 | 5.4 | 5.5 | 6.1 | 6.6 | 6.9 | 6.2 | 7.9 | 10.8 |
ASV_8 | 3.0 | 2.0 | 2.3 | 2.4 | 1.4 | 2.3 | 1.5 | 0.0 | 0.0 |
Classification Performance | |||||||||
sensitivity | 0.98 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
prevalence | 0.0004 | 0.0013 | 0.0188 | 0.0434 | 0.1193 | 0.2756 | 0.6461 | 0.5565 | 0.8007 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 1' )
## Joining, by = c("reference", "id", "color", "sample", "count")
# Scenario 1, Case 2 - Well defined experimental environment, blank control as contaminant environments
In this case, the defined sources were:
The results are defined by:
We repeat the analysis that was performed for Scenario 1, Case 1.
# load results from running SourceTracker with contaminant profiles and black controls (Case 2 well-defined experimental community, blank only as contaminant profile)
rec_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_MockCommunityProfile_contributions.txt", asv_key)
blank_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Blank_contributions.txt", asv_key)
unk_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Unknown_contributions.txt", asv_key)
# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa]
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa]
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa]
con_otus <- rec_con + blank_con + unk_con
con_orig <- rowSums(con_otus)
# identify mock community ASVs correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])
# identify contaminants correctly identified as contaminants
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
unk_c <- rowSums(unk_otus[,!colnames(unk_otus) %in% mock_taxa])
# identify mock microbial community ASVs incorrectly identified as contaminants
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])
unk_i <- rowSums(unk_otus[,colnames(unk_otus) %in% mock_taxa])
# Plot classifications per sample
st_profile <- rbind(mock_i,blank_i,unk_i,mock_c, blank_c, unk_c)
long_st_profile <- melt(data = st_profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
customPalette <- c('#969696','#bdbdbd','#f0f0f0', '#1B9E77', '#E7298A', '#7570B3')
# Figure 5C
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill()) + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')
all_otus <- rec_otus + blank_otus + unk_otus
results_st_sc1_case2 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 1 case 2')
summary_table(results_st_sc1_case2, 'scenario 1 case 2')
Proportion Removed | |||||||||
contaminants_removed | 98 | 98 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
mock_ASVs_removed | 0.0013 | 0.0000 | 0.0150 | 0.0299 | 0.0972 | 0.3201 | 13.5365 | 2.0781 | 21.7031 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.0010 | 0.0020 | 0.0024 | 0.0038 | 0.0027 | 0.0014 | 0.0022 | 0.0015 | 0.0021 |
total_mock_abundance | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 10 | 11 | 10 | 10 | 9 | 9 | 11 | 10 | 9 |
Shannon | 1.9 | 1.9 | 1.9 | 1.9 | 1.8 | 1.9 | 1.8 | 1.8 | 1.9 |
InvSimpson | 5.4 | 5.4 | 5.6 | 5.8 | 4.9 | 5.7 | 5.1 | 5.3 | 6.1 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_2 | 26 | 21 | 23 | 23 | 15 | 20 | 20 | 21 | 24 |
ASV_7 | 2.6 | 3.8 | 3.3 | 3.3 | 4.6 | 3.7 | 3.9 | 3.0 | 6.4 |
ASV_10 | 1.3 | 1.5 | 1.4 | 1.4 | 1.2 | 1.5 | 1.0 | 2.3 | 3.0 |
ASV_1 | 23 | 30 | 26 | 24 | 33 | 28 | 31 | 29 | 21 |
ASV_3 | 21.4 | 14.4 | 17.2 | 16.8 | 9.9 | 15.9 | 9.4 | 12.6 | 13.5 |
ASV_4 | 6.5 | 14.5 | 12.5 | 15.4 | 23.5 | 13.5 | 20.0 | 19.2 | 16.1 |
ASV_5 | 11.0 | 7.7 | 9.1 | 8.5 | 5.1 | 8.5 | 6.5 | 6.0 | 4.9 |
ASV_6 | 5.2 | 5.4 | 5.5 | 5.8 | 6.5 | 6.9 | 6.6 | 7.8 | 11.3 |
ASV_8 | 3.0 | 2.0 | 2.2 | 2.5 | 1.2 | 2.1 | 1.7 | 0.0 | 0.0 |
Classification Performance | |||||||||
sensitivity | 0.98 | 0.98 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
specificty | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.86 | 0.98 | 0.78 |
accuracy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.95 | 0.99 | 0.96 |
prevalence | 0.00057 | 0.00113 | 0.01907 | 0.04263 | 0.11877 | 0.28253 | 0.64457 | 0.55697 | 0.80017 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 2' )
## Joining, by = c("reference", "id", "color", "sample", "count")
## Scenario 2, Case 3 - undefined experimental environment, well defined contaminant environments In this case, the defined sources were:
The results are defined by:
# load results from running SourceTracker with only contaminant profiles and black controls defined (Case 3)
rec_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Unknown_contributions.txt", asv_key)
cp_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_ContaminantProfile_contributions.txt", asv_key)
blank_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Blank_contributions.txt", asv_key)
# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa]
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa]
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa]
con_otus <- rec_con + blank_con + cp_con
con_orig <- rowSums(con_otus)
# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])
# Identify contaminants correctly and incorrectly classified
cp_c <- rowSums(cp_otus[,!colnames(cp_otus) %in% mock_taxa])
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
cp_i <- rowSums(cp_otus[,colnames(cp_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])
st_profile <- rbind(mock_i ,blank_i,cp_i,mock_c, blank_c, cp_c)
long_st_profile <- melt(data = st_profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
customPalette <- c('#969696','#bdbdbd','#d9d9d9', '#1B9E77', '#E7298A', '#D95F02')
# Figure 5D
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill()) + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')
all_otus <- rec_otus + blank_otus + cp_otus
results_st_sc2_case1 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 2 case 1')
## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_st_sc2_case1, 'scenario 2 case 1')
Proportion Removed | |||||||||
contaminants_removed | 2.00 | 2.12 | 1.09 | 0.85 | 1.03 | 5.14 | 39.68 | 2.58 | 67.00 |
mock_ASVs_removed | 0.0010 | 0.0010 | 0.0010 | 0.0017 | 0.0026 | 0.0032 | 0.0160 | 0.0068 | 0.0202 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.049 | 0.108 | 1.728 | 4.606 | 11.729 | 27.096 | 52.346 | 55.090 | 57.170 |
total_mock_abundance | 100 | 100 | 98 | 95 | 88 | 73 | 48 | 45 | 43 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 13 | 20 | 98 | 165 | 259 | 310 | 379 | 147 | 193 |
Shannon | 1.9 | 1.9 | 2.0 | 2.2 | 2.5 | 3.4 | 4.3 | 4.1 | 4.4 |
InvSimpson | 5.4 | 5.4 | 5.8 | 6.4 | 6.2 | 10.9 | 19.3 | 23.9 | 31.5 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_2 | 26.2 | 20.9 | 22.7 | 22.0 | 13.7 | 14.5 | 8.0 | 9.4 | 8.2 |
ASV_7 | 2.5 | 3.5 | 3.2 | 3.0 | 4.0 | 2.8 | 1.7 | 1.4 | 3.4 |
ASV_10 | 1.49 | 1.65 | 1.41 | 1.38 | 1.02 | 1.31 | 0.46 | 1.20 | 2.64 |
ASV_1 | 22 | 30 | 25 | 22 | 29 | 20 | 14 | 13 | 7 |
ASV_3 | 21.6 | 14.4 | 16.7 | 15.8 | 8.2 | 12.0 | 3.9 | 5.6 | 4.8 |
ASV_4 | 6.7 | 14.2 | 12.2 | 15.1 | 21.0 | 9.8 | 13.8 | 8.8 | 10.4 |
ASV_5 | 11.0 | 8.0 | 9.0 | 7.9 | 4.6 | 6.1 | 2.8 | 2.5 | 1.4 |
ASV_6 | 5.2 | 5.2 | 5.4 | 5.7 | 5.9 | 5.0 | 2.9 | 3.4 | 5.0 |
ASV_8 | 3.20 | 2.06 | 2.30 | 2.33 | 1.04 | 1.65 | 0.64 | 0.00 | 0.00 |
Classification Performance | |||||||||
sensitivity | 0.0200 | 0.0212 | 0.0109 | 0.0085 | 0.0103 | 0.0514 | 0.3968 | 0.0258 | 0.6700 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 0.98 | 0.95 | 0.88 | 0.73 | 0.61 | 0.46 | 0.74 |
prevalence | 0.0005 | 0.0011 | 0.0175 | 0.0464 | 0.1184 | 0.2815 | 0.6455 | 0.5573 | 0.8018 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 1' )
## Joining, by = c("reference", "id", "color", "sample", "count")
# Scenario 2, Case 4 - undefined experimental environment, blank control as contaminant environment In this case, the defined sources were:
The results are defined by:
# load results from running SourceTracker with only a black control defined (Case 4)
rec_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Unknown_contributions.txt", asv_key)
blank_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Blank_contributions.txt", asv_key)
# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa]
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa]
con_otus <- rec_con + blank_con
con_orig <- rowSums(con_otus)
# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])
# Identify contaminants correctly and incorrectly classified
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])
st_profile <- rbind(mock_i ,blank_i,mock_c, blank_c)
long_st_profile <- melt(data = st_profile,
id.vars = rownames(),
variable.name = colnames(),
value.name = "Abundance"
)
customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')
# Figure 5E
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill()) + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')
all_otus <- rec_otus + blank_otus
results_st_sc2_case2 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 2 case 2')
## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
summary_table(results_st_sc2_case2, 'scenario 2 case 2')
Proportion Removed | |||||||||
contaminants_removed | 0.000 | 0.278 | 0.455 | 0.149 | 0.134 | 0.264 | 0.256 | 0.053 | 0.105 |
mock_ASVs_removed | 0.00100 | 0.00067 | 0.00136 | 0.00105 | 0.00227 | 0.00326 | 0.01805 | 0.00000 | 0.01675 |
Percent Remaining after Contaminant Removal | |||||||||
con_abundance | 0.023 | 0.120 | 1.822 | 4.460 | 11.706 | 28.400 | 64.859 | 55.463 | 80.089 |
total_mock_abundance | 100 | 100 | 98 | 96 | 88 | 72 | 35 | 45 | 20 |
Diversity Estimates after Contaminant Removal | |||||||||
Observed | 13 | 19 | 94 | 161 | 258 | 310 | 381 | 147 | 193 |
Shannon | 1.9 | 1.9 | 2.0 | 2.2 | 2.5 | 3.4 | 4.5 | 4.1 | 4.8 |
InvSimpson | 5.4 | 5.4 | 5.8 | 6.4 | 6.2 | 11.2 | 28.2 | 24.5 | 70.7 |
Mock Abundances after Contaminant Removal | |||||||||
ASV_2 | 26.3 | 20.9 | 22.2 | 21.8 | 13.3 | 14.4 | 6.3 | 9.2 | 3.8 |
ASV_7 | 2.6 | 3.6 | 3.1 | 3.1 | 4.1 | 2.7 | 1.1 | 1.4 | 1.5 |
ASV_10 | 1.55 | 1.61 | 1.42 | 1.46 | 1.02 | 1.22 | 0.35 | 1.12 | 1.24 |
ASV_1 | 21.9 | 29.7 | 25.7 | 22.4 | 29.4 | 19.4 | 9.9 | 12.4 | 3.3 |
ASV_3 | 21.6 | 14.7 | 17.0 | 15.8 | 8.3 | 11.7 | 2.8 | 5.4 | 2.1 |
ASV_4 | 6.8 | 14.0 | 12.2 | 15.0 | 20.6 | 9.3 | 10.0 | 8.7 | 4.7 |
ASV_5 | 10.94 | 8.11 | 9.05 | 7.97 | 4.57 | 6.19 | 2.06 | 2.65 | 0.83 |
ASV_6 | 5.3 | 5.2 | 5.3 | 5.7 | 5.9 | 5.2 | 2.1 | 3.8 | 2.4 |
ASV_8 | 2.94 | 2.04 | 2.34 | 2.32 | 1.12 | 1.57 | 0.49 | 0.00 | 0.00 |
Classification Performance | |||||||||
sensitivity | 0.00000 | 0.00278 | 0.00455 | 0.00149 | 0.00134 | 0.00264 | 0.00256 | 0.00053 | 0.00105 |
specificty | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
accuracy | 1.00 | 1.00 | 0.98 | 0.96 | 0.88 | 0.72 | 0.35 | 0.45 | 0.20 |
prevalence | 0.00023 | 0.00120 | 0.01830 | 0.04467 | 0.11720 | 0.28453 | 0.64913 | 0.55477 | 0.80103 |
sample_names | D0 | D1 | D2 | D3 | D4 | D5 | D6 | D7 | D8 |
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 2' )
## Joining, by = c("reference", "id", "color", "sample", "count")
vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep)
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_rm' not found
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_keep' not found
save.image('results_sourcetracker.RData')