Introduction

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

We provide the compiled R Markdown documents to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Introduction and evaluating contaminants
Removing contaminant ASVs with filtering methods
Removing contaminant ASVs with decontam
Removing contaminant ASVs with SourceTracker
Evaluating SourceTracker results (this document)
Final evaluation of all methods

Here, we provide the compiled R Markdown document to reproduce evaluating SourceTracker results for identifying and removing contaminants as presented in the manuscript.

For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information about setting up and using SourceTracker, see the Removing contaminant ASVs with SourceTracker section.

Using SourceTracker for contaminant identification and removal

SourceTracker was used to identify contaminants in a 16S rRNA sequencing experiment. More information about SourceTracker can be found on the SourceTracker website. SourceTracker is a bit resource intensive and can take a while to run, so the code used to run SourceTracker for this analysis is available in a separate Markdown document here.

Theory behind SourceTracker and its application for contaminant removal

SourceTracker uses a Bayesian approach to predict the proportion of unknown samples (‘sink’ samples) arising from defined microbial sources (‘source’ samples). In theory, SourceTracker can be used to identify the proportion of experimental samples that arise from contaminants.

We tested two scenarios for recovering the mock microbial community profiles from the mock microbial Samples. In the first scenario, the expected mock community profile served as a source environment, mimicking the scenario when the experimental source is well defined. In the second scenario, the expected mock microbial community is unknown, and the proportion of sequences not predicted to be from the blank control or contaminant profile is the contamination-corrected profile. The second scenario is the more commonly encountered scenario, where the low microbial biomass environment that is being studied is poorly defined. For each scenario, we evaluated the use of the blank control and a combination of blank control and contaminant profiles as the source environments.

This R Markdown file loads in the result files from running SourceTracker and summarizes how well SourceTracker performed by:

Identifying how much of the contaminants were removed with SourceTracker
Identifying how much of the sample is contaminants after applying SourceTracker
Identifying if any expected sequences were removed

The output from running SourceTracker is a series of text files that each contain an otu table for each source environment. The sum of all of the ASVs over all text files sums to 1 for each sample. Thus, SourceTracker predicts the proportion of each individual ASV arising from each source. We will step through the analysis of the SourceTracker results in detail for the first set of results, and then run through the analysis of the other sets, indicating key differences.

Set up the workspace

To run this analysis yourself, download the data along with the R markdown file here.

# load libraries
library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(RColorBrewer)
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

library(phyloseq)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(knitr)
options(digits=2)

# save session info (packages versions loaded)
session <- sessionInfo()

Create functions to use in this analysis

We will use a function called compBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in gray scale. This is similar to the function created in the decontam analysis, but uses a data frame rather than a phyloseq object.

We also create a function remainCont to calculate the percent of remaining contaminants after running SourceTracker and mockCom to calculate the percent of mock microbial community ASVs present in each sample.

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
compBarPlot <- function(otus, plotTitle){
#set up data_table
  data_table <- as.data.frame(t(otus))

  sample_names <- colnames(data_table)
  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% mock_taxa] = TRUE
  data_table$id <- paste0('ASV_', 1:nrow(data_table))
  dilution_labels <- sample_names

  set.seed(444)

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
  other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)

# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% gather("sample", "count", sample_names)
  legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
    scale_fill_identity(guide = FALSE) +
    scale_alpha_manual(name = "Sequence type",
                     labels = c("expected sequences", "other"),
                     values = c(bright = 1, dull = 1),
                     guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
                                          keywidth = NULL, keyheight = NULL)) +
    scale_x_discrete(labels = dilution_labels) +
    labs(title = plotTitle, x = "sample", y = "Relative Abundance") +
    theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))
  comp_bar
}

# Calculate % of contaminants remaining in each sample
remainCont <- function(otus){
  #set up data_table
  # renormalize the recovered profile to equal 1
  data_table <- otus/rowSums(otus)
  data_table <- data_table[,colnames(data_table) %in% mock_taxa]
  print((1 - rowSums(data_table)) * 100)
}

# Calculate % of mock community ASVs in each sample
mockCom <- function(otus){
  #set up data_table
  data_table <- otus[,colnames(otus) %in% mock_taxa]
  print((rowSums(data_table)) * 100)
}


summary_table <- function(results, scenario){
    # add scenario to results table

  caption_text = paste0('SourceTracker ', scenario, ' summary')
  drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
  results <- results %>% select(-one_of(drop_col))
  kable(t(results), digits = 2, caption = caption_text) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
    group_rows("Proportion Removed", 1,2) %>%
    group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
    group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>% 
  group_rows("Mock Abundances after Contaminant Removal", 8,16) %>% 
  group_rows("Classification Performance", 17,20)

}

# Function to load sourcetracker results
load_sourcetracker <- function(file, otu_key) {
  otus <- read.delim(file, row.names = 1)
    # subset only the Dilution data
  dilutions <- c('D0', 'D1','D2','D3', 'D4', 'D5', 'D6', 'D7', 'D8')
  otus<-otus[rownames(otus) %in% dilutions,]
  return(otus)
}

summarize_st <- function (recovered_otus,all_otus,mock_taxa, scenario){
  # for simplicity, all other classifications are 'contaminants'
 removed_otus <- all_otus - recovered_otus
 con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
 mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])

#  % of mock community ASVs correctly classified as mock community ASVs 
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
#  identify non-mock community ASVs correctly classified as not belonging to mock   community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
#  identify mock community ASVs incorrectly classified as not belonging to mock   community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

sensitivity <- true_pos/(true_pos + false_neg)
specificty <- true_neg/(true_neg + false_pos)
accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
 
   # Summarize results  
    ## proportion of contaminants removed (of all total contaminant ASVs)
    contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ con_orig) * 100
    ## proportion of mock removed (of all total mock ASVs)
    mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_orig) * 100
  ## total amount of conatmiants remaining
  contaminants_remaining =  rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa]) *100
  # calculate alpha diverity  and summary of relative abundances (on re-normalized recovered data)
  # create phyloseq object on the recovered otus
  ps_st <- mock_ps
  ps_st@otu_table <- otu_table(recovered_otus, taxa_are_rows = FALSE)
  # Renormalize and transform into count matrix - not ideal, but necessary to compare to other data (to calculate diversity estimates)
  ps_st <- transform_sample_counts(ps_st,function(x) 100000* x/sum(x))
  otu_table(ps_st) <- otu_table(round(as.matrix((otu_table(ps_st)))), taxa_are_rows= FALSE)
  diversity <- estimate_richness(ps_st, measures = c('Observed','Shannon','InvSimpson'))
  # Calculate relative abundance on recovered otus
  rel_abundance <- (rec_otus/rowSums(rec_otus))*100
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  # return results
  results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
    results <- results %>%
    mutate(method = paste0('SourceTracker, ', scenario)) %>%
    mutate(sample_names = rownames(results))
return(results)
}

plot_st_classes <- function(mock_taxa, recovered_otus, all_otus, st_method){
    # for simplicity, all other classifications are 'contaminants'
  removed_otus <- all_otus - recovered_otus
  con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
  mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])

#  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

  profile <- rbind(false_neg, false_pos,true_neg,true_pos)

  long_profile <- melt(data = profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )
  names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
  customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')

  # Figures
  classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
      ggtitle(paste0('Sequence classification for \n  ', st_method)) + 
      theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

  compositionPlot <- compBarPlot(recovered_otus, paste0('SourceTracker ', st_method))

  # plot figures
  grid.arrange(compositionPlot, classificationPlot, ncol = 2)
}

Loading the mock microbial community information

We created an R work space that just contains the ASV sequences of the expected mock microbial community. This will help us identify the ASVs that were correctly and incorrectly assigned to the mock microbial community.

## 
# Load workspace with Mock microbial community taxa ASV identifiers
load("mock_taxa.RData")
# load initial phyloseq object
load("mockDilutions.RData")

This data set includes 1 object: mock_taxa - A list of the expected mock microbial community ASV sequences

Results for Case 1 - well defined experimental and contaminant environments

In this case, the defined sources were:

Mock microbial community - the undiluted mock microbial community sample
Contaminant profiles - Contaminant profiles built by removing the expected mock microbial community ASVs
Blank - The blank control

The results are defined by:

rec_otus -ASVs identified as arising from the mock microbial community source (recovered mock microbial community)
blank_otus - ASVs identified as arising from the blank source (contaminants)
cp_otus - ASVs identified as arising from the contaminant source (contaminants)
unk_otus - ASVs not classified as arising from any defined source (contaminants)

# load results from running SourceTracker with contaminant profiles and black controls 
rec_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_MockCommunityProfile_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Blank_contributions.txt", asv_key)

cp_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_ContaminantProfile_contributions.txt", asv_key)

unk_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Unknown_contributions.txt", asv_key)

We will also create a new variables, mock_orig and con_orig that stores the proportion of each sample arising from the mock community and contaminants, respectively. Since the data was subsampled prior to running SourceTracker, the proportion of contaminant ASVs will vary. To identify the proportion of the original sample that arose from contaminants, we sum the proportion of non-mock community ASVs from each of the SourceTracker result files. We then sum the total amount of contaminant ASVs for each sample.

# determine total amounts of conatmiants across the dataset
rec_mock <- rec_otus[,colnames(rec_otus) %in% mock_taxa] 
blank_mock<- blank_otus[,colnames(blank_otus) %in% mock_taxa] 
cp_mock <- cp_otus[,colnames(cp_otus) %in% mock_taxa] 
unk_mock <- unk_otus[,colnames(unk_otus) %in% mock_taxa] 
# Sum the total amounts of contaminants across contaminant ASVs
mock_otus <- rec_mock +  blank_mock + cp_mock + unk_mock
# Sum the total amount of contaminants per Samples sample
mock_orig <- rowSums(mock_otus)


# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa] 
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa] 
# Sum the total amounts of contaminants across contaminant ASVs
con_otus <- rec_con +  blank_con + cp_con + unk_con
# Sum the total amount of contaminants per Samples sample
con_orig <- rowSums(con_otus)

# 
all_otus <- rec_otus + blank_otus + cp_otus + unk_otus

Evaluation of Scenario 1, Case 1 - well defined experimental and contaminant communites

To evaluate SourceTracker, we will generate a stacked bar plot of the SourceTracker predicted mock microbial community profile as well as evaluate the percentage of the samples that were correctly (indicated by the community abbreviation followed by _c) and incorrectly classified (indicated by the community abbreviation followed by _i).

Plot classifications per sample

To summarize the SourceTracker classifications, we will create stacked bar plots that summarize the proportion of reads correctly and incorrectly classified as arising from the mock microbial community (mock_c for correct, mock_i for incorrect), the contaminant profile (cp_c, cp_i), the blank (blank_c, blank_i), unknown (unk_c, unk_i).

Summarize Results as done for other methods

all_otus <- rec_otus + blank_otus + cp_otus + unk_otus
results_st_sc1_case1 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 1 case 1')
summary_table(results_st_sc1_case1, 'scenario 1 case 1')

SourceTracker scenario 1 case 1 summary
Proportion Removed
contaminants_removed	98	99	100	100	100	100	100	100	100
mock_ASVs_removed	0.00100	0.00033	0.00340	0.00035	0.00114	0.00368	0.01507	0.00225	0.00502
Percent Remaining after Contaminant Removal
con_abundance	0.00067	0.00167	0.00272	0.00209	0.00454	0.00138	0.00754	0.00150	0.00167
total_mock_abundance	100	100	100	100	100	100	100	100	100
Diversity Estimates after Contaminant Removal
Observed	9	11	9	9	10	9	16	10	9
Shannon	1.9	1.9	1.9	1.9	1.8	1.9	1.8	1.8	1.9
InvSimpson	5.4	5.4	5.6	5.8	4.9	5.8	4.9	5.4	6.2
Mock Abundances after Contaminant Removal
ASV_2	26	21	22	23	15	20	17	20	20
ASV_7	2.8	3.9	3.4	3.3	4.4	3.8	3.7	3.1	7.2
ASV_10	1.5	1.5	1.4	1.4	1.1	1.6	1.0	2.7	7.1
ASV_1	22	30	27	23	33	27	28	28	16
ASV_3	21.8	14.4	17.0	16.6	9.6	16.4	7.7	12.7	10.0
ASV_4	6.5	13.8	12.5	15.2	23.8	13.6	28.9	19.7	24.8
ASV_5	10.9	7.8	9.2	8.6	5.2	8.5	6.1	5.4	3.8
ASV_6	5.4	5.4	5.5	6.1	6.6	6.9	6.2	7.9	10.8
ASV_8	3.0	2.0	2.3	2.4	1.4	2.3	1.5	0.0	0.0
Classification Performance
sensitivity	0.98	0.99	1.00	1.00	1.00	1.00	1.00	1.00	1.00
specificty	1	1	1	1	1	1	1	1	1
accuracy	1	1	1	1	1	1	1	1	1
prevalence	0.0004	0.0013	0.0188	0.0434	0.1193	0.2756	0.6461	0.5565	0.8007
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 1' )

## Joining, by = c("reference", "id", "color", "sample", "count")

# Scenario 1, Case 2 - Well defined experimental environment, blank control as contaminant environments

In this case, the defined sources were:

Mock microbial community - the undiluted mock microbial community
Contaminant profiles - Contaminant profiles built by removing the expected mock microbial community ASVs
Blank - The blank control

The results are defined by:

rec_otus -ASVs identified as arising from the mock microbial community source (recovered mock microbial community)
blank_otus - ASVs identified as arising from the blank source (contaminants)
unk_otus - ASVs not classified as arising from any defined source (contaminants)

We repeat the analysis that was performed for Scenario 1, Case 1.

# load results from running SourceTracker with contaminant profiles and black controls (Case 2 well-defined experimental community, blank only as contaminant profile)
rec_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_MockCommunityProfile_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Blank_contributions.txt", asv_key)

unk_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Unknown_contributions.txt", asv_key)

# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con + unk_con
con_orig <- rowSums(con_otus)

# identify mock community ASVs correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# identify contaminants correctly identified as contaminants
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
unk_c <- rowSums(unk_otus[,!colnames(unk_otus) %in% mock_taxa])

# identify mock microbial community ASVs incorrectly identified as contaminants
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])
unk_i <- rowSums(unk_otus[,colnames(unk_otus) %in% mock_taxa])

# Plot classifications per sample

st_profile <- rbind(mock_i,blank_i,unk_i,mock_c, blank_c, unk_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd','#f0f0f0', '#1B9E77', '#E7298A', '#7570B3')

# Figure 5C
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus + unk_otus
results_st_sc1_case2 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 1 case 2')
summary_table(results_st_sc1_case2, 'scenario 1 case 2')

SourceTracker scenario 1 case 2 summary
Proportion Removed
contaminants_removed	98	98	100	100	100	100	100	100	100
mock_ASVs_removed	0.0013	0.0000	0.0150	0.0299	0.0972	0.3201	13.5365	2.0781	21.7031
Percent Remaining after Contaminant Removal
con_abundance	0.0010	0.0020	0.0024	0.0038	0.0027	0.0014	0.0022	0.0015	0.0021
total_mock_abundance	100	100	100	100	100	100	100	100	100
Diversity Estimates after Contaminant Removal
Observed	10	11	10	10	9	9	11	10	9
Shannon	1.9	1.9	1.9	1.9	1.8	1.9	1.8	1.8	1.9
InvSimpson	5.4	5.4	5.6	5.8	4.9	5.7	5.1	5.3	6.1
Mock Abundances after Contaminant Removal
ASV_2	26	21	23	23	15	20	20	21	24
ASV_7	2.6	3.8	3.3	3.3	4.6	3.7	3.9	3.0	6.4
ASV_10	1.3	1.5	1.4	1.4	1.2	1.5	1.0	2.3	3.0
ASV_1	23	30	26	24	33	28	31	29	21
ASV_3	21.4	14.4	17.2	16.8	9.9	15.9	9.4	12.6	13.5
ASV_4	6.5	14.5	12.5	15.4	23.5	13.5	20.0	19.2	16.1
ASV_5	11.0	7.7	9.1	8.5	5.1	8.5	6.5	6.0	4.9
ASV_6	5.2	5.4	5.5	5.8	6.5	6.9	6.6	7.8	11.3
ASV_8	3.0	2.0	2.2	2.5	1.2	2.1	1.7	0.0	0.0
Classification Performance
sensitivity	0.98	0.98	1.00	1.00	1.00	1.00	1.00	1.00	1.00
specificty	1.00	1.00	1.00	1.00	1.00	1.00	0.86	0.98	0.78
accuracy	1.00	1.00	1.00	1.00	1.00	1.00	0.95	0.99	0.96
prevalence	0.00057	0.00113	0.01907	0.04263	0.11877	0.28253	0.64457	0.55697	0.80017
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 2' )

## Joining, by = c("reference", "id", "color", "sample", "count")

## Scenario 2, Case 3 - undefined experimental environment, well defined contaminant environments In this case, the defined sources were:

Contaminant profiles - Contaminant profiles built by removing the expected mock microbial community ASVs
Blank - The blank control

The results are defined by:

rec_otus - ASVs not classified as arising from any defined source (recovered mock microbial community)
blank_otus - ASVs identified as arising from the blank source (contaminants)
cp_otus - ASVs identified as arising from the contaminant source (contaminants)

# load results from running SourceTracker with only contaminant profiles and black controls defined (Case 3)
rec_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Unknown_contributions.txt", asv_key)

cp_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_ContaminantProfile_contributions.txt",  asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Blank_contributions.txt", asv_key)

# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con + cp_con 
con_orig <- rowSums(con_otus)

# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# Identify contaminants correctly and incorrectly classified
cp_c <- rowSums(cp_otus[,!colnames(cp_otus) %in% mock_taxa])
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])

cp_i <- rowSums(cp_otus[,colnames(cp_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])

st_profile <- rbind(mock_i ,blank_i,cp_i,mock_c, blank_c, cp_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd','#d9d9d9', '#1B9E77', '#E7298A', '#D95F02')

# Figure 5D
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus + cp_otus
results_st_sc2_case1 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 2 case 1')

## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_st_sc2_case1, 'scenario 2 case 1')

SourceTracker scenario 2 case 1 summary
Proportion Removed
contaminants_removed	2.00	2.12	1.09	0.85	1.03	5.14	39.68	2.58	67.00
mock_ASVs_removed	0.0010	0.0010	0.0010	0.0017	0.0026	0.0032	0.0160	0.0068	0.0202
Percent Remaining after Contaminant Removal
con_abundance	0.049	0.108	1.728	4.606	11.729	27.096	52.346	55.090	57.170
total_mock_abundance	100	100	98	95	88	73	48	45	43
Diversity Estimates after Contaminant Removal
Observed	13	20	98	165	259	310	379	147	193
Shannon	1.9	1.9	2.0	2.2	2.5	3.4	4.3	4.1	4.4
InvSimpson	5.4	5.4	5.8	6.4	6.2	10.9	19.3	23.9	31.5
Mock Abundances after Contaminant Removal
ASV_2	26.2	20.9	22.7	22.0	13.7	14.5	8.0	9.4	8.2
ASV_7	2.5	3.5	3.2	3.0	4.0	2.8	1.7	1.4	3.4
ASV_10	1.49	1.65	1.41	1.38	1.02	1.31	0.46	1.20	2.64
ASV_1	22	30	25	22	29	20	14	13	7
ASV_3	21.6	14.4	16.7	15.8	8.2	12.0	3.9	5.6	4.8
ASV_4	6.7	14.2	12.2	15.1	21.0	9.8	13.8	8.8	10.4
ASV_5	11.0	8.0	9.0	7.9	4.6	6.1	2.8	2.5	1.4
ASV_6	5.2	5.2	5.4	5.7	5.9	5.0	2.9	3.4	5.0
ASV_8	3.20	2.06	2.30	2.33	1.04	1.65	0.64	0.00	0.00
Classification Performance
sensitivity	0.0200	0.0212	0.0109	0.0085	0.0103	0.0514	0.3968	0.0258	0.6700
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	0.98	0.95	0.88	0.73	0.61	0.46	0.74
prevalence	0.0005	0.0011	0.0175	0.0464	0.1184	0.2815	0.6455	0.5573	0.8018
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 1' )

## Joining, by = c("reference", "id", "color", "sample", "count")

# Scenario 2, Case 4 - undefined experimental environment, blank control as contaminant environment In this case, the defined sources were:

Contaminant profiles - Contaminant profiles built by removing the expected mock microbial community ASVs
Blank - The blank control

The results are defined by:

rec_otus - ASVs not classified as arising from any defined source (recovered mock microbial community)
blank_otus - ASVs identified as arising from the blank source (contaminants)

# load results from running SourceTracker with only a black control defined (Case 4)
rec_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Unknown_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Blank_contributions.txt", asv_key)

# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con 
con_orig <- rowSums(con_otus)

# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# Identify contaminants correctly and incorrectly classified
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])

st_profile <- rbind(mock_i ,blank_i,mock_c, blank_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')
# Figure 5E
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus 
results_st_sc2_case2 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 2 case 2')

## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.

summary_table(results_st_sc2_case2, 'scenario 2 case 2')

SourceTracker scenario 2 case 2 summary
Proportion Removed
contaminants_removed	0.000	0.278	0.455	0.149	0.134	0.264	0.256	0.053	0.105
mock_ASVs_removed	0.00100	0.00067	0.00136	0.00105	0.00227	0.00326	0.01805	0.00000	0.01675
Percent Remaining after Contaminant Removal
con_abundance	0.023	0.120	1.822	4.460	11.706	28.400	64.859	55.463	80.089
total_mock_abundance	100	100	98	96	88	72	35	45	20
Diversity Estimates after Contaminant Removal
Observed	13	19	94	161	258	310	381	147	193
Shannon	1.9	1.9	2.0	2.2	2.5	3.4	4.5	4.1	4.8
InvSimpson	5.4	5.4	5.8	6.4	6.2	11.2	28.2	24.5	70.7
Mock Abundances after Contaminant Removal
ASV_2	26.3	20.9	22.2	21.8	13.3	14.4	6.3	9.2	3.8
ASV_7	2.6	3.6	3.1	3.1	4.1	2.7	1.1	1.4	1.5
ASV_10	1.55	1.61	1.42	1.46	1.02	1.22	0.35	1.12	1.24
ASV_1	21.9	29.7	25.7	22.4	29.4	19.4	9.9	12.4	3.3
ASV_3	21.6	14.7	17.0	15.8	8.3	11.7	2.8	5.4	2.1
ASV_4	6.8	14.0	12.2	15.0	20.6	9.3	10.0	8.7	4.7
ASV_5	10.94	8.11	9.05	7.97	4.57	6.19	2.06	2.65	0.83
ASV_6	5.3	5.2	5.3	5.7	5.9	5.2	2.1	3.8	2.4
ASV_8	2.94	2.04	2.34	2.32	1.12	1.57	0.49	0.00	0.00
Classification Performance
sensitivity	0.00000	0.00278	0.00455	0.00149	0.00134	0.00264	0.00256	0.00053	0.00105
specificty	1	1	1	1	1	1	1	1	1
accuracy	1.00	1.00	0.98	0.96	0.88	0.72	0.35	0.45	0.20
prevalence	0.00023	0.00120	0.01830	0.04467	0.11720	0.28453	0.64913	0.55477	0.80103
sample_names	D0	D1	D2	D3	D4	D5	D6	D7	D8

plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 2' )

## Joining, by = c("reference", "id", "color", "sample", "count")

Save results

vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep)

## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_rm' not found

## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_keep' not found

save.image('results_sourcetracker.RData')

Controlling Contaminants with SourceTracker

Lisa Karstens, PhD

Introduction

Using SourceTracker for contaminant identification and removal

Theory behind SourceTracker and its application for contaminant removal

Set up the workspace

Create functions to use in this analysis

Loading the mock microbial community information

Results for Case 1 - well defined experimental and contaminant environments

Evaluation of Scenario 1, Case 1 - well defined experimental and contaminant communites

Plot classifications per sample

Summarize Results as done for other methods

Save results