Introduction

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

We provide the compiled R Markdown documents to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Here, we provide the compiled R Markdown document to reproduce evaluating SourceTracker results for identifying and removing contaminants as presented in the manuscript.

For more information about the experimental design and preprocessing of the data, see the Introduction and evaluating contaminants section. For more information about setting up and using SourceTracker, see the Removing contaminant ASVs with SourceTracker section.

Using SourceTracker for contaminant identification and removal

SourceTracker was used to identify contaminants in a 16S rRNA sequencing experiment. More information about SourceTracker can be found on the SourceTracker website. SourceTracker is a bit resource intensive and can take a while to run, so the code used to run SourceTracker for this analysis is available in a separate Markdown document here.

Theory behind SourceTracker and its application for contaminant removal

SourceTracker uses a Bayesian approach to predict the proportion of unknown samples (‘sink’ samples) arising from defined microbial sources (‘source’ samples). In theory, SourceTracker can be used to identify the proportion of experimental samples that arise from contaminants.

We tested two scenarios for recovering the mock microbial community profiles from the mock microbial Samples. In the first scenario, the expected mock community profile served as a source environment, mimicking the scenario when the experimental source is well defined. In the second scenario, the expected mock microbial community is unknown, and the proportion of sequences not predicted to be from the blank control or contaminant profile is the contamination-corrected profile. The second scenario is the more commonly encountered scenario, where the low microbial biomass environment that is being studied is poorly defined. For each scenario, we evaluated the use of the blank control and a combination of blank control and contaminant profiles as the source environments.

This R Markdown file loads in the result files from running SourceTracker and summarizes how well SourceTracker performed by:

The output from running SourceTracker is a series of text files that each contain an otu table for each source environment. The sum of all of the ASVs over all text files sums to 1 for each sample. Thus, SourceTracker predicts the proportion of each individual ASV arising from each source. We will step through the analysis of the SourceTracker results in detail for the first set of results, and then run through the analysis of the other sets, indicating key differences.

Set up the workspace

To run this analysis yourself, download the data along with the R markdown file here.

# load libraries
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.1       ✔ purrr   0.3.2  
## ✔ tibble  2.1.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.3       ✔ stringr 1.4.0  
## ✔ readr   1.3.1       ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(RColorBrewer)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(phyloseq)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(knitr)
options(digits=2)

# save session info (packages versions loaded)
session <- sessionInfo()

Create functions to use in this analysis

We will use a function called compBarPlot to display the expected mock microbial ASVs in color and unexpected contaminants sequences in gray scale. This is similar to the function created in the decontam analysis, but uses a data frame rather than a phyloseq object.

We also create a function remainCont to calculate the percent of remaining contaminants after running SourceTracker and mockCom to calculate the percent of mock microbial community ASVs present in each sample.

# Create function to plot bar plots with contaminants in grey scale and expected mock microbial sequences in color
compBarPlot <- function(otus, plotTitle){
#set up data_table
  data_table <- as.data.frame(t(otus))

  sample_names <- colnames(data_table)
  data_table$reference = FALSE
  data_table$reference[rownames(data_table) %in% mock_taxa] = TRUE
  data_table$id <- paste0('ASV_', 1:nrow(data_table))
  dilution_labels <- sample_names

  set.seed(444)

# define the colors to use for reference and non-reference OTUs/ASVs
  ref_colors <- brewer.pal(sum(data_table$reference), "Paired")
  other_colors <- sample(grey.colors(5, start = 0.5, end = 0.9), sum(!data_table$reference), replace = TRUE)

# add a color variable to the data table
  data_table$color <- rep(NA, nrow(data_table))
  data_table$color[data_table$reference] <- ref_colors
  data_table$color[!data_table$reference] <- other_colors

# reshape the data table into a ggplot-friendly format, by gathering samples into a single column called "count"

  color_gg <- data_table %>% gather("sample", "count", sample_names)
  legend_color <- c(bright = ref_colors[2], dull = other_colors[2])
  data_gg <- data_table %>% gather("sample", "count", sample_names)

  data_gg <- inner_join(data_gg,color_gg)

# create the composition bar plot
  comp_bar <- ggplot(data_gg, aes(x = sample, y = count)) +
    geom_col(aes(fill = color, group = reference, alpha = ifelse(reference, "bright", "dull")), width = 0.7, position = position_fill()) +
    scale_fill_identity(guide = FALSE) +
    scale_alpha_manual(name = "Sequence type",
                     labels = c("expected sequences", "other"),
                     values = c(bright = 1, dull = 1),
                     guide = guide_legend(override.aes = list(fill = c(ref_colors[4], "#AEAEAE")),
                                          keywidth = NULL, keyheight = NULL)) +
    scale_x_discrete(labels = dilution_labels) +
    labs(title = plotTitle, x = "sample", y = "Relative Abundance") +
    theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))
  comp_bar
}

# Calculate % of contaminants remaining in each sample
remainCont <- function(otus){
  #set up data_table
  # renormalize the recovered profile to equal 1
  data_table <- otus/rowSums(otus)
  data_table <- data_table[,colnames(data_table) %in% mock_taxa]
  print((1 - rowSums(data_table)) * 100)
}

# Calculate % of mock community ASVs in each sample
mockCom <- function(otus){
  #set up data_table
  data_table <- otus[,colnames(otus) %in% mock_taxa]
  print((rowSums(data_table)) * 100)
}


summary_table <- function(results, scenario){
    # add scenario to results table

  caption_text = paste0('SourceTracker ', scenario, ' summary')
  drop_col <- c('true_neg','true_pos','false_neg','false_pos','method')
  results <- results %>% select(-one_of(drop_col))
  kable(t(results), digits = 2, caption = caption_text) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
    group_rows("Proportion Removed", 1,2) %>%
    group_rows("Percent Remaining after Contaminant Removal", 3,4) %>%
    group_rows("Diversity Estimates after Contaminant Removal", 5,7) %>% 
  group_rows("Mock Abundances after Contaminant Removal", 8,16) %>% 
  group_rows("Classification Performance", 17,20)

}

# Function to load sourcetracker results
load_sourcetracker <- function(file, otu_key) {
  otus <- read.delim(file, row.names = 1)
    # subset only the Dilution data
  dilutions <- c('D0', 'D1','D2','D3', 'D4', 'D5', 'D6', 'D7', 'D8')
  otus<-otus[rownames(otus) %in% dilutions,]
  return(otus)
}

summarize_st <- function (recovered_otus,all_otus,mock_taxa, scenario){
  # for simplicity, all other classifications are 'contaminants'
 removed_otus <- all_otus - recovered_otus
 con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
 mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])

#  % of mock community ASVs correctly classified as mock community ASVs 
true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
#  identify non-mock community ASVs correctly classified as not belonging to mock   community
true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
#  identify mock community ASVs incorrectly classified as not belonging to mock   community
false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

sensitivity <- true_pos/(true_pos + false_neg)
specificty <- true_neg/(true_neg + false_pos)
accuracy <- (true_pos + true_neg) / (false_pos + true_pos + false_neg + true_neg)
prevalence <- (true_pos + false_neg) / (false_pos + true_pos + false_neg + true_neg)
 
   # Summarize results  
    ## proportion of contaminants removed (of all total contaminant ASVs)
    contaminants_removed = (rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])/ con_orig) * 100
    ## proportion of mock removed (of all total mock ASVs)
    mock_ASVs_removed = (rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])/ mock_orig) * 100
  ## total amount of conatmiants remaining
  contaminants_remaining =  rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa]) *100
  # calculate alpha diverity  and summary of relative abundances (on re-normalized recovered data)
  # create phyloseq object on the recovered otus
  ps_st <- mock_ps
  ps_st@otu_table <- otu_table(recovered_otus, taxa_are_rows = FALSE)
  # Renormalize and transform into count matrix - not ideal, but necessary to compare to other data (to calculate diversity estimates)
  ps_st <- transform_sample_counts(ps_st,function(x) 100000* x/sum(x))
  otu_table(ps_st) <- otu_table(round(as.matrix((otu_table(ps_st)))), taxa_are_rows= FALSE)
  diversity <- estimate_richness(ps_st, measures = c('Observed','Shannon','InvSimpson'))
  # Calculate relative abundance on recovered otus
  rel_abundance <- (rec_otus/rowSums(rec_otus))*100
  mock_abundance <- rel_abundance[, colnames(rel_abundance) %in% mock_taxa]
  total_mock_abundance <- rowSums(mock_abundance)
  con_abundance <- rowSums(rel_abundance[,!colnames(rel_abundance) %in% mock_taxa])
  # return results
  results <- cbind(contaminants_removed, mock_ASVs_removed,con_abundance, total_mock_abundance, diversity, mock_abundance,sensitivity , specificty, accuracy, prevalence, true_pos, true_neg, false_pos, false_neg)
    results <- results %>%
    mutate(method = paste0('SourceTracker, ', scenario)) %>%
    mutate(sample_names = rownames(results))
return(results)
}

plot_st_classes <- function(mock_taxa, recovered_otus, all_otus, st_method){
    # for simplicity, all other classifications are 'contaminants'
  removed_otus <- all_otus - recovered_otus
  con_orig <- rowSums(all_otus[,!colnames(all_otus) %in% mock_taxa])
  mock_orig <- rowSums(all_otus[,colnames(all_otus) %in% mock_taxa])

#  % of mock community ASVs correctly classified as mock community ASVs 
  true_neg <- rowSums(recovered_otus[,colnames(recovered_otus) %in% mock_taxa])
# % of mock community incorrectly classified as non-mock community ASVs
  false_neg <- rowSums(recovered_otus[,!colnames(recovered_otus) %in% mock_taxa])
  #  identify non-mock community ASVs correctly classified as not belonging to mock   community
  true_pos <- rowSums(removed_otus[,!colnames(removed_otus) %in% mock_taxa])
  #  identify mock community ASVs incorrectly classified as not belonging to mock   community
  false_pos <- rowSums(removed_otus[,colnames(removed_otus) %in% mock_taxa])

  profile <- rbind(false_neg, false_pos,true_neg,true_pos)

  long_profile <- melt(data = profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )
  names(long_profile)[names(long_profile)=="Var1"] <- "SequenceClass"
  customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')

  # Figures
  classificationPlot <- ggplot(long_profile, aes(x = Var2, y = Abundance)) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
      ggtitle(paste0('Sequence classification for \n  ', st_method)) + 
      theme(legend.position = "right", legend.title = element_text(size = 12),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12),
        plot.title = element_text(size = 12))

  compositionPlot <- compBarPlot(recovered_otus, paste0('SourceTracker ', st_method))

  # plot figures
  grid.arrange(compositionPlot, classificationPlot, ncol = 2)
}

Loading the mock microbial community information

We created an R work space that just contains the ASV sequences of the expected mock microbial community. This will help us identify the ASVs that were correctly and incorrectly assigned to the mock microbial community.

## 
# Load workspace with Mock microbial community taxa ASV identifiers
load("mock_taxa.RData")
# load initial phyloseq object
load("mockDilutions.RData")

This data set includes 1 object: mock_taxa - A list of the expected mock microbial community ASV sequences

Results for Case 1 - well defined experimental and contaminant environments

In this case, the defined sources were:

The results are defined by:

# load results from running SourceTracker with contaminant profiles and black controls 
rec_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_MockCommunityProfile_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Blank_contributions.txt", asv_key)

cp_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_ContaminantProfile_contributions.txt", asv_key)

unk_otus <- load_sourcetracker("./Sourcetracker_mock_cp_b_30000/full_results/mock_cp_b_Unknown_contributions.txt", asv_key)

We will also create a new variables, mock_orig and con_orig that stores the proportion of each sample arising from the mock community and contaminants, respectively. Since the data was subsampled prior to running SourceTracker, the proportion of contaminant ASVs will vary. To identify the proportion of the original sample that arose from contaminants, we sum the proportion of non-mock community ASVs from each of the SourceTracker result files. We then sum the total amount of contaminant ASVs for each sample.

# determine total amounts of conatmiants across the dataset
rec_mock <- rec_otus[,colnames(rec_otus) %in% mock_taxa] 
blank_mock<- blank_otus[,colnames(blank_otus) %in% mock_taxa] 
cp_mock <- cp_otus[,colnames(cp_otus) %in% mock_taxa] 
unk_mock <- unk_otus[,colnames(unk_otus) %in% mock_taxa] 
# Sum the total amounts of contaminants across contaminant ASVs
mock_otus <- rec_mock +  blank_mock + cp_mock + unk_mock
# Sum the total amount of contaminants per Samples sample
mock_orig <- rowSums(mock_otus)


# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa] 
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa] 
# Sum the total amounts of contaminants across contaminant ASVs
con_otus <- rec_con +  blank_con + cp_con + unk_con
# Sum the total amount of contaminants per Samples sample
con_orig <- rowSums(con_otus)

# 
all_otus <- rec_otus + blank_otus + cp_otus + unk_otus

Evaluation of Scenario 1, Case 1 - well defined experimental and contaminant communites

To evaluate SourceTracker, we will generate a stacked bar plot of the SourceTracker predicted mock microbial community profile as well as evaluate the percentage of the samples that were correctly (indicated by the community abbreviation followed by _c) and incorrectly classified (indicated by the community abbreviation followed by _i).

Plot classifications per sample

To summarize the SourceTracker classifications, we will create stacked bar plots that summarize the proportion of reads correctly and incorrectly classified as arising from the mock microbial community (mock_c for correct, mock_i for incorrect), the contaminant profile (cp_c, cp_i), the blank (blank_c, blank_i), unknown (unk_c, unk_i).

Summarize Results as done for other methods

all_otus <- rec_otus + blank_otus + cp_otus + unk_otus
results_st_sc1_case1 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 1 case 1')
summary_table(results_st_sc1_case1, 'scenario 1 case 1')
SourceTracker scenario 1 case 1 summary
Proportion Removed
contaminants_removed 98 99 100 100 100 100 100 100 100
mock_ASVs_removed 0.00100 0.00033 0.00340 0.00035 0.00114 0.00368 0.01507 0.00225 0.00502
Percent Remaining after Contaminant Removal
con_abundance 0.00067 0.00167 0.00272 0.00209 0.00454 0.00138 0.00754 0.00150 0.00167
total_mock_abundance 100 100 100 100 100 100 100 100 100
Diversity Estimates after Contaminant Removal
Observed 9 11 9 9 10 9 16 10 9
Shannon 1.9 1.9 1.9 1.9 1.8 1.9 1.8 1.8 1.9
InvSimpson 5.4 5.4 5.6 5.8 4.9 5.8 4.9 5.4 6.2
Mock Abundances after Contaminant Removal
ASV_2 26 21 22 23 15 20 17 20 20
ASV_7 2.8 3.9 3.4 3.3 4.4 3.8 3.7 3.1 7.2
ASV_10 1.5 1.5 1.4 1.4 1.1 1.6 1.0 2.7 7.1
ASV_1 22 30 27 23 33 27 28 28 16
ASV_3 21.8 14.4 17.0 16.6 9.6 16.4 7.7 12.7 10.0
ASV_4 6.5 13.8 12.5 15.2 23.8 13.6 28.9 19.7 24.8
ASV_5 10.9 7.8 9.2 8.6 5.2 8.5 6.1 5.4 3.8
ASV_6 5.4 5.4 5.5 6.1 6.6 6.9 6.2 7.9 10.8
ASV_8 3.0 2.0 2.3 2.4 1.4 2.3 1.5 0.0 0.0
Classification Performance
sensitivity 0.98 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00
specificty 1 1 1 1 1 1 1 1 1
accuracy 1 1 1 1 1 1 1 1 1
prevalence 0.0004 0.0013 0.0188 0.0434 0.1193 0.2756 0.6461 0.5565 0.8007
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 1' )
## Joining, by = c("reference", "id", "color", "sample", "count")

# Scenario 1, Case 2 - Well defined experimental environment, blank control as contaminant environments

In this case, the defined sources were:

The results are defined by:

We repeat the analysis that was performed for Scenario 1, Case 1.

# load results from running SourceTracker with contaminant profiles and black controls (Case 2 well-defined experimental community, blank only as contaminant profile)
rec_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_MockCommunityProfile_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Blank_contributions.txt", asv_key)

unk_otus <- load_sourcetracker("./Sourcetracker_mock_b_30000/full_results/mock_b_Unknown_contributions.txt", asv_key)

# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
unk_con <- unk_otus[,!colnames(unk_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con + unk_con
con_orig <- rowSums(con_otus)

# identify mock community ASVs correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# identify contaminants correctly identified as contaminants
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
unk_c <- rowSums(unk_otus[,!colnames(unk_otus) %in% mock_taxa])

# identify mock microbial community ASVs incorrectly identified as contaminants
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])
unk_i <- rowSums(unk_otus[,colnames(unk_otus) %in% mock_taxa])

# Plot classifications per sample

st_profile <- rbind(mock_i,blank_i,unk_i,mock_c, blank_c, unk_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd','#f0f0f0', '#1B9E77', '#E7298A', '#7570B3')

# Figure 5C
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus + unk_otus
results_st_sc1_case2 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 1 case 2')
summary_table(results_st_sc1_case2, 'scenario 1 case 2')
SourceTracker scenario 1 case 2 summary
Proportion Removed
contaminants_removed 98 98 100 100 100 100 100 100 100
mock_ASVs_removed 0.0013 0.0000 0.0150 0.0299 0.0972 0.3201 13.5365 2.0781 21.7031
Percent Remaining after Contaminant Removal
con_abundance 0.0010 0.0020 0.0024 0.0038 0.0027 0.0014 0.0022 0.0015 0.0021
total_mock_abundance 100 100 100 100 100 100 100 100 100
Diversity Estimates after Contaminant Removal
Observed 10 11 10 10 9 9 11 10 9
Shannon 1.9 1.9 1.9 1.9 1.8 1.9 1.8 1.8 1.9
InvSimpson 5.4 5.4 5.6 5.8 4.9 5.7 5.1 5.3 6.1
Mock Abundances after Contaminant Removal
ASV_2 26 21 23 23 15 20 20 21 24
ASV_7 2.6 3.8 3.3 3.3 4.6 3.7 3.9 3.0 6.4
ASV_10 1.3 1.5 1.4 1.4 1.2 1.5 1.0 2.3 3.0
ASV_1 23 30 26 24 33 28 31 29 21
ASV_3 21.4 14.4 17.2 16.8 9.9 15.9 9.4 12.6 13.5
ASV_4 6.5 14.5 12.5 15.4 23.5 13.5 20.0 19.2 16.1
ASV_5 11.0 7.7 9.1 8.5 5.1 8.5 6.5 6.0 4.9
ASV_6 5.2 5.4 5.5 5.8 6.5 6.9 6.6 7.8 11.3
ASV_8 3.0 2.0 2.2 2.5 1.2 2.1 1.7 0.0 0.0
Classification Performance
sensitivity 0.98 0.98 1.00 1.00 1.00 1.00 1.00 1.00 1.00
specificty 1.00 1.00 1.00 1.00 1.00 1.00 0.86 0.98 0.78
accuracy 1.00 1.00 1.00 1.00 1.00 1.00 0.95 0.99 0.96
prevalence 0.00057 0.00113 0.01907 0.04263 0.11877 0.28253 0.64457 0.55697 0.80017
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 1, Case 2' )
## Joining, by = c("reference", "id", "color", "sample", "count")

## Scenario 2, Case 3 - undefined experimental environment, well defined contaminant environments In this case, the defined sources were:

The results are defined by:

# load results from running SourceTracker with only contaminant profiles and black controls defined (Case 3)
rec_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Unknown_contributions.txt", asv_key)

cp_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_ContaminantProfile_contributions.txt",  asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_cp_b_30000/full_results/cp_b_Blank_contributions.txt", asv_key)

# determine total amounts of contaminants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
cp_con <- cp_otus[,!colnames(cp_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con + cp_con 
con_orig <- rowSums(con_otus)

# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# Identify contaminants correctly and incorrectly classified
cp_c <- rowSums(cp_otus[,!colnames(cp_otus) %in% mock_taxa])
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])

cp_i <- rowSums(cp_otus[,colnames(cp_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])

st_profile <- rbind(mock_i ,blank_i,cp_i,mock_c, blank_c, cp_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd','#d9d9d9', '#1B9E77', '#E7298A', '#D95F02')

# Figure 5D
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus + cp_otus
results_st_sc2_case1 <- summarize_st(rec_otus,all_otus,mock_taxa,'scenario 2 case 1')
## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
summary_table(results_st_sc2_case1, 'scenario 2 case 1')
SourceTracker scenario 2 case 1 summary
Proportion Removed
contaminants_removed 2.00 2.12 1.09 0.85 1.03 5.14 39.68 2.58 67.00
mock_ASVs_removed 0.0010 0.0010 0.0010 0.0017 0.0026 0.0032 0.0160 0.0068 0.0202
Percent Remaining after Contaminant Removal
con_abundance 0.049 0.108 1.728 4.606 11.729 27.096 52.346 55.090 57.170
total_mock_abundance 100 100 98 95 88 73 48 45 43
Diversity Estimates after Contaminant Removal
Observed 13 20 98 165 259 310 379 147 193
Shannon 1.9 1.9 2.0 2.2 2.5 3.4 4.3 4.1 4.4
InvSimpson 5.4 5.4 5.8 6.4 6.2 10.9 19.3 23.9 31.5
Mock Abundances after Contaminant Removal
ASV_2 26.2 20.9 22.7 22.0 13.7 14.5 8.0 9.4 8.2
ASV_7 2.5 3.5 3.2 3.0 4.0 2.8 1.7 1.4 3.4
ASV_10 1.49 1.65 1.41 1.38 1.02 1.31 0.46 1.20 2.64
ASV_1 22 30 25 22 29 20 14 13 7
ASV_3 21.6 14.4 16.7 15.8 8.2 12.0 3.9 5.6 4.8
ASV_4 6.7 14.2 12.2 15.1 21.0 9.8 13.8 8.8 10.4
ASV_5 11.0 8.0 9.0 7.9 4.6 6.1 2.8 2.5 1.4
ASV_6 5.2 5.2 5.4 5.7 5.9 5.0 2.9 3.4 5.0
ASV_8 3.20 2.06 2.30 2.33 1.04 1.65 0.64 0.00 0.00
Classification Performance
sensitivity 0.0200 0.0212 0.0109 0.0085 0.0103 0.0514 0.3968 0.0258 0.6700
specificty 1 1 1 1 1 1 1 1 1
accuracy 1.00 1.00 0.98 0.95 0.88 0.73 0.61 0.46 0.74
prevalence 0.0005 0.0011 0.0175 0.0464 0.1184 0.2815 0.6455 0.5573 0.8018
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 1' )
## Joining, by = c("reference", "id", "color", "sample", "count")

# Scenario 2, Case 4 - undefined experimental environment, blank control as contaminant environment In this case, the defined sources were:

The results are defined by:

# load results from running SourceTracker with only a black control defined (Case 4)
rec_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Unknown_contributions.txt", asv_key)

blank_otus <- load_sourcetracker("./Sourcetracker_b_30000/full_results/b_Blank_contributions.txt", asv_key)

# determine total amounts of conatmiants across the dataset
rec_con <- rec_otus[,!colnames(rec_otus) %in% mock_taxa] 
blank_con<- blank_otus[,!colnames(blank_otus) %in% mock_taxa] 
con_otus <- rec_con +  blank_con 
con_orig <- rowSums(con_otus)

# Identify mock community correctly and incorrectly classified
mock_c <- rowSums(rec_otus[,colnames(rec_otus) %in% mock_taxa])
mock_i <- rowSums(rec_otus[,!colnames(rec_otus) %in% mock_taxa])

# Identify contaminants correctly and incorrectly classified
blank_c <- rowSums(blank_otus[,!colnames(blank_otus) %in% mock_taxa])
blank_i <- rowSums(blank_otus[,colnames(blank_otus) %in% mock_taxa])

st_profile <- rbind(mock_i ,blank_i,mock_c, blank_c)

long_st_profile <- melt(data = st_profile, 
                 id.vars = rownames(), 
                 variable.name = colnames(), 
                 value.name = "Abundance"
                )

customPalette <- c('#969696','#bdbdbd', '#1B9E77', '#D95F02')
# Figure 5E
ggplot(long_st_profile, aes(x = Var2, y = Abundance)) + geom_col(aes(fill = Var1), width = 0.7, position = position_fill())  + scale_fill_manual(values=customPalette) + theme(text = element_text(size=12)) + labs(x = "Samples", y = 'Proportion of Reads')

all_otus <- rec_otus + blank_otus 
results_st_sc2_case2 <- summarize_st(rec_otus,all_otus,mock_taxa, 'scenario 2 case 2')
## Warning in estimate_richness(ps_st, measures = c("Observed", "Shannon", : The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
summary_table(results_st_sc2_case2, 'scenario 2 case 2')
SourceTracker scenario 2 case 2 summary
Proportion Removed
contaminants_removed 0.000 0.278 0.455 0.149 0.134 0.264 0.256 0.053 0.105
mock_ASVs_removed 0.00100 0.00067 0.00136 0.00105 0.00227 0.00326 0.01805 0.00000 0.01675
Percent Remaining after Contaminant Removal
con_abundance 0.023 0.120 1.822 4.460 11.706 28.400 64.859 55.463 80.089
total_mock_abundance 100 100 98 96 88 72 35 45 20
Diversity Estimates after Contaminant Removal
Observed 13 19 94 161 258 310 381 147 193
Shannon 1.9 1.9 2.0 2.2 2.5 3.4 4.5 4.1 4.8
InvSimpson 5.4 5.4 5.8 6.4 6.2 11.2 28.2 24.5 70.7
Mock Abundances after Contaminant Removal
ASV_2 26.3 20.9 22.2 21.8 13.3 14.4 6.3 9.2 3.8
ASV_7 2.6 3.6 3.1 3.1 4.1 2.7 1.1 1.4 1.5
ASV_10 1.55 1.61 1.42 1.46 1.02 1.22 0.35 1.12 1.24
ASV_1 21.9 29.7 25.7 22.4 29.4 19.4 9.9 12.4 3.3
ASV_3 21.6 14.7 17.0 15.8 8.3 11.7 2.8 5.4 2.1
ASV_4 6.8 14.0 12.2 15.0 20.6 9.3 10.0 8.7 4.7
ASV_5 10.94 8.11 9.05 7.97 4.57 6.19 2.06 2.65 0.83
ASV_6 5.3 5.2 5.3 5.7 5.9 5.2 2.1 3.8 2.4
ASV_8 2.94 2.04 2.34 2.32 1.12 1.57 0.49 0.00 0.00
Classification Performance
sensitivity 0.00000 0.00278 0.00455 0.00149 0.00134 0.00264 0.00256 0.00053 0.00105
specificty 1 1 1 1 1 1 1 1 1
accuracy 1.00 1.00 0.98 0.96 0.88 0.72 0.35 0.45 0.20
prevalence 0.00023 0.00120 0.01830 0.04467 0.11720 0.28453 0.64913 0.55477 0.80103
sample_names D0 D1 D2 D3 D4 D5 D6 D7 D8
plot_st_classes(mock_taxa,rec_otus,all_otus,'Scenario 2, Case 2' )
## Joining, by = c("reference", "id", "color", "sample", "count")

Save results

vars_to_keep <- ls(pattern = 'results')
vars_to_rm <- ls()
vars_to_rm <-vars_to_rm[!vars_to_rm %in% vars_to_keep]
rm(list = vars_to_rm)
rm(vars_to_rm, vars_to_keep)
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_rm' not found
## Warning in rm(vars_to_rm, vars_to_keep): object 'vars_to_keep' not found
save.image('results_sourcetracker.RData')