ComparingResults

Overview

Introduction

Increasingly there is interest in studying the microbiome from environments that contain few microbes (low microbial biomass). 16S rRNA gene sequencing allows researchers to study microbes from environments in a high throughput culture independent manner, but recent research has highlighted that this technique is prone to contaminants when applied to samples originating from low microbial biomass environments.

We recently performed a study to understand the impact of decreasing microbial biomass on 16S rRNA gene sequencing experiments and evaluate the current computational strategies to control for contaminants (preprint available here).

Here, we provide the compiled R Markdown document to reproduce the analysis presented in the manuscript, divided into 6 primary sections:

Introduction and evaluating contaminants
Removing contaminant ASVs with filtering methods
Removing contaminant ASVs with decontam
Removing contaminant ASVs with SourceTracker
Evaluating SourceTracker results
Final evaluation of all methods (this document)

Here, we provide the compiled R Markdown document to reproduce the analyses of the mock microbial community presented in the manuscript. Readers interested in the details of how to use the different methods should click on the respective links above.

Setup workspace

Load libraries used in this analysis
Load the results workspace from each contaminant removal method

# load libraries
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggplot2)
library(ggpubr)

## Loading required package: magrittr

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(knitr)
options(digits=2)
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggpubr':
## 
##     get_legend

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

load('results_filtering.RData')
load('results_decontam.RData')
load('results_sourcetracker.RData')

This loaded a LOT of data! Each method that was evaluated has its own results variable. The method used is indicated in the variable name:

Variable name	Method	Parameter changed
results_blank	negative control	none
results_filter_001	abundance filter	abundance < 0.01 % removed
results_filter_01	abundance filter	abundance < 0.10 % removed
results_filter_1	abundance filter	abundance < 1.00 % removed
results_decontam_0_1	decontam frequency	threshold = 0.1
results_decontam_0_2	decontam frequency	threshold = 0.2
results_decontam_0_3	decontam frequency	threshold = 0.3
results_decontam_0_4	decontam frequency	threshold = 0.4
results_decontam_0_5	decontam frequency	threshold = 0.5
results_st_sc1_case1	SourceTracker	Scenario 1, case 1
results_st_sc1_case2	SourceTracker	Scenario 1, case 2
results_st_sc2_case1	SourceTracker	Scenario 2, case 1
results_st_sc2_case2	SourceTracker	Scenario 2, case 2

All of these objects have the same overall structure, but a few modifications are needed to make everything work together (and look nice).

Modifications

We need to make a few adjustments to the result data frames to make them easier to manage, summarize, and use for analysis.

These changes are documented as comments.

# Change the method in results_original to better match terminology used with other methods. (current is "Original Data")
results_original$method <- 'Uncorrected Results'

# Add prevalence to the original data and expected data (for plotting)
results_original$prevalence <- results_filter_0_1$prevalence
results_expected$prevalence <- results_filter_0_1$prevalence

Combine data

Working with the results in individual variables is probably do-able, but likely a headache. We will save ourselves by combining all the results_* objects into one object called results_all. This was kind of done in a brute force way, if anyone has suggestions to do this more elegantly, please share!

We also have a few modifications to the results_all data frame that to make the data cleaner/easier to work with down the line.

# create massive data table with all results
results_all <- bind_rows(results_original,results_expected, results_blank, results_decontam_0_1,results_decontam_0_2,results_decontam_0_3, results_decontam_0_4, results_decontam_0_5, results_filter_0_01, results_filter_0_1, results_filter_1, results_st_sc1_case1, results_st_sc1_case2,results_st_sc2_case1, results_st_sc2_case2 )

# change 'Blank control removal' to 'Negative control filter' to reflect the manuscript
results_all <- results_all %>%
  mutate(method = recode_factor(method,`Blank control removal`= 'Negative control filter'))

# change NAs to 0 (confrimed that NAs are only due to ASVs removed from blank control removal)
results_all[is.na(results_all)] <- 0

# resubset results_blank so it has all expected ASVs (ASVs that are not represented in the data set are now 0)
results_blank <- results_all %>%
  filter(method == 'Negative control filter')

# reorder to be contistent with the rest of the resulst file 
## figure out how to do this better
results_blank <- results_blank[,c(17,18,1:14,19:22,15,16)]

# order the methods 
results_all$method <- factor(results_all$method, levels = c("Uncorrected Results", "Expected Results" ,"Negative control filter", "Abundance filter, 0.01","Abundance filter, 0.1", "Abundance filter, 1", "Decontam frequency, thr =0.1", "Decontam frequency, thr =0.2", "Decontam frequency, thr =0.3", "Decontam frequency, thr =0.4", "Decontam frequency, thr =0.5", "SourceTracker, scenario 1 case 1", "SourceTracker, scenario 1 case 2","SourceTracker, scenario 2 case 1", "SourceTracker, scenario 2 case 2"))

# create a column for contaminant removal method type
results_all <- results_all %>%  mutate(method_type = gsub(',.*', '', method))

# order the method types
results_all$method_type <- factor(results_all$method_type, levels = c("Uncorrected Results", "Expected Results" ,"Negative control filter", "Abundance filter", "Decontam frequency","SourceTracker"))

Create subsets of the results

The results_all variable contains all of our results in one nice organized data.frame. However, this is not the best format for everything.

Here, we subset the data into 1. results_asv_long - This variable has only the ASV abundance and associated data in long format to make it easy to work with ggplot.
2. results_limited - This variable has only a subset of samples spanning low, moderate, and high levels of contaminants (represented by D3, D6, and D8) 3. results_asv_limited_long - This variable has only the ASV abundance with associated data from a subset of dilution series samples and is in long format

# Create a pretty subset of ASV-only data for plotting
results_asv_long <- results_all %>% 
  select(contains("ASV_"),sample_names, method, method_type) %>%
  gather(key = 'ASV',value = 'Abundance', -sample_names, -method, - method_type)

# Create a small subset of the result data representing low, moderate, and high levels of contaminants for plotting 
results_limited <- results_all %>%
  filter(sample_names %in% c('D3','D6','D8'))

results_asv_limited_long <-  results_limited %>%
  select(contains("ASV_"),sample_names, method, method_type) %>%
gather(key = 'ASV',value = 'Abundance', -sample_names, -method, - method_type)

Summarize results - Alpha diversity measures

We evaluated the effect of contaminant ASVs on the summary metrics typically used in microbiome studies - alpha diversity metrics such as the number of Observed ASVs, Shannon Diversity, and Inverse Simpson indices.

# Alpha diversity dataframe
results_all_alpha <- results_all %>% filter (method_type %in% c('Expected Results', 'Uncorrected Results','Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
  select(Observed, InvSimpson, Shannon,method, method_type, sample_names, prevalence) %>%
  gather(key = 'alpha_diversity_measure',value = 'Value', -sample_names, -method, - method_type, - prevalence) %>%
  mutate(alpha_diversity_measure = factor(alpha_diversity_measure, levels = c('Observed', 'InvSimpson', 'Shannon')))


# Figure 2 - Alpha diversity plots for expected and uncorrected results 
figure_2 <- results_all_alpha %>% filter (method_type %in% c('Expected Results', 'Uncorrected Results')) %>%
  ggplot(., aes(x = sample_names, y = Value, shape = method)) +
  geom_point() + facet_wrap(~alpha_diversity_measure, scales = "free_y") +
  theme_bw() +
  theme(legend.position = "right", legend.title = element_text(size = 10),
        axis.text = element_text(size = 8),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 10),
        strip.text.x = element_text(size = 10)) +
  labs(x = "Sample") +
  ggtitle('Figure 2')

figure_2

From the above plot, we can see that leaving contaminants unaccounted for can lead to drastically inflated values of alpha diversity.

Summarize results - ASV classification

To evaluate the overall performance of each contaminant removal method, ASVs were classified as being correctly or incorrectly identified as a contaminant ASV (an ASV that was not expected to be part of the mock community ) or a mock community ASV (an ASV that is expected to occur in the mock community.

library(gtable)  
library(grid)
customPalette <- c('#969696','#bdbdbd', '#d6604d', '#4393c3')

### Individual plots for publication
  # Filter methods
  classification_data <- results_all %>% 
    mutate(method = plyr::revalue(method, c("Uncorrected Results" = "A. None", "Expected Results" = "Expected Results",  "Negative control filter" = "B. Negative control ", "Abundance filter, 0.01"  = "C. Abundance, 0.01" , "Abundance filter, 0.1" = "D. Abundance, 0.1" ,"Abundance filter, 1"  = "E. Abundance, 1.0" , "Decontam frequency, thr =0.1" = "F. Threshold = 0.1", "Decontam frequency, thr =0.2" = "G. Threshold = 0.2", "Decontam frequency, thr =0.3" = "H. Threshold = 0.3",  "Decontam frequency, thr =0.4" = "I. Threshold = 0.4", "Decontam frequency, thr =0.5" = "J. Threshold = 0.5" ,  "SourceTracker, scenario 1 case 1" =  "K. S1, case 1", "SourceTracker, scenario 1 case 2" = "L. S1, case 2", "SourceTracker, scenario 2 case 1" =  "M. S2, case 1", "SourceTracker, scenario 2 case 2" =  "N. S2, case 2")))
  
  
  long_profile <- classification_data %>%
    filter(method_type %in% c('Uncorrected Results', 'Abundance filter', 'Negative control filter')) %>%
  select(true_pos, true_neg, false_pos, false_neg, sample_names, method, method_type) %>%
    gather(key = 'SequenceClass',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  classificationPlot_filter <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 5) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette, ) + theme(text = element_text(size=9)) +
    labs(x = NULL, y = 'Proportion of Reads') +
       theme(legend.position = "none", 
        axis.text = element_text(size = 9),
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_text(size = 9),
        plot.title = element_text(size = 9))
 # Decontam methods
  long_profile <- classification_data %>%
  filter(method_type %in% c('Decontam frequency')) %>%
  select(true_pos, true_neg, false_pos, false_neg, sample_names, method, method_type) %>%
  gather(key = 'SequenceClass',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  classificationPlot_decontam <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 5) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette) + theme(text = element_text(size=9)) +
    labs(x = NULL, y = 'Proportion of Reads') +
       theme(legend.position = "none",
        axis.text = element_text(size = 9),
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_text(size = 9),
        plot.title = element_text(size = 9))
  
  # SourceTracker
  long_profile <- classification_data %>%
  filter(method_type %in% c('SourceTracker')) %>%
  select(true_pos, true_neg, false_pos, false_neg, sample_names, method, method_type) %>%
  gather(key = 'SequenceClass',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  classificationPlot_sourcetracker <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 4) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette, labels = c("Mock \n(incorrect)","Contaminants \n(incorrect)", "Mock \n(correct)", "Contaminant \n(correct)")) + 
    guides(color = guide_legend(override.aes = list(size=7))) +
    theme(text = element_text(size=9)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
       theme(legend.position = "right", 
        axis.text = element_text(size = 9),
        axis.text.x = element_text(angle = 90, hjust = 1),
        axis.title = element_text(size = 9),
        plot.title = element_text(size = 9),
        legend.justification='left',
        legend.margin = margin(0, unit='cm'))

figure_3 <- ggdraw() +
  draw_plot(classificationPlot_filter, x = 0, y = .68, width = 1, height = .3) +
  draw_plot(classificationPlot_decontam, x = 0, y = .35, width = 1, height = .3) +
  draw_plot(classificationPlot_sourcetracker, x = 0, y = 0, width = 1, height = 0.33) +
  draw_plot_label(label = c('Figure 3','Filter', 'Decontam frequency', 'SourceTracker'), size = 12,
                  x = c(.75,0, 0, 0), y = c(1,1, .67, .35), hjust = 0)
figure_3

Summarize results - Classification accuracy

We wanted to succinctly be able to evaluate the performance of each method and compare across methods. There are many ways to do this, we chose to use a classification scheme and evaluate performance based on common classification metrics which are commonly used and easy to interpret such as accuracy.

Accuracy measures the proportion of classifications that were correct. In our case, this is the number of ASVs correctly classified as contaminant ASVs (True Positives) AND number of ASVs correctly classified as mock community ASVs (True Negatives).

We summarized how accurate each method was by plotting the accuracy by the prevalence of contaminant ASVs (in the original sample) for each method.

# Figure 4 - Accuracy faceted by method type (in manuscript)
color_palette <- c(`1` ="#3182bd", `2` ="#fd8d3c" , `3` = "#d95f02", `4` =  "#a63603", `5` =  "#bcbddc" ,`6` = "#9e9ac8", `7`= "#807dba", `8` ="#6a51a3" , `9` = "#4a1486", `10`=  "#99d8c6", `12` = "#66c2a4", `13` ="#2ca25f", `14` = "#006d2c")

figure_4 <- results_all %>% filter (method_type %in% c('Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
 ggplot(., aes(x = prevalence, y = accuracy, color = method)) +
  geom_line() + geom_point() + facet_grid(~method_type) +
  theme_bw() +
  theme(legend.position = "right", legend.title = element_text(size = 10),
        legend.text=element_text(size=8),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 8),
        strip.text.x = element_text(size = 10),
        legend.key.size = unit(.01, 'lines'),
        panel.spacing = unit(.3, "lines"),
        legend.margin = margin(l = 0, unit = 'cm' )) + 
  scale_color_manual(values = unname(color_palette)) +
  labs(x = "Contaminant Prevalence", y = "Accuracy") +
  ggtitle('Figure 4')
      
figure_4

From this graph, we see that the negative control filter performed poorly regardless of the level of contaminants. All other methods worked well (with an accuracy near 1) for low levels of contaminants (less than 5% contaminant ASVs). All methods start declining once the amount of contaminants is greater than 20%, though the sourcetracker method with well-defined experimental environments, abundance filter of 1%, and decontam with a threshold of 0.5 work the best.

Sensitivity

We also evaluated the ability of each method to predict contaminants correctly by measuring the sensitivity of each method.

# Sensitivity (not in manuscript)
results_all %>% filter (method_type %in% c('Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
ggplot(., aes(x = prevalence, y = sensitivity, color = method)) +
  geom_line() + geom_point() + facet_grid(~method_type) +
  theme(legend.position = "right", legend.title = element_text(size = 10),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 10),
        strip.text.x = element_text(size = 10)) +
        scale_color_manual(values = unname(color_palette))

Impact of removing contamiants from 16S rRNA gene sequencing data

To demonstrate the impact of contaminant removal on microbiome analyses, we present the relative abundance data after contaminant removal. The full results are in the supplemental materials (Supplemental Figure 1), and a subset is presented in the main manuscript.

# representative
fig_5a <- results_asv_limited_long %>% filter(ASV == 'ASV_1') %>% 
filter (method_type %in% c('Abundance filter','Decontam frequency','SourceTracker', 'Negative control filter')) %>%
 ggplot(., aes(x = method, y = Abundance, fill = method)) +   geom_bar(stat = 'identity') + facet_wrap(~sample_names , ncol = 9) + theme(axis.text.x = element_text(angle = 90, hjust = 1))  + ylim(0,35) +   scale_fill_manual(values = unname(color_palette)) +
  geom_hline(data = results_asv_limited_long %>% filter(method %in% c('Expected Results'), ASV == 'ASV_1'), aes(yintercept = Abundance, linetype ="Expected Results*"),  show.legend = TRUE, color = 'black') +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(size = 12))

# representative
fig_5b <- results_asv_limited_long %>% filter(ASV == 'ASV_2') %>% 
filter (method_type %in% c('Abundance filter','Decontam frequency','SourceTracker', 'Negative control filter')) %>%
 ggplot(., aes(x = method, y = Abundance, fill = method)) +   
  geom_bar(stat = 'identity') + facet_wrap(~sample_names , ncol = 9) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  + ylim(0,30) + 
  scale_fill_manual(values = unname(color_palette)) +
  geom_hline(data = results_asv_limited_long %>% filter(method %in% c('Expected Results'), ASV == 'ASV_2'), aes(yintercept = Abundance, linetype ="Expected Results*"),  show.legend = TRUE, color = 'black') +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(size = 12))

# representative
fig_5c <- results_limited %>% 
filter (method_type %in% c('Abundance filter','Decontam frequency','SourceTracker', 'Negative control filter')) %>%
  ggplot(., aes(x = method, y = InvSimpson, fill = method) )+   geom_bar( stat = "identity") +  facet_wrap(~sample_names , ncol = 9) + theme(axis.text.x = element_text(angle = 90, hjust = 1))  + scale_fill_manual(values = unname(color_palette)) + geom_hline(data = results_limited %>% filter(method %in% c('Expected Results')), aes(yintercept = InvSimpson, linetype ="Expected Results*"),  show.legend = TRUE, color = 'black') +
  theme(legend.position = "none",
        axis.text.y = element_text(size = 12),
        axis.title.y = element_text(size = 12),
        axis.title.x = element_text(size = 12),
        axis.text.x =  element_text(size = 10),
        plot.title = element_text(size = 12))

figure_5 <- ggdraw() +
  draw_plot(fig_5a, x = 0, y = .78, width = 1, height = .19) +
  draw_plot(fig_5b, x = 0, y = .56, width = 1, height = .19) +
  draw_plot(fig_5c, x = 0, y = 0, width = 1, height = 0.53) +
  draw_plot_label(label = c('Figure 5','A. Relative abundance of Lactobacillus', 'B. Relative abundance of Escherichia/Shigella', 'C. Inverse Simpson Index'), size = 12,
                  x = c(.75,0, 0, 0), y = c(1,1, .77, .55), hjust = 0)

figure_5

Full composition plot (Supplemental Figure 1)

customPalette <- c("#1F78B4", "#A6CEE3", "#33A02C", "#B2DF8A", "#E31A1C", "#FB9A99", "#FF7F00", "#FDBF6F", "#6A3D9A", "#d3d3d3")

### Individual plots for publication
  
  long_profile <- classification_data %>%
    filter(method_type %in% c('Uncorrected Results', 'Abundance filter', 'Negative control filter')) %>%
  select(ASV_1, ASV_2, ASV_3, ASV_4, ASV_5, ASV_6, ASV_7, ASV_8, ASV_10, con_abundance,sample_names, method, method_type) %>%
    gather(key = 'SequenceClass',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  composition_filter <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 5) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette, ) +
    labs(x = NULL, y = 'Proportion of Reads') +
       theme(legend.position = "none", 
        axis.text = element_text(size = 7),
        axis.title = element_text(size = 8),
        strip.text = element_text(size=8))

   # Decontam methods
  long_profile <- classification_data %>%
  filter(method_type %in% c('Decontam frequency')) %>%
    select(ASV_1, ASV_2, ASV_3, ASV_4, ASV_5, ASV_6, ASV_7, ASV_8, ASV_10, con_abundance,sample_names, method, method_type) %>%
  gather(key = 'SequenceClass',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  composition_decontam <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 5) + 
    geom_col(aes(fill = SequenceClass), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette)  +
    labs(x = NULL, y = 'Proportion of Reads') +
       theme(legend.position = "none",
        axis.text = element_text(size = 7),
        axis.title = element_text(size = 8),
        strip.text = element_text(size=8))
  
  # SourceTracker
  long_profile <- classification_data %>%
  filter(method_type %in% c('SourceTracker')) %>%
    select(ASV_1, ASV_2, ASV_3, ASV_4, ASV_5, ASV_6, ASV_7, ASV_8, ASV_10, con_abundance,sample_names, method, method_type) %>%
  gather(key = 'SequenceClassification',value = 'Abundance', -sample_names, -method, - method_type)

  # Figures
  composition_sourcetracker <- ggplot(long_profile, aes(x = sample_names, y = Abundance)) + facet_wrap(~method, ncol = 4) + 
    geom_col(aes(fill = SequenceClassification), width = 0.7, position = position_fill())  +
    scale_fill_manual(values=customPalette,  labels = c('Lactobacillus','Staphylococcus','Escherichia/Shigella','Salmonella ASV1','Bacillus','Pseudomonas','Listeria','Enterococcus','Salmonella ASV2','Contaminants')) + theme(text = element_text(size=8)) +
    labs(x = "Sample", y = 'Proportion of Reads') +
       theme(legend.position = "right", 
             legend.title = element_text(size = 8),
             axis.text = element_text(size = 7),
             axis.title = element_text(size = 10),
            strip.text = element_text(size=8))

  figure_S1 <- ggdraw() +
  draw_plot(composition_filter, x = 0, y = .68, width = 1, height = .3) +
  draw_plot(composition_decontam, x = 0, y = .35, width = 1, height = .3) +
  draw_plot(composition_sourcetracker, x = 0, y = 0, width = 1, height = 0.33) +
  draw_plot_label(label = c('Figure S1','Filter method', 'Decontam frequency method', 'SourceTracker method'), size = 12,
                  x = c(.75,0, 0, 0), y = c(1,1, .67, .35), hjust = 0)
figure_S1

Impact on alpha diversity measures

We also evaluated the alpha diversity measures after removing contaminants (Supplemental Figure 2).

# Alpha diversity for all (supplemental figure 2)
fig_observed <- results_all_alpha %>% filter (method_type %in% c('Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
  filter(alpha_diversity_measure == 'Observed') %>%
ggplot(., aes(x = sample_names, y = Value, color = method)) +
  geom_point() + facet_wrap(~method_type, ncol = 6) +
  theme(legend.position = "none",
        axis.text = element_text(size = 8),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 8),
        strip.text.x = element_text(size = 8)) +
  labs(x = "Sample", y = 'Observed ASVs') +
  scale_color_manual(values = unname(color_palette))

fig_invSimpson <- results_all_alpha %>% filter (method_type %in% c('Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
  filter(alpha_diversity_measure == 'InvSimpson') %>%
ggplot(., aes(x = sample_names, y = Value, color = method)) +
  geom_point() + facet_wrap(~method_type, ncol = 4) +
  theme(legend.position = "right", 
        legend.title = element_text(size = 10),
        legend.text = element_text(size = 8),
        axis.text = element_text(size = 6),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 8),
        strip.text.x = element_text(size = 8)) +
   labs(x = "Sample", y = 'Inverse Simpson Index') +
  scale_color_manual(values = unname(color_palette))

fig_shannon <-results_all_alpha %>% filter (method_type %in% c('Negative control filter','Abundance filter','Decontam frequency','SourceTracker')) %>%
  filter(alpha_diversity_measure == 'Shannon') %>%
ggplot(., aes(x = sample_names, y = Value, color = method)) +
  geom_point() + facet_wrap(~method_type, ncol = 4) +
  theme(legend.position = "none",
        axis.text = element_text(size = 6),
        axis.title = element_text(size = 10),
        plot.title = element_text(size = 8),
        strip.text.x = element_text(size = 8)) +
  labs(x = "Sample", y = 'Shannon Index') +
  scale_color_manual(values = unname(color_palette))

figure_S2 <- ggdraw() +
  draw_plot(fig_observed, x = 0, y = .68, width = .72, height = .3) +
  draw_plot(fig_invSimpson, x = 0, y = .35, width = 1, height = .3) +
  draw_plot(fig_shannon, x = 0, y = 0, width = .72, height = 0.33) +
  draw_plot_label(label = c('Figure S2'), size = 10,
                  x = c(.75), y = c(1), hjust = 0)
figure_S2