Intro

This document includes the ensemble model analysis performed on the 7500 models generated by the Gitsbe module when running the DrugLogics computational pipeline for finding synergistic drug combinations. All these models were trained towards a consensus steady state signaling pattern. The input for the simulations and the output data are in the consensus-2500 directory (the 2500 number denotes the number of simulations executed). The analysis will be presented step by step in the sections below and it’s main focus is to investigate the association between the models’ stable states and their fitness to the consensus steady state activity pattern they are trained to.

Prerequisites

Firstly, we load the required libraries (you need to install them if you don’t have them):

library(circlize)
library(ComplexHeatmap)
library(usefun)
library(emba)

Input

Two inputs are used in this analysis:

  • The consensus steady state file from ~1000 cell lines, which lists the nodes of the signaling network along with their (predicted) activity state. The consensus steady state has been calculated using PARADIGM with the gene expression and CNV data from all cell lines of the CCLE, while the majority rule was used to derive the activity state of each node - meaning that more than half 1’s for one node in all cell lines would result in an activity state of 1 for that particular node in the consensus steady state profile.
  • The models directory, which is the same as the models directory produced by Gitsbe and has one .gitsbe file per model that includes the stable state of the boolean model. Note that a model can have 1 stable state or none in our simulations - but the models used in this analysis have been selected through a genetic evolution algorithm in Gitsbe and so in the end, only those with 1 stable state have higher fitness values and remain in the later generations. Higher fitness here means a better match of a model’s stable state to the consensus steady state (a perfect match would result in a fitness of 1)
data.dir = paste0(getwd(), "/")
models.dir = paste0(data.dir, "models")

Now, we parse the data into proper R objects. First, we get the full stable state per model:

models = get_model_names(models.dir)
nodes  = get_node_names(models.dir)

models.stable.state.file = paste0(data.dir, "models_stable_state")
models.stable.state = as.matrix(
  read.table(file = models.stable.state.file, check.names = FALSE)
)

# print the stable state of the first model (first 12 nodes)
pretty_print_vector_names_and_values(models.stable.state[1,], n = 12)

MAP3K7: 0, MAP2K6: 0, MAP2K3: 0, NLK: 0, MAP3K4: 0, MAP2K4: 1, IKBKG: 0, IKBKB: 1, AKT1: 0, BRAF: 0, SMAD3: 1, DAB2IP: 1

number.of.nodes = length(nodes)
number.of.models = length(models)

pretty_print_string(paste("Number of models:", number.of.models), with.gt = TRUE)

Number of models: 7500

pretty_print_string(paste("Number of nodes:", number.of.nodes), with.gt = TRUE)

Number of nodes: 139

Next, we get the consensus steady state. Note that the data and tools used (e.g.  PARADIGM, CCLE) to derive the activity states of the network nodes produce further nodes that are present in the consensus steady state vector (but not in our network) and so they need to be discarded. Also, we reorder the columns of the consensus vector so as to match the order of nodes as they are in the models.stable.state object:

consensus.steady.state.file = paste0(data.dir, "consensus_steady_state")
ss.df = read.table(consensus.steady.state.file, sep = "\t", stringsAsFactors = FALSE)
consensus.steady.state = ss.df[,2]
names(consensus.steady.state) = ss.df[,1]

consensus.steady.state = prune_and_reorder_vector(consensus.steady.state, nodes)

# print the consensus steady state (first 12 nodes)
pretty_print_vector_names_and_values(consensus.steady.state, n = 12)

MAP3K7: 0, MAP2K6: 0, MAP2K3: 0, NLK: 0, MAP3K4: 0, MAP2K4: 1, IKBKG: 0, IKBKB: 1, AKT1: 1, BRAF: 1, SMAD3: 1, DAB2IP: 0

As we can see, even for the first 12 nodes, the consensus steady state is different compared to the example stable state of the first model.

Analysis

Continuing, we compare each of the models’ stable state to the consensus one and find the percentage of matches for each model (fitness value). Then, we calculate the percentage of models that had a fitness match larger than a threshold value (e.g. \(0.55\) or more than \(55\%\) match) and we plot the models’ percentage results while ranging the fitness threshold value from 0 to \(100\%\):

models.fitness = 
  apply(models.stable.state, 1, get_percentage_of_matches, consensus.steady.state)

min.fitness = min(models.fitness)
max.fitness = max(models.fitness)
string = paste0("Minimum fitness: ", 
                specify_decimal(min.fitness, digits.to.keep = 4), ", ", 
                "Maximum fitness: ", 
                specify_decimal(max.fitness, digits.to.keep = 4))
pretty_print_string(string)

Minimum fitness: 0.5396, Maximum fitness: 0.6547

fitness.range = seq(0, 1, 0.001)
models.percentage = sapply(fitness.range, function(fitness.threshold) {
  return(sum(models.fitness > fitness.threshold)/number.of.models)
})

y.axis.values = pretty(models.percentage)
plot(fitness.range, models.percentage, type = "l", yaxt = "n",
     xlab = "Fitness Threshold value",
     ylab = "Percentage of models",
     main = "Model Exceedance of fitness threshold value")
axis(2, at = y.axis.values, lab = paste(y.axis.values * 100, "%"), las = 1)

As we can see in the figure above, the models generated by the evolutionary algorithm of Gitsbe - which are exactly the best fitting ones to the consensus steady state pattern across all simulations - have a fitness value between 0.54 and 0.65, meaning that in all models, more than half of the network nodes stabilize in the same activity state as the one from the consensus steady state (100% of the models exhibit a match of more than 50% to the consensus steady state). Using the same logic and the figure above, we can see that 50% of the models exhibit a match of more than 63.4%. So, the models produced in this analysis, while trained to a specific steady state signaling pattern, they cover a broader set of potential states, allowing thus a more robust drug response analysis in the respective module of the DrugLogics pipeline (Drabme).

states = apply(models.stable.state, 1, paste, collapse = "")

model.count.per.fitness = table(models.fitness)
number.of.unique.fitness.values = length(model.count.per.fitness)
number.of.unique.stable.states = length(unique(states))

pretty_print_string(paste0("Unique fitness values: ", 
  number.of.unique.fitness.values, ", Unique stable states found by the models: ", 
  number.of.unique.stable.states))

Unique fitness values: 17, Unique stable states found by the models: 1330

y.axis.values = pretty(model.count.per.fitness)
plot(names(model.count.per.fitness), model.count.per.fitness, 
     type = "b",  yaxt = "n",
     xlab = "Fitness",
     ylab = "Number of models",
     main = "Distribution of fitness among models")
axis(2, at = y.axis.values, las = 1)

Note that there are 17 unique fitness values included in the [0.54, 0.65] range, while the distribution of these values among the models can be seen in the figure above. Almost 3000 models had a fitness of ~0.64 to the consensus steady state pattern, marking the area of the highest fitness models in the figure. Furthermore, the number of unique stable states found by the models is 1330. This means that for a specific fitness value there will be many models that have a stable state corresponding to it (having the same percentage of matches for the activity states of the network nodes compared to the consensus state), but the activity of the nodes between these models will be different, allowing a more diverse attractor landscape to match a specific fitness. To see the difference between the stable states of the produced models relating to fitness, we produce the next heatmap:

# coloring
activity.colors = c("red", "lightyellow")
activity.col.fun = colorRamp2(breaks = c(0, 1), colors = activity.colors)
fitness.col.fun = 
  colorRamp2(breaks = c(min.fitness, (min.fitness + max.fitness)/2, max.fitness), 
             colors = c("red", "black", "green"))

# define the fitness color bar
fitness.annot = rowAnnotation(
  fitness = anno_simple(x = models.fitness, col = fitness.col.fun), 
            show_annotation_name = FALSE)

heatmap.ss = 
  Heatmap(matrix = models.stable.state, column_title = "Models Stable States",
    column_title_gp = gpar(fontsize = 20), column_names_gp = gpar(fontsize = 4),
    row_dend_width = unit(1, "inches"), column_dend_height = unit(1, "inches"),
    col = activity.col.fun, show_row_names = FALSE,
    left_annotation = fitness.annot, show_heatmap_legend = FALSE,
    use_raster = TRUE, raster_device = "png", raster_quality = 20)

# define the 2 legends
activity.state.legend = Legend(title = "Activity State",
                               labels = c("Inhibited", "Active"),
                               legend_gp = gpar(fill = activity.colors))
fitness.legend = Legend(title = "Fitness", col = fitness.col.fun)
legend.list = packLegend(activity.state.legend, fitness.legend, 
                         direction = "vertical")

draw(heatmap.ss, annotation_legend_list = legend.list, 
     annotation_legend_side = "right")

In the heatmap above, an activity state equal to zero is represented with red color while an activity state equal to 1 with light yellow. We note the prevalence of the highest fitness models across the heatmap as well their differences regarding the activity states of the network nodes. Though this sample of 7500 models represents the best models generated through a genetic algorithm that results in higher fitness models getting selected because they are more close to the consensus steady state pattern, they still show diverse individual stable state pattern characteristics.

R session info

xfun::session_info()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS, RStudio 1.2.5001

Locale:
  LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
  LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
  LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
  LC_PAPER=en_US.UTF-8       LC_NAME=C                 
  LC_ADDRESS=C               LC_TELEPHONE=C            
  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

Package version:
  assertthat_0.2.1     backports_1.1.5      base64enc_0.1-3     
  BH_1.69.0.1          bibtex_0.4.2         circlize_0.4.8      
  Ckmeans.1d.dp_4.3.0  cli_1.1.0            clue_0.3-57         
  cluster_2.1.0        colorspace_1.4-1     compiler_3.6.1      
  ComplexHeatmap_2.0.0 crayon_1.3.4         digest_0.6.21       
  dplyr_0.8.3          ellipsis_0.3.0       emba_0.1.1          
  evaluate_0.14        fansi_0.4.0          gbRd_0.4-11         
  GetoptLong_0.1.7     GlobalOptions_0.1.1  glue_1.3.1          
  graphics_3.6.1       grDevices_3.6.1      grid_3.6.1          
  highr_0.8            htmltools_0.4.0      htmlwidgets_1.5.1   
  igraph_1.2.4.1       jsonlite_1.6         knitr_1.25          
  lattice_0.20.38      magrittr_1.5         markdown_1.1        
  Matrix_1.2.17        methods_3.6.1        mime_0.7            
  packrat_0.5.0        parallel_3.6.1       pillar_1.4.2        
  pkgconfig_2.0.3      plogr_0.2.0          png_0.1-7           
  purrr_0.3.2          R6_2.4.0             RColorBrewer_1.1-2  
  Rcpp_1.0.2           Rdpack_0.11-0        rje_1.10.10         
  rjson_0.2.20         rlang_0.4.0          rmarkdown_1.16      
  shape_1.4.4          stats_3.6.1          stringi_1.4.3       
  stringr_1.4.0        tibble_2.1.3         tidyselect_0.2.5    
  tinytex_0.16         tools_3.6.1          usefun_0.4.1        
  utf8_1.1.4           utils_3.6.1          vctrs_0.2.0         
  visNetwork_2.0.8     xfun_0.10            yaml_2.2.0          
  zeallot_0.1.0       
---
title: "Gitsbe Consensus Model Analysis"
author: "[John Zobolas](https://github.com/bblodfon)"
date: "Last updated: `r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    css: style.css
    theme: united
    toc: true
    toc_float: 
      collapsed: false
      smooth_scroll: true
    number_sections: false
    code_folding: hide
    code_download: true
---

```{r Render command, eval=FALSE, include=FALSE}
#rmarkdown::render(input = "./consensus_model_analysis.Rmd", output_format = "html_document", output_dir = "../docs/consensus-2500/")
```

## Intro {-}

This document includes the ensemble model analysis performed on the 7500 
models generated by the `Gitsbe` module when running the DrugLogics 
computational pipeline for finding synergistic drug combinations. All these 
models were trained towards a consensus steady state signaling pattern. The 
input for the simulations and the output data are in the `consensus-2500` 
directory (the 2500 number denotes the number of simulations executed). The 
analysis will be presented step by step in the sections below and it's **main
focus** is to investigate the association between the models' stable states and 
their fitness to the consensus steady state activity pattern they are trained to.

## Prerequisites {-}

Firstly, we load the required libraries (you need to install them if you don't
have them):
```{r Load libraries, message = FALSE}
library(circlize)
library(ComplexHeatmap)
library(usefun)
library(emba)
```

## Input {-}

**Two inputs** are used in this analysis:

- The consensus **steady state** file from ~1000 cell lines, which lists the 
nodes of the signaling network along with their (predicted) activity state. The
consensus steady state has been calculated using 
[PARADIGM](https://doi.org/10.1093/bioinformatics/btq182) with the gene 
expression and CNV data from all cell lines of the 
[CCLE](https://www.sevenbridges.com/ccle/), while the majority rule was used to 
derive the activity state of each node - meaning that 
more than half 1's for one node in all cell lines would result in an activity 
state of 1 for that particular node in the consensus steady state profile.
- The **models** directory, which is the same as the models directory produced 
by `Gitsbe` and has one `.gitsbe` file per model that includes the **stable 
state** of the boolean model. Note that a model can have 1 stable state or none 
in our simulations - but the models used in this analysis have been selected 
through a genetic evolution algorithm in `Gitsbe` and so in the end, only those 
with 1 stable state have higher fitness values and remain in the later 
generations. Higher fitness here means a better match of a model's stable state
to the consensus steady state (a perfect match would result in a fitness of 1)
```{r Input: files}
data.dir = paste0(getwd(), "/")
models.dir = paste0(data.dir, "models")
```
Now, we parse the data into proper R objects. First, we get the full stable 
state per model:
```{r Input: models stable states, results = 'asis'}
models = get_model_names(models.dir)
nodes  = get_node_names(models.dir)

models.stable.state.file = paste0(data.dir, "models_stable_state")
models.stable.state = as.matrix(
  read.table(file = models.stable.state.file, check.names = FALSE)
)

# print the stable state of the first model (first 12 nodes)
pretty_print_vector_names_and_values(models.stable.state[1,], n = 12)
```

```{r Model Stats, results='asis'}
number.of.nodes = length(nodes)
number.of.models = length(models)

pretty_print_string(paste("Number of models:", number.of.models), with.gt = TRUE)
pretty_print_string(paste("Number of nodes:", number.of.nodes), with.gt = TRUE)
```

Next, we get the consensus steady state. Note that the data and tools used (e.g. 
PARADIGM, CCLE) to derive the activity states of the network nodes produce further 
nodes that are present in the consensus steady state vector (but not in our 
network) and so they need to be discarded. Also, we reorder the columns of the 
consensus vector so as to match the order of nodes as they are in the 
`models.stable.state` object:
```{r Input: consensus steady state, results = 'asis'}
consensus.steady.state.file = paste0(data.dir, "consensus_steady_state")
ss.df = read.table(consensus.steady.state.file, sep = "\t", stringsAsFactors = FALSE)
consensus.steady.state = ss.df[,2]
names(consensus.steady.state) = ss.df[,1]

consensus.steady.state = prune_and_reorder_vector(consensus.steady.state, nodes)

# print the consensus steady state (first 12 nodes)
pretty_print_vector_names_and_values(consensus.steady.state, n = 12)
```

As we can see, even for the first 12 nodes, the consensus steady state is 
different compared to the example stable state of the first model.

## Analysis {-}

Continuing, **we compare each of the models' stable state to the consensus one and 
find the percentage of matches for each model (fitness value)**. Then, we 
calculate the percentage of models that had a fitness match larger than
a threshold value (e.g. $0.55$ or more than $55\%$ match) and we plot the models' 
percentage results while ranging the fitness threshold value from 0 to $100\%$:
```{r Model Exceedance, results = 'asis'}
models.fitness = 
  apply(models.stable.state, 1, get_percentage_of_matches, consensus.steady.state)

min.fitness = min(models.fitness)
max.fitness = max(models.fitness)
string = paste0("Minimum fitness: ", 
                specify_decimal(min.fitness, digits.to.keep = 4), ", ", 
                "Maximum fitness: ", 
                specify_decimal(max.fitness, digits.to.keep = 4))
pretty_print_string(string)
```

```{r Model Exceedance figure, results = 'asis'}
fitness.range = seq(0, 1, 0.001)
models.percentage = sapply(fitness.range, function(fitness.threshold) {
  return(sum(models.fitness > fitness.threshold)/number.of.models)
})

y.axis.values = pretty(models.percentage)
plot(fitness.range, models.percentage, type = "l", yaxt = "n",
     xlab = "Fitness Threshold value",
     ylab = "Percentage of models",
     main = "Model Exceedance of fitness threshold value")
axis(2, at = y.axis.values, lab = paste(y.axis.values * 100, "%"), las = 1)
```

As we can see in the figure above, the models generated by the evolutionary 
algorithm of `Gitsbe` - which are exactly the best fitting ones to the consensus
steady state pattern across all simulations - have a **fitness value between 
0.54 and 0.65**, meaning that in all models, more than half of the network nodes
stabilize in the same activity state as the one from the consensus steady state 
(100% of the models exhibit a match of more than 50% to the consensus steady 
state). Using the same logic and the figure above, we can see **that 50% of the 
models exhibit a match of more than 63.4%**. So, the models produced in this 
analysis, while trained to a specific steady state signaling pattern, they cover
a broader set of potential states, allowing thus a more robust drug response 
analysis in the respective module of the DrugLogics pipeline (`Drabme`).

```{r Distribution of fitness, results = 'asis'}
states = apply(models.stable.state, 1, paste, collapse = "")

model.count.per.fitness = table(models.fitness)
number.of.unique.fitness.values = length(model.count.per.fitness)
number.of.unique.stable.states = length(unique(states))

pretty_print_string(paste0("Unique fitness values: ", 
  number.of.unique.fitness.values, ", Unique stable states found by the models: ", 
  number.of.unique.stable.states))

y.axis.values = pretty(model.count.per.fitness)
plot(names(model.count.per.fitness), model.count.per.fitness, 
     type = "b",  yaxt = "n",
     xlab = "Fitness",
     ylab = "Number of models",
     main = "Distribution of fitness among models")
axis(2, at = y.axis.values, las = 1)
```

Note that there are **`r number.of.unique.fitness.values` unique fitness values** 
included in the [0.54, 0.65] range, while the distribution of these values among 
the models can be seen in the figure above. Almost **3000 models had a fitness 
of ~0.64** to the consensus steady state pattern, marking the area of the highest 
fitness models in the figure. Furthermore, **the number of unique stable states 
found by the models is `r number.of.unique.stable.states`**. This means that for 
a specific fitness value there will be many models that have a stable state 
corresponding to it (having the same percentage of matches for the activity 
states of the network nodes compared to the consensus state), but the activity 
of the nodes between these models will be different, allowing a more diverse 
attractor landscape to match a specific fitness. To see the difference between 
the stable states of the produced models relating to fitness, we produce the 
next heatmap:

```{r Models Stable States Heatmap, fig.height = 7, fig.width = 9, dpi = 300, cache=TRUE}
# coloring
activity.colors = c("red", "lightyellow")
activity.col.fun = colorRamp2(breaks = c(0, 1), colors = activity.colors)
fitness.col.fun = 
  colorRamp2(breaks = c(min.fitness, (min.fitness + max.fitness)/2, max.fitness), 
             colors = c("red", "black", "green"))

# define the fitness color bar
fitness.annot = rowAnnotation(
  fitness = anno_simple(x = models.fitness, col = fitness.col.fun), 
            show_annotation_name = FALSE)

heatmap.ss = 
  Heatmap(matrix = models.stable.state, column_title = "Models Stable States",
    column_title_gp = gpar(fontsize = 20), column_names_gp = gpar(fontsize = 4),
    row_dend_width = unit(1, "inches"), column_dend_height = unit(1, "inches"),
    col = activity.col.fun, show_row_names = FALSE,
    left_annotation = fitness.annot, show_heatmap_legend = FALSE,
    use_raster = TRUE, raster_device = "png", raster_quality = 20)

# define the 2 legends
activity.state.legend = Legend(title = "Activity State",
                               labels = c("Inhibited", "Active"),
                               legend_gp = gpar(fill = activity.colors))
fitness.legend = Legend(title = "Fitness", col = fitness.col.fun)
legend.list = packLegend(activity.state.legend, fitness.legend, 
                         direction = "vertical")

draw(heatmap.ss, annotation_legend_list = legend.list, 
     annotation_legend_side = "right")
```

In the heatmap above, an activity state equal to zero is represented with red 
color while an activity state equal to 1 with light yellow. We note the prevalence of 
the highest fitness models across the heatmap as well their differences regarding
the activity states of the network nodes. Though this sample of 7500 models 
represents the *best* models generated through a genetic algorithm that results 
in higher fitness models getting selected because they are more *close* to the 
consensus steady state pattern, they still show **diverse individual stable state 
pattern characteristics**.

## R session info {-}

```{r session info, comment=""}
xfun::session_info()
```