Annotated Heatmaps
In this section we will use the models from the bootstrap analysis above and produce heatmaps of the models stable states and parameterization. Specifically, we will use the two CASCADE 2.0 model pools created, that have either only link-operator mutated or only topology-mutated models.
For both pools, a stable state heatmap will be produced (columns are nodes). For the first pool, the parameterization is presented with a link-operator heatmap (columns are nodes) and for the second pool as an edge heatmap (columns are edges).
Annotations
Pathways
Every node in CASCADE 2.0 belongs to a specific pathway, as can be seen in Fig. 1A of (Niederdorfer et al. 2020). The pathway categorization is a result of a computational analysis performed by the author of that paper and provided as file here.
We present the node and edge distribution across the pathways in CASCADE 2.0. For the edge pathway annotation, either both ends/nodes of an edge belong to a specific pathway and we use that label or the nodes belong to different pathways and the edge is labeled as Cross-talk:
::include_graphics(path = 'img/node_path_dist.png')
knitr::include_graphics(path = 'img/edge_path_dist.png') knitr
- \(\approx50\%\) of the edges are labeled Cross-talk
- Pathways with more nodes have also more edges between these nodes
Training Data
We annotate in the stable state heatmaps the states (activation or inhibition) of the nodes as they were in the AGS training data (annotation name is Training or Calibration).
Connectivity
We annotate each node’s in-degree or out-degree connectivity for the node-oriented heatmaps, i.e. the number of its regulators (in-degree) or the number of nodes it connects to (out-degree). For the edge-oriented heatmaps, we provide either the in-degree of each edge’s target or the out-degree of each edge’s source.
Drug Target
The CASCADE 2.0 model was used to test in-silico prediction performance of the drug combination dataset in (Flobak et al. 2019).
The drug panel used for the simulations involved \(18\) drugs (excluding the SF
drug), each one having one to several drug targets in the CASCADE 2.0 network.
We annotate this information on the combined stable states and parameterization heatmap, by specifying exactly which nodes are drug targets.
For the edge-oriented heatmaps, we specify either if an edge’s target, source, both or none of them is a drug target.
COSMIC Cancer Gene Census (CGC)
We annotate some of the CASCADE 2.0 nodes either as tumor suppressors (TSG), oncogenes or both, based on the COSMIC Cancer Gene Census dataset, an expert-curated effort to describe genes driving human cancer (Sondka et al. 2018).
We used the HGNC symbols of the CASCADE 2.0 nodes and via the HGNC REST API (Braschi et al. 2019), we got the respective Ensemble IDs that helped us match the genes in the COSMIC dataset. From the COSMIC genes that we got back, we did not restrict them to a particular cancer type, we discarded the Tier 2 (lower quality) genes, as well the ones that were annotated with only the term fusion in their respective cancer role. We kept cancer genes that were annotated as both TSGs and fusion or oncogenes and fusion genes and maintained only the TSG or oncogene annotation tag (discarding only the fusion tag). Lastly note that for the CASCADE 2.0 nodes that represent families, complexes, etc. and they include a few COSMIC genes in their membership, we use a majority rule to decide which cancer role to assign to these nodes.
See the script get_cosmic_data_annot.R for more details.
A total of \(52\) CASCADE 2.0 nodes were annotated and in the following figure we can see that most of them were oncogenes, followed by TSGs and a few annotated as both categories:
::include_graphics(path = 'img/cosmic_cascade2_dist.png') knitr
Agreement (Parameterization vs Activity)
For the link-operator CASCADE 2.0 nodes we calculate the percent agreement between link-operator parameterization (AND-NOT
=> \(0\), OR-NOT
=> \(1\)) and stable state activity (Inhibited state => \(0\), Active state => \(1\)).
The percent agreement is defined here as the number of matches (node has the link-operator AND-NOT
(resp. OR-NOT
) and it’s stable state activity value is \(0\) (resp. \(1\))) divided by the total amount of models (\(4500\)).
Note that the models of the link-operator Gitsbe pool from the previous bootstrap analysis that we will use for the heatmaps, had precicely \(1\) stable state each, making thus the aforementioned comparison/calculation much easier.
Link-Operator mutated models
See script lo_mutated_models_heatmaps.R for creating the heatmaps.
::include_graphics(path = 'img/lo_ss_heat.png') knitr
Calibrated models obey training data, i.e. stable states for nodes specified in training data, match the training data activity values
::include_graphics(path = 'img/cosmic_state_cmp.png') knitr
Nodes that are annotated as oncogenes have statistically higher average activity state than the ones annotated as TSGs (also evident partially from the stable states heatmap). The median values for the two groups in the boxplot are \(0.997\) (oncogene) and \(0.452\) (TSG) respectively.
The RTPK_g
node is an outlier in our data: it is annotated as an oncogene and in all models it had a stable state of \(0\).
::include_graphics(path = 'img/lo_heat.png') knitr
::include_graphics(path = 'img/lo_combined_heat.png') knitr
High degree of agreement between link-operator parameterization and stable state activity.
Nodes of the second cluster (the most heterogeneous one) where the parameterization and stable state activity seem to be assigned randomly (i.e. in \(\approx50\%\) of the total models a node is in an inhibited vs an active state, or has the AND-NOT
vs the OR-NOT
link-operator) we observe the high percent agreement scores.
In the third cluster, where the models nodes are mostly in an active state (obeying the training data), we observe that the parameterization does not affect the stable state activity and that these nodes have mostly low connectivity (\(<5\) regulators).
Topology-mutated models
See script topo_mutated_models_heatmaps.R for creating the heatmaps.
::include_graphics(path = 'img/topo_ss_heat.png') knitr
Calibrated models obey training data, i.e. stable states for nodes specified in training data, match the training data activity values.
::include_graphics(path = 'img/edge_heat.png') knitr
Looking at the above figure from left to right we have \(5\) clusters:
- First cluster has all the edges whose target has a single regulator and these are not removed by the topology mutations in Gitsbe, to preserve the network connectivity.
- Second cluster with edges that are mostly present in the topology-mutated models. These edges have two distinguished characteristics: they show high target connectivity (\(\ge 5\) regulators) - meaning that they target mostly hub-nodes - and their source and target nodes belong mostly to different pathways (i.e. they are Cross-talk edges).
- Third cluster with edges that have a ~50% percent chance to stay in the topology-mutated models. These edges belong to a variety of pathways and can have both low and high target in-degree connectivity.
- Fourth cluster with edges that will most likely be removed in the topology-mutated models. These edges belong to a variety of pathways and mostly have low target in-degree connectivity.
- Fifth cluster with edges that are mostly absent in the topology-mutated models. Some of them are high target-connectivity nodes and most of them belong to the TGF-b pathway.
Now, we present a subset of columns (edges) of the above heatmap, chosen based on some user-defined thresholds to include only the edges that are either mostly absent or present in the models (so the second and last cluster). We do not include the edges that are present in all models (cluster 1) since there were the ones whose target had only \(1\) regulator and as such they couldn’t be removed by the Gitsbe algorithm (we don’t lose connectivity when using topology mutations).
::include_graphics(path = 'img/edge_heat_stable.png') knitr
ERK performance investigation
We investigate the role of the activity of the ERK_f
node in the performance of boolean models with link operator mutations.
In the middle cluster of the stable states heatmap (see Figure 70) there were lot of nodes whose stable state activity was different for about half of the models.
For example, although ERK_f
is defined in an active state in the training data, we also get models after training that have it inhibited in their corresponding stable state.
Since these boolean models essentially represent AGS proliferating cancer cells, it’s remarkable that the literature curation results are also divided, i.e. around half of the publications support that activation of ERK_f
and the other its inhibition (see Table S2 from (Flobak et al. 2015)).
So, do we have a way to distinguish which of the two scenarios (ERK
active vs inhibited) better matches the observed biological reality?
Model-wise analyses provide evidence which suggests that the overexpression of ERK_f
is a characteristic biomarker of the higher performance AGS models in terms of synergy prediction, i.e. models that have ERK_f
active in their stable state can predict synergies that have been observed experimentally in the AGS cancer cells.
See methodology and how to reproduce the results here.
= readRDS(file = "data/res_erk.rds")
res
= RColorBrewer::brewer.pal(9, 'Set1')
set1_col
= res %>%
stat_test_roc ::wilcox_test(formula = roc_auc ~ erk_state) %>%
rstatix::add_significance("p")
rstatix
# ROC AUCs
ggboxplot(res, x = "erk_state", y = "roc_auc", fill = "erk_state",
palette = c(set1_col[3], set1_col[1]),
add = "jitter", xlab = "", ylab = "ROC AUC",
title = "ERK_f Pools Performance Comparison (ROC)") +
scale_x_discrete(breaks = c("active","inhibited"), labels = c("Active", "Inhibited")) +
::stat_pvalue_manual(stat_test_roc, label = "p = {p} ({p.signif})", y.position = c(1)) +
ggpubrylim(c(0,1)) +
theme(plot.title = element_text(hjust = 0.5), axis.text = element_text(size = 14)) +
geom_hline(yintercept = 0.5, linetype = 'dashed', color = "red") +
geom_text(aes(x = 2.1, y = 0.55, label = "Random Predictions (AUC = 0.5)")) +
theme(legend.position = "none")
= res %>%
stat_test_pr ::wilcox_test(formula = pr_auc ~ erk_state) %>%
rstatix::add_significance("p") %>%
rstatix::add_y_position()
rstatix
# PR AUCs
ggboxplot(res, x = "erk_state", y = "pr_auc", fill = "erk_state",
palette = c(set1_col[3], set1_col[1]),
add = "jitter", xlab = "", ylab = "PR AUC",
title = "ERK_f Pools Performance Comparison (Precision-Recall)") +
scale_x_discrete(breaks = c("active","inhibited"), labels = c("Active", "Inhibited")) +
::stat_pvalue_manual(stat_test_pr, label = "p = {p} ({p.signif})") +
ggpubrtheme(plot.title = element_text(hjust = 0.5), axis.text = element_text(size = 14)) +
# 6 out 153 drug combos were synergistic
geom_hline(yintercept = 6/153, linetype = 'dashed', color = "red") +
geom_text(aes(x = 1.05, y = 0.06, label = "Random Predictions (AUC = 0.04)")) +
theme(legend.position = "none")
Significantly better performance of the bootstrapped ensembles with ERK_f
active, suggesting that ERK should be considered active in AGS cells from a functional point of view.