SF295 Model Analysis

This chapter includes the ensemble model analysis performed on the models generated by the Gitsbe module when running the DrugLogics computational pipeline for finding synergistic drug combinations (drug pairs). All these models were trained towards a specific steady state signaling pattern that was derived based on input data (gene expression, CNV) for the SF295 cell line, the (Glioblastoma, a brain cancer), the use of the PARADIGM software (Vaske et al. 2010) and a topology that was build for simulating a cancer cell fate decision network. The input for the simulations and the output data are in the cell-lines-2500 directory (the 2500 number denotes the number of simulations executed). The analysis will be presented step by step in the sections below.

Input

We will define the name of the cell line which must match the name of the directory that has the input files inside the cell-lines-2500 directory. Our analysis in this chapter will be done on the data for the SF295 cell line:

Three inputs are used in this analysis:

  • The model_predictions file which has for each model the prediction for each drug combination tested (0 = no synergy predicted, 1 = synergy predicted, NA = couldn’t find stable states in either the drug combination inhibited model or in any of the two single-drug inhibited models)
  • The observed_synergies file which lists the drug combinations that were observed as synergistic for the particular cell line.
  • The models directory, which is the same as the models directory produced by Gitsbe and has one .gitsbe file per model that includes this info:
    • The stable state of the boolean model. Note that a model can have 1 stable state or none in our simulations - but the models used in this analysis have been selected through a genetic evolution algorithm in Gitsbe and so in the end, only those with 1 stable state have higher fitness values and remain in the later generations. Higher fitness here means a better match of a model’s stable state to the cell line derived steady state (a perfect match would result in a fitness of 1)
    • The boolean equations of the model

Now, we parse the data into proper R objects. First the synergy predictions per model:

5Z-AK: 0, 5Z-BI: NA, 5Z-CT: NA, 5Z-PD: 0, 5Z-PI: 0, 5Z-PK: 0, 5Z-JN: 0, 5Z-D1: 0, 5Z-60: 0, 5Z-SB: 0, 5Z-RU: 0, 5Z-D4: 0

So, the model.predictions object has the models as rows and each column is a different drug combination that was tested in our simulations.

Drug combinations tested: 153
Number of models: 7500
Number of nodes: 139

Next, we get the full stable state and the equations per model:

MAP3K7: 0, MAP2K6: 0, MAP2K3: 0, NLK: 0, MAP3K4: 0, MAP2K4: 1, IKBKG: 0, IKBKB: 1, AKT1: 0, BRAF: 0, SMAD3: 0, DAB2IP: 1

The rows of the models.stable.state object represent the models while its columns are the names of the nodes (proteins, genes, etc.) of the cancer cell network under study. So, each model has one stable state which means that in every model, the nodes in the network have reached a state of either 0 (inhibition) or 1 (activation).

MAP3K4: 1, MAP2K4: 1, IKBKB: 1, AKT1: 0, SMAD3: 0, GSK3B: 1, RAF1: 0, GAB2: 0, CTNNB1: 1, NR3C1: 0, CREB1: 0, RAC1: 1

For the models.equations, if we look at a specific row (a model so to speak), the columns (node names) correspond to the targets of regulation (and the network has been built so that every node is a target - i.e. it has other nodes activating and/or inhibiting it). The general form of a boolean equation is:

Target *= (Activator OR Activator OR…) AND NOT (Inhibitor OR Inhibitor OR…)

The difference between the models’ boolean equations is the link operator (OR NOT/AND NOT) which has been mutated (changed) through the evolutionary process of the genetic algorithm in Gitsbe. For example, if a model has for the column ERK_f (of the models.equations object) a value of 1, the correspoding equation is: ERK_f *= (MEK_f) OR NOT ((DUSP6) OR PPP1CA). A value of 0 would correspond to the same equation but having AND NOT as the link operator: ERK_f *= (MEK_f) AND NOT ((DUSP6) OR PPP1CA). Note that the equations that do not have link operators (meaning that they are the same for every model) are discarded (so less columns in this dataset) since in a later section we study only the equations whose link operators differentiate between the models.

Lastly, the synergies observed for this particular cell line are:

23 observed synergies: AK-60, PI-60, AK-BI, AK-D1, BI-D1, JN-D1, PI-D1, AK-G2, PI-G2, AK-JN, BI-JN, PI-JN, AK-P5, BI-P5, D1-P5, G2-P5, JN-P5, PI-P5, AK-PI, BI-PI, CT-PI, AK-ST, PK-ST

Performance Statistics

It will be interesting to know the percentage of the above observed synergies that were actually predicted by at least one of the models (there might be combinations that no model in our dataset could predict): combinations that no model in our dataset could predict):

5 predicted synergies: BI-PI, BI-D1, PI-D1, JN-D1, D1-P5

Percentage of True Positive predicted synergies: 21.74%

So, for this particular cell line, there were indeed observed synergies that no model could predict (e.g.  AK-BI, AK-PI). Next, we would like to know the maximum number of observed synergies predicted by one model alone - can one model by itself predict all the true positive synergies predicted by all the models together or do we need many models to capture this diverse synergy landscape? To do that, we go even further and count the number of models that predict a specific set of observed synergies for every possible combination subset of the predicted.synergies object:

From the above figure (where we excluded sets of synergies that were predicted by no model by setting the threshold.for.subset.removal value to 1) we observe that:

  • Most of the models predict none of the observed synergies
  • The PI-D1 synergy is predicted by almost all the rest of the models

Next we calculate the maximum number of correctly predicted synergies (\(TP\) - True Positives) per model:

To summarize:

  • There were only 17 models that predicted 3 synergies - the set BI-PI,BI-D1,PI-D1 - which is the maximum number of predicted synergies by an individual model
  • No model could predict all 5 of the total predicted synergies

The power of the ensemble model approach lies in the fact that (as we saw from the above figures) even though we may not have individual super models that can predict many observed drug combinations, there are many that predict at least one and which will be used by the drug response analysis module (Drabme) to better infer the synergistic drug combinations. It goes without saying though, that the existance of models that could predict more than a handful of synergies would be beneficial for any approach that performs drug response analysis on a multitude of models.

Biomarker analysis

Intro-Methods

Now, we want to investigate and find possible important nodes - biomarkers - whose activity state either distinguishes good performance models from less performant ones (in terms of a performance metric - e.g. the true positive synergies predicted) or makes some models predict a specific synergy compared to others that can’t (but could predict other synergies). So, we devised two strategies to split the models in our disposal to good and bad ones (but not necessarily all of them), the demarcation line being either a performance metric (the number of \(TP\) or the Matthews Correlation Coefficient score) or the prediction or not of a specific synergy. Then, for each group of models (labeled as either good or bad) we find the average activity state of every node in the network (value between 0 and 1) and then we compute the average state difference for each node between the two groups: \(\forall i\in nodes,mean(state_i)_{good} - mean(state_i)_{bad}\). Our hypothesis is that if the absolute value of these average differences are larger than a user-defined threshold (e.g. 0.7) for a small number of nodes (the less the better) while for the rest of the nodes they remain close to zero, then the former nodes are considered the most important since they define the difference between the average bad model and the average good one in that particular case study. We will also deploy a network visualization method to observe these average differences.

True Positives-based analysis

Using our first strategy, we will split the models based on the number of true positive predictions. For example, the bad models will be the ones that predicted 0 \(TP\) synergies whereas the good models will be the ones that predicted 2 \(TP\) (we will denote the grouping as \((0,2)\)). This particular classification strategy will be used for every possible combination of the number of \(TP\) as given by the models.synergies.tp.stats object and the density estimation of the nodes’ average state differences in each case will be ploted in a common graph:

What we are actually looking for is density plots that are largely skewed to the right (so the average absolute differences of these nodes are close to zero) while there are a few areas of non-zero densities which are as close to 1 as possible. So, from the above graph, the density plots that fit this description are the ones marked as \((0,3)\) and \((2,3)\). We will visualize the nodes’ average state differences in a network graph (Csardi and Nepusz 2006), where the color of each node will denote how much more inhibited or active that node is, in the average good model vs the average bad one. The color of the edges will denote activation (green) or inhibition (red). We first build the network from the node topology (edge list):

In the next colored graphs we can identify the important nodes whose activity state can influence the true positive prediction performance (from 0 true positive synergies to a total of 3):

We will now list the important nodes that affect the models’ prediction performance with regards to the number of true positive synergies that they predict. Comparing the graphs above, we observe that there exist common nodes that maintain the same significant influence in all of the graphs. We set the threshold for the absolute significance level in average state differences to \(0.7\). A node will be marked as a biomarker (active or inhibited) if its activity state difference surpassed the aforementioned threshold (positively or negatively) for any of the tested groups (e.g. 2 \(TP\) vs 3 \(TP\)). So, the nodes that have to be in a more active state are:

4 nodes: PSEN1, MAPK8IP1, MAPK8IP3, MAPK9

Also, the nodes that have to be in a more inhibited state are:

1 node: RXRA

We check if there are nodes found by our method as biomarkers in both active and inhibited states:

No common nodes

MCC classification-based analysis

The previous method to split the models based on the number of true positive predictions is a good metric for absolute performance but a very restricted one since it ignores the other values of the confusion matrix for each model (true negatives, false positives, false negatives). Also, since our dataset is imbalanced in the sense that out of the total drug combinations tested only a few of them are observed as synergistic (and in a hypothetical larger drug screening evaluation it will be even less true positives) we will now devise a method to split the models into different performance categories based on the value of the Matthews Correlation Coefficient (MCC) score which takes into account the balance ratios of all the four confusion matrix values:

From the above figure we observe that:

  • There are no relatively bad models (MCC values close to -1)
  • Most of the models (exluding the NaN category) perform a little better than random prediction (\(MCC>0\))
  • There are models that had NaN value for the MCC score

Given the MCC formula: \(MCC = (TP\cdot TN - FP\cdot FN)/\sqrt{(TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)}\), we can see that the MCC value can be NaN because of zero devision. Two of the four values in the denominator represent the number of positive \((TP+FN)\) and negative \((TN+FP)\) observations which are non-zero for every model, since they correspond to the observed and non-obsered synergies in each case. The case where both \(TN\) and \(FN\) are zero is rare (if non-existent) because of the imbalanced dataset (large proportion of negatives) and the reason that logical models which report no negatives means that they should find fixpoint attractors for every possible drug combination perturbation which also is extremely unlikely. We can actually see that the NaN are produced by models that have both TP and FP equal to zero:

3400

Since these models could intentify no synergies (either correctly or wrongly), we decided to put them as the lowest performant category in our MCC-based analysis. To classify the models based on their MCC score (which takes values in the \([-1, 1]\) interval, NaN values excluded), we will perform a univariate k-means clustering to split the previously found MCC values to different classes (Wang and Song 2011). The MCC classification is presented with a histogram:

Note that in total we have 6 MCC classes, since the NaN MCC values constitute a class on its own.

Following our first strategy, we will split the models based on the MCC performance metric score. For example, the bad models will be the ones that had a NaN MCC score \((TP+FP = 0)\) whereas the good models will be the ones that had an MCC score belonging to the first MCC class as seen in the histogram above. This particular classification strategy will be used for every possible combination of the MCC classes and the density estimation of the nodes’ average state differences in each case will be ploted in a common graph:

Many of the density plots above are of interest to us, since they are right skewed. Next, we visualize the average state differences with our network coloring method for 2 of the above cases, in order to identify the important nodes whose activity state can influence the prediction performance based on the MCC classification:

We will now list the important nodes that affect the models’ prediction performance with regards to their MCC classification, using the same method as in the True Positives-based analysis section. First, the nodes that have to be in a more active state:

1 node: PSEN1

Secondly, the nodes that have to be in a more inhibited state:

1 node: PSEN1

Since we identified only one (common) biomarker in each category state, we will use a less strict threshold for identifying further biomarkers. First, the nodes that have to be in a more active state:

12 nodes: GSK3B, AKT2, GSK3B/Axin/APC, MYC, APC, LRP6, PSEN1, ROR2, PHLPP1, MAML1, SIRT2, CUL1

Secondly, the nodes that have to be in a more inhibited state:

12 nodes: GSK3B, AKT2, GSK3B/Axin/APC, MYC, APC, LRP6, PSEN1, ROR2, PHLPP1, MAML1, SIRT2, CUL1

We check if there are nodes found by our method as biomarkers in both active and inhibited states:

12 nodes: GSK3B, AKT2, GSK3B/Axin/APC, MYC, APC, LRP6, PSEN1, ROR2, PHLPP1, MAML1, SIRT2, CUL1

Since there are common nodes that were found to surpass the significance threshold level for the average state difference between two MCC classes both negatively and positively, we will keep these biomarkers only in the state which corresponds to the comparison of the highest classification categories. For example, if for the comparison of the MCC classes \((1,3)\) the node of interest had an average difference of \(-0.89\) while for the comparison of the \((3,4)\) MCC classes it had a value of \(0.91\), then we will keep that node only in the active biomarker list. The logic behind this is that the higher the MCC classes we compare, the more sure we are that the average state difference corresponds to a better indicator of the state of the biomarker found.

4 nodes: GSK3B, AKT2, APC, LRP6

8 nodes: GSK3B/Axin/APC, MYC, PSEN1, ROR2, PHLPP1, MAML1, SIRT2, CUL1

Equation-based analysis

It will be interesting to see the different patterns in the form of the boolean equations (regarding the mutation of the link operator as mentioned in the Input section) when comparing higher performance models vs the low performant ones. We could also check if any of the biomarkers found above relate to a different link operator on average between models with different performance characteristics (e.g. higher predictive models should have the OR NOT as the link operator in a boolean equation where a specific biomarker is the regulation target) or if they constitute targets of exclusively activating nodes or inhibiting ones (an equation with no link operator).

The performance metric we will first use to sort the models is the number of true positive predictions. We will now illustrate the heatmap of the models.equations object raw-order by the number of \(TP\) predictions:

# order based on number of true positives
models.synergies.tp.sorted = sort(models.synergies.tp)
models.sorted = names(models.synergies.tp.sorted)
models.equations.sorted = models.equations[models.sorted, ]

# coloring
eq.link.colors = c("red", "lightyellow")
eq.link.col.fun = colorRamp2(breaks = c(0, 1), colors = eq.link.colors)
tp.values = sort(unique(models.synergies.tp))
tp.col.fun = colorRamp2(breaks = c(min(tp.values), max(tp.values)), 
                         colors = c("red", "green"))

# color biomarker names in the heatmap
bottom.nodes.colors = rep("black", length(colnames(models.equations.sorted)))
names(bottom.nodes.colors) = colnames(models.equations.sorted)
bottom.nodes.colors[names(bottom.nodes.colors) %in%
                    biomarkers.perf.active] = "blue"
bottom.nodes.colors[names(bottom.nodes.colors) %in%
                    biomarkers.perf.inhibited] = "magenta"

# define the TP color bar
tp.annot = rowAnnotation(
  tp = anno_simple(x = models.synergies.tp.sorted, col = tp.col.fun), 
  show_annotation_name = FALSE)

heatmap.eq = 
  Heatmap(matrix = models.equations.sorted, column_title = "Models Link Operators",
    column_title_gp = gpar(fontsize = 20), row_title = "Models (TP sorted)",
    cluster_rows = FALSE, show_row_names = FALSE, show_heatmap_legend = FALSE,
    column_dend_height = unit(1, "inches"), col = eq.link.col.fun, 
    column_names_gp = gpar(fontsize = 9, col = bottom.nodes.colors),  
    left_annotation = tp.annot,
    use_raster = TRUE, raster_device = "png", raster_quality = 10)

# define the 3 legends
link.op.legend = Legend(title = "Link Operator", labels = c("AND", "OR"), 
                        legend_gp = gpar(fill = eq.link.colors))
tp.values = min(tp.values):max(tp.values) # maybe some integers are missing
tp.legend = Legend(at = tp.values, title = "TP", 
                   legend_gp = gpar(fill = tp.col.fun(tp.values)))
biomarkers.legend = Legend(title = "Biomarkers", labels = c("Inhibited", "Active"), 
                           legend_gp = gpar(fill = c("magenta", "blue")))
legend.list = packLegend(link.op.legend, tp.legend, biomarkers.legend,
                         direction = "vertical")

draw(heatmap.eq, annotation_legend_list = legend.list, 
     annotation_legend_side = "right")

In the heatmap above, an equation whose link operator is AND NOT is represented with red color while an OR NOT link operator is represented with light yellow color. The targets whose equations do not have a link operator are not represented. The rows are ordered by the number of true positive predictions (ascending). We have colored the names of the network nodes that were also found as biomarkers (green color is used for the active biomarkers and red color for the inhibited biomarkers). We observe that:

  • Most of the biomarkers are nodes that do not have both activators and inhibitors and so are absent from the above heatmap
  • There doesn’t seem to exist a pattern between the models’ link operators and their corresponding performance (at least not for all of the nodes) when using the true positive predictions as a classifier for the models
  • There exist a lot of target nodes that need to have the OR NOT link operator in their respective boolean equation in order for the corresponding logical model to show a higher number of true positive predictions. By assigning the OR NOT link operator to a target’s boolean regulation equation, we allow more flexibility to the target’s output active result state - meaning that the inhibitors play less role and the output state has a higher probability of being active - compared to assigning the AND NOT link operator to the equation

We will also illustrate the heatmap of the models.equations object raw-order by the MCC score which is a better performance classifier. Models who had a NaN MCC score will be again placed in the lower performant category:

# order based on the MCC value
models.mcc.sorted = sort(models.mcc, na.last = FALSE)
models.sorted = names(models.mcc.sorted)
models.equations.sorted = models.equations[models.sorted, ]

# coloring
mcc.col.fun = 
  colorRamp2(breaks = c(min(models.mcc.no.nan),
                        max(models.mcc.no.nan)), colors = c("red", "green"))

# define the MCC color bar
mcc.annot = rowAnnotation(
  mcc = anno_simple(x = models.mcc.sorted, col = mcc.col.fun, na_col = "black"), 
  show_annotation_name = FALSE)

heatmap.eq = 
  Heatmap(matrix = models.equations.sorted, column_title = "Models Link Operators",
    column_title_gp = gpar(fontsize = 20), row_title = "Models (MCC sorted)",
    cluster_rows = FALSE, show_row_names = FALSE, show_heatmap_legend = FALSE,
    column_dend_height = unit(1, "inches"), col = eq.link.col.fun, 
    column_names_gp = gpar(fontsize = 9, col = bottom.nodes.colors),  
    left_annotation = mcc.annot,
    use_raster = TRUE, raster_device = "png", raster_quality = 10)

# define the 4 legends
link.op.legend = Legend(title = "Link Operator", labels = c("AND", "OR"), 
                        legend_gp = gpar(fill = eq.link.colors))
mcc.legend = Legend(title = "MCC", col_fun = mcc.col.fun)
na.legend = Legend(labels = "NA", legend_gp = gpar(fill = "black"))
biomarkers.legend = Legend(title = "Biomarkers", labels = c("Inhibited", "Active"), 
                           legend_gp = gpar(fill = c("magenta", "blue")))
legend.list = packLegend(link.op.legend, mcc.legend, na.legend, biomarkers.legend,
                         direction = "vertical")

draw(heatmap.eq, annotation_legend_list = legend.list, 
     annotation_legend_side = "right")

In the MCC-ordered heatmap above we observe that there is better correlation between the boolean equations’ link operators and the performance of a model compared to the \(TP\) classification. For example, the biomarkers AKT2 and GSK3B show distinguished patterns for the higher performance models (use of AND NOT and OR NOT link operators respectively) when the models are sorted by the MCC score.

Synergy-prediction based analysis

We will now use the second strategy to split the models, based on whether they predict a specific observed synergy or not. This will allow us to find biomarkers that affect the prediction of a specific synergy. For example, the good models could be the ones that predicted the hypothetical synergy A-B while the bad models all the rest that identified the particular combination as non-synergistic. In another case scenario, the good models could be those that predicted a triple synergy set A-B,C-D,A-C, while the bad models could be the ones that predicted the double synergy subset A-B,C-D (excluding the common models that predicted both the triple synergy set and subsequently its given subset). In such a case scenario, we want to find out which nodes are responsible for making the good models predict the extra synergy - in this hypothetical case the synergy A-C - demonstrating thus better model performance. Note that the models selected in each case as good or bad, could have predicted other synergies as well (correctly as \(TP\) or wrongly as \(FP\)) which means that the biomarker selection method could be somewhat innacurate, since we can’t really know the prediction of which extra synergy or synergies the biomarkers’ state affected. To account for this, we label as good models those that predict large synergy sets (so fewer models) which capture almost all the true positive predictions and also minimize the possible extra different synergies predicted by models of the same classification category (e.g. the good models).

Starting with the first model classification method (prediction vs non-prediction of a particular synergy), we generate the density distribution of the nodes’ average state differences between the good and bad models for each predicted synergy:

Next, we will plot the biomarkers with the network visualization method (for all predicted synergies) and output the respective biomarkers in each case. The biomarker results for each predicted synergy will be stored for further comparison with the results from the other cell lines.

Biomarkers for BI-PI synergy prediction

Active biomarkers
4 nodes: NR3C1, MAPK14, RPS6KA5, DUSP1

Inhibited biomarkers
11 nodes: MAP3K7, MAP2K6, MAP2K3, NLK, IKBKG, FGFR1, TAB1, CASP3, PTPN7, MAX, ROCK1

Biomarkers for BI-D1 synergy prediction

Active biomarkers
3 nodes: MAPK14, RPS6KA5, DUSP1

Inhibited biomarkers
12 nodes: MAP3K7, MAP2K6, MAP2K3, NLK, IKBKG, FGFR1, TAB1, CASP3, EGFR, PTPN7, MAX, ROCK1

Biomarkers for PI-D1 synergy prediction

Active biomarkers
0 nodes:

Inhibited biomarkers
0 nodes:

Biomarkers for JN-D1 synergy prediction

Active biomarkers
22 nodes: MAP3K11, FGFR1, TAB1, CASP3, STAT3, EGFR, PTPN7, MAX, MAPK8, MAPK8IP1, MAPK8IP3, APP, SIRT1, SOCS3, TGFB1, HSPA1A, SALL4, ROCK1, TGFBR1, TRAF6, RHOA, PIK3R1

Inhibited biomarkers
7 nodes: DLX5, NR3C1, MAPK14, RPS6KA5, DUSP1, RXRA, LAT

Biomarkers for D1-P5 synergy prediction

Active biomarkers
3 nodes: JAK2, CSF2RA, APOA1

Inhibited biomarkers
0 nodes:

We notice that for some predicted synergies the above method identified zero biomarkers. We will now study cases where the main goal is the better identification and/or refinement of the biomarkers responsible for allowing the models to predict one extra synergy from a specific synergy set. If we find biomarkers for a predicted synergy using this strategy, we compare them to the ones already found with the previous method and if none of them is common we just add the new ones to the list of biomarkers for that specific synergy. On the other hand, if the synergy-set comparison method identifies a subset of the previously found biomarkers for a specific synergy (one common node at least), we will only keep the later method’s biomarkers since we believe that the synergy-set prediction based method is more accurate at identifying biomarkers for a specific synergy because of the fewer models involved in each contrasting category which also minimizes the total false positive synergy predictions taken into account. Note that if the second method finds even more biomarkers than the first, we have the option to prune the end result to only the common biomarkers between the two methods.

We will focus our analysis on the predicted synergies D1-P5, PI-D1, BI-D1 and BI-PI.

Synergy-set prediction based analysis

D1-P5 synergy

The first use case will contrast the models that predicted the synergy set PI-D1,D1-P5 vs the models that predicted the signle synergy PI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra D1-P5 synergy:

We now visualize the average state differences:

and report the active biomarkers found:

15 nodes: MAP3K7, MAP2K6, MAP2K3, NLK, IKBKG, AKT1, BRCA1, PRKACA, TTC3, JAK2, CSF2RA, APOA1, SYK, MAP4K1, VAV1

Note that the 3 active biomarkers found with the previous method are included in the list above. We also report the inhibited biomarkers:

8 nodes: DAB2IP, PPP1CA, AR, CHEK1, RARA, IRAK1, LCK, CASP9

Note that with the previous method we couldn’t identify any inhibited biomarker. Lastly, we update the biomarkers for this specific synergy in the respective file:

PI-D1 synergy

The second use case will contrast the models that predicted the synergy set PI-D1,D1-P5 vs the models that predicted the single synergy D1-P5 as well as the models that predicted the synergy set BI-D1,PI-D1 vs the models that predicted the single synergy BI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra PI-D1 synergy:

We now visualize the average state differences:

Observing the graphs above, we note that the second comparison (BI-D1,PI-D1 vs BI-D1) does not produce any biomarkers. Even the first comparison’s biomarkers are below the 0.7 threshold that we use. So, in order to capture the biomarkers of the first comparison, we will use a lower threshold level (0.6). The active biomarkers are thus:

4 nodes: GSK3B, AKT2, APC, LRP6

Note that with the previous method where we contrasted the models that predict the PI-D1 synergy vs those that don’t, we weren’t able to identify not even one biomarker. The inhibited biomarkers found are:

5 nodes: MAP3K4, GSK3B/Axin/APC, ROR2, PHLPP1, MAML1

Lastly, we update the biomarkers for this specific synergy in the respective file:

BI-D1 synergy

The third use case will contrast the models that predicted the synergy set BI-D1,PI-D1 vs the models that predicted the single synergy subset PI-D1. This will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra BI-D1 synergy:

We now visualize the average state differences:

and report the active biomarkers found:

3 nodes: MAPK8IP1, MAPK8IP3, MAPK9

We also report the inhibited biomarkers:

1 node: RXRA

Lastly, we update the biomarkers for this specific synergy in the respective file:

BI-PI synergy

The fourth use case will contrast the models that predicted the synergy set BI-PI,PI-D1 vs the models that predicted the single synergy PI-D1 as well as the models that predicted the synergy set BI-PI,BI-D1,PI-D1 vs the models that predicted the synergy subset "BI-D1,PI-D1. These two comparisons will allow us to find important nodes whose state can influence the prediction performance of a model and make it predict the extra BI-PI synergy:

We now visualize the average state differences:

As we see from the two graphs above, each comparison produces different biomarkers for the BI-PI synergy. We note that the second comparison’s (BI-PI,BI-D1,PI-D1 vs BI-D1,PI-D1) biomarkers are below the 0.7 threshold that we use. So, in order to capture the biomarkers of the second synergy set comparison, we will use a lower threshold level (0.6). The active biomarkers found from the two comparisons are thus:

4 nodes: PSEN1, MAPK8IP1, MAPK8IP3, MAPK9

2 nodes: GAB2, MAP2K1

No common nodes

So, as the active biomarkers for this synergy we merged the (non-common) results from the two synergy set comparisons.

The inhibited biomarkers found from the two comparisons are:

1 node: RXRA

0 nodes:

Since the second comparison did not identify any inhibited biomarkers at the \(0.6\) threshold level, we use the inhibited biomarkers found from the first comparison. Lastly, we update the biomarkers for this specific synergy in the respective file:

Biomarker results

In this section, we will compare the biomarkers found per predicted synergy for this particular cell line as well as the performance biomarkers (notated as PERF in the heatmap below) which were found using the results from the MCC classification-based model analysis:

So, in general we observe that:

  • A total of 68 nodes were found as biomarkers (for at least one synergy)
  • We found a lot of biomarkers for each predicted synergy. Usually we wouldn’t expect too many biomarkers that are directly related to the prediction of a specific synergy. The abudance of (false positive) biomarkers for some synergies (e.g. JN-D1) relates to the model classification method used, which does not incorporate in its internal logic that the prediction of other synergies than the ones used for the grouping itself can affect the biomarker results obtained from it
  • Not all the performance-related biomarkers (PERF) were also observed as biomarkers for the prediction of a specific synergy(ies)
  • The PSEN1 biomarker state is found contradictory between the performance-related biomarkers (PERF) and the BI-PI synergy ones. Note though that only 17 models predicted the BI-PI synergy and the performance-related state of that biomarker was produced through the more rigorous MCC classification. Thus, we place a larger confidence on the activity state of the performance-related biomarker result
  • There exist common biomarkers across different predicted synergies, e.g. RXRA is a common inhibited biomarker across 3 synergistic drug combinations
  • The results between the different synergies are sometimes contradictory, meaning that there are active biomarkers for a particular synergy that were found as inhibited in another and vise versa (e.g. the NLK biomarker)
  • The biomarkers of the PI-D1 synergy are not shared with any of the other predicted synergies’ biomarkers but they are the same as some of the performance-related biomarkers