# configure the logger to print to console
import logging
import random
import pandas as pd
import numpy as np
from yeastdnnexplorer.ml_models.lasso_modeling import (
generate_modeling_data,
get_significant_predictors,
stratified_cv_r2,
get_interactor_importance,
OLSFeatureSelector
)
logging.basicConfig(level=logging.ERROR)
random.seed(42)
np.random.seed(42)
Interactor Modeling Workflow 1¶
This tutorial describes a process of modeling perturbation response by binding data with the goal of discovering a meaningful set of interactor terms. More specifically, we start with the following model:
$$ tf_{perturbed} \sim tf_{perturbed} + tf_{perturbed}:tf_{2} + tf_{perturbed}:tf_{2} + ... + max(non\ perturbed\ binding) $$
Where the response variable is the $tf_{perturbed}$ perturbation response, and the predictor variables are binding data (e.g., calling card experiments). Predictor terms such as $tf_{perturbed}:tf_{2}$ represent the interaction between the $tf_{perturbed}$ binding and the binding of another transcription factor. The final term, $\max(\text{non-perturbed binding})$, is defined as the maximum binding score for each gene, excluding $tf_{perturbed}$. This term is included to mitigate the effect of outlier genes which may have high binding scores across multiple transcription factors, potentially distorting the model.
We assume that the actual relationship between the perturbation response and the binding data is sparse and use the following steps to identify significant terms. These terms represent a set of TFs which, when considered as interactors with the perturbed TF, improve the inferred relationship between the binding and perturbation data.
Interactor sparse modeling¶
We currently have two methods to produce a set of what we deem to be significant initial parameters. In the first, we use the bootstrap on a 4-fold cross-validated Lasso model. In this case, we consider a coefficient significant if the confidence interval generated by the bootstrap is far enough away from 0. The second method is to perform 4-fold cross validated Lasso once, and then with the non-zero coefficients, iteratively use OLS modeling with a p-value threshold to drop insignificant predictors. With either method, we produce two variations of this model:
A model trained using all available data. For the bootstrapped LassoCV, we accept a coefficient if its 99.8% confidence interval does not include 0.0. For the OLS reduction method, we accept a predictor when its p-value is <= 0.001
A model trained using only the top 10% of data based on the binding score of the perturbed TF. For the bootstrapped LassoCV, we accept a coefficient if its 90.0% confidence interval does not include 0.0. For the OLS reduction method, we accept a predictor when its p-value is <= 0.01
We intersect predictors that result from steps 1.1 and 1.2
With this set of predictors, we then test whether the interactor is a significantly better predictor than the corresponding main effect. We do this by individually swapping out the interactor term for its main effect and comparing the average r-squared of OLS models across 4 folds. When this process is complete, we adjust our predictors by replacing interactors with main effects where the main effect was deemed better.
Finally, we report, as significant interactors, the interaction terms which have survived the steps above. We use the average R-squared achieved by this model and compare it to the average R-squared achieved by the univariate counterpart (the response TF predicted solely by its main effect). We would like to create a boxplot of this comparison across all TFs to see how this pipeline affects model performance in explaining variance in contrast to the simple univariate model.
NOTE: To generate the response_df
and predictors_df
below, see the first six
cells in the LassoCV tutorial.
response_df = pd.read_csv("~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv", index_col=0)
predictors_df = pd.read_csv("~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv", index_col=0)
We will demonstrate both methods of identifying a significant set of interactor terms using the TF Cbf1
model_tf = 'CBF1'
Step 1: Find significant predictors using all of the data¶
The function get_significant_predictors()
is capable of choosing significant
perdictors using two methods: either by the bootstrap with LassoCV, or by a single
LassoCV model and subsequently choosing only significant coefficients through
iterative OLS modeling.
# step 1.1 with the bootstrapped lassoCV
bootstrap_lasso_all_data = get_significant_predictors(
"bootstrap_lassocv",
model_tf,
response_df,
predictors_df,
ci_percentile=99.8,
n_bootstraps=100,
add_max_lrb=True)
# step 1.2 with the bootstrapped lassoCV
bootstrap_lasso_top10 = get_significant_predictors(
"bootstrap_lassocv",
model_tf,
response_df,
predictors_df,
quantile_threshold=0.1,
ci_percentile=90.0,
n_bootstraps=100,
add_max_lrb=True)
# step 1.1 with with the lassoCV method (the ols reduction comes later)
lassocv_ols_all_data = get_significant_predictors(
"lassocv_ols",
model_tf,
response_df,
predictors_df,
add_max_lrb=True)
# step 1.2 with with the lassoCV method (the ols reduction comes later)
lassocv_ols_top10 = get_significant_predictors(
"lassocv_ols",
model_tf,
response_df,
predictors_df,
add_max_lrb=True,
quantile_threshold=0.1)
Significant coefficients for 99.8, where intervals are entirely above or below ±0.0: CBF1:SWI6: (-0.14416455697760802, -0.009884405236493667) CBF1:RGM1: (0.014576695345595624, 0.16062245178888665) CBF1:ARG81: (-0.21487429483287668, -0.03345111234665463) CBF1:MET28: (0.07821562304993974, 0.20784574601671219) CBF1:AZF1: (-0.15416577155959094, -0.026421884027885402) CBF1:GAL4: (0.09086284981352438, 0.3185129462792361) CBF1:MSN2: (0.10065808506838035, 0.2776672633042757) max_lrb: (0.0018092836372706933, 0.0983814591861913) Significant coefficients for 90.0, where intervals are entirely above or below ±0.0: CBF1:MET28: (0.06998122210410229, 0.2088974100580522)
Step 2¶
We next need to intersect the significant coefficients (see definitions above) in both models. In this case, one interactors survives (note that there are only 100 bootstraps in this example in the interest of speed for the tutorial. We recommend no less than 1000 in practice).
bootstrap_lassocv_final_features = (set(bootstrap_lasso_all_data['sig_coefs'].keys())
.intersection(set(bootstrap_lasso_top10['sig_coefs'].keys())))
print(f"Bootstrap Lasso Intersect Coefs: {bootstrap_lassocv_final_features}")
# these are not the final features from this method!
lassocv_ols_intersect_coefs = (set(lassocv_ols_all_data['sig_coefs'].keys())
.intersection(set(lassocv_ols_top10['sig_coefs'].keys())))
print(f"LassoCV Intersect Coefs: {lassocv_ols_intersect_coefs}")
Bootstrap Lasso Intersect Coefs: {'CBF1:MET28'} LassoCV Intersect Coefs: {'CBF1:MET28', 'CBF1:SKN7', 'CBF1:SWI6', 'CBF1:AZF1'}
OLS reduction only¶
For the LassoCV result, we go through another round of reduction. Using the intersect of the LassoCV result, we consider iteratively use this set of predictors with an OLS model. At eac iteration, if a predictor's pvalue falls below a given threshold, it is removed from the model.
# Initialize the selector
selector_all = OLSFeatureSelector(p_value_threshold=0.001)
# Transform the data to select only significant features
selector_all.refine_features(
lassocv_ols_all_data['predictors'][list(lassocv_ols_intersect_coefs)],
lassocv_ols_all_data['response'],)
selector_top10 = OLSFeatureSelector(p_value_threshold=0.01)
_ = selector_top10.refine_features(
lassocv_ols_top10['predictors'].loc[lassocv_ols_top10['response'].index,
list(lassocv_ols_intersect_coefs)],
lassocv_ols_top10['response'],)
# Get significant features and the model summary
print("Significant features in all data:", selector_all.get_significant_features(drop_intercept=True))
print("All data model summary:\n", selector_all.get_summary())
print("Significant features in top 10:", selector_top10.get_significant_features(drop_intercept=True))
print("Top 10 model summary:\n", selector_top10.get_summary())
lassocv_ols_final_features = set(selector_all.get_significant_features(drop_intercept=True)).intersection(
selector_top10.get_significant_features(drop_intercept=True)
)
print(f"LassoCV OLS Final Features: {lassocv_ols_final_features}")
Significant features in all data: ['CBF1:MET28', 'CBF1:SWI6'] All data model summary: coef std_err t pvalue const 0.421557 0.006203 67.964194 0.000000e+00 CBF1:MET28 0.138165 0.012179 11.344703 1.547294e-29 CBF1:SWI6 -0.104014 0.015297 -6.799570 1.148398e-11 Significant features in top 10: ['CBF1:MET28'] Top 10 model summary: coef std_err t pvalue const 0.441569 0.029296 15.072468 6.710796e-44 CBF1:MET28 0.083300 0.016322 5.103594 4.451965e-07 LassoCV OLS Final Features: {'CBF1:MET28'}
Step 3¶
We next implement the method which searches alternative models, which include the surviving interactor terms, with variations on substituing in the main effect. In this case, we have a single term. However, if we had more than one term, we would do the following for each surviving interactor term. The goal of this process, remember, is to generate a set of high confidence interactor terms for this TF. If the predictive power of the main effect is equivalent or better than a model with the interactor, we consider that a low confidence interactor effect.
After identifying all interactor terms for which substituting in the main effect improves the average R-squared from CV, we drop these terms from our set of features. We then log the final average R-squared achieved by this model. In this case, no terms are dropped from testing the substitution of main effects.
# in this case, the lassocv_ols_final_features are the bootstrap_lassocv_final_features
# are the same. I'm setting the lassocv_ols_final_features to the final_features for
# the following steps.
final_features = lassocv_ols_final_features
# get the additional main effects which will be tested from the final_features
main_effects = []
for term in final_features:
if ":" in term:
main_effects.append(term.split(":")[1])
else:
main_effects.append(term)
# combine these main effects with the final_features
interactor_terms_and_main_effects = list(final_features) + main_effects
# generate a model matrix with the intersect terms and the main effects. This full
# model will not be used for modeling -- subsets of the columns will be, however.
_, full_X = generate_modeling_data(
model_tf,
response_df,
predictors_df,
formula = f"~ {' + '.join(interactor_terms_and_main_effects)}",
drop_intercept=False,
)
# Add the max_lrb column, just in case it is present in the final_predictors. In this
# case, it is not.
max_lrb = predictors_df.drop(columns=model_tf).max(axis=1)
full_X['max_lrb'] = max_lrb
# Currently, this function tests each interactor term in the final_features
# with two variants by replacing the interaction term with the main effect only, and
# with the main effect + interactor. If either of the variants has a higher avg
# r-squared than the intersect_model, then that variant is returned. In this case,
# the original final_features are the best model.
full_avg_rsquared, interactor_results = get_interactor_importance(
lassocv_ols_all_data['response'],
full_X,
lassocv_ols_all_data['classes'],
final_features
)
# use the interactor_results to update the final_features
for interactor_variant in interactor_results:
k = interactor_variant['interactor']
v = interactor_variant['variant']
final_features.remove(k)
final_features.add(v)
Step 4: Comparing our final model to a univariate model¶
In our last step, we take our reamining set of features from the end of Step 3, and now compare its performance to that of a univariate model where the response TF is predicted solely by its main effect. We will use the average R-squared achieved by 4-Fold CV on both models.
avg_r2_univariate = stratified_cv_r2(
lassocv_ols_all_data['response'],
lassocv_ols_all_data['predictors'][[model_tf]],
lassocv_ols_all_data['classes'],
)
final_model_avg_r_squared = stratified_cv_r2(
lassocv_ols_all_data['response'],
full_X[list(final_features)],
lassocv_ols_all_data['classes'],
)
print(f"The univariate average R-squared is: {avg_r2_univariate}")
print(f"The final model average R-squared is {final_model_avg_r_squared}")
The univariate average R-squared is: 0.0049704126329321585 The final model average R-squared is 0.010517208487239943
/home/chase/code/yeastdnnexplorer/.venv/lib/python3.11/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=4. warnings.warn( /home/chase/code/yeastdnnexplorer/.venv/lib/python3.11/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=4. warnings.warn(
As we can see, the final model we achieved through Workflow 1 demonstrates a higher average R-squared achieved by 4-fold CV.
Using the cmd line interface¶
There is a cmd line utility that performs this analysis. See
python -m yeastdnnexplorer find_interactors_workflow --help
for more information. An
example command looks like this:
python -m yeastdnnexplorer find_interactors_workflow \
--response_file ~/htcf_local/lasso_bootstrap/erics_tfs/response_dataframe_20241105.csv \
--predictors_file ~/htcf_local/lasso_bootstrap/erics_tfs/predictors_dataframe_20241105.csv \
--method bootstrap_lassocv \
--response_tf CBF1
the output, in your $PWD
would look like this:
raw
├── find_interactors_output
│ └── CBF1_output.json
Where CBF1_output.json looks like:
{
"response_tf": "CBF1",
"final_features": [
"CBF1:MET28",
"GAL4"
],
"avg_r2_univariate": 0.0049704126329321585,
"final_model_avg_r_squared": 0.022854644322078704
}