Output Structure and Files¶
This document describes the complete output structure and file formats generated by the linear_perturbation_binding_modeling command.
Directory Structure¶
When you run linear_perturbation_binding_modeling, results are saved in a subdirectory within your specified --output_dir:
{output_dir}/{perturbed_tf}{output_suffix}/
├── all_data_result_object/
│ ├── result_obj_ci.json
│ └── result_obj_coefs_alphas.pkl
├── topn_result_object/
│ ├── result_obj_ci.json
│ └── result_obj_coefs_alphas.pkl
├── all_data_significant_{ci_level}.json
├── topn_significant_{ci_level}.json
├── best_all_data_model.pkl
└── interactor_vs_main_result.json
Directory Naming¶
- Base directory: Specified by
--output_dir(default:./linear_perturbation_binding_modeling_results) - Subdirectory:
{perturbed_tf}{output_suffix} perturbed_tf: The name of the transcription factor you're analyzingoutput_suffix: Optional suffix specified with--output_suffix- Example:
CBF1_experiment1/or justCBF1/
Stage 2: Bootstrap Modeling Results (All Data)¶
all_data_result_object/¶
Contains the complete bootstrap modeling results from Stage 2, where all features are used.
result_obj_ci.json¶
Confidence intervals for all model coefficients across bootstrap iterations.
Format:
{
"95.0": {
"TF1:TF2": [0.123, 0.456],
"TF1:TF3": [-0.012, 0.089],
"TF1:TF4": [0.234, 0.567]
},
"99.0": {
"TF1:TF2": [0.089, 0.501],
"TF1:TF3": [-0.045, 0.123],
"TF1:TF4": [0.189, 0.612]
}
}
Structure:
- Keys: Confidence interval levels (e.g., "95.0", "99.0")
- Values: Dictionaries mapping coefficient names to [lower_bound, upper_bound]
- Coefficient names: Follow the model formula (e.g., interaction terms like "TF1:TF2")
Usage: - Identify which coefficients have confidence intervals that exclude zero - Compare coefficient stability across different CI levels - Used internally to extract significant coefficients
result_obj_coefs_alphas.pkl¶
Pickled file containing raw bootstrap results.
Contents:
(bootstrap_coefs_df, alpha_list)
bootstrap_coefs_df: pandas DataFrame- Rows: Bootstrap iterations (e.g., 1000 rows for 1000 bootstraps)
- Columns: Model coefficients (matching the model formula)
-
Values: Fitted coefficient values from each bootstrap iteration
-
alpha_list: list of floats - Length: Number of bootstrap iterations
- Values: Optimal alpha (regularization parameter) selected by LassoCV in each iteration
Example loading (recommended - using deserialize):
from tfbpmodeling.bootstrap_model_results import BootstrapModelResults
# Load using the deserialize class method
results = BootstrapModelResults.deserialize(
ci_dict_json='all_data_result_object/result_obj_ci.json',
coefs_alphas_pkl='all_data_result_object/result_obj_coefs_alphas.pkl'
)
# Access all the data
print(results.bootstrap_coefs_df.shape) # (1000, n_coefficients)
print(f"Mean coefficient values:\n{results.bootstrap_coefs_df.mean()}")
print(f"Alpha statistics: {np.mean(results.alpha_list):.6f} ± {np.std(results.alpha_list):.6f}")
# Extract significant coefficients at a specific CI level
significant = results.extract_significant_coefficients(ci_level=95.0)
print(f"Significant coefficients: {significant}")
Example loading (alternative - direct pickle loading):
import pickle
import numpy as np
# Load just the coefficients and alphas directly
with open('all_data_result_object/result_obj_coefs_alphas.pkl', 'rb') as f:
bootstrap_coefs_df, alpha_list = pickle.load(f)
print(bootstrap_coefs_df.shape) # (1000, n_coefficients)
print(f"Mean alpha: {np.mean(alpha_list):.6f}")
all_data_significant_{ci_level}.json¶
Coefficients identified as significant at the specified confidence level.
Filename examples:
- all_data_significant_98-0.json (for --all_data_ci_level 98.0)
- all_data_significant_95-0.json (for --all_data_ci_level 95.0)
Format:
{
"TF1:TF2": [0.123, 0.456],
"TF1:TF4": [0.234, 0.567],
"I(TF1 ** 2)": [0.045, 0.123]
}
Structure:
- Keys: Coefficient names that passed the significance threshold
- Values: [lower_bound, upper_bound] of the confidence interval
- Only includes coefficients whose CI does not include zero
Interpretation: - These are the coefficients that survived Stage 2 filtering - They will be used as the formula for Stage 4 (top-N modeling) - Represents the first level of feature selection
Stage 3: Best All Data Model¶
best_all_data_model.pkl¶
The single best-fit model trained on all data using only the significant coefficients from Stage 2.
Type: Dictionary containing the fitted model and metadata
Structure:
{
"model": <sklearn.linear_model.LassoCV>, # Fitted estimator
"feature_names": ["TF1:TF2", "TF1:TF4", ...], # Column names from design matrix
"formula": "TF1:TF2 + TF1:TF4 + ...", # Model formula used
"perturbed_tf": "TF1", # Name of perturbed TF
"scale_by_std": False, # Whether scaling was applied
"drop_intercept": True, # Whether intercept was dropped from design matrix
}
Example loading:
import joblib
import pandas as pd
# Load the model bundle
bundle = joblib.load('best_all_data_model.pkl')
# Access the fitted model
model = bundle["model"]
feature_names = bundle["feature_names"]
# Access model attributes
coefficients = model.coef_ # numpy array of fitted coefficients
intercept = model.intercept_ # fitted intercept value
optimal_alpha = model.alpha_ # optimal regularization parameter
n_iterations = model.n_iter_ # iterations to convergence
# Create coefficient dictionary with feature names
coef_dict = dict(zip(feature_names, coefficients))
print(coef_dict)
# Make predictions on new data
# Note: You must prepare X_new with the same formula and preprocessing
predictions = model.predict(X_new)
Key differences from bootstrap results: - Single model (not 1000 bootstrap models) - Trained without sample weights - Uses stratified cross-validation for alpha selection - Represents the "final" model on all available data
Use cases: - Make predictions on new data - Extract final coefficient estimates - Understand the optimal regularization strength - Deploy for production predictions
Stage 4: Top-N Modeling Results¶
topn_result_object/¶
Similar structure to all_data_result_object/ but trained on top-N features only.
result_obj_ci.json¶
Confidence intervals for coefficients from top-N data modeling.
Format: Same as all_data_result_object/result_obj_ci.json
Differences: - Trained only on genes in the top-N ranking (by perturbed TF binding) - May have different coefficient values due to subset of data - Provides refined estimates on high-signal genes
result_obj_coefs_alphas.pkl¶
Bootstrap coefficients and alphas from top-N modeling.
Format: Same as all_data_result_object/result_obj_coefs_alphas.pkl
Example loading:
from tfbpmodeling.bootstrap_model_results import BootstrapModelResults
# Load top-N results
topn_results = BootstrapModelResults.deserialize(
ci_dict_json='topn_result_object/result_obj_ci.json',
coefs_alphas_pkl='topn_result_object/result_obj_coefs_alphas.pkl'
)
# Compare with all data results
all_data_results = BootstrapModelResults.deserialize(
ci_dict_json='all_data_result_object/result_obj_ci.json',
coefs_alphas_pkl='all_data_result_object/result_obj_coefs_alphas.pkl'
)
# Extract significant coefficients from both
all_data_sig = all_data_results.extract_significant_coefficients(ci_level=98.0)
topn_sig = topn_results.extract_significant_coefficients(ci_level=90.0)
# Find coefficients that survived both stages
survived = set(all_data_sig.keys()) & set(topn_sig.keys())
print(f"Coefficients surviving both stages: {len(survived)}")
topn_significant_{ci_level}.json¶
Significant coefficients from top-N modeling at the specified confidence level.
Filename examples:
- topn_significant_90-0.json (for --topn_ci_level 90.0)
- topn_significant_85-0.json (for --topn_ci_level 85.0)
Format:
{
"TF1:TF2": [0.145, 0.478],
"TF1:TF4": [0.256, 0.589]
}
Interpretation: - These coefficients survived both Stage 2 AND Stage 4 filtering - Represents high-confidence predictors on high-signal genes - These are the features evaluated in Stage 5 (interactor significance)
Stage 5: Interactor Significance Results¶
interactor_vs_main_result.json¶
Statistical comparison between interaction terms and their corresponding main effects.
Format:
[
{
"interactor": "TF1:TF2",
"variant": "TF2",
"avg_r2_interactor": 0.456,
"avg_r2_main_effect": 0.389,
"delta_r2": -0.067
},
{
"interactor": "TF1:TF4",
"variant": "TF4",
"avg_r2_interactor": 0.456,
"avg_r2_main_effect": 0.412,
"delta_r2": -0.044
}
]
Fields:
- interactor: The interaction term being tested (e.g., "TF1:TF2")
- variant: The corresponding main effect (e.g., "TF2")
- avg_r2_interactor: Cross-validated R² with the interaction term
- avg_r2_main_effect: Cross-validated R² when interaction is replaced by main effect
- delta_r2: avg_r2_main_effect - avg_r2_interactor
Interpretation: - Negative delta_r2: Interaction term performs better than main effect (desirable) - Positive delta_r2: Main effect performs better than interaction (interaction may not be necessary) - Magnitude: How much predictive power is gained/lost by using the interaction
Usage: - Identify which interactions provide genuine predictive value - Filter out interactions that could be replaced by simpler main effects - Final validation that interaction terms are scientifically meaningful
File Formats Summary¶
JSON Files¶
All JSON files are human-readable and can be loaded with:
import json
with open('file.json', 'r') as f:
data = json.load(f)
Files:
- result_obj_ci.json - Nested dictionaries with confidence intervals
- all_data_significant_{ci}.json - Dictionary of significant coefficients
- topn_significant_{ci}.json - Dictionary of significant coefficients
- interactor_vs_main_result.json - List of evaluation dictionaries
Pickle Files¶
Pickle files require Python to load and contain Python objects:
import pickle
with open('file.pkl', 'rb') as f:
data = pickle.load(f)
Files:
- result_obj_coefs_alphas.pkl - Tuple of (DataFrame, list)
- best_all_data_model.pkl - Dictionary bundle containing fitted sklearn LassoCV model and metadata (use joblib.load())