Output Structure and Files

This document describes the complete output structure and file formats generated by the linear_perturbation_binding_modeling command.

Directory Structure

When you run linear_perturbation_binding_modeling, results are saved in a subdirectory within your specified --output_dir:

{output_dir}/{perturbed_tf}{output_suffix}/
├── all_data_result_object/
│   ├── result_obj_ci.json
│   └── result_obj_coefs_alphas.pkl
├── topn_result_object/
│   ├── result_obj_ci.json
│   └── result_obj_coefs_alphas.pkl
├── all_data_significant_{ci_level}.json
├── topn_significant_{ci_level}.json
├── best_all_data_model.pkl
└── interactor_vs_main_result.json

Directory Naming

  • Base directory: Specified by --output_dir (default: ./linear_perturbation_binding_modeling_results)
  • Subdirectory: {perturbed_tf}{output_suffix}
  • perturbed_tf: The name of the transcription factor you're analyzing
  • output_suffix: Optional suffix specified with --output_suffix
  • Example: CBF1_experiment1/ or just CBF1/

Stage 2: Bootstrap Modeling Results (All Data)

all_data_result_object/

Contains the complete bootstrap modeling results from Stage 2, where all features are used.

result_obj_ci.json

Confidence intervals for all model coefficients across bootstrap iterations.

Format:

{
  "95.0": {
    "TF1:TF2": [0.123, 0.456],
    "TF1:TF3": [-0.012, 0.089],
    "TF1:TF4": [0.234, 0.567]
  },
  "99.0": {
    "TF1:TF2": [0.089, 0.501],
    "TF1:TF3": [-0.045, 0.123],
    "TF1:TF4": [0.189, 0.612]
  }
}

Structure: - Keys: Confidence interval levels (e.g., "95.0", "99.0") - Values: Dictionaries mapping coefficient names to [lower_bound, upper_bound] - Coefficient names: Follow the model formula (e.g., interaction terms like "TF1:TF2")

Usage: - Identify which coefficients have confidence intervals that exclude zero - Compare coefficient stability across different CI levels - Used internally to extract significant coefficients

result_obj_coefs_alphas.pkl

Pickled file containing raw bootstrap results.

Contents:

(bootstrap_coefs_df, alpha_list)

  • bootstrap_coefs_df: pandas DataFrame
  • Rows: Bootstrap iterations (e.g., 1000 rows for 1000 bootstraps)
  • Columns: Model coefficients (matching the model formula)
  • Values: Fitted coefficient values from each bootstrap iteration

  • alpha_list: list of floats

  • Length: Number of bootstrap iterations
  • Values: Optimal alpha (regularization parameter) selected by LassoCV in each iteration

Example loading (recommended - using deserialize):

from tfbpmodeling.bootstrap_model_results import BootstrapModelResults

# Load using the deserialize class method
results = BootstrapModelResults.deserialize(
    ci_dict_json='all_data_result_object/result_obj_ci.json',
    coefs_alphas_pkl='all_data_result_object/result_obj_coefs_alphas.pkl'
)

# Access all the data
print(results.bootstrap_coefs_df.shape)  # (1000, n_coefficients)
print(f"Mean coefficient values:\n{results.bootstrap_coefs_df.mean()}")
print(f"Alpha statistics: {np.mean(results.alpha_list):.6f} ± {np.std(results.alpha_list):.6f}")

# Extract significant coefficients at a specific CI level
significant = results.extract_significant_coefficients(ci_level=95.0)
print(f"Significant coefficients: {significant}")

Example loading (alternative - direct pickle loading):

import pickle
import numpy as np

# Load just the coefficients and alphas directly
with open('all_data_result_object/result_obj_coefs_alphas.pkl', 'rb') as f:
    bootstrap_coefs_df, alpha_list = pickle.load(f)

print(bootstrap_coefs_df.shape)  # (1000, n_coefficients)
print(f"Mean alpha: {np.mean(alpha_list):.6f}")

all_data_significant_{ci_level}.json

Coefficients identified as significant at the specified confidence level.

Filename examples: - all_data_significant_98-0.json (for --all_data_ci_level 98.0) - all_data_significant_95-0.json (for --all_data_ci_level 95.0)

Format:

{
  "TF1:TF2": [0.123, 0.456],
  "TF1:TF4": [0.234, 0.567],
  "I(TF1 ** 2)": [0.045, 0.123]
}

Structure: - Keys: Coefficient names that passed the significance threshold - Values: [lower_bound, upper_bound] of the confidence interval - Only includes coefficients whose CI does not include zero

Interpretation: - These are the coefficients that survived Stage 2 filtering - They will be used as the formula for Stage 4 (top-N modeling) - Represents the first level of feature selection

Stage 3: Best All Data Model

best_all_data_model.pkl

The single best-fit model trained on all data using only the significant coefficients from Stage 2.

Type: Dictionary containing the fitted model and metadata

Structure:

{
    "model": <sklearn.linear_model.LassoCV>,  # Fitted estimator
    "feature_names": ["TF1:TF2", "TF1:TF4", ...],  # Column names from design matrix
    "formula": "TF1:TF2 + TF1:TF4 + ...",  # Model formula used
    "perturbed_tf": "TF1",  # Name of perturbed TF
    "scale_by_std": False,  # Whether scaling was applied
    "drop_intercept": True,  # Whether intercept was dropped from design matrix
}

Example loading:

import joblib
import pandas as pd

# Load the model bundle
bundle = joblib.load('best_all_data_model.pkl')

# Access the fitted model
model = bundle["model"]
feature_names = bundle["feature_names"]

# Access model attributes
coefficients = model.coef_          # numpy array of fitted coefficients
intercept = model.intercept_        # fitted intercept value
optimal_alpha = model.alpha_        # optimal regularization parameter
n_iterations = model.n_iter_        # iterations to convergence

# Create coefficient dictionary with feature names
coef_dict = dict(zip(feature_names, coefficients))
print(coef_dict)

# Make predictions on new data
# Note: You must prepare X_new with the same formula and preprocessing
predictions = model.predict(X_new)

Key differences from bootstrap results: - Single model (not 1000 bootstrap models) - Trained without sample weights - Uses stratified cross-validation for alpha selection - Represents the "final" model on all available data

Use cases: - Make predictions on new data - Extract final coefficient estimates - Understand the optimal regularization strength - Deploy for production predictions

Stage 4: Top-N Modeling Results

topn_result_object/

Similar structure to all_data_result_object/ but trained on top-N features only.

result_obj_ci.json

Confidence intervals for coefficients from top-N data modeling.

Format: Same as all_data_result_object/result_obj_ci.json

Differences: - Trained only on genes in the top-N ranking (by perturbed TF binding) - May have different coefficient values due to subset of data - Provides refined estimates on high-signal genes

result_obj_coefs_alphas.pkl

Bootstrap coefficients and alphas from top-N modeling.

Format: Same as all_data_result_object/result_obj_coefs_alphas.pkl

Example loading:

from tfbpmodeling.bootstrap_model_results import BootstrapModelResults

# Load top-N results
topn_results = BootstrapModelResults.deserialize(
    ci_dict_json='topn_result_object/result_obj_ci.json',
    coefs_alphas_pkl='topn_result_object/result_obj_coefs_alphas.pkl'
)

# Compare with all data results
all_data_results = BootstrapModelResults.deserialize(
    ci_dict_json='all_data_result_object/result_obj_ci.json',
    coefs_alphas_pkl='all_data_result_object/result_obj_coefs_alphas.pkl'
)

# Extract significant coefficients from both
all_data_sig = all_data_results.extract_significant_coefficients(ci_level=98.0)
topn_sig = topn_results.extract_significant_coefficients(ci_level=90.0)

# Find coefficients that survived both stages
survived = set(all_data_sig.keys()) & set(topn_sig.keys())
print(f"Coefficients surviving both stages: {len(survived)}")

topn_significant_{ci_level}.json

Significant coefficients from top-N modeling at the specified confidence level.

Filename examples: - topn_significant_90-0.json (for --topn_ci_level 90.0) - topn_significant_85-0.json (for --topn_ci_level 85.0)

Format:

{
  "TF1:TF2": [0.145, 0.478],
  "TF1:TF4": [0.256, 0.589]
}

Interpretation: - These coefficients survived both Stage 2 AND Stage 4 filtering - Represents high-confidence predictors on high-signal genes - These are the features evaluated in Stage 5 (interactor significance)

Stage 5: Interactor Significance Results

interactor_vs_main_result.json

Statistical comparison between interaction terms and their corresponding main effects.

Format:

[
  {
    "interactor": "TF1:TF2",
    "variant": "TF2",
    "avg_r2_interactor": 0.456,
    "avg_r2_main_effect": 0.389,
    "delta_r2": -0.067
  },
  {
    "interactor": "TF1:TF4",
    "variant": "TF4",
    "avg_r2_interactor": 0.456,
    "avg_r2_main_effect": 0.412,
    "delta_r2": -0.044
  }
]

Fields: - interactor: The interaction term being tested (e.g., "TF1:TF2") - variant: The corresponding main effect (e.g., "TF2") - avg_r2_interactor: Cross-validated R² with the interaction term - avg_r2_main_effect: Cross-validated R² when interaction is replaced by main effect - delta_r2: avg_r2_main_effect - avg_r2_interactor

Interpretation: - Negative delta_r2: Interaction term performs better than main effect (desirable) - Positive delta_r2: Main effect performs better than interaction (interaction may not be necessary) - Magnitude: How much predictive power is gained/lost by using the interaction

Usage: - Identify which interactions provide genuine predictive value - Filter out interactions that could be replaced by simpler main effects - Final validation that interaction terms are scientifically meaningful

File Formats Summary

JSON Files

All JSON files are human-readable and can be loaded with:

import json

with open('file.json', 'r') as f:
    data = json.load(f)

Files: - result_obj_ci.json - Nested dictionaries with confidence intervals - all_data_significant_{ci}.json - Dictionary of significant coefficients - topn_significant_{ci}.json - Dictionary of significant coefficients - interactor_vs_main_result.json - List of evaluation dictionaries

Pickle Files

Pickle files require Python to load and contain Python objects:

import pickle

with open('file.pkl', 'rb') as f:
    data = pickle.load(f)

Files: - result_obj_coefs_alphas.pkl - Tuple of (DataFrame, list) - best_all_data_model.pkl - Dictionary bundle containing fitted sklearn LassoCV model and metadata (use joblib.load())