interface¶

The main interface module provides the core workflow functions and command-line interface components for tfbpmodeling.

tfbpmodeling.interface ¶

CustomHelpFormatter ¶

Bases: HelpFormatter

This could be used to customize the help message formatting for the argparse parser.

Left as a placeholder.

common_modeling_input_arguments ¶

common_modeling_input_arguments(parser, top_n_default=600)

Add common input arguments for modeling commands.

Source code in tfbpmodeling/interface.py

def common_modeling_input_arguments(
    parser: argparse._ArgumentGroup, top_n_default: int | None = 600
) -> None:
    """Add common input arguments for modeling commands."""
    parser.add_argument(
        "--response_file",
        type=str,
        required=True,
        help=(
            "Path to the response CSV file. The first column must contain "
            "feature names or locus tags (e.g., gene symbols), matching the index "
            "format in both response and predictor files. The perturbed gene will "
            "be removed from the model data only if its column names match the "
            "index format."
        ),
    )
    parser.add_argument(
        "--predictors_file",
        type=str,
        required=True,
        help=(
            "Path to the predictors CSV file. The first column must contain "
            "feature names or locus tags (e.g., gene symbols), ensuring consistency "
            "between response and predictor files."
        ),
    )
    parser.add_argument(
        "--perturbed_tf",
        type=str,
        required=True,
        help=(
            "Name of the perturbed transcription factor (TF) used as the "
            "response variable. It must match a column in the response file."
        ),
    )
    parser.add_argument(
        "--blacklist_file",
        type=str,
        default="",
        help=(
            "Optional file containing a list of features (one per line) to be excluded "
            "from the analysis."
        ),
    )
    parser.add_argument(
        "--n_bootstraps",
        type=int,
        default=1000,
        help="Number of bootstrap samples to generate for resampling. Default is 1000",
    )
    parser.add_argument(
        "--random_state",
        type=int,
        default=None,
        help="Set this to an integer to make the bootstrap sampling reproducible. "
        "Default is None (no fixed seed) and each call will produce different "
        "bootstrap indices. Note that if this is set, the `top_n` random_state will "
        "be +10 in order to make the top_n indices different from the `all_data` step",
    )
    parser.add_argument(
        "--normalize_sample_weights",
        action="store_true",
        help=(
            "Set this to normalize the sample weights to sum to 1. " "Default is False."
        ),
    )
    parser.add_argument(
        "--scale_by_std",
        action="store_true",
        help=(
            "Set this to scale the model matrix by standard deviation"
            "(without centering). The data is scaled using"
            "StandardScaler(with_mean=False, with_std=True). The estimator will"
            "still fit an intercept (fit_intercept=True) since the "
            "data is not centered."
        ),
    )
    parser.add_argument(
        "--top_n",
        type=int,
        default=top_n_default,
        help=(
            "Number of features to retain in the second round of modeling. "
            f"Default is {top_n_default}"
        ),
    )

linear_perturbation_binding_modeling ¶

linear_perturbation_binding_modeling(args)

Parameters:	`args` – Command-line arguments containing input file paths and parameters.

Source code in tfbpmodeling/interface.py

def linear_perturbation_binding_modeling(args):
    """
    :param args: Command-line arguments containing input file paths and parameters.
    """
    if not isinstance(args.max_iter, int) or args.max_iter < 1:
        raise ValueError("The `max_iter` parameter must be a positive integer.")

    max_iter = int(args.max_iter)

    logger.info(f"estimator max_iter: {max_iter}.")

    logger.info("Step 1: Preprocessing")

    # validate input files/dirs
    if not os.path.exists(args.response_file):
        raise FileNotFoundError(f"File {args.response_file} does not exist.")
    if not os.path.exists(args.predictors_file):
        raise FileNotFoundError(f"File {args.predictors_file} does not exist.")
    if os.path.exists(args.output_dir):
        logger.warning(f"Output directory {args.output_dir} already exists.")
    else:
        os.makedirs(args.output_dir, exist_ok=True)
        logger.info(f"Output directory created at {args.output_dir}")

    # the output subdir is where the output of this modeling run will be saved
    output_subdir = os.path.join(
        args.output_dir, os.path.join(args.perturbed_tf + args.output_suffix)
    )
    if os.path.exists(output_subdir):
        raise FileExistsError(
            f"Directory {output_subdir} already exists. "
            "Please specify a different `output_dir`."
        )
    else:
        os.makedirs(output_subdir, exist_ok=True)
        logger.info(f"Output subdirectory created at {output_subdir}")

    # instantiate a estimator
    estimator = LassoCV(
        fit_intercept=True,
        selection="random",
        n_alphas=100,
        random_state=42,
        n_jobs=args.n_cpus,
        max_iter=max_iter,
    )

    input_data = ModelingInputData.from_files(
        response_path=args.response_file,
        predictors_path=args.predictors_file,
        perturbed_tf=args.perturbed_tf,
        feature_blacklist_path=args.blacklist_file,
        top_n=args.top_n,
    )

    logger.info("Step 2: Bootstrap LassoCV on all data, full interactor model")

    # Unset the top n masking -- we want to use all the data for the first round
    # modeling
    input_data.top_n_masked = False

    # extract a list of predictor variables, which are the columns of the predictors_df
    predictor_variables = input_data.predictors_df.columns.drop(input_data.perturbed_tf)

    # drop any variables which are in args.exclude_interactor_variables
    predictor_variables = exclude_predictor_variables(
        list(predictor_variables), args.exclude_interactor_variables
    )

    # create a list of interactor terms with the perturbed_tf as the first term
    interaction_terms = [
        f"{input_data.perturbed_tf}:{var}" for var in predictor_variables
    ]

    # Construct the full interaction formula, ie perturbed_tf + perturbed_tf:other_tf1 +
    # perturbed_tf:other_tf2 + ... . perturbed_tf main effect only added if
    # --ptf_main_effect is passed.
    if args.ptf_main_effect:
        logger.info("adding pTF main effect to `all_data_formula`")
        all_data_formula = (
            f"{input_data.perturbed_tf} + {' + '.join(interaction_terms)}"
        )
    else:
        all_data_formula = " + ".join(interaction_terms)

    if args.squared_pTF:
        # if --squared_pTF is passed, then add the squared perturbed TF to the formula
        squared_term = f"I({input_data.perturbed_tf} ** 2)"
        logger.info(f"Adding squared term to model formula: {squared_term}")
        all_data_formula += f" + {squared_term}"

    if args.cubic_pTF:
        # if --cubic_pTF is passed, then add the cubic perturbed TF to the formula
        cubic_term = f"I({input_data.perturbed_tf} ** 3)"
        logger.info(f"Add cubic term to model formula: {cubic_term}")
        all_data_formula += f" + {cubic_term}"

    # if --row_max is passed, then add "row_max" to the formula
    if args.row_max:
        logger.info("Adding `row_max` to the all data model formula")
        all_data_formula += " + row_max"

    # if --add_model_variables is passed, then add the variables to the formula
    if args.add_model_variables:
        logger.info(
            f"Adding model variables to the all data model "
            f"formula: {args.add_model_variables}"
        )
        all_data_formula += " + " + " + ".join(args.add_model_variables)

    logger.debug(f"All data formula: {all_data_formula}")

    # create the bootstrapped data.
    bootstrapped_data_all = BootstrappedModelingInputData(
        response_df=input_data.response_df,
        model_df=input_data.get_modeling_data(
            all_data_formula,
            add_row_max=args.row_max,
            drop_intercept=True,
            scale_by_std=args.scale_by_std,
        ),
        n_bootstraps=args.n_bootstraps,
        normalize_sample_weights=args.normalize_sample_weights,
        random_state=args.random_state,
    )

    logger.info(
        f"Running bootstrap LassoCV on all data with {args.n_bootstraps} bootstraps"
    )
    if args.iterative_dropout:
        logger.info("Using iterative dropout modeling for all data results.")
        all_data_results = bootstrap_stratified_cv_loop(
            bootstrapped_data=bootstrapped_data_all,
            perturbed_tf_series=input_data.predictors_df[input_data.perturbed_tf],
            estimator=estimator,
            ci_percentile=float(args.all_data_ci_level),
            stabilization_ci_start=args.stabilization_ci_start,
            bins=args.bins,
            output_dir=output_subdir,
        )
    else:
        logger.info("Using standard bootstrap modeling for all data results.")
        all_data_results = bootstrap_stratified_cv_modeling(
            bootstrapped_data=bootstrapped_data_all,
            perturbed_tf_series=input_data.predictors_df[input_data.perturbed_tf],
            estimator=estimator,
            ci_percentiles=[float(args.all_data_ci_level)],
            bins=args.bins,
        )
    # create the all data object output subdir
    all_data_output = os.path.join(output_subdir, "all_data_result_object")
    os.makedirs(all_data_output, exist_ok=True)

    logger.info(f"Serializing all data results to {all_data_output}")
    all_data_results.serialize("result_obj", all_data_output)

    # Extract the coefficients that are significant at the specified confidence level
    all_data_sig_coefs = all_data_results.extract_significant_coefficients(
        ci_level=args.all_data_ci_level,
    )

    logger.info(f"all_data_sig_coefs: {all_data_sig_coefs}")

    if not all_data_sig_coefs:
        logger.warning(
            f"No significant coefficients found at {args.all_data_ci_level}% "
            "confidence level. Exiting."
        )
        return

    # write all_data_sig_coefs to a json file
    all_data_ci_str = str(args.all_data_ci_level).replace(".", "-")
    all_data_output_file = os.path.join(
        output_subdir, f"all_data_significant_{all_data_ci_str}.json"
    )
    logger.info(f"Writing the all data significant results to {all_data_output_file}")
    with open(
        all_data_output_file,
        "w",
    ) as f:
        json.dump(all_data_sig_coefs, f, indent=4)

    # extract the significant coefficients and create a formula.
    all_data_sig_coefs_formula = f"{' + '.join(all_data_sig_coefs.keys())}"
    logger.debug(f"`all_data_sig_coefs_formula` formula: {all_data_sig_coefs_formula}")

    logger.info(
        "Step 3: Bootstrap LassoCV on the significant coefficients "
        "from the all data model. This produces the best model for all data"
    )

    skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
    classes = stratification_classification(
        input_data.predictors_df[input_data.perturbed_tf].squeeze(),
        bins=args.bins,
    )

    best_all_data_model_df = input_data.get_modeling_data(
        all_data_sig_coefs_formula,
        add_row_max=args.row_max,
        drop_intercept=True,
        scale_by_std=args.scale_by_std,
    )
    best_all_data_model = stratified_cv_modeling(
        input_data.response_df,
        best_all_data_model_df,
        classes=classes,
        estimator=estimator,
        skf=skf,
        sample_weight=None,
    )

    # save the best all data model to file with metadata
    best_model_file = os.path.join(output_subdir, "best_all_data_model.pkl")
    logger.info(f"Saving the best all data model to {best_model_file}")

    # Bundle model with metadata so feature names are preserved
    model_bundle = {
        "model": best_all_data_model,
        "feature_names": list(best_all_data_model_df.columns),
        "formula": all_data_sig_coefs_formula,
        "perturbed_tf": input_data.perturbed_tf,
        "scale_by_std": args.scale_by_std,
        "drop_intercept": True,
    }
    joblib.dump(model_bundle, best_model_file)

    logger.info(
        "Step 4: Running LassoCV on topn data with significant coefficients "
        "from the all data model"
    )

    # apply the top_n masking
    input_data.top_n_masked = True

    # Create the bootstrapped data for the topn modeling
    bootstrapped_data_top_n = BootstrappedModelingInputData(
        response_df=input_data.response_df,
        model_df=input_data.get_modeling_data(
            all_data_sig_coefs_formula,
            add_row_max=args.row_max,
            drop_intercept=True,
            scale_by_std=args.scale_by_std,
        ),
        n_bootstraps=args.n_bootstraps,
        normalize_sample_weights=args.normalize_sample_weights,
        random_state=(
            args.random_state + 10 if args.random_state else args.random_state
        ),
    )

    logger.debug(
        f"Running bootstrap LassoCV on topn data with {args.n_bootstraps} bootstraps"
    )
    topn_results = bootstrap_stratified_cv_modeling(
        bootstrapped_data_top_n,
        input_data.predictors_df[input_data.perturbed_tf],
        estimator=estimator,
        ci_percentiles=[float(args.topn_ci_level)],
    )

    # create the topn data object output subdir
    topn_output = os.path.join(output_subdir, "topn_result_object")
    os.makedirs(topn_output, exist_ok=True)

    logger.info(f"Serializing topn results to {topn_output}")
    topn_results.serialize("result_obj", topn_output)

    # extract the topn_results at the specified confidence level
    topn_output_res = topn_results.extract_significant_coefficients(
        ci_level=args.topn_ci_level
    )

    logger.info(f"topn_output_res: {topn_output_res}")

    if not topn_output_res:
        logger.warning(
            f"No significant coefficients found at {args.topn_ci_level}% "
            "confidence level. Exiting."
        )
        return

    # write topn_output_res to a json file
    topn_ci_str = str(args.topn_ci_level).replace(".", "-")
    topn_output_file = os.path.join(
        output_subdir, f"topn_significant_{topn_ci_str}.json"
    )
    logger.info(f"Writing the topn significant results to {topn_output_file}")
    with open(topn_output_file, "w") as f:
        json.dump(topn_output_res, f, indent=4)

    logger.info(
        "Step 5: Test the significance of the interactor terms that survive "
        "against the corresponding main effect"
    )

    if args.stage4_topn:
        logger.info("Stage 4 will use top-n masked input data.")
        input_data.top_n_masked = True
    else:
        logger.info("Stage 4 will use full input data.")

    # calculate the statification classes for the perturbed TF (all data)
    stage4_classes = stratification_classification(
        input_data.predictors_df[input_data.perturbed_tf].squeeze(),
        bins=args.bins,
    )

    # Test the significance of the interactor terms
    evaluate_interactor_significance = (
        evaluate_interactor_significance_lassocv
        if args.stage4_lasso
        else evaluate_interactor_significance_linear
    )

    results = evaluate_interactor_significance(
        input_data,
        stratification_classes=stage4_classes,
        model_variables=list(
            topn_results.extract_significant_coefficients(
                ci_level=args.topn_ci_level
            ).keys()
        ),
        estimator=estimator,
    )

    output_significance_file = os.path.join(
        output_subdir, "interactor_vs_main_result.json"
    )
    logger.info(
        "Writing the final interactor significance "
        f"results to {output_significance_file}"
    )
    results.serialize(output_significance_file)

Overview¶

The interface module serves as the primary entry point for the tfbpmodeling workflow. It contains:

Main workflow function: linear_perturbation_binding_modeling()
CLI helper functions: Argument parsing utilities for the command-line interface
Custom formatters: Enhanced help formatting for better user experience

Main Functions¶

linear_perturbation_binding_modeling¶

The core function that executes the complete 4-stage TFBP modeling workflow:

Data Preprocessing: Load and validate input files, handle missing data
Bootstrap Modeling: All-data analysis with bootstrap resampling and LassoCV
Top-N Modeling: Refined analysis on significant predictors from top-performing data
Interactor Significance: Statistical evaluation of interaction terms vs main effects

Parameters: Command-line arguments object containing all configuration options

Returns: None (results saved to output directory)

Key Features: - Comprehensive input validation - Automatic output directory creation with timestamps - Detailed logging of all processing steps - Error handling with informative messages

CLI Helper Functions¶

common_modeling_input_arguments¶

Adds standard input arguments to argument parsers: - File paths for response and predictor data - Perturbed TF specification - Bootstrap and sampling parameters

common_modeling_feature_options¶

Configures feature engineering options: - Polynomial terms (squared, cubic) - Row maximum inclusion - Custom variable additions and exclusions

common_modeling_binning_arguments¶

Sets up data stratification parameters: - Bin edge specifications - Stratification methods

add_general_arguments_to_subparsers¶

Propagates global arguments to subcommand parsers: - Logging configuration - System-wide options

Data Flow¶

graph TD A[CLI Arguments] --> B[Input Validation] B --> C[Data Loading] C --> D[ModelingInputData] D --> E[BootstrappedModelingInputData] E --> F[Bootstrap CV Loop] F --> G[Top-N Selection] G --> H[Interactor Significance] H --> I[Results Output]

Usage Examples¶

Programmatic Usage¶

import argparse
from tfbpmodeling.interface import linear_perturbation_binding_modeling

# Create arguments object
args = argparse.Namespace(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1',
    n_bootstraps=1000,
    top_n=600,
    all_data_ci_level=98.0,
    topn_ci_level=90.0,
    max_iter=10000,
    output_dir='./results',
    output_suffix='',
    n_cpus=4,
    # ... other parameters
)

# Run analysis
linear_perturbation_binding_modeling(args)

Custom Argument Parser¶

import argparse
from tfbpmodeling.interface import (
    common_modeling_input_arguments,
    common_modeling_feature_options,
    CustomHelpFormatter
)

# Create custom parser
parser = argparse.ArgumentParser(
    formatter_class=CustomHelpFormatter,
    description="Custom TFBP Analysis"
)

# Add standard arguments
input_group = parser.add_argument_group("Input")
common_modeling_input_arguments(input_group)

feature_group = parser.add_argument_group("Features")
common_modeling_feature_options(feature_group)

# Parse and use
args = parser.parse_args()
linear_perturbation_binding_modeling(args)

Error Handling¶

The interface module includes comprehensive error handling:

Input Validation Errors¶

# File existence checks
FileNotFoundError: "File data/missing.csv does not exist."

# Parameter validation
ValueError: "The `max_iter` parameter must be a positive integer."

# Data format validation
ValueError: "Perturbed TF 'INVALID' not found in response file columns"

Runtime Errors¶

# Convergence issues
RuntimeWarning: "LassoCV failed to converge for 15/1000 bootstrap samples"

# Insufficient data
ValueError: "Insufficient data after filtering. Found 5 samples, minimum required: 10"

Configuration Options¶

The interface supports extensive configuration through command-line arguments:

Core Parameters¶

Input files: Response data, predictor data, optional blacklist
TF specification: Name of perturbed transcription factor
Bootstrap settings: Sample count, random seed, weight normalization

Feature Engineering¶

Polynomial terms: Squared and cubic pTF terms
Additional predictors: Row max, custom variables
Interaction control: Variable exclusions, main effects

Model Configuration¶

Confidence intervals: Separate thresholds for each stage
Convergence: Maximum iterations, dropout options
Performance: CPU cores, memory management

Output Control¶

Directory structure: Base directory, custom suffixes
Logging: Verbosity levels, file vs console output

Performance Considerations¶

Memory Management¶

Bootstrap samples stored efficiently using sparse representations
Automatic garbage collection between stages
Memory usage monitoring and warnings

Parallel Processing¶

LassoCV uses specified CPU cores for cross-validation
Bootstrap samples processed in batches
I/O operations optimized for large datasets

Runtime Optimization¶

Early stopping for non-convergent models
Adaptive batch sizing based on available memory
Progress reporting for long-running analyses

modeling_input_data: Core data structures
bootstrapped_input_data: Bootstrap resampling
bootstrap_model_results: Result aggregation
evaluate_interactor_significance_lassocv: LassoCV-based significance testing
evaluate_interactor_significance_linear: Linear regression-based significance testing