interface

The main interface module provides the core workflow functions and command-line interface components for tfbpmodeling.

tfbpmodeling.interface

CustomHelpFormatter

Bases: HelpFormatter

This could be used to customize the help message formatting for the argparse parser.

Left as a placeholder.

common_modeling_input_arguments

common_modeling_input_arguments(parser, top_n_default=600)

Add common input arguments for modeling commands.

Source code in tfbpmodeling/interface.py
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
def common_modeling_input_arguments(
    parser: argparse._ArgumentGroup, top_n_default: int | None = 600
) -> None:
    """Add common input arguments for modeling commands."""
    parser.add_argument(
        "--response_file",
        type=str,
        required=True,
        help=(
            "Path to the response CSV file. The first column must contain "
            "feature names or locus tags (e.g., gene symbols), matching the index "
            "format in both response and predictor files. The perturbed gene will "
            "be removed from the model data only if its column names match the "
            "index format."
        ),
    )
    parser.add_argument(
        "--predictors_file",
        type=str,
        required=True,
        help=(
            "Path to the predictors CSV file. The first column must contain "
            "feature names or locus tags (e.g., gene symbols), ensuring consistency "
            "between response and predictor files."
        ),
    )
    parser.add_argument(
        "--perturbed_tf",
        type=str,
        required=True,
        help=(
            "Name of the perturbed transcription factor (TF) used as the "
            "response variable. It must match a column in the response file."
        ),
    )
    parser.add_argument(
        "--blacklist_file",
        type=str,
        default="",
        help=(
            "Optional file containing a list of features (one per line) to be excluded "
            "from the analysis."
        ),
    )
    parser.add_argument(
        "--n_bootstraps",
        type=int,
        default=1000,
        help="Number of bootstrap samples to generate for resampling. Default is 1000",
    )
    parser.add_argument(
        "--random_state",
        type=int,
        default=None,
        help="Set this to an integer to make the bootstrap sampling reproducible. "
        "Default is None (no fixed seed) and each call will produce different "
        "bootstrap indices. Note that if this is set, the `top_n` random_state will "
        "be +10 in order to make the top_n indices different from the `all_data` step",
    )
    parser.add_argument(
        "--normalize_sample_weights",
        action="store_true",
        help=(
            "Set this to normalize the sample weights to sum to 1. " "Default is False."
        ),
    )
    parser.add_argument(
        "--scale_by_std",
        action="store_true",
        help=(
            "Set this to scale the model matrix by standard deviation"
            "(without centering). The data is scaled using"
            "StandardScaler(with_mean=False, with_std=True). The estimator will"
            "still fit an intercept (fit_intercept=True) since the "
            "data is not centered."
        ),
    )
    parser.add_argument(
        "--top_n",
        type=int,
        default=top_n_default,
        help=(
            "Number of features to retain in the second round of modeling. "
            f"Default is {top_n_default}"
        ),
    )

linear_perturbation_binding_modeling

linear_perturbation_binding_modeling(args)
Parameters:
  • args

    Command-line arguments containing input file paths and parameters.

Source code in tfbpmodeling/interface.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
def linear_perturbation_binding_modeling(args):
    """
    :param args: Command-line arguments containing input file paths and parameters.
    """
    if not isinstance(args.max_iter, int) or args.max_iter < 1:
        raise ValueError("The `max_iter` parameter must be a positive integer.")

    max_iter = int(args.max_iter)

    logger.info(f"estimator max_iter: {max_iter}.")

    logger.info("Step 1: Preprocessing")

    # validate input files/dirs
    if not os.path.exists(args.response_file):
        raise FileNotFoundError(f"File {args.response_file} does not exist.")
    if not os.path.exists(args.predictors_file):
        raise FileNotFoundError(f"File {args.predictors_file} does not exist.")
    if os.path.exists(args.output_dir):
        logger.warning(f"Output directory {args.output_dir} already exists.")
    else:
        os.makedirs(args.output_dir, exist_ok=True)
        logger.info(f"Output directory created at {args.output_dir}")

    # the output subdir is where the output of this modeling run will be saved
    output_subdir = os.path.join(
        args.output_dir, os.path.join(args.perturbed_tf + args.output_suffix)
    )
    if os.path.exists(output_subdir):
        raise FileExistsError(
            f"Directory {output_subdir} already exists. "
            "Please specify a different `output_dir`."
        )
    else:
        os.makedirs(output_subdir, exist_ok=True)
        logger.info(f"Output subdirectory created at {output_subdir}")

    # instantiate a estimator
    estimator = LassoCV(
        fit_intercept=True,
        selection="random",
        n_alphas=100,
        random_state=42,
        n_jobs=args.n_cpus,
        max_iter=max_iter,
    )

    input_data = ModelingInputData.from_files(
        response_path=args.response_file,
        predictors_path=args.predictors_file,
        perturbed_tf=args.perturbed_tf,
        feature_blacklist_path=args.blacklist_file,
        top_n=args.top_n,
    )

    logger.info("Step 2: Bootstrap LassoCV on all data, full interactor model")

    # Unset the top n masking -- we want to use all the data for the first round
    # modeling
    input_data.top_n_masked = False

    # extract a list of predictor variables, which are the columns of the predictors_df
    predictor_variables = input_data.predictors_df.columns.drop(input_data.perturbed_tf)

    # drop any variables which are in args.exclude_interactor_variables
    predictor_variables = exclude_predictor_variables(
        list(predictor_variables), args.exclude_interactor_variables
    )

    # create a list of interactor terms with the perturbed_tf as the first term
    interaction_terms = [
        f"{input_data.perturbed_tf}:{var}" for var in predictor_variables
    ]

    # Construct the full interaction formula, ie perturbed_tf + perturbed_tf:other_tf1 +
    # perturbed_tf:other_tf2 + ... . perturbed_tf main effect only added if
    # --ptf_main_effect is passed.
    if args.ptf_main_effect:
        logger.info("adding pTF main effect to `all_data_formula`")
        all_data_formula = (
            f"{input_data.perturbed_tf} + {' + '.join(interaction_terms)}"
        )
    else:
        all_data_formula = " + ".join(interaction_terms)

    if args.squared_pTF:
        # if --squared_pTF is passed, then add the squared perturbed TF to the formula
        squared_term = f"I({input_data.perturbed_tf} ** 2)"
        logger.info(f"Adding squared term to model formula: {squared_term}")
        all_data_formula += f" + {squared_term}"

    if args.cubic_pTF:
        # if --cubic_pTF is passed, then add the cubic perturbed TF to the formula
        cubic_term = f"I({input_data.perturbed_tf} ** 3)"
        logger.info(f"Add cubic term to model formula: {cubic_term}")
        all_data_formula += f" + {cubic_term}"

    # if --row_max is passed, then add "row_max" to the formula
    if args.row_max:
        logger.info("Adding `row_max` to the all data model formula")
        all_data_formula += " + row_max"

    # if --add_model_variables is passed, then add the variables to the formula
    if args.add_model_variables:
        logger.info(
            f"Adding model variables to the all data model "
            f"formula: {args.add_model_variables}"
        )
        all_data_formula += " + " + " + ".join(args.add_model_variables)

    logger.debug(f"All data formula: {all_data_formula}")

    # create the bootstrapped data.
    bootstrapped_data_all = BootstrappedModelingInputData(
        response_df=input_data.response_df,
        model_df=input_data.get_modeling_data(
            all_data_formula,
            add_row_max=args.row_max,
            drop_intercept=True,
            scale_by_std=args.scale_by_std,
        ),
        n_bootstraps=args.n_bootstraps,
        normalize_sample_weights=args.normalize_sample_weights,
        random_state=args.random_state,
    )

    logger.info(
        f"Running bootstrap LassoCV on all data with {args.n_bootstraps} bootstraps"
    )
    if args.iterative_dropout:
        logger.info("Using iterative dropout modeling for all data results.")
        all_data_results = bootstrap_stratified_cv_loop(
            bootstrapped_data=bootstrapped_data_all,
            perturbed_tf_series=input_data.predictors_df[input_data.perturbed_tf],
            estimator=estimator,
            ci_percentile=float(args.all_data_ci_level),
            stabilization_ci_start=args.stabilization_ci_start,
            bins=args.bins,
            output_dir=output_subdir,
        )
    else:
        logger.info("Using standard bootstrap modeling for all data results.")
        all_data_results = bootstrap_stratified_cv_modeling(
            bootstrapped_data=bootstrapped_data_all,
            perturbed_tf_series=input_data.predictors_df[input_data.perturbed_tf],
            estimator=estimator,
            ci_percentiles=[float(args.all_data_ci_level)],
            bins=args.bins,
        )
    # create the all data object output subdir
    all_data_output = os.path.join(output_subdir, "all_data_result_object")
    os.makedirs(all_data_output, exist_ok=True)

    logger.info(f"Serializing all data results to {all_data_output}")
    all_data_results.serialize("result_obj", all_data_output)

    # Extract the coefficients that are significant at the specified confidence level
    all_data_sig_coefs = all_data_results.extract_significant_coefficients(
        ci_level=args.all_data_ci_level,
    )

    logger.info(f"all_data_sig_coefs: {all_data_sig_coefs}")

    if not all_data_sig_coefs:
        logger.warning(
            f"No significant coefficients found at {args.all_data_ci_level}% "
            "confidence level. Exiting."
        )
        return

    # write all_data_sig_coefs to a json file
    all_data_ci_str = str(args.all_data_ci_level).replace(".", "-")
    all_data_output_file = os.path.join(
        output_subdir, f"all_data_significant_{all_data_ci_str}.json"
    )
    logger.info(f"Writing the all data significant results to {all_data_output_file}")
    with open(
        all_data_output_file,
        "w",
    ) as f:
        json.dump(all_data_sig_coefs, f, indent=4)

    # extract the significant coefficients and create a formula.
    all_data_sig_coefs_formula = f"{' + '.join(all_data_sig_coefs.keys())}"
    logger.debug(f"`all_data_sig_coefs_formula` formula: {all_data_sig_coefs_formula}")

    logger.info(
        "Step 3: Bootstrap LassoCV on the significant coefficients "
        "from the all data model. This produces the best model for all data"
    )

    skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=42)
    classes = stratification_classification(
        input_data.predictors_df[input_data.perturbed_tf].squeeze(),
        bins=args.bins,
    )

    best_all_data_model_df = input_data.get_modeling_data(
        all_data_sig_coefs_formula,
        add_row_max=args.row_max,
        drop_intercept=True,
        scale_by_std=args.scale_by_std,
    )
    best_all_data_model = stratified_cv_modeling(
        input_data.response_df,
        best_all_data_model_df,
        classes=classes,
        estimator=estimator,
        skf=skf,
        sample_weight=None,
    )

    # save the best all data model to file with metadata
    best_model_file = os.path.join(output_subdir, "best_all_data_model.pkl")
    logger.info(f"Saving the best all data model to {best_model_file}")

    # Bundle model with metadata so feature names are preserved
    model_bundle = {
        "model": best_all_data_model,
        "feature_names": list(best_all_data_model_df.columns),
        "formula": all_data_sig_coefs_formula,
        "perturbed_tf": input_data.perturbed_tf,
        "scale_by_std": args.scale_by_std,
        "drop_intercept": True,
    }
    joblib.dump(model_bundle, best_model_file)

    logger.info(
        "Step 4: Running LassoCV on topn data with significant coefficients "
        "from the all data model"
    )

    # apply the top_n masking
    input_data.top_n_masked = True

    # Create the bootstrapped data for the topn modeling
    bootstrapped_data_top_n = BootstrappedModelingInputData(
        response_df=input_data.response_df,
        model_df=input_data.get_modeling_data(
            all_data_sig_coefs_formula,
            add_row_max=args.row_max,
            drop_intercept=True,
            scale_by_std=args.scale_by_std,
        ),
        n_bootstraps=args.n_bootstraps,
        normalize_sample_weights=args.normalize_sample_weights,
        random_state=(
            args.random_state + 10 if args.random_state else args.random_state
        ),
    )

    logger.debug(
        f"Running bootstrap LassoCV on topn data with {args.n_bootstraps} bootstraps"
    )
    topn_results = bootstrap_stratified_cv_modeling(
        bootstrapped_data_top_n,
        input_data.predictors_df[input_data.perturbed_tf],
        estimator=estimator,
        ci_percentiles=[float(args.topn_ci_level)],
    )

    # create the topn data object output subdir
    topn_output = os.path.join(output_subdir, "topn_result_object")
    os.makedirs(topn_output, exist_ok=True)

    logger.info(f"Serializing topn results to {topn_output}")
    topn_results.serialize("result_obj", topn_output)

    # extract the topn_results at the specified confidence level
    topn_output_res = topn_results.extract_significant_coefficients(
        ci_level=args.topn_ci_level
    )

    logger.info(f"topn_output_res: {topn_output_res}")

    if not topn_output_res:
        logger.warning(
            f"No significant coefficients found at {args.topn_ci_level}% "
            "confidence level. Exiting."
        )
        return

    # write topn_output_res to a json file
    topn_ci_str = str(args.topn_ci_level).replace(".", "-")
    topn_output_file = os.path.join(
        output_subdir, f"topn_significant_{topn_ci_str}.json"
    )
    logger.info(f"Writing the topn significant results to {topn_output_file}")
    with open(topn_output_file, "w") as f:
        json.dump(topn_output_res, f, indent=4)

    logger.info(
        "Step 5: Test the significance of the interactor terms that survive "
        "against the corresponding main effect"
    )

    if args.stage4_topn:
        logger.info("Stage 4 will use top-n masked input data.")
        input_data.top_n_masked = True
    else:
        logger.info("Stage 4 will use full input data.")

    # calculate the statification classes for the perturbed TF (all data)
    stage4_classes = stratification_classification(
        input_data.predictors_df[input_data.perturbed_tf].squeeze(),
        bins=args.bins,
    )

    # Test the significance of the interactor terms
    evaluate_interactor_significance = (
        evaluate_interactor_significance_lassocv
        if args.stage4_lasso
        else evaluate_interactor_significance_linear
    )

    results = evaluate_interactor_significance(
        input_data,
        stratification_classes=stage4_classes,
        model_variables=list(
            topn_results.extract_significant_coefficients(
                ci_level=args.topn_ci_level
            ).keys()
        ),
        estimator=estimator,
    )

    output_significance_file = os.path.join(
        output_subdir, "interactor_vs_main_result.json"
    )
    logger.info(
        "Writing the final interactor significance "
        f"results to {output_significance_file}"
    )
    results.serialize(output_significance_file)

Overview

The interface module serves as the primary entry point for the tfbpmodeling workflow. It contains:

  • Main workflow function: linear_perturbation_binding_modeling()
  • CLI helper functions: Argument parsing utilities for the command-line interface
  • Custom formatters: Enhanced help formatting for better user experience

Main Functions

linear_perturbation_binding_modeling

The core function that executes the complete 4-stage TFBP modeling workflow:

  1. Data Preprocessing: Load and validate input files, handle missing data
  2. Bootstrap Modeling: All-data analysis with bootstrap resampling and LassoCV
  3. Top-N Modeling: Refined analysis on significant predictors from top-performing data
  4. Interactor Significance: Statistical evaluation of interaction terms vs main effects

Parameters: Command-line arguments object containing all configuration options

Returns: None (results saved to output directory)

Key Features: - Comprehensive input validation - Automatic output directory creation with timestamps - Detailed logging of all processing steps - Error handling with informative messages

CLI Helper Functions

common_modeling_input_arguments

Adds standard input arguments to argument parsers: - File paths for response and predictor data - Perturbed TF specification - Bootstrap and sampling parameters

common_modeling_feature_options

Configures feature engineering options: - Polynomial terms (squared, cubic) - Row maximum inclusion - Custom variable additions and exclusions

common_modeling_binning_arguments

Sets up data stratification parameters: - Bin edge specifications - Stratification methods

add_general_arguments_to_subparsers

Propagates global arguments to subcommand parsers: - Logging configuration - System-wide options

Data Flow

graph TD A[CLI Arguments] --> B[Input Validation] B --> C[Data Loading] C --> D[ModelingInputData] D --> E[BootstrappedModelingInputData] E --> F[Bootstrap CV Loop] F --> G[Top-N Selection] G --> H[Interactor Significance] H --> I[Results Output]

Usage Examples

Programmatic Usage

import argparse
from tfbpmodeling.interface import linear_perturbation_binding_modeling

# Create arguments object
args = argparse.Namespace(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1',
    n_bootstraps=1000,
    top_n=600,
    all_data_ci_level=98.0,
    topn_ci_level=90.0,
    max_iter=10000,
    output_dir='./results',
    output_suffix='',
    n_cpus=4,
    # ... other parameters
)

# Run analysis
linear_perturbation_binding_modeling(args)

Custom Argument Parser

import argparse
from tfbpmodeling.interface import (
    common_modeling_input_arguments,
    common_modeling_feature_options,
    CustomHelpFormatter
)

# Create custom parser
parser = argparse.ArgumentParser(
    formatter_class=CustomHelpFormatter,
    description="Custom TFBP Analysis"
)

# Add standard arguments
input_group = parser.add_argument_group("Input")
common_modeling_input_arguments(input_group)

feature_group = parser.add_argument_group("Features")
common_modeling_feature_options(feature_group)

# Parse and use
args = parser.parse_args()
linear_perturbation_binding_modeling(args)

Error Handling

The interface module includes comprehensive error handling:

Input Validation Errors

# File existence checks
FileNotFoundError: "File data/missing.csv does not exist."

# Parameter validation
ValueError: "The `max_iter` parameter must be a positive integer."

# Data format validation
ValueError: "Perturbed TF 'INVALID' not found in response file columns"

Runtime Errors

# Convergence issues
RuntimeWarning: "LassoCV failed to converge for 15/1000 bootstrap samples"

# Insufficient data
ValueError: "Insufficient data after filtering. Found 5 samples, minimum required: 10"

Configuration Options

The interface supports extensive configuration through command-line arguments:

Core Parameters

  • Input files: Response data, predictor data, optional blacklist
  • TF specification: Name of perturbed transcription factor
  • Bootstrap settings: Sample count, random seed, weight normalization

Feature Engineering

  • Polynomial terms: Squared and cubic pTF terms
  • Additional predictors: Row max, custom variables
  • Interaction control: Variable exclusions, main effects

Model Configuration

  • Confidence intervals: Separate thresholds for each stage
  • Convergence: Maximum iterations, dropout options
  • Performance: CPU cores, memory management

Output Control

  • Directory structure: Base directory, custom suffixes
  • Logging: Verbosity levels, file vs console output

Performance Considerations

Memory Management

  • Bootstrap samples stored efficiently using sparse representations
  • Automatic garbage collection between stages
  • Memory usage monitoring and warnings

Parallel Processing

  • LassoCV uses specified CPU cores for cross-validation
  • Bootstrap samples processed in batches
  • I/O operations optimized for large datasets

Runtime Optimization

  • Early stopping for non-convergent models
  • Adaptive batch sizing based on available memory
  • Progress reporting for long-running analyses