modeling_input_data

Core data structures for handling input data preprocessing and validation in tfbpmodeling.

tfbpmodeling.modeling_input_data

ModelingInputData

ModelingInputData(
    response_df,
    predictors_df,
    perturbed_tf,
    feature_col="target_symbol",
    feature_blacklist=None,
    top_n=None,
)

Container for response and predictor data used in modeling transcription factor perturbation experiments.

This class handles: - Validation and synchronization of response and predictor DataFrames based on a shared feature identifier. - Optional blacklisting of features, including the perturbed transcription factor. - Optional feature selection based on the top N strongest binding signals (as ranked from a specific TF column in the predictor matrix). - Application of masking logic to restrict modeling to selected features.

Initialize ModelingInputData with response and predictor matrices. Note that the response and predictor dataframes will be subset down to the features in common between them, by index. The rows in both dataframes will also be ordered such that they match, again by index.

Parameters:
  • response_df (DataFrame) –

    A two column DataFrame containing the feature_col and numeric column representing the response variable.

  • predictors_df (DataFrame) –

    A Dataframe containing the feature_col and predictor numeric columns that represent the predictor variables.

  • perturbed_tf (str) –

    Name of the perturbed TF. Note: this must exist as a column in predictors_df.

  • feature_col (str, default: 'target_symbol' ) –

    Name of the column to use as the feature index. This column must exist in both the response and predictor DataFrames. (default: "target_symbol").

  • feature_blacklist (list[str] | None, default: None ) –

    List of feature names to exclude from analysis.

  • top_n (int | None, default: None ) –

    If specified, retain only the top N features with the strongest binding scores for the perturbed TF. If this is passed on initialization, then the top_n_masked is set to True by default. If you wish to extract unmasked data, you can set object.top_n_masked = False. The mask can be toggled on and off at will.

Source code in tfbpmodeling/modeling_input_data.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def __init__(
    self,
    response_df: pd.DataFrame,
    predictors_df: pd.DataFrame,
    perturbed_tf: str,
    feature_col: str = "target_symbol",
    feature_blacklist: list[str] | None = None,
    top_n: int | None = None,
):
    """
    Initialize ModelingInputData with response and predictor matrices. Note that the
    response and predictor dataframes will be subset down to the features in common
    between them, by index. The rows in both dataframes will also be ordered such
    that they match, again by index.

    :param response_df: A two column DataFrame containing the `feature_col` and
        numeric column representing the response variable.
    :param predictors_df: A Dataframe containing the `feature_col` and predictor
        numeric columns that represent the predictor variables.
    :param perturbed_tf: Name of the perturbed TF. **Note**: this must exist as a
        column in predictors_df.
    :param feature_col: Name of the column to use as the feature index. This column
        must exist in both the response and predictor DataFrames.
        (default: "target_symbol").
    :param feature_blacklist: List of feature names to exclude from analysis.
    :param top_n: If specified, retain only the top N features with the strongest
        binding scores for the perturbed TF. If this is passed on initialization,
        then the top_n_masked is set to True by default. If you wish to extract
        unmasked data, you can set `object.top_n_masked = False`. The mask can be
        toggled on and off at will.

    """
    if not isinstance(response_df, pd.DataFrame):
        raise ValueError("response_df must be a DataFrame.")
    if not isinstance(predictors_df, pd.DataFrame):
        raise ValueError("predictors_df must be a DataFrame.")
    if not isinstance(perturbed_tf, str):
        raise ValueError("perturbed_tf must be a string representing the TF name.")
    if not isinstance(feature_col, str):
        raise ValueError(
            "feature_col must be a string representing the feature name."
        )
    if feature_blacklist is not None and not isinstance(feature_blacklist, list):
        raise ValueError("feature_blacklist must be a list or None.")
    if top_n is not None and not isinstance(top_n, int):
        raise ValueError("top_n must be an integer or None.")

    self.perturbed_tf = perturbed_tf
    self.feature_col = feature_col
    self._top_n_masked = False

    # Ensure feature_blacklist is a list
    if feature_blacklist is None:
        feature_blacklist = []

    # Ensure perturbed_tf is in the blacklist
    if perturbed_tf not in feature_blacklist:
        logger.warning(
            f"Perturbed TF '{perturbed_tf}' not in blacklist. "
            f"Adding to blacklist. Setting blacklist_masked to True. "
            f"If you do not wish to blacklist the perturbed TF, "
            f"set blacklist_masked to False."
        )
        feature_blacklist.append(perturbed_tf)

    self.feature_blacklist = set(feature_blacklist)
    self.blacklist_masked = bool(self.feature_blacklist)

    # Ensure the response and predictors only contain common features
    self.response_df = response_df
    self.predictors_df = predictors_df

    # Assign top_n value
    self.top_n = top_n

predictors_df property writable

predictors_df

Get the predictors DataFrame with feature masks applied.

The returned DataFrame reflects any active blacklist or top-N filtering.

response_df property writable

response_df

Get the response DataFrame with feature masks applied.

Returns a version of the response matrix filtered by: - Feature blacklist (if blacklist_masked is True) - Top-N feature selection (if top_n_masked is True)

The final DataFrame will be aligned in index order with the predictors matrix.

Returns:
  • DataFrame

    Filtered and ordered response DataFrame.

top_n property writable

top_n

Get the threshold for top-ranked feature selection.

If set to an integer, this defines how many of the highest-ranked features (based on predictors_df[self.perturbed_tf]) should be retained. Ranking is descending (higher values rank higher). If the cutoff falls on a tie, fewer than N features may be selected to preserve a consistent threshold. The most impactful tie is when the majority of the lower ranked features have the same value, eg an enrichment of 0 or pvalue of 1.0.

If set to None, top-N feature selection is disabled.

Note: Whether top-N filtering is actively applied depends on the top_n_masked attribute. You can set top_n_masked = False to access the unfiltered data, even if top_n is set.

Returns:
  • int | None

    The current top-N threshold or None.

top_n_masked property writable

top_n_masked

Get the status of top-n feature masking.

If this is True, then the top-n feature selection is applied to the predictors and response

from_files classmethod

from_files(
    response_path,
    predictors_path,
    perturbed_tf,
    feature_col="target_symbol",
    feature_blacklist_path=None,
    top_n=600,
)

Load response and predictor data from files. This would be considered an overloaded constructor in other languages. The input files must be able to be read into objects that satisfy the init method -- see init docs.

Parameters:
  • response_path (str) –

    Path to the response file (CSV).

  • predictors_path (str) –

    Path to the predictors file (CSV).

  • perturbed_tf (str) –

    The perturbed TF.

  • feature_col (str, default: 'target_symbol' ) –

    The column name representing features.

  • feature_blacklist_path (str | None, default: None ) –

    Path to a file containing a list of features to exclude.

  • top_n (int, default: 600 ) –

    Maximum number of features for top-n selection.

Returns:
Raises:
  • FileNotFoundError

    If the response or predictor files are missing.

Source code in tfbpmodeling/modeling_input_data.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
@classmethod
def from_files(
    cls,
    response_path: str,
    predictors_path: str,
    perturbed_tf: str,
    feature_col: str = "target_symbol",
    feature_blacklist_path: str | None = None,
    top_n: int = 600,
) -> "ModelingInputData":
    """
    Load response and predictor data from files. This would be considered an
    overloaded constructor in other languages. The input files must be able to be
    read into objects that satisfy the __init__ method -- see __init__ docs.

    :param response_path: Path to the response file (CSV).
    :param predictors_path: Path to the predictors file (CSV).
    :param perturbed_tf: The perturbed TF.
    :param feature_col: The column name representing features.
    :param feature_blacklist_path: Path to a file containing a list of features to
        exclude.
    :param top_n: Maximum number of features for top-n selection.
    :return: An instance of ModelingInputData.
    :raises FileNotFoundError: If the response or predictor files are missing.

    """
    if not os.path.exists(response_path):
        raise FileNotFoundError(f"Response file '{response_path}' does not exist.")
    if not os.path.exists(predictors_path):
        raise FileNotFoundError(
            f"Predictors file '{predictors_path}' does not exist."
        )

    response_df = pd.read_csv(response_path)
    predictors_df = pd.read_csv(predictors_path)

    # Load feature blacklist if provided
    feature_blacklist: list[str] = []
    if feature_blacklist_path:
        if not os.path.exists(feature_blacklist_path):
            raise FileNotFoundError(
                f"Feature blacklist file '{feature_blacklist_path}' does not exist."
            )
        with open(feature_blacklist_path) as f:
            feature_blacklist = [line.strip() for line in f if line.strip()]

    return cls(
        response_df,
        predictors_df,
        perturbed_tf,
        feature_col,
        feature_blacklist,
        top_n,
    )

get_modeling_data

get_modeling_data(
    formula,
    add_row_max=False,
    drop_intercept=False,
    scale_by_std=False,
)

Get the predictors for modeling, optionally adding a row-wise max feature.

Parameters:
  • formula (str) –

    The formula to use for modeling.

  • add_row_max (bool, default: False ) –

    Whether to add a row-wise max feature to the predictors.

  • drop_intercept (bool, default: False ) –

    If drop_intercept is True, "-1" will be appended to the formula string. This will drop the intercept (constant) term from the model matrix output by patsy.dmatrix. Default is False.

  • scale_by_std (bool, default: False ) –

    If True, scale the design matrix by standard deviation using StandardScaler(with_mean=False, with_std=True). The data is NOT centered, so the estimator should still fit an intercept (fit_intercept=True).

Returns:
  • DataFrame

    The design matrix for modeling. self.response_df can be used for the response variable.

Raises:
  • ValueError

    If the formula is not provided

  • PatsyError

    If there is an error in creating the model matrix

Source code in tfbpmodeling/modeling_input_data.py
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
def get_modeling_data(
    self,
    formula: str,
    add_row_max: bool = False,
    drop_intercept: bool = False,
    scale_by_std: bool = False,
) -> pd.DataFrame:
    """
    Get the predictors for modeling, optionally adding a row-wise max feature.

    :param formula: The formula to use for modeling.
    :param add_row_max: Whether to add a row-wise max feature to the predictors.
    :param drop_intercept: If `drop_intercept` is True, "-1" will be appended to
        the formula string. This will drop the intercept (constant) term from
        the model matrix output by patsy.dmatrix. Default is `False`.
    :param scale_by_std: If True, scale the design matrix by standard deviation
        using StandardScaler(with_mean=False, with_std=True). The data is NOT
        centered, so the estimator should still fit an intercept
        (fit_intercept=True).
    :return: The design matrix for modeling. self.response_df can be used for the
        response variable.

    :raises ValueError: If the formula is not provided
    :raises PatsyError: If there is an error in creating the model matrix

    """
    if not formula:
        raise ValueError("Formula must be provided for modeling.")

    if drop_intercept:
        logger.info("Dropping intercept from the patsy model matrix")
        formula += " - 1"

    predictors_df = self.predictors_df  # Apply top-n feature mask

    # Add row-wise max feature if requested
    if add_row_max:
        predictors_df["row_max"] = predictors_df.max(axis=1)

    # Create a design matrix using patsy
    try:
        design_matrix = dmatrix(
            formula,
            data=predictors_df,
            return_type="dataframe",
            NA_action="raise",
        )
    except PatsyError as exc:
        logger.error(
            f"Error in creating model matrix with formula '{formula}': {exc}"
        )
        raise

    if scale_by_std:
        logger.info("Center matrix = `False`. Scale matrix = `True`")
        scaler = StandardScaler(with_mean=False, with_std=True)
        scaled_values = scaler.fit_transform(design_matrix)
        design_matrix = pd.DataFrame(
            scaled_values, index=design_matrix.index, columns=design_matrix.columns
        )

    logger.info(f"Design matrix columns: {list(design_matrix.columns)}")

    return design_matrix

Overview

The modeling_input_data module provides the fundamental ModelingInputData class that handles:

  • Data loading: Reading CSV files for response and predictor data
  • Validation: Ensuring data consistency and format compliance
  • Preprocessing: Feature filtering, normalization, and transformation
  • Integration: Merging response and predictor data for modeling

This class serves as the foundation for all downstream modeling operations.

Core Classes

ModelingInputData

The primary class for managing input data throughout the modeling workflow.

Key Features

  • Automatic data validation: Checks file formats, column consistency, and data types
  • Feature filtering: Removes blacklisted genes and handles missing data
  • Index alignment: Ensures consistent gene identifiers between response and predictor files
  • Data integration: Combines multiple data sources into modeling-ready format

Initialization

from tfbpmodeling.modeling_input_data import ModelingInputData

# Basic initialization
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1'
)

# With optional parameters
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1',
    blacklist_file='exclude_genes.txt',
    normalize_weights=True,
    scale_by_std=True
)

Key Methods

Data Loading and Validation
  • load_data(): Read and validate input CSV files
  • validate_format(): Check data format compliance
  • check_consistency(): Verify response-predictor alignment
Data Preprocessing
  • filter_features(): Remove blacklisted and invalid features
  • normalize_data(): Apply scaling and normalization
  • handle_missing(): Deal with missing values appropriately
Data Access
  • get_response_data(): Access processed response data
  • get_predictor_data(): Access processed predictor data
  • get_feature_names(): Retrieve feature identifiers
  • get_sample_names(): Retrieve sample identifiers

Data Format Requirements

Response File Format

The response file must be a CSV with specific structure:

gene_id,sample1,sample2,sample3,sample4
YPD1,0.23,-1.45,0.87,-0.12
YBR123W,1.34,0.56,-0.23,0.78
YCR456X,-0.45,0.12,1.23,-0.56

Requirements: - First column: Gene identifiers (must match predictor file) - Subsequent columns: Numeric expression values - Column names: Sample identifiers - Must contain column matching perturbed_tf parameter

Predictor File Format

The predictor file structure:

gene_id,TF1,TF2,TF3,TF4
YPD1,0.34,0.12,0.78,0.01
YBR123W,0.89,0.45,0.23,0.67
YCR456X,0.12,0.78,0.34,0.90

Requirements: - First column: Gene identifiers (must match response file) - Subsequent columns: Numeric binding values - Column names: Transcription factor identifiers - All values must be numeric (no missing values in binding data)

Blacklist File Format

Optional exclusion file:

YBR999W
YCR888X
control_gene
technical_artifact

Requirements: - Plain text file - One gene identifier per line - Gene IDs must match those in data files - Comments not supported

Usage Examples

Basic Data Loading

from tfbpmodeling.modeling_input_data import ModelingInputData

# Load data with minimal configuration
data = ModelingInputData(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1'
)

# Access processed data
response_data = data.get_response_data()
predictor_data = data.get_predictor_data()
feature_names = data.get_feature_names()

print(f"Loaded {len(feature_names)} features")
print(f"Response data shape: {response_data.shape}")
print(f"Predictor data shape: {predictor_data.shape}")

Data Preprocessing Options

# Advanced preprocessing
data = ModelingInputData(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1',
    blacklist_file='data/exclude_genes.txt',
    normalize_weights=True,
    scale_by_std=True,
    handle_missing='drop'  # or 'impute', 'zero'
)

# Check data quality
print(f"Original features: {data.original_feature_count}")
print(f"Filtered features: {len(data.get_feature_names())}")
print(f"Excluded features: {data.excluded_feature_count}")

Integration with Modeling Pipeline

# Prepare data for bootstrap modeling
from tfbpmodeling.bootstrapped_input_data import BootstrappedModelingInputData

# Base data
base_data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1'
)

# Create bootstrap version
bootstrap_data = BootstrappedModelingInputData(
    base_data=base_data,
    n_bootstraps=1000,
    random_state=42
)

Data Validation

Automatic Checks

The class performs comprehensive validation:

# File existence and readability
assert os.path.exists(response_file), f"Response file not found: {response_file}"
assert os.path.exists(predictors_file), f"Predictor file not found: {predictors_file}"

# Data format validation
assert response_df.index.equals(predictor_df.index), "Gene indices must match"
assert perturbed_tf in response_df.columns, f"Perturbed TF '{perturbed_tf}' not found"

# Data type validation
assert response_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
assert predictor_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()

Custom Validation

# Add custom validation rules
def validate_expression_range(data):
    \"\"\"Ensure expression values are in reasonable range\"\"\"
    assert data.abs().max().max() < 10, "Expression values seem too large"

def validate_binding_range(data):
    \"\"\"Ensure binding values are probabilities\"\"\"
    assert (data >= 0).all().all(), "Binding values must be non-negative"
    assert (data <= 1).all().all(), "Binding values must be <= 1"

# Apply custom validation
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1',
    custom_validators=[validate_expression_range, validate_binding_range]
)

Error Handling

Common Errors and Solutions

File Format Errors

# CSV parsing errors
try:
    data = ModelingInputData(response_file='malformed.csv', ...)
except pd.errors.ParserError as e:
    print(f"CSV format error: {e}")
    # Solution: Check file encoding, delimiters, quotes

Data Consistency Errors

# Mismatched gene indices
try:
    data = ModelingInputData(...)
except ValueError as e:
    if "Gene indices must match" in str(e):
        print("Response and predictor files have different gene sets")
        # Solution: Align gene lists or use intersection

Missing Data Errors

# Perturbed TF not found
try:
    data = ModelingInputData(perturbed_tf='MISSING_TF', ...)
except KeyError as e:
    print(f"Perturbed TF not found in response data: {e}")
    # Solution: Check TF name spelling, verify column names

Performance Considerations

Memory Management

  • Large datasets loaded using chunked reading
  • Unnecessary columns dropped early in processing
  • Memory-efficient data types selected automatically

I/O Optimization

  • CSV reading optimized with appropriate engines
  • Caching of preprocessed data for repeated access
  • Lazy loading of optional data components

Data Processing

  • Vectorized operations for filtering and transformation
  • Efficient indexing for data alignment
  • Minimal data copying during preprocessing

Integration Points

The ModelingInputData class integrates with:

  1. CLI Interface: Receives parameters from command-line arguments
  2. Bootstrap Modeling: Provides base data for resampling
  3. Feature Engineering: Supplies data for polynomial and interaction terms
  4. Cross-Validation: Furnishes stratified sampling input
  5. Results Output: Delivers metadata for result interpretation