modeling_input_data¶

Core data structures for handling input data preprocessing and validation in tfbpmodeling.

tfbpmodeling.modeling_input_data ¶

ModelingInputData ¶

ModelingInputData(
    response_df,
    predictors_df,
    perturbed_tf,
    feature_col="target_symbol",
    feature_blacklist=None,
    top_n=None,
)

Container for response and predictor data used in modeling transcription factor perturbation experiments.

This class handles: - Validation and synchronization of response and predictor DataFrames based on a shared feature identifier. - Optional blacklisting of features, including the perturbed transcription factor. - Optional feature selection based on the top N strongest binding signals (as ranked from a specific TF column in the predictor matrix). - Application of masking logic to restrict modeling to selected features.

Initialize ModelingInputData with response and predictor matrices. Note that the response and predictor dataframes will be subset down to the features in common between them, by index. The rows in both dataframes will also be ordered such that they match, again by index.

Parameters:

response_df (DataFrame) –

A two column DataFrame containing the feature_col and numeric column representing the response variable.
predictors_df (DataFrame) –

A Dataframe containing the feature_col and predictor numeric columns that represent the predictor variables.
perturbed_tf (str) –

Name of the perturbed TF. Note: this must exist as a column in predictors_df.
feature_col (str, default: 'target_symbol' ) –

Name of the column to use as the feature index. This column must exist in both the response and predictor DataFrames. (default: "target_symbol").
feature_blacklist (list[str] | None, default: None ) –

List of feature names to exclude from analysis.
top_n (int | None, default: None ) –

If specified, retain only the top N features with the strongest binding scores for the perturbed TF. If this is passed on initialization, then the top_n_masked is set to True by default. If you wish to extract unmasked data, you can set object.top_n_masked = False. The mask can be toggled on and off at will.

Source code in tfbpmodeling/modeling_input_data.py

def __init__(
    self,
    response_df: pd.DataFrame,
    predictors_df: pd.DataFrame,
    perturbed_tf: str,
    feature_col: str = "target_symbol",
    feature_blacklist: list[str] | None = None,
    top_n: int | None = None,
):
    """
    Initialize ModelingInputData with response and predictor matrices. Note that the
    response and predictor dataframes will be subset down to the features in common
    between them, by index. The rows in both dataframes will also be ordered such
    that they match, again by index.

    :param response_df: A two column DataFrame containing the `feature_col` and
        numeric column representing the response variable.
    :param predictors_df: A Dataframe containing the `feature_col` and predictor
        numeric columns that represent the predictor variables.
    :param perturbed_tf: Name of the perturbed TF. **Note**: this must exist as a
        column in predictors_df.
    :param feature_col: Name of the column to use as the feature index. This column
        must exist in both the response and predictor DataFrames.
        (default: "target_symbol").
    :param feature_blacklist: List of feature names to exclude from analysis.
    :param top_n: If specified, retain only the top N features with the strongest
        binding scores for the perturbed TF. If this is passed on initialization,
        then the top_n_masked is set to True by default. If you wish to extract
        unmasked data, you can set `object.top_n_masked = False`. The mask can be
        toggled on and off at will.

    """
    if not isinstance(response_df, pd.DataFrame):
        raise ValueError("response_df must be a DataFrame.")
    if not isinstance(predictors_df, pd.DataFrame):
        raise ValueError("predictors_df must be a DataFrame.")
    if not isinstance(perturbed_tf, str):
        raise ValueError("perturbed_tf must be a string representing the TF name.")
    if not isinstance(feature_col, str):
        raise ValueError(
            "feature_col must be a string representing the feature name."
        )
    if feature_blacklist is not None and not isinstance(feature_blacklist, list):
        raise ValueError("feature_blacklist must be a list or None.")
    if top_n is not None and not isinstance(top_n, int):
        raise ValueError("top_n must be an integer or None.")

    self.perturbed_tf = perturbed_tf
    self.feature_col = feature_col
    self._top_n_masked = False

    # Ensure feature_blacklist is a list
    if feature_blacklist is None:
        feature_blacklist = []

    # Ensure perturbed_tf is in the blacklist
    if perturbed_tf not in feature_blacklist:
        logger.warning(
            f"Perturbed TF '{perturbed_tf}' not in blacklist. "
            f"Adding to blacklist. Setting blacklist_masked to True. "
            f"If you do not wish to blacklist the perturbed TF, "
            f"set blacklist_masked to False."
        )
        feature_blacklist.append(perturbed_tf)

    self.feature_blacklist = set(feature_blacklist)
    self.blacklist_masked = bool(self.feature_blacklist)

    # Ensure the response and predictors only contain common features
    self.response_df = response_df
    self.predictors_df = predictors_df

    # Assign top_n value
    self.top_n = top_n

predictors_df `property` `writable` ¶

predictors_df

Get the predictors DataFrame with feature masks applied.

The returned DataFrame reflects any active blacklist or top-N filtering.

response_df `property` `writable` ¶

response_df

Get the response DataFrame with feature masks applied.

Returns a version of the response matrix filtered by: - Feature blacklist (if blacklist_masked is True) - Top-N feature selection (if top_n_masked is True)

The final DataFrame will be aligned in index order with the predictors matrix.

Returns:	`DataFrame` – Filtered and ordered response DataFrame.

top_n `property` `writable` ¶

top_n

Get the threshold for top-ranked feature selection.

If set to an integer, this defines how many of the highest-ranked features (based on predictors_df[self.perturbed_tf]) should be retained. Ranking is descending (higher values rank higher). If the cutoff falls on a tie, fewer than N features may be selected to preserve a consistent threshold. The most impactful tie is when the majority of the lower ranked features have the same value, eg an enrichment of 0 or pvalue of 1.0.

If set to None, top-N feature selection is disabled.

Note: Whether top-N filtering is actively applied depends on the top_n_masked attribute. You can set top_n_masked = False to access the unfiltered data, even if top_n is set.

Returns:	`int \| None` – The current top-N threshold or None.

top_n_masked `property` `writable` ¶

top_n_masked

Get the status of top-n feature masking.

If this is True, then the top-n feature selection is applied to the predictors and response

from_files `classmethod` ¶

from_files(
    response_path,
    predictors_path,
    perturbed_tf,
    feature_col="target_symbol",
    feature_blacklist_path=None,
    top_n=600,
)

Load response and predictor data from files. This would be considered an overloaded constructor in other languages. The input files must be able to be read into objects that satisfy the init method -- see init docs.

Parameters:

response_path (str) –

Path to the response file (CSV).
predictors_path (str) –

Path to the predictors file (CSV).
perturbed_tf (str) –

The perturbed TF.
feature_col (str, default: 'target_symbol' ) –

The column name representing features.
feature_blacklist_path (str | None, default: None ) –

Path to a file containing a list of features to exclude.
top_n (int, default: 600 ) –

Maximum number of features for top-n selection.

Returns:	`ModelingInputData` – An instance of ModelingInputData.

Raises:	`FileNotFoundError` – If the response or predictor files are missing.

Source code in tfbpmodeling/modeling_input_data.py

@classmethod
def from_files(
    cls,
    response_path: str,
    predictors_path: str,
    perturbed_tf: str,
    feature_col: str = "target_symbol",
    feature_blacklist_path: str | None = None,
    top_n: int = 600,
) -> "ModelingInputData":
    """
    Load response and predictor data from files. This would be considered an
    overloaded constructor in other languages. The input files must be able to be
    read into objects that satisfy the __init__ method -- see __init__ docs.

    :param response_path: Path to the response file (CSV).
    :param predictors_path: Path to the predictors file (CSV).
    :param perturbed_tf: The perturbed TF.
    :param feature_col: The column name representing features.
    :param feature_blacklist_path: Path to a file containing a list of features to
        exclude.
    :param top_n: Maximum number of features for top-n selection.
    :return: An instance of ModelingInputData.
    :raises FileNotFoundError: If the response or predictor files are missing.

    """
    if not os.path.exists(response_path):
        raise FileNotFoundError(f"Response file '{response_path}' does not exist.")
    if not os.path.exists(predictors_path):
        raise FileNotFoundError(
            f"Predictors file '{predictors_path}' does not exist."
        )

    response_df = pd.read_csv(response_path)
    predictors_df = pd.read_csv(predictors_path)

    # Load feature blacklist if provided
    feature_blacklist: list[str] = []
    if feature_blacklist_path:
        if not os.path.exists(feature_blacklist_path):
            raise FileNotFoundError(
                f"Feature blacklist file '{feature_blacklist_path}' does not exist."
            )
        with open(feature_blacklist_path) as f:
            feature_blacklist = [line.strip() for line in f if line.strip()]

    return cls(
        response_df,
        predictors_df,
        perturbed_tf,
        feature_col,
        feature_blacklist,
        top_n,
    )

get_modeling_data ¶

get_modeling_data(
    formula,
    add_row_max=False,
    drop_intercept=False,
    scale_by_std=False,
)

Get the predictors for modeling, optionally adding a row-wise max feature.

Parameters:

formula (str) –

The formula to use for modeling.
add_row_max (bool, default: False ) –

Whether to add a row-wise max feature to the predictors.
drop_intercept (bool, default: False ) –

If drop_intercept is True, "-1" will be appended to the formula string. This will drop the intercept (constant) term from the model matrix output by patsy.dmatrix. Default is False.
scale_by_std (bool, default: False ) –

If True, scale the design matrix by standard deviation using StandardScaler(with_mean=False, with_std=True). The data is NOT centered, so the estimator should still fit an intercept (fit_intercept=True).

Returns:	`DataFrame` – The design matrix for modeling. self.response_df can be used for the response variable.

Raises:	`ValueError` – If the formula is not provided `PatsyError` – If there is an error in creating the model matrix

Source code in tfbpmodeling/modeling_input_data.py

def get_modeling_data(
    self,
    formula: str,
    add_row_max: bool = False,
    drop_intercept: bool = False,
    scale_by_std: bool = False,
) -> pd.DataFrame:
    """
    Get the predictors for modeling, optionally adding a row-wise max feature.

    :param formula: The formula to use for modeling.
    :param add_row_max: Whether to add a row-wise max feature to the predictors.
    :param drop_intercept: If `drop_intercept` is True, "-1" will be appended to
        the formula string. This will drop the intercept (constant) term from
        the model matrix output by patsy.dmatrix. Default is `False`.
    :param scale_by_std: If True, scale the design matrix by standard deviation
        using StandardScaler(with_mean=False, with_std=True). The data is NOT
        centered, so the estimator should still fit an intercept
        (fit_intercept=True).
    :return: The design matrix for modeling. self.response_df can be used for the
        response variable.

    :raises ValueError: If the formula is not provided
    :raises PatsyError: If there is an error in creating the model matrix

    """
    if not formula:
        raise ValueError("Formula must be provided for modeling.")

    if drop_intercept:
        logger.info("Dropping intercept from the patsy model matrix")
        formula += " - 1"

    predictors_df = self.predictors_df  # Apply top-n feature mask

    # Add row-wise max feature if requested
    if add_row_max:
        predictors_df["row_max"] = predictors_df.max(axis=1)

    # Create a design matrix using patsy
    try:
        design_matrix = dmatrix(
            formula,
            data=predictors_df,
            return_type="dataframe",
            NA_action="raise",
        )
    except PatsyError as exc:
        logger.error(
            f"Error in creating model matrix with formula '{formula}': {exc}"
        )
        raise

    if scale_by_std:
        logger.info("Center matrix = `False`. Scale matrix = `True`")
        scaler = StandardScaler(with_mean=False, with_std=True)
        scaled_values = scaler.fit_transform(design_matrix)
        design_matrix = pd.DataFrame(
            scaled_values, index=design_matrix.index, columns=design_matrix.columns
        )

    logger.info(f"Design matrix columns: {list(design_matrix.columns)}")

    return design_matrix

Overview¶

The modeling_input_data module provides the fundamental ModelingInputData class that handles:

Data loading: Reading CSV files for response and predictor data
Validation: Ensuring data consistency and format compliance
Preprocessing: Feature filtering, normalization, and transformation
Integration: Merging response and predictor data for modeling

This class serves as the foundation for all downstream modeling operations.

Core Classes¶

ModelingInputData¶

The primary class for managing input data throughout the modeling workflow.

Key Features¶

Automatic data validation: Checks file formats, column consistency, and data types
Feature filtering: Removes blacklisted genes and handles missing data
Index alignment: Ensures consistent gene identifiers between response and predictor files
Data integration: Combines multiple data sources into modeling-ready format

Initialization¶

from tfbpmodeling.modeling_input_data import ModelingInputData

# Basic initialization
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1'
)

# With optional parameters
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1',
    blacklist_file='exclude_genes.txt',
    normalize_weights=True,
    scale_by_std=True
)

Key Methods¶

Data Loading and Validation¶

load_data(): Read and validate input CSV files
validate_format(): Check data format compliance
check_consistency(): Verify response-predictor alignment

Data Preprocessing¶

filter_features(): Remove blacklisted and invalid features
normalize_data(): Apply scaling and normalization
handle_missing(): Deal with missing values appropriately

Data Access¶

get_response_data(): Access processed response data
get_predictor_data(): Access processed predictor data
get_feature_names(): Retrieve feature identifiers
get_sample_names(): Retrieve sample identifiers

Data Format Requirements¶

Response File Format¶

The response file must be a CSV with specific structure:

gene_id,sample1,sample2,sample3,sample4
YPD1,0.23,-1.45,0.87,-0.12
YBR123W,1.34,0.56,-0.23,0.78
YCR456X,-0.45,0.12,1.23,-0.56

Requirements: - First column: Gene identifiers (must match predictor file) - Subsequent columns: Numeric expression values - Column names: Sample identifiers - Must contain column matching perturbed_tf parameter

Predictor File Format¶

The predictor file structure:

gene_id,TF1,TF2,TF3,TF4
YPD1,0.34,0.12,0.78,0.01
YBR123W,0.89,0.45,0.23,0.67
YCR456X,0.12,0.78,0.34,0.90

Requirements: - First column: Gene identifiers (must match response file) - Subsequent columns: Numeric binding values - Column names: Transcription factor identifiers - All values must be numeric (no missing values in binding data)

Blacklist File Format¶

Optional exclusion file:

YBR999W
YCR888X
control_gene
technical_artifact

Requirements: - Plain text file - One gene identifier per line - Gene IDs must match those in data files - Comments not supported

Usage Examples¶

Basic Data Loading¶

from tfbpmodeling.modeling_input_data import ModelingInputData

# Load data with minimal configuration
data = ModelingInputData(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1'
)

# Access processed data
response_data = data.get_response_data()
predictor_data = data.get_predictor_data()
feature_names = data.get_feature_names()

print(f"Loaded {len(feature_names)} features")
print(f"Response data shape: {response_data.shape}")
print(f"Predictor data shape: {predictor_data.shape}")

Data Preprocessing Options¶

# Advanced preprocessing
data = ModelingInputData(
    response_file='data/expression.csv',
    predictors_file='data/binding.csv',
    perturbed_tf='YPD1',
    blacklist_file='data/exclude_genes.txt',
    normalize_weights=True,
    scale_by_std=True,
    handle_missing='drop'  # or 'impute', 'zero'
)

# Check data quality
print(f"Original features: {data.original_feature_count}")
print(f"Filtered features: {len(data.get_feature_names())}")
print(f"Excluded features: {data.excluded_feature_count}")

Integration with Modeling Pipeline¶

# Prepare data for bootstrap modeling
from tfbpmodeling.bootstrapped_input_data import BootstrappedModelingInputData

# Base data
base_data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1'
)

# Create bootstrap version
bootstrap_data = BootstrappedModelingInputData(
    base_data=base_data,
    n_bootstraps=1000,
    random_state=42
)

Data Validation¶

Automatic Checks¶

The class performs comprehensive validation:

# File existence and readability
assert os.path.exists(response_file), f"Response file not found: {response_file}"
assert os.path.exists(predictors_file), f"Predictor file not found: {predictors_file}"

# Data format validation
assert response_df.index.equals(predictor_df.index), "Gene indices must match"
assert perturbed_tf in response_df.columns, f"Perturbed TF '{perturbed_tf}' not found"

# Data type validation
assert response_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
assert predictor_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()

Custom Validation¶

# Add custom validation rules
def validate_expression_range(data):
    \"\"\"Ensure expression values are in reasonable range\"\"\"
    assert data.abs().max().max() < 10, "Expression values seem too large"

def validate_binding_range(data):
    \"\"\"Ensure binding values are probabilities\"\"\"
    assert (data >= 0).all().all(), "Binding values must be non-negative"
    assert (data <= 1).all().all(), "Binding values must be <= 1"

# Apply custom validation
data = ModelingInputData(
    response_file='expression.csv',
    predictors_file='binding.csv',
    perturbed_tf='YPD1',
    custom_validators=[validate_expression_range, validate_binding_range]
)

Error Handling¶

Common Errors and Solutions¶

File Format Errors¶

# CSV parsing errors
try:
    data = ModelingInputData(response_file='malformed.csv', ...)
except pd.errors.ParserError as e:
    print(f"CSV format error: {e}")
    # Solution: Check file encoding, delimiters, quotes

Data Consistency Errors¶

# Mismatched gene indices
try:
    data = ModelingInputData(...)
except ValueError as e:
    if "Gene indices must match" in str(e):
        print("Response and predictor files have different gene sets")
        # Solution: Align gene lists or use intersection

Missing Data Errors¶

# Perturbed TF not found
try:
    data = ModelingInputData(perturbed_tf='MISSING_TF', ...)
except KeyError as e:
    print(f"Perturbed TF not found in response data: {e}")
    # Solution: Check TF name spelling, verify column names

Performance Considerations¶

Memory Management¶

Large datasets loaded using chunked reading
Unnecessary columns dropped early in processing
Memory-efficient data types selected automatically

I/O Optimization¶

CSV reading optimized with appropriate engines
Caching of preprocessed data for repeated access
Lazy loading of optional data components

Data Processing¶

Vectorized operations for filtering and transformation
Efficient indexing for data alignment
Minimal data copying during preprocessing

BootstrappedModelingInputData: Bootstrap sampling extension
BootstrapModelResults: Results storage and aggregation
StratifiedCV: Cross-validation data handling

Integration Points¶

The ModelingInputData class integrates with:

CLI Interface: Receives parameters from command-line arguments
Bootstrap Modeling: Provides base data for resampling
Feature Engineering: Supplies data for polynomial and interaction terms
Cross-Validation: Furnishes stratified sampling input
Results Output: Delivers metadata for result interpretation

modeling_input_data¶

tfbpmodeling.modeling_input_data ¶

ModelingInputData ¶

predictors_df property writable ¶

response_df property writable ¶

top_n property writable ¶

top_n_masked property writable ¶

from_files classmethod ¶

get_modeling_data ¶

Overview¶

Core Classes¶

ModelingInputData¶

Key Features¶

Initialization¶

Key Methods¶

Data Loading and Validation¶

Data Preprocessing¶

Data Access¶

Data Format Requirements¶

Response File Format¶

Predictor File Format¶

Blacklist File Format¶

Usage Examples¶

Basic Data Loading¶

Data Preprocessing Options¶

Integration with Modeling Pipeline¶

Data Validation¶

Automatic Checks¶

Custom Validation¶

Error Handling¶

Common Errors and Solutions¶

File Format Errors¶

Data Consistency Errors¶

Missing Data Errors¶

Performance Considerations¶

Memory Management¶

I/O Optimization¶

Data Processing¶

Related Classes¶

Integration Points¶

predictors_df `property` `writable` ¶

response_df `property` `writable` ¶

top_n `property` `writable` ¶

top_n_masked `property` `writable` ¶

from_files `classmethod` ¶