modeling_input_data¶
Core data structures for handling input data preprocessing and validation in tfbpmodeling.
tfbpmodeling.modeling_input_data ¶
ModelingInputData ¶
ModelingInputData(
response_df,
predictors_df,
perturbed_tf,
feature_col="target_symbol",
feature_blacklist=None,
top_n=None,
)
Container for response and predictor data used in modeling transcription factor perturbation experiments.
This class handles: - Validation and synchronization of response and predictor DataFrames based on a shared feature identifier. - Optional blacklisting of features, including the perturbed transcription factor. - Optional feature selection based on the top N strongest binding signals (as ranked from a specific TF column in the predictor matrix). - Application of masking logic to restrict modeling to selected features.
Initialize ModelingInputData with response and predictor matrices. Note that the response and predictor dataframes will be subset down to the features in common between them, by index. The rows in both dataframes will also be ordered such that they match, again by index.
| Parameters: |
|
|---|
Source code in tfbpmodeling/modeling_input_data.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
predictors_df
property
writable
¶
predictors_df
Get the predictors DataFrame with feature masks applied.
The returned DataFrame reflects any active blacklist or top-N filtering.
response_df
property
writable
¶
response_df
Get the response DataFrame with feature masks applied.
Returns a version of the response matrix filtered by:
- Feature blacklist (if blacklist_masked is True)
- Top-N feature selection (if top_n_masked is True)
The final DataFrame will be aligned in index order with the predictors matrix.
| Returns: |
|
|---|
top_n
property
writable
¶
top_n
Get the threshold for top-ranked feature selection.
If set to an integer, this defines how many of the highest-ranked features
(based on predictors_df[self.perturbed_tf]) should be retained. Ranking is
descending (higher values rank higher). If the cutoff falls on a tie,
fewer than N features may be selected to preserve a consistent threshold. The
most impactful tie is when the majority of the lower ranked features have
the same value, eg an enrichment of 0 or pvalue of 1.0.
If set to None, top-N feature selection is disabled.
Note: Whether top-N filtering is actively applied depends on the
top_n_masked attribute. You can set top_n_masked = False to access the
unfiltered data, even if top_n is set.
| Returns: |
|
|---|
top_n_masked
property
writable
¶
top_n_masked
Get the status of top-n feature masking.
If this is True, then
the top-n feature selection is applied to the predictors and response
from_files
classmethod
¶
from_files(
response_path,
predictors_path,
perturbed_tf,
feature_col="target_symbol",
feature_blacklist_path=None,
top_n=600,
)
Load response and predictor data from files. This would be considered an overloaded constructor in other languages. The input files must be able to be read into objects that satisfy the init method -- see init docs.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Source code in tfbpmodeling/modeling_input_data.py
439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 | |
get_modeling_data ¶
get_modeling_data(
formula,
add_row_max=False,
drop_intercept=False,
scale_by_std=False,
)
Get the predictors for modeling, optionally adding a row-wise max feature.
| Parameters: |
|
|---|
| Returns: |
|
|---|
| Raises: |
|
|---|
Source code in tfbpmodeling/modeling_input_data.py
374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 | |
Overview¶
The modeling_input_data module provides the fundamental ModelingInputData class that handles:
- Data loading: Reading CSV files for response and predictor data
- Validation: Ensuring data consistency and format compliance
- Preprocessing: Feature filtering, normalization, and transformation
- Integration: Merging response and predictor data for modeling
This class serves as the foundation for all downstream modeling operations.
Core Classes¶
ModelingInputData¶
The primary class for managing input data throughout the modeling workflow.
Key Features¶
- Automatic data validation: Checks file formats, column consistency, and data types
- Feature filtering: Removes blacklisted genes and handles missing data
- Index alignment: Ensures consistent gene identifiers between response and predictor files
- Data integration: Combines multiple data sources into modeling-ready format
Initialization¶
from tfbpmodeling.modeling_input_data import ModelingInputData
# Basic initialization
data = ModelingInputData(
response_file='expression.csv',
predictors_file='binding.csv',
perturbed_tf='YPD1'
)
# With optional parameters
data = ModelingInputData(
response_file='expression.csv',
predictors_file='binding.csv',
perturbed_tf='YPD1',
blacklist_file='exclude_genes.txt',
normalize_weights=True,
scale_by_std=True
)
Key Methods¶
Data Loading and Validation¶
load_data(): Read and validate input CSV filesvalidate_format(): Check data format compliancecheck_consistency(): Verify response-predictor alignment
Data Preprocessing¶
filter_features(): Remove blacklisted and invalid featuresnormalize_data(): Apply scaling and normalizationhandle_missing(): Deal with missing values appropriately
Data Access¶
get_response_data(): Access processed response dataget_predictor_data(): Access processed predictor dataget_feature_names(): Retrieve feature identifiersget_sample_names(): Retrieve sample identifiers
Data Format Requirements¶
Response File Format¶
The response file must be a CSV with specific structure:
gene_id,sample1,sample2,sample3,sample4
YPD1,0.23,-1.45,0.87,-0.12
YBR123W,1.34,0.56,-0.23,0.78
YCR456X,-0.45,0.12,1.23,-0.56
Requirements:
- First column: Gene identifiers (must match predictor file)
- Subsequent columns: Numeric expression values
- Column names: Sample identifiers
- Must contain column matching perturbed_tf parameter
Predictor File Format¶
The predictor file structure:
gene_id,TF1,TF2,TF3,TF4
YPD1,0.34,0.12,0.78,0.01
YBR123W,0.89,0.45,0.23,0.67
YCR456X,0.12,0.78,0.34,0.90
Requirements: - First column: Gene identifiers (must match response file) - Subsequent columns: Numeric binding values - Column names: Transcription factor identifiers - All values must be numeric (no missing values in binding data)
Blacklist File Format¶
Optional exclusion file:
YBR999W
YCR888X
control_gene
technical_artifact
Requirements: - Plain text file - One gene identifier per line - Gene IDs must match those in data files - Comments not supported
Usage Examples¶
Basic Data Loading¶
from tfbpmodeling.modeling_input_data import ModelingInputData
# Load data with minimal configuration
data = ModelingInputData(
response_file='data/expression.csv',
predictors_file='data/binding.csv',
perturbed_tf='YPD1'
)
# Access processed data
response_data = data.get_response_data()
predictor_data = data.get_predictor_data()
feature_names = data.get_feature_names()
print(f"Loaded {len(feature_names)} features")
print(f"Response data shape: {response_data.shape}")
print(f"Predictor data shape: {predictor_data.shape}")
Data Preprocessing Options¶
# Advanced preprocessing
data = ModelingInputData(
response_file='data/expression.csv',
predictors_file='data/binding.csv',
perturbed_tf='YPD1',
blacklist_file='data/exclude_genes.txt',
normalize_weights=True,
scale_by_std=True,
handle_missing='drop' # or 'impute', 'zero'
)
# Check data quality
print(f"Original features: {data.original_feature_count}")
print(f"Filtered features: {len(data.get_feature_names())}")
print(f"Excluded features: {data.excluded_feature_count}")
Integration with Modeling Pipeline¶
# Prepare data for bootstrap modeling
from tfbpmodeling.bootstrapped_input_data import BootstrappedModelingInputData
# Base data
base_data = ModelingInputData(
response_file='expression.csv',
predictors_file='binding.csv',
perturbed_tf='YPD1'
)
# Create bootstrap version
bootstrap_data = BootstrappedModelingInputData(
base_data=base_data,
n_bootstraps=1000,
random_state=42
)
Data Validation¶
Automatic Checks¶
The class performs comprehensive validation:
# File existence and readability
assert os.path.exists(response_file), f"Response file not found: {response_file}"
assert os.path.exists(predictors_file), f"Predictor file not found: {predictors_file}"
# Data format validation
assert response_df.index.equals(predictor_df.index), "Gene indices must match"
assert perturbed_tf in response_df.columns, f"Perturbed TF '{perturbed_tf}' not found"
# Data type validation
assert response_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
assert predictor_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
Custom Validation¶
# Add custom validation rules
def validate_expression_range(data):
\"\"\"Ensure expression values are in reasonable range\"\"\"
assert data.abs().max().max() < 10, "Expression values seem too large"
def validate_binding_range(data):
\"\"\"Ensure binding values are probabilities\"\"\"
assert (data >= 0).all().all(), "Binding values must be non-negative"
assert (data <= 1).all().all(), "Binding values must be <= 1"
# Apply custom validation
data = ModelingInputData(
response_file='expression.csv',
predictors_file='binding.csv',
perturbed_tf='YPD1',
custom_validators=[validate_expression_range, validate_binding_range]
)
Error Handling¶
Common Errors and Solutions¶
File Format Errors¶
# CSV parsing errors
try:
data = ModelingInputData(response_file='malformed.csv', ...)
except pd.errors.ParserError as e:
print(f"CSV format error: {e}")
# Solution: Check file encoding, delimiters, quotes
Data Consistency Errors¶
# Mismatched gene indices
try:
data = ModelingInputData(...)
except ValueError as e:
if "Gene indices must match" in str(e):
print("Response and predictor files have different gene sets")
# Solution: Align gene lists or use intersection
Missing Data Errors¶
# Perturbed TF not found
try:
data = ModelingInputData(perturbed_tf='MISSING_TF', ...)
except KeyError as e:
print(f"Perturbed TF not found in response data: {e}")
# Solution: Check TF name spelling, verify column names
Performance Considerations¶
Memory Management¶
- Large datasets loaded using chunked reading
- Unnecessary columns dropped early in processing
- Memory-efficient data types selected automatically
I/O Optimization¶
- CSV reading optimized with appropriate engines
- Caching of preprocessed data for repeated access
- Lazy loading of optional data components
Data Processing¶
- Vectorized operations for filtering and transformation
- Efficient indexing for data alignment
- Minimal data copying during preprocessing
Related Classes¶
- BootstrappedModelingInputData: Bootstrap sampling extension
- BootstrapModelResults: Results storage and aggregation
- StratifiedCV: Cross-validation data handling
Integration Points¶
The ModelingInputData class integrates with:
- CLI Interface: Receives parameters from command-line arguments
- Bootstrap Modeling: Provides base data for resampling
- Feature Engineering: Supplies data for polynomial and interaction terms
- Cross-Validation: Furnishes stratified sampling input
- Results Output: Delivers metadata for result interpretation