Input Data Formats¶
This guide provides detailed specifications for preparing input data for tfbpmodeling analysis.
Overview¶
tfbpmodeling requires two main input files plus optional supplementary files:
- Response File: Gene expression data (dependent variable)
- Predictors File: Transcription factor binding data (independent variables)
- Blacklist File (optional): Features to exclude from analysis
Response File Format¶
Structure¶
The response file contains gene expression measurements:
gene_id,sample_1,sample_2,sample_3,sample_4,YPD1
YBR001C,0.234,-1.456,0.876,-0.123,0.543
YBR002W,-0.456,0.123,1.234,-0.567,-0.234
YBR003W,1.234,0.567,-0.234,0.876,0.123
YBR004C,-0.123,-0.876,0.456,0.234,-0.456
Requirements¶
| Element | Requirement | Description |
|---|---|---|
| Format | CSV with comma separators | Standard comma-separated values |
| First Column | Gene identifiers | Must match predictor file exactly |
| Header Row | Sample names + perturbed TF | Column names for each measurement |
| Data Cells | Numeric values | Expression measurements (log2 fold-change, z-scores, etc.) |
| Perturbed TF Column | Must be present | Column name must match --perturbed_tf parameter |
Data Types¶
Supported expression data types: - Log2 fold-change values - Z-scores or standardized values - Raw expression values (will be normalized if needed) - Differential expression statistics
Example expression data preparation:
import pandas as pd
import numpy as np
# Load raw expression data
raw_expr = pd.read_csv('raw_expression.csv', index_col=0)
# Calculate log2 fold-change vs control
control_samples = ['ctrl_1', 'ctrl_2', 'ctrl_3']
treatment_samples = ['treat_1', 'treat_2', 'treat_3']
control_mean = raw_expr[control_samples].mean(axis=1)
treatment_mean = raw_expr[treatment_samples].mean(axis=1)
log2fc = np.log2(treatment_mean + 1) - np.log2(control_mean + 1)
# Create response file
response_df = pd.DataFrame({
'sample_1': log2fc,
'sample_2': log2fc + np.random.normal(0, 0.1, len(log2fc)),
'YPD1': log2fc # Perturbed TF response
})
response_df.index.name = 'gene_id'
response_df.to_csv('response_data.csv')
Predictors File Format¶
Structure¶
The predictors file contains transcription factor binding data:
gene_id,TF_1,TF_2,TF_3,TF_4,YPD1_binding
YBR001C,0.123,0.456,0.789,0.012,0.345
YBR002W,0.234,0.567,0.890,0.123,0.456
YBR003W,0.345,0.678,0.901,0.234,0.567
YBR004C,0.456,0.789,0.012,0.345,0.678
Requirements¶
| Element | Requirement | Description |
|---|---|---|
| Format | CSV with comma separators | Standard comma-separated values |
| First Column | Gene identifiers | Must match response file exactly |
| Header Row | TF names | Transcription factor identifiers |
| Data Cells | Numeric values 0-1 | Binding probabilities or normalized scores |
| No Missing Values | Complete data required | All cells must contain numeric values |
Data Types¶
Supported binding data types: - ChIP-seq binding probabilities (0-1) - Normalized binding scores (0-1) - Peak presence indicators (0/1) - Binding strength quantiles (0-1)
Example binding data preparation:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
# Load ChIP-seq peak scores
chip_data = pd.read_csv('chip_peaks.csv', index_col=0)
# Normalize to 0-1 range
scaler = MinMaxScaler()
normalized_binding = pd.DataFrame(
scaler.fit_transform(chip_data),
index=chip_data.index,
columns=chip_data.columns
)
# Save as predictors file
normalized_binding.index.name = 'gene_id'
normalized_binding.to_csv('binding_data.csv')
Gene Identifier Consistency¶
Critical Requirements¶
Both files must use identical gene identifiers:
# Verify gene ID consistency
response_df = pd.read_csv('response.csv', index_col=0)
predictor_df = pd.read_csv('predictors.csv', index_col=0)
# Check for exact matches
common_genes = set(response_df.index) & set(predictor_df.index)
response_only = set(response_df.index) - set(predictor_df.index)
predictor_only = set(predictor_df.index) - set(response_df.index)
print(f"Common genes: {len(common_genes)}")
print(f"Response only: {len(response_only)}")
print(f"Predictor only: {len(predictor_only)}")
Common Gene ID Formats¶
| Format | Example | Notes |
|---|---|---|
| Systematic names | YBR001C | S. cerevisiae standard |
| Gene symbols | CDC42 | Human/mouse standard |
| Ensembl IDs | ENSG00000123456 | Cross-species compatible |
| RefSeq IDs | NM_001234567 | NCBI standard |
Handling ID Mismatches¶
# Align gene IDs between files
def align_gene_ids(response_df, predictor_df):
# Find common genes
common_genes = list(set(response_df.index) & set(predictor_df.index))
# Subset both dataframes
aligned_response = response_df.loc[common_genes]
aligned_predictor = predictor_df.loc[common_genes]
return aligned_response, aligned_predictor
# Apply alignment
aligned_response, aligned_predictor = align_gene_ids(response_df, predictor_df)
# Save aligned files
aligned_response.to_csv('aligned_response.csv')
aligned_predictor.to_csv('aligned_predictors.csv')
Blacklist File Format¶
Structure¶
Simple text file with one gene identifier per line:
YBR999W
YCR888X
ribosomal_protein_L1
housekeeping_gene_1
batch_effect_gene
Use Cases¶
Common genes to blacklist: - Housekeeping genes with stable expression - Ribosomal protein genes - Mitochondrial genes - Known batch effect genes - Genes with technical artifacts
Creating Blacklist Files¶
# Identify housekeeping genes
housekeeping_genes = [
'ACT1', 'TUB1', 'TUB2', 'RDN18-1', 'RDN25-1'
]
# Identify low-variance genes
low_variance_genes = response_df.var(axis=1).sort_values().head(50).index.tolist()
# Combine blacklists
blacklist_genes = list(set(housekeeping_genes + low_variance_genes))
# Save blacklist
with open('blacklist.txt', 'w') as f:
for gene in sorted(blacklist_genes):
f.write(f"{gene}\n")
Data Quality Checks¶
Automated Validation¶
def validate_input_data(response_file, predictor_file, perturbed_tf):
"""Validate input data format and consistency."""
# Load data
response_df = pd.read_csv(response_file, index_col=0)
predictor_df = pd.read_csv(predictor_file, index_col=0)
# Check 1: File formats
assert response_df.index.name == 'gene_id', "Response file must have 'gene_id' as index"
assert predictor_df.index.name == 'gene_id', "Predictor file must have 'gene_id' as index"
# Check 2: Perturbed TF presence
assert perturbed_tf in response_df.columns, f"Perturbed TF '{perturbed_tf}' not found in response"
# Check 3: Gene ID overlap
common_genes = set(response_df.index) & set(predictor_df.index)
assert len(common_genes) > 100, f"Too few common genes: {len(common_genes)}"
# Check 4: Data types
assert response_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
assert predictor_df.dtypes.apply(lambda x: np.issubdtype(x, np.number)).all()
# Check 5: Missing values
assert not response_df.isnull().any().any(), "Response data contains missing values"
assert not predictor_df.isnull().any().any(), "Predictor data contains missing values"
# Check 6: Value ranges
predictor_ranges = predictor_df.describe()
if (predictor_ranges.loc['min'] < 0).any():
print("Warning: Some predictor values are negative")
if (predictor_ranges.loc['max'] > 1).any():
print("Warning: Some predictor values exceed 1")
print("✓ All validation checks passed")
# Run validation
validate_input_data('response.csv', 'predictors.csv', 'YPD1')
Manual Quality Assessment¶
# Data exploration
def explore_data(response_file, predictor_file):
response_df = pd.read_csv(response_file, index_col=0)
predictor_df = pd.read_csv(predictor_file, index_col=0)
print("=== Response Data ===")
print(f"Shape: {response_df.shape}")
print(f"Columns: {list(response_df.columns)}")
print(f"Value range: [{response_df.min().min():.3f}, {response_df.max().max():.3f}]")
print(f"Missing values: {response_df.isnull().sum().sum()}")
print("\n=== Predictor Data ===")
print(f"Shape: {predictor_df.shape}")
print(f"Columns: {list(predictor_df.columns[:5])}..." if len(predictor_df.columns) > 5 else list(predictor_df.columns))
print(f"Value range: [{predictor_df.min().min():.3f}, {predictor_df.max().max():.3f}]")
print(f"Missing values: {predictor_df.isnull().sum().sum()}")
# Plot distributions
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Response distribution
response_df.iloc[:, 0].hist(bins=50, ax=axes[0])
axes[0].set_title('Response Data Distribution')
axes[0].set_xlabel('Expression Values')
# Predictor distribution
predictor_df.iloc[:, 0].hist(bins=50, ax=axes[1])
axes[1].set_title('Predictor Data Distribution')
axes[1].set_xlabel('Binding Values')
plt.tight_layout()
plt.savefig('data_distributions.png')
plt.show()
# Explore your data
explore_data('response.csv', 'predictors.csv')
Example Data Preparation Workflow¶
Complete Pipeline¶
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
def prepare_tfbp_data(raw_expression_file, raw_binding_file,
perturbed_tf, output_prefix):
"""Complete data preparation pipeline."""
# 1. Load raw data
print("Loading raw data...")
expr_raw = pd.read_csv(raw_expression_file, index_col=0)
binding_raw = pd.read_csv(raw_binding_file, index_col=0)
# 2. Process expression data
print("Processing expression data...")
# Log-transform if needed
if expr_raw.min().min() >= 0:
expr_processed = np.log2(expr_raw + 1)
else:
expr_processed = expr_raw
# Add perturbed TF column (example: mean of treatment samples)
if perturbed_tf not in expr_processed.columns:
# Calculate as mean expression change
expr_processed[perturbed_tf] = expr_processed.mean(axis=1)
# 3. Process binding data
print("Processing binding data...")
# Normalize to 0-1 range
scaler = MinMaxScaler()
binding_processed = pd.DataFrame(
scaler.fit_transform(binding_raw),
index=binding_raw.index,
columns=binding_raw.columns
)
# 4. Align gene IDs
print("Aligning gene identifiers...")
common_genes = list(set(expr_processed.index) & set(binding_processed.index))
print(f"Common genes: {len(common_genes)}")
expr_aligned = expr_processed.loc[common_genes]
binding_aligned = binding_processed.loc[common_genes]
# 5. Quality checks
print("Running quality checks...")
assert len(common_genes) > 100, "Insufficient gene overlap"
assert perturbed_tf in expr_aligned.columns, "Perturbed TF missing"
assert not expr_aligned.isnull().any().any(), "Missing expression data"
assert not binding_aligned.isnull().any().any(), "Missing binding data"
# 6. Save processed data
print("Saving processed data...")
expr_aligned.index.name = 'gene_id'
binding_aligned.index.name = 'gene_id'
expr_aligned.to_csv(f'{output_prefix}_response.csv')
binding_aligned.to_csv(f'{output_prefix}_predictors.csv')
print(f"✓ Data preparation complete!")
print(f" Response file: {output_prefix}_response.csv")
print(f" Predictors file: {output_prefix}_predictors.csv")
print(f" Genes: {len(common_genes)}")
print(f" Expression samples: {len(expr_aligned.columns)}")
print(f" TF predictors: {len(binding_aligned.columns)}")
# Run preparation
prepare_tfbp_data(
raw_expression_file='raw_expression.csv',
raw_binding_file='raw_binding.csv',
perturbed_tf='YPD1',
output_prefix='processed'
)
Common Issues and Solutions¶
Issue 1: Gene ID Mismatches¶
Problem: Gene IDs don't match between files
Solution: Use gene ID mapping:
# Load ID mapping
id_mapping = pd.read_csv('gene_id_mapping.csv') # old_id, new_id
mapping_dict = dict(zip(id_mapping['old_id'], id_mapping['new_id']))
# Apply mapping
response_df.index = response_df.index.map(mapping_dict).fillna(response_df.index)
Issue 2: Missing Values¶
Problem: Missing data in binding matrix
Solution: Impute or filter:
# Option 1: Remove genes with missing binding data
complete_genes = binding_df.dropna().index
response_df = response_df.loc[complete_genes]
binding_df = binding_df.loc[complete_genes]
# Option 2: Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='median')
binding_imputed = pd.DataFrame(
imputer.fit_transform(binding_df),
index=binding_df.index,
columns=binding_df.columns
)
Issue 3: Scale Differences¶
Problem: Binding values not in 0-1 range
Solution: Normalize appropriately:
# For ChIP-seq peak heights
binding_normalized = binding_df / binding_df.max()
# For count data
binding_normalized = (binding_df - binding_df.min()) / (binding_df.max() - binding_df.min())
# For already-processed scores
binding_clipped = binding_df.clip(0, 1)
This comprehensive guide should help you prepare properly formatted input data for tfbpmodeling analysis.