Basic Workflow Tutorial¶
This tutorial walks through a complete tfbpmodeling analysis from data preparation to result interpretation.
Overview¶
We'll analyze the relationship between transcription factor binding and gene expression perturbation using a sample dataset. The workflow demonstrates:
- Data preparation: Formatting input files
- Basic analysis: Running with default parameters
- Result interpretation: Understanding output files
- Parameter tuning: Optimizing for your data
Prerequisites¶
- tfbpmodeling installed and configured
- Basic familiarity with CSV files and command-line interfaces
- Understanding of transcription factor biology (helpful but not required)
Sample Data¶
For this tutorial, we'll use example data representing: - Response data: Gene expression changes after YPD1 knockout - Predictor data: Transcription factor binding probabilities from ChIP-seq
Creating Sample Data¶
import pandas as pd
import numpy as np
# Set random seed for reproducibility
np.random.seed(42)
# Create sample gene list
genes = [f"YBR{str(i).zfill(3)}W" for i in range(1, 1001)]
samples = [f"sample_{i}" for i in range(1, 101)]
tfs = [f"TF_{i}" for i in range(1, 51)]
# Generate response data (expression changes)
response_data = pd.DataFrame(
np.random.normal(0, 1, (1000, 100)),
index=genes,
columns=samples
)
response_data.index.name = 'gene_id'
# Add YPD1 column (our perturbed TF)
response_data['YPD1'] = np.random.normal(-0.5, 0.8, 1000)
# Generate predictor data (binding probabilities)
predictor_data = pd.DataFrame(
np.random.beta(0.5, 2, (1000, 50)),
index=genes,
columns=tfs
)
predictor_data.index.name = 'gene_id'
# Save to CSV
response_data.to_csv('tutorial_expression.csv')
predictor_data.to_csv('tutorial_binding.csv')
print("Sample data created:")
print(f"Response data: {response_data.shape}")
print(f"Predictor data: {predictor_data.shape}")
Step 1: Basic Analysis¶
Run Default Analysis¶
Start with the simplest possible command:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file tutorial_expression.csv \
--predictors_file tutorial_binding.csv \
--perturbed_tf YPD1
This command will:
- Use 1000 bootstrap samples
- Apply 98% confidence interval for feature selection
- Process top 600 features in second round
- Save results to ./linear_perturbation_binding_modeling_results/YPD1_{timestamp}/
Monitor Progress¶
The command provides real-time progress information:
2024-01-15 14:30:22 - INFO - Starting linear perturbation binding modeling
2024-01-15 14:30:22 - INFO - Loading response data from: tutorial_expression.csv
2024-01-15 14:30:23 - INFO - Loading predictor data from: tutorial_binding.csv
2024-01-15 14:30:23 - INFO - Perturbed TF: YPD1
2024-01-15 14:30:23 - INFO - Data preprocessing complete
2024-01-15 14:30:23 - INFO - Features: 1000, Samples: 100
2024-01-15 14:30:24 - INFO - Starting Stage 1: Bootstrap modeling on all data
2024-01-15 14:30:24 - INFO - Bootstrap parameters: n_bootstraps=1000, random_state=None
2024-01-15 14:32:15 - INFO - Stage 1 complete. Significant features: 156
2024-01-15 14:32:15 - INFO - Starting Stage 2: Top-N modeling
2024-01-15 14:33:45 - INFO - Stage 2 complete. Refined features: 78
2024-01-15 14:33:45 - INFO - Starting Stage 3: Interactor significance testing
2024-01-15 14:34:20 - INFO - Analysis complete. Results saved to: ./linear_perturbation_binding_modeling_results/YPD1_20240115_143022/
Step 2: Understanding Results¶
Output Directory Structure¶
After completion, examine the results directory:
ls -la linear_perturbation_binding_modeling_results/YPD1_*/
YPD1_20240115_143022/
├── all_data_results/
│ ├── bootstrap_coefficients.csv
│ ├── confidence_intervals.csv
│ ├── model_statistics.csv
│ └── diagnostic_plots/
├── topn_results/
│ ├── bootstrap_coefficients.csv
│ ├── confidence_intervals.csv
│ ├── model_statistics.csv
│ └── diagnostic_plots/
├── interactor_significance/
│ ├── significance_results.csv
│ ├── comparison_statistics.csv
│ └── final_selection.csv
├── input_data/
│ ├── processed_response.csv
│ ├── processed_predictors.csv
│ └── data_summary.json
└── tfbpmodeling_20240115_143022.log
Key Result Files¶
1. Confidence Intervals (most important)¶
head -10 YPD1_*/all_data_results/confidence_intervals.csv
feature,mean_coef,std_coef,ci_lower,ci_upper,significant,abs_mean_coef
TF_1,0.023,0.008,0.007,0.039,True,0.023
TF_2,-0.045,0.012,-0.069,-0.021,True,0.045
TF_3,0.001,0.006,-0.011,0.013,False,0.001
Key columns:
- feature: Transcription factor name
- mean_coef: Average effect size across bootstrap samples
- ci_lower/ci_upper: Confidence interval bounds
- significant: Whether the effect is statistically significant
2. Model Statistics¶
cat YPD1_*/all_data_results/model_statistics.csv
metric,value
mean_r2,0.234
std_r2,0.023
mean_cv_score,0.198
n_significant_features,156
total_features,1000
Key metrics:
- mean_r2: Model explanatory power
- mean_cv_score: Cross-validation performance
- n_significant_features: Count of statistically significant predictors
3. Final Significant Interactions¶
head -10 YPD1_*/interactor_significance/final_selection.csv
feature,interaction_coef,main_effect_coef,p_value,significant,effect_size
TF_1:binding_strength,0.034,0.012,0.003,True,0.022
TF_2:binding_strength,-0.028,-0.008,0.012,True,0.020
This shows transcription factors with significant interaction effects beyond their main effects.
Step 3: Interpreting Results¶
Biological Interpretation¶
- Significant Features: TFs with non-zero confidence intervals affect YPD1 expression
- Effect Direction: Positive coefficients indicate binding increases expression
- Effect Size: Larger absolute coefficients indicate stronger effects
- Interactions: Features in final selection have context-dependent effects
Statistical Interpretation¶
- Confidence Intervals: 98% CIs that exclude zero are statistically significant
- Bootstrap Stability: Lower standard deviations indicate more stable effects
- Cross-Validation: CV scores show generalization performance
- Multiple Testing: Built-in correction through bootstrap resampling
Example Interpretation¶
From our results: - TF_1 (coef: 0.023): Binding increases YPD1 expression - TF_2 (coef: -0.045): Binding decreases YPD1 expression - TF_3 (coef: 0.001, not significant): No detectable effect
Step 4: Parameter Optimization¶
Increasing Statistical Power¶
For more robust results, increase bootstrap samples:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file tutorial_expression.csv \
--predictors_file tutorial_binding.csv \
--perturbed_tf YPD1 \
--n_bootstraps 2000 \
--output_suffix _high_power
Adjusting Sensitivity¶
For more sensitive detection, lower confidence thresholds:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file tutorial_expression.csv \
--predictors_file tutorial_binding.csv \
--perturbed_tf YPD1 \
--all_data_ci_level 95.0 \
--topn_ci_level 85.0 \
--output_suffix _sensitive
Adding Feature Engineering¶
Include polynomial terms for non-linear relationships:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file tutorial_expression.csv \
--predictors_file tutorial_binding.csv \
--perturbed_tf YPD1 \
--squared_pTF \
--ptf_main_effect \
--row_max \
--output_suffix _engineered
Reproducible Analysis¶
For reproducible results, set random seed:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file tutorial_expression.csv \
--predictors_file tutorial_binding.csv \
--perturbed_tf YPD1 \
--random_state 42 \
--output_suffix _reproducible
Step 5: Comparing Results¶
Compare Different Analyses¶
# List all result directories
ls -d YPD1_*/
# Compare significant feature counts
echo "Analysis,Significant_Features"
for dir in YPD1_*/; do
count=$(tail -n +2 "$dir/all_data_results/confidence_intervals.csv" | awk -F',' '$6=="True"' | wc -l)
echo "$dir,$count"
done
Visualize Results¶
import pandas as pd
import matplotlib.pyplot as plt
# Load confidence intervals from different analyses
default_ci = pd.read_csv('YPD1_20240115_143022/all_data_results/confidence_intervals.csv')
sensitive_ci = pd.read_csv('YPD1_sensitive_20240115_143522/all_data_results/confidence_intervals.csv')
# Compare significant feature counts
print(f"Default analysis: {default_ci['significant'].sum()} significant features")
print(f"Sensitive analysis: {sensitive_ci['significant'].sum()} significant features")
# Plot effect size distributions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
ax1.hist(default_ci['abs_mean_coef'], bins=30, alpha=0.7, label='Default')
ax1.set_xlabel('Absolute Effect Size')
ax1.set_ylabel('Frequency')
ax1.set_title('Default Analysis')
ax2.hist(sensitive_ci['abs_mean_coef'], bins=30, alpha=0.7, label='Sensitive', color='orange')
ax2.set_xlabel('Absolute Effect Size')
ax2.set_ylabel('Frequency')
ax2.set_title('Sensitive Analysis')
plt.tight_layout()
plt.savefig('effect_size_comparison.png')
plt.show()
Next Steps¶
For Your Own Data¶
- Prepare your files: Follow the CSV format requirements
- Start simple: Use default parameters first
- Validate results: Check that results make biological sense
- Optimize parameters: Adjust based on data characteristics
- Document analysis: Save parameter choices and interpretations
Advanced Techniques¶
- Advanced Features Tutorial: Feature engineering and model tuning
- Input Formats Guide: Detailed data preparation instructions
- CLI Reference: Complete parameter documentation
Common Issues¶
Low R² Scores¶
- Cause: Weak signal, noisy data, or model misspecification
- Solutions: Increase sample size, add feature engineering, check data quality
Few Significant Features¶
- Cause: Stringent thresholds or weak effects
- Solutions: Lower confidence levels, increase bootstrap samples, check effect sizes
Long Runtime¶
- Cause: Large datasets or high bootstrap counts
- Solutions: Reduce parameters, increase
--n_cpus, use smaller subset for testing
Summary¶
This tutorial demonstrated:
- Basic analysis with default parameters
- Result interpretation using key output files
- Parameter optimization for different analysis goals
- Comparison methods for evaluating different approaches
The tfbpmodeling workflow provides a robust framework for analyzing transcription factor binding and perturbation relationships while controlling for multiple testing and providing interpretable results.