Quick Start Guide¶
This guide will walk you through your first analysis with tfbpmodeling using example data.
Overview¶
tfbpmodeling analyzes the relationship between transcription factor binding data and gene expression perturbation data through a 4-stage workflow:
- Bootstrap modeling on all data to identify significant binding-expression relationships
- Top-N modeling on the most significant predictors from high-performing data
- Interaction analysis to evaluate interaction terms vs main effects
- Results generation with comprehensive statistics and confidence intervals
Prepare Your Data¶
tfbpmodeling requires two main input files:
Response File (Gene Expression Data)¶
CSV format with genes as rows and samples as columns:
gene_id,sample1,sample2,sample3,sample4
YPD1,0.23,-1.45,0.87,-0.12
YBR123W,1.34,0.56,-0.23,0.78
YCR456X,-0.45,0.12,1.23,-0.56
- First column: Gene identifiers
- Subsequent columns: Expression values for each sample
- Must contain a column matching your
--perturbed_tfparameter
Predictors File (Binding Data)¶
CSV format with genes as rows and transcription factors as columns:
gene_id,TF1,TF2,TF3,TF4
YPD1,0.34,0.12,0.78,0.01
YBR123W,0.89,0.45,0.23,0.67
YCR456X,0.12,0.78,0.34,0.90
- First column: Gene identifiers (must match response file)
- Subsequent columns: Binding measurements for different TFs
Basic Analysis¶
Minimal Command¶
Run a basic analysis with default parameters:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file data/expression.csv \
--predictors_file data/binding.csv \
--perturbed_tf YPD1
This command will:
- Use 1000 bootstrap samples
- Apply 98% confidence interval for initial feature selection
- Use 90% confidence interval for second-round modeling
- Process top 600 features in the second round
- Save results to
./linear_perturbation_binding_modeling_results/YPD1/
With Custom Parameters¶
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file data/expression.csv \
--predictors_file data/binding.csv \
--perturbed_tf YPD1 \
--n_bootstraps 2000 \
--top_n 500 \
--all_data_ci_level 95.0 \
--topn_ci_level 85.0 \
--output_dir ./my_results \
--output_suffix _custom_analysis \
--random_state 42
Understanding the Output¶
Results are saved in a timestamped subdirectory within your specified output directory:
my_results/YPD1_custom_analysis_20240115_143022/
├── all_data_results/
│ ├── bootstrap_coefficients.csv
│ ├── confidence_intervals.csv
│ ├── model_statistics.csv
│ └── diagnostic_plots/
├── topn_results/
│ ├── bootstrap_coefficients.csv
│ ├── confidence_intervals.csv
│ ├── model_statistics.csv
│ └── diagnostic_plots/
├── interactor_significance/
│ ├── significance_results.csv
│ ├── comparison_statistics.csv
│ └── final_selection.csv
└── tfbpmodeling_20240115_143022.log
Key Output Files¶
Bootstrap Coefficients¶
Contains coefficient estimates from each bootstrap sample: - Rows: Bootstrap samples - Columns: Model features - Values: Coefficient estimates
Confidence Intervals¶
Statistical significance of each feature:
- feature: Feature name
- mean_coef: Mean coefficient across bootstrap samples
- ci_lower: Lower confidence interval bound
- ci_upper: Upper confidence interval bound
- significant: Boolean indicating statistical significance
Model Statistics¶
Overall model performance metrics: - R² scores across bootstrap samples - Cross-validation performance - Feature selection statistics
Advanced Features¶
Feature Engineering¶
Add polynomial terms and custom variables:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file data/expression.csv \
--predictors_file data/binding.csv \
--perturbed_tf YPD1 \
--squared_pTF \
--cubic_pTF \
--row_max \
--ptf_main_effect \
--add_model_variables "red_median,green_median"
Data Processing Options¶
Control data preprocessing:
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file data/expression.csv \
--predictors_file data/binding.csv \
--perturbed_tf YPD1 \
--normalize_sample_weights \
--scale_by_std \
--bins "0,5,10,15,np.inf"
Excluding Features¶
Exclude specific genes or features:
# Create blacklist file
echo -e "YBR999W\nYCR888X\ncontrol_gene" > blacklist.txt
# Run analysis with exclusions
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file data/expression.csv \
--predictors_file data/binding.csv \
--perturbed_tf YPD1 \
--blacklist_file blacklist.txt \
--exclude_interactor_variables "batch_effect,technical_replicate"
Example Workflow¶
Here's a complete example workflow:
1. Prepare Data¶
# Create example data directory
mkdir -p example_data
# Your data preparation steps here
# (load and format your actual expression and binding data)
2. Run Basic Analysis¶
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file example_data/expression.csv \
--predictors_file example_data/binding.csv \
--perturbed_tf YPD1 \
--random_state 12345 \
--output_dir ./results \
--output_suffix _basic_analysis
3. Run Advanced Analysis¶
python -m tfbpmodeling linear_perturbation_binding_modeling \
--response_file example_data/expression.csv \
--predictors_file example_data/binding.csv \
--perturbed_tf YPD1 \
--n_bootstraps 2000 \
--squared_pTF \
--ptf_main_effect \
--iterative_dropout \
--stage4_lasso \
--random_state 12345 \
--output_dir ./results \
--output_suffix _advanced_analysis
4. Compare Results¶
# Compare the two analyses
ls -la results/YPD1_*_analysis_*/
Next Steps¶
- CLI Reference: Complete documentation of all command-line options
- Tutorials: Detailed tutorials with real examples
- API Reference: Documentation for programmatic usage
- Input Formats: Detailed specifications for input data
Troubleshooting¶
Common Issues¶
File Not Found¶
# Verify your files exist and paths are correct
ls -la data/expression.csv data/binding.csv
Memory Issues¶
# Reduce bootstrap samples or top_n for large datasets
--n_bootstraps 500 --top_n 300
Convergence Issues¶
# Increase iteration limit
--max_iter 20000
No Significant Features¶
# Lower confidence intervals or check data quality
--all_data_ci_level 90.0 --topn_ci_level 80.0
For more help, see the troubleshooting section or open an issue on GitHub.