BrentLab Yeast Resources Collection¶
This document describes the BrentLab yeast resources collection on HuggingFace as an example implementation of the datacard specifications. This collection demonstrates best practices for organizing transcription factor binding and perturbation datasets for Saccharomyces cerevisiae.
Collection Overview¶
The BrentLab yeast resources collection contains 11 datasets related to yeast transcription factor binding and gene expression regulation:
- barkai_compendium - ChEC-seq binding data across multiple GEO series
- callingcards - Calling Cards transposon-based binding data
- hackett_2020 - TF overexpression with nutrient limitation
- harbison_2004 - ChIP-chip binding across 14 environmental conditions
- hu_2007_reimand_2010 - TF knockout expression data
- hughes_2006 - TF perturbation screen (overexpression and knockout)
- kemmeren_2014 - TF deletion expression profiling
- mahendrawada_2025 - ChEC-seq and nascent RNA-seq data
- rossi_2021 - ChIP-exo binding data
- yeast_comparative_analysis - Cross-dataset comparative analyses
- yeast_genome_resources - Reference genomic features
Standardized Media Names¶
The collection uses standardized media names to facilitate cross-dataset queries. When specifying media in datacards, use these canonical names:
Rich Media¶
- YPD (Yeast extract Peptone Dextrose)
- Carbon source: 2% D-glucose
- Nitrogen sources: 1% yeast extract, 2% peptone
-
Standard rich medium for yeast growth
-
yeast_extract_peptone
- Base medium without specified carbon source
- Used with galactose (YPGal) or raffinose (YPRaff)
Minimal/Defined Media¶
- minimal or minimal_glucose
- Minimal defined medium with glucose as carbon source
-
Nitrogen source varies by experiment
-
synthetic_complete or synthetic_complete_dextrose
- Defined medium with complete amino acid supplementation
- Carbon source: typically 2% D-glucose
-
Nitrogen source: yeast nitrogen base + amino acid dropout mix
-
synthetic_complete_minus_X
- Synthetic complete medium lacking specific nutrient(s)
- Examples:
synthetic_complete_minus_thiamine,synthetic_complete_minus_phosphate -
Used for nutrient deprivation experiments
-
selective_medium
- Defined medium for plasmid selection
- Specific composition varies by selection markers
Standardized Strain Backgrounds¶
The collection primarily uses these strain backgrounds:
- BY4741 - MATa his3Δ1 leu2Δ0 met15Δ0 ura3Δ0
-
Used in: hu_2007_reimand_2010, kemmeren_2014
-
W303 - Common alternative strain background
-
Used in: harbison_2004 (derivative Z1256)
-
S288C - Reference genome strain
- Used in: Various datasets
Strain background can be specified as a string or detailed object:
# Simple string
experimental_conditions:
strain_background: BY4741
# Detailed specification
experimental_conditions:
strain_background:
genotype: BY4741
mating_type: MATa
markers:
- his3Δ1
- leu2Δ0
- met15Δ0
- ura3Δ0
source: Open_Biosystems
description: Knockout strains for nonessential transcription factors
Standard Experimental Conditions¶
Growth Temperature¶
Standard growth temperature across the collection is 30°C unless otherwise noted.
Exceptions: - rossi_2021: 25°C baseline with 37°C heat shock for some samples - hu_2007_reimand_2010: Heat shock at 39°C for heat shock response TFs - callingcards: the experiments are performed at room temperature (~22-25°C)
Growth Phase¶
Common growth phase specifications:
These labels are taken from the original publications. In some cases the OD600 is noted
- early_log_phase
- mid_log_phase
- late_log_phase
- stationary_phase - eg barkai_compendium, which are allowed to grow overnight. The cells are harvested at a very high density (OD600 4.0).
Example:
experimental_conditions:
growth_phase_at_harvest:
stage: mid_log_phase
od600: 0.6
od600_tolerance: 0.1
Cultivation Methods¶
Standard cultivation methods used:
- liquid_culture - Standard batch culture in flasks
- batch - Batch culture
- plate - Growth on agar plates
- chemostat - Continuous culture (hackett_2020)
Concentration Specifications¶
Always use concentration_percent for all concentration specifications.
Convert other units to percentage:
- mg/ml to percent: divide by 10 (e.g., 5 mg/ml = 0.5%)
- g/L to percent: divide by 10 (e.g., 6.71 g/L = 0.671%)
- Molar to percent: convert using molecular weight
- Example: 100 nM rapamycin = 9.142e-6%
Examples from the Collection¶
# Yeast nitrogen base: 6.71 g/L = 0.671%
- compound: yeast_nitrogen_base
concentration_percent: 0.671
# Alpha factor: 5 mg/ml = 0.5%
- compound: alpha_factor_pheromone
concentration_percent: 0.5
# Rapamycin: 100 nM = 9.142e-6%
chemical_treatment:
compound: rapamycin
concentration_percent: 9.142e-6
Field Naming Conventions¶
The collection follows these field naming conventions:
Gene/Feature Identifiers¶
- regulator_locus_tag: Systematic ID of regulatory factor (e.g., “YJR060W”)
- regulator_symbol: Common name of regulatory factor (e.g., “CBF1”)
- target_locus_tag: Systematic ID of target gene
- target_symbol: Common name of target gene
All locus tags and symbols join to yeast_genome_resources dataset.
Quantitative Measurements Examples¶
Common measurement field names:
- effect, log2fc, log2_ratio - Log fold change measurements
- pvalue, pval, p_value - Statistical significance
- padj, adj_p_value - FDR-adjusted p-values
- binding_score, peak_score - Binding strength metrics
- enrichment - Enrichment ratios
Experimental Metadata Examples¶
- sample_id - Unique sample identifier (integer)
- db_id - Legacy database identifier (deprecated, do not use)
- batch - Experimental batch identifier
- replicate - Biological replicate number
- time - Timepoint in timecourse experiments
Dataset Type Usage Examples¶
genomic_features¶
yeast_genome_resources provides reference annotations: - Gene coordinates and strand information - Systematic IDs (locus_tag) and common names (symbol) - Feature types (gene, ncRNA_gene, tRNA_gene, etc.)
Used for joining regulator/target identifiers across all other datasets.
annotated_features¶
Most common dataset type in the collection. Examples:
- hackett_2020: TF overexpression with timecourse measurements
- harbison_2004: ChIP-chip binding with condition field definitions
- kemmeren_2014: TF deletion expression data
- mahendrawada_2025: ChEC-seq binding scores
Typical structure: regulator × target × measurements, with optional condition fields.
genome_map¶
Position-level data, typically partitioned by sample or accession:
- barkai_compendium: ChEC-seq pileup data partitioned by Series/Accession
- rossi_2021: ChIP-exo 5’ tag coverage partitioned by sample
- callingcards: Transposon insertion density partitioned by batch
metadata¶
Separate metadata configs or embedded metadata via metadata_fields:
Separate config example (barkai_compendium):
Embedded metadata example (harbison_2004):
- config_name: harbison_2004
dataset_type: annotated_features
metadata_fields: ["regulator_locus_tag", "regulator_symbol", "condition"]
comparative¶
yeast_comparative_analysis provides cross-dataset analysis results:
- dto config: Direct Target Overlap analysis comparing binding and perturbation experiments
- Uses
source_samplerole for composite identifiers - Format:
"repo_id;config_name;sample_id"(semicolon-separated) - Contains 8 quantitative measures: rank thresholds, set sizes, FDR, p-values
- Partitioned by binding_repo_dataset and perturbation_repo_dataset
Composite Sample Identifiers:
Comparative datasets use composite identifiers to reference samples from other datasets:
- binding_id: Points to a binding experiment (e.g., BrentLab/callingcards;annotated_features;1)
- perturbation_id: Points to a perturbation experiment (e.g., BrentLab/hackett_2020;hackett_2020;200)
Typical structure: source_sample_1 x source_sample_2 x … x measurements
Use case: Answer questions like “Which binding experiments show significant overlap with perturbation effects?”
Categorical Condition Definitions¶
Many datasets define categorical experimental conditions using the definitions field.
harbison_2004 Environmental Conditions¶
14 conditions with detailed specifications: - YPD (rich media baseline) - SM (amino acid starvation) - RAPA (rapamycin treatment) - H2O2Hi, H2O2Lo (oxidative stress) - HEAT (heat shock) - GAL, RAFF (alternative carbon sources) - And 6 more…
Each condition definition includes media composition, temperature, growth phase, and treatments.
hackett_2020 Nutrient Limitations¶
restriction:
definitions:
P: # Phosphate limitation
media:
phosphate_source:
- compound: potassium_phosphate_monobasic
concentration_percent: 0.002
N: # Nitrogen limitation
media:
nitrogen_source:
- compound: ammonium_sulfate
concentration_percent: 0.004
M: # Undefined limitation
description: "Not defined in the paper"
hu_2007_reimand_2010 Treatment Conditions¶
heat_shock:
definitions:
true:
temperature_celsius: 39
duration_minutes: 15
false:
description: Standard growth conditions at 30°C
Partitioning Strategies¶
Large genome_map datasets use partitioning:
barkai_compendium - Two-level partitioning:
callingcards - Batch partitioning:
partitioning:
enabled: true
partition_by: ["batch"]
path_template: "genome_map/batch={batch}/*.parquet"
Collection-Wide Best Practices¶
1. Omit unspecified fields with a comment¶
tfbpapi will handle adding “unspecified” to fields which are not common across
datasets.
# CORRECT
experimental_conditions:
temperature_celsius: 30
# cultivation_method is note noted in the paper and is omitted
# INCORRECT
experimental_conditions:
temperature_celsius: unspecified
2. Document Source Publications¶
If the original paper used something like g/L, then convert that to
concentration_percent and add a comment with the original value and units.
3. Use Standard Field Roles¶
Apply semantic roles consistently:
- regulator_identifier - for regulator fields
- target_identifier - for target fields
- quantitative_measure - for measurements
- experimental_condition - for condition fields
- genomic_coordinate - for positional data
4. Provide sample_id¶
All annotated_features datasets should include sample_id to uniquely identify experimental samples. This enables cross-dataset joining and metadata management.
5. Specify metadata_fields or applies_to¶
For datasets with metadata, either:
- Use metadata_fields to extract from the data itself, OR
- Create separate metadata config with applies_to field
6. Use Consistent Gene Identifiers¶
All regulator/target identifiers must be joinable to yeast_genome_resources: - Use current systematic IDs (ORF names) - Include both locus_tag and symbol fields - Mark with appropriate roles