DataCard Tutorial: Exploring HuggingFace Dataset Metadata¶
The DataCard class provides an interface for exploring HuggingFace dataset metadata without loading the actual genomic data. This is particularly useful for:
- Understanding dataset structure and available configurations
- Exploring experimental conditions at all hierarchy levels
- Discovering metadata relationships
- Planning data analysis workflows and metadata table creation
In this tutorial, we'll explore the BrentLab/harbison_2004 dataset, which contains ChIP-chip data for transcription factor binding across 14 environmental conditions in yeast.
1. Instantiating a DataCard Object¶
from tfbpapi.datacard import DataCard
card = DataCard('BrentLab/harbison_2004')
print(f"Repository: {card.repo_id}")
Repository: BrentLab/harbison_2004
2. Repository Overview¶
Let's start by getting a high-level overview of the dataset.
# Get repository information
repo_info = card.get_repository_info()
print("Repository Information:")
print("=" * 40)
for key, value in repo_info.items():
print(f"{key:20}: {value}")
Repository Information: ======================================== repo_id : BrentLab/harbison_2004 pretty_name : Harbison, 2004 ChIP-chip license : mit tags : ['genomics', 'yeast', 'transcription', 'binding'] language : ['en'] size_categories : ['1M<n<10M'] num_configs : 1 dataset_types : ['annotated_features'] total_files : 7 last_modified : 2025-12-16T20:28:09+00:00 has_default_config : True
# Get a human-readable summary
print("Dataset Summary:")
print("=" * 50)
print(card.summary())
Dataset Summary:
==================================================
Dataset: Harbison, 2004 ChIP-chip
Repository: BrentLab/harbison_2004
License: mit
Configurations: 1
Dataset Types: annotated_features
Tags: genomics, yeast, transcription, binding
Configurations:
- harbison_2004: annotated_features (default)
ChIP-chip transcription factor binding data with environmental conditions
3. Exploring Configurations¶
Datasets can have multiple configurations representing different types of data.
# List all configurations
print(f"Number of configurations: {len(card.configs)}")
print("\nConfiguration details:")
for config in card.configs:
print(f"\n• {config.config_name}:")
print(f" Type: {config.dataset_type.value}")
print(f" Default: {config.default}")
print(f" Description: {config.description}")
print(f" Features: {len(config.dataset_info.features)}")
Number of configurations: 1 Configuration details: • harbison_2004: Type: annotated_features Default: True Description: ChIP-chip transcription factor binding data with environmental conditions Features: 7
4. Understanding Experimental Conditions: The Three-Level Hierarchy¶
The tfbpapi system supports experimental conditions at three hierarchy levels:
- Top-level (repo-wide): Conditions common to all datasets/samples
- Config-level: Conditions specific to a dataset configuration
- Field-level: Conditions that vary per sample, defined in field definitions
Let's explore each level for the Harbison 2004 dataset.
Level 1: Top-Level Conditions¶
Top-level conditions apply to all experiments in the repository.
# Get top-level experimental conditions
top_conditions = card.get_experimental_conditions()
print("Top-Level Experimental Conditions:")
print("=" * 40)
if top_conditions:
for key, value in top_conditions.items():
print(f"{key}: {value}")
else:
print("No top-level conditions defined for this repository")
print("(All conditions are defined at config or field level)")
Top-Level Experimental Conditions: ======================================== No top-level conditions defined for this repository (All conditions are defined at config or field level)
Level 2: Config-Level Conditions¶
Config-level conditions apply to all samples in a specific configuration.
# Get config-level conditions (merged with top-level)
config_conditions = card.get_experimental_conditions('harbison_2004')
print("Config-Level Experimental Conditions:")
print("=" * 40)
if config_conditions:
for key, value in config_conditions.items():
print(f"{key}: {value}")
else:
print("No config-level conditions defined")
print("(Conditions vary per sample at field level)")
Config-Level Experimental Conditions: ======================================== No config-level conditions defined (Conditions vary per sample at field level)
Level 3: Field-Level Conditions¶
Field-level conditions vary per sample and are defined in field definitions.
# Get definitions for the 'condition' field
# This maps each condition value to its detailed specification
condition_defs = card.get_field_definitions('harbison_2004', 'condition')
print(f"Condition Field Definitions:")
print("=" * 40)
print(f"Found {len(condition_defs)} defined conditions:\n")
# Show all condition names
for cond_name in sorted(condition_defs.keys()):
print(f" • {cond_name}")
Condition Field Definitions: ======================================== Found 14 defined conditions: • Acid • Alpha • BUT14 • BUT90 • GAL • H2O2Hi • H2O2Lo • HEAT • Pi- • RAFF • RAPA • SM • Thi- • YPD
# Explore a specific condition in detail
import json
# Let's look at the YPD baseline condition
ypd_def = condition_defs.get('YPD', {})
print("YPD Condition Definition:")
print("=" * 40)
print(json.dumps(ypd_def, indent=2))
YPD Condition Definition:
========================================
{
"description": "Rich media baseline condition",
"temperature_celsius": 30,
"growth_phase_at_harvest": {
"od600": 0.8
},
"media": {
"name": "YPD",
"carbon_source": [
{
"compound": "D-glucose",
"concentration_percent": 2
}
],
"nitrogen_source": [
{
"compound": "yeast_extract",
"concentration_percent": 1
},
{
"compound": "peptone",
"concentration_percent": 2
}
]
}
}
# Let's look at a treatment condition (HEAT shock)
heat_def = condition_defs.get('HEAT', {})
print("HEAT Condition Definition:")
print("=" * 40)
print(json.dumps(heat_def, indent=2))
HEAT Condition Definition:
========================================
{
"description": "Heat shock stress condition",
"initial_temperature_celsius": 30,
"temperature_shift_celsius": 37,
"temperature_shift_duration_minutes": 45,
"growth_phase_at_harvest": {
"od600": 0.5
},
"media": {
"name": "YPD",
"carbon_source": [
{
"compound": "D-glucose",
"concentration_percent": 2
}
],
"nitrogen_source": [
{
"compound": "yeast_extract",
"concentration_percent": 1
},
{
"compound": "peptone",
"concentration_percent": 2
}
]
}
}
5. Working with Condition Definitions¶
Now let's see how to extract specific information from condition definitions.
# Extract growth media names for all conditions
print("Growth Media Across Conditions:")
print("=" * 40)
for cond_name, cond_def in sorted(condition_defs.items()):
# Navigate the nested structure
media = cond_def.get('media', {})
media_name = media.get('name', 'unspecified')
print(f" {cond_name:10}: {media_name}")
Growth Media Across Conditions: ======================================== Acid : YPD Alpha : YPD BUT14 : YPD BUT90 : YPD GAL : yeast_extract_peptone H2O2Hi : YPD H2O2Lo : YPD HEAT : YPD Pi- : synthetic_complete_minus_phosphate RAFF : yeast_extract_peptone RAPA : YPD SM : synthetic_complete Thi- : synthetic_complete_minus_thiamine YPD : YPD
condition_defs.get("YPD")
{'description': 'Rich media baseline condition',
'temperature_celsius': 30,
'growth_phase_at_harvest': {'od600': 0.8},
'media': {'name': 'YPD',
'carbon_source': [{'compound': 'D-glucose', 'concentration_percent': 2}],
'nitrogen_source': [{'compound': 'yeast_extract',
'concentration_percent': 1},
{'compound': 'peptone', 'concentration_percent': 2}]}}
# Extract temperature conditions
print("Temperature Across Conditions:")
print("=" * 40)
for cond_name, cond_def in sorted(condition_defs.items()):
env_conds = cond_def.get('environmental_conditions', {})
temp = env_conds.get('temperature_celsius', 'not specified')
# Also check for temperature shifts
temp_shift = env_conds.get('temperature_shift')
if temp_shift:
from_temp = temp_shift.get('from_celsius', '?')
to_temp = temp_shift.get('to_celsius', '?')
print(f" {cond_name:10}: {from_temp}°C → {to_temp}°C")
else:
print(f" {cond_name:10}: {temp}°C")
Temperature Across Conditions: ======================================== Acid : not specified°C Alpha : not specified°C BUT14 : not specified°C BUT90 : not specified°C GAL : not specified°C H2O2Hi : not specified°C H2O2Lo : not specified°C HEAT : not specified°C Pi- : not specified°C RAFF : not specified°C RAPA : not specified°C SM : not specified°C Thi- : not specified°C YPD : not specified°C
6. Using extract_metadata_schema for Metadata Table Planning¶
The extract_metadata_schema method provides all condition information in one call, which is useful for planning metadata table creation.
# Extract complete metadata schema
schema = card.extract_metadata_schema('harbison_2004')
print("Metadata Schema Summary:")
print("=" * 40)
print(f"Regulator fields: {schema['regulator_fields']}")
print(f"Target fields: {schema['target_fields']}")
print(f"Condition fields: {schema['condition_fields']}")
print(f"\nTop-level conditions: {schema['top_level_conditions']}")
print(f"Config-level conditions: {schema['config_level_conditions']}")
print(f"Field definitions available for: {list(schema['condition_definitions'].keys())}")
Metadata Schema Summary: ======================================== Regulator fields: ['regulator_locus_tag', 'regulator_symbol'] Target fields: ['target_locus_tag', 'target_symbol'] Condition fields: ['condition'] Top-level conditions: None Config-level conditions: None Field definitions available for: ['condition']