DataCard Tutorial: Exploring HuggingFace Genomics Datasets¶
The DataCard class provides an easy-to-use interface for exploring HuggingFace dataset metadata without loading the actual genomic data. This is particularly useful for:
- Understanding dataset structure and available configurations
- Exploring experimental conditions and regulators
- Discovering metadata relationships
- Planning data analysis workflows
In this tutorial, we'll explore the BrentLab/rossi_2021 dataset, which contains ChIP-exo data for transcription factor binding in yeast.
1. Getting Started¶
First, let's import the DataCard class and initialize it with our target dataset.
from tfbpapi.datainfo import DataCard
# Initialize DataCard with the Rossi 2021 dataset
# try this with mahendrawada_2025, which is more complex
card = DataCard('BrentLab/mahendrawada_2025')
print(f"Repository: {card.repo_id}")
Repository: BrentLab/mahendrawada_2025
/home/chase/code/tfbp/tfbpapi/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
2. Repository Overview¶
Let's start by getting a high-level overview of the dataset.
# Get repository information
repo_info = card.get_repository_info()
print("Repository Information:")
print("=" * 40)
for key, value in repo_info.items():
print(f"{key:20}: {value}")
Repository Information: ======================================== repo_id : BrentLab/mahendrawada_2025 pretty_name : Mahendrawada 2025 ChEC-seq and Nascent RNA-seq data license : mit tags : ['biology', 'genomics', 'yeast', 'transcription-factors', 'gene-expression', 'binding', 'chec', 'perturbation', 'rnaseq', 'nascent rnaseq'] language : ['en'] size_categories : ['100K<n<1M'] num_configs : 6 dataset_types : ['genomic_features', 'metadata', 'genome_map', 'annotated_features', 'annotated_features', 'annotated_features'] total_files : 7 last_modified : 2025-09-17T19:07:50+00:00 has_default_config : True
# Get a human-readable summary
print("Dataset Summary:")
print("=" * 50)
print(card.summary())
Dataset Summary:
==================================================
Dataset: Mahendrawada 2025 ChEC-seq and Nascent RNA-seq data
Repository: BrentLab/mahendrawada_2025
License: mit
Configurations: 6
Dataset Types: genomic_features, metadata, genome_map, annotated_features, annotated_features, annotated_features
Tags: biology, genomics, yeast, transcription-factors, gene-expression, binding, chec, perturbation, rnaseq, nascent rnaseq
Configurations:
- genomic_features: genomic_features
Comprehensive genomic features and regulatory characteristics for yeast genes
- mahendrawada_2025_metadata: metadata
Metadata for ChEC-seq experiments describing transcription factors and experimental conditions
- chec_seq_genome_map: genome_map
Raw ChEC-seq signal data partitioned by experiment accession showing genome-wide binding profiles
- mahendrawada_chec_seq: annotated_features (default)
ChEC-seq transcription factor binding data with peak scores (original authors' processed data)
- reprocessed_chec_seq: annotated_features
ChEC-seq transcription factor binding data reprocessed with updated peak calling methodology
- rna_seq: annotated_features
Nascent RNA-seq differential expression data following transcription factor depletion using 4TU metabolic labeling
3. Exploring Configurations¶
Datasets can have multiple configurations representing different types of data. Let's explore what's available in this dataset.
# List all configurations
print(f"Number of configurations: {len(card.configs)}")
print("\nConfiguration details:")
for config in card.configs:
print(f"\n• {config.config_name}:")
print(f" Type: {config.dataset_type.value}")
print(f" Default: {config.default}")
print(f" Description: {config.description}")
print(f" Features: {len(config.dataset_info.features)}")
Number of configurations: 6 Configuration details: • genomic_features: Type: genomic_features Default: False Description: Comprehensive genomic features and regulatory characteristics for yeast genes Features: 24 • mahendrawada_2025_metadata: Type: metadata Default: False Description: Metadata for ChEC-seq experiments describing transcription factors and experimental conditions Features: 7 • chec_seq_genome_map: Type: genome_map Default: False Description: Raw ChEC-seq signal data partitioned by experiment accession showing genome-wide binding profiles Features: 3 • mahendrawada_chec_seq: Type: annotated_features Default: True Description: ChEC-seq transcription factor binding data with peak scores (original authors' processed data) Features: 6 • reprocessed_chec_seq: Type: annotated_features Default: False Description: ChEC-seq transcription factor binding data reprocessed with updated peak calling methodology Features: 6 • rna_seq: Type: annotated_features Default: False Description: Nascent RNA-seq differential expression data following transcription factor depletion using 4TU metabolic labeling Features: 5
Understanding Dataset Types¶
The Rossi 2021 dataset contains two types of configurations:
metadata: Experimental metadata describing each ChIP-exo samplegenome_map: Position-level ChIP-exo tag coverage data
Let's explore each configuration in detail.
# Explore the metadata configuration
metadata_info = card.explore_config('metadata')
print("Metadata Configuration Details:")
print("=" * 40)
print(f"Config name: {metadata_info['config_name']}")
print(f"Dataset type: {metadata_info['dataset_type']}")
print(f"Number of features: {metadata_info['num_features']}")
print(f"Default config: {metadata_info['is_default']}")
print("\nFeatures in metadata config:")
for feature in metadata_info['features']:
print(f" • {feature['name']:20} ({feature['dtype']:10}): {feature['description']}")
--------------------------------------------------------------------------- DataCardError Traceback (most recent call last) Cell In[5], line 2 1 # Explore the metadata configuration ----> 2 metadata_info = card.explore_config('metadata') 4 print("Metadata Configuration Details:") 5 print("=" * 40) File ~/code/tfbp/tfbpapi/tfbpapi/datainfo/datacard.py:291, in DataCard.explore_config(self, config_name) 289 config = self.get_config(config_name) 290 if not config: --> 291 raise DataCardError(f"Configuration '{config_name}' not found") 293 info: dict[str, Any] = { 294 "config_name": config.config_name, 295 "description": config.description, (...) 305 ], 306 } 308 # Add partitioning info if present DataCardError: Configuration 'metadata' not found
# Explore the genome_map configuration
genome_map_info = card.explore_config('genome_map')
print("Genome Map Configuration Details:")
print("=" * 40)
print(f"Config name: {genome_map_info['config_name']}")
print(f"Dataset type: {genome_map_info['dataset_type']}")
print(f"Number of features: {genome_map_info['num_features']}")
print("\nFeatures in genome_map config:")
for feature in genome_map_info['features']:
print(f" • {feature['name']:15} ({feature['dtype']:10}): {feature['description']}")
# Check if this config has partitioning
if 'partitioning' in genome_map_info:
print("\nPartitioning Information:")
partitioning = genome_map_info['partitioning']
print(f" Enabled: {partitioning['enabled']}")
print(f" Partition by: {partitioning['partition_by']}")
print(f" Path template: {partitioning['path_template']}")
Genome Map Configuration Details:
========================================
Config name: genome_map
Dataset type: genome_map
Number of features: 3
Features in genome_map config:
• chr (string ): Chromosome name (e.g., chrI, chrII, etc.)
• pos (int32 ): Genomic position of the 5' tag
• pileup (int32 ): Depth of coverage (number of 5' tags) at this genomic position
Partitioning Information:
Enabled: True
Partition by: ['run_accession']
Path template: genome_map/accession={run_accession}/*.parquet
4. Understanding Data Relationships¶
The DataCard can help you understand how different configurations relate to each other, particularly metadata relationships.
# Explore metadata relationships
relationships = card.get_metadata_relationships()
print(f"Found {len(relationships)} metadata relationships:")
print("\nRelationship details:")
for rel in relationships:
print(f" • {rel.data_config} -> {rel.metadata_config}")
print(f" Type: {rel.relationship_type}")
if rel.relationship_type == "explicit":
print(" (Metadata config explicitly specifies which data configs it applies to)")
elif rel.relationship_type == "embedded":
print(" (Metadata is embedded within the data config itself)")
Found 1 metadata relationships:
Relationship details:
• genome_map -> metadata
Type: explicit
(Metadata config explicitly specifies which data configs it applies to)
5. Exploring Dataset Contents¶
Now let's explore what experimental data is available in this dataset.
# Get different config types
from tfbpapi.datainfo.models import DatasetType
# Find metadata configs
metadata_configs = card.get_configs_by_type(DatasetType.METADATA)
print(f"Metadata configurations: {[c.config_name for c in metadata_configs]}")
# Find data configs
data_configs = card.get_configs_by_type(DatasetType.GENOME_MAP)
print(f"Data configurations: {[c.config_name for c in data_configs]}")
# Get the default config
default_config = card.dataset_card.get_default_config()
if default_config:
print(f"\nDefault configuration: {default_config.config_name}")
Metadata configurations: ['metadata'] Data configurations: ['genome_map'] Default configuration: metadata
Extracting Field Values¶
For exploration purposes, we can extract unique values from specific fields. This is particularly useful for understanding what experimental conditions or regulators are available.
# Try to extract run accession information
try:
accessions = card.get_field_values('metadata', 'run_accession')
print(f"Found {len(accessions)} unique run accessions:")
if accessions:
sample_accessions = sorted(list(accessions))[:5]
print(f"Sample accessions: {sample_accessions}...")
else:
print("No accessions found (might require partition-based extraction)")
except Exception as e:
print(f"Could not extract accession values: {e}")
Found 0 unique run accessions: No accessions found (might require partition-based extraction)
6. Working with Partitioned Data¶
Many genomics datasets are partitioned for efficient storage and querying. Let's explore how partitioning works in this dataset.
# Check partitioning details for the genome_map config
genome_map_config = card.get_config('genome_map')
if genome_map_config and genome_map_config.dataset_info.partitioning:
part_info = genome_map_config.dataset_info.partitioning
print("Partitioning Details:")
print("=" * 30)
print(f"Enabled: {part_info.enabled}")
print(f"Partition columns: {part_info.partition_by}")
print(f"Path template: {part_info.path_template}")
print("\nThis means:")
print("• The genome map data is split into separate files for each run_accession")
print("• Files are organized as: genome_map/accession={run_accession}/*.parquet")
print("• This allows efficient querying of specific experimental runs")
else:
print("No partitioning information found.")
Partitioning Details:
==============================
Enabled: True
Partition columns: ['run_accession']
Path template: genome_map/accession={run_accession}/*.parquet
This means:
• The genome map data is split into separate files for each run_accession
• Files are organized as: genome_map/accession={run_accession}/*.parquet
• This allows efficient querying of specific experimental runs
7. Understanding Data Files¶
Let's examine how the data files are organized within each configuration.
# Examine data files for each configuration
for config in card.configs:
print(f"\nData files for '{config.config_name}' config:")
print("-" * 40)
for i, data_file in enumerate(config.data_files):
print(f" File {i+1}:")
print(f" Split: {data_file.split}")
print(f" Path: {data_file.path}")
# Explain path patterns
if '*' in data_file.path:
print(f" - This is a glob pattern that matches multiple files")
if '=' in data_file.path:
print(f" - This uses partitioned directory structure")
Data files for 'metadata' config:
----------------------------------------
File 1:
Split: train
Path: rossi_2021_metadata.parquet
Data files for 'genome_map' config:
----------------------------------------
File 1:
Split: train
Path: genome_map/*/*.parquet
- This is a glob pattern that matches multiple files
8. Practical Use Cases¶
Here are some common scenarios where DataCard is useful:
Use Case 1: Finding Datasets with Specific Data Types¶
# Check what types of data are available
available_types = [config.dataset_type.value for config in card.configs]
print(f"Available dataset types: {available_types}")
# Check if this dataset has genome-wide binding data
has_genome_map = any(config.dataset_type == DatasetType.GENOME_MAP for config in card.configs)
print(f"\nHas genome-wide binding data: {has_genome_map}")
Available dataset types: ['metadata', 'genome_map'] Has genome-wide binding data: True
Use Case 2: Understanding Data Schema Before Loading¶
# Before loading large genome map data, understand its structure
genome_config = card.get_config('genome_map')
if genome_config:
print("Genome Map Data Schema:")
print("=" * 30)
for feature in genome_config.dataset_info.features:
print(f"Column: {feature.name}")
print(f" Type: {feature.dtype}")
print(f" Description: {feature.description}")
print()
print("This tells us:")
print("• 'chr' column contains chromosome names (string)")
print("• 'pos' column contains genomic positions (int32)")
print("• 'pileup' column contains tag counts (int32)")
print("• Data represents 5' tag coverage from ChIP-exo experiments")
Genome Map Data Schema: ============================== Column: chr Type: string Description: Chromosome name (e.g., chrI, chrII, etc.) Column: pos Type: int32 Description: Genomic position of the 5' tag Column: pileup Type: int32 Description: Depth of coverage (number of 5' tags) at this genomic position This tells us: • 'chr' column contains chromosome names (string) • 'pos' column contains genomic positions (int32) • 'pileup' column contains tag counts (int32) • Data represents 5' tag coverage from ChIP-exo experiments
Use Case 3: Planning Efficient Data Access¶
# Understanding partitioning helps plan efficient queries
if genome_config and genome_config.dataset_info.partitioning:
print("Data Access Strategy:")
print("=" * 25)
print("• Data is partitioned by 'run_accession'")
print("• To load data for a specific experiment, filter by run_accession")
print("• This avoids loading data from all experiments")
print("• Path pattern: genome_map/accession={run_accession}/*.parquet")
print("\nExample workflow:")
print("1. Use metadata config to find interesting run_accessions")
print("2. Load only genome_map data for those specific accessions")
print("3. Analyze position-level binding data for selected experiments")
Data Access Strategy:
=========================
• Data is partitioned by 'run_accession'
• To load data for a specific experiment, filter by run_accession
• This avoids loading data from all experiments
• Path pattern: genome_map/accession={run_accession}/*.parquet
Example workflow:
1. Use metadata config to find interesting run_accessions
2. Load only genome_map data for those specific accessions
3. Analyze position-level binding data for selected experiments
9. Error Handling and Troubleshooting¶
The DataCard class includes validation and error handling. Here are some common scenarios:
# Handling missing configurations
missing_config = card.get_config('nonexistent_config')
print(f"Non-existent config result: {missing_config}")
# Handling missing fields
try:
invalid_field = card.get_field_values('metadata', 'nonexistent_field')
except Exception as e:
print(f"Error accessing non-existent field: {e}")
# Checking if config exists before using
config_name = 'some_config'
if card.get_config(config_name):
print(f"Config '{config_name}' exists")
else:
print(f"Config '{config_name}' not found in this dataset")
Non-existent config result: None Error accessing non-existent field: Field 'nonexistent_field' not found in config 'metadata' Config 'some_config' not found in this dataset
10. Summary and Next Steps¶
The DataCard class provides a powerful way to explore HuggingFace genomics datasets before committing to loading large amounts of data.
Key Takeaways:¶
- Dataset Structure: The Rossi 2021 dataset contains both experimental metadata and genome-wide ChIP-exo binding data
- Partitioning: Data is efficiently partitioned by experimental run for fast access
- Metadata Relationships: The system automatically understands how metadata relates to data configs
- Schema Discovery: You can understand data types and structure before loading
Next Steps:¶
- Use
HfQueryAPIto load specific subsets of the data based on your exploration - Apply filters based on experimental conditions discovered through DataCard
- Combine multiple datasets that have compatible schemas
Example Integration with HfQueryAPI:¶
from tfbpapi import HfQueryAPI
# After exploring with DataCard, load specific data
query_api = HfQueryAPI('BrentLab/rossi_2021')
# Load metadata for planning
metadata_df = query_api.get_pandas('metadata')
# Load genome map data for specific experiments
# (using partition filters based on DataCard exploration)
genome_data = query_api.get_pandas('genome_map',
filters={'run_accession': 'SRR123456'})