HfQueryAPI Tutorial: Metadata-Driven Data Exploration¶
This tutorial demonstrates how to use the HfQueryAPI for efficient metadata-driven exploration and querying of Hugging Face datasets, using the Hackett 2020 transcription factor overexpression dataset as an example.
Overview¶
The HfQueryAPI provides a streamlined workflow for:
- Exploring metadata to understand dataset structure
- Setting filters based on metadata values
- Querying data with automatic filter application
- Efficient SQL-based filtering on large datasets
Dataset: Hackett 2020¶
The BrentLab/hackett_2020 dataset contains transcription factor overexpression data with embedded metadata fields including:
regulator_locus_tag®ulator_symbol: Transcription factor identifierstime: Time point (0, 15, 30, 90 minutes)mechanism: Experimental mechanism (ZEV)restriction: Restriction enzyme treatment (P, NP)date,strain: Experimental metadata
Let's start exploring!
Initialize HfQueryAPI and Getting Metadata¶
First, we'll initialize the API client for the Hackett 2020 dataset:
In [23]:
Copied!
from tfbpapi.HfQueryAPI import HfQueryAPI
import pandas as pd
# Initialize the API client
api = HfQueryAPI(repo_id="BrentLab/hackett_2020", repo_type="dataset")
print(f"Initialized HfQueryAPI for {api.repo_id}")
print(f"Repository type: {api.repo_type}")
print(f"Available configurations: {[c.config_name for c in api.configs]}")
from tfbpapi.HfQueryAPI import HfQueryAPI
import pandas as pd
# Initialize the API client
api = HfQueryAPI(repo_id="BrentLab/hackett_2020", repo_type="dataset")
print(f"Initialized HfQueryAPI for {api.repo_id}")
print(f"Repository type: {api.repo_type}")
print(f"Available configurations: {[c.config_name for c in api.configs]}")
Initialized HfQueryAPI for BrentLab/hackett_2020 Repository type: dataset Available configurations: ['hackett_2020']
In [24]:
Copied!
metadata = api.get_metadata("hackett_2020")
metadata = api.get_metadata("hackett_2020")
In [25]:
Copied!
# Create summary tables for experimental conditions
print("Experimental Conditions Summary")
print("=" * 50)
# Time points summary
time_summary = pd.DataFrame({
'Time Point (min)': metadata['time'].value_counts().sort_index().index,
'Frequency': metadata['time'].value_counts().sort_index().values,
})
time_summary['% of Records'] = (time_summary['Frequency'] / time_summary['Frequency'].sum() * 100).round(1)
print("\n⏱️ Time Points:")
display(time_summary)
# Experimental conditions table
conditions_summary = pd.DataFrame({
'Condition Type': ['Mechanisms', 'Restriction Treatments', 'Strains', 'Experimental Dates'],
'Unique Values': [
metadata['mechanism'].nunique(),
metadata['restriction'].nunique(),
metadata['strain'].nunique(),
metadata['date'].nunique()
],
'Most Common': [
metadata['mechanism'].mode().iloc[0],
metadata['restriction'].mode().iloc[0],
metadata['strain'].mode().iloc[0],
f"{metadata['date'].min()} to {metadata['date'].max()}"
]
})
print("\nExperimental Design Overview:")
display(conditions_summary)
# Create summary tables for experimental conditions
print("Experimental Conditions Summary")
print("=" * 50)
# Time points summary
time_summary = pd.DataFrame({
'Time Point (min)': metadata['time'].value_counts().sort_index().index,
'Frequency': metadata['time'].value_counts().sort_index().values,
})
time_summary['% of Records'] = (time_summary['Frequency'] / time_summary['Frequency'].sum() * 100).round(1)
print("\n⏱️ Time Points:")
display(time_summary)
# Experimental conditions table
conditions_summary = pd.DataFrame({
'Condition Type': ['Mechanisms', 'Restriction Treatments', 'Strains', 'Experimental Dates'],
'Unique Values': [
metadata['mechanism'].nunique(),
metadata['restriction'].nunique(),
metadata['strain'].nunique(),
metadata['date'].nunique()
],
'Most Common': [
metadata['mechanism'].mode().iloc[0],
metadata['restriction'].mode().iloc[0],
metadata['strain'].mode().iloc[0],
f"{metadata['date'].min()} to {metadata['date'].max()}"
]
})
print("\nExperimental Design Overview:")
display(conditions_summary)
Experimental Conditions Summary ================================================== ⏱️ Time Points:
| Time Point (min) | Frequency | % of Records | |
|---|---|---|---|
| 0 | 0.0 | 217 | 12.8 |
| 1 | 2.5 | 8 | 0.5 |
| 2 | 5.0 | 212 | 12.5 |
| 3 | 7.5 | 2 | 0.1 |
| 4 | 8.0 | 2 | 0.1 |
| 5 | 10.0 | 187 | 11.0 |
| 6 | 12.5 | 2 | 0.1 |
| 7 | 15.0 | 212 | 12.5 |
| 8 | 18.0 | 1 | 0.1 |
| 9 | 20.0 | 184 | 10.9 |
| 10 | 30.0 | 216 | 12.8 |
| 11 | 45.0 | 213 | 12.6 |
| 12 | 60.0 | 20 | 1.2 |
| 13 | 90.0 | 212 | 12.5 |
| 14 | 100.0 | 2 | 0.1 |
| 15 | 120.0 | 1 | 0.1 |
| 16 | 180.0 | 1 | 0.1 |
| 17 | 290.0 | 1 | 0.1 |
Experimental Design Overview:
| Condition Type | Unique Values | Most Common | |
|---|---|---|---|
| 0 | Mechanisms | 2 | ZEV |
| 1 | Restriction Treatments | 3 | P |
| 2 | Strains | 215 | SMY2207 |
| 3 | Experimental Dates | 23 | 20150101 to 20161117 |
In [26]:
Copied!
# Create comprehensive transcription factor analysis tables
print("Transcription Factor Analysis")
print("=" * 50)
print(f"\nDataset Overview:")
tf_overview = pd.DataFrame({
'Metric': [
'Total Unique Transcription Factors',
],
'Value': [
f"{metadata['regulator_locus_tag'].nunique():,}",
]
})
display(tf_overview)
# Create comprehensive transcription factor analysis tables
print("Transcription Factor Analysis")
print("=" * 50)
print(f"\nDataset Overview:")
tf_overview = pd.DataFrame({
'Metric': [
'Total Unique Transcription Factors',
],
'Value': [
f"{metadata['regulator_locus_tag'].nunique():,}",
]
})
display(tf_overview)
Transcription Factor Analysis ================================================== Dataset Overview:
| Metric | Value | |
|---|---|---|
| 0 | Total Unique Transcription Factors | 203 |
Setting Simple Filters¶
In [27]:
Copied!
# Set filters for 15-minute timepoint with ZEV mechanism and P restriction
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
# Create comprehensive filter analysis
print("Filter Analysis")
print("=" * 40)
# Show current filter
current_filter = api.get_current_filter("hackett_2020")
filter_info = pd.DataFrame({
'Filter Component': ['Time Point', 'Mechanism', 'Restriction', 'SQL Filter'],
'Value': ['15 minutes', 'ZEV', 'P (Restriction)', current_filter]
})
print("\nApplied Filters:")
display(filter_info)
# Analyze filter impact
filtered_metadata = metadata[
(metadata['time'] == 15) &
(metadata['mechanism'] == 'ZEV') &
(metadata['restriction'] == 'P')
]
# Show top TFs in filtered data
filtered_tf_summary = filtered_metadata.groupby(['regulator_locus_tag', 'regulator_symbol'])['count'].sum().sort_values(ascending=False).head(8)
tf_filtered_df = pd.DataFrame({
'Locus Tag': [idx[0] for idx in filtered_tf_summary.index],
'Symbol': [idx[1] for idx in filtered_tf_summary.index],
'Records in Subset': filtered_tf_summary.values,
})
print("\nTop Transcription Factors in Filtered Dataset:")
display(tf_filtered_df)
# Set filters for 15-minute timepoint with ZEV mechanism and P restriction
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
# Create comprehensive filter analysis
print("Filter Analysis")
print("=" * 40)
# Show current filter
current_filter = api.get_current_filter("hackett_2020")
filter_info = pd.DataFrame({
'Filter Component': ['Time Point', 'Mechanism', 'Restriction', 'SQL Filter'],
'Value': ['15 minutes', 'ZEV', 'P (Restriction)', current_filter]
})
print("\nApplied Filters:")
display(filter_info)
# Analyze filter impact
filtered_metadata = metadata[
(metadata['time'] == 15) &
(metadata['mechanism'] == 'ZEV') &
(metadata['restriction'] == 'P')
]
# Show top TFs in filtered data
filtered_tf_summary = filtered_metadata.groupby(['regulator_locus_tag', 'regulator_symbol'])['count'].sum().sort_values(ascending=False).head(8)
tf_filtered_df = pd.DataFrame({
'Locus Tag': [idx[0] for idx in filtered_tf_summary.index],
'Symbol': [idx[1] for idx in filtered_tf_summary.index],
'Records in Subset': filtered_tf_summary.values,
})
print("\nTop Transcription Factors in Filtered Dataset:")
display(tf_filtered_df)
Filter Analysis ======================================== Applied Filters:
| Filter Component | Value | |
|---|---|---|
| 0 | Time Point | 15 minutes |
| 1 | Mechanism | ZEV |
| 2 | Restriction | P (Restriction) |
| 3 | SQL Filter | time = 15 AND mechanism = 'ZEV' AND restrictio... |
Top Transcription Factors in Filtered Dataset:
| Locus Tag | Symbol | Records in Subset | |
|---|---|---|---|
| 0 | YPL016W | SWI1 | 18525 |
| 1 | YEL009C | GCN4 | 12350 |
| 2 | YPL133C | RDS2 | 12350 |
| 3 | Z3EV | Z3EV | 12350 |
| 4 | YBL008W | HIR1 | 6175 |
| 5 | YBL021C | HAP3 | 6175 |
| 6 | YBL025W | RRN10 | 6175 |
| 7 | YBR239C | ERT1 | 6175 |
Query Data with Automatic Filter Application¶
Now when we query the data, our filters will be automatically applied:
Setting Complex SQL Filters¶
For more sophisticated filtering, we can use the set_sql_filter() method with full SQL expressions:
In [28]:
Copied!
# Set a complex filter for multiple time points and specific transcription factors
api.set_sql_filter("hackett_2020", """
time IN (15, 30) AND
mechanism = 'ZEV' AND
restriction = 'P' AND
regulator_locus_tag IN ('YER040W', 'YER028C', 'YPL016W')
""")
print("Complex SQL Filter Analysis")
print("=" * 50)
# Show filter details
filter_details = pd.DataFrame({
'Filter Component': [
'Time Points',
'Mechanism',
'Restriction',
'Selected TFs',
'Complete SQL Filter'
],
'Value': [
'15, 30 minutes',
'ZEV (overexpression)',
'P (restriction enzyme)',
'YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1)',
api.get_current_filter('hackett_2020')
]
})
print("\nApplied Complex Filter:")
display(filter_details)
# Set a complex filter for multiple time points and specific transcription factors
api.set_sql_filter("hackett_2020", """
time IN (15, 30) AND
mechanism = 'ZEV' AND
restriction = 'P' AND
regulator_locus_tag IN ('YER040W', 'YER028C', 'YPL016W')
""")
print("Complex SQL Filter Analysis")
print("=" * 50)
# Show filter details
filter_details = pd.DataFrame({
'Filter Component': [
'Time Points',
'Mechanism',
'Restriction',
'Selected TFs',
'Complete SQL Filter'
],
'Value': [
'15, 30 minutes',
'ZEV (overexpression)',
'P (restriction enzyme)',
'YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1)',
api.get_current_filter('hackett_2020')
]
})
print("\nApplied Complex Filter:")
display(filter_details)
Complex SQL Filter Analysis ================================================== Applied Complex Filter:
| Filter Component | Value | |
|---|---|---|
| 0 | Time Points | 15, 30 minutes |
| 1 | Mechanism | ZEV (overexpression) |
| 2 | Restriction | P (restriction enzyme) |
| 3 | Selected TFs | YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1) |
| 4 | Complete SQL Filter | time IN (15, 30) AND \n mechanism = 'ZEV' A... |
In [29]:
Copied!
## Query with the complex filter
## Query with the complex filter
In [30]:
Copied!
time_comparison = api.query("""
SELECT
regulator_locus_tag,
regulator_symbol,
time,
COUNT(*) as target_count,
ROUND(AVG(log2_shrunken_timecourses), 3) as avg_response,
COUNT(CASE WHEN ABS(log2_shrunken_timecourses) > 0.5 THEN 1 END) as strong_responders,
ROUND(MAX(ABS(log2_shrunken_timecourses)), 3) as max_abs_response
FROM hackett_2020
GROUP BY regulator_locus_tag, regulator_symbol, time
ORDER BY regulator_locus_tag, time
""", "hackett_2020")
# Format display
time_comparison_display = time_comparison.copy()
time_comparison_display.columns = ['Locus Tag', 'Symbol', 'Time (min)', 'Target Count', 'Avg Response', 'Strong Responders', 'Max |Response|']
time_comparison_display['% Strong'] = (time_comparison_display['Strong Responders'] / time_comparison_display['Target Count'] * 100).round(1)
print("\nTime Course Comparison for Selected TFs:")
display(time_comparison_display)
# Summary analysis
tf_summary = time_comparison.groupby(['regulator_locus_tag', 'regulator_symbol']).agg({
'target_count': 'sum',
'strong_responders': 'sum',
'avg_response': 'mean'
}).reset_index()
tf_summary['total_%_strong'] = (tf_summary['strong_responders'] / tf_summary['target_count'] * 100).round(1)
tf_summary_display = tf_summary.copy()
tf_summary_display.columns = ['Locus Tag', 'Symbol', 'Total Targets', 'Total Strong', 'Avg Response', '% Strong Overall']
print("\nOverall TF Performance Summary:")
display(tf_summary_display)
time_comparison = api.query("""
SELECT
regulator_locus_tag,
regulator_symbol,
time,
COUNT(*) as target_count,
ROUND(AVG(log2_shrunken_timecourses), 3) as avg_response,
COUNT(CASE WHEN ABS(log2_shrunken_timecourses) > 0.5 THEN 1 END) as strong_responders,
ROUND(MAX(ABS(log2_shrunken_timecourses)), 3) as max_abs_response
FROM hackett_2020
GROUP BY regulator_locus_tag, regulator_symbol, time
ORDER BY regulator_locus_tag, time
""", "hackett_2020")
# Format display
time_comparison_display = time_comparison.copy()
time_comparison_display.columns = ['Locus Tag', 'Symbol', 'Time (min)', 'Target Count', 'Avg Response', 'Strong Responders', 'Max |Response|']
time_comparison_display['% Strong'] = (time_comparison_display['Strong Responders'] / time_comparison_display['Target Count'] * 100).round(1)
print("\nTime Course Comparison for Selected TFs:")
display(time_comparison_display)
# Summary analysis
tf_summary = time_comparison.groupby(['regulator_locus_tag', 'regulator_symbol']).agg({
'target_count': 'sum',
'strong_responders': 'sum',
'avg_response': 'mean'
}).reset_index()
tf_summary['total_%_strong'] = (tf_summary['strong_responders'] / tf_summary['target_count'] * 100).round(1)
tf_summary_display = tf_summary.copy()
tf_summary_display.columns = ['Locus Tag', 'Symbol', 'Total Targets', 'Total Strong', 'Avg Response', '% Strong Overall']
print("\nOverall TF Performance Summary:")
display(tf_summary_display)
Time Course Comparison for Selected TFs:
| Locus Tag | Symbol | Time (min) | Target Count | Avg Response | Strong Responders | Max |Response| | % Strong | |
|---|---|---|---|---|---|---|---|---|
| 0 | YER028C | MIG3 | 15.0 | 6175 | -0.010 | 99 | 5.894 | 1.6 |
| 1 | YER028C | MIG3 | 30.0 | 6175 | -0.028 | 246 | 5.516 | 4.0 |
| 2 | YER040W | GLN3 | 15.0 | 6175 | 0.018 | 81 | 7.923 | 1.3 |
| 3 | YER040W | GLN3 | 30.0 | 6175 | 0.042 | 631 | 10.459 | 10.2 |
| 4 | YPL016W | SWI1 | 15.0 | 18525 | 0.001 | 431 | 6.216 | 2.3 |
| 5 | YPL016W | SWI1 | 30.0 | 18525 | 0.033 | 762 | 6.753 | 4.1 |
Overall TF Performance Summary:
| Locus Tag | Symbol | Total Targets | Total Strong | Avg Response | % Strong Overall | |
|---|---|---|---|---|---|---|
| 0 | YER028C | MIG3 | 12350 | 345 | -0.019 | 2.8 |
| 1 | YER040W | GLN3 | 12350 | 712 | 0.030 | 5.8 |
| 2 | YPL016W | SWI1 | 37050 | 1193 | 0.017 | 3.2 |
Filter Management¶
The filtering system provides full control over active filters:
In [31]:
Copied!
# Demonstrate filter management capabilities
print("🔧 Filter Management Demonstration")
print("=" * 50)
# Show current filter
current_filter = api.get_current_filter('hackett_2020')
print(f"\nCurrent filter:")
current_filter_df = pd.DataFrame({
'Status': ['Active Filter'],
'SQL WHERE clause': [current_filter if current_filter else 'None']
})
display(current_filter_df)
# Clear all filters and show impact
api.clear_filter("hackett_2020")
print(f"\nAfter clearing filters:")
cleared_filter_df = pd.DataFrame({
'Status': ['After Clearing'],
'SQL WHERE clause': [api.get_current_filter('hackett_2020') or 'None']
})
display(cleared_filter_df)
# Query unfiltered vs filtered data comparison
total_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")
# Set filters again for comparison
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
filtered_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")
# Create comprehensive comparison table
comparison_results = pd.DataFrame({
'Dataset State': [
'Unfiltered (Full Dataset)',
'Filtered (time=15, ZEV, P)',
'Data Reduction'
],
'Total Records': [
f"{total_records.iloc[0]['total']:,}",
f"{filtered_records.iloc[0]['total']:,}",
f"{total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']:,}"
],
'Percentage': [
'100.0%',
f"{(filtered_records.iloc[0]['total'] / total_records.iloc[0]['total'] * 100):.1f}%",
f"{((total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']) / total_records.iloc[0]['total'] * 100):.1f}%"
]
})
print("\n📊 Dataset Size Comparison:")
display(comparison_results)
# Show filter workflow summary
workflow_summary = pd.DataFrame({
'Step': [
'1. Explore Metadata',
'2. Set Simple Filters',
'3. Set Complex SQL Filters',
'4. Query with Auto-Apply',
'5. Clear/Manage Filters'
],
'Method': [
'api.get_metadata()',
'api.set_filter(config, **kwargs)',
'api.set_sql_filter(config, sql)',
'api.query(sql, config)',
'api.clear_filter() / get_current_filter()'
],
'Purpose': [
'Understand dataset structure',
'Filter by metadata values',
'Complex multi-condition filtering',
'Analyze with automatic filtering',
'Reset or inspect current state'
]
})
print("\nComplete HfQueryAPI Workflow:")
display(workflow_summary)
# Demonstrate filter management capabilities
print("🔧 Filter Management Demonstration")
print("=" * 50)
# Show current filter
current_filter = api.get_current_filter('hackett_2020')
print(f"\nCurrent filter:")
current_filter_df = pd.DataFrame({
'Status': ['Active Filter'],
'SQL WHERE clause': [current_filter if current_filter else 'None']
})
display(current_filter_df)
# Clear all filters and show impact
api.clear_filter("hackett_2020")
print(f"\nAfter clearing filters:")
cleared_filter_df = pd.DataFrame({
'Status': ['After Clearing'],
'SQL WHERE clause': [api.get_current_filter('hackett_2020') or 'None']
})
display(cleared_filter_df)
# Query unfiltered vs filtered data comparison
total_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")
# Set filters again for comparison
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
filtered_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")
# Create comprehensive comparison table
comparison_results = pd.DataFrame({
'Dataset State': [
'Unfiltered (Full Dataset)',
'Filtered (time=15, ZEV, P)',
'Data Reduction'
],
'Total Records': [
f"{total_records.iloc[0]['total']:,}",
f"{filtered_records.iloc[0]['total']:,}",
f"{total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']:,}"
],
'Percentage': [
'100.0%',
f"{(filtered_records.iloc[0]['total'] / total_records.iloc[0]['total'] * 100):.1f}%",
f"{((total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']) / total_records.iloc[0]['total'] * 100):.1f}%"
]
})
print("\n📊 Dataset Size Comparison:")
display(comparison_results)
# Show filter workflow summary
workflow_summary = pd.DataFrame({
'Step': [
'1. Explore Metadata',
'2. Set Simple Filters',
'3. Set Complex SQL Filters',
'4. Query with Auto-Apply',
'5. Clear/Manage Filters'
],
'Method': [
'api.get_metadata()',
'api.set_filter(config, **kwargs)',
'api.set_sql_filter(config, sql)',
'api.query(sql, config)',
'api.clear_filter() / get_current_filter()'
],
'Purpose': [
'Understand dataset structure',
'Filter by metadata values',
'Complex multi-condition filtering',
'Analyze with automatic filtering',
'Reset or inspect current state'
]
})
print("\nComplete HfQueryAPI Workflow:")
display(workflow_summary)
🔧 Filter Management Demonstration ================================================== Current filter:
| Status | SQL WHERE clause | |
|---|---|---|
| 0 | Active Filter | time IN (15, 30) AND \n mechanism = 'ZEV' A... |
After clearing filters:
| Status | SQL WHERE clause | |
|---|---|---|
| 0 | After Clearing | None |
📊 Dataset Size Comparison:
| Dataset State | Total Records | Percentage | |
|---|---|---|---|
| 0 | Unfiltered (Full Dataset) | 10,454,275 | 100.0% |
| 1 | Filtered (time=15, ZEV, P) | 1,012,700 | 9.7% |
| 2 | Data Reduction | 9,441,575 | 90.3% |
Complete HfQueryAPI Workflow:
| Step | Method | Purpose | |
|---|---|---|---|
| 0 | 1. Explore Metadata | api.get_metadata() | Understand dataset structure |
| 1 | 2. Set Simple Filters | api.set_filter(config, **kwargs) | Filter by metadata values |
| 2 | 3. Set Complex SQL Filters | api.set_sql_filter(config, sql) | Complex multi-condition filtering |
| 3 | 4. Query with Auto-Apply | api.query(sql, config) | Analyze with automatic filtering |
| 4 | 5. Clear/Manage Filters | api.clear_filter() / get_current_filter() | Reset or inspect current state |