HfQueryAPI Tutorial: Metadata-Driven Data Exploration¶

This tutorial demonstrates how to use the HfQueryAPI for efficient metadata-driven exploration and querying of Hugging Face datasets, using the Hackett 2020 transcription factor overexpression dataset as an example.

Overview¶

The HfQueryAPI provides a streamlined workflow for:

Exploring metadata to understand dataset structure
Setting filters based on metadata values
Querying data with automatic filter application
Efficient SQL-based filtering on large datasets

Dataset: Hackett 2020¶

The BrentLab/hackett_2020 dataset contains transcription factor overexpression data with embedded metadata fields including:

regulator_locus_tag & regulator_symbol: Transcription factor identifiers
time: Time point (0, 15, 30, 90 minutes)
mechanism: Experimental mechanism (ZEV)
restriction: Restriction enzyme treatment (P, NP)
date, strain: Experimental metadata

Let's start exploring!

Initialize HfQueryAPI and Getting Metadata¶

First, we'll initialize the API client for the Hackett 2020 dataset:

In [23]:

Copied!





from tfbpapi.HfQueryAPI import HfQueryAPI
import pandas as pd

# Initialize the API client
api = HfQueryAPI(repo_id="BrentLab/hackett_2020", repo_type="dataset")

print(f"Initialized HfQueryAPI for {api.repo_id}")
print(f"Repository type: {api.repo_type}")
print(f"Available configurations: {[c.config_name for c in api.configs]}")
from tfbpapi.HfQueryAPI import HfQueryAPI
import pandas as pd

# Initialize the API client
api = HfQueryAPI(repo_id="BrentLab/hackett_2020", repo_type="dataset")

print(f"Initialized HfQueryAPI for {api.repo_id}")
print(f"Repository type: {api.repo_type}")
print(f"Available configurations: {[c.config_name for c in api.configs]}")

Initialized HfQueryAPI for BrentLab/hackett_2020
Repository type: dataset
Available configurations: ['hackett_2020']

In [24]:

Copied!

metadata = api.get_metadata("hackett_2020")
metadata = api.get_metadata("hackett_2020")

In [25]:

Copied!





# Create summary tables for experimental conditions
print("Experimental Conditions Summary")
print("=" * 50)

# Time points summary
time_summary = pd.DataFrame({
    'Time Point (min)': metadata['time'].value_counts().sort_index().index,
    'Frequency': metadata['time'].value_counts().sort_index().values,
})
time_summary['% of Records'] = (time_summary['Frequency'] / time_summary['Frequency'].sum() * 100).round(1)

print("\n⏱️ Time Points:")
display(time_summary)

# Experimental conditions table
conditions_summary = pd.DataFrame({
    'Condition Type': ['Mechanisms', 'Restriction Treatments', 'Strains', 'Experimental Dates'],
    'Unique Values': [
        metadata['mechanism'].nunique(),
        metadata['restriction'].nunique(), 
        metadata['strain'].nunique(),
        metadata['date'].nunique()
    ],
    'Most Common': [
        metadata['mechanism'].mode().iloc[0],
        metadata['restriction'].mode().iloc[0],
        metadata['strain'].mode().iloc[0],
        f"{metadata['date'].min()} to {metadata['date'].max()}"
    ]
})

print("\nExperimental Design Overview:")
display(conditions_summary)
# Create summary tables for experimental conditions
print("Experimental Conditions Summary")
print("=" * 50)

# Time points summary
time_summary = pd.DataFrame({
    'Time Point (min)': metadata['time'].value_counts().sort_index().index,
    'Frequency': metadata['time'].value_counts().sort_index().values,
})
time_summary['% of Records'] = (time_summary['Frequency'] / time_summary['Frequency'].sum() * 100).round(1)

print("\n⏱️ Time Points:")
display(time_summary)

# Experimental conditions table
conditions_summary = pd.DataFrame({
    'Condition Type': ['Mechanisms', 'Restriction Treatments', 'Strains', 'Experimental Dates'],
    'Unique Values': [
        metadata['mechanism'].nunique(),
        metadata['restriction'].nunique(), 
        metadata['strain'].nunique(),
        metadata['date'].nunique()
    ],
    'Most Common': [
        metadata['mechanism'].mode().iloc[0],
        metadata['restriction'].mode().iloc[0],
        metadata['strain'].mode().iloc[0],
        f"{metadata['date'].min()} to {metadata['date'].max()}"
    ]
})

print("\nExperimental Design Overview:")
display(conditions_summary)

Experimental Conditions Summary
==================================================

⏱️ Time Points:

	Time Point (min)	Frequency	% of Records
0	0.0	217	12.8
1	2.5	8	0.5
2	5.0	212	12.5
3	7.5	2	0.1
4	8.0	2	0.1
5	10.0	187	11.0
6	12.5	2	0.1
7	15.0	212	12.5
8	18.0	1	0.1
9	20.0	184	10.9
10	30.0	216	12.8
11	45.0	213	12.6
12	60.0	20	1.2
13	90.0	212	12.5
14	100.0	2	0.1
15	120.0	1	0.1
16	180.0	1	0.1
17	290.0	1	0.1

Experimental Design Overview:

	Condition Type	Unique Values	Most Common
0	Mechanisms	2	ZEV
1	Restriction Treatments	3	P
2	Strains	215	SMY2207
3	Experimental Dates	23	20150101 to 20161117

In [26]:

Copied!





# Create comprehensive transcription factor analysis tables
print("Transcription Factor Analysis")
print("=" * 50)

print(f"\nDataset Overview:")
tf_overview = pd.DataFrame({
    'Metric': [
        'Total Unique Transcription Factors',
    ],
    'Value': [
        f"{metadata['regulator_locus_tag'].nunique():,}",
    ]
})
display(tf_overview)
# Create comprehensive transcription factor analysis tables
print("Transcription Factor Analysis")
print("=" * 50)

print(f"\nDataset Overview:")
tf_overview = pd.DataFrame({
    'Metric': [
        'Total Unique Transcription Factors',
    ],
    'Value': [
        f"{metadata['regulator_locus_tag'].nunique():,}",
    ]
})
display(tf_overview)

Transcription Factor Analysis
==================================================

Dataset Overview:

	Metric	Value
0	Total Unique Transcription Factors	203

Setting Simple Filters¶

In [27]:

Copied!





# Set filters for 15-minute timepoint with ZEV mechanism and P restriction
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")

# Create comprehensive filter analysis
print("Filter Analysis")
print("=" * 40)

# Show current filter
current_filter = api.get_current_filter("hackett_2020")
filter_info = pd.DataFrame({
    'Filter Component': ['Time Point', 'Mechanism', 'Restriction', 'SQL Filter'],
    'Value': ['15 minutes', 'ZEV', 'P (Restriction)', current_filter]
})

print("\nApplied Filters:")
display(filter_info)

# Analyze filter impact
filtered_metadata = metadata[
    (metadata['time'] == 15) & 
    (metadata['mechanism'] == 'ZEV') & 
    (metadata['restriction'] == 'P')
]

# Show top TFs in filtered data
filtered_tf_summary = filtered_metadata.groupby(['regulator_locus_tag', 'regulator_symbol'])['count'].sum().sort_values(ascending=False).head(8)
tf_filtered_df = pd.DataFrame({
    'Locus Tag': [idx[0] for idx in filtered_tf_summary.index],
    'Symbol': [idx[1] for idx in filtered_tf_summary.index],
    'Records in Subset': filtered_tf_summary.values,
})

print("\nTop Transcription Factors in Filtered Dataset:")
display(tf_filtered_df)
# Set filters for 15-minute timepoint with ZEV mechanism and P restriction
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")

# Create comprehensive filter analysis
print("Filter Analysis")
print("=" * 40)

# Show current filter
current_filter = api.get_current_filter("hackett_2020")
filter_info = pd.DataFrame({
    'Filter Component': ['Time Point', 'Mechanism', 'Restriction', 'SQL Filter'],
    'Value': ['15 minutes', 'ZEV', 'P (Restriction)', current_filter]
})

print("\nApplied Filters:")
display(filter_info)

# Analyze filter impact
filtered_metadata = metadata[
    (metadata['time'] == 15) & 
    (metadata['mechanism'] == 'ZEV') & 
    (metadata['restriction'] == 'P')
]

# Show top TFs in filtered data
filtered_tf_summary = filtered_metadata.groupby(['regulator_locus_tag', 'regulator_symbol'])['count'].sum().sort_values(ascending=False).head(8)
tf_filtered_df = pd.DataFrame({
    'Locus Tag': [idx[0] for idx in filtered_tf_summary.index],
    'Symbol': [idx[1] for idx in filtered_tf_summary.index],
    'Records in Subset': filtered_tf_summary.values,
})

print("\nTop Transcription Factors in Filtered Dataset:")
display(tf_filtered_df)

Filter Analysis
========================================

Applied Filters:

	Filter Component	Value
0	Time Point	15 minutes
1	Mechanism	ZEV
2	Restriction	P (Restriction)
3	SQL Filter	time = 15 AND mechanism = 'ZEV' AND restrictio...

Top Transcription Factors in Filtered Dataset:

	Locus Tag	Symbol	Records in Subset
0	YPL016W	SWI1	18525
1	YEL009C	GCN4	12350
2	YPL133C	RDS2	12350
3	Z3EV	Z3EV	12350
4	YBL008W	HIR1	6175
5	YBL021C	HAP3	6175
6	YBL025W	RRN10	6175
7	YBR239C	ERT1	6175

Query Data with Automatic Filter Application¶

Now when we query the data, our filters will be automatically applied:

Setting Complex SQL Filters¶

For more sophisticated filtering, we can use the set_sql_filter() method with full SQL expressions:

In [28]:

Copied!





# Set a complex filter for multiple time points and specific transcription factors
api.set_sql_filter("hackett_2020", """
    time IN (15, 30) AND 
    mechanism = 'ZEV' AND 
    restriction = 'P' AND 
    regulator_locus_tag IN ('YER040W', 'YER028C', 'YPL016W')
""")

print("Complex SQL Filter Analysis")
print("=" * 50)

# Show filter details
filter_details = pd.DataFrame({
    'Filter Component': [
        'Time Points',
        'Mechanism', 
        'Restriction',
        'Selected TFs',
        'Complete SQL Filter'
    ],
    'Value': [
        '15, 30 minutes',
        'ZEV (overexpression)',
        'P (restriction enzyme)',
        'YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1)',
        api.get_current_filter('hackett_2020')
    ]
})

print("\nApplied Complex Filter:")
display(filter_details)
# Set a complex filter for multiple time points and specific transcription factors
api.set_sql_filter("hackett_2020", """
    time IN (15, 30) AND 
    mechanism = 'ZEV' AND 
    restriction = 'P' AND 
    regulator_locus_tag IN ('YER040W', 'YER028C', 'YPL016W')
""")

print("Complex SQL Filter Analysis")
print("=" * 50)

# Show filter details
filter_details = pd.DataFrame({
    'Filter Component': [
        'Time Points',
        'Mechanism', 
        'Restriction',
        'Selected TFs',
        'Complete SQL Filter'
    ],
    'Value': [
        '15, 30 minutes',
        'ZEV (overexpression)',
        'P (restriction enzyme)',
        'YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1)',
        api.get_current_filter('hackett_2020')
    ]
})

print("\nApplied Complex Filter:")
display(filter_details)

Complex SQL Filter Analysis
==================================================

Applied Complex Filter:

	Filter Component	Value
0	Time Points	15, 30 minutes
1	Mechanism	ZEV (overexpression)
2	Restriction	P (restriction enzyme)
3	Selected TFs	YER040W (GLN3), YER028C (RSF2), YPL016W (SWI1)
4	Complete SQL Filter	time IN (15, 30) AND \n mechanism = 'ZEV' A...

In [29]:

Copied!

## Query with the complex filter
## Query with the complex filter

In [30]:

Copied!





time_comparison = api.query("""
    SELECT 
        regulator_locus_tag,
        regulator_symbol,
        time,
        COUNT(*) as target_count,
        ROUND(AVG(log2_shrunken_timecourses), 3) as avg_response,
        COUNT(CASE WHEN ABS(log2_shrunken_timecourses) > 0.5 THEN 1 END) as strong_responders,
        ROUND(MAX(ABS(log2_shrunken_timecourses)), 3) as max_abs_response
    FROM hackett_2020 
    GROUP BY regulator_locus_tag, regulator_symbol, time
    ORDER BY regulator_locus_tag, time
""", "hackett_2020")

# Format display
time_comparison_display = time_comparison.copy()
time_comparison_display.columns = ['Locus Tag', 'Symbol', 'Time (min)', 'Target Count', 'Avg Response', 'Strong Responders', 'Max |Response|']
time_comparison_display['% Strong'] = (time_comparison_display['Strong Responders'] / time_comparison_display['Target Count'] * 100).round(1)

print("\nTime Course Comparison for Selected TFs:")
display(time_comparison_display)

# Summary analysis
tf_summary = time_comparison.groupby(['regulator_locus_tag', 'regulator_symbol']).agg({
    'target_count': 'sum',
    'strong_responders': 'sum',
    'avg_response': 'mean'
}).reset_index()

tf_summary['total_%_strong'] = (tf_summary['strong_responders'] / tf_summary['target_count'] * 100).round(1)
tf_summary_display = tf_summary.copy()
tf_summary_display.columns = ['Locus Tag', 'Symbol', 'Total Targets', 'Total Strong', 'Avg Response', '% Strong Overall']

print("\nOverall TF Performance Summary:")
display(tf_summary_display)

time_comparison = api.query("""
    SELECT 
        regulator_locus_tag,
        regulator_symbol,
        time,
        COUNT(*) as target_count,
        ROUND(AVG(log2_shrunken_timecourses), 3) as avg_response,
        COUNT(CASE WHEN ABS(log2_shrunken_timecourses) > 0.5 THEN 1 END) as strong_responders,
        ROUND(MAX(ABS(log2_shrunken_timecourses)), 3) as max_abs_response
    FROM hackett_2020 
    GROUP BY regulator_locus_tag, regulator_symbol, time
    ORDER BY regulator_locus_tag, time
""", "hackett_2020")

# Format display
time_comparison_display = time_comparison.copy()
time_comparison_display.columns = ['Locus Tag', 'Symbol', 'Time (min)', 'Target Count', 'Avg Response', 'Strong Responders', 'Max |Response|']
time_comparison_display['% Strong'] = (time_comparison_display['Strong Responders'] / time_comparison_display['Target Count'] * 100).round(1)

print("\nTime Course Comparison for Selected TFs:")
display(time_comparison_display)

# Summary analysis
tf_summary = time_comparison.groupby(['regulator_locus_tag', 'regulator_symbol']).agg({
    'target_count': 'sum',
    'strong_responders': 'sum',
    'avg_response': 'mean'
}).reset_index()

tf_summary['total_%_strong'] = (tf_summary['strong_responders'] / tf_summary['target_count'] * 100).round(1)
tf_summary_display = tf_summary.copy()
tf_summary_display.columns = ['Locus Tag', 'Symbol', 'Total Targets', 'Total Strong', 'Avg Response', '% Strong Overall']

print("\nOverall TF Performance Summary:")
display(tf_summary_display)

Time Course Comparison for Selected TFs:

	Locus Tag	Symbol	Time (min)	Target Count	Avg Response	Strong Responders	Max \|Response\|	% Strong
0	YER028C	MIG3	15.0	6175	-0.010	99	5.894	1.6
1	YER028C	MIG3	30.0	6175	-0.028	246	5.516	4.0
2	YER040W	GLN3	15.0	6175	0.018	81	7.923	1.3
3	YER040W	GLN3	30.0	6175	0.042	631	10.459	10.2
4	YPL016W	SWI1	15.0	18525	0.001	431	6.216	2.3
5	YPL016W	SWI1	30.0	18525	0.033	762	6.753	4.1

Overall TF Performance Summary:

	Locus Tag	Symbol	Total Targets	Total Strong	Avg Response	% Strong Overall
0	YER028C	MIG3	12350	345	-0.019	2.8
1	YER040W	GLN3	12350	712	0.030	5.8
2	YPL016W	SWI1	37050	1193	0.017	3.2

Filter Management¶

The filtering system provides full control over active filters:

In [31]:

Copied!





# Demonstrate filter management capabilities
print("🔧 Filter Management Demonstration")
print("=" * 50)

# Show current filter
current_filter = api.get_current_filter('hackett_2020')
print(f"\nCurrent filter:")
current_filter_df = pd.DataFrame({
    'Status': ['Active Filter'],
    'SQL WHERE clause': [current_filter if current_filter else 'None']
})
display(current_filter_df)

# Clear all filters and show impact
api.clear_filter("hackett_2020")
print(f"\nAfter clearing filters:")
cleared_filter_df = pd.DataFrame({
    'Status': ['After Clearing'],
    'SQL WHERE clause': [api.get_current_filter('hackett_2020') or 'None']
})
display(cleared_filter_df)

# Query unfiltered vs filtered data comparison
total_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")

# Set filters again for comparison
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
filtered_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")

# Create comprehensive comparison table
comparison_results = pd.DataFrame({
    'Dataset State': [
        'Unfiltered (Full Dataset)',
        'Filtered (time=15, ZEV, P)',
        'Data Reduction'
    ],
    'Total Records': [
        f"{total_records.iloc[0]['total']:,}",
        f"{filtered_records.iloc[0]['total']:,}",
        f"{total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']:,}"
    ],
    'Percentage': [
        '100.0%',
        f"{(filtered_records.iloc[0]['total'] / total_records.iloc[0]['total'] * 100):.1f}%",
        f"{((total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']) / total_records.iloc[0]['total'] * 100):.1f}%"
    ]
})

print("\n📊 Dataset Size Comparison:")
display(comparison_results)

# Show filter workflow summary
workflow_summary = pd.DataFrame({
    'Step': [
        '1. Explore Metadata',
        '2. Set Simple Filters', 
        '3. Set Complex SQL Filters',
        '4. Query with Auto-Apply',
        '5. Clear/Manage Filters'
    ],
    'Method': [
        'api.get_metadata()',
        'api.set_filter(config, **kwargs)',
        'api.set_sql_filter(config, sql)',
        'api.query(sql, config)',
        'api.clear_filter() / get_current_filter()'
    ],
    'Purpose': [
        'Understand dataset structure',
        'Filter by metadata values',
        'Complex multi-condition filtering',
        'Analyze with automatic filtering',
        'Reset or inspect current state'
    ]
})

print("\nComplete HfQueryAPI Workflow:")
display(workflow_summary)
# Demonstrate filter management capabilities
print("🔧 Filter Management Demonstration")
print("=" * 50)

# Show current filter
current_filter = api.get_current_filter('hackett_2020')
print(f"\nCurrent filter:")
current_filter_df = pd.DataFrame({
    'Status': ['Active Filter'],
    'SQL WHERE clause': [current_filter if current_filter else 'None']
})
display(current_filter_df)

# Clear all filters and show impact
api.clear_filter("hackett_2020")
print(f"\nAfter clearing filters:")
cleared_filter_df = pd.DataFrame({
    'Status': ['After Clearing'],
    'SQL WHERE clause': [api.get_current_filter('hackett_2020') or 'None']
})
display(cleared_filter_df)

# Query unfiltered vs filtered data comparison
total_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")

# Set filters again for comparison
api.set_filter("hackett_2020", time=15, mechanism="ZEV", restriction="P")
filtered_records = api.query("SELECT COUNT(*) as total FROM hackett_2020", "hackett_2020")

# Create comprehensive comparison table
comparison_results = pd.DataFrame({
    'Dataset State': [
        'Unfiltered (Full Dataset)',
        'Filtered (time=15, ZEV, P)',
        'Data Reduction'
    ],
    'Total Records': [
        f"{total_records.iloc[0]['total']:,}",
        f"{filtered_records.iloc[0]['total']:,}",
        f"{total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']:,}"
    ],
    'Percentage': [
        '100.0%',
        f"{(filtered_records.iloc[0]['total'] / total_records.iloc[0]['total'] * 100):.1f}%",
        f"{((total_records.iloc[0]['total'] - filtered_records.iloc[0]['total']) / total_records.iloc[0]['total'] * 100):.1f}%"
    ]
})

print("\n📊 Dataset Size Comparison:")
display(comparison_results)

# Show filter workflow summary
workflow_summary = pd.DataFrame({
    'Step': [
        '1. Explore Metadata',
        '2. Set Simple Filters', 
        '3. Set Complex SQL Filters',
        '4. Query with Auto-Apply',
        '5. Clear/Manage Filters'
    ],
    'Method': [
        'api.get_metadata()',
        'api.set_filter(config, **kwargs)',
        'api.set_sql_filter(config, sql)',
        'api.query(sql, config)',
        'api.clear_filter() / get_current_filter()'
    ],
    'Purpose': [
        'Understand dataset structure',
        'Filter by metadata values',
        'Complex multi-condition filtering',
        'Analyze with automatic filtering',
        'Reset or inspect current state'
    ]
})

print("\nComplete HfQueryAPI Workflow:")
display(workflow_summary)

🔧 Filter Management Demonstration
==================================================

Current filter:

	Status	SQL WHERE clause
0	Active Filter	time IN (15, 30) AND \n mechanism = 'ZEV' A...

After clearing filters:

	Status	SQL WHERE clause
0	After Clearing	None

📊 Dataset Size Comparison:

	Dataset State	Total Records	Percentage
0	Unfiltered (Full Dataset)	10,454,275	100.0%
1	Filtered (time=15, ZEV, P)	1,012,700	9.7%
2	Data Reduction	9,441,575	90.3%

Complete HfQueryAPI Workflow:

	Step	Method	Purpose
0	1. Explore Metadata	api.get_metadata()	Understand dataset structure
1	2. Set Simple Filters	api.set_filter(config, **kwargs)	Filter by metadata values
2	3. Set Complex SQL Filters	api.set_sql_filter(config, sql)	Complex multi-condition filtering
3	4. Query with Auto-Apply	api.query(sql, config)	Analyze with automatic filtering
4	5. Clear/Manage Filters	api.clear_filter() / get_current_filter()	Reset or inspect current state