VirtualDB Tutorial: Unified Cross-Dataset Queries¶
The VirtualDB class provides a unified query interface across heterogeneous datasets with different experimental condition structures and terminologies. Each dataset defines conditions in its own way, with properties at different hierarchy levels and using different naming conventions. VirtualDB uses external YAML configuration to:
- Map varying structures to a common schema
- Normalize factor level names (e.g., "D-glucose", "dextrose", "glu" all become "glucose")
- Enable cross-dataset queries with standardized field names and values
In this tutorial, we'll explore how to use VirtualDB to query and compare data across multiple datasets.
Creating a VirtualDB Specification¶
VirtualDB requires a YAML configuration file that defines:
- Which datasets to include
- How to map their fields to common names
- How to normalize factor levels
# For this tutorial, we'll create a sample configuration
# In practice, you'd load this from a YAML file
config_yaml = """
repositories:
BrentLab/harbison_2004:
dataset:
harbison_2004:
sample_id:
field: sample_id
carbon_source:
field: condition
path: media.carbon_source.compound
temperature_celsius:
field: condition
path: temperature_celsius
dtype: numeric
environmental_condition:
field: condition
comparative_analyses:
- repo: BrentLab/yeast_comparative_analysis
dataset: dto
via_field: binding_id
BrentLab/kemmeren_2014:
dataset:
kemmeren_2014:
sample_id:
field: sample_id
carbon_source:
path: media.carbon_source.compound
temperature_celsius:
path: temperature_celsius
dtype: numeric
comparative_analyses:
- repo: BrentLab/yeast_comparative_analysis
dataset: dto
via_field: perturbation_id
factor_aliases:
carbon_source:
glucose: [D-glucose, dextrose, glu]
galactose: [D-galactose, gal]
raffinose: [D-raffinose]
missing_value_labels:
carbon_source: "unspecified"
description:
carbon_source: The carbon source provided during growth
temperature_celsius: Growth temperature in degrees Celsius
environmental_condition: Named environmental condition
"""
# Save config to temporary file
import tempfile
from pathlib import Path
temp_config = Path(tempfile.mkdtemp()) / "vdb_config.yaml"
temp_config.write_text(config_yaml)
print(f"Configuration saved to: {temp_config}")
Configuration saved to: /tmp/tmpnetd9hv1/vdb_config.yaml
from tfbpapi.virtual_db import VirtualDB
# Initialize VirtualDB with the configuration
vdb = VirtualDB(str(temp_config))
print("VirtualDB initialized successfully!")
print(f"Configured repositories: {len(vdb.config.repositories)}")
/home/chase/code/tfbp/tfbpapi/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
VirtualDB initialized successfully! Configured repositories: 2
Schema Discovery¶
The VirtualDB class provides methods to inspect the unified schema after loading the configuration.
# Get all fields defined in any dataset
all_fields = vdb.get_fields()
print("All available fields:")
for field in sorted(all_fields):
print(f" - {field}")
All available fields: - carbon_source - environmental_condition - sample_id - temperature_celsius
# Get fields present in ALL datasets (common fields)
common_fields = vdb.get_common_fields()
print("Common fields (present in all datasets):")
for field in sorted(common_fields):
print(f" - {field}")
Common fields (present in all datasets): - carbon_source - temperature_celsius
# Get fields that may be used to filter two or more datasets at a time
comp_info = vdb.get_comparative_analyses()
print("Datasets with comparative data\n")
for primary_dataset, comparatives in sorted(comp_info["primary_to_comparative"].items()):
print(f"\n{primary_dataset}:")
for comp in comparatives:
comp_key = f"{comp['comparative_repo']}/{comp['comparative_dataset']}"
print(f" - {comp_key}")
print(f" via field: {comp['via_field']}")
num_fields = len(comp_info["comparative_fields"].get(comp_key, []))
print(f" fields available: {num_fields}")
# Show fields available from comparative datasets
print("Comparative data fields")
for comp_dataset, fields in sorted(comp_info["comparative_fields"].items()):
print(f"\n{comp_dataset}:")
if fields:
# Print in columns for better readability
fields_sorted = sorted(fields)
for i in range(0, len(fields_sorted), 3):
row_fields = fields_sorted[i:i + 3]
print(" " + " ".join(f"{f:<28}" for f in row_fields))
else:
print(" (no fields found)")
Datasets with comparative data
BrentLab/harbison_2004/harbison_2004:
- BrentLab/yeast_comparative_analysis/dto
via field: binding_id
fields available: 8
BrentLab/kemmeren_2014/kemmeren_2014:
- BrentLab/yeast_comparative_analysis/dto
via field: perturbation_id
fields available: 8
Comparative data fields
BrentLab/yeast_comparative_analysis/dto:
binding_id binding_rank_threshold binding_set_size
dto_empirical_pvalue dto_fdr perturbation_id
perturbation_rank_threshold perturbation_set_size
Discovering Valid Values¶
VirtualDB can tell you what values exist for each field.
# Get all unique values for a field (normalized)
carbon_source_factor_levels = vdb.get_unique_values("carbon_source")
print("Unique carbon sources (normalized):")
for source in sorted(carbon_source_factor_levels):
print(f" - {source}")
Unique carbon sources (normalized): - galactose - glucose - raffinose - unspecified
# Get values broken down by dataset
carbon_by_dataset = vdb.get_unique_values("carbon_source", by_dataset=True)
print("Carbon sources by dataset:")
for dataset, sources in carbon_by_dataset.items():
print(f"\n{dataset}:")
for source in sorted(sources):
print(f" - {source}")
Carbon sources by dataset: BrentLab/harbison_2004/harbison_2004: - galactose - glucose - raffinose - unspecified BrentLab/kemmeren_2014/kemmeren_2014: - glucose
4. Simple Queries¶
Now let's start querying data. The query() method is the primary interface.
Basic Query: All Samples with Glucose¶
By default, queries return sample-level data (one row per sample) with all configured fields.
# Query all datasets for samples grown on glucose
glucose_samples = vdb.query(filters={"carbon_source": "glucose"})
print(f"Found {len(glucose_samples)} samples with glucose")
print(f"\nColumns: {list(glucose_samples.columns)}")
print(f"\nFirst few rows:")
glucose_samples.head()
Found 1797 samples with glucose Columns: ['sample_id', 'regulator_locus_tag', 'regulator_symbol', 'condition', 'carbon_source', 'temperature_celsius', 'dataset_id'] First few rows:
| sample_id | regulator_locus_tag | regulator_symbol | condition | carbon_source | temperature_celsius | dataset_id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 2 | YAL051W | OAF1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 3 | YBL005W | PDR3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 4 | YBL008W | HIR1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 5 | YBL021C | HAP3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
Query Specific Datasets¶
Limit your query to specific datasets using the datasets parameter.
# Query only harbison_2004
harbison_glucose = vdb.query(
filters={"carbon_source": "glucose"},
datasets=[("BrentLab/harbison_2004", "harbison_2004")]
)
print(f"Found {len(harbison_glucose)} samples from harbison_2004")
harbison_glucose.head()
Found 310 samples from harbison_2004
| sample_id | regulator_locus_tag | regulator_symbol | condition | carbon_source | temperature_celsius | dataset_id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 2 | YAL051W | OAF1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 3 | YBL005W | PDR3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 4 | YBL008W | HIR1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 5 | YBL021C | HAP3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
Select Specific Fields¶
Return only the fields you need with the fields parameter.
# Get just sample_id, carbon_source, and temperature
minimal_data = vdb.query(
filters={"carbon_source": "glucose"},
fields=["sample_id", "carbon_source", "temperature_celsius"]
)
print(f"Columns: {list(minimal_data.columns)}")
minimal_data.head()
Columns: ['sample_id', 'carbon_source', 'temperature_celsius', 'dataset_id']
| sample_id | carbon_source | temperature_celsius | dataset_id | |
|---|---|---|---|---|
| 0 | 1 | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 2 | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 3 | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 4 | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 5 | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
5. Advanced Queries¶
VirtualDB supports more sophisticated query patterns.
Multiple Filter Conditions¶
# Samples with glucose at 30C
glucose_30c = vdb.query(
filters={
"carbon_source": "glucose",
"temperature_celsius": 30
}
)
print(f"Found {len(glucose_30c)} samples with glucose at 30C")
glucose_30c.head()
Found 1791 samples with glucose at 30C
| sample_id | regulator_locus_tag | regulator_symbol | condition | carbon_source | temperature_celsius | dataset_id | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 2 | YAL051W | OAF1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 3 | YBL005W | PDR3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 4 | YBL008W | HIR1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 5 | YBL021C | HAP3 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
Numeric Range Queries¶
# Samples at temperature >= 30C
warm_samples = vdb.query(
filters={"temperature_celsius": (">=", 30)}
)
print(f"Found {len(warm_samples)} samples at >= 30C")
# Samples between 28C and 32C
moderate_temp = vdb.query(
filters={"temperature_celsius": ("between", 28, 32)}
)
print(f"Found {len(moderate_temp)} samples between 28-32C")
Found 1833 samples at >= 30C Found 1833 samples between 28-32C
Factor Alias Expansion¶
When you query for a normalized value, VirtualDB automatically expands to all original aliases.
# Query for "galactose" matches "D-galactose", "gal", and "galactose"
galactose_samples = vdb.query(filters={"carbon_source": "galactose"})
print(galactose_samples)
sample_id regulator_locus_tag regulator_symbol condition carbon_source \ 0 68 YDR277C MTH1 GAL galactose 1 112 YGL035C MIG1 GAL galactose 2 197 YKL038W RGT1 GAL galactose 3 335 YPL248C GAL4 GAL galactose temperature_celsius dataset_id 0 30.0 BrentLab/harbison_2004/harbison_2004 1 30.0 BrentLab/harbison_2004/harbison_2004 2 30.0 BrentLab/harbison_2004/harbison_2004 3 30.0 BrentLab/harbison_2004/harbison_2004
Complete Data Retrieval¶
By default, query() returns sample-level metadata (one row per sample).
Set complete=True to get all measurements (many rows per sample).
# Get complete data with measurements
complete_data = vdb.query(
filters={"carbon_source": "glucose"},
datasets=[("BrentLab/harbison_2004", "harbison_2004")],
complete=True
)
print(f"Complete data: {len(complete_data)} rows")
print(f"Columns: {list(complete_data.columns)}")
print("\nFirst few measurements:")
complete_data.head()
Complete data: 1930060 rows Columns: ['sample_id', 'db_id', 'target_locus_tag', 'target_symbol', 'effect', 'pvalue', 'regulator_locus_tag', 'regulator_symbol', 'condition', 'carbon_source', 'temperature_celsius', 'dataset_id'] First few measurements:
| sample_id | db_id | target_locus_tag | target_symbol | effect | pvalue | regulator_locus_tag | regulator_symbol | condition | carbon_source | temperature_celsius | dataset_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | YAL001C | TFC3 | 1.697754 | 0.068705 | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 1 | 0.0 | YAL002W | VPS8 | NaN | NaN | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 1 | 0.0 | YAL003W | EFB1 | NaN | NaN | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 1 | 0.0 | YAL004W | YAL004W | 0.745342 | 0.835929 | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 1 | 0.0 | YAL005C | SSA1 | NaN | NaN | YSC0017 | MATA1 | YPD | glucose | 30.0 | BrentLab/harbison_2004/harbison_2004 |
# You can combine complete=True with field selection
# Get just the binding data columns
binding_data = vdb.query(
filters={"carbon_source": "glucose"},
datasets=[("BrentLab/harbison_2004", "harbison_2004")],
fields=["sample_id", "regulator_symbol", "target_symbol", "effect", "pvalue"],
complete=True
)
print(f"Binding data: {len(binding_data)} measurements")
binding_data.head(10)
Binding data: 1930060 measurements
| sample_id | regulator_symbol | target_symbol | effect | pvalue | dataset_id | |
|---|---|---|---|---|---|---|
| 0 | 2 | OAF1 | TFC3 | 1.589564 | 0.088986 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 2 | OAF1 | VPS8 | 1.141321 | 0.324805 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 2 | OAF1 | EFB1 | 0.729120 | 0.878824 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 2 | OAF1 | YAL004W | 1.167904 | 0.282253 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 2 | OAF1 | SSA1 | 0.729120 | 0.878824 | BrentLab/harbison_2004/harbison_2004 |
| 5 | 2 | OAF1 | ERP2 | 1.050827 | 0.430707 | BrentLab/harbison_2004/harbison_2004 |
| 6 | 2 | OAF1 | FUN14 | 1.347876 | 0.155511 | BrentLab/harbison_2004/harbison_2004 |
| 7 | 2 | OAF1 | SPO7 | 0.939673 | 0.578234 | BrentLab/harbison_2004/harbison_2004 |
| 8 | 2 | OAF1 | MDM10 | 0.939673 | 0.578234 | BrentLab/harbison_2004/harbison_2004 |
| 9 | 2 | OAF1 | SWC3 | 0.865667 | 0.671119 | BrentLab/harbison_2004/harbison_2004 |
Example analysis¶
The following is an example of using VirtualDB to extract and summarize data across datasets.
# Compare number of samples by carbon source across datasets
# Get all samples
all_samples = vdb.query()
# Count by dataset and carbon source
summary = all_samples.groupby(['dataset_id', 'carbon_source']).size()
summary = summary.reset_index(name='num_samples')
print("Sample counts by dataset and carbon source:")
print(summary.to_string(index=False))
Sample counts by dataset and carbon source:
dataset_id carbon_source num_samples
BrentLab/harbison_2004/harbison_2004 galactose 4
BrentLab/harbison_2004/harbison_2004 glucose 310
BrentLab/harbison_2004/harbison_2004 raffinose 1
BrentLab/harbison_2004/harbison_2004 unspecified 37
BrentLab/kemmeren_2014/kemmeren_2014 glucose 1487
# Compare glucose experiments at different temperatures
glucose_by_temp = vdb.query(
filters={"carbon_source": "glucose"},
fields=["sample_id", "temperature_celsius", "environmental_condition"]
)
# Count samples by temperature
temp_counts = glucose_by_temp['temperature_celsius'].value_counts().sort_index()
print("Glucose samples by temperature:")
for temp, count in temp_counts.items():
print(f" {temp}C: {count} samples")
Glucose samples by temperature: 30.0C: 1791 samples
# Get binding data for a specific regulator across datasets
# Query for FHL1 binding in glucose conditions
fhl1_binding = vdb.query(
filters={
"carbon_source": "glucose",
"regulator_symbol": "FHL1"
},
fields=["sample_id", "regulator_symbol", "target_symbol", "effect", "pvalue"],
complete=True
)
print(f"Found {len(fhl1_binding)} FHL1 binding measurements in glucose")
# Find significant targets (p < 0.001)
significant = fhl1_binding[fhl1_binding['pvalue'] < 0.001]
print(f"Significant targets: {len(significant)}")
# Top 10 by effect size
top_targets = significant.nlargest(10, 'effect')[['target_symbol', 'effect', 'pvalue']]
print("\nTop 10 targets by effect size:")
print(top_targets.to_string(index=False))
Found 18678 FHL1 binding measurements in glucose
Significant targets: 379
Top 10 targets by effect size:
target_symbol effect pvalue
RPS5 24.145013 9.739702e-09
RPL11A 20.585725 1.232356e-08
PRE2 20.585725 1.232356e-08
SRF1 20.342898 1.226799e-08
SLX8 20.057080 1.513076e-08
RPL23B 20.057080 1.513076e-08
RPL40A 19.262139 1.761808e-08
MLP2 19.262139 1.761808e-08
RPS6A 18.704379 1.544172e-08
RPL22A 17.926705 1.560357e-08
Querying Comparative Datasets¶
Comparative datasets like DTO (Direct Target Overlap) contain analysis results that relate samples across multiple datasets. These datasets can be queried directly to find significant cross-dataset relationships.
# Query harbison_2004 binding data enriched with DTO metrics
# This demonstrates field-based joins: requesting dto_fdr field
# while querying the primary binding dataset
binding_with_dto = vdb.query(
datasets=[("BrentLab/harbison_2004", "harbison_2004")],
filters={"regulator_symbol": "FHL1"},
fields=["sample_id", "regulator_symbol", "condition", "dto_fdr", "binding_id", "perturbation_id"],
)
print(f"Found {len(binding_with_dto)} FHL1 binding measurements")
print(f"\nColumns: {list(binding_with_dto.columns)}")
print(f"\nRows with DTO data: {binding_with_dto['dto_fdr'].notna().sum()}")
print(f"\nFirst few results:")
binding_with_dto
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 65536.00it/s] Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 57325.34it/s]
Found 32 FHL1 binding measurements Columns: ['sample_id', 'regulator_symbol', 'condition', 'dto_fdr', 'perturbation_id', 'dataset_id'] Rows with DTO data: 4 First few results:
| sample_id | regulator_symbol | condition | dto_fdr | perturbation_id | dataset_id | |
|---|---|---|---|---|---|---|
| 0 | 345 | FHL1 | H2O2Hi | 0.454909 | BrentLab/Hackett_2020;hackett_2020;1666 | BrentLab/harbison_2004/harbison_2004 |
| 1 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1665 | BrentLab/harbison_2004/harbison_2004 |
| 2 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1667 | BrentLab/harbison_2004/harbison_2004 |
| 3 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1669 | BrentLab/harbison_2004/harbison_2004 |
| 4 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1663 | BrentLab/harbison_2004/harbison_2004 |
| 5 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1664 | BrentLab/harbison_2004/harbison_2004 |
| 6 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1670 | BrentLab/harbison_2004/harbison_2004 |
| 7 | 345 | FHL1 | H2O2Hi | NaN | BrentLab/Hackett_2020;hackett_2020;1668 | BrentLab/harbison_2004/harbison_2004 |
| 8 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1667 | BrentLab/harbison_2004/harbison_2004 |
| 9 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1663 | BrentLab/harbison_2004/harbison_2004 |
| 10 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1670 | BrentLab/harbison_2004/harbison_2004 |
| 11 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1668 | BrentLab/harbison_2004/harbison_2004 |
| 12 | 346 | FHL1 | RAPA | 0.000000 | BrentLab/Hackett_2020;hackett_2020;1666 | BrentLab/harbison_2004/harbison_2004 |
| 13 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1669 | BrentLab/harbison_2004/harbison_2004 |
| 14 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1664 | BrentLab/harbison_2004/harbison_2004 |
| 15 | 346 | FHL1 | RAPA | NaN | BrentLab/Hackett_2020;hackett_2020;1665 | BrentLab/harbison_2004/harbison_2004 |
| 16 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1667 | BrentLab/harbison_2004/harbison_2004 |
| 17 | 347 | FHL1 | SM | 0.022196 | BrentLab/Hackett_2020;hackett_2020;1666 | BrentLab/harbison_2004/harbison_2004 |
| 18 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1669 | BrentLab/harbison_2004/harbison_2004 |
| 19 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1664 | BrentLab/harbison_2004/harbison_2004 |
| 20 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1663 | BrentLab/harbison_2004/harbison_2004 |
| 21 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1670 | BrentLab/harbison_2004/harbison_2004 |
| 22 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1668 | BrentLab/harbison_2004/harbison_2004 |
| 23 | 347 | FHL1 | SM | NaN | BrentLab/Hackett_2020;hackett_2020;1665 | BrentLab/harbison_2004/harbison_2004 |
| 24 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1664 | BrentLab/harbison_2004/harbison_2004 |
| 25 | 348 | FHL1 | YPD | 0.089578 | BrentLab/Hackett_2020;hackett_2020;1666 | BrentLab/harbison_2004/harbison_2004 |
| 26 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1663 | BrentLab/harbison_2004/harbison_2004 |
| 27 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1667 | BrentLab/harbison_2004/harbison_2004 |
| 28 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1669 | BrentLab/harbison_2004/harbison_2004 |
| 29 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1665 | BrentLab/harbison_2004/harbison_2004 |
| 30 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1670 | BrentLab/harbison_2004/harbison_2004 |
| 31 | 348 | FHL1 | YPD | NaN | BrentLab/Hackett_2020;hackett_2020;1668 | BrentLab/harbison_2004/harbison_2004 |
# You can also filter on comparative dataset fields
# This returns only binding measurements with significant DTO results
significant_dtos = vdb.query(
datasets=[("BrentLab/harbison_2004", "harbison_2004")],
filters={
"regulator_symbol": "FHL1",
# the threshold is high here b/c FHL1 didn't have significant results in harbison
"dto_empirical_pvalue": ("<", 0.5)
},
fields=["sample_id", "regulator_symbol", "target_symbol", "perturbation_id", "dto_empirical_pvalue"],
)
significant_dtos
Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 122760.12it/s] Fetching 6 files: 100%|██████████| 6/6 [00:00<00:00, 35951.18it/s]
| sample_id | regulator_symbol | perturbation_id | dto_empirical_pvalue | dataset_id | |
|---|---|---|---|---|---|
| 0 | 347 | FHL1 | BrentLab/Hackett_2020;hackett_2020;1666 | 0.297 | BrentLab/harbison_2004/harbison_2004 |