DataInfo Package¶
The datainfo package provides dataset information management for HuggingFace datasets. It enables exploration of dataset metadata, structure, and relationships without loading actual genomic data.
Overview¶
The datainfo package consists of three main components:
- DataCard: High-level interface for exploring dataset metadata
- Fetchers: Low-level components for retrieving data from HuggingFace Hub
- Models: Pydantic models for validation and type safety
Main Interface¶
DataCard¶
tfbpapi.datainfo.datacard.DataCard
¶
Easy-to-use interface for exploring HuggingFace dataset metadata.
Provides methods to discover and explore dataset contents, configurations, and metadata without loading the actual genomic data.
Source code in tfbpapi/datainfo/datacard.py
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 | |
configs
property
¶
Get all dataset configurations.
dataset_card
property
¶
Get the validated dataset card.
__init__(repo_id, token=None)
¶
Initialize DataCard for a repository.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
HuggingFace repository identifier (e.g., “user/dataset”) |
required |
token
|
str | None
|
Optional HuggingFace token for authentication |
None
|
Source code in tfbpapi/datainfo/datacard.py
explore_config(config_name)
¶
Get detailed information about a specific configuration.
Source code in tfbpapi/datainfo/datacard.py
get_config(config_name)
¶
get_configs_by_type(dataset_type)
¶
Get configurations by dataset type.
Source code in tfbpapi/datainfo/datacard.py
get_experimental_conditions(config_name=None)
¶
Get all experimental conditions mentioned in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_name
|
str | None
|
Optional specific config to search, otherwise searches all |
None
|
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of experimental conditions found |
Source code in tfbpapi/datainfo/datacard.py
get_field_values(config_name, field_name)
¶
Get all unique values for a specific field in a configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_name
|
str
|
Configuration name |
required |
field_name
|
str
|
Field name to extract values from |
required |
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of unique values |
Raises:
| Type | Description |
|---|---|
DataCardError
|
If config or field not found |
Source code in tfbpapi/datainfo/datacard.py
get_metadata_relationships(refresh_cache=False)
¶
Get relationships between data configs and their metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
refresh_cache
|
bool
|
If True, force refresh dataset card from remote |
False
|
Source code in tfbpapi/datainfo/datacard.py
get_regulators(config_name=None)
¶
Get all regulators mentioned in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config_name
|
str | None
|
Optional specific config to search, otherwise searches all |
None
|
Returns:
| Type | Description |
|---|---|
set[str]
|
Set of regulator identifiers found |
Source code in tfbpapi/datainfo/datacard.py
get_repository_info()
¶
Get general repository information.
Source code in tfbpapi/datainfo/datacard.py
summary()
¶
Get a human-readable summary of the dataset.
Source code in tfbpapi/datainfo/datacard.py
The DataCard class is the primary interface for exploring HuggingFace datasets. It provides methods to:
- Discover dataset configurations and types
- Explore feature schemas and data types
- Understand metadata relationships
- Extract field values and experimental conditions
- Navigate partitioned dataset structures
Data Models¶
Core Models¶
tfbpapi.datainfo.models.DatasetCard
¶
Bases: BaseModel
Complete dataset card model.
Source code in tfbpapi/datainfo/models.py
at_most_one_default(v)
classmethod
¶
Ensure at most one config is marked as default.
Source code in tfbpapi/datainfo/models.py
configs_not_empty(v)
classmethod
¶
Ensure at least one config is present.
get_config_by_name(name)
¶
get_configs_by_type(dataset_type)
¶
Get all configurations of a specific type.
get_data_configs()
¶
get_default_config()
¶
Get the default configuration if one exists.
get_metadata_configs()
¶
unique_config_names(v)
classmethod
¶
Ensure config names are unique.
Source code in tfbpapi/datainfo/models.py
tfbpapi.datainfo.models.DatasetConfig
¶
Bases: BaseModel
Configuration for a dataset within a repository.
Source code in tfbpapi/datainfo/models.py
applies_to_only_for_metadata(v, info)
classmethod
¶
Validate that applies_to is only used for metadata configs.
Source code in tfbpapi/datainfo/models.py
metadata_fields_validation(v)
classmethod
¶
Validate metadata_fields usage.
Source code in tfbpapi/datainfo/models.py
tfbpapi.datainfo.models.FeatureInfo
¶
Bases: BaseModel
Information about a dataset feature/column.
Source code in tfbpapi/datainfo/models.py
get_dtype_summary()
¶
Get a human-readable summary of the data type.
Source code in tfbpapi/datainfo/models.py
validate_dtype(v)
classmethod
¶
Validate and normalize dtype field.
Source code in tfbpapi/datainfo/models.py
Dataset Types¶
tfbpapi.datainfo.models.DatasetType
¶
Relationship Models¶
tfbpapi.datainfo.models.MetadataRelationship
¶
Bases: BaseModel
Relationship between a data config and its metadata.
Source code in tfbpapi/datainfo/models.py
tfbpapi.datainfo.models.ExtractedMetadata
¶
Bases: BaseModel
Metadata extracted from datasets.
Source code in tfbpapi/datainfo/models.py
Data Fetchers¶
HuggingFace Integration¶
tfbpapi.datainfo.fetchers.HfDataCardFetcher
¶
Handles fetching dataset cards from HuggingFace Hub.
Source code in tfbpapi/datainfo/fetchers.py
__init__(token=None)
¶
Initialize the fetcher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
HuggingFace token for authentication |
None
|
fetch(repo_id, repo_type='dataset')
¶
Fetch and return dataset card data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
Repository identifier (e.g., “user/dataset”) |
required |
repo_type
|
str
|
Type of repository (“dataset”, “model”, “space”) |
'dataset'
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dataset card data as dictionary |
Raises:
| Type | Description |
|---|---|
HfDataFetchError
|
If fetching fails |
Source code in tfbpapi/datainfo/fetchers.py
tfbpapi.datainfo.fetchers.HfRepoStructureFetcher
¶
Handles fetching repository structure from HuggingFace Hub.
Source code in tfbpapi/datainfo/fetchers.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 | |
__init__(token=None)
¶
Initialize the fetcher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
HuggingFace token for authentication |
None
|
Source code in tfbpapi/datainfo/fetchers.py
fetch(repo_id, force_refresh=False)
¶
Fetch repository structure information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
Repository identifier (e.g., “user/dataset”) |
required |
force_refresh
|
bool
|
If True, bypass cache and fetch fresh data |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Repository structure information |
Raises:
| Type | Description |
|---|---|
HfDataFetchError
|
If fetching fails |
Source code in tfbpapi/datainfo/fetchers.py
get_dataset_files(repo_id, path_pattern=None, force_refresh=False)
¶
Get dataset files, optionally filtered by path pattern.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
Repository identifier |
required |
path_pattern
|
str | None
|
Optional regex pattern to filter files |
None
|
force_refresh
|
bool
|
If True, bypass cache and fetch fresh data |
False
|
Returns:
| Type | Description |
|---|---|
list[dict[str, Any]]
|
List of matching files |
Raises:
| Type | Description |
|---|---|
HfDataFetchError
|
If fetching fails |
Source code in tfbpapi/datainfo/fetchers.py
get_partition_values(repo_id, partition_column, force_refresh=False)
¶
Get all values for a specific partition column.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
Repository identifier |
required |
partition_column
|
str
|
Name of the partition column |
required |
force_refresh
|
bool
|
If True, bypass cache and fetch fresh data |
False
|
Returns:
| Type | Description |
|---|---|
list[str]
|
List of unique partition values |
Raises:
| Type | Description |
|---|---|
HfDataFetchError
|
If fetching fails |
Source code in tfbpapi/datainfo/fetchers.py
tfbpapi.datainfo.fetchers.HfSizeInfoFetcher
¶
Handles fetching size information from HuggingFace Dataset Server API.
Source code in tfbpapi/datainfo/fetchers.py
__init__(token=None)
¶
Initialize the fetcher.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
HuggingFace token for authentication |
None
|
Source code in tfbpapi/datainfo/fetchers.py
fetch(repo_id)
¶
Fetch dataset size information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
repo_id
|
str
|
Repository identifier (e.g., “user/dataset”) |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Size information as dictionary |
Raises:
| Type | Description |
|---|---|
HfDataFetchError
|
If fetching fails |
Source code in tfbpapi/datainfo/fetchers.py
Usage Examples¶
Basic Dataset Exploration¶
from tfbpapi.datainfo import DataCard
# Initialize DataCard for a repository
card = DataCard('BrentLab/rossi_2021')
# Get repository overview
repo_info = card.get_repository_info()
print(f"Dataset: {repo_info['pretty_name']}")
print(f"Configurations: {repo_info['num_configs']}")
# Explore configurations
for config in card.configs:
print(f"{config.config_name}: {config.dataset_type.value}")
Understanding Dataset Structure¶
# Get detailed config information
config_info = card.explore_config('metadata')
print(f"Features: {config_info['num_features']}")
# Check for partitioned data
if 'partitioning' in config_info:
partition_info = config_info['partitioning']
print(f"Partitioned by: {partition_info['partition_by']}")
Metadata Relationships¶
# Discover metadata relationships
relationships = card.get_metadata_relationships()
for rel in relationships:
print(f"{rel.data_config} -> {rel.metadata_config} ({rel.relationship_type})")
Integration with HfQueryAPI¶
The datainfo package is designed to work seamlessly with HfQueryAPI for efficient data loading:
from tfbpapi import HfQueryAPI
from tfbpapi.datainfo import DataCard
# Explore dataset structure first
card = DataCard('BrentLab/rossi_2021')
config_info = card.explore_config('genome_map')
# Use insights to load data efficiently
query_api = HfQueryAPI('BrentLab/rossi_2021')
data = query_api.get_pandas('genome_map',
filters={'run_accession': 'SRR123456'})
For a complete tutorial, see the DataCard Tutorial.