Skip to content

tfbpapi Documentation

Development Commands

Testing

  • Run tests: poetry run pytest
  • Run specific test: poetry run pytest tfbpapi/tests/test_[module_name].py
  • Run tests with coverage: poetry run pytest --cov=tfbpapi

Linting and Formatting

  • Run all pre-commit checks: poetry run pre-commit run --all-files
  • Format code with Black: poetry run black tfbpapi/
  • Sort imports with isort: poetry run isort tfbpapi/
  • Type check with mypy: poetry run mypy tfbpapi/
  • Lint with flake8: poetry run flake8 tfbpapi/

Installation

  • Install dependencies: poetry install
  • Install pre-commit hooks: poetry run pre-commit install

Architecture

This is a Python package for interfacing with a collection of datasets hosted on Hugging Face. The modern architecture provides efficient querying, caching, and metadata management for genomic and transcriptomic datasets.

Core Components

  • VirtualDB (tfbpapi/virtual_db.py): Primary API for unified cross-dataset queries. Provides standardized query interface across heterogeneous datasets with varying experimental condition structures through external YAML configuration.

  • DataCard (tfbpapi/datacard.py): Interface for exploring HuggingFace dataset metadata without loading actual data. Enables dataset structure discovery, experimental condition exploration, and query planning.

  • HfCacheManager (tfbpapi/hf_cache_manager.py): Manages HuggingFace cache with intelligent downloading, DuckDB-based SQL querying, and automatic cleanup based on age/size thresholds.

Supporting Components

  • Models (tfbpapi/models.py): Pydantic models for dataset cards, configurations, features, and VirtualDB configuration (MetadataConfig, PropertyMapping, RepositoryConfig).

  • Fetchers (tfbpapi/fetchers.py): Low-level components for retrieving data from HuggingFace Hub (HfDataCardFetcher, HfRepoStructureFetcher, HfSizeInfoFetcher).

Data Types

The datasets in this collection store the following types of genomic data:

  • genomic_features: Labels and information about genomic features (e.g., parsed GTF/GFF files)
  • annotated_features: Data quantified to features, typically genes
  • genome_map: Data mapped to genome coordinates
  • metadata: Additional sample information (cell types, experimental conditions, etc.)

Data is stored in Apache Parquet format, either as single files or parquet datasets (directories of parquet files).

Error Handling

  • errors.py (tfbpapi/errors.py): Custom exception classes for dataset management including HfDataFetchError, DataCardError, and DataCardValidationError.

Configuration

  • Uses Poetry for dependency management
  • Python 3.11+ required
  • Black formatter with 88-character line length
  • Pre-commit hooks include Black, isort, flake8, mypy, and various file checks
  • pytest with comprehensive testing support
  • Environment variables: HF_TOKEN, HF_CACHE_DIR

Testing Patterns

  • Tests use pytest with modern testing practices
  • Integration tests for HuggingFace dataset functionality
  • Test fixtures for dataset operations
  • Comprehensive error handling testing

mkdocs

Commands

After building the environment with poetry, you can use poetry run or a poetry shell to execute the following:

  • mkdocs new [dir-name] - Create a new project.
  • mkdocs serve - Start the live-reloading docs server.
  • mkdocs build - Build the documentation site.
  • mkdocs -h - Print help message and exit.

Project layout

mkdocs.yml    # The configuration file.
docs/
    index.md  # The documentation homepage.
    ...       # Other markdown pages, images and other files.

To update the gh-pages documentation, use poetry run mkdocs gh-deply