stratified_cv_r2

R² calculation with stratified cross-validation for tfbpmodeling.

tfbpmodeling.stratified_cv_r2

stratified_cv_r2

stratified_cv_r2(
    y,
    X,
    classes,
    estimator=LinearRegression(fit_intercept=True),
    skf=StratifiedKFold(
        n_splits=4, shuffle=True, random_state=42
    ),
    **kwargs
)

Calculate the average stratified CV r-squared for a given estimator and data. By default, this is a 4-fold stratified CV with a LinearRegression estimator. Note that by default, the estimator is set to LinearRegression() and the StratifiedKFold object is set to a 4-fold stratified CV with shuffle=True and random_state=42. LinearRegression has fit_intercept explicitly set to True, meaning the data IS NOT expected to be centered and there should not be a constant column in X.

Parameters:
  • y (DataFrame) –

    The response variable. See generate_modeling_data()

  • X (DataFrame) –

    The predictor variables. See generate_modeling_data()

  • classes (ndarray) –

    the stratification classes for the data

  • estimator (BaseEstimator, default: LinearRegression(fit_intercept=True) ) –

    the estimator to be used in the modeling. By default, this is a LinearRegression() model.

  • skf (StratifiedKFold, default: StratifiedKFold(n_splits=4, shuffle=True, random_state=42) ) –

    the StratifiedKFold object to be used in the modeling. By default, this is a 4-fold stratified CV with shuffle=True and random_state=42.

Returns:
  • float

    the average r-squared value for the stratified CV

Source code in tfbpmodeling/stratified_cv_r2.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def stratified_cv_r2(
    y: pd.DataFrame,
    X: pd.DataFrame,
    classes: np.ndarray,
    estimator: BaseEstimator = LinearRegression(fit_intercept=True),
    skf: StratifiedKFold = StratifiedKFold(n_splits=4, shuffle=True, random_state=42),
    **kwargs,
) -> float:
    """
    Calculate the average stratified CV r-squared for a given estimator and data. By
    default, this is a 4-fold stratified CV with a LinearRegression estimator. Note that
    by default, the estimator is set to LinearRegression() and the StratifiedKFold
    object is set to a 4-fold stratified CV with shuffle=True and random_state=42.
    LinearRegression has fit_intercept explicitly set to True, meaning the data IS NOT
    expected to be centered and there should not be a constant column in X.

    :param y: The response variable. See generate_modeling_data()
    :param X: The predictor variables. See generate_modeling_data()
    :param classes: the stratification classes for the data
    :param estimator: the estimator to be used in the modeling. By default, this is a
        LinearRegression() model.
    :param skf: the StratifiedKFold object to be used in the modeling. By default, this
        is a 4-fold stratified CV with shuffle=True and random_state=42.
    :return: the average r-squared value for the stratified CV

    """
    estimator_local = clone(estimator)
    r2_scores = []
    with warnings.catch_warnings(record=True) as w:
        warnings.simplefilter("always")
        folds = list(skf.split(X, classes))
        for warning in w:
            logger.debug(
                f"Warning encountered during stratified k-fold split: {warning.message}"
            )

    for train_idx, test_idx in folds:
        # Use train and test indices to split X and y
        X_train, X_test = (
            X.iloc[train_idx],
            X.iloc[test_idx],
        )
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        # Fit the model
        model = estimator_local.fit(
            X_train,
            y_train,
        )

        # Calculate R-squared and append to r2_scores
        r2_scores.append(r2_score(y_test, model.predict(X_test)))

    return np.mean(r2_scores)

Overview

The stratified_cv_r2 module provides specialized functions for calculating R² scores using stratified cross-validation. This ensures that model performance metrics accurately reflect the model's ability to generalize across different data strata.

Key Features

  • Stratified R² Calculation: R² scores that account for data stratification
  • Cross-Validation Integration: Works with stratified CV folds
  • Bootstrap Compatibility: Integrates with bootstrap resampling
  • Robust Performance Metrics: Reduces bias in performance estimation

Usage Examples

Basic R² Calculation

from tfbpmodeling.stratified_cv_r2 import calculate_stratified_r2

# Calculate stratified R² scores
r2_scores = calculate_stratified_r2(
    estimator=LassoCV(),
    X=predictor_data,
    y=response_data,
    cv_folds=5,
    stratification_bins=[0, 8, 12, np.inf]
)

print(f"Mean R²: {r2_scores.mean():.3f}")
print(f"Std R²: {r2_scores.std():.3f}")

Bootstrap Integration

from tfbpmodeling.stratified_cv_r2 import bootstrap_stratified_r2

# Bootstrap R² with stratification
bootstrap_r2 = bootstrap_stratified_r2(
    estimator=LassoCV(),
    X=predictor_data,
    y=response_data,
    n_bootstraps=1000,
    cv_folds=5,
    stratification_bins=[0, 8, 12, np.inf]
)

# Get confidence interval for R²
r2_ci = np.percentile(bootstrap_r2, [2.5, 97.5])
print(f"R² 95% CI: [{r2_ci[0]:.3f}, {r2_ci[1]:.3f}]")

Performance Metrics

Stratified R²

Calculates R² separately for each stratum and then aggregates:

# Per-stratum R² calculation
stratum_r2 = calculate_per_stratum_r2(
    estimator=model,
    X=X_test,
    y=y_test,
    strata=test_strata
)

Weighted Aggregation

Combines R² scores across strata with appropriate weighting:

# Weighted average R²
weighted_r2 = calculate_weighted_r2(
    stratum_r2_scores=stratum_scores,
    stratum_weights=stratum_sizes
)