API Reference

This section documents the public API of the bbstat package.


bbstat Package

bbstat: Bayesian Bootstrap Utilities

This package provides tools for performing and evaluating the Bayesian bootstrap, a resampling method based on the Bayesian interpretation of uncertainty.

Main Features
  • bootstrap: Run the Bayesian bootstrap on compatible data structures.
  • BootstrapDistribution: A frozen data class representing the resulting distribution of a bootstrap resampling procedure.
  • BootstrapSummary: A frozen data class that holds the summary (mean, credible interval, and level) of a Bayesian bootstrap procedure's result.
  • resample: Generate weighted samples using the Dirichlet distribution.
  • statistics: Collection of built-in weighted statistics.
  • BootstrapResult: A data class that holds bootstrap estimates, computes the mean, and automatically evaluates the credible interval.
Supported Statistic Functions

Custom statistic functions must accept the signature:

(data: ..., weights: numpy.typing.NDarray[numpy.floating], **kwargs) -> float

Compatible examples in bbstat.statistics include:

  • compute_weighted_entropy: Weighted entropy
  • compute_weighted_eta_square_dependency: Weighted eta-squared for categorical group differences
  • compute_weighted_log_odds: Weighted log-odds of a selected state
  • compute_weighted_mean: Weighted mean estimate
  • compute_weighted_median: Weighted median estimate
  • compute_weighted_mutual_information: Weighted mutual information
  • compute_weighted_pearson_dependency: Weighted Pearson correlation
  • compute_weighted_percentile: Weighted percentile estimate
  • compute_weighted_probability: Weighted probability of a selected state
  • compute_weighted_quantile: Weighted quantile estimate
  • compute_weighted_self_information: Weighted self-information of a selected state
  • compute_weighted_spearman_dependency: Weighted Spearman correlation
  • compute_weighted_std: Weighted standard deviation estimate
  • compute_weighted_sum: Weighted sum estimate
  • compute_weighted_variance: Weighted variance estimate

Modules:

Name Description
- `bootstrap`

Core logic for Bayesian bootstrap

- `evaluate`

Tools for summarizing bootstrap results

- `plot`

Tool for visualizing bootstrap results

- `registry`

Registry for built-in statistic functions

- `resample`

Weighted resampling function

- `statistics`

Built-in statistic functions

- `utils`

Utility functions


bootstrap Module

Bayesian bootstrap resampling for statistical estimation and uncertainty quantification.

This module provides the bootstrap function, which applies the Bayesian bootstrap resampling method to estimate a statistic (such as the mean or median) along with its credible interval. It supports flexible input data formats, user-defined or registered statistic functions, and additional customization via keyword arguments.

The function is designed for use in probabilistic data analysis workflows, where quantifying uncertainty through resampling is critical. It is particularly well-suited for small to moderate datasets and non-parametric inference.

Main Features
  • Resampling via the Bayesian bootstrap method.
  • Support for scalar or multivariate data inputs.
  • Use of string-based or function-based statistic definitions.
  • Configurable number of resamples and credible interval level.
  • Optional blockwise resampling for structured data.
  • Random seed control for reproducibility.
Example
import numpy as np
from bbstat.bootstrap import bootstrap
data = np.random.randn(100)
distribution = bootstrap(data, statistic_fn="mean")
print(distribution)
print(distribution.summarize())

See the function-level docstring of bootstrap for full details.

bootstrap(data, statistic_fn, n_boot=1000, seed=None, blocksize=None, fn_kwargs=None)

Performs Bayesian bootstrap resampling to estimate a statistic.

This function performs Bayesian bootstrap resampling by generating n_boot resamples from the provided data and applying the specified statistic function (statistic_fn).

Parameters:

Name Type Description Default
data Any

The data to be resampled. It can be a 1D array, a tuple, or a list of arrays where each element represents a different group of data to resample.

required
statistic_fn Union[str, StatisticFunction]

The statistic function to be applied on each bootstrap resample. It can either be the name of a registered statistic function or the function itself.

required
n_boot int

The number of bootstrap resamples to generate. Default is 1000.

1000
seed int

A seed for the random number generator to ensure reproducibility. Default is None, which means no fixed seed.

None
blocksize int

The block size for resampling. If provided, resampling weights are generated in blocks of this size. Defaults to None, meaning all resampling weights are generated at once.

None
fn_kwargs Dict[str, Any]

Additional keyword arguments to be passed to the statistic_fn for each resample. Default is None.

None

Returns:

Name Type Description
BootstrapDistribution BootstrapDistribution

An object containing the array with the resampled statistics.

Raises:

Type Description
ValueError

If any data array is not 1D or if the dimensions of the input arrays do not match.

Example
data = np.random.randn(100)
statistic_fn = "mean"
result = bootstrap(data, statistic_fn)
print(result)
print(result.summarize())
Notes
  • The data argument can be a single 1D array, or a tuple or list of 1D arrays where each array represents a feature of the data.
  • The statistic_fn can either be the name of a registered function (as a string) or the function itself. If a string is provided, it must match the name of a function in the statistics.registry.
  • The function uses the resample function to generate bootstrap resamples and apply the statistic function to each resample.
Source code in bbstat/bootstrap.py
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def bootstrap(
    data: Any,
    statistic_fn: Union[str, Callable],
    n_boot: int = 1000,
    seed: Optional[int] = None,
    blocksize: Optional[int] = None,
    fn_kwargs: Optional[Dict[str, Any]] = None,
) -> BootstrapDistribution:
    """
    Performs Bayesian bootstrap resampling to estimate a statistic.

    This function performs Bayesian bootstrap resampling by generating `n_boot` resamples from
    the provided `data` and applying the specified statistic function (`statistic_fn`).

    Args:
        data (Any): The data to be resampled. It can be a 1D array, a tuple,
            or a list of arrays where each element represents a different group of data to resample.
        statistic_fn (Union[str, StatisticFunction]): The statistic function to be applied on each
            bootstrap resample. It can either be the name of a registered statistic function or the
            function itself.
        n_boot (int, optional): The number of bootstrap resamples to generate. Default is 1000.
        seed (int, optional): A seed for the random number generator to ensure reproducibility.
            Default is `None`, which means no fixed seed.
        blocksize (int, optional): The block size for resampling. If provided, resampling weights
            are generated in blocks of this size. Defaults to `None`, meaning all resampling weights
            are generated at once.
        fn_kwargs (Dict[str, Any], optional): Additional keyword arguments to be passed to
            the `statistic_fn` for each resample. Default is `None`.

    Returns:
        BootstrapDistribution: An object containing the array with the resampled statistics.

    Raises:
        ValueError: If any data array is not 1D or if the dimensions of the input arrays do not match.

    Example:
        ```python
        data = np.random.randn(100)
        statistic_fn = "mean"
        result = bootstrap(data, statistic_fn)
        print(result)
        print(result.summarize())
        ```

    Notes:
        - The `data` argument can be a single 1D array, or a tuple or list of 1D arrays where each array
          represents a feature of the data.
        - The `statistic_fn` can either be the name of a registered function (as a string) or the function
          itself. If a string is provided, it must match the name of a function in the `statistics.registry`.
        - The function uses the `resample` function to generate bootstrap resamples and apply the statistic
          function to each resample.
    """
    if isinstance(data, np.ndarray):
        if data.ndim != 1:
            raise ValueError(f"Invalid parameter {data.ndim=:}: must be 1.")
        n_data: int = len(data)
    elif isinstance(data, (tuple, list)):
        n_data = len(data[0])
        for i, array in enumerate(data):
            if array.ndim != 1:
                raise ValueError(f"Invalid parameter {data[i].ndim=:}: must be 1.")
            if n_data != len(array):
                raise ValueError(
                    f"Invalid parameter {data[i].shape[0]=:}: must be {n_data=:}."
                )

    if isinstance(statistic_fn, str):
        statistic_fn = get_statistic_fn(statistic_fn)
    estimates = np.array(
        [
            statistic_fn(data=data, weights=weights, **(fn_kwargs or {}))
            for weights in resample(
                n_boot=n_boot,
                n_data=n_data,
                seed=seed,
                blocksize=blocksize,
            )
        ]
    )
    return BootstrapDistribution(estimates=estimates)

evaluate Module

Evaluation utilities for summarizing bootstrap resampling results.

This module provides a data structure for interpreting and summarizing the output of Bayesian bootstrap resampling procedures.

Main Features
  • BootstrapDistribution: A frozen data class representing the resulting distribution of a bootstrap resampling procedure.
  • BootstrapSummary: A frozen data class that holds the summary (mean, credible interval, and level) of a Bayesian bootstrap procedure's result.
Example
import numpy as np
from bbstat.evaluate import BootstrapDistribution
distribution = BootstrapDistribution(estimates=np.array([5.0, 2.3, 2.9]))
print(distribution)  # => BootstrapDistribution(mean=3.4, size=3)
summary = distribution.summarize(level=0.95)
print(summary)  # => BootstrapSummary(mean=3.4, ci_low=2.33, ci_high=4.895, level=0.95)
Notes
  • This module is designed to be used alongside the bootstrap and resample modules to provide complete statistical summaries of resampled data.

BootstrapDistribution dataclass

A class representing the resulting distribution of a bootstrap resampling procedure.

This class stores the distribution resulting from a Bayesian bootstrap analysis, and provides a method to summarize the result.

Attributes:

Name Type Description
estimates FArray

The array of bootstrap resample estimates.

Methods:

Name Description
__post_init__

Validates and locks the estimates attribute.

__len__

Returns the length of the estimates array.

__str__

Returns a string representation of the object.

summarize

Returns a BootstrapSummary object.

Raises:

Type Description
ValueError

If estimates is not a 1D array or contains NaN values.

Source code in bbstat/evaluate.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
@dataclass(frozen=True)
class BootstrapDistribution:
    """
    A class representing the resulting distribution of a bootstrap resampling procedure.

    This class stores the distribution resulting from a Bayesian bootstrap analysis,
    and provides a method to summarize the result.

    Attributes:
        estimates (FArray): The array of bootstrap resample estimates.

    Methods:
        __post_init__: Validates and locks the `estimates` attribute.
        __len__: Returns the length of the `estimates` array.
        __str__: Returns a string representation of the object.
        summarize: Returns a `BootstrapSummary` object.

    Raises:
        ValueError: If `estimates` is not a 1D array or contains NaN values.
    """

    estimates: FArray

    def __post_init__(self):
        """
        Post-initialization method to validate and lock the estimates array.

        Raises:
            ValueError: If `estimates` is not a 1D array or contains NaN values.
        """
        if self.estimates.ndim != 1:
            raise ValueError(f"Invalid parameter {self.estimates.ndim=}: must be 1.")
        if len(self.estimates) < 1:
            raise ValueError("Invalid parameter estimates: must not be empty.")
        if np.isnan(self.estimates).any():
            raise ValueError(
                "Invalid parameter estimates: must not contain NaN values."
            )
        estimates_copy = np.array(self.estimates, copy=True)
        estimates_copy.setflags(write=False)
        object.__setattr__(self, "estimates", estimates_copy)

    def __len__(self) -> int:
        """Returns the length of the estimates array."""
        return len(self.estimates)

    def __str__(self) -> str:
        """
        Returns a human-readable string representation of the bootstrap distribution.

        This method formats the mean and size of the bootstrap distribution for display.

        Returns:
            str: A formatted string representing the bootstrap distribution.
        """
        mean = self.summarize().mean
        size = len(self)
        return f"BootstrapDistribution({mean=:}, {size=:})"

    def summarize(self, level: float = 0.87) -> BootstrapSummary:
        """
        Returns a `BootstrapSummary` object.

        This method is a wrapper for `BootstrapSummary.from_estimates`.

        Args:
            level (float): The desired level for the credible interval
                (must be between 0 and 1).

        Returns:
            BootstrapSummary: the summary object.

        Raises:
            ValueError: If the `level` is not between 0 and 1.
        """
        return BootstrapSummary.from_estimates(self.estimates, level=level)

__len__()

Returns the length of the estimates array.

Source code in bbstat/evaluate.py
212
213
214
def __len__(self) -> int:
    """Returns the length of the estimates array."""
    return len(self.estimates)

__post_init__()

Post-initialization method to validate and lock the estimates array.

Raises:

Type Description
ValueError

If estimates is not a 1D array or contains NaN values.

Source code in bbstat/evaluate.py
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def __post_init__(self):
    """
    Post-initialization method to validate and lock the estimates array.

    Raises:
        ValueError: If `estimates` is not a 1D array or contains NaN values.
    """
    if self.estimates.ndim != 1:
        raise ValueError(f"Invalid parameter {self.estimates.ndim=}: must be 1.")
    if len(self.estimates) < 1:
        raise ValueError("Invalid parameter estimates: must not be empty.")
    if np.isnan(self.estimates).any():
        raise ValueError(
            "Invalid parameter estimates: must not contain NaN values."
        )
    estimates_copy = np.array(self.estimates, copy=True)
    estimates_copy.setflags(write=False)
    object.__setattr__(self, "estimates", estimates_copy)

__str__()

Returns a human-readable string representation of the bootstrap distribution.

This method formats the mean and size of the bootstrap distribution for display.

Returns:

Name Type Description
str str

A formatted string representing the bootstrap distribution.

Source code in bbstat/evaluate.py
216
217
218
219
220
221
222
223
224
225
226
227
def __str__(self) -> str:
    """
    Returns a human-readable string representation of the bootstrap distribution.

    This method formats the mean and size of the bootstrap distribution for display.

    Returns:
        str: A formatted string representing the bootstrap distribution.
    """
    mean = self.summarize().mean
    size = len(self)
    return f"BootstrapDistribution({mean=:}, {size=:})"

summarize(level=0.87)

Returns a BootstrapSummary object.

This method is a wrapper for BootstrapSummary.from_estimates.

Parameters:

Name Type Description Default
level float

The desired level for the credible interval (must be between 0 and 1).

0.87

Returns:

Name Type Description
BootstrapSummary BootstrapSummary

the summary object.

Raises:

Type Description
ValueError

If the level is not between 0 and 1.

Source code in bbstat/evaluate.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
def summarize(self, level: float = 0.87) -> BootstrapSummary:
    """
    Returns a `BootstrapSummary` object.

    This method is a wrapper for `BootstrapSummary.from_estimates`.

    Args:
        level (float): The desired level for the credible interval
            (must be between 0 and 1).

    Returns:
        BootstrapSummary: the summary object.

    Raises:
        ValueError: If the `level` is not between 0 and 1.
    """
    return BootstrapSummary.from_estimates(self.estimates, level=level)

BootstrapSummary dataclass

A class representing the summary of a Bayesian bootstrap resampling procedure.

This class stores the mean, the credible interval, and level.

Attributes:

Name Type Description
mean float

The mean of the bootstrap estimates.

ci_low float

The lower bound of the credible interval.

ci_high float

The upper bound of the credible interval.

level float

The desired level for the credible interval (between 0 and 1).

ci_width float

The width of the credible interval (property).

Methods:

Name Description
__post_init__

Validates the mean, ci_low, ci_high, and level attributes.

round

Returns a new version of the summary with rounded values.

from_estimates

Creates a summary object from estimates.

Raises:

Type Description
ValueError

If mean, ci_low, ci_high, or level are NaN.

ValueError

If the bounds are swapped, ci_low > ci_high.

ValueError

If level is not between 0 and 1 (exclusive).

Source code in bbstat/evaluate.py
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
@dataclass(frozen=True)
class BootstrapSummary:
    """
    A class representing the summary of a Bayesian bootstrap resampling procedure.

    This class stores the mean, the credible interval, and level.

    Attributes:
        mean (float): The mean of the bootstrap estimates.
        ci_low (float): The lower bound of the credible interval.
        ci_high (float): The upper bound of the credible interval.
        level (float): The desired level for the credible interval (between 0 and 1).
        ci_width (float): The width of the credible interval (property).

    Methods:
        __post_init__: Validates the `mean`, `ci_low`, `ci_high`, and `level` attributes.
        round: Returns a new version of the summary with rounded values.
        from_estimates: Creates a summary object from estimates.

    Raises:
        ValueError: If `mean`, `ci_low`, `ci_high`, or `level` are NaN.
        ValueError: If the bounds are swapped, `ci_low > ci_high`.
        ValueError: If `level` is not between 0 and 1 (exclusive).
    """

    mean: float
    ci_low: float
    ci_high: float
    level: float

    def __post_init__(self) -> None:
        """
        Post-initialization method to validate the `mean`, `ci_low`,
        `ci_high`, and `level` attributes.

        Raises:
            ValueError: If `mean`, `ci_low`, `ci_high`, or `level` are NaN.
            ValueError: If the bounds are swapped, `ci_low > ci_high`.
            ValueError: If `level` is not between 0 and 1 (exclusive).
        """
        if np.isnan(self.mean):
            raise ValueError("Invalid parameter mean: must not be NaN.")
        if np.isnan(self.ci_low):
            raise ValueError("Invalid parameter ci_low: must not be NaN.")
        if np.isnan(self.ci_high):
            raise ValueError("Invalid parameter ci_high: must not be NaN.")
        if np.isnan(self.level):
            raise ValueError("Invalid parameter level: must not be NaN.")
        if self.ci_low > self.ci_high:
            raise ValueError(
                f"Invalid parameters {self.ci_low=:} and {self.ci_high=:}: "
                "higher end is smaller than lower end."
            )
        if self.level <= 0 or self.level >= 1:
            raise ValueError(
                f"Invalid parameter {self.level=:}: must be within (0, 1)."
            )
        if self.mean < self.ci_low or self.mean > self.ci_high:
            raise ValueError(
                f"Invalid parameter {self.mean=:}: is outside the credible interval "
                f"{self.ci_low=:}, {self.ci_high}."
            )

    @property
    def ci_width(self) -> float:
        """Returns the width of the credible interval."""
        return self.ci_high - self.ci_low

    def round(self, precision: Optional[int] = None) -> "BootstrapSummary":
        """
        Returns a new version of the summary with rounded values.

        When `precision` is given, the mean and credible interval bounds are rounded
        to this number of digits. If `precision=None` (default), the precision is
        computed form the width of the credible interval.

        Args:
            precision (int, optional): The desired precision for rounding.

        Returns:
            BootstrapSummary: The summary of a Bayesian bootstrap procedure's result.
        """
        if precision is None:
            precision = get_precision_for_rounding(self.ci_width)
        return self.__class__(
            mean=round(self.mean, precision),
            ci_low=round(self.ci_low, precision),
            ci_high=round(self.ci_high, precision),
            level=self.level,
        )

    @classmethod
    def from_estimates(
        cls,
        estimates: FArray,
        *,
        level: float = 0.87,
    ) -> "BootstrapSummary":
        """
        Creates a summary object from estimates.

        This method computes the `mean` and credible interval bounds `ci_low` and
        `ci_high`, and creates a `BootstrapSummary` object.

        Args:
            estimates (FArray): The estimated values from a Bayesian bootstrap procedure.
            level (float): The desired level for the credible interval (between 0 and 1),
                default is 0.87.

        Returns:
            BootstrapSummary: The summary of a Bayesian bootstrap procedure's result.

        Raises:
            ValueError: If estimates is empty, not a 1D array, or contains NaN values.
            ValueError: If level is not between 0 and 1 (exclusive).
        """
        if estimates.ndim != 1:
            raise ValueError(f"Invalid parameter {estimates.ndim=}: must be 1.")
        if len(estimates) < 1:
            raise ValueError("Invalid parameter estimates: must not be empty.")
        if np.isnan(estimates).any():
            raise ValueError(
                "Invalid parameter estimates: must not contain NaN values."
            )
        mean = np.mean(estimates).item()
        ci_low, ci_high = compute_credible_interval(estimates=estimates, level=level)
        return cls(mean=mean, ci_low=ci_low, ci_high=ci_high, level=level)

ci_width property

Returns the width of the credible interval.

__post_init__()

Post-initialization method to validate the mean, ci_low, ci_high, and level attributes.

Raises:

Type Description
ValueError

If mean, ci_low, ci_high, or level are NaN.

ValueError

If the bounds are swapped, ci_low > ci_high.

ValueError

If level is not between 0 and 1 (exclusive).

Source code in bbstat/evaluate.py
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def __post_init__(self) -> None:
    """
    Post-initialization method to validate the `mean`, `ci_low`,
    `ci_high`, and `level` attributes.

    Raises:
        ValueError: If `mean`, `ci_low`, `ci_high`, or `level` are NaN.
        ValueError: If the bounds are swapped, `ci_low > ci_high`.
        ValueError: If `level` is not between 0 and 1 (exclusive).
    """
    if np.isnan(self.mean):
        raise ValueError("Invalid parameter mean: must not be NaN.")
    if np.isnan(self.ci_low):
        raise ValueError("Invalid parameter ci_low: must not be NaN.")
    if np.isnan(self.ci_high):
        raise ValueError("Invalid parameter ci_high: must not be NaN.")
    if np.isnan(self.level):
        raise ValueError("Invalid parameter level: must not be NaN.")
    if self.ci_low > self.ci_high:
        raise ValueError(
            f"Invalid parameters {self.ci_low=:} and {self.ci_high=:}: "
            "higher end is smaller than lower end."
        )
    if self.level <= 0 or self.level >= 1:
        raise ValueError(
            f"Invalid parameter {self.level=:}: must be within (0, 1)."
        )
    if self.mean < self.ci_low or self.mean > self.ci_high:
        raise ValueError(
            f"Invalid parameter {self.mean=:}: is outside the credible interval "
            f"{self.ci_low=:}, {self.ci_high}."
        )

from_estimates(estimates, *, level=0.87) classmethod

Creates a summary object from estimates.

This method computes the mean and credible interval bounds ci_low and ci_high, and creates a BootstrapSummary object.

Parameters:

Name Type Description Default
estimates FArray

The estimated values from a Bayesian bootstrap procedure.

required
level float

The desired level for the credible interval (between 0 and 1), default is 0.87.

0.87

Returns:

Name Type Description
BootstrapSummary BootstrapSummary

The summary of a Bayesian bootstrap procedure's result.

Raises:

Type Description
ValueError

If estimates is empty, not a 1D array, or contains NaN values.

ValueError

If level is not between 0 and 1 (exclusive).

Source code in bbstat/evaluate.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
@classmethod
def from_estimates(
    cls,
    estimates: FArray,
    *,
    level: float = 0.87,
) -> "BootstrapSummary":
    """
    Creates a summary object from estimates.

    This method computes the `mean` and credible interval bounds `ci_low` and
    `ci_high`, and creates a `BootstrapSummary` object.

    Args:
        estimates (FArray): The estimated values from a Bayesian bootstrap procedure.
        level (float): The desired level for the credible interval (between 0 and 1),
            default is 0.87.

    Returns:
        BootstrapSummary: The summary of a Bayesian bootstrap procedure's result.

    Raises:
        ValueError: If estimates is empty, not a 1D array, or contains NaN values.
        ValueError: If level is not between 0 and 1 (exclusive).
    """
    if estimates.ndim != 1:
        raise ValueError(f"Invalid parameter {estimates.ndim=}: must be 1.")
    if len(estimates) < 1:
        raise ValueError("Invalid parameter estimates: must not be empty.")
    if np.isnan(estimates).any():
        raise ValueError(
            "Invalid parameter estimates: must not contain NaN values."
        )
    mean = np.mean(estimates).item()
    ci_low, ci_high = compute_credible_interval(estimates=estimates, level=level)
    return cls(mean=mean, ci_low=ci_low, ci_high=ci_high, level=level)

round(precision=None)

Returns a new version of the summary with rounded values.

When precision is given, the mean and credible interval bounds are rounded to this number of digits. If precision=None (default), the precision is computed form the width of the credible interval.

Parameters:

Name Type Description Default
precision int

The desired precision for rounding.

None

Returns:

Name Type Description
BootstrapSummary BootstrapSummary

The summary of a Bayesian bootstrap procedure's result.

Source code in bbstat/evaluate.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
def round(self, precision: Optional[int] = None) -> "BootstrapSummary":
    """
    Returns a new version of the summary with rounded values.

    When `precision` is given, the mean and credible interval bounds are rounded
    to this number of digits. If `precision=None` (default), the precision is
    computed form the width of the credible interval.

    Args:
        precision (int, optional): The desired precision for rounding.

    Returns:
        BootstrapSummary: The summary of a Bayesian bootstrap procedure's result.
    """
    if precision is None:
        precision = get_precision_for_rounding(self.ci_width)
    return self.__class__(
        mean=round(self.mean, precision),
        ci_low=round(self.ci_low, precision),
        ci_high=round(self.ci_high, precision),
        level=self.level,
    )

plot Module

Plotting utility for bootstrap resampling results.

This module provides a function for visually interpreting and summarizing the output of Bayesian bootstrap resampling procedures.

Main Features
  • plot: Visualizes the result of a bootstrap resampling procedure.
Notes
  • The credible interval is calculated using quantiles of the empirical distribution of bootstrap estimates.
  • This module is designed to be used alongside the evaluate module to provide complete statistical summaries of resampled data.

plot(bootstrap_distribution, level, *, ax=None, n_grid=200, label=None, precision=None)

Plot the kernel density estimate (KDE) of bootstrap estimates with credible interval shading and a vertical line at the mean.

If an axis is provided, the plot is drawn on it; otherwise, a new figure and axis are created. Displays a shaded credible interval and labels the plot with a formatted mean and credible interval. If no axis is provided, the figure further is annotated with a title and ylabel, ylim[0] positioned at zero, the legend is set, and a tight layout applied.

Parameters:

Name Type Description Default
bootstrap_distribution BootstrapDistribution

The result of a bootstrap resampling procedure.

required
level float

Credible interval level (e.g., 0.95 for 95% CI).

required
ax Axes

Matplotlib axis to draw the plot on. If None, a new axis is created.

None
n_grid int

Number of grid points to use for evaluating the KDE, default is 200.

200
label str

Optional label for the line. If provided, the label is extended to include the mean and credible interval.

None
precision int or auto or None

Optional precision for rounding the summary values (mean and credible interval). If None (default), no rounding is done; if "auto", the precision is computed from the width of the credible interval; if integer, we round to this many digits.

None

Returns:

Type Description
Axes

plt.Axes: The axis object containing the plot.

Source code in bbstat/plot.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
def plot(
    bootstrap_distribution: BootstrapDistribution,
    level: float,
    *,
    ax: Optional[plt.Axes] = None,
    n_grid: int = 200,
    label: Optional[str] = None,
    precision: Optional[Union[int, Literal["auto"]]] = None,
) -> plt.Axes:
    """
    Plot the kernel density estimate (KDE) of bootstrap estimates with
    credible interval shading and a vertical line at the mean.

    If an axis is provided, the plot is drawn on it; otherwise, a new figure and axis are created.
    Displays a shaded credible interval and labels the plot with a formatted mean
    and credible interval. If no axis is provided, the figure further is annotated with a title and ylabel,
    ylim[0] positioned at zero, the legend is set, and a tight layout applied.

    Args:
        bootstrap_distribution (BootstrapDistribution): The result of a bootstrap resampling procedure.
        level (float): Credible interval level (e.g., 0.95 for 95% CI).
        ax (plt.Axes, optional): Matplotlib axis to draw the plot on. If None, a new axis is created.
        n_grid (int): Number of grid points to use for evaluating the KDE, default is 200.
        label (str, optional): Optional label for the line. If provided, the label is
            extended to include the mean and credible interval.
        precision (int or "auto" or None, optional): Optional precision for rounding the summary
            values (mean and credible interval). If None (default), no rounding is done; if "auto",
            the precision is computed from the width of the credible interval; if integer, we round to
            this many digits.

    Returns:
        plt.Axes: The axis object containing the plot.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 4.5))
    else:
        fig = None

    summary = bootstrap_distribution.summarize(level)

    if precision is not None:
        if precision == "auto":
            summary = summary.round()
        else:
            summary = summary.round(precision)

    param_str = f"{summary.mean} ({summary.ci_low}, {summary.ci_high})"

    if label is not None:
        param_str = f"{label}={param_str}"

    p = gaussian_kde(bootstrap_distribution.estimates)

    x_grid = np.linspace(
        bootstrap_distribution.estimates.min(), bootstrap_distribution.estimates.max(), n_grid
    )
    within_ci = np.logical_and(x_grid >= summary.ci_low, x_grid <= summary.ci_high)
    y_grid = p(x_grid)
    y_mean = p([summary.mean]).item()

    (line,) = ax.plot(x_grid, y_grid, label=param_str)
    color = line.get_color()

    ax.fill_between(
        x_grid[within_ci],
        0,
        y_grid[within_ci],
        facecolor=color,
        alpha=0.5,
    )
    ax.plot([summary.mean] * 2, [0, y_mean], "--", color=color)
    ax.plot([summary.mean], [y_mean], "o", color=color)

    if fig is not None:
        ax.set_title(
            f"Bayesian bootstrap  •  {len(bootstrap_distribution)} resamples, {level * 100:.0f}% CI"
        )
        ax.set_ylim(0, ax.get_ylim()[1])
        ax.set_ylabel("Distribution of estimates")
        ax.legend()
        fig.tight_layout()
    return ax

registry Module

Statistic function registry and protocol definition.

This module defines a strict Protocol (StatisticFunction) for all supported statistical aggregation functions used in the system. It also provides a typed mapping of statistic function names to their concrete implementations and a lookup function (get_statistic_fn) for retrieving them by name.

All registered functions are callable with specific combinations of arguments (e.g. data, weights, and optional parameters like ddof, factor, or sorter) depending on the computation type. Static typing ensures correct usage of each registered function.

StatisticFunction

Bases: Protocol

A protocol defining the interface for all statistical computation functions.

Each implementing function must take data and weights arrays and may accept additional keyword-only arguments depending on the computation type.

Overloads:

  • aggregate: accepts data: FArray, weights: FArray, and optional factor: float
  • mean, sum: accept only data: FArray, weights: FArray
  • variance, std: accept optional weighted_mean: float and ddof: int
  • quantile: requires quantile: float and optional sorter: IArray
  • percentile: requires percentile: float and optional sorter: IArray
  • median: accepts optional sorter
  • mutual_information: accepts data: IIArray and weights: FArray, and normalize: bool = True
  • pearson_dependency, spearman_dependency: take tuple of two float arrays (FFArray) and ddof
  • eta_square_dependency: takes tuple of and integer and a float array (IFArray)
  • entropy: accepts data: IFArray and weights: FArray
  • probability, self_information, log_odds: accepts data: IFArray, weights: FArray, and state: int
Source code in bbstat/registry.py
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
class StatisticFunction(Protocol):
    """
    A protocol defining the interface for all statistical computation functions.

    Each implementing function must take `data` and `weights` arrays and may
    accept additional keyword-only arguments depending on the computation type.

    Overloads:

    - `aggregate`: accepts `data: FArray`, `weights: FArray`, and optional `factor: float`
    - `mean`, `sum`: accept only `data: FArray`, `weights: FArray`
    - `variance`, `std`: accept optional `weighted_mean: float` and `ddof: int`
    - `quantile`: requires `quantile: float` and optional `sorter: IArray`
    - `percentile`: requires `percentile: float` and optional `sorter: IArray`
    - `median`: accepts optional `sorter`
    - `mutual_information`: accepts `data: IIArray` and `weights: FArray`, and `normalize: bool = True`
    - `pearson_dependency`, `spearman_dependency`: take tuple of two
      float arrays (`FFArray`) and `ddof`
    - `eta_square_dependency`: takes tuple of and integer and a float
      array (`IFArray`)
    - `entropy`: accepts `data: IFArray` and `weights: FArray`
    - `probability`, `self_information`, `log_odds`: accepts `data: IFArray`,
      `weights: FArray`, and `state: int`
    """

    # aggregate
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
        *,
        factor: Optional[float],
    ) -> float: ...

    # mean, sum
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
    ) -> float: ...

    # entropy
    @overload
    def __call__(
        self,
        data: IArray,
        weights: FArray,
    ) -> float: ...

    # mutual_information
    @overload
    def __call__(
        self,
        data: IIArray,
        weights: FArray,
        *,
        normalize: bool = True,
    ) -> float: ...

    # probability, log_odds, self_information
    @overload
    def __call__(
        self,
        data: IArray,
        weights: FArray,
        state: int,
    ) -> float: ...

    # variance, std
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
        *,
        weighted_mean: Optional[float],
        ddof: int,
    ) -> float: ...

    # quantile
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
        *,
        quantile: float,
        sorter: Optional[IArray],
    ) -> float: ...

    # percentile
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
        *,
        percentile: float,
        sorter: Optional[IArray],
    ) -> float: ...

    # median
    @overload
    def __call__(
        self,
        data: FArray,
        weights: FArray,
        *,
        sorter: Optional[IArray],
    ) -> float: ...

    # pearson_dependency, spearman_dependency
    @overload
    def __call__(
        self,
        data: FFArray,
        weights: FArray,
        *,
        ddof: int,
    ) -> float: ...

    # eta_squared_dependency
    @overload
    def __call__(
        self,
        data: IFArray,
        weights: FArray,
    ) -> float: ...

get_statistic_fn(name)

Retrieve a registered statistic function by name.

Parameters:

Name Type Description Default
name str

The lowercase name of the statistic function to retrieve. Must be one of: - "aggregate" - "entropy" - "eta_square_dependency" - "log_odds" - "mean" - "median" - "mutual_information" - "pearson_dependency" - "percentile" - "probability" - "quantile" - "self_information" - "spearman_dependency" - "std" - "sum" - "variance"

required

Returns:

Name Type Description
StatisticFunction StatisticFunction

The corresponding function implementation.

Raises:

Type Description
ValueError

If the name does not correspond to a registered function.

Source code in bbstat/registry.py
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def get_statistic_fn(name: str) -> StatisticFunction:
    """
    Retrieve a registered statistic function by name.

    Parameters:
        name (str): The lowercase name of the statistic function to retrieve.
            Must be one of:
            - "aggregate"
            - "entropy"
            - "eta_square_dependency"
            - "log_odds"
            - "mean"
            - "median"
            - "mutual_information"
            - "pearson_dependency"
            - "percentile"
            - "probability"
            - "quantile"
            - "self_information"
            - "spearman_dependency"
            - "std"
            - "sum"
            - "variance"

    Returns:
        StatisticFunction: The corresponding function implementation.

    Raises:
        ValueError: If the name does not correspond to a registered function.
    """
    try:
        return STATISTIC_FUNCTIONS[name.lower()]
    except KeyError:
        raise ValueError(
            f"Invalid {name=:}: choose from {list(STATISTIC_FUNCTIONS.keys())}"
        )

get_statistic_fn_names()

Retrieve the names of registered statistic functions.

Returns:

Type Description
Tuple[str, ...]

Tuple[str, ...]: The names of the available statistic functions.

Source code in bbstat/registry.py
292
293
294
295
296
297
298
299
def get_statistic_fn_names() -> Tuple[str, ...]:
    """
    Retrieve the names of registered statistic functions.

    Returns:
        Tuple[str, ...]: The names of the available statistic functions.
    """
    return tuple(STATISTIC_FUNCTIONS.keys())

resample Module

Bootstrap resampling utilities using Dirichlet-distributed weights.

This module provides functionality for generating bootstrap resamples via the Bayesian bootstrap method, where resamples are weighted using samples from a Dirichlet distribution. It is intended for internal use within higher-level resampling and estimation workflows.

The function resample yields weighted resamples suitable for estimating statistics under uncertainty without making parametric assumptions.

Main Features
  • Dirichlet-based resampling for Bayesian bootstrap.
  • Support for blockwise resample generation to control memory usage.
  • Optional random seed for reproducibility.
  • Generator interface for efficient streaming of resample weights.
Example
from bbstat.resample import resample
for weights in resample(n_boot=1000, n_data=50):
    # Apply weights to compute statistic
    ...
Notes
  • The function is designed to scale to large numbers of resamples.
  • It is most useful as a low-level utility within a bootstrap framework.

See the resample function docstring for complete usage details.

resample(n_boot, n_data, seed=None, blocksize=None)

Generates bootstrap resamples with Dirichlet-distributed weights.

This function performs resampling by generating weights from a Dirichlet distribution. The number of resamples is controlled by the n_boot argument, while the size of each block of resamples can be adjusted using the blocksize argument. The seed argument allows for reproducible results.

Parameters:

Name Type Description Default
n_boot int

The total number of bootstrap resamples to generate.

required
n_data int

The number of data points to resample (used for the dimension of the Dirichlet distribution).

required
seed int

A random seed for reproducibility (default is None for random seeding).

None
blocksize int

The number of resamples to generate in each block. If None, the entire number of resamples is generated in one block. Defaults to None.

None

Yields:

Type Description
FArray

Generator[FArray, None, None]: A generator that yields each resample (a 1D array of floats) as it is generated. Each resample contains Dirichlet- distributed weights for the given n_data.

Example
for r in resample(n_boot=10, n_data=5):
    print(r)
Notes
  • If blocksize is specified, the resampling will be performed in smaller blocks, which can be useful for parallelizing or limiting memory usage.
  • The function uses NumPy's default_rng to generate random numbers, which provides a more flexible and efficient interface compared to np.random.seed.

Raises:

Type Description
ValueError

If n_boot is less than 1 or n_data is less than 1.

Source code in bbstat/resample.py
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
def resample(
    n_boot: int,
    n_data: int,
    seed: Optional[int] = None,
    blocksize: Optional[int] = None,
) -> Generator[FArray, None, None]:
    """
    Generates bootstrap resamples with Dirichlet-distributed weights.

    This function performs resampling by generating weights from a Dirichlet distribution.
    The number of resamples is controlled by the `n_boot` argument, while the size of
    each block of resamples can be adjusted using the `blocksize` argument. The `seed`
    argument allows for reproducible results.

    Args:
        n_boot (int): The total number of bootstrap resamples to generate.
        n_data (int): The number of data points to resample (used for the dimension of the
            Dirichlet distribution).
        seed (int, optional): A random seed for reproducibility (default is `None` for
            random seeding).
        blocksize (int, optional): The number of resamples to generate in each block.
            If `None`, the entire number of resamples is generated in one block.
            Defaults to `None`.

    Yields:
        Generator[FArray, None, None]: A generator that yields each resample
            (a 1D array of floats) as it is generated. Each resample contains Dirichlet-
            distributed weights for the given `n_data`.

    Example:
        ```python
        for r in resample(n_boot=10, n_data=5):
            print(r)
        ```

    Notes:
        - If `blocksize` is specified, the resampling will be performed in smaller blocks,
          which can be useful for parallelizing or limiting memory usage.
        - The function uses NumPy's `default_rng` to generate random numbers, which provides
          a more flexible and efficient interface compared to `np.random.seed`.

    Raises:
        ValueError: If `n_boot` is less than 1 or `n_data` is less than 1.
    """
    if n_boot < 1:
        raise ValueError(f"Invalid parameter {n_boot=:}: must be positive.")
    if n_data < 1:
        raise ValueError(f"Invalid parameter {n_data=:}: must be positive.")
    rng = np.random.default_rng(seed)
    alpha = np.ones(n_data)
    if blocksize is None:
        blocksize = n_boot
    else:
        blocksize = min(max(1, blocksize), n_boot)
    remainder = n_boot
    while remainder > 0:
        size = min(blocksize, remainder)
        weights = rng.dirichlet(alpha=alpha, size=size)
        for w in weights:
            yield w
        remainder -= size

statistics Module

Type definitions and weighted statistical functions for use in bootstrap resampling and analysis.

This module defines types for statistical functions that operate on weighted data, particularly in the context of Bayesian bootstrap procedures. It provides a collection of pre-defined weighted statistics (e.g., mean, variance, quantile).

Main Features
  • Type aliases for data and weights.
  • A library of built-in weighted statistical functions (e.g., mean, std, quantile, etc.)

Type Aliases:

Name Description
- `FArray`

Alias for numpy.typing.NDArray[numpy.floating], used for floating-point data and weights.

- `IArray`

Alias for numpy.typing.NDArray[numpy.integer], used for index arrays.

- `FFArray`, `IFArray`, `IIArray`

Tuples of data arrays used in bivariate computations.

Built-in Functions
  • "compute_weighted_aggregate": Weighted dot product, optionally scaled by a factor (internal use only).
  • "compute_weighted_entropy": Weighted entropy.
  • "compute_weighted_eta_square_dependency": Effect size for categorical-continuous variable relationships.
  • "compute_weighted_mean": Weighted arithmetic mean.
  • "compute_weighted_median": Weighted median.
  • "compute_weighted_mutual_information": Weighted mutual information.
  • "compute_weighted_pearson_dependency": Weighted Pearson correlation for two variables.
  • "compute_weighted_probability": Weighted probability of a state.
  • "compute_weighted_quantile" / "compute_weighted_percentile": Weighted quantile estimation.
  • "compute_weighted_spearman_dependency": Weighted Spearman correlation.
  • "compute_weighted_std": Weighted standard deviation.
  • "compute_weighted_sum": Weighted sum.
  • "compute_weighted_variance": Weighted variance with optional degrees of freedom correction.
Notes
  • All functions assume normalized weights (i.e., sum to 1).
  • Functions raise ValueError for invalid shapes, mismatched dimensions, or inappropriate input types.
  • This module is intended for use with bootstrap, which applies these functions across bootstrap resamples.

FArray = NDArray[np.floating] module-attribute

FFArray = Tuple[FArray, FArray] module-attribute

IArray = NDArray[np.integer] module-attribute

IFArray = Tuple[IArray, FArray] module-attribute

compute_weighted_aggregate(data, weights, *, factor=None)

Computes a weighted aggregate of the input data.

This function calculates the dot product of the input data and weights. The function assumes that both data and weights are 1D arrays of the same length and that the weights sum to 1. If a factor is provided, the dot product is multiplied with it.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the data to be aggregated.

required
weights FArray

A 1D array of numeric values representing the weights for the data.

required
factor float

A scalar factor to multiply with the computed aggregate (default is None).

None

Returns:

Name Type Description
float float

The computed weighted aggregate, potentially scaled by the factor.

Raises:

Type Description
ValueError

If data or weights are not 1D arrays.

ValueError

If the shapes of data and weights do not match.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_aggregate(data, weights))  # => 2.1
print(compute_weighted_aggregate(data, weights, factor=1.5))  # => 3.15
Notes

The weighted aggregate is computed using the dot product between data and weights. The optional factor scales the result of this dot product. If no factor is given, the aggregation computes the weighted arithmetic mean of the data; if instead the factor equals the length of the data array, the aggregation computes the weighted sum.

Source code in bbstat/statistics.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def compute_weighted_aggregate(
    data: FArray,
    weights: FArray,
    *,
    factor: Optional[float] = None,
) -> float:
    """
    Computes a weighted aggregate of the input data.

    This function calculates the dot product of the input `data` and `weights`.
    The function assumes that both `data` and `weights` are 1D arrays of the same
    length and that the weights sum to 1. If a `factor` is provided, the dot product
    is multiplied with it.

    Args:
        data (FArray): A 1D array of numeric values representing the
            data to be aggregated.
        weights (FArray): A 1D array of numeric values representing
            the weights for the data.
        factor (float, optional): A scalar factor to multiply with the computed
            aggregate (default is None).

    Returns:
        float: The computed weighted aggregate, potentially scaled by the `factor`.

    Raises:
        ValueError: If `data` or `weights` are not 1D arrays.
        ValueError: If the shapes of `data` and `weights` do not match.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_aggregate(data, weights))  # => 2.1
        print(compute_weighted_aggregate(data, weights, factor=1.5))  # => 3.15
        ```

    Notes:
        The weighted aggregate is computed using the dot product between `data` and `weights`.
        The optional `factor` scales the result of this dot product. If no factor is given,
        the aggregation computes the weighted arithmetic mean of the data; if instead the factor
        equals the length of the data array, the aggregation computes the weighted sum.

    """
    validate_array(data=data, weights=weights)
    aggregate = np.dot(weights, data)
    if factor is not None:
        aggregate *= factor
    return aggregate.item()

compute_weighted_entropy(data, weights)

Computes a weighted entropy of 1D code data.

This function calculates the weighted entropy by first computing the weighted distribution, dropping the zero elements (contribute zero to the following sum), and computing the negative dot product between the distribution and log-distribution.

Parameters:

Name Type Description Default
data IArray

A 1D array of numeric values representing the sample data in code format.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required

Returns:

Name Type Description
float float

The weighted entropy value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D.

Example
data = np.array([1, 0, 0])
weights = np.array([0.4, 0.2, 0.4])
print(compute_weighted_entropy(data, weights))  # => 0.673...
Source code in bbstat/statistics.py
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def compute_weighted_entropy(
    data: IArray,
    weights: FArray,
) -> float:
    """
    Computes a weighted entropy of 1D code data.

    This function calculates the weighted entropy by first computing
    the weighted distribution, dropping the zero elements (contribute
    zero to the following sum), and computing the negative dot product
    between the distribution and log-distribution.

    Args:
        data (IArray): A 1D array of numeric values representing
            the sample data in code format.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.

    Returns:
        float: The weighted entropy value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D.

    Example:
        ```python
        data = np.array([1, 0, 0])
        weights = np.array([0.4, 0.2, 0.4])
        print(compute_weighted_entropy(data, weights))  # => 0.673...
        ```
    """
    validate_array(data=data, weights=weights)
    active_set = get_active_set(data=data)
    distribution = weighted_discrete_distribution(data=active_set, weights=weights)
    distribution = distribution[distribution > 0.0]
    return -np.dot(distribution, np.log(distribution)).item()

compute_weighted_eta_square_dependency(data, weights)

Computes the weighted eta-squared (η²) statistic to assess dependency between a categorical and a numerical variable.

Eta-squared measures the proportion of total variance in the numerical variable that is explained by the categorical grouping. It is commonly used in ANOVA-like analyses and effect size estimation. The value is bounded between 0 and 1.

Parameters:

Name Type Description Default
data IFArray

A tuple (data_cat, data_num) where: - data_cat is a 1D array of integer-encoded categorical values. - data_num is a 1D array of corresponding numeric values.

required
weights FArray

A 1D array of non-negative weights, same length as data_cat.

required

Returns:

Name Type Description
float float

Weighted eta-squared value in the range [0, 1], where higher values

float

indicate stronger association between the categorical and numeric variable.

Raises:

Type Description
ValueError

If input arrays are not 1D or do not have matching shapes.

Example
data_cat = np.array([0, 0, 1, 1])
data_num = np.array([1.0, 2.0, 3.0, 4.0])
weights = np.array([0.25, 0.25, 0.25, 0.25)
print(compute_weighted_eta_square_dependency((data_cat, data_num), weights))  # => 0.8
Notes
  • Internally, η² is computed as the ratio of weighted between-group variance to the total weighted variance.
  • The statistic is sensitive to group sizes and imbalance in weights.
  • When all group means equal the global mean, η² is 0.
  • When groups are perfectly separated by the numeric variable, η² is 1.
Source code in bbstat/statistics.py
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
def compute_weighted_eta_square_dependency(
    data: IFArray,
    weights: FArray,
) -> float:
    """
    Computes the weighted eta-squared (η²) statistic to assess dependency between
    a categorical and a numerical variable.

    Eta-squared measures the proportion of total variance in the numerical variable
    that is explained by the categorical grouping. It is commonly used in ANOVA-like
    analyses and effect size estimation. The value is bounded between 0 and 1.

    Args:
        data (IFArray): A tuple `(data_cat, data_num)` where:
            - `data_cat` is a 1D array of integer-encoded categorical values.
            - `data_num` is a 1D array of corresponding numeric values.
        weights (FArray): A 1D array of non-negative weights,
            same length as `data_cat`.

    Returns:
        float: Weighted eta-squared value in the range [0, 1], where higher values
        indicate stronger association between the categorical and numeric variable.

    Raises:
        ValueError: If input arrays are not 1D or do not have matching shapes.

    Example:
        ```python
        data_cat = np.array([0, 0, 1, 1])
        data_num = np.array([1.0, 2.0, 3.0, 4.0])
        weights = np.array([0.25, 0.25, 0.25, 0.25)
        print(compute_weighted_eta_square_dependency((data_cat, data_num), weights))  # => 0.8
        ```

    Notes:
        - Internally, η² is computed as the ratio of weighted between-group variance
          to the total weighted variance.
        - The statistic is sensitive to group sizes and imbalance in weights.
        - When all group means equal the global mean, η² is 0.
        - When groups are perfectly separated by the numeric variable, η² is 1.
    """
    data_cat, data_num = data
    mean_sample = compute_weighted_mean(data=data_num, weights=weights)
    group_variance_sum = 0.0
    for value in np.unique(data_cat):
        subset = data_cat == value
        group_weight = weights[subset].sum().item()
        mean_group = compute_weighted_aggregate(
            data=data_num[subset],
            weights=weights[subset],
            factor=1.0 / group_weight,
        )
        group_variance_sum += group_weight * (mean_group - mean_sample) ** 2
    return group_variance_sum / compute_weighted_variance(
        data=data_num,
        weights=weights,
        ddof=0,
    )

compute_weighted_log_odds(data, weights, state)

Computes a weighted log-odds of a state within 1D code data.

This function calculates the weighted probability of a state, p(state) by summing the weights which coincide with data == state. The log-odds then computes as logarithm of the odds p(state) / (1 - p(state)).

Parameters:

Name Type Description Default
data IArray

A 1D array of numeric values representing the sample data in code format.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
state int

The state for which we estimate the log-odds.

required

Returns:

Name Type Description
float float

The weighted log-odds value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D, or if state is not in data.

Example
data = np.array([1, 0, 0])
weights = np.array([0.4, 0.2, 0.4])
print(compute_weighted_log_odds(data, weights, state=0))  # => 0.405...
Source code in bbstat/statistics.py
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def compute_weighted_log_odds(
    data: IArray,
    weights: FArray,
    state: int,
) -> float:
    """
    Computes a weighted log-odds of a state within 1D code data.

    This function calculates the weighted probability of a `state`, `p(state)` by summing
    the `weights` which coincide with `data == state`. The log-odds then
    computes as logarithm of the odds `p(state) / (1 - p(state))`.

    Args:
        data (IArray): A 1D array of numeric values representing
            the sample data in code format.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        state (int): The state for which we estimate the log-odds.

    Returns:
        float: The weighted log-odds value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D,
            or if `state` is not in `data`.

    Example:
        ```python
        data = np.array([1, 0, 0])
        weights = np.array([0.4, 0.2, 0.4])
        print(compute_weighted_log_odds(data, weights, state=0))  # => 0.405...
        ```
    """
    probability = compute_weighted_probability(data=data, weights=weights, state=state)
    return math.log(probability / (1.0 - probability))

compute_weighted_mean(data, weights)

Computes a weighted mean of the input data.

This function calculates the weighted arithmetic mean of the input data and weights via the compute_weighted_aggregate function.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the data to be averaged.

required
weights FArray

A 1D array of numeric values representing the weights for the data.

required

Returns:

Name Type Description
float float

The computed weighted mean.

Raises:

Type Description
ValueError

If data or weights are not 1D arrays.

ValueError

If the shapes of data and weights do not match.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_mean(data, weights))  # => 2.1
Source code in bbstat/statistics.py
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
def compute_weighted_mean(
    data: FArray,
    weights: FArray,
) -> float:
    """
    Computes a weighted mean of the input data.

    This function calculates the weighted arithmetic mean of the input `data`
    and `weights` via the `compute_weighted_aggregate` function.

    Args:
        data (FArray): A 1D array of numeric values representing
            the data to be averaged.
        weights (FArray): A 1D array of numeric values representing
            the weights for the data.

    Returns:
        float: The computed weighted mean.

    Raises:
        ValueError: If `data` or `weights` are not 1D arrays.
        ValueError: If the shapes of `data` and `weights` do not match.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_mean(data, weights))  # => 2.1
        ```
    """
    return compute_weighted_aggregate(data=data, weights=weights, factor=None)

compute_weighted_median(data, weights, *, sorter=None)

Computes a weighted median of 1D data using linear interpolation.

This function calculates the weighted meadian of the given data array based on the provided weights via compute_weighted_quantile with parameter quantile=0.5.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the sample data.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
sorter Optional[NDArray[integer]]

Optional array of indices that sorts data.

None

Returns:

Name Type Description
float float

The interpolated weighted median value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.4, 0.2, 0.4])
print(compute_weighted_median(data, weights))  # => 2.25
Source code in bbstat/statistics.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
def compute_weighted_median(
    data: FArray,
    weights: FArray,
    *,
    sorter: Optional[NDArray[np.integer]] = None,
) -> float:
    """
    Computes a weighted median of 1D data using linear interpolation.

    This function calculates the weighted meadian of the given `data` array
    based on the provided `weights` via `compute_weighted_quantile` with parameter
    `quantile=0.5`.

    Args:
        data (FArray): A 1D array of numeric values representing
            the sample data.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        sorter (Optional[NDArray[np.integer]]): Optional array of indices that
            sorts `data`.

    Returns:
        float: The interpolated weighted median value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.4, 0.2, 0.4])
        print(compute_weighted_median(data, weights))  # => 2.25
        ```
    """
    return compute_weighted_quantile(
        data=data,
        weights=weights,
        quantile=0.5,
        sorter=sorter,
    )

compute_weighted_mutual_information(data, weights, *, normalize=True)

Computes the weighted mutual information (dependency) between two 1D arrays.

This function calculates the mutual information dependency between two integer variables using the direct mutual information formula on weighted joint distribution estimates. The inputs data_1 and data_2 are expected to be 1D arrays of the same length, provided as a tuple data. Each data point is assigned a weight from the weights array.

Parameters:

Name Type Description Default
data IIArray

A tuple of two 1D integer arrays (data_1, data_2) of equal length.

required
weights FArray

A 1D float array of weights, same length as each array in data.

required
normalize bool

Whether or not to normalize the mutual information to range [0, 1]. Default is True

True

Returns:

Name Type Description
float float

The weighted mutual information.

Raises:

Type Description
ValueError

If the input arrays are not 1D or have mismatched lengths.

Example
data_1 = np.array([0, 0, 1])
data_2 = np.array([0, 1, 1])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_mutual_information((data_1, data_2), weights))  # => 0.146...
Notes
  • The normalized result is bounded by [0, 1], where 0 indicates perfect independence and 1 indicates perfect dependence.
  • The unnormalized result is bounded between 0 (perfect independence) and min(H(data_0), H(data_1)), where H(data_0) and H(data_1) are the entropies of the two variables. If we reach the upper bound and H(data_0) == H(data_1), we have perfect dependence.
Source code in bbstat/statistics.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
def compute_weighted_mutual_information(
    data: IIArray,
    weights: FArray,
    *,
    normalize: bool = True,
) -> float:
    """
    Computes the weighted mutual information (dependency) between two 1D arrays.

    This function calculates the mutual information dependency between two integer variables
    using the direct mutual information formula on weighted joint distribution estimates.
    The inputs `data_1` and `data_2` are expected to be 1D arrays of the same length, provided
    as a tuple `data`. Each data point is assigned a weight from the `weights` array.

    Args:
        data (IIArray): A tuple of two 1D integer arrays `(data_1, data_2)`
            of equal length.
        weights (FArray): A 1D float array of weights, same length as
            each array in `data`.
        normalize (bool, optional): Whether or not to normalize the mutual information
            to range [0, 1]. Default is True

    Returns:
        float: The weighted mutual information.

    Raises:
        ValueError: If the input arrays are not 1D or have mismatched lengths.

    Example:
        ```python
        data_1 = np.array([0, 0, 1])
        data_2 = np.array([0, 1, 1])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_mutual_information((data_1, data_2), weights))  # => 0.146...
        ```

    Notes:
        - The normalized result is bounded by [0, 1], where 0 indicates perfect independence and
          1 indicates perfect dependence.
        - The unnormalized result is bounded between 0 (perfect independence) and
          `min(H(data_0), H(data_1))`, where `H(data_0)` and `H(data_1)` are the entropies of
          the two variables. If we reach the upper bound and `H(data_0) == H(data_1)`,
          we have perfect dependence.
    """
    validate_arrays(data=data, weights=weights)
    active_sets = (
        get_active_set(data[0]),
        get_active_set(data[1]),
    )
    joint_distribution = weighted_discrete_joint_distribution(
        data=active_sets,
        weights=weights,
    )
    distribution_0 = np.sum(joint_distribution, axis=1)
    distribution_1 = np.sum(joint_distribution, axis=0)
    divisor = np.outer(distribution_0, distribution_1).reshape(-1)
    joint_distribution = joint_distribution.reshape(-1)
    non_zero = joint_distribution > 0.0
    mutual_information = np.sum(
        joint_distribution[non_zero]
        * np.log(joint_distribution[non_zero] / divisor[non_zero])
    ).item()
    if mutual_information > 0 and normalize:
        mask = distribution_0 > 0.0
        entropy_0 = -np.dot(distribution_0[mask], np.log(distribution_0[mask])).item()
        mask = distribution_1 > 0.0
        entropy_1 = -np.dot(distribution_1[mask], np.log(distribution_1[mask])).item()
        mutual_information = 2.0 * mutual_information / (entropy_0 + entropy_1)
    return mutual_information

compute_weighted_pearson_dependency(data, weights, *, ddof=0)

Computes the weighted Pearson correlation coefficient (dependency) between two 1D arrays.

This function calculates the linear dependency between two variables using a weighted version of Pearson's correlation coefficient. The inputs data_1 and data_2 are expected to be 1D arrays of the same length, provided as a tuple data. Each data point is assigned a weight from the weights array.

The function normalizes both variables by subtracting their weighted means and dividing by their weighted standard deviations, then computes the weighted mean of the element-wise product of these normalized arrays.

Parameters:

Name Type Description Default
data FArray

A tuple of two 1D float arrays (data_1, data_2) of equal length.

required
weights FArray

A 1D float array of weights, same length as each array in data.

required
ddof int

Delta degrees of freedom for standard deviation. Defaults to 0 (population formula). Use 1 for sample-based correction.

0

Returns:

Name Type Description
float float

The weighted Pearson correlation coefficient in the range [-1, 1].

Raises:

Type Description
ValueError

If the input arrays are not 1D or have mismatched lengths.

Example
data_1 = np.array([1.0, 2.0, 3.0])
data_2 = np.array([1.0, 2.0, 2.9])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_pearson_dependency((data_1, data_2), weights))  # => 0.998...
Notes
  • The function relies on compute_weighted_mean and compute_weighted_std.
  • The correlation is computed using the formula: corr = weighted_mean(z1 * z2) where z1 and z2 are the standardized variables.
  • The result is bounded between -1 (perfect negative linear relationship) and 1 (perfect positive linear relationship), with 0 indicating no linear dependency.
Source code in bbstat/statistics.py
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
def compute_weighted_pearson_dependency(
    data: FFArray,
    weights: FArray,
    *,
    ddof: int = 0,
) -> float:
    """
    Computes the weighted Pearson correlation coefficient (dependency) between two 1D arrays.

    This function calculates the linear dependency between two variables using a weighted
    version of Pearson's correlation coefficient. The inputs `data_1` and `data_2` are
    expected to be 1D arrays of the same length, provided as a tuple `data`. Each data point
    is assigned a weight from the `weights` array.

    The function normalizes both variables by subtracting their weighted means and dividing
    by their weighted standard deviations, then computes the weighted mean of the element-wise
    product of these normalized arrays.

    Args:
        data (FArray): A tuple of two 1D float arrays `(data_1, data_2)`
            of equal length.
        weights (FArray): A 1D float array of weights, same length as
            each array in `data`.
        ddof (int, optional): Delta degrees of freedom for standard deviation.
            Defaults to 0 (population formula). Use 1 for sample-based correction.

    Returns:
        float: The weighted Pearson correlation coefficient in the range [-1, 1].

    Raises:
        ValueError: If the input arrays are not 1D or have mismatched lengths.

    Example:
        ```python
        data_1 = np.array([1.0, 2.0, 3.0])
        data_2 = np.array([1.0, 2.0, 2.9])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_pearson_dependency((data_1, data_2), weights))  # => 0.998...
        ```

    Notes:
        - The function relies on `compute_weighted_mean` and `compute_weighted_std`.
        - The correlation is computed using the formula:
            corr = weighted_mean(z1 * z2)
          where z1 and z2 are the standardized variables.
        - The result is bounded between -1 (perfect negative linear relationship)
          and 1 (perfect positive linear relationship), with 0 indicating no linear dependency.
    """
    data_1, data_2 = data
    weighted_mean_1 = compute_weighted_mean(data=data_1, weights=weights)
    weighted_mean_2 = compute_weighted_mean(data=data_2, weights=weights)
    weighted_std_1 = compute_weighted_std(
        data=data_1,
        weights=weights,
        weighted_mean=weighted_mean_1,
        ddof=ddof,
    )
    weighted_std_2 = compute_weighted_std(
        data=data_2,
        weights=weights,
        weighted_mean=weighted_mean_2,
        ddof=ddof,
    )
    array_1 = (data_1 - weighted_mean_1) / weighted_std_1
    array_2 = (data_2 - weighted_mean_2) / weighted_std_2
    return compute_weighted_mean(data=array_1 * array_2, weights=weights)

compute_weighted_percentile(data, weights, *, percentile, sorter=None)

Computes a weighted percentile of 1D data using linear interpolation.

This function calculates the weighted percentile of the given data array based on the provided weights via compute_weighted_quantile with parameter quantile=0.01 * percentile.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the sample data.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
percentile float

The desired percentile in the interval [0, 100].

required
sorter Optional[NDArray[integer]]

Optional array of indices that sorts data.

None

Returns:

Name Type Description
float float

The interpolated weighted percentile value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_percentile(data, weights, percentile=70))  # => 2.2
Source code in bbstat/statistics.py
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
def compute_weighted_percentile(
    data: FArray,
    weights: FArray,
    *,
    percentile: float,
    sorter: Optional[NDArray[np.integer]] = None,
) -> float:
    """
    Computes a weighted percentile of 1D data using linear interpolation.

    This function calculates the weighted percentile of the given `data` array
    based on the provided `weights` via `compute_weighted_quantile` with parameter
    `quantile=0.01 * percentile`.

    Args:
        data (FArray): A 1D array of numeric values representing
            the sample data.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        percentile (float): The desired percentile in the interval [0, 100].
        sorter (Optional[NDArray[np.integer]]): Optional array of indices that
            sorts `data`.

    Returns:
        float: The interpolated weighted percentile value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_percentile(data, weights, percentile=70))  # => 2.2
        ```
    """
    return compute_weighted_quantile(
        data=data,
        weights=weights,
        quantile=0.01 * percentile,
        sorter=sorter,
    )

compute_weighted_probability(data, weights, state)

Computes a weighted probability of a state within 1D code data.

This function calculates the weighted probability of a state by summing the weights which coincide with data == state.

Parameters:

Name Type Description Default
data IArray

A 1D array of numeric values representing the sample data in code format.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
state int

The state for which we estimate the probability.

required

Returns:

Name Type Description
float float

The weighted probability value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D, or if state is not in data.

Example
data = np.array([1, 0, 0])
weights = np.array([0.4, 0.2, 0.4])
print(compute_weighted_probability(data, weights, state=0))  # => 0.6
Source code in bbstat/statistics.py
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
def compute_weighted_probability(
    data: IArray,
    weights: FArray,
    state: int,
) -> float:
    """
    Computes a weighted probability of a state within 1D code data.

    This function calculates the weighted probability of a `state` by summing
    the `weights` which coincide with `data == state`.

    Args:
        data (IArray): A 1D array of numeric values representing
            the sample data in code format.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        state (int): The state for which we estimate the probability.

    Returns:
        float: The weighted probability value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D,
            or if `state` is not in `data`.

    Example:
        ```python
        data = np.array([1, 0, 0])
        weights = np.array([0.4, 0.2, 0.4])
        print(compute_weighted_probability(data, weights, state=0))  # => 0.6
        ```
    """
    validate_array(data=data, weights=weights)
    if state not in data:
        raise ValueError(f"Incompatible parameter {state=:}: not included in data.")
    return np.sum(weights[data == state]).item()

compute_weighted_quantile(data, weights, *, quantile, sorter=None)

Computes a weighted quantile of 1D data using linear interpolation.

This function calculates the weighted quantile of the given data array based on the provided weights. It uses a normalized cumulative weight distribution to determine the interpolated quantile value. The computation assumes both data and weights are 1D arrays of equal length.

A precomputed sorter (array of indices that would sort data) can be optionally provided to avoid recomputing it internally.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the sample data.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
quantile float

The desired quantile in the interval [0, 1].

required
sorter Optional[NDArray[integer]]

Optional array of indices that sorts data.

None

Returns:

Name Type Description
float float

The interpolated weighted quantile value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_quantile(data, weights, quantile=0.7))  # => 2.2
Notes
  • If quantile is less than or equal to the minimum cumulative weight, the smallest data point is returned.
  • If quantile is greater than or equal to the maximum cumulative weight, the largest data point is returned.
  • Linear interpolation is used between the two closest surrounding data points.
  • Providing a precomputed sorter can optimize performance in repeated calls.
Source code in bbstat/statistics.py
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
def compute_weighted_quantile(
    data: FArray,
    weights: FArray,
    *,
    quantile: float,
    sorter: Optional[IArray] = None,
) -> float:
    """
    Computes a weighted quantile of 1D data using linear interpolation.

    This function calculates the weighted quantile of the given `data` array
    based on the provided `weights`. It uses a normalized cumulative weight
    distribution to determine the interpolated quantile value. The computation
    assumes both `data` and `weights` are 1D arrays of equal length.

    A precomputed `sorter` (array of indices that would sort `data`) can be
    optionally provided to avoid recomputing it internally.

    Args:
        data (FArray): A 1D array of numeric values representing
            the sample data.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        quantile (float): The desired quantile in the interval [0, 1].
        sorter (Optional[NDArray[np.integer]]): Optional array of indices that
            sorts `data`.

    Returns:
        float: The interpolated weighted quantile value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_quantile(data, weights, quantile=0.7))  # => 2.2
        ```

    Notes:
        - If `quantile` is less than or equal to the minimum cumulative weight,
          the smallest data point is returned.
        - If `quantile` is greater than or equal to the maximum cumulative weight,
          the largest data point is returned.
        - Linear interpolation is used between the two closest surrounding data points.
        - Providing a precomputed `sorter` can optimize performance in repeated calls.
    """
    validate_array(data=data, weights=weights)
    if sorter is None:
        sorter = np.argsort(data)
    elif sorter.shape != data.shape:
        raise ValueError(f"Invalid parameter {sorter.shape}: must match {data.shape}.")
    cumulative_weights = np.cumsum(weights[sorter])
    cw_lo, cw_hi = cumulative_weights[[0, -1]]
    cumulative_weights -= cw_lo
    cumulative_weights /= cw_hi - cw_lo
    data = data[sorter]
    if quantile <= cumulative_weights[0]:
        return data[0]
    if quantile >= cumulative_weights[-1]:
        return data[-1]
    idx = np.searchsorted(cumulative_weights, quantile)
    w_lo, w_hi = cumulative_weights[[idx - 1, idx]]
    s_lo, s_hi = data[[idx - 1, idx]]
    w = (quantile - w_lo) / (w_hi - w_lo)
    return (s_lo + w * (s_hi - s_lo)).item()

compute_weighted_self_information(data, weights, state)

Computes a weighted self-information of a state within 1D code data.

This function calculates the weighted probability of a state by summing the weights which coincide with data == state. The self-information then computes as negative logarithm of the weighted probability.

Parameters:

Name Type Description Default
data IArray

A 1D array of numeric values representing the sample data in code format.

required
weights FArray

A 1D array of numeric weights corresponding to the data.

required
state int

The state for which we estimate the self-information.

required

Returns:

Name Type Description
float float

The weighted self-information value.

Raises:

Type Description
ValueError

If data and weights have different shapes or are not 1D, or if state is not in data.

Example
data = np.array([1, 0, 0])
weights = np.array([0.4, 0.2, 0.4])
print(compute_weighted_self_information(data, weights, state=0))  # => 0.510...
Source code in bbstat/statistics.py
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
def compute_weighted_self_information(
    data: IArray,
    weights: FArray,
    state: int,
) -> float:
    """
    Computes a weighted self-information of a state within 1D code data.

    This function calculates the weighted probability of a `state` by summing
    the `weights` which coincide with `data == state`. The self-information then
    computes as negative logarithm of the weighted probability.

    Args:
        data (IArray): A 1D array of numeric values representing
            the sample data in code format.
        weights (FArray): A 1D array of numeric weights corresponding
            to the data.
        state (int): The state for which we estimate the self-information.

    Returns:
        float: The weighted self-information value.

    Raises:
        ValueError: If `data` and `weights` have different shapes or are not 1D,
            or if `state` is not in `data`.

    Example:
        ```python
        data = np.array([1, 0, 0])
        weights = np.array([0.4, 0.2, 0.4])
        print(compute_weighted_self_information(data, weights, state=0))  # => 0.510...
        ```
    """
    probability = compute_weighted_probability(data=data, weights=weights, state=state)
    return -math.log(probability)

compute_weighted_spearman_dependency(data, weights, *, ddof=0)

Computes the weighted Spearman rank correlation coefficient between two 1D arrays.

This function measures the monotonic relationship between two variables by computing the weighted Pearson correlation between their ranked values (i.e., their order statistics). It is particularly useful for assessing non-linear relationships.

Parameters:

Name Type Description Default
data FArray

A tuple of two 1D float arrays (data_1, data_2) of equal length.

required
weights FArray

A 1D float array of weights, same length as each array in data.

required
ddof int

Delta degrees of freedom for standard deviation. Defaults to 0 (population formula). Use 1 for sample-based correction.

0

Returns:

Name Type Description
float float

The weighted Spearman rank correlation coefficient in the range [-1, 1].

Raises:

Type Description
ValueError

If the input arrays are not 1D or have mismatched lengths.

Example
data_1 = np.array([1.0, 2.0, 3.0])
data_2 = np.array([0.3, 0.2, 0.1])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_spearman_dependency((data_1, data_2), weights))  # => -0.9999...
Notes
  • Internally, ranks are computed using scipy.stats.rankdata, which handles ties by assigning average ranks.
  • The Spearman coefficient is equivalent to the Pearson correlation between rank-transformed data.
  • Output is bounded between -1 (perfect inverse monotonic relationship) and 1 (perfect direct monotonic relationship), with 0 indicating no monotonic correlation.
  • Weights are applied after ranking.
Source code in bbstat/statistics.py
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
def compute_weighted_spearman_dependency(
    data: FFArray,
    weights: FArray,
    *,
    ddof: int = 0,
) -> float:
    """
    Computes the weighted Spearman rank correlation coefficient between two 1D arrays.

    This function measures the monotonic relationship between two variables by computing
    the weighted Pearson correlation between their ranked values (i.e., their order statistics).
    It is particularly useful for assessing non-linear relationships.

    Args:
        data (FArray): A tuple of two 1D float arrays `(data_1, data_2)`
            of equal length.
        weights (FArray): A 1D float array of weights, same length as
            each array in `data`.
        ddof (int, optional): Delta degrees of freedom for standard deviation.
            Defaults to 0 (population formula). Use 1 for sample-based correction.

    Returns:
        float: The weighted Spearman rank correlation coefficient in the range [-1, 1].

    Raises:
        ValueError: If the input arrays are not 1D or have mismatched lengths.

    Example:
        ```python
        data_1 = np.array([1.0, 2.0, 3.0])
        data_2 = np.array([0.3, 0.2, 0.1])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_spearman_dependency((data_1, data_2), weights))  # => -0.9999...
        ```

    Notes:
        - Internally, ranks are computed using `scipy.stats.rankdata`, which handles ties
          by assigning average ranks.
        - The Spearman coefficient is equivalent to the Pearson correlation between
          rank-transformed data.
        - Output is bounded between -1 (perfect inverse monotonic relationship)
          and 1 (perfect direct monotonic relationship), with 0 indicating no
          monotonic correlation.
        - Weights are applied after ranking.
    """
    data_1, data_2 = data
    ranks_1 = cast(FArray, rankdata(data_1))
    ranks_2 = cast(FArray, rankdata(data_2))
    return compute_weighted_pearson_dependency(
        data=(ranks_1, ranks_2),
        weights=weights,
        ddof=ddof,
    )

compute_weighted_std(data, weights, *, weighted_mean=None, ddof=0)

Computes a weighted standard deviation of the input data.

This function calculates the weighted standard deviation of the input data and weights via the square root of the compute_weighted_variance function.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the data for which we want the standard deviation.

required
weights FArray

A 1D array of numeric values representing the weights for the data.

required
weighted_mean float

The weighted mean of the data (default is None). If missing, this value is computed via compute_weighted_mean via compute_weighted_variance.

None
ddof int

Delta degrees of freedom. Defaults to 0 (population formula). Use 1 for sample-based correction.

0

Returns:

Name Type Description
float float

The computed weighted standard deviation.

Raises:

Type Description
ValueError

If data or weights are not 1D arrays.

ValueError

If the shapes of data and weights do not match.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_std(data, weights))  # => 0.7
Source code in bbstat/statistics.py
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
def compute_weighted_std(
    data: FArray,
    weights: FArray,
    *,
    weighted_mean: Optional[float] = None,
    ddof: int = 0,
) -> float:
    """
    Computes a weighted standard deviation of the input data.

    This function calculates the weighted standard deviation of the
    input `data` and `weights` via the square root of the
    `compute_weighted_variance` function.

    Args:
        data (FArray): A 1D array of numeric values representing
            the data for which we want the standard deviation.
        weights (FArray): A 1D array of numeric values representing
            the weights for the data.
        weighted_mean (float, optional): The weighted mean of the data (default is
            None). If missing, this value is computed via `compute_weighted_mean`
            via `compute_weighted_variance`.
        ddof (int, optional): Delta degrees of freedom.
            Defaults to 0 (population formula). Use 1 for sample-based correction.

    Returns:
        float: The computed weighted standard deviation.

    Raises:
        ValueError: If `data` or `weights` are not 1D arrays.
        ValueError: If the shapes of `data` and `weights` do not match.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_std(data, weights))  # => 0.7
        ```
    """
    weighted_variance = compute_weighted_variance(
        data=data,
        weights=weights,
        weighted_mean=weighted_mean,
        ddof=ddof,
    )
    return math.sqrt(weighted_variance)

compute_weighted_sum(data, weights)

Computes a weighted sum of the input data.

This function calculates the weighted sum of the input data and weights via the compute_weighted_aggregate function with factor=len(data).

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the data to be summed.

required
weights FArray

A 1D array of numeric values representing the weights for the data.

required

Returns:

Name Type Description
float float

The computed weighted sum.

Raises:

Type Description
ValueError

If data or weights are not 1D arrays.

ValueError

If the shapes of data and weights do not match.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_sum(data, weights))  # => 6.3
Source code in bbstat/statistics.py
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
def compute_weighted_sum(
    data: FArray,
    weights: FArray,
) -> float:
    """
    Computes a weighted sum of the input data.

    This function calculates the weighted sum of the input `data`
    and `weights` via the `compute_weighted_aggregate` function with
    `factor=len(data)`.

    Args:
        data (FArray): A 1D array of numeric values representing
            the data to be summed.
        weights (FArray): A 1D array of numeric values representing
            the weights for the data.

    Returns:
        float: The computed weighted sum.

    Raises:
        ValueError: If `data` or `weights` are not 1D arrays.
        ValueError: If the shapes of `data` and `weights` do not match.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_sum(data, weights))  # => 6.3
        ```
    """
    return compute_weighted_aggregate(data=data, weights=weights, factor=len(data))

compute_weighted_variance(data, weights, *, weighted_mean=None, ddof=0)

Computes a weighted variance of the input data.

This function calculates the weighted variance of the input data and weights via the compute_weighted_aggregate function with factor=len(data) / (len(data) - ddof), where ddof specifies the delta degrees of freedom.

Parameters:

Name Type Description Default
data FArray

A 1D array of numeric values representing the data for which we want the variance.

required
weights FArray

A 1D array of numeric values representing the weights for the data.

required
weighted_mean float

The weighted mean of the data (default is None). If missing, this value is computed via compute_weighted_mean.

None
ddof int

Delta degrees of freedom. Defaults to 0 (population formula). Use 1 for sample-based correction.

0

Returns:

Name Type Description
float float

The computed weighted variance.

Raises:

Type Description
ValueError

If data or weights are not 1D arrays.

ValueError

If the shapes of data and weights do not match.

Example
data = np.array([1.0, 2.0, 3.0])
weights = np.array([0.2, 0.5, 0.3])
print(compute_weighted_variance(data, weights))  # => 0.49
print(compute_weighted_variance(data, weights, ddof=1))  # => 0.735
Source code in bbstat/statistics.py
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
def compute_weighted_variance(
    data: FArray,
    weights: FArray,
    *,
    weighted_mean: Optional[float] = None,
    ddof: int = 0,
) -> float:
    """
    Computes a weighted variance of the input data.

    This function calculates the weighted variance of the input `data`
    and `weights` via the `compute_weighted_aggregate` function with
    `factor=len(data) / (len(data) - ddof)`, where `ddof` specifies the
    delta degrees of freedom.

    Args:
        data (FArray): A 1D array of numeric values representing
            the data for which we want the variance.
        weights (FArray): A 1D array of numeric values representing
            the weights for the data.
        weighted_mean (float, optional): The weighted mean of the data (default is
            None). If missing, this value is computed via `compute_weighted_mean`.
        ddof (int, optional): Delta degrees of freedom.
            Defaults to 0 (population formula). Use 1 for sample-based correction.

    Returns:
        float: The computed weighted variance.

    Raises:
        ValueError: If `data` or `weights` are not 1D arrays.
        ValueError: If the shapes of `data` and `weights` do not match.

    Example:
        ```python
        data = np.array([1.0, 2.0, 3.0])
        weights = np.array([0.2, 0.5, 0.3])
        print(compute_weighted_variance(data, weights))  # => 0.49
        print(compute_weighted_variance(data, weights, ddof=1))  # => 0.735
        ```
    """
    if weighted_mean is None:
        weighted_mean = compute_weighted_mean(data=data, weights=weights)
    return compute_weighted_aggregate(
        data=np.power(data - weighted_mean, 2.0),
        weights=weights,
        factor=len(data) / (len(data) - ddof),
    )

utils Module

Utilities bootstrap-related tasks.

This module provides functions to aid interpretation and summarizing the output of Bayesian bootstrap resampling procedures. It includes tools to compute credible intervals for statistical estimates and gauging the appropriate precision for rounding mean and crebilility interval values from the width of the latter.

Main Features
  • compute_credible_interval: Computes a credible interval from a set of estimates.
  • get_precision_for_rounding: Gauges the precision for rounding from the width of the credible interval.
Notes
  • The credible interval is calculated using quantiles of the empirical distribution of bootstrap estimates.
  • This module is designed to be used alongside the evaluate module to provide complete statistical summaries of resampled data.

compute_credible_interval(estimates, level=0.87)

Compute the credible interval for a set of estimates.

This function calculates the credible interval of the given estimates array, which is a range of values that contains a specified proportion of the data, determined by the level parameter.

The credible interval is calculated by determining the quantiles at (1 - level) / 2 and 1 - (1 - level) / 2 of the sorted estimates data.

Parameters:

Name Type Description Default
estimates FArray

A 1D array of floating-point numbers representing the estimates from which the credible interval will be calculated.

required
level float

The proportion of data to be included in the credible interval. Must be between 0 and 1 (exclusive). Default is 0.87.

0.87

Returns:

Type Description
Tuple[float, float]

Tuple[float, float]: A tuple containing the lower and upper bounds of the credible interval, with the lower bound corresponding to the (1 - level) / 2 quantile, and the upper bound corresponding to the 1 - (1 - level) / 2 quantile.

Raises:

Type Description
ValueError

If estimates is not a 1D array or if level is not between 0 and 1 (exclusive).

Example
import numpy as np
estimates = np.array([1.1, 2.3, 3.5, 2.9, 4.0])
compute_credible_interval(estimates, 0.6)  # => (2.06, 3.6)
Source code in bbstat/utils.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def compute_credible_interval(
    estimates: FArray,
    level: float = 0.87,
) -> Tuple[float, float]:
    """
    Compute the credible interval for a set of estimates.

    This function calculates the credible interval of the given `estimates` array,
    which is a range of values that contains a specified proportion of the data,
    determined by the `level` parameter.

    The credible interval is calculated by determining the quantiles at
    `(1 - level) / 2` and `1 - (1 - level) / 2` of the sorted `estimates` data.

    Args:
        estimates (FArray): A 1D array of floating-point numbers representing
            the estimates from which the credible interval will be calculated.
        level (float, optional): The proportion of data to be included in the credible
            interval. Must be between 0 and 1 (exclusive). Default is 0.87.

    Returns:
        Tuple[float, float]: A tuple containing the lower and upper bounds of the credible
            interval, with the lower bound corresponding to the `(1 - level) / 2` quantile,
            and the upper bound corresponding to the `1 - (1 - level) / 2` quantile.

    Raises:
        ValueError: If `estimates` is not a 1D array or if `level` is not between 0 and 1
            (exclusive).

    Example:
        ```python
        import numpy as np
        estimates = np.array([1.1, 2.3, 3.5, 2.9, 4.0])
        compute_credible_interval(estimates, 0.6)  # => (2.06, 3.6)
        ```
    """
    if estimates.ndim != 1:
        raise ValueError(f"Invalid parameter {estimates.ndim=:}: must be 1D array.")
    if level <= 0 or level >= 1:
        raise ValueError(f"Invalid parameter {level=:}: must be within (0, 1).")
    edge = (1.0 - level) / 2.0
    return tuple(np.quantile(estimates, [edge, 1.0 - edge]).tolist())

get_precision_for_rounding(ci_width)

Returns number of digits for rounding.

This method computes the precision (number of digits) for rounding mean and credible interval values for better readability. If the credible interval has width zero, we round to zero digits. Otherwise, we take one minus the floored order of magnitude of the width.

Parameters:

Name Type Description Default
ci_width float

The width of the credible interval.

required

Returns:

Name Type Description
int int

The number of digits for rounding.

Raises:

Type Description
ValueError

If ci_width is negative or NaN.

Source code in bbstat/utils.py
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def get_precision_for_rounding(ci_width: float) -> int:
    """
    Returns number of digits for rounding.

    This method computes the precision (number of digits) for rounding mean and
    credible interval values for better readability. If the credible interval
    has width zero, we round to zero digits. Otherwise, we take one minus the floored
    order of magnitude of the width.

    Args:
        ci_width (float): The width of the credible interval.

    Returns:
        int: The number of digits for rounding.

    Raises:
        ValueError: If `ci_width` is negative or NaN.
    """
    if np.isnan(ci_width) or ci_width < 0.0:
        raise ValueError(f"Invalid parameter {ci_width=:}: must be non-negative.")
    if ci_width == 0:
        return 0
    return int(1 - math.floor(math.log10(ci_width)))