binlearn.methods.Chi2Binning

class binlearn.methods.Chi2Binning(max_bins: int | None = None, min_bins: int | None = None, alpha: float | None = None, initial_bins: int | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]

Chi-square binning implementation for supervised discretization.

This class implements chi-square binning (χ² binning), a supervised discretization method that uses the chi-square statistic to find optimal bin boundaries for classification tasks. The method creates bins that maximize the association between numeric features and categorical target variables, making it particularly effective for improving classification performance.

Chi-square binning is particularly effective for: - Binary and multi-class classification preprocessing - Creating bins that preserve class-discriminative information - Reducing feature dimensionality while maintaining predictive power - Handling continuous features with complex relationships to target classes

Key Features: - Uses chi-square test of independence to guide bin boundary selection - Iterative merging process starting from fine initial discretization - Configurable stopping criteria (significance level, bin count limits) - Handles both binary and multi-class classification targets - Automatic handling of insufficient data and edge cases

Algorithm: 1. Create initial fine-grained discretization (equal frequency or equal width) 2. For each pair of adjacent bins, calculate chi-square statistic 3. Merge the pair with the smallest (least significant) chi-square value 4. Repeat merging until stopping criterion is met:

  • Minimum number of bins reached, OR

  • All remaining chi-square values exceed significance threshold (alpha)

  1. Create final bin boundaries and representatives

Parameters:
  • max_bins – Maximum number of bins to create. The algorithm will not exceed this limit regardless of statistical significance. Useful for controlling model complexity and computational costs.

  • min_bins – Minimum number of bins to maintain. The algorithm will not merge below this threshold even if chi-square values are not significant. Ensures some level of discretization is preserved.

  • alpha – Significance level for the chi-square test. Adjacent bins are merged if their chi-square p-value exceeds this threshold (indicating lack of significant association). Lower values result in more bins.

  • initial_bins – Number of bins to create in the initial discretization step before beginning the merging process. Higher values provide finer granularity for the merging algorithm to work with.

bin_edges_

Dictionary mapping column identifiers to lists of optimized bin edges after fitting. These edges maximize class separation.

bin_representatives_

Dictionary mapping column identifiers to lists of bin representatives (typically bin centers).

Example

>>> import numpy as np
>>> from binlearn.methods import Chi2Binning
>>>
>>> # Binary classification example
>>> X = np.random.normal(0, 1, (1000, 2))
>>> # Create target correlated with first feature
>>> y = (X[:, 0] > 0).astype(int)
>>>
>>> binner = Chi2Binning(max_bins=5, alpha=0.05)
>>> binner.fit(X, guidance_data=y.reshape(-1, 1))
>>> X_binned = binner.transform(X)
>>>
>>> # Multi-class example with custom parameters
>>> y_multi = np.random.choice([0, 1, 2], size=1000)
>>> binner_multi = Chi2Binning(
...     max_bins=10,
...     min_bins=3,
...     alpha=0.01,
...     initial_bins=20
... )
>>> binner_multi.fit(X, guidance_data=y_multi.reshape(-1, 1))

Note

  • Requires target data (guidance_data) during fitting for supervised learning

  • Works only with numeric input features and categorical targets

  • Performance depends on the relationship between features and target

  • May create fewer bins than max_bins if early stopping criteria are met

  • Inherits clipping behavior and format preservation from SupervisedBinningBase

__init__(max_bins: int | None = None, min_bins: int | None = None, alpha: float | None = None, initial_bins: int | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]

Initialize Chi-square binning.