Chi2Binning

Bases: SupervisedBinningBase

Chi-square binning implementation for supervised discretization.

This class implements chi-square binning (χ² binning), a supervised discretization method that uses the chi-square statistic to find optimal bin boundaries for classification tasks. The method creates bins that maximize the association between numeric features and categorical target variables, making it particularly effective for improving classification performance.

Chi-square binning is particularly effective for: - Binary and multi-class classification preprocessing - Creating bins that preserve class-discriminative information - Reducing feature dimensionality while maintaining predictive power - Handling continuous features with complex relationships to target classes

Key Features: - Uses chi-square test of independence to guide bin boundary selection - Iterative merging process starting from fine initial discretization - Configurable stopping criteria (significance level, bin count limits) - Handles both binary and multi-class classification targets - Automatic handling of insufficient data and edge cases

Algorithm: 1. Create initial fine-grained discretization (equal frequency or equal width) 2. For each pair of adjacent bins, calculate chi-square statistic 3. Merge the pair with the smallest (least significant) chi-square value 4. Repeat merging until stopping criterion is met:

Minimum number of bins reached, OR

All remaining chi-square values exceed significance threshold (alpha)

Create final bin boundaries and representatives

Parameters:

max_bins – Maximum number of bins to create. The algorithm will not exceed this limit regardless of statistical significance. Useful for controlling model complexity and computational costs.
min_bins – Minimum number of bins to maintain. The algorithm will not merge below this threshold even if chi-square values are not significant. Ensures some level of discretization is preserved.
alpha – Significance level for the chi-square test. Adjacent bins are merged if their chi-square p-value exceeds this threshold (indicating lack of significant association). Lower values result in more bins.
initial_bins – Number of bins to create in the initial discretization step before beginning the merging process. Higher values provide finer granularity for the merging algorithm to work with.

bin_edges_: Dictionary mapping column identifiers to lists of optimized bin edges after fitting. These edges maximize class separation.

bin_representatives_: Dictionary mapping column identifiers to lists of bin representatives (typically bin centers).

Example

>>> import numpy as np
>>> from binlearn.methods import Chi2Binning
>>>
>>> # Binary classification example
>>> X = np.random.normal(0, 1, (1000, 2))
>>> # Create target correlated with first feature
>>> y = (X[:, 0] > 0).astype(int)
>>>
>>> binner = Chi2Binning(max_bins=5, alpha=0.05)
>>> binner.fit(X, guidance_data=y.reshape(-1, 1))
>>> X_binned = binner.transform(X)
>>>
>>> # Multi-class example with custom parameters
>>> y_multi = np.random.choice([0, 1, 2], size=1000)
>>> binner_multi = Chi2Binning(
...     max_bins=10,
...     min_bins=3,
...     alpha=0.01,
...     initial_bins=20
... )
>>> binner_multi.fit(X, guidance_data=y_multi.reshape(-1, 1))

Note

Requires target data (guidance_data) during fitting for supervised learning
Works only with numeric input features and categorical targets
Performance depends on the relationship between features and target
May create fewer bins than max_bins if early stopping criteria are met
Inherits clipping behavior and format preservation from SupervisedBinningBase

__init__(max_bins: int | None = None, min_bins: int | None = None, alpha: float | None = None, initial_bins: int | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]: Initialize Chi-square binning.

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

static check_data_quality(data: ndarray[Any, Any], name: str = 'data') → None: Check data quality and issue warnings if needed.

property feature_names_in_: list[str] | None: Get feature names.

fit(X: Any, y: Any | None = None, **fit_params: Any) → GeneralBinningBase

Fit the binning transformer with comprehensive orchestration.

This method orchestrates the complete fitting process, handling parameter validation, input preprocessing, column separation, and routing to the appropriate fitting strategy (joint vs independent).

Parameters:

X – Input data to fit the binning transformer on. Can be: - pandas.DataFrame: Column names are preserved - polars.DataFrame: Column names are preserved - numpy.ndarray: Numeric column indices are used - array-like: Converted to numpy array
y – Target values for supervised binning methods. Ignored by unsupervised methods. Can be array-like or None.
**fit_params – Additional fitting parameters passed to the specific binning algorithm implementation. Common parameters include: - guidance_data: Alternative guidance data (conflicts with fit_jointly=True)

Returns:

The fitted binning transformer instance.

Return type:

self

Raises:

ValueError – If parameter validation fails, inputs are invalid, or conflicting parameters are provided (e.g., fit_jointly=True with guidance_data).
BinningError – If the binning algorithm fails to fit the data.
RuntimeError – If an unexpected error occurs during fitting.

Example

>>> from binlearn import EqualWidthBinning
>>> import pandas as pd
>>> X = pd.DataFrame({'feature1': [1, 2, 3, 4, 5], 'feature2': [10, 20, 30, 40, 50]})
>>> binner = EqualWidthBinning(n_bins=3)
>>> binner.fit(X)
EqualWidthBinning(...)

Note

The method automatically handles column separation when guidance_columns is specified, routing guidance columns separately from binning columns. The fitting strategy (joint vs independent) is determined by the fit_jointly parameter.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_input_columns() → list[Any] | None

Get input columns for data preparation.

This method should be overridden by derived classes to provide appropriate column information without exposing binning-specific concepts.

Returns:: Column information or None if not available

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep: bool = True) → dict[str, Any]

Get parameters for this estimator, including fitted parameters.

This method extends sklearn’s standard get_params to include fitted parameters when the estimator is fitted, enabling complete object reconstruction through the get_params/set_params interface. This is essential for pipeline persistence and model serialization.

Parameters:

deep – If True, returns parameters for sub-estimators (not applicable here but maintained for sklearn compatibility).

Returns:

Constructor parameters extracted from __init__ signature
Fitted parameters (if estimator is fitted) mapped from attributes
Class metadata (class_, module_) for automatic reconstruction

Return type:

Dictionary of parameter names mapped to their values, including

Example

>>> binner = EqualWidthBinning(n_bins=5)
>>> params = binner.get_params()
>>> print(params)
{'n_bins': 5, 'clip': None, ..., 'class_': 'EqualWidthBinning', 'module_': '...'}
>>>
>>> binner.fit(X)
>>> fitted_params = binner.get_params()
>>> # Now includes: {'bin_edges': {...}, 'bin_representatives': {...}, ...}

Note

Automatically extracts constructor parameters from __init__ signature
Includes fitted parameters only when estimator is fitted
Adds class metadata for reconstruction workflows
Excludes internal sklearn attributes like n_features_in_
class_ and module_ parameters are handled specially during set_params

inverse_transform(X: Any) → Any

Inverse transform from bin indices back to representative values.

Converts discrete bin indices back to their representative values, effectively reversing the binning transformation. This is useful for interpreting results or reconstructing approximate original values.

Parameters:

X – Input data containing bin indices to inverse transform. Should contain only binning columns (no guidance columns). Can be: - pandas.DataFrame: Column names should match binning columns - polars.DataFrame: Column names should match binning columns - numpy.ndarray: Must have same number of binning columns - array-like: Converted to numpy array

Returns:

Inverse transformed data where bin indices are replaced with their representative values (typically bin centers). Output format matches the preserve_dataframe setting.

Raises:

RuntimeError – If the transformer has not been fitted yet.
ValueError – If input data has wrong number of columns or invalid format.
BinningError – If inverse transformation fails.

Example

>>> # After fitting and transforming
>>> X_binned = [[0, 1], [1, 0], [2, 2]]  # Bin indices
>>> X_reconstructed = binner.inverse_transform(X_binned)
>>> print(X_reconstructed)
[[0.5, 1.5], [1.5, 0.5], [2.5, 2.5]]  # Representative values

Note

For guided binning (when guidance_columns is specified), the input should only contain the binning columns, not the guidance columns. The number of input columns must match the number of binning columns.

property n_features_in_: int: Get number of features.

set_output(*, transform=None)

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params: Any) → SklearnIntegrationBase

Set the parameters of this estimator.

This method supports reconstruction workflows by handling fitted parameters that come from get_params() output (without underscores) and setting them as fitted attributes (with underscores).

Parameters:: **params – Parameters to set. Can include: - Regular constructor parameters (n_bins, clip, etc.) - Fitted parameters from get_params (bin_edges, bin_representatives) - Class metadata (ignored during reconstruction)
Returns:: Returns the instance itself.
Return type:: self

transform(X: Any) → Any

Transform input data using fitted binning parameters.

Applies the fitted binning transformation to new data, converting continuous values to discrete bin indices or representatives. Handles column separation when guidance columns are present.

Parameters:

X – Input data to transform. Must have the same structure as the data used during fitting (same number of columns). Can be: - pandas.DataFrame: Column names should match training data - polars.DataFrame: Column names should match training data - numpy.ndarray: Must have same number of columns as training - array-like: Converted to numpy array

Returns:

Transformed data where continuous values are replaced with bin indices or representative values. The output format depends on: - preserve_dataframe setting: DataFrame vs array format - binning method: indices vs representatives - guidance_columns: only binning columns are transformed

Raises:

RuntimeError – If the transformer has not been fitted yet.
ValueError – If the input data has incompatible structure or format.
BinningError – If transformation fails due to data issues.

Example

>>> # After fitting
>>> X_new = pd.DataFrame({'feature1': [1.5, 2.5], 'feature2': [15, 25]})
>>> X_binned = binner.transform(X_new)
>>> print(X_binned)
[[0, 0], [1, 1]]  # Bin indices

Note

When guidance_columns is specified, only the binning columns are transformed. Guidance columns are filtered out from the output. The method preserves the original data format when preserve_dataframe=True.

static validate_array_like(data: Any, name: str = 'data', allow_none: bool = False) → ndarray[Any, Any] | None

Validate and convert array-like input to numpy array.

This method provides robust validation and conversion of various input formats to numpy arrays, with comprehensive error handling and helpful suggestions for common issues.

Parameters:

data – Input data to validate and convert. Can be: - numpy.ndarray: Used directly - pandas.DataFrame/Series: Converted to numpy array - polars.DataFrame: Converted to numpy array - list, tuple: Converted to numpy array - None: Allowed only if allow_none=True
name – Name of the data parameter for error messages. Used to provide context in error messages (e.g., “X”, “y”, “guidance_data”).
allow_none – Whether to allow None as a valid input. If True, None is returned unchanged; if False, None raises InvalidDataError.

Returns:

Validated numpy array, or None if data is None and allow_none=True. The returned array maintains the same data content but is guaranteed to be a numpy array.

Raises:

InvalidDataError – If validation fails: - data is None when allow_none=False - data cannot be converted to numpy array - Conversion process encounters errors

Example

>>> # Valid inputs
>>> arr = ValidationMixin.validate_array_like([1, 2, 3], "X")
>>> print(type(arr))
<class 'numpy.ndarray'>
>>>
>>> # Allow None
>>> result = ValidationMixin.validate_array_like(None, "y", allow_none=True)
>>> print(result)
None
>>>
>>> # Invalid input
>>> ValidationMixin.validate_array_like(None, "X", allow_none=False)
InvalidDataError: X cannot be None

Note

This method focuses on format validation and conversion. Content validation (like checking for NaN values) should be done separately using other validation methods.

static validate_column_specification(columns: Any, data_shape: tuple[int, ...]) → list[Any]: Validate column specifications.

static validate_guidance_columns(guidance_cols: Any, binning_cols: list[Any], data_shape: tuple[int, ...]) → list[Any]: Validate guidance column specifications.

validate_guidance_data(guidance_data: Any, name: str = 'guidance_data') → ndarray[Any, Any]

Validate and preprocess guidance data for supervised binning.

Ensures that the guidance data is appropriate for supervised binning by validating its shape and checking for data quality issues.

Parameters:

guidance_data – Raw guidance/target data to validate. Should be a 2D array with shape (n_samples, 1) or 1D array with shape (n_samples,).
name – Name used in error messages for better debugging context.

Returns:

Validated and preprocessed guidance data with shape (n_samples, 1).

Raises:

ValidationError – If guidance data has invalid shape or format.

Overview

Chi2Binning is a supervised discretization method that uses the chi-square statistic to find optimal split points. The method iteratively merges adjacent intervals to minimize the chi-square statistic, creating bins that maximize the association between features and target variables.

This approach is particularly effective for:

Classification tasks where bins need to separate different classes effectively
Categorical target variables with clear class boundaries
Feature engineering for improving downstream classification performance
Data preparation where maintaining class relationships is crucial

Key Features

Supervised Learning: Uses target variable information for optimal binning
Statistical Foundation: Based on chi-square test of independence
Iterative Optimization: Merges intervals to minimize chi-square statistic
Classification Focus: Optimized for categorical target variables
Automatic Stopping: Uses significance levels to determine optimal number of bins
Sklearn Compatibility: Full transformer interface with fit/transform methods
DataFrame Support: Preserves pandas/polars column names and structure

Basic Usage

import numpy as np
import pandas as pd
from binlearn.methods import Chi2Binning
from sklearn.datasets import make_classification

# Create sample classification data
X, y = make_classification(
    n_samples=1000,
    n_features=3,
    n_classes=3,
    n_redundant=0,
    random_state=42
)

# Apply chi-square binning
binner = Chi2Binning(
    max_bins=10,
    min_bins=3,
    alpha=0.05
)

# Method 1: Using fit with X and y (sklearn style)
binner.fit(X, y)
X_binned = binner.transform(X)

print(f"Original shape: {X.shape}")
print(f"Binned shape: {X_binned.shape}")
print(f"Bins for feature 0: {len(binner.bin_edges_[0]) - 1}")

DataFrame Example with Target Column

# Create DataFrame with target column
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3'])
df['target'] = y

# Method 2: Using guidance_columns (binlearn style)
binner = Chi2Binning(
    guidance_columns=['target'],  # Use target column for guidance
    max_bins=8,
    min_bins=2,
    preserve_dataframe=True
)

# Fit and transform the entire DataFrame
df_binned = binner.fit_transform(df)

print(f"Bin edges for feature1: {binner.bin_edges_['feature1']}")
print(f"Bin edges for feature2: {binner.bin_edges_['feature2']}")
print(f"Target column preserved: {'target' in df_binned.columns}")

Regression Example

from sklearn.datasets import make_regression

# Create regression data and discretize target
X_reg, y_reg = make_regression(n_samples=1000, n_features=2, random_state=42)

# Discretize continuous target for chi-square binning
y_discrete = pd.cut(y_reg, bins=5, labels=['very_low', 'low', 'medium', 'high', 'very_high'])

binner = Chi2Binning(
    max_bins=6,
    min_bins=3,
    alpha=0.01  # More stringent significance level
)

binner.fit(X_reg, y_discrete)
X_reg_binned = binner.transform(X_reg)

print(f"Regression bins created: {[len(edges)-1 for edges in binner.bin_edges_.values()]}")

Advanced Configuration

# Fine-tuned chi-square binning for specific requirements

# Conservative binning (fewer, more significant bins)
conservative_binner = Chi2Binning(
    max_bins=15,
    min_bins=2,
    alpha=0.001,        # Very stringent significance level
    initial_bins=20     # Start with more initial bins
)

# Liberal binning (more bins, less stringent)
liberal_binner = Chi2Binning(
    max_bins=25,
    min_bins=5,
    alpha=0.1,          # More permissive significance level
    initial_bins=30
)

Comparison with Other Methods

from binlearn.methods import EqualWidthBinning, TreeBinning
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compare different binning methods
binners = {
    'chi2': Chi2Binning(max_bins=8, alpha=0.05),
    'equal_width': EqualWidthBinning(n_bins=8),
    'supervised': SupervisedBinning(max_depth=3, min_samples_leaf=50)
}

results = {}
classifier = RandomForestClassifier(random_state=42, n_estimators=100)

for name, binner in binners.items():
    # Fit binner and transform data
    binner.fit(X_train, y_train)
    X_train_binned = binner.transform(X_train)
    X_test_binned = binner.transform(X_test)

    # Train classifier on binned data
    classifier.fit(X_train_binned, y_train)
    y_pred = classifier.predict(X_test_binned)

    results[name] = accuracy_score(y_test, y_pred)
    print(f"{name}: {results[name]:.3f} accuracy")

Parameter Tuning

# Grid search for optimal parameters
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Create pipeline with chi-square binning
pipeline = Pipeline([
    ('binning', Chi2Binning()),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Parameter grid for binning
param_grid = {
    'binning__max_bins': [5, 8, 12, 15],
    'binning__alpha': [0.001, 0.01, 0.05, 0.1],
    'binning__initial_bins': [10, 15, 20]
}

# Grid search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

Parameter Guide

max_bins (int, default=10)

Maximum number of bins to create. The algorithm will never exceed this number:

Higher values: Allow more granular binning
Lower values: Force more aggressive merging
Consider your downstream model’s capacity

min_bins (int, default=2)

Minimum number of bins to maintain. Prevents over-merging:

Higher values: Ensure sufficient granularity
Lower values: Allow aggressive simplification
Should be at least 2 for meaningful binning

alpha (float, default=0.05)

Significance level for chi-square test. Lower values are more stringent:

Lower values (0.001): More conservative, fewer bins
Higher values (0.1): More liberal, more bins
Common values: 0.05 (standard), 0.01 (conservative)

initial_bins (int, default=10)

Number of initial equal-width bins before merging:

Higher values: More potential split points to consider
Lower values: Faster computation but less flexibility
Should be >= max_bins

Statistical Background

The chi-square statistic measures the independence between a feature’s bins and the target classes:

\[\begin{split}\\chi^2 = \\sum_{i=1}^{r} \\sum_{j=1}^{c} \\frac{(O_{ij} - E_{ij})^2}{E_{ij}}\end{split}\]

Where: - \(O_{ij}\) is the observed frequency in bin i, class j - \(E_{ij}\) is the expected frequency under independence - Lower chi-square values indicate better independence (good for merging)

Handling Edge Cases

# Handling insufficient data
small_X = X[:50]  # Very small dataset
small_y = y[:50]

# Use conservative parameters for small datasets
small_binner = Chi2Binning(
    max_bins=5,      # Fewer bins for small data
    min_bins=2,      # Conservative minimum
    alpha=0.1,       # More permissive for small samples
    initial_bins=8   # Fewer initial bins
)

small_binned = small_binner.fit_transform(small_X, small_y)

Tips for Best Results

Choose initial_bins wisely: Start with 2-3x your desired max_bins
Adjust alpha based on sample size: Use smaller alpha for larger datasets
Consider target distribution: Imbalanced classes may need different alpha values
Validate on holdout data: Chi-square optimization can overfit to training data

Chi2Binning

Overview

Key Features

Basic Usage

DataFrame Example with Target Column

Regression Example

Advanced Configuration

Comparison with Other Methods

Parameter Tuning

Parameter Guide

Statistical Background

Handling Edge Cases

Tips for Best Results

See Also