binlearn.methods.KMeansBinning
- class binlearn.methods.KMeansBinning(n_bins: int | str | None = None, allow_fallback: bool | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, fit_jointly: bool | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]
K-means clustering-based binning implementation for natural data groupings.
This class implements K-means binning, which uses K-means clustering to identify natural groupings in the data and creates bin boundaries at the midpoints between adjacent cluster centroids. This approach is data-adaptive and creates bins that reflect the underlying distribution of values, making it particularly effective for non-uniformly distributed data.
K-means binning is particularly effective for: - Non-uniformly distributed data with natural clusters - Creating bins that preserve data density patterns - Multimodal distributions where clusters represent different modes - Cases where traditional equal-width or equal-frequency binning is inadequate
Key Features: - Data-driven bin boundary selection based on clustering - Automatically adapts to the underlying data distribution - Creates bins with meaningful separation based on value similarity - Handles irregular data distributions better than fixed-interval methods - Support for flexible bin count specification (integer or string rules)
Algorithm: 1. Apply K-means clustering to each column independently to find n_bins centroids 2. Sort the centroids in ascending order 3. Create bin edges at the midpoints between consecutive centroids 4. Add data range boundaries (min, max) as outer edges 5. Use centroids as bin representatives
- Parameters:
n_bins – Number of bins to create, or string specification for automatic calculation. Can be: - Integer: exact number of bins (and clusters) to create - ‘sqrt’: number of bins = sqrt(n_samples) - ‘log2’: number of bins = log2(n_samples) - ‘sturges’: Sturges’ rule for histogram bins Default value can be configured globally via binlearn.config.
allow_fallback – Whether to fall back to equal-width binning when K-means clustering fails or when data has insufficient variation. If True (default), uses equal-width binning as fallback with a warning. If False, raises an error when clustering fails. Default can be configured globally.
- n_bins
Number of clusters/bins to create
- allow_fallback
Whether to fall back to equal-width binning when needed
- bin_edges_
Dictionary mapping column identifiers to lists of bin edges after fitting. Edges are positioned at midpoints between cluster centroids.
- bin_representatives_
Dictionary mapping column identifiers to lists of bin representatives (the cluster centroids).
Example
>>> import numpy as np >>> from binlearn.methods import KMeansBinning >>> >>> # Multimodal data - mixture of two normal distributions >>> X1 = np.random.normal(2, 0.5, 500) # First mode >>> X2 = np.random.normal(8, 0.5, 500) # Second mode >>> X = np.concatenate([X1, X2]).reshape(-1, 1) >>> >>> binner = KMeansBinning(n_bins=4) >>> binner.fit(X) >>> X_binned = binner.transform(X) >>> # Bins naturally separate the two modes >>> >>> # Automatic bin count based on data size >>> binner_auto = KMeansBinning(n_bins='sqrt') >>> binner_auto.fit(X) # Uses sqrt(1000) ≈ 32 bins >>> >>> # Irregular distribution >>> X_irregular = np.concatenate([ ... np.random.uniform(0, 2, 100), # Uniform region ... np.random.normal(5, 0.2, 800), # Tight cluster ... np.random.uniform(8, 10, 100) # Another uniform region ... ]).reshape(-1, 1) >>> binner_adaptive = KMeansBinning(n_bins=6) >>> binner_adaptive.fit(X_irregular) # Adapts to density variations
Note
Only works with numeric data - non-numeric columns will raise errors
Performance depends on the clustering quality and data separability
May create fewer effective bins if clusters are very close together
Requires the kmeans1d package for efficient 1D K-means clustering
Inherits clipping behavior and format preservation from IntervalBinningBase
- __init__(n_bins: int | str | None = None, allow_fallback: bool | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, fit_jointly: bool | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]
Initialize K-means binning.