GaussianMixtureBinning ====================== .. currentmodule:: binlearn.methods .. autoclass:: GaussianMixtureBinning :members: :inherited-members: :show-inheritance: Overview -------- ``GaussianMixtureBinning`` creates bins based on Gaussian Mixture Model (GMM) clustering of each feature. The bin edges are determined by finding the decision boundaries between adjacent mixture components, which naturally groups similar values together based on probabilistic clustering. This approach is particularly useful when: * Your data has natural Gaussian-like clusters * You want bins that adapt to the probabilistic structure of the data distribution * You need clustering that captures overlapping distributions * You want to model uncertainty in cluster assignments Key Features ------------ * **Probabilistic Clustering**: Uses Gaussian mixture models for sophisticated clustering * **Decision Boundaries**: Creates bin edges at optimal decision boundaries between components * **Overlap Handling**: Naturally handles overlapping distributions * **Reproducible Results**: Supports random state for consistent clustering * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import GaussianMixtureBinning # Create sample data with natural clusters np.random.seed(42) cluster1 = np.random.normal(10, 2, 300) cluster2 = np.random.normal(25, 3, 200) cluster3 = np.random.normal(40, 1.5, 100) data = np.concatenate([cluster1, cluster2, cluster3]) # Apply Gaussian mixture binning binner = GaussianMixtureBinning(n_components=3, random_state=42) data_binned = binner.fit_transform(data.reshape(-1, 1)) print(f"Bin edges: {binner.bin_edges_[0]}") print(f"Original data shape: {data.shape}") print(f"Binned data shape: {data_binned.shape}") DataFrame Example ----------------- .. code-block:: python # DataFrame usage with multiple features df = pd.DataFrame({ 'feature1': np.concatenate([ np.random.normal(10, 2, 200), np.random.normal(30, 3, 200) ]), 'feature2': np.concatenate([ np.random.normal(5, 1, 200), np.random.normal(15, 2, 200) ]) }) binner = GaussianMixtureBinning( n_components=2, random_state=42, preserve_dataframe=True ) df_binned = binner.fit_transform(df) print(f"Bin edges for feature1: {binner.bin_edges_['feature1']}") print(f"Bin edges for feature2: {binner.bin_edges_['feature2']}") Advanced Configuration ---------------------- .. code-block:: python # Advanced configuration with GMM parameters binner = GaussianMixtureBinning( n_components=4, covariance_type='full', # Full covariance matrices max_iter=200, # Maximum EM iterations random_state=42, tol=1e-3, # Convergence tolerance clip=True # Clip out-of-range values ) data_binned = binner.fit_transform(data) Scikit-learn Pipeline Integration --------------------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # Create a pipeline with Gaussian mixture binning pipeline = Pipeline([ ('binning', GaussianMixtureBinning(n_components=3, random_state=42)), ('classifier', RandomForestClassifier(random_state=42)) ]) # Use in ML workflow X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) pipeline.fit(X_train, y_train) accuracy = pipeline.score(X_test, y_test) Parameter Guide --------------- **n_components** (int, default=5) Number of mixture components (bins) to create. Higher values create more granular bins. **covariance_type** (str, default='full') Type of covariance parameters: * 'full': Each component has its own general covariance matrix * 'tied': All components share the same general covariance matrix * 'diag': Each component has its own diagonal covariance matrix * 'spherical': Each component has its own single variance **random_state** (int, optional) Random seed for reproducible clustering results. **max_iter** (int, default=100) Maximum number of EM algorithm iterations. **tol** (float, default=1e-3) Tolerance for convergence of the EM algorithm. See Also -------- * :class:`KMeansBinning` - K-means clustering-based binning * :class:`DBSCANBinning` - Density-based clustering binning * :class:`EqualFrequencyBinning` - Quantile-based binning * :class:`TreeBinning` - Decision tree-based supervised binning