IsotonicBinning =============== .. currentmodule:: binlearn.methods .. autoclass:: IsotonicBinning :members: :inherited-members: :show-inheritance: Overview -------- ``IsotonicBinning`` creates bins using isotonic regression to find optimal cut points that preserve monotonic relationships between features and targets. The transformer fits an isotonic (non-decreasing) function to the data and identifies significant changes in this function to determine bin boundaries. This method is particularly effective when: * There's a **known monotonic relationship** between feature and target * You want bins that **respect monotonic ordering** * Traditional tree-based methods might create non-monotonic splits * You need **interpretable bins** that maintain logical ordering Key Features ------------ * **Monotonicity Preservation**: Ensures bins respect monotonic relationships * **Isotonic Regression**: Uses sklearn's IsotonicRegression for optimal fitting * **Automatic Cut Points**: Identifies significant changes in isotonic function * **Flexible Direction**: Supports both increasing and decreasing monotonicity * **Sample Size Control**: Ensures minimum samples per bin for statistical validity * **Supervised Learning**: Uses target variable information for optimal binning * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import IsotonicBinning # Create sample data with monotonic relationship np.random.seed(42) X = np.random.uniform(0, 10, 500).reshape(-1, 1) y = 2 * X.flatten() + np.random.normal(0, 1, 500) # Linear + noise # Apply isotonic binning binner = IsotonicBinning( max_bins=6, min_samples_per_bin=20, increasing=True ) # Fit using X and y (sklearn style) binner.fit(X, y) X_binned = binner.transform(X) print(f"Original shape: {X.shape}") print(f"Binned shape: {X_binned.shape}") print(f"Bin edges: {binner.bin_edges_[0]}") Classification Example ---------------------- .. code-block:: python from sklearn.datasets import make_classification # Create classification data with monotonic relationship X, y = make_classification( n_samples=1000, n_features=1, n_redundant=0, n_clusters_per_class=1, random_state=42 ) # Sort by feature to create monotonic relationship sort_idx = np.argsort(X.flatten()) X_sorted = X[sort_idx] y_sorted = y[sort_idx] binner = IsotonicBinning( max_bins=8, min_samples_per_bin=30, increasing=True, min_change_threshold=0.05 ) binner.fit(X_sorted, y_sorted) X_binned = binner.transform(X_sorted) print(f"Created {len(binner.bin_edges_[0]) - 1} bins") print(f"Bin edges: {binner.bin_edges_[0]}") DataFrame Example with Guidance Columns ---------------------------------------- .. code-block:: python # Create DataFrame with target column df = pd.DataFrame({ 'age': np.random.uniform(18, 80, 1000), 'income': np.random.uniform(20000, 150000, 1000), 'credit_score': np.random.uniform(300, 850, 1000) }) # Create monotonic target: risk increases with age, decreases with income/credit df['default_risk'] = ( 0.3 * (df['age'] - 18) / 62 + # Age increases risk -0.4 * (df['income'] - 20000) / 130000 + # Income decreases risk -0.3 * (df['credit_score'] - 300) / 550 + # Credit decreases risk np.random.normal(0, 0.1, 1000) ) # Bin each feature with appropriate monotonicity age_binner = IsotonicBinning( guidance_columns=['default_risk'], max_bins=5, increasing=True, # Risk increases with age preserve_dataframe=True ) income_binner = IsotonicBinning( guidance_columns=['default_risk'], max_bins=6, increasing=False, # Risk decreases with income preserve_dataframe=True ) # Apply binning df_age_binned = age_binner.fit_transform(df[['age', 'default_risk']]) df_income_binned = income_binner.fit_transform(df[['income', 'default_risk']]) Advanced Configuration ---------------------- .. code-block:: python # Fine-tuned isotonic binning for specific requirements # Conservative binning (fewer bins, stricter requirements) conservative_binner = IsotonicBinning( max_bins=5, min_samples_per_bin=50, # Larger bins for stability min_change_threshold=0.1, # Require larger changes increasing=True, y_min=0.0, # Bound target values y_max=1.0 ) # Granular binning (more bins, sensitive to changes) granular_binner = IsotonicBinning( max_bins=15, min_samples_per_bin=10, # Smaller bins allowed min_change_threshold=0.01, # Sensitive to small changes increasing=True ) Parameter Guide --------------- **max_bins** (int, default=10) Maximum number of bins to create. Actual number may be smaller: * Higher values: Allow more granular binning * Lower values: Force simpler, broader bins * Consider your model's complexity needs **min_samples_per_bin** (int, default=5) Minimum samples required per bin for statistical validity: * Higher values: More stable bins, fewer total bins * Lower values: More granular binning, potentially less stable * Rule of thumb: At least 30 for regression, 10+ per class for classification **increasing** (bool, default=True) Direction of monotonic relationship: * True: Higher feature values → higher target values * False: Higher feature values → lower target values * Must match your domain knowledge **min_change_threshold** (float, default=0.01) Minimum relative change in isotonic function to create new bin: * Smaller values: More sensitive, create more bins * Larger values: Less sensitive, create fewer bins * Typical range: 0.005 to 0.1 **y_min, y_max** (float, optional) Bounds for target values in isotonic regression: * Helps constrain the isotonic function * Useful for normalized targets or known ranges * If None, uses data min/max Scikit-learn Pipeline Integration --------------------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error # Create pipeline with isotonic binning pipeline = Pipeline([ ('binning', IsotonicBinning(max_bins=6, increasing=True)), ('regressor', RandomForestRegressor(random_state=42)) ]) # Use in ML workflow X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) mse = mean_squared_error(y_test, y_pred) print(f"Pipeline MSE: {mse:.4f}") Tips for Best Results --------------------- 1. **Validate monotonicity first**: Use correlation analysis to confirm monotonic relationships 2. **Choose appropriate direction**: Set `increasing` parameter based on domain knowledge 3. **Balance bin size and count**: Larger `min_samples_per_bin` gives more stable bins 4. **Consider target distribution**: Normalize targets if they have extreme ranges 5. **Validate on holdout data**: Check that monotonic relationship holds on test data See Also -------- * :class:`TreeBinning` - Decision tree-based supervised binning * :class:`Chi2Binning` - Chi-square statistic-based supervised binning * :class:`EqualFrequencyBinning` - Quantile-based unsupervised binning * :class:`KMeansBinning` - K-means clustering-based binning Overview -------- ``IsotonicBinning`` creates bins using isotonic regression to find optimal cut points that preserve monotonic relationships between features and targets. The transformer fits an isotonic (non-decreasing) function to the data and identifies significant changes in this function to determine bin boundaries. This approach is particularly effective when: * **Monotonic relationships exist** between features and targets * **Ordinal consistency** is important in your binning * **Regulatory requirements** mandate monotonic scoring models * **Risk scoring** applications where higher values should consistently indicate higher risk Key Features ------------ * **Monotonicity Preservation**: Ensures bins respect monotonic ordering relationships * **Regression-Based**: Uses isotonic regression for optimal cut point identification * **Automatic Boundary Detection**: Identifies significant changes in fitted isotonic function * **Flexible Direction**: Supports both increasing and decreasing monotonicity * **Sample Control**: Ensures minimum samples per bin for statistical reliability * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import IsotonicBinning # Create data with monotonic relationship np.random.seed(42) X = np.random.rand(1000, 2) # Create target with monotonic relationship to first feature y = 2 * X[:, 0] + 0.5 * np.random.randn(1000) # Apply isotonic binning binner = IsotonicBinning( max_bins=8, min_samples_per_bin=50, increasing=True ) # Method 1: Using fit with X and y (sklearn style) binner.fit(X, y) X_binned = binner.transform(X) print(f"Original shape: {X.shape}") print(f"Binned shape: {X_binned.shape}") print(f"Bins for feature 0: {len(binner.bin_edges_[0]) - 1}") DataFrame Example with Target Column ------------------------------------ .. code-block:: python # Create DataFrame with monotonic relationships df = pd.DataFrame({ 'age': np.random.uniform(18, 80, 1000), 'income': np.random.uniform(20000, 150000, 1000), 'experience': np.random.uniform(0, 40, 1000) }) # Create target with monotonic relationships df['default_risk'] = ( 0.01 * df['age'] + -0.00001 * df['income'] + -0.005 * df['experience'] + 0.2 * np.random.randn(1000) ) # Method 2: Using guidance_columns (binlearn style) binner = IsotonicBinning( guidance_columns=['default_risk'], max_bins=6, min_samples_per_bin=100, preserve_dataframe=True ) df_binned = binner.fit_transform(df) print(f"Bin edges for age: {binner.bin_edges_['age']}") print(f"Bin edges for income: {binner.bin_edges_['income']}") Decreasing Monotonicity Example ------------------------------- .. code-block:: python # Example where higher feature values should lead to lower target values X_credit = np.random.uniform(0, 100, 500).reshape(-1, 1) # Credit score y_default = 1 / (1 + np.exp(0.1 * (X_credit.flatten() - 50))) # Lower default prob for higher scores # Use decreasing monotonicity binner = IsotonicBinning( max_bins=5, min_samples_per_bin=50, increasing=False, # Higher credit score = lower default probability min_change_threshold=0.05 ) binner.fit(X_credit, y_default) X_credit_binned = binner.transform(X_credit) # Verify monotonicity: bin representatives should decrease print("Bin representatives:", binner.bin_representatives_[0]) print("Monotonically decreasing:", all(binner.bin_representatives_[0][i] >= binner.bin_representatives_[0][i+1] for i in range(len(binner.bin_representatives_[0])-1))) Advanced Configuration ---------------------- .. code-block:: python # Fine-tuned isotonic binning for different scenarios # High-precision binning (more sensitive to changes) precise_binner = IsotonicBinning( max_bins=12, min_samples_per_bin=30, min_change_threshold=0.005, # More sensitive to changes increasing=True, y_min=0.0, # Explicit bounds y_max=1.0 ) # Robust binning (less sensitive, larger bins) robust_binner = IsotonicBinning( max_bins=6, min_samples_per_bin=100, # Larger bins for stability min_change_threshold=0.1, # Less sensitive to changes increasing=True ) Classification Example ---------------------- .. code-block:: python from sklearn.datasets import make_classification from sklearn.preprocessing import LabelEncoder # Create classification data X_class, y_class = make_classification( n_samples=1000, n_features=3, n_classes=3, n_redundant=0, random_state=42 ) # Isotonic binning works with classification by treating classes ordinally binner = IsotonicBinning( max_bins=7, min_samples_per_bin=50, increasing=True ) binner.fit(X_class, y_class) X_class_binned = binner.transform(X_class) print(f"Classification bins: {[len(edges)-1 for edges in binner.bin_edges_.values()]}") Risk Scoring Pipeline --------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score # Create a risk scoring pipeline with monotonic binning risk_pipeline = Pipeline([ ('isotonic_binning', IsotonicBinning( max_bins=8, min_samples_per_bin=50, increasing=True )), ('logistic_regression', LogisticRegression(random_state=42)) ]) # Use in credit risk modeling X_train, X_test, y_train, y_test = train_test_split( X, y > np.median(y), test_size=0.2, random_state=42 ) risk_pipeline.fit(X_train, y_train) y_proba = risk_pipeline.predict_proba(X_test)[:, 1] auc_score = roc_auc_score(y_test, y_proba) print(f"AUC with isotonic binning: {auc_score:.3f}") Parameter Guide --------------- **max_bins** (int, default=10) Maximum number of bins to create per feature: * Higher values: More granular risk segments * Lower values: Broader, more stable risk categories * Consider regulatory requirements and model interpretability **min_samples_per_bin** (int, default=5) Minimum samples required per bin for statistical reliability: * Higher values: More stable bins, better statistical power * Lower values: More granular binning, potential instability * Rule of thumb: At least 30-50 for reliable estimates **increasing** (bool, default=True) Direction of monotonicity to enforce: * True: Higher feature values → higher target values * False: Higher feature values → lower target values * Choose based on domain knowledge and expected relationships **min_change_threshold** (float, default=0.01) Minimum relative change in fitted values to create new bin: * Lower values: More sensitive, more bins * Higher values: Less sensitive, fewer bins * Typical range: 0.005 (sensitive) to 0.1 (robust) **y_min, y_max** (float, optional) Bounds for target values in isotonic regression: * Explicit bounds can improve numerical stability * Useful for probability targets: y_min=0.0, y_max=1.0 * Auto-detected from data if not specified Monotonicity Validation ----------------------- .. code-block:: python # Function to validate monotonicity in binned results def validate_monotonicity(binner, feature_idx=0, increasing=True): """Validate that bin representatives follow monotonic order.""" reps = binner.bin_representatives_[feature_idx] if increasing: is_monotonic = all(reps[i] <= reps[i+1] for i in range(len(reps)-1)) direction = "increasing" else: is_monotonic = all(reps[i] >= reps[i+1] for i in range(len(reps)-1)) direction = "decreasing" print(f"Monotonicity ({direction}): {is_monotonic}") print(f"Representatives: {reps}") return is_monotonic # Validate our binning results validate_monotonicity(binner, feature_idx=0, increasing=True) Tips for Best Results --------------------- 1. **Verify monotonic relationships exist** in your data before applying 2. **Choose appropriate min_samples_per_bin** based on your sample size 3. **Adjust min_change_threshold** based on noise level in your data 4. **Consider regulatory constraints** for financial/medical applications 5. **Validate results** on holdout data to avoid overfitting Common Use Cases ---------------- * **Credit Risk Scoring**: Age, income, debt-to-income ratio * **Medical Risk Assessment**: Biomarkers, age, symptom severity * **Marketing Response**: Customer value, engagement metrics * **Predictive Maintenance**: Usage hours, temperature readings * **Quality Control**: Process parameters, environmental conditions See Also -------- * :class:`TreeBinning` - Decision tree-based supervised binning * :class:`Chi2Binning` - Chi-square statistic-based supervised binning * :class:`EqualFrequencyBinning` - Quantile-based unsupervised binning * :class:`KMeansBinning` - K-means clustering-based binning