ManualIntervalBinning ==================== .. currentmodule:: binlearn.methods .. autoclass:: ManualIntervalBinning :members: :inherited-members: :show-inheritance: Overview -------- ``ManualIntervalBinning`` creates bins using explicitly provided bin edges, giving users complete control over binning boundaries. Unlike automatic binning methods, this transformer never infers bin edges from data - they must always be provided by the user. This approach is ideal for: * **Standardized binning** across multiple datasets * **Domain-specific binning** requirements with business rules * **Reproducible binning** with known boundaries * **Integration** with external binning specifications * **Regulatory compliance** where specific bins are mandated Key Features ------------ * **Complete Control**: User defines all bin boundaries explicitly * **Consistency**: Same bins across different datasets and time periods * **Validation**: Comprehensive validation of user-provided bin edges * **Auto-Representatives**: Automatic generation of bin center representatives * **Flexible Keys**: Supports both column names and indices as keys * **Out-of-Range Handling**: Configurable clipping for values outside bin ranges * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import ManualIntervalBinning # Create sample data np.random.seed(42) X = np.random.uniform(0, 100, 200).reshape(-1, 2) # Define custom bin edges for each feature custom_edges = { 0: [0, 20, 40, 60, 80, 100], # Feature 0: quintiles 1: [0, 25, 50, 75, 100] # Feature 1: quartiles } # Apply manual binning binner = ManualIntervalBinning(bin_edges=custom_edges) X_binned = binner.fit_transform(X) print(f"Original shape: {X.shape}") print(f"Binned shape: {X_binned.shape}") print(f"Bin edges for feature 0: {binner.bin_edges_[0]}") print(f"Bin edges for feature 1: {binner.bin_edges_[1]}") print(f"Representatives for feature 0: {binner.bin_representatives_[0]}") DataFrame Example with Named Columns ------------------------------------ .. code-block:: python # Create DataFrame with named columns df = pd.DataFrame({ 'age': np.random.uniform(18, 80, 1000), 'income': np.random.uniform(20000, 200000, 1000), 'credit_score': np.random.uniform(300, 850, 1000) }) # Define business-relevant bin edges business_edges = { 'age': [18, 25, 35, 50, 65, 80], # Life stages 'income': [0, 30000, 60000, 100000, 200000], # Income brackets 'credit_score': [300, 580, 670, 740, 850] # Credit categories } # Optional: Define custom representatives representatives = { 'age': [21, 30, 42, 57, 72], # Midpoint ages 'income': [15000, 45000, 80000, 150000], # Representative incomes 'credit_score': [440, 625, 705, 795] # Representative scores } binner = ManualIntervalBinning( bin_edges=business_edges, bin_representatives=representatives, preserve_dataframe=True, clip=True # Clip outliers to bin boundaries ) df_binned = binner.fit_transform(df) print("Age bins:") for i, (start, end) in enumerate(zip(business_edges['age'][:-1], business_edges['age'][1:])): count = ((df['age'] >= start) & (df['age'] < end)).sum() print(f" Bin {i}: [{start}, {end}) - {count} samples") Financial Risk Example ---------------------- .. code-block:: python # Financial data with regulatory-defined risk categories financial_df = pd.DataFrame({ 'debt_to_income': np.random.uniform(0, 1.5, 5000), 'loan_to_value': np.random.uniform(0.3, 1.2, 5000), 'fico_score': np.random.uniform(300, 850, 5000) }) # Regulatory risk categories (example) risk_edges = { 'debt_to_income': [0, 0.28, 0.36, 0.43, 1.5], # DTI risk categories 'loan_to_value': [0, 0.8, 0.9, 0.95, 1.2], # LTV risk categories 'fico_score': [300, 580, 620, 680, 740, 850] # Credit score tiers } # Risk level names as representatives risk_representatives = { 'debt_to_income': ['Low', 'Moderate', 'High', 'Very High'], 'loan_to_value': ['Conservative', 'Standard', 'Aggressive', 'High Risk'], 'fico_score': ['Poor', 'Fair', 'Good', 'Very Good', 'Excellent'] } risk_binner = ManualIntervalBinning( bin_edges=risk_edges, bin_representatives=risk_representatives, preserve_dataframe=True, clip=True ) financial_binned = risk_binner.fit_transform(financial_df) # Show distribution across risk categories for col in ['debt_to_income', 'loan_to_value', 'fico_score']: print(f"\\n{col.replace('_', ' ').title()} Distribution:") for i, rep in enumerate(risk_representatives[col]): mask = financial_binned[col] == i count = mask.sum() percentage = count / len(financial_df) * 100 print(f" {rep}: {count} ({percentage:.1f}%)") Medical/Clinical Example ------------------------ .. code-block:: python # Medical data with clinical thresholds medical_df = pd.DataFrame({ 'bmi': np.random.normal(25, 5, 2000), 'blood_pressure_systolic': np.random.normal(120, 20, 2000), 'cholesterol': np.random.normal(200, 40, 2000), 'age': np.random.uniform(18, 90, 2000) }) # Clinical classification thresholds clinical_edges = { 'bmi': [0, 18.5, 25, 30, 40], # BMI categories 'blood_pressure_systolic': [0, 120, 130, 140, 180, 300], # BP stages 'cholesterol': [0, 200, 240, 300], # Cholesterol levels 'age': [0, 18, 40, 65, 100] # Age groups } clinical_labels = { 'bmi': ['Underweight', 'Normal', 'Overweight', 'Obese'], 'blood_pressure_systolic': ['Normal', 'Elevated', 'Stage 1', 'Stage 2', 'Crisis'], 'cholesterol': ['Desirable', 'Borderline', 'High'], 'age': ['Child', 'Adult', 'Middle Age', 'Senior'] } clinical_binner = ManualIntervalBinning( bin_edges=clinical_edges, bin_representatives=clinical_labels, preserve_dataframe=True, clip=True ) medical_binned = clinical_binner.fit_transform(medical_df) # Clinical summary print("Clinical Distribution Summary:") for condition in ['bmi', 'blood_pressure_systolic', 'cholesterol']: print(f"\\n{condition.replace('_', ' ').title()}:") for i, label in enumerate(clinical_labels[condition]): count = (medical_binned[condition] == i).sum() print(f" {label}: {count} patients ({count/len(medical_df)*100:.1f}%)") Cross-Dataset Consistency ------------------------- .. code-block:: python # Ensure consistent binning across training and test sets # Training data train_data = pd.DataFrame({ 'feature1': np.random.normal(50, 15, 1000), 'feature2': np.random.exponential(2, 1000) }) # Test data (different distribution) test_data = pd.DataFrame({ 'feature1': np.random.normal(45, 20, 500), # Different mean/std 'feature2': np.random.exponential(3, 500) # Different scale }) # Fixed bin edges ensure consistency standard_edges = { 'feature1': [0, 25, 40, 55, 70, 100], 'feature2': [0, 1, 3, 6, 10, 20] } binner = ManualIntervalBinning( bin_edges=standard_edges, preserve_dataframe=True, clip=True ) # Same binning applied to both datasets train_binned = binner.fit_transform(train_data) test_binned = binner.transform(test_data) # No fitting needed print("Training data distribution:") print(train_binned['feature1'].value_counts().sort_index()) print("\\nTest data distribution:") print(test_binned['feature1'].value_counts().sort_index()) Advanced Bin Edge Validation ----------------------------- .. code-block:: python # Example of comprehensive bin edge validation def validate_custom_edges(edges_dict, data_ranges): \"\"\"Validate that bin edges cover expected data ranges.\"\"\" for col, edges in edges_dict.items(): if col in data_ranges: data_min, data_max = data_ranges[col] edge_min, edge_max = min(edges), max(edges) if edge_min > data_min: print(f"Warning: {col} edges start at {edge_min}, data starts at {data_min}") if edge_max < data_max: print(f"Warning: {col} edges end at {edge_max}, data ends at {data_max}") # Check for reasonable bin sizes bin_widths = np.diff(edges) if max(bin_widths) / min(bin_widths) > 10: print(f"Warning: {col} has very uneven bin sizes") # Usage example data_ranges = { 'age': (df['age'].min(), df['age'].max()), 'income': (df['income'].min(), df['income'].max()) } validate_custom_edges(business_edges, data_ranges) Scikit-learn Pipeline Integration --------------------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification # Create classification data X, y = make_classification(n_samples=2000, n_features=3, n_classes=2, random_state=42) # Define standardized bin edges for each feature pipeline_edges = { 0: [-3, -1, 0, 1, 3], 1: [-3, -1.5, 0, 1.5, 3], 2: [-3, -1, 1, 3] } # Create pipeline with manual binning pipeline = Pipeline([ ('binning', ManualIntervalBinning( bin_edges=pipeline_edges, clip=True )), ('classifier', RandomForestClassifier(random_state=42)) ]) # Train and evaluate X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipeline.fit(X_train, y_train) accuracy = pipeline.score(X_test, y_test) print(f"Pipeline accuracy with manual binning: {accuracy:.3f}") Parameter Guide --------------- **bin_edges** (dict, required) Dictionary mapping column identifiers to bin edge lists: * Keys: Column names (str) or indices (int) * Values: Sorted lists/arrays of bin boundaries * Must have at least 2 elements per list * Will create len(edges)-1 bins **bin_representatives** (dict, optional) Dictionary mapping columns to bin representative values: * Keys: Must match bin_edges keys * Values: Lists with len(edges)-1 elements * If None, uses bin centers as representatives * Can be numeric values or category names **clip** (bool, optional) Whether to clip out-of-range values: * True: Clip to nearest bin edge * False: Assign special out-of-range indicators * None: Use global configuration default Edge Case Handling ------------------- .. code-block:: python # Handling data outside bin ranges # Data with outliers outlier_data = pd.DataFrame({ 'normal_feature': np.concatenate([ np.random.normal(50, 10, 900), # Normal data [5, 95, 120, -10] # Outliers ]) }) normal_edges = {'normal_feature': [20, 40, 60, 80]} # With clipping clipper = ManualIntervalBinning(bin_edges=normal_edges, clip=True) clipped_result = clipper.fit_transform(outlier_data) # Without clipping (outliers get special values) no_clipper = ManualIntervalBinning(bin_edges=normal_edges, clip=False) unclipped_result = no_clipper.fit_transform(outlier_data) print("With clipping - unique values:", np.unique(clipped_result)) print("Without clipping - unique values:", np.unique(unclipped_result)) Tips for Best Results --------------------- 1. **Validate edge coverage**: Ensure edges cover your expected data range 2. **Consider domain knowledge**: Use meaningful boundaries from your field 3. **Check bin balance**: Avoid bins that are too small or too large 4. **Plan for outliers**: Decide on clipping strategy early 5. **Document edge rationale**: Keep records of why specific edges were chosen 6. **Test across datasets**: Validate that edges work across different data samples Common Use Cases ---------------- * **Age Groups**: [18, 25, 35, 50, 65, 80] for life stage analysis * **Income Brackets**: [0, 25000, 50000, 100000, 200000] for economic segments * **Test Scores**: [0, 60, 70, 80, 90, 100] for grade boundaries * **Medical Thresholds**: Disease-specific clinical cutoffs * **Risk Categories**: Regulatory or business-defined risk levels See Also -------- * :class:`ManualFlexibleBinning` - Manual binning with mixed interval and singleton bins * :class:`EqualWidthBinning` - Automatic equal-width interval binning * :class:`EqualFrequencyBinning` - Automatic quantile-based binning * :class:`TreeBinning` - Automatic supervised binning with decision trees