ManualFlexibleBinning ===================== .. currentmodule:: binlearn.methods .. autoclass:: ManualFlexibleBinning :members: :inherited-members: :show-inheritance: Overview -------- ``ManualFlexibleBinning`` creates bins using explicitly provided bin specifications that can include both singleton bins (exact numeric value matches) and interval bins (numeric range matches). This transformer offers maximum flexibility for complex binning scenarios that combine exact value matching with traditional interval binning. This approach is ideal for: * **Numeric data requiring both exact and range matching** * **Complex domain-specific numeric binning rules** * **Outlier handling** with specific value bins * **Standardized binning** with both singleton and continuous elements * **Integration** with external flexible binning specifications Key Features ------------ * **Mixed Bin Types**: Combines singleton (exact value) and interval (range) bins * **Complete Control**: User defines all bin specifications explicitly * **Numeric Focus**: Designed specifically for numeric data and values * **Flexible Matching**: Supports exact matches and range-based matching * **Auto-Representatives**: Automatic generation of appropriate representatives * **Comprehensive Validation**: Thorough validation of bin specifications * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import ManualFlexibleBinning # Create sample numeric data np.random.seed(42) data = pd.DataFrame({ 'score': [95, 85, 75, 65, 45, 25, 85, 95, 12, 88], 'age': [22, 35, 45, 67, 28, 19, 65, 72, 16, 41] }) # Define flexible bin specifications flexible_specs = { 'score': [ 95, # Singleton bin for perfect scores 85, # Singleton bin for high achievers (60, 80), # Interval bin for passing grades (0, 60) # Interval bin for failing grades ], 'age': [ (0, 18), # Minors (18, 35), # Young adults (35, 65), # Middle-aged 65 # Seniors (singleton for retirement age) ] } # Apply flexible binning binner = ManualFlexibleBinning( bin_spec=flexible_specs, preserve_dataframe=True ) data_binned = binner.fit_transform(data) print("Original data:") print(data.head()) print("\\nBinned data:") print(data_binned.head()) print("\\nBin specifications used:") for col, specs in flexible_specs.items(): print(f" {col}: {specs}") Grade Analysis Example ---------------------- .. code-block:: python # Academic grading with special handling for specific scores grades_df = pd.DataFrame({ 'midterm_score': np.random.choice([100, 95, 88, 76, 65, 42, 0], 500, p=[0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]), 'participation': np.random.uniform(0, 100, 500) }) # Academic bin specifications academic_specs = { 'midterm_score': [ 100, # Perfect score (singleton) 0, # Zero score (singleton) (90, 100), # A grade range (80, 90), # B grade range (70, 80), # C grade range (60, 70), # D grade range (0, 60) # F grade range ], 'participation': [ 100, # Perfect participation (singleton) (80, 100), # High participation (60, 80), # Moderate participation (0, 60) # Low participation ] } # Custom representatives for interpretability academic_reps = { 'midterm_score': ['Perfect', 'Zero', 'A', 'B', 'C', 'D', 'F'], 'participation': ['Perfect', 'High', 'Moderate', 'Low'] } academic_binner = ManualFlexibleBinning( bin_spec=academic_specs, bin_representatives=academic_reps, preserve_dataframe=True ) grades_binned = academic_binner.fit_transform(grades_df) # Analyze grade distribution print("Midterm Score Distribution:") for i, rep in enumerate(academic_reps['midterm_score']): count = (grades_binned['midterm_score'] == i).sum() percentage = count / len(grades_df) * 100 print(f" {rep}: {count} students ({percentage:.1f}%)") Financial Risk Assessment ------------------------- .. code-block:: python # Financial data with special handling for extreme values financial_df = pd.DataFrame({ 'credit_score': np.random.choice([850, 300] + list(range(400, 800, 20)), 1000), 'debt_ratio': np.random.exponential(0.3, 1000), 'years_employment': np.random.choice([0] + list(range(1, 31)), 1000) }) # Financial risk bin specifications risk_specs = { 'credit_score': [ 850, # Perfect credit (singleton) 300, # Minimum credit (singleton) (740, 850), # Excellent credit (670, 740), # Good credit (580, 670), # Fair credit (300, 580) # Poor credit ], 'debt_ratio': [ 0.0, # No debt (singleton) (0, 0.28), # Low debt (0.28, 0.36), # Moderate debt (0.36, 0.5), # High debt (0.5, 2.0) # Very high debt ], 'years_employment': [ 0, # Unemployed (singleton) (1, 2), # New employee (2, 5), # Junior employee (5, 10), # Experienced (10, 30) # Senior employee ] } risk_labels = { 'credit_score': ['Perfect', 'Minimum', 'Excellent', 'Good', 'Fair', 'Poor'], 'debt_ratio': ['No Debt', 'Low', 'Moderate', 'High', 'Very High'], 'years_employment': ['Unemployed', 'New', 'Junior', 'Experienced', 'Senior'] } risk_binner = ManualFlexibleBinning( bin_spec=risk_specs, bin_representatives=risk_labels, preserve_dataframe=True ) financial_binned = risk_binner.fit_transform(financial_df) # Risk profile analysis print("Financial Risk Profile Distribution:") for feature in ['credit_score', 'debt_ratio', 'years_employment']: print(f"\\n{feature.replace('_', ' ').title()}:") for i, label in enumerate(risk_labels[feature]): count = (financial_binned[feature] == i).sum() print(f" {label}: {count} ({count/len(financial_df)*100:.1f}%)") Medical Diagnostic Example -------------------------- .. code-block:: python # Medical data with critical values as singletons medical_df = pd.DataFrame({ 'temperature': np.random.normal(98.6, 2, 800), 'heart_rate': np.random.normal(70, 15, 800), 'blood_sugar': np.random.lognormal(4.5, 0.3, 800) }) # Add some extreme values medical_df.loc[:10, 'temperature'] = [105, 95, 106, 94] # Critical temperatures medical_df.loc[:10, 'heart_rate'] = [200, 40, 180, 35] # Critical heart rates # Medical bin specifications with critical values medical_specs = { 'temperature': [ 105, # High fever (singleton) 95, # Hypothermia (singleton) (100.4, 105), # Fever (98, 100.4), # Normal (95, 98), # Low normal (90, 95) # Hypothermic range ], 'heart_rate': [ 200, # Tachycardia crisis (singleton) 40, # Bradycardia crisis (singleton) (100, 200), # Tachycardia (60, 100), # Normal (40, 60), # Bradycardia (20, 40) # Severe bradycardia ], 'blood_sugar': [ (70, 100), # Normal (100, 126), # Pre-diabetic (126, 300), # Diabetic (0, 70), # Hypoglycemic (300, 500) # Severe hyperglycemic ] } medical_labels = { 'temperature': ['High Fever', 'Hypothermia', 'Fever', 'Normal', 'Low Normal', 'Hypothermic'], 'heart_rate': ['Tachy Crisis', 'Brady Crisis', 'Tachycardia', 'Normal', 'Bradycardia', 'Severe Brady'], 'blood_sugar': ['Normal', 'Pre-diabetic', 'Diabetic', 'Hypoglycemic', 'Severe Hyperglycemic'] } medical_binner = ManualFlexibleBinning( bin_spec=medical_specs, bin_representatives=medical_labels, preserve_dataframe=True ) medical_binned = medical_binner.fit_transform(medical_df) Quality Control Example ----------------------- .. code-block:: python # Manufacturing quality control with specification limits qc_df = pd.DataFrame({ 'diameter': np.random.normal(10.0, 0.5, 1000), # Target: 10.0mm 'hardness': np.random.normal(50, 5, 1000), # Target: 50 HRC 'weight': np.random.normal(100, 3, 1000) # Target: 100g }) # Add some out-of-spec values qc_df.loc[:5, 'diameter'] = [12.5, 7.5, 11.0, 9.0] # Out of tolerance # Quality control specifications qc_specs = { 'diameter': [ 12.5, # Upper specification limit (singleton) 7.5, # Lower specification limit (singleton) (9.8, 10.2), # Within tolerance (9.5, 9.8), # Low acceptable (10.2, 10.5), # High acceptable (7.5, 9.5), # Low reject (10.5, 12.5) # High reject ], 'hardness': [ (45, 55), # Target range (40, 45), # Low acceptable (55, 60), # High acceptable (0, 40), # Low reject (60, 100) # High reject ], 'weight': [ (98, 102), # Target range (95, 98), # Light (102, 105), # Heavy (0, 95), # Too light (105, 200) # Too heavy ] } qc_labels = { 'diameter': ['Upper Limit', 'Lower Limit', 'Target', 'Low OK', 'High OK', 'Low Reject', 'High Reject'], 'hardness': ['Target', 'Low OK', 'High OK', 'Low Reject', 'High Reject'], 'weight': ['Target', 'Light', 'Heavy', 'Too Light', 'Too Heavy'] } qc_binner = ManualFlexibleBinning( bin_spec=qc_specs, bin_representatives=qc_labels, preserve_dataframe=True ) qc_binned = qc_binner.fit_transform(qc_df) # Quality analysis print("Quality Control Analysis:") for feature in ['diameter', 'hardness', 'weight']: print(f"\\n{feature.title()}:") for i, label in enumerate(qc_labels[feature]): count = (qc_binned[feature] == i).sum() print(f" {label}: {count} units ({count/len(qc_df)*100:.1f}%)") Bin Specification Guide ----------------------- .. code-block:: python # Examples of different bin specification formats specification_examples = { # Example 1: Mixed singleton and interval bins 'feature1': [ 42, # Singleton: exact match for value 42 (0, 25), # Interval: values in range [0, 25) (25, 50), # Interval: values in range [25, 50) 100 # Singleton: exact match for value 100 ], # Example 2: Mostly intervals with key singletons 'feature2': [ 0, # Singleton: zero values (0, 10), # Interval: low values (10, 90), # Interval: normal range (90, 100), # Interval: high values 100 # Singleton: maximum values ], # Example 3: Mostly singletons (discrete-like) 'feature3': [ 1, 2, 3, 4, 5, # Individual values (6, 10), # Range for higher values (10, float('inf')) # Open upper range ] } # Demonstration of matching behavior test_data = pd.DataFrame({ 'feature1': [42, 15, 35, 100, 75], 'feature2': [0, 5, 45, 95, 100], 'feature3': [1, 3, 7, 15, 25] }) demo_binner = ManualFlexibleBinning( bin_spec=specification_examples, preserve_dataframe=True ) result = demo_binner.fit_transform(test_data) print("Bin matching demonstration:") for col in test_data.columns: print(f"\\n{col}:") print(f" Original: {test_data[col].tolist()}") print(f" Binned: {result[col].tolist()}") print(f" Specs: {specification_examples[col]}") Scikit-learn Pipeline Integration --------------------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification # Create sample data X, y = make_classification(n_samples=1000, n_features=3, n_classes=2, random_state=42) # Define flexible binning for each feature pipeline_specs = { 0: [ -2.5, # Extreme low (singleton) (-2, -1), # Low range (-1, 1), # Medium range (1, 2), # High range 2.5 # Extreme high (singleton) ], 1: [ (-3, -1), # Low (-1, 1), # Medium (1, 3), # High 3.5 # Extreme (singleton) ], 2: [ -2.0, # Extreme low (singleton) (-1.5, 0), # Low-medium (0, 1.5), # Medium-high 2.0 # Extreme high (singleton) ] } # Create pipeline pipeline = Pipeline([ ('binning', ManualFlexibleBinning(bin_spec=pipeline_specs)), ('classifier', RandomForestClassifier(random_state=42)) ]) # Train and evaluate X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) pipeline.fit(X_train, y_train) accuracy = pipeline.score(X_test, y_test) print(f"Pipeline accuracy with flexible binning: {accuracy:.3f}") Parameter Guide --------------- **bin_spec** (dict, required) Dictionary mapping column identifiers to flexible bin specification lists: * Keys: Column names (str) or indices (int) * Values: Lists containing: * **Singleton bins**: Numeric values for exact matches * **Interval bins**: Tuples (min, max) for range matches * Order matters: earlier specifications take precedence * No overlap validation (user responsibility) **bin_representatives** (dict, optional) Dictionary mapping columns to bin representative values: * Keys: Must match bin_spec keys * Values: Lists with same length as corresponding bin_spec * Can be numeric values or category names/labels * If None, auto-generates appropriate representatives Tips for Best Results --------------------- 1. **Order specifications carefully**: Earlier bins take precedence in matching 2. **Avoid overlapping intervals**: Can lead to ambiguous matches 3. **Use singletons for critical values**: Exact matches for important thresholds 4. **Consider floating point precision**: Use appropriate precision for your data 5. **Test with representative data**: Validate that all expected values match correctly 6. **Document specification logic**: Keep records of binning rationale Common Patterns --------------- * **Outlier Isolation**: Use singletons for extreme values, intervals for normal ranges * **Threshold Systems**: Combine critical value singletons with range intervals * **Quality Control**: Specification limits as singletons, tolerance ranges as intervals * **Grade Systems**: Perfect scores as singletons, grade ranges as intervals * **Medical Diagnostics**: Critical values as singletons, normal ranges as intervals See Also -------- * :class:`ManualIntervalBinning` - Manual binning with only interval bins * :class:`SingletonBinning` - Automatic binning with only singleton bins * :class:`EqualWidthBinning` - Automatic equal-width interval binning * :class:`TreeBinning` - Automatic supervised binning with decision trees