ManualFlexibleBinning

class binlearn.methods.ManualFlexibleBinning(bin_spec: dict[Any, list[Any]], bin_representatives: dict[Any, list[float]] | None = None, preserve_dataframe: bool | None = None, *, class_: str | None = None, module_: str | None = None)[source]

Bases: FlexibleBinningBase

Manual flexible binning implementation for user-defined mixed bin types.

This class provides complete control over flexible binning by allowing users to specify bin definitions that can include both singleton bins (exact value matching) and interval bins (range matching) within the same feature. This flexibility makes it ideal for domain-specific binning requirements, handling special values, and creating custom discretization schemes.

Manual flexible binning is particularly useful for: - Mixed data types requiring both exact and range-based binning - Handling special values (outliers, missing indicators) as singleton bins - Domain-specific requirements with irregular bin boundaries - Creating bins that combine categorical-like values with continuous ranges

Key Features: - Support for mixed bin types within the same feature - Singleton bins for exact value matching - Interval bins for range-based matching - No data-dependent bin calculation - uses provided specifications exactly - Automatic generation of representatives if not provided - Integration with binlearn’s format preservation features

Algorithm: 1. Validate and store user-provided flexible bin specifications 2. Generate default representatives if not provided:

  • For singleton bins: use the singleton value itself

  • For interval bins: use the interval midpoint

  1. During transformation, match values against bin definitions: - Check singleton bins for exact matches - Check interval bins for range membership - Return index of first matching bin

Parameters:
  • bin_spec – Required dictionary mapping column identifiers to lists of flexible bin definitions. Each bin definition can be either: - Scalar value: singleton bin matching exactly that value - Tuple (start, end): interval bin matching values in [start, end] For example: {0: [42, (10, 20), ‘special’], ‘age’: [(0, 18), (18, 65), (65, 100)]}

  • bin_representatives – Optional dictionary mapping column identifiers to lists of representative values for each bin. If not provided, representatives are automatically generated.

bin_spec_

Dictionary containing the provided flexible bin specifications

bin_representatives_

Dictionary containing bin representatives (provided or auto-generated)

Example

>>> import numpy as np
>>> from binlearn.methods import ManualFlexibleBinning
>>>
>>> # Define mixed bin types for different features
>>> bin_spec = {
...     'numeric_feature': [
...         0,              # Singleton: exactly zero
...         (1, 10),        # Interval: 1 to 10
...         (10, 100),      # Interval: 10 to 100
...         999             # Singleton: exactly 999 (outlier)
...     ],
...     'mixed_feature': [
...         'special',      # Singleton: exactly 'special'
...         (0, 50),        # Interval: 0 to 50
...         (50, 100)       # Interval: 50 to 100
...     ]
... }
>>>
>>> # Create binner with flexible specifications
>>> binner = ManualFlexibleBinning(bin_spec=bin_spec)
>>>
>>> # Sample data with mixed types
>>> X = np.array([[0, 25], [5, 75], [999, 'special']], dtype=object)
>>> X_binned = binner.fit_transform(X)
>>> # Results: [[0, 1], [1, 2], [3, 0]]
>>>
>>> # With custom representatives
>>> bin_reps = {
...     'numeric_feature': [0, 5.5, 55, 999],    # Custom representatives
...     'mixed_feature': ['special', 25, 75]      # Mixed type representatives
... }
>>> binner_custom = ManualFlexibleBinning(
...     bin_spec=bin_spec,
...     bin_representatives=bin_reps
... )

Note

  • bin_spec is required and cannot be None

  • fit() method is essentially a no-op since specifications are predefined

  • Values are matched against bins in order - first match wins

  • Singleton bins support any hashable type (numeric, string, etc.)

  • Interval bins only work with numeric values

  • Unmatched values receive MISSING_VALUE (-1) bin index

__init__(bin_spec: dict[Any, list[Any]], bin_representatives: dict[Any, list[float]] | None = None, preserve_dataframe: bool | None = None, *, class_: str | None = None, module_: str | None = None)[source]

Initialize manual flexible binning with user-defined bin specifications.

Sets up manual flexible binning with explicitly provided bin definitions that can include both singleton and interval bins. This method requires complete bin specification upfront and integrates with binlearn’s configuration system for other parameters.

Parameters:
  • bin_spec – Required dictionary mapping column identifiers to lists of flexible bin definitions. Each bin definition can be either: - Scalar value (any type): singleton bin matching exactly that value - Tuple (start, end): interval bin matching numeric values in [start, end] Mixed types are allowed within the same feature. For example: {0: [42, (10, 20), ‘special’], ‘col’: [(0, 50), (50, 100)]}

  • bin_representatives – Optional dictionary mapping column identifiers to lists of representative values for each bin. If provided, must have the same column keys as bin_spec and appropriate counts (one representative per bin). If None, representatives are automatically generated: - For singleton bins: the singleton value itself - For interval bins: the interval midpoint (start + end) / 2

  • preserve_dataframe – Whether to preserve DataFrame format in outputs when input is a DataFrame. If None, uses global configuration default.

  • class – Class name for reconstruction compatibility (ignored during normal initialization).

  • module – Module name for reconstruction compatibility (ignored during normal initialization).

Raises:

ConfigurationError – If bin_spec is None or not provided, with helpful suggestions for proper usage including example formats.

Example

>>> # Basic flexible binning with auto-generated representatives
>>> bin_spec = {
...     'feature1': [0, (1, 10), (10, 100), 999],     # Mixed types
...     'feature2': [(0, 25), 'special', (50, 100)]   # Mixed types
... }
>>> binner = ManualFlexibleBinning(bin_spec=bin_spec)
>>>
>>> # With custom representatives
>>> bin_reps = {
...     'feature1': [0, 5.5, 55, 999],      # Custom values
...     'feature2': [12.5, 'special', 75]   # Mixed representatives
... }
>>> binner_custom = ManualFlexibleBinning(
...     bin_spec=bin_spec,
...     bin_representatives=bin_reps
... )
>>>
>>> # Single feature with intervals only
>>> simple_spec = {'price': [(0, 100), (100, 500), (500, float('inf'))]}
>>> binner_simple = ManualFlexibleBinning(bin_spec=simple_spec)

Note

  • bin_spec is the only required parameter and cannot be None

  • Validation of bin_spec format occurs during initialization

  • The fit() method will be essentially a no-op since specs are predefined

  • Each column can have different numbers and types of bins

  • Singleton bins can be any hashable type (numbers, strings, etc.)

  • Interval bins must have numeric start and end values

fit(X: Any, y: Any | None = None, **fit_params: Any) ManualFlexibleBinning[source]

Fit the Manual Flexible binning (no-op since bin specs are pre-defined).

For manual binning, the object is already fitted during initialization. This method only performs validation.

Parameters:
  • X – Input data (used only for validation)

  • y – Target values (ignored for manual binning)

  • **fit_params – Additional fit parameters (ignored)

Returns:

Self (fitted transformer)

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

static check_data_quality(data: ndarray[Any, Any], name: str = 'data') None

Check data quality and issue warnings if needed.

property feature_names_in_: list[str] | None

Get feature names.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_input_columns() list[Any] | None

Get input columns for data preparation.

This method should be overridden by derived classes to provide appropriate column information without exposing binning-specific concepts.

Returns:

Column information or None if not available

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep: bool = True) dict[str, Any]

Get parameters for this estimator, including fitted parameters.

This method extends sklearn’s standard get_params to include fitted parameters when the estimator is fitted, enabling complete object reconstruction through the get_params/set_params interface. This is essential for pipeline persistence and model serialization.

Parameters:

deep – If True, returns parameters for sub-estimators (not applicable here but maintained for sklearn compatibility).

Returns:

  • Constructor parameters extracted from __init__ signature

  • Fitted parameters (if estimator is fitted) mapped from attributes

  • Class metadata (class_, module_) for automatic reconstruction

Return type:

Dictionary of parameter names mapped to their values, including

Example

>>> binner = EqualWidthBinning(n_bins=5)
>>> params = binner.get_params()
>>> print(params)
{'n_bins': 5, 'clip': None, ..., 'class_': 'EqualWidthBinning', 'module_': '...'}
>>>
>>> binner.fit(X)
>>> fitted_params = binner.get_params()
>>> # Now includes: {'bin_edges': {...}, 'bin_representatives': {...}, ...}

Note

  • Automatically extracts constructor parameters from __init__ signature

  • Includes fitted parameters only when estimator is fitted

  • Adds class metadata for reconstruction workflows

  • Excludes internal sklearn attributes like n_features_in_

  • class_ and module_ parameters are handled specially during set_params

inverse_transform(X: Any) Any

Inverse transform from bin indices back to representative values.

Converts discrete bin indices back to their representative values, effectively reversing the binning transformation. This is useful for interpreting results or reconstructing approximate original values.

Parameters:

X – Input data containing bin indices to inverse transform. Should contain only binning columns (no guidance columns). Can be: - pandas.DataFrame: Column names should match binning columns - polars.DataFrame: Column names should match binning columns - numpy.ndarray: Must have same number of binning columns - array-like: Converted to numpy array

Returns:

Inverse transformed data where bin indices are replaced with their representative values (typically bin centers). Output format matches the preserve_dataframe setting.

Raises:
  • RuntimeError – If the transformer has not been fitted yet.

  • ValueError – If input data has wrong number of columns or invalid format.

  • BinningError – If inverse transformation fails.

Example

>>> # After fitting and transforming
>>> X_binned = [[0, 1], [1, 0], [2, 2]]  # Bin indices
>>> X_reconstructed = binner.inverse_transform(X_binned)
>>> print(X_reconstructed)
[[0.5, 1.5], [1.5, 0.5], [2.5, 2.5]]  # Representative values

Note

For guided binning (when guidance_columns is specified), the input should only contain the binning columns, not the guidance columns. The number of input columns must match the number of binning columns.

property n_features_in_: int

Get number of features.

set_output(*, transform=None)

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params: Any) SklearnIntegrationBase

Set the parameters of this estimator.

This method supports reconstruction workflows by handling fitted parameters that come from get_params() output (without underscores) and setting them as fitted attributes (with underscores).

Parameters:

**params – Parameters to set. Can include: - Regular constructor parameters (n_bins, clip, etc.) - Fitted parameters from get_params (bin_edges, bin_representatives) - Class metadata (ignored during reconstruction)

Returns:

Returns the instance itself.

Return type:

self

transform(X: Any) Any

Transform input data using fitted binning parameters.

Applies the fitted binning transformation to new data, converting continuous values to discrete bin indices or representatives. Handles column separation when guidance columns are present.

Parameters:

X – Input data to transform. Must have the same structure as the data used during fitting (same number of columns). Can be: - pandas.DataFrame: Column names should match training data - polars.DataFrame: Column names should match training data - numpy.ndarray: Must have same number of columns as training - array-like: Converted to numpy array

Returns:

Transformed data where continuous values are replaced with bin indices or representative values. The output format depends on: - preserve_dataframe setting: DataFrame vs array format - binning method: indices vs representatives - guidance_columns: only binning columns are transformed

Raises:
  • RuntimeError – If the transformer has not been fitted yet.

  • ValueError – If the input data has incompatible structure or format.

  • BinningError – If transformation fails due to data issues.

Example

>>> # After fitting
>>> X_new = pd.DataFrame({'feature1': [1.5, 2.5], 'feature2': [15, 25]})
>>> X_binned = binner.transform(X_new)
>>> print(X_binned)
[[0, 0], [1, 1]]  # Bin indices

Note

When guidance_columns is specified, only the binning columns are transformed. Guidance columns are filtered out from the output. The method preserves the original data format when preserve_dataframe=True.

static validate_array_like(data: Any, name: str = 'data', allow_none: bool = False) ndarray[Any, Any] | None

Validate and convert array-like input to numpy array.

This method provides robust validation and conversion of various input formats to numpy arrays, with comprehensive error handling and helpful suggestions for common issues.

Parameters:
  • data – Input data to validate and convert. Can be: - numpy.ndarray: Used directly - pandas.DataFrame/Series: Converted to numpy array - polars.DataFrame: Converted to numpy array - list, tuple: Converted to numpy array - None: Allowed only if allow_none=True

  • name – Name of the data parameter for error messages. Used to provide context in error messages (e.g., “X”, “y”, “guidance_data”).

  • allow_none – Whether to allow None as a valid input. If True, None is returned unchanged; if False, None raises InvalidDataError.

Returns:

Validated numpy array, or None if data is None and allow_none=True. The returned array maintains the same data content but is guaranteed to be a numpy array.

Raises:

InvalidDataError – If validation fails: - data is None when allow_none=False - data cannot be converted to numpy array - Conversion process encounters errors

Example

>>> # Valid inputs
>>> arr = ValidationMixin.validate_array_like([1, 2, 3], "X")
>>> print(type(arr))
<class 'numpy.ndarray'>
>>>
>>> # Allow None
>>> result = ValidationMixin.validate_array_like(None, "y", allow_none=True)
>>> print(result)
None
>>>
>>> # Invalid input
>>> ValidationMixin.validate_array_like(None, "X", allow_none=False)
InvalidDataError: X cannot be None

Note

This method focuses on format validation and conversion. Content validation (like checking for NaN values) should be done separately using other validation methods.

static validate_column_specification(columns: Any, data_shape: tuple[int, ...]) list[Any]

Validate column specifications.

static validate_guidance_columns(guidance_cols: Any, binning_cols: list[Any], data_shape: tuple[int, ...]) list[Any]

Validate guidance column specifications.

Overview

ManualFlexibleBinning creates bins using explicitly provided bin specifications that can include both singleton bins (exact numeric value matches) and interval bins (numeric range matches). This transformer offers maximum flexibility for complex binning scenarios that combine exact value matching with traditional interval binning.

This approach is ideal for:

  • Numeric data requiring both exact and range matching

  • Complex domain-specific numeric binning rules

  • Outlier handling with specific value bins

  • Standardized binning with both singleton and continuous elements

  • Integration with external flexible binning specifications

Key Features

  • Mixed Bin Types: Combines singleton (exact value) and interval (range) bins

  • Complete Control: User defines all bin specifications explicitly

  • Numeric Focus: Designed specifically for numeric data and values

  • Flexible Matching: Supports exact matches and range-based matching

  • Auto-Representatives: Automatic generation of appropriate representatives

  • Comprehensive Validation: Thorough validation of bin specifications

  • Sklearn Compatibility: Full transformer interface with fit/transform methods

  • DataFrame Support: Preserves pandas/polars column names and structure

Basic Usage

import numpy as np
import pandas as pd
from binlearn.methods import ManualFlexibleBinning

# Create sample numeric data
np.random.seed(42)
data = pd.DataFrame({
    'score': [95, 85, 75, 65, 45, 25, 85, 95, 12, 88],
    'age': [22, 35, 45, 67, 28, 19, 65, 72, 16, 41]
})

# Define flexible bin specifications
flexible_specs = {
    'score': [
        95,           # Singleton bin for perfect scores
        85,           # Singleton bin for high achievers
        (60, 80),     # Interval bin for passing grades
        (0, 60)       # Interval bin for failing grades
    ],
    'age': [
        (0, 18),      # Minors
        (18, 35),     # Young adults
        (35, 65),     # Middle-aged
        65            # Seniors (singleton for retirement age)
    ]
}

# Apply flexible binning
binner = ManualFlexibleBinning(
    bin_spec=flexible_specs,
    preserve_dataframe=True
)

data_binned = binner.fit_transform(data)

print("Original data:")
print(data.head())
print("\\nBinned data:")
print(data_binned.head())
print("\\nBin specifications used:")
for col, specs in flexible_specs.items():
    print(f"  {col}: {specs}")

Grade Analysis Example

# Academic grading with special handling for specific scores
grades_df = pd.DataFrame({
    'midterm_score': np.random.choice([100, 95, 88, 76, 65, 42, 0], 500,
                                     p=[0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05]),
    'participation': np.random.uniform(0, 100, 500)
})

# Academic bin specifications
academic_specs = {
    'midterm_score': [
        100,          # Perfect score (singleton)
        0,            # Zero score (singleton)
        (90, 100),    # A grade range
        (80, 90),     # B grade range
        (70, 80),     # C grade range
        (60, 70),     # D grade range
        (0, 60)       # F grade range
    ],
    'participation': [
        100,          # Perfect participation (singleton)
        (80, 100),    # High participation
        (60, 80),     # Moderate participation
        (0, 60)       # Low participation
    ]
}

# Custom representatives for interpretability
academic_reps = {
    'midterm_score': ['Perfect', 'Zero', 'A', 'B', 'C', 'D', 'F'],
    'participation': ['Perfect', 'High', 'Moderate', 'Low']
}

academic_binner = ManualFlexibleBinning(
    bin_spec=academic_specs,
    bin_representatives=academic_reps,
    preserve_dataframe=True
)

grades_binned = academic_binner.fit_transform(grades_df)

# Analyze grade distribution
print("Midterm Score Distribution:")
for i, rep in enumerate(academic_reps['midterm_score']):
    count = (grades_binned['midterm_score'] == i).sum()
    percentage = count / len(grades_df) * 100
    print(f"  {rep}: {count} students ({percentage:.1f}%)")

Financial Risk Assessment

# Financial data with special handling for extreme values
financial_df = pd.DataFrame({
    'credit_score': np.random.choice([850, 300] + list(range(400, 800, 20)), 1000),
    'debt_ratio': np.random.exponential(0.3, 1000),
    'years_employment': np.random.choice([0] + list(range(1, 31)), 1000)
})

# Financial risk bin specifications
risk_specs = {
    'credit_score': [
        850,          # Perfect credit (singleton)
        300,          # Minimum credit (singleton)
        (740, 850),   # Excellent credit
        (670, 740),   # Good credit
        (580, 670),   # Fair credit
        (300, 580)    # Poor credit
    ],
    'debt_ratio': [
        0.0,          # No debt (singleton)
        (0, 0.28),    # Low debt
        (0.28, 0.36), # Moderate debt
        (0.36, 0.5),  # High debt
        (0.5, 2.0)    # Very high debt
    ],
    'years_employment': [
        0,            # Unemployed (singleton)
        (1, 2),       # New employee
        (2, 5),       # Junior employee
        (5, 10),      # Experienced
        (10, 30)      # Senior employee
    ]
}

risk_labels = {
    'credit_score': ['Perfect', 'Minimum', 'Excellent', 'Good', 'Fair', 'Poor'],
    'debt_ratio': ['No Debt', 'Low', 'Moderate', 'High', 'Very High'],
    'years_employment': ['Unemployed', 'New', 'Junior', 'Experienced', 'Senior']
}

risk_binner = ManualFlexibleBinning(
    bin_spec=risk_specs,
    bin_representatives=risk_labels,
    preserve_dataframe=True
)

financial_binned = risk_binner.fit_transform(financial_df)

# Risk profile analysis
print("Financial Risk Profile Distribution:")
for feature in ['credit_score', 'debt_ratio', 'years_employment']:
    print(f"\\n{feature.replace('_', ' ').title()}:")
    for i, label in enumerate(risk_labels[feature]):
        count = (financial_binned[feature] == i).sum()
        print(f"  {label}: {count} ({count/len(financial_df)*100:.1f}%)")

Medical Diagnostic Example

# Medical data with critical values as singletons
medical_df = pd.DataFrame({
    'temperature': np.random.normal(98.6, 2, 800),
    'heart_rate': np.random.normal(70, 15, 800),
    'blood_sugar': np.random.lognormal(4.5, 0.3, 800)
})

# Add some extreme values
medical_df.loc[:10, 'temperature'] = [105, 95, 106, 94]  # Critical temperatures
medical_df.loc[:10, 'heart_rate'] = [200, 40, 180, 35]   # Critical heart rates

# Medical bin specifications with critical values
medical_specs = {
    'temperature': [
        105,          # High fever (singleton)
        95,           # Hypothermia (singleton)
        (100.4, 105), # Fever
        (98, 100.4),  # Normal
        (95, 98),     # Low normal
        (90, 95)      # Hypothermic range
    ],
    'heart_rate': [
        200,          # Tachycardia crisis (singleton)
        40,           # Bradycardia crisis (singleton)
        (100, 200),   # Tachycardia
        (60, 100),    # Normal
        (40, 60),     # Bradycardia
        (20, 40)      # Severe bradycardia
    ],
    'blood_sugar': [
        (70, 100),    # Normal
        (100, 126),   # Pre-diabetic
        (126, 300),   # Diabetic
        (0, 70),      # Hypoglycemic
        (300, 500)    # Severe hyperglycemic
    ]
}

medical_labels = {
    'temperature': ['High Fever', 'Hypothermia', 'Fever', 'Normal', 'Low Normal', 'Hypothermic'],
    'heart_rate': ['Tachy Crisis', 'Brady Crisis', 'Tachycardia', 'Normal', 'Bradycardia', 'Severe Brady'],
    'blood_sugar': ['Normal', 'Pre-diabetic', 'Diabetic', 'Hypoglycemic', 'Severe Hyperglycemic']
}

medical_binner = ManualFlexibleBinning(
    bin_spec=medical_specs,
    bin_representatives=medical_labels,
    preserve_dataframe=True
)

medical_binned = medical_binner.fit_transform(medical_df)

Quality Control Example

# Manufacturing quality control with specification limits
qc_df = pd.DataFrame({
    'diameter': np.random.normal(10.0, 0.5, 1000),      # Target: 10.0mm
    'hardness': np.random.normal(50, 5, 1000),          # Target: 50 HRC
    'weight': np.random.normal(100, 3, 1000)            # Target: 100g
})

# Add some out-of-spec values
qc_df.loc[:5, 'diameter'] = [12.5, 7.5, 11.0, 9.0]   # Out of tolerance

# Quality control specifications
qc_specs = {
    'diameter': [
        12.5,         # Upper specification limit (singleton)
        7.5,          # Lower specification limit (singleton)
        (9.8, 10.2),  # Within tolerance
        (9.5, 9.8),   # Low acceptable
        (10.2, 10.5), # High acceptable
        (7.5, 9.5),   # Low reject
        (10.5, 12.5)  # High reject
    ],
    'hardness': [
        (45, 55),     # Target range
        (40, 45),     # Low acceptable
        (55, 60),     # High acceptable
        (0, 40),      # Low reject
        (60, 100)     # High reject
    ],
    'weight': [
        (98, 102),    # Target range
        (95, 98),     # Light
        (102, 105),   # Heavy
        (0, 95),      # Too light
        (105, 200)    # Too heavy
    ]
}

qc_labels = {
    'diameter': ['Upper Limit', 'Lower Limit', 'Target', 'Low OK', 'High OK', 'Low Reject', 'High Reject'],
    'hardness': ['Target', 'Low OK', 'High OK', 'Low Reject', 'High Reject'],
    'weight': ['Target', 'Light', 'Heavy', 'Too Light', 'Too Heavy']
}

qc_binner = ManualFlexibleBinning(
    bin_spec=qc_specs,
    bin_representatives=qc_labels,
    preserve_dataframe=True
)

qc_binned = qc_binner.fit_transform(qc_df)

# Quality analysis
print("Quality Control Analysis:")
for feature in ['diameter', 'hardness', 'weight']:
    print(f"\\n{feature.title()}:")
    for i, label in enumerate(qc_labels[feature]):
        count = (qc_binned[feature] == i).sum()
        print(f"  {label}: {count} units ({count/len(qc_df)*100:.1f}%)")

Bin Specification Guide

# Examples of different bin specification formats

specification_examples = {
    # Example 1: Mixed singleton and interval bins
    'feature1': [
        42,           # Singleton: exact match for value 42
        (0, 25),      # Interval: values in range [0, 25)
        (25, 50),     # Interval: values in range [25, 50)
        100           # Singleton: exact match for value 100
    ],

    # Example 2: Mostly intervals with key singletons
    'feature2': [
        0,            # Singleton: zero values
        (0, 10),      # Interval: low values
        (10, 90),     # Interval: normal range
        (90, 100),    # Interval: high values
        100           # Singleton: maximum values
    ],

    # Example 3: Mostly singletons (discrete-like)
    'feature3': [
        1, 2, 3, 4, 5,     # Individual values
        (6, 10),           # Range for higher values
        (10, float('inf')) # Open upper range
    ]
}

# Demonstration of matching behavior
test_data = pd.DataFrame({
    'feature1': [42, 15, 35, 100, 75],
    'feature2': [0, 5, 45, 95, 100],
    'feature3': [1, 3, 7, 15, 25]
})

demo_binner = ManualFlexibleBinning(
    bin_spec=specification_examples,
    preserve_dataframe=True
)

result = demo_binner.fit_transform(test_data)

print("Bin matching demonstration:")
for col in test_data.columns:
    print(f"\\n{col}:")
    print(f"  Original: {test_data[col].tolist()}")
    print(f"  Binned:   {result[col].tolist()}")
    print(f"  Specs:    {specification_examples[col]}")

Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create sample data
X, y = make_classification(n_samples=1000, n_features=3, n_classes=2, random_state=42)

# Define flexible binning for each feature
pipeline_specs = {
    0: [
        -2.5,         # Extreme low (singleton)
        (-2, -1),     # Low range
        (-1, 1),      # Medium range
        (1, 2),       # High range
        2.5           # Extreme high (singleton)
    ],
    1: [
        (-3, -1),     # Low
        (-1, 1),      # Medium
        (1, 3),       # High
        3.5           # Extreme (singleton)
    ],
    2: [
        -2.0,         # Extreme low (singleton)
        (-1.5, 0),    # Low-medium
        (0, 1.5),     # Medium-high
        2.0           # Extreme high (singleton)
    ]
}

# Create pipeline
pipeline = Pipeline([
    ('binning', ManualFlexibleBinning(bin_spec=pipeline_specs)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Train and evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)

print(f"Pipeline accuracy with flexible binning: {accuracy:.3f}")

Parameter Guide

bin_spec (dict, required)

Dictionary mapping column identifiers to flexible bin specification lists:

  • Keys: Column names (str) or indices (int)

  • Values: Lists containing:

    • Singleton bins: Numeric values for exact matches

    • Interval bins: Tuples (min, max) for range matches

  • Order matters: earlier specifications take precedence

  • No overlap validation (user responsibility)

bin_representatives (dict, optional)

Dictionary mapping columns to bin representative values:

  • Keys: Must match bin_spec keys

  • Values: Lists with same length as corresponding bin_spec

  • Can be numeric values or category names/labels

  • If None, auto-generates appropriate representatives

Tips for Best Results

  1. Order specifications carefully: Earlier bins take precedence in matching

  2. Avoid overlapping intervals: Can lead to ambiguous matches

  3. Use singletons for critical values: Exact matches for important thresholds

  4. Consider floating point precision: Use appropriate precision for your data

  5. Test with representative data: Validate that all expected values match correctly

  6. Document specification logic: Keep records of binning rationale

Common Patterns

  • Outlier Isolation: Use singletons for extreme values, intervals for normal ranges

  • Threshold Systems: Combine critical value singletons with range intervals

  • Quality Control: Specification limits as singletons, tolerance ranges as intervals

  • Grade Systems: Perfect scores as singletons, grade ranges as intervals

  • Medical Diagnostics: Critical values as singletons, normal ranges as intervals

See Also