Overview

binlearn is a comprehensive Python library for data binning and discretization, designed to integrate seamlessly with modern data science workflows. This section provides an overview of the library’s architecture, design principles, and key concepts.

Design Principles

Modern Python Standards

binlearn is built with modern Python best practices:

  • Type Safety: 100% mypy compliance with comprehensive type annotations

  • Code Quality: 100% ruff compliance following modern Python standards

  • Error Handling: Comprehensive validation with helpful error messages

  • Documentation: Extensive docstrings and examples for all components

Framework Integration

The library is designed to work seamlessly with popular data science frameworks:

  • Scikit-learn: Full compatibility with sklearn pipelines and transformers

  • Pandas: Native DataFrame support with column name preservation

  • Polars: High-performance columnar data support (optional)

  • NumPy: Efficient numerical array processing

Flexibility and Extensibility

binlearn provides flexible binning approaches:

  • Multiple Methods: Eight different binning algorithms

  • Configurable Behavior: Global and per-instance configuration

  • Custom Binning: Manual specification of bin boundaries and definitions

  • Guided Binning: Use auxiliary data to inform binning decisions

Architecture Overview

Base Classes

The library is built on a hierarchy of base classes:

GeneralBinningBase
├── IntervalBinningBase
│   ├── EqualWidthBinning
│   ├── EqualFrequencyBinning
│   ├── KMeansBinning
│   ├── EqualWidthMinimumWeightBinning
│   └── ManualIntervalBinning
├── FlexibleBinningBase
│   └── ManualFlexibleBinning
├── SupervisedBinningBase
│   └── SupervisedBinning
└── SingletonBinning
GeneralBinningBase

The root base class providing core functionality like sklearn compatibility, configuration management, and data validation.

IntervalBinningBase

For methods that create interval-based bins (e.g., [0, 5), [5, 10)).

FlexibleBinningBase

For methods that support mixed interval and singleton bins.

SupervisedBinningBase

For methods that use guidance data (target variables) to inform binning.

Key Concepts

Binning vs. Discretization

While often used interchangeably, binlearn distinguishes between:

  • Binning: Converting continuous data into discrete intervals or categories

  • Discretization: The broader process of making continuous data discrete (includes binning)

Types of Bins

binlearn supports different types of bins:

Interval Bins

Continuous ranges like [0, 5), [5, 10), [10, 15]

Singleton Bins

Individual values, typically used for categorical data

Flexible Bins

Mixed interval and singleton bins in the same feature

Data Flow

The typical binlearn workflow:

  1. Configuration: Set global or instance-specific parameters

  2. Initialization: Create a binner with desired parameters

  3. Fitting: Learn bin boundaries from training data

  4. Transformation: Apply learned binning to new data

  5. Analysis: Examine bin edges, representatives, and distributions

# Example workflow
from binlearn import EqualWidthBinning
import numpy as np

# 1. Configuration (optional)
from binlearn import set_config
set_config(preserve_dataframe=True)

# 2. Initialization
binner = EqualWidthBinning(n_bins=5)

# 3. Fitting
X = np.random.rand(1000, 3)
binner.fit(X)

# 4. Transformation
X_binned = binner.transform(X)

# 5. Analysis
print(f"Bin edges: {binner.bin_edges_}")

Core Features

Sklearn Compatibility

All binlearn transformers implement the sklearn transformer interface:

  • fit(X, y=None): Learn binning parameters from data

  • transform(X): Apply learned binning to data

  • fit_transform(X, y=None): Fit and transform in one step

  • get_params()/set_params(): Parameter management

Configuration System

Global configuration allows consistent behavior across all binners:

from binlearn import get_config, set_config

# View current configuration
config = get_config()

# Set global defaults
set_config(
    preserve_dataframe=True,
    clip=True,
    fit_jointly=False
)

Error Handling

Comprehensive error handling with helpful messages:

  • ConfigurationError: Invalid parameters or settings

  • ValidationError: Data validation failures

  • FittingError: Issues during the fitting process

  • TransformationError: Problems during transformation

Type System

Rich type annotations for better development experience:

from binlearn.utils.types import (
    ArrayLike,           # Input data types
    BinEdgesDict,        # Bin edges per column
    ColumnList,          # List of column identifiers
    FlexibleBinSpec      # Flexible bin specifications
)

Data Handling

Input Formats

binlearn accepts various input formats:

  • NumPy arrays: np.ndarray of any numeric dtype

  • Pandas DataFrames: With automatic column name preservation

  • Polars DataFrames: High-performance columnar data (optional)

  • Scipy sparse matrices: For memory-efficient sparse data

Output Formats

Output format depends on configuration and input:

  • preserve_dataframe=True: Returns same format as input when possible

  • preserve_dataframe=False: Always returns NumPy array

  • Column names and indices are preserved for DataFrame inputs

Missing Values

binlearn handles missing values gracefully:

  • NaN values: Preserved in output (assigned special bin value)

  • Detection: Automatic detection of various missing value representations

  • Validation: Warns about excessive missing values

Performance Considerations

Memory Efficiency

  • In-place operations: Where possible to reduce memory usage

  • Sparse matrix support: For high-dimensional sparse data

  • Chunked processing: For datasets larger than memory (planned)

Computational Efficiency

  • Vectorized operations: Using NumPy for fast computation

  • Optimized algorithms: Efficient implementations of binning methods

  • Caching: Intermediate results cached when beneficial

Integration Patterns

Machine Learning Pipelines

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from binlearn import EqualWidthBinning

pipeline = Pipeline([
    ('binning', EqualWidthBinning(n_bins=5)),
    ('classifier', RandomForestClassifier())
])

Feature Engineering

from sklearn.compose import ColumnTransformer
from binlearn import EqualWidthBinning, SingletonBinning

preprocessor = ColumnTransformer([
    ('numeric', EqualWidthBinning(n_bins=5), ['age', 'income']),
    ('discrete', SingletonBinning(), ['category_id', 'region_code'])
])

Cross-Validation

binlearn transformers work correctly with cross-validation:

from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('binning', EqualWidthBinning(n_bins=5)),
    ('classifier', RandomForestClassifier())
])

scores = cross_val_score(pipeline, X, y, cv=5)

Next Steps

Now that you understand the overall architecture and design, explore:

  • binning_methods: Detailed guide to all available binning methods

  • data_types: Working with different data formats

  • configuration: Advanced configuration options

  • best_practices: Tips for effective binning strategies