Overview
binlearn is a comprehensive Python library for data binning and discretization, designed to integrate seamlessly with modern data science workflows. This section provides an overview of the library’s architecture, design principles, and key concepts.
Design Principles
Modern Python Standards
binlearn is built with modern Python best practices:
Type Safety: 100% mypy compliance with comprehensive type annotations
Code Quality: 100% ruff compliance following modern Python standards
Error Handling: Comprehensive validation with helpful error messages
Documentation: Extensive docstrings and examples for all components
Framework Integration
The library is designed to work seamlessly with popular data science frameworks:
Scikit-learn: Full compatibility with sklearn pipelines and transformers
Pandas: Native DataFrame support with column name preservation
Polars: High-performance columnar data support (optional)
NumPy: Efficient numerical array processing
Flexibility and Extensibility
binlearn provides flexible binning approaches:
Multiple Methods: Eight different binning algorithms
Configurable Behavior: Global and per-instance configuration
Custom Binning: Manual specification of bin boundaries and definitions
Guided Binning: Use auxiliary data to inform binning decisions
Architecture Overview
Base Classes
The library is built on a hierarchy of base classes:
GeneralBinningBase
├── IntervalBinningBase
│ ├── EqualWidthBinning
│ ├── EqualFrequencyBinning
│ ├── KMeansBinning
│ ├── EqualWidthMinimumWeightBinning
│ └── ManualIntervalBinning
├── FlexibleBinningBase
│ └── ManualFlexibleBinning
├── SupervisedBinningBase
│ └── SupervisedBinning
└── SingletonBinning
- GeneralBinningBase
The root base class providing core functionality like sklearn compatibility, configuration management, and data validation.
- IntervalBinningBase
For methods that create interval-based bins (e.g., [0, 5), [5, 10)).
- FlexibleBinningBase
For methods that support mixed interval and singleton bins.
- SupervisedBinningBase
For methods that use guidance data (target variables) to inform binning.
Key Concepts
Binning vs. Discretization
While often used interchangeably, binlearn distinguishes between:
Binning: Converting continuous data into discrete intervals or categories
Discretization: The broader process of making continuous data discrete (includes binning)
Types of Bins
binlearn supports different types of bins:
- Interval Bins
Continuous ranges like [0, 5), [5, 10), [10, 15]
- Singleton Bins
Individual values, typically used for categorical data
- Flexible Bins
Mixed interval and singleton bins in the same feature
Data Flow
The typical binlearn workflow:
Configuration: Set global or instance-specific parameters
Initialization: Create a binner with desired parameters
Fitting: Learn bin boundaries from training data
Transformation: Apply learned binning to new data
Analysis: Examine bin edges, representatives, and distributions
# Example workflow
from binlearn import EqualWidthBinning
import numpy as np
# 1. Configuration (optional)
from binlearn import set_config
set_config(preserve_dataframe=True)
# 2. Initialization
binner = EqualWidthBinning(n_bins=5)
# 3. Fitting
X = np.random.rand(1000, 3)
binner.fit(X)
# 4. Transformation
X_binned = binner.transform(X)
# 5. Analysis
print(f"Bin edges: {binner.bin_edges_}")
Core Features
Sklearn Compatibility
All binlearn transformers implement the sklearn transformer interface:
fit(X, y=None): Learn binning parameters from datatransform(X): Apply learned binning to datafit_transform(X, y=None): Fit and transform in one stepget_params()/set_params(): Parameter management
Configuration System
Global configuration allows consistent behavior across all binners:
from binlearn import get_config, set_config
# View current configuration
config = get_config()
# Set global defaults
set_config(
preserve_dataframe=True,
clip=True,
fit_jointly=False
)
Error Handling
Comprehensive error handling with helpful messages:
ConfigurationError: Invalid parameters or settings
ValidationError: Data validation failures
FittingError: Issues during the fitting process
TransformationError: Problems during transformation
Type System
Rich type annotations for better development experience:
from binlearn.utils.types import (
ArrayLike, # Input data types
BinEdgesDict, # Bin edges per column
ColumnList, # List of column identifiers
FlexibleBinSpec # Flexible bin specifications
)
Data Handling
Input Formats
binlearn accepts various input formats:
NumPy arrays:
np.ndarrayof any numeric dtypePandas DataFrames: With automatic column name preservation
Polars DataFrames: High-performance columnar data (optional)
Scipy sparse matrices: For memory-efficient sparse data
Output Formats
Output format depends on configuration and input:
preserve_dataframe=True: Returns same format as input when possible
preserve_dataframe=False: Always returns NumPy array
Column names and indices are preserved for DataFrame inputs
Missing Values
binlearn handles missing values gracefully:
NaN values: Preserved in output (assigned special bin value)
Detection: Automatic detection of various missing value representations
Validation: Warns about excessive missing values
Performance Considerations
Memory Efficiency
In-place operations: Where possible to reduce memory usage
Sparse matrix support: For high-dimensional sparse data
Chunked processing: For datasets larger than memory (planned)
Computational Efficiency
Vectorized operations: Using NumPy for fast computation
Optimized algorithms: Efficient implementations of binning methods
Caching: Intermediate results cached when beneficial
Integration Patterns
Machine Learning Pipelines
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from binlearn import EqualWidthBinning
pipeline = Pipeline([
('binning', EqualWidthBinning(n_bins=5)),
('classifier', RandomForestClassifier())
])
Feature Engineering
from sklearn.compose import ColumnTransformer
from binlearn import EqualWidthBinning, SingletonBinning
preprocessor = ColumnTransformer([
('numeric', EqualWidthBinning(n_bins=5), ['age', 'income']),
('discrete', SingletonBinning(), ['category_id', 'region_code'])
])
Cross-Validation
binlearn transformers work correctly with cross-validation:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('binning', EqualWidthBinning(n_bins=5)),
('classifier', RandomForestClassifier())
])
scores = cross_val_score(pipeline, X, y, cv=5)
Next Steps
Now that you understand the overall architecture and design, explore:
binning_methods: Detailed guide to all available binning methods
data_types: Working with different data formats
configuration: Advanced configuration options
best_practices: Tips for effective binning strategies