Overview
========

binlearn is a comprehensive Python library for data binning and discretization, designed to integrate seamlessly with modern data science workflows. This section provides an overview of the library's architecture, design principles, and key concepts.

Design Principles
-----------------

Modern Python Standards
~~~~~~~~~~~~~~~~~~~~~~~~

binlearn is built with modern Python best practices:

- **Type Safety**: 100% mypy compliance with comprehensive type annotations
- **Code Quality**: 100% ruff compliance following modern Python standards  
- **Error Handling**: Comprehensive validation with helpful error messages
- **Documentation**: Extensive docstrings and examples for all components

Framework Integration
~~~~~~~~~~~~~~~~~~~~~

The library is designed to work seamlessly with popular data science frameworks:

- **Scikit-learn**: Full compatibility with sklearn pipelines and transformers
- **Pandas**: Native DataFrame support with column name preservation
- **Polars**: High-performance columnar data support (optional)
- **NumPy**: Efficient numerical array processing

Flexibility and Extensibility
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

binlearn provides flexible binning approaches:

- **Multiple Methods**: Eight different binning algorithms
- **Configurable Behavior**: Global and per-instance configuration
- **Custom Binning**: Manual specification of bin boundaries and definitions
- **Guided Binning**: Use auxiliary data to inform binning decisions

Architecture Overview
---------------------

Base Classes
~~~~~~~~~~~~

The library is built on a hierarchy of base classes:

.. code-block:: text

   GeneralBinningBase
   ├── IntervalBinningBase
   │   ├── EqualWidthBinning
   │   ├── EqualFrequencyBinning
   │   ├── KMeansBinning
   │   ├── EqualWidthMinimumWeightBinning
   │   └── ManualIntervalBinning
   ├── FlexibleBinningBase
   │   └── ManualFlexibleBinning
   ├── SupervisedBinningBase
   │   └── SupervisedBinning
   └── SingletonBinning

**GeneralBinningBase**
   The root base class providing core functionality like sklearn compatibility, 
   configuration management, and data validation.

**IntervalBinningBase**
   For methods that create interval-based bins (e.g., [0, 5), [5, 10)).

**FlexibleBinningBase**  
   For methods that support mixed interval and singleton bins.

**SupervisedBinningBase**
   For methods that use guidance data (target variables) to inform binning.

Key Concepts
------------

Binning vs. Discretization
~~~~~~~~~~~~~~~~~~~~~~~~~~

While often used interchangeably, binlearn distinguishes between:

- **Binning**: Converting continuous data into discrete intervals or categories
- **Discretization**: The broader process of making continuous data discrete (includes binning)

Types of Bins
~~~~~~~~~~~~~

binlearn supports different types of bins:

**Interval Bins**
   Continuous ranges like [0, 5), [5, 10), [10, 15]

**Singleton Bins**  
   Individual values, typically used for categorical data

**Flexible Bins**
   Mixed interval and singleton bins in the same feature

Data Flow
~~~~~~~~~

The typical binlearn workflow:

1. **Configuration**: Set global or instance-specific parameters
2. **Initialization**: Create a binner with desired parameters  
3. **Fitting**: Learn bin boundaries from training data
4. **Transformation**: Apply learned binning to new data
5. **Analysis**: Examine bin edges, representatives, and distributions

.. code-block:: python

   # Example workflow
   from binlearn import EqualWidthBinning
   import numpy as np
   
   # 1. Configuration (optional)
   from binlearn import set_config
   set_config(preserve_dataframe=True)
   
   # 2. Initialization  
   binner = EqualWidthBinning(n_bins=5)
   
   # 3. Fitting
   X = np.random.rand(1000, 3)
   binner.fit(X)
   
   # 4. Transformation
   X_binned = binner.transform(X)
   
   # 5. Analysis
   print(f"Bin edges: {binner.bin_edges_}")

Core Features
-------------

Sklearn Compatibility
~~~~~~~~~~~~~~~~~~~~~~

All binlearn transformers implement the sklearn transformer interface:

- ``fit(X, y=None)``: Learn binning parameters from data
- ``transform(X)``: Apply learned binning to data  
- ``fit_transform(X, y=None)``: Fit and transform in one step
- ``get_params()``/``set_params()``: Parameter management

Configuration System
~~~~~~~~~~~~~~~~~~~~~

Global configuration allows consistent behavior across all binners:

.. code-block:: python

   from binlearn import get_config, set_config
   
   # View current configuration
   config = get_config()
   
   # Set global defaults
   set_config(
       preserve_dataframe=True,
       clip=True,
       fit_jointly=False
   )

Error Handling
~~~~~~~~~~~~~~

Comprehensive error handling with helpful messages:

- **ConfigurationError**: Invalid parameters or settings
- **ValidationError**: Data validation failures  
- **FittingError**: Issues during the fitting process
- **TransformationError**: Problems during transformation

Type System
~~~~~~~~~~~

Rich type annotations for better development experience:

.. code-block:: python

   from binlearn.utils.types import (
       ArrayLike,           # Input data types
       BinEdgesDict,        # Bin edges per column
       ColumnList,          # List of column identifiers
       FlexibleBinSpec      # Flexible bin specifications
   )

Data Handling
-------------

Input Formats
~~~~~~~~~~~~~

binlearn accepts various input formats:

- **NumPy arrays**: ``np.ndarray`` of any numeric dtype
- **Pandas DataFrames**: With automatic column name preservation  
- **Polars DataFrames**: High-performance columnar data (optional)
- **Scipy sparse matrices**: For memory-efficient sparse data

Output Formats
~~~~~~~~~~~~~~

Output format depends on configuration and input:

- **preserve_dataframe=True**: Returns same format as input when possible
- **preserve_dataframe=False**: Always returns NumPy array
- Column names and indices are preserved for DataFrame inputs

Missing Values
~~~~~~~~~~~~~~

binlearn handles missing values gracefully:

- **NaN values**: Preserved in output (assigned special bin value)
- **Detection**: Automatic detection of various missing value representations
- **Validation**: Warns about excessive missing values

Performance Considerations
--------------------------

Memory Efficiency
~~~~~~~~~~~~~~~~~

- **In-place operations**: Where possible to reduce memory usage
- **Sparse matrix support**: For high-dimensional sparse data
- **Chunked processing**: For datasets larger than memory (planned)

Computational Efficiency  
~~~~~~~~~~~~~~~~~~~~~~~~

- **Vectorized operations**: Using NumPy for fast computation
- **Optimized algorithms**: Efficient implementations of binning methods
- **Caching**: Intermediate results cached when beneficial

Integration Patterns
--------------------

Machine Learning Pipelines
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from sklearn.pipeline import Pipeline
   from sklearn.ensemble import RandomForestClassifier
   from binlearn import EqualWidthBinning
   
   pipeline = Pipeline([
       ('binning', EqualWidthBinning(n_bins=5)),
       ('classifier', RandomForestClassifier())
   ])

Feature Engineering
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from sklearn.compose import ColumnTransformer
   from binlearn import EqualWidthBinning, SingletonBinning
   
   preprocessor = ColumnTransformer([
       ('numeric', EqualWidthBinning(n_bins=5), ['age', 'income']),
       ('discrete', SingletonBinning(), ['category_id', 'region_code'])
   ])

Cross-Validation
~~~~~~~~~~~~~~~~

binlearn transformers work correctly with cross-validation:

.. code-block:: python

   from sklearn.model_selection import cross_val_score
   from sklearn.pipeline import Pipeline
   
   pipeline = Pipeline([
       ('binning', EqualWidthBinning(n_bins=5)),
       ('classifier', RandomForestClassifier())
   ])
   
   scores = cross_val_score(pipeline, X, y, cv=5)

Next Steps
----------

Now that you understand the overall architecture and design, explore:

- :doc:`binning_methods`: Detailed guide to all available binning methods
- :doc:`data_types`: Working with different data formats  
- :doc:`configuration`: Advanced configuration options
- :doc:`best_practices`: Tips for effective binning strategies