binlearn.methods.TreeBinning

Tree-based supervised binning implementation using clean architecture.

Creates bins using decision tree splits guided by a target column. This method fits a decision tree to predict the guidance column from the features to be binned, then uses the tree’s split thresholds to define bin boundaries that optimize the tree’s ability to separate different target values.

The decision tree learning algorithm automatically identifies the most informative split points for distinguishing between different target values, making this approach particularly effective for supervised learning tasks. The resulting bins correspond to the decision tree’s internal nodes, creating intervals that maximize the separation of target classes or minimize target variance.

This approach is especially valuable when: - The relationship between features and targets is complex and non-linear - Domain knowledge about optimal split points is limited - Automatic feature discretization is needed for downstream models - Interpretable binning rules are desired (tree splits are easy to understand)

The method supports both classification and regression tasks, automatically selecting the appropriate decision tree variant based on the task type. The fitted trees are stored and can be accessed for analysis of feature importance and split decisions.

This implementation follows the clean binlearn architecture with straight inheritance, dynamic column resolution, and parameter reconstruction capabilities.

Parameters:

task_type – Type of supervised task - either ‘classification’ or ‘regression’. Determines whether to use DecisionTreeClassifier or DecisionTreeRegressor. If None, uses configuration default.
tree_params – Dictionary of parameters to pass to the sklearn DecisionTree. Common parameters include max_depth, min_samples_split, min_samples_leaf, random_state. If None, uses configuration default or sensible defaults.
clip – Whether to clip values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
guidance_columns – Column specification for target/guidance data used in supervised binning. Can be column names, indices, or callable selector.
bin_edges – Pre-computed bin edges for reconstruction. Should not be provided during normal usage.
bin_representatives – Pre-computed bin representatives for reconstruction. Should not be provided during normal usage.
class – Class name for reconstruction compatibility. Internal use only.
module – Module name for reconstruction compatibility. Internal use only.

task_type: Type of supervised task (‘classification’ or ‘regression’)

tree_params: Parameters passed to the decision tree

_fitted_trees: Dictionary storing fitted tree models per column

_tree_importance: Dictionary storing feature importance per column

_tree_template: Template tree used for cloning during fitting

Example

>>> import numpy as np
>>> from binlearn.methods import TreeBinning
>>> from sklearn.datasets import make_classification
>>>
>>> # Create sample classification data
>>> X, y = make_classification(n_samples=1000, n_features=1, n_redundant=0, random_state=42)
>>>
>>> # Initialize tree binning for classification
>>> binner = TreeBinning(
...     task_type='classification',
...     tree_params={'max_depth': 4, 'min_samples_leaf': 50, 'random_state': 42}
... )
>>>
>>> # Fit with target data
>>> binner.fit(X, y)
>>> X_binned = binner.transform(X)
>>>
>>> # Analyze tree splits
>>> print(f"Number of bins: {len(binner.bin_edges_[0]) - 1}")
>>> print(f"Split points: {binner.bin_edges_[0][1:-1]}")  # Exclude data bounds
>>>
>>> # Access fitted tree for analysis
>>> tree = binner._fitted_trees[0]
>>> print(f"Tree depth: {tree.tree_.max_depth}")

Note

Requires target/guidance data for supervised learning of optimal split points
Automatically selects DecisionTreeClassifier or DecisionTreeRegressor based on task_type
Split thresholds from the tree become the bin boundaries
Supports all sklearn DecisionTree parameters through tree_params
Fitted trees are stored and accessible for further analysis
Each column is processed independently with its corresponding target data
Handles both classification and regression tasks seamlessly

See also

Chi2Binning: Statistical significance-based supervised binning IsotonicBinning: Monotonic relationship preserving supervised binning SupervisedBinningBase: Base class for supervised binning methods

References

Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees.

Initialize Tree binning with decision tree parameters and task configuration.

Sets up decision tree-based binning with specified tree parameters and task type. Creates a tree template that will be cloned for each column during fitting. Applies configuration defaults for any unspecified parameters.

Parameters:

task_type – Type of supervised learning task. Must be either: - ‘classification’: Uses DecisionTreeClassifier for discrete targets - ‘regression’: Uses DecisionTreeRegressor for continuous targets If None, uses configuration default (typically ‘classification’).
tree_params – Dictionary of parameters to pass to the sklearn DecisionTree constructor. Common parameters include: - max_depth: Maximum depth of the tree (int or None) - min_samples_split: Minimum samples required to split a node (int) - min_samples_leaf: Minimum samples required at each leaf (int) - random_state: Random seed for reproducible results (int or None) If None, uses sensible defaults.
clip – Whether to clip transformed values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
guidance_columns – Column specification for target/guidance data. Can be column names, indices, or callable selector. Required for supervised binning during fit operations.
bin_edges – Pre-computed bin edges dictionary for reconstruction. Internal use only - should not be provided during normal initialization.
bin_representatives – Pre-computed representatives dictionary for reconstruction. Internal use only.
class – Class name string for reconstruction compatibility. Internal use only.
module – Module name string for reconstruction compatibility. Internal use only.

Raises:

ConfigurationError – If task_type is not ‘classification’ or ‘regression’, or if tree_params contains invalid parameters.

Example

>>> # Classification with custom tree parameters
>>> binner = TreeBinning(
...     task_type='classification',
...     tree_params={
...         'max_depth': 5,
...         'min_samples_leaf': 20,
...         'random_state': 42
...     },
...     guidance_columns='target_class'
... )
>>>
>>> # Regression with minimal tree constraints
>>> binner = TreeBinning(
...     task_type='regression',
...     tree_params={'max_depth': 3, 'min_samples_split': 10},
...     guidance_columns=['continuous_target']
... )
>>>
>>> # Use configuration defaults
>>> binner = TreeBinning(guidance_columns='target')

Note

Parameter validation occurs during initialization
Tree template is created during initialization and cloned for each column
Configuration defaults are applied for None parameters
The tree_params dictionary is validated against sklearn DecisionTree parameters
Guidance columns must be specified for supervised binning to work properly
Reconstruction parameters should not be provided during normal usage