TreeBinning

class binlearn.methods.TreeBinning(task_type: str | None = None, tree_params: dict[str, Any] | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]

Bases: SupervisedBinningBase

Tree-based supervised binning implementation using clean architecture.

Creates bins using decision tree splits guided by a target column. This method fits a decision tree to predict the guidance column from the features to be binned, then uses the tree’s split thresholds to define bin boundaries that optimize the tree’s ability to separate different target values.

The decision tree learning algorithm automatically identifies the most informative split points for distinguishing between different target values, making this approach particularly effective for supervised learning tasks. The resulting bins correspond to the decision tree’s internal nodes, creating intervals that maximize the separation of target classes or minimize target variance.

This approach is especially valuable when: - The relationship between features and targets is complex and non-linear - Domain knowledge about optimal split points is limited - Automatic feature discretization is needed for downstream models - Interpretable binning rules are desired (tree splits are easy to understand)

The method supports both classification and regression tasks, automatically selecting the appropriate decision tree variant based on the task type. The fitted trees are stored and can be accessed for analysis of feature importance and split decisions.

This implementation follows the clean binlearn architecture with straight inheritance, dynamic column resolution, and parameter reconstruction capabilities.

Parameters:
  • task_type – Type of supervised task - either ‘classification’ or ‘regression’. Determines whether to use DecisionTreeClassifier or DecisionTreeRegressor. If None, uses configuration default.

  • tree_params – Dictionary of parameters to pass to the sklearn DecisionTree. Common parameters include max_depth, min_samples_split, min_samples_leaf, random_state. If None, uses configuration default or sensible defaults.

  • clip – Whether to clip values outside the fitted range to the nearest bin edge. If None, uses configuration default.

  • preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.

  • guidance_columns – Column specification for target/guidance data used in supervised binning. Can be column names, indices, or callable selector.

  • bin_edges – Pre-computed bin edges for reconstruction. Should not be provided during normal usage.

  • bin_representatives – Pre-computed bin representatives for reconstruction. Should not be provided during normal usage.

  • class – Class name for reconstruction compatibility. Internal use only.

  • module – Module name for reconstruction compatibility. Internal use only.

task_type

Type of supervised task (‘classification’ or ‘regression’)

tree_params

Parameters passed to the decision tree

_fitted_trees

Dictionary storing fitted tree models per column

_tree_importance

Dictionary storing feature importance per column

_tree_template

Template tree used for cloning during fitting

Example

>>> import numpy as np
>>> from binlearn.methods import TreeBinning
>>> from sklearn.datasets import make_classification
>>>
>>> # Create sample classification data
>>> X, y = make_classification(n_samples=1000, n_features=1, n_redundant=0, random_state=42)
>>>
>>> # Initialize tree binning for classification
>>> binner = TreeBinning(
...     task_type='classification',
...     tree_params={'max_depth': 4, 'min_samples_leaf': 50, 'random_state': 42}
... )
>>>
>>> # Fit with target data
>>> binner.fit(X, y)
>>> X_binned = binner.transform(X)
>>>
>>> # Analyze tree splits
>>> print(f"Number of bins: {len(binner.bin_edges_[0]) - 1}")
>>> print(f"Split points: {binner.bin_edges_[0][1:-1]}")  # Exclude data bounds
>>>
>>> # Access fitted tree for analysis
>>> tree = binner._fitted_trees[0]
>>> print(f"Tree depth: {tree.tree_.max_depth}")

Note

  • Requires target/guidance data for supervised learning of optimal split points

  • Automatically selects DecisionTreeClassifier or DecisionTreeRegressor based on task_type

  • Split thresholds from the tree become the bin boundaries

  • Supports all sklearn DecisionTree parameters through tree_params

  • Fitted trees are stored and accessible for further analysis

  • Each column is processed independently with its corresponding target data

  • Handles both classification and regression tasks seamlessly

See also

Chi2Binning: Statistical significance-based supervised binning IsotonicBinning: Monotonic relationship preserving supervised binning SupervisedBinningBase: Base class for supervised binning methods

References

Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees.

__init__(task_type: str | None = None, tree_params: dict[str, Any] | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]

Initialize Tree binning with decision tree parameters and task configuration.

Sets up decision tree-based binning with specified tree parameters and task type. Creates a tree template that will be cloned for each column during fitting. Applies configuration defaults for any unspecified parameters.

Parameters:
  • task_type – Type of supervised learning task. Must be either: - ‘classification’: Uses DecisionTreeClassifier for discrete targets - ‘regression’: Uses DecisionTreeRegressor for continuous targets If None, uses configuration default (typically ‘classification’).

  • tree_params – Dictionary of parameters to pass to the sklearn DecisionTree constructor. Common parameters include: - max_depth: Maximum depth of the tree (int or None) - min_samples_split: Minimum samples required to split a node (int) - min_samples_leaf: Minimum samples required at each leaf (int) - random_state: Random seed for reproducible results (int or None) If None, uses sensible defaults.

  • clip – Whether to clip transformed values outside the fitted range to the nearest bin edge. If None, uses configuration default.

  • preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.

  • guidance_columns – Column specification for target/guidance data. Can be column names, indices, or callable selector. Required for supervised binning during fit operations.

  • bin_edges – Pre-computed bin edges dictionary for reconstruction. Internal use only - should not be provided during normal initialization.

  • bin_representatives – Pre-computed representatives dictionary for reconstruction. Internal use only.

  • class – Class name string for reconstruction compatibility. Internal use only.

  • module – Module name string for reconstruction compatibility. Internal use only.

Raises:

ConfigurationError – If task_type is not ‘classification’ or ‘regression’, or if tree_params contains invalid parameters.

Example

>>> # Classification with custom tree parameters
>>> binner = TreeBinning(
...     task_type='classification',
...     tree_params={
...         'max_depth': 5,
...         'min_samples_leaf': 20,
...         'random_state': 42
...     },
...     guidance_columns='target_class'
... )
>>>
>>> # Regression with minimal tree constraints
>>> binner = TreeBinning(
...     task_type='regression',
...     tree_params={'max_depth': 3, 'min_samples_split': 10},
...     guidance_columns=['continuous_target']
... )
>>>
>>> # Use configuration defaults
>>> binner = TreeBinning(guidance_columns='target')

Note

  • Parameter validation occurs during initialization

  • Tree template is created during initialization and cloned for each column

  • Configuration defaults are applied for None parameters

  • The tree_params dictionary is validated against sklearn DecisionTree parameters

  • Guidance columns must be specified for supervised binning to work properly

  • Reconstruction parameters should not be provided during normal usage

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

static check_data_quality(data: ndarray[Any, Any], name: str = 'data') None

Check data quality and issue warnings if needed.

property feature_names_in_: list[str] | None

Get feature names.

fit(X: Any, y: Any | None = None, **fit_params: Any) GeneralBinningBase

Fit the binning transformer with comprehensive orchestration.

This method orchestrates the complete fitting process, handling parameter validation, input preprocessing, column separation, and routing to the appropriate fitting strategy (joint vs independent).

Parameters:
  • X – Input data to fit the binning transformer on. Can be: - pandas.DataFrame: Column names are preserved - polars.DataFrame: Column names are preserved - numpy.ndarray: Numeric column indices are used - array-like: Converted to numpy array

  • y – Target values for supervised binning methods. Ignored by unsupervised methods. Can be array-like or None.

  • **fit_params – Additional fitting parameters passed to the specific binning algorithm implementation. Common parameters include: - guidance_data: Alternative guidance data (conflicts with fit_jointly=True)

Returns:

The fitted binning transformer instance.

Return type:

self

Raises:
  • ValueError – If parameter validation fails, inputs are invalid, or conflicting parameters are provided (e.g., fit_jointly=True with guidance_data).

  • BinningError – If the binning algorithm fails to fit the data.

  • RuntimeError – If an unexpected error occurs during fitting.

Example

>>> from binlearn import EqualWidthBinning
>>> import pandas as pd
>>> X = pd.DataFrame({'feature1': [1, 2, 3, 4, 5], 'feature2': [10, 20, 30, 40, 50]})
>>> binner = EqualWidthBinning(n_bins=3)
>>> binner.fit(X)
EqualWidthBinning(...)

Note

The method automatically handles column separation when guidance_columns is specified, routing guidance columns separately from binning columns. The fitting strategy (joint vs independent) is determined by the fit_jointly parameter.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_input_columns() list[Any] | None

Get input columns for data preparation.

This method should be overridden by derived classes to provide appropriate column information without exposing binning-specific concepts.

Returns:

Column information or None if not available

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routing – A MetadataRequest encapsulating routing information.

Return type:

MetadataRequest

get_params(deep: bool = True) dict[str, Any]

Get parameters for this estimator, including fitted parameters.

This method extends sklearn’s standard get_params to include fitted parameters when the estimator is fitted, enabling complete object reconstruction through the get_params/set_params interface. This is essential for pipeline persistence and model serialization.

Parameters:

deep – If True, returns parameters for sub-estimators (not applicable here but maintained for sklearn compatibility).

Returns:

  • Constructor parameters extracted from __init__ signature

  • Fitted parameters (if estimator is fitted) mapped from attributes

  • Class metadata (class_, module_) for automatic reconstruction

Return type:

Dictionary of parameter names mapped to their values, including

Example

>>> binner = EqualWidthBinning(n_bins=5)
>>> params = binner.get_params()
>>> print(params)
{'n_bins': 5, 'clip': None, ..., 'class_': 'EqualWidthBinning', 'module_': '...'}
>>>
>>> binner.fit(X)
>>> fitted_params = binner.get_params()
>>> # Now includes: {'bin_edges': {...}, 'bin_representatives': {...}, ...}

Note

  • Automatically extracts constructor parameters from __init__ signature

  • Includes fitted parameters only when estimator is fitted

  • Adds class metadata for reconstruction workflows

  • Excludes internal sklearn attributes like n_features_in_

  • class_ and module_ parameters are handled specially during set_params

inverse_transform(X: Any) Any

Inverse transform from bin indices back to representative values.

Converts discrete bin indices back to their representative values, effectively reversing the binning transformation. This is useful for interpreting results or reconstructing approximate original values.

Parameters:

X – Input data containing bin indices to inverse transform. Should contain only binning columns (no guidance columns). Can be: - pandas.DataFrame: Column names should match binning columns - polars.DataFrame: Column names should match binning columns - numpy.ndarray: Must have same number of binning columns - array-like: Converted to numpy array

Returns:

Inverse transformed data where bin indices are replaced with their representative values (typically bin centers). Output format matches the preserve_dataframe setting.

Raises:
  • RuntimeError – If the transformer has not been fitted yet.

  • ValueError – If input data has wrong number of columns or invalid format.

  • BinningError – If inverse transformation fails.

Example

>>> # After fitting and transforming
>>> X_binned = [[0, 1], [1, 0], [2, 2]]  # Bin indices
>>> X_reconstructed = binner.inverse_transform(X_binned)
>>> print(X_reconstructed)
[[0.5, 1.5], [1.5, 0.5], [2.5, 2.5]]  # Representative values

Note

For guided binning (when guidance_columns is specified), the input should only contain the binning columns, not the guidance columns. The number of input columns must match the number of binning columns.

property n_features_in_: int

Get number of features.

set_output(*, transform=None)

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

  • ”default”: Default output format of a transformer

  • ”pandas”: DataFrame output

  • ”polars”: Polars output

  • None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params: Any) SklearnIntegrationBase

Set the parameters of this estimator.

This method supports reconstruction workflows by handling fitted parameters that come from get_params() output (without underscores) and setting them as fitted attributes (with underscores).

Parameters:

**params – Parameters to set. Can include: - Regular constructor parameters (n_bins, clip, etc.) - Fitted parameters from get_params (bin_edges, bin_representatives) - Class metadata (ignored during reconstruction)

Returns:

Returns the instance itself.

Return type:

self

transform(X: Any) Any

Transform input data using fitted binning parameters.

Applies the fitted binning transformation to new data, converting continuous values to discrete bin indices or representatives. Handles column separation when guidance columns are present.

Parameters:

X – Input data to transform. Must have the same structure as the data used during fitting (same number of columns). Can be: - pandas.DataFrame: Column names should match training data - polars.DataFrame: Column names should match training data - numpy.ndarray: Must have same number of columns as training - array-like: Converted to numpy array

Returns:

Transformed data where continuous values are replaced with bin indices or representative values. The output format depends on: - preserve_dataframe setting: DataFrame vs array format - binning method: indices vs representatives - guidance_columns: only binning columns are transformed

Raises:
  • RuntimeError – If the transformer has not been fitted yet.

  • ValueError – If the input data has incompatible structure or format.

  • BinningError – If transformation fails due to data issues.

Example

>>> # After fitting
>>> X_new = pd.DataFrame({'feature1': [1.5, 2.5], 'feature2': [15, 25]})
>>> X_binned = binner.transform(X_new)
>>> print(X_binned)
[[0, 0], [1, 1]]  # Bin indices

Note

When guidance_columns is specified, only the binning columns are transformed. Guidance columns are filtered out from the output. The method preserves the original data format when preserve_dataframe=True.

static validate_array_like(data: Any, name: str = 'data', allow_none: bool = False) ndarray[Any, Any] | None

Validate and convert array-like input to numpy array.

This method provides robust validation and conversion of various input formats to numpy arrays, with comprehensive error handling and helpful suggestions for common issues.

Parameters:
  • data – Input data to validate and convert. Can be: - numpy.ndarray: Used directly - pandas.DataFrame/Series: Converted to numpy array - polars.DataFrame: Converted to numpy array - list, tuple: Converted to numpy array - None: Allowed only if allow_none=True

  • name – Name of the data parameter for error messages. Used to provide context in error messages (e.g., “X”, “y”, “guidance_data”).

  • allow_none – Whether to allow None as a valid input. If True, None is returned unchanged; if False, None raises InvalidDataError.

Returns:

Validated numpy array, or None if data is None and allow_none=True. The returned array maintains the same data content but is guaranteed to be a numpy array.

Raises:

InvalidDataError – If validation fails: - data is None when allow_none=False - data cannot be converted to numpy array - Conversion process encounters errors

Example

>>> # Valid inputs
>>> arr = ValidationMixin.validate_array_like([1, 2, 3], "X")
>>> print(type(arr))
<class 'numpy.ndarray'>
>>>
>>> # Allow None
>>> result = ValidationMixin.validate_array_like(None, "y", allow_none=True)
>>> print(result)
None
>>>
>>> # Invalid input
>>> ValidationMixin.validate_array_like(None, "X", allow_none=False)
InvalidDataError: X cannot be None

Note

This method focuses on format validation and conversion. Content validation (like checking for NaN values) should be done separately using other validation methods.

static validate_column_specification(columns: Any, data_shape: tuple[int, ...]) list[Any]

Validate column specifications.

static validate_guidance_columns(guidance_cols: Any, binning_cols: list[Any], data_shape: tuple[int, ...]) list[Any]

Validate guidance column specifications.

validate_guidance_data(guidance_data: Any, name: str = 'guidance_data') ndarray[Any, Any]

Validate and preprocess guidance data for supervised binning.

Ensures that the guidance data is appropriate for supervised binning by validating its shape and checking for data quality issues.

Parameters:
  • guidance_data – Raw guidance/target data to validate. Should be a 2D array with shape (n_samples, 1) or 1D array with shape (n_samples,).

  • name – Name used in error messages for better debugging context.

Returns:

Validated and preprocessed guidance data with shape (n_samples, 1).

Raises:

ValidationError – If guidance data has invalid shape or format.

Overview

TreeBinning creates bins using decision tree-based supervised discretization. The method uses decision trees to find optimal split points that maximize predictive performance for classification or regression tasks. This approach creates bins that are optimized for a specific target variable.

Key Features

  • Decision Tree Foundation: Uses sklearn’s DecisionTreeClassifier/Regressor for optimal splits

  • Dual Interface: Supports both sklearn-style (X, y) and binlearn-style (guidance_columns) usage

  • Task Flexibility: Handles both classification and regression targets

  • Configurable Trees: Full control over decision tree parameters

  • Automatic Extraction: Extracts bin edges from tree structure automatically

  • Sklearn Compatibility: Full transformer interface with fit/transform methods

  • DataFrame Support: Preserves pandas/polars column names and structure

Examples

Basic Classification (sklearn-style)

import numpy as np
from sklearn.datasets import make_classification
from binlearn.methods import TreeBinning

# Create classification dataset
X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42)

# Method 1: sklearn-style interface
sup_binner = TreeBinning(
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)

# Fit with X and y separately
sup_binner.fit(X, y)
X_binned = sup_binner.transform(X)

print(f"Original shape: {X.shape}")
print(f"Binned shape: {X_binned.shape}")
print(f"Bin edges for feature 0: {sup_binner.bin_edges_[0]}")

Basic Classification (binlearn-style)

# Method 2: binlearn-style with guidance_columns
# Combine features and target into single dataset
X_with_target = np.column_stack([X, y])

sup_binner2 = TreeBinning(
    guidance_columns=[4],  # Use column 4 (target) as guidance
    task_type='classification',
    tree_params={'max_depth': 3, 'min_samples_leaf': 20}
)

X_binned2 = sup_binner2.fit_transform(X_with_target)

# Both methods produce identical results
print(f"Methods produce same results: {np.array_equal(X_binned, X_binned2)}")

DataFrame Example

import pandas as pd

# Create DataFrame with features and target
df = pd.DataFrame(X, columns=['feature1', 'feature2', 'feature3', 'feature4'])
df['target'] = y

# Sklearn-style: fit with separate X and y
binner1 = TreeBinning(
    task_type='classification',
    preserve_dataframe=True
)
binner1.fit(df[['feature1', 'feature2', 'feature3', 'feature4']], df['target'])
df_binned1 = binner1.transform(df[['feature1', 'feature2', 'feature3', 'feature4']])

# Binlearn-style: use guidance_columns
binner2 = TreeBinning(
    guidance_columns=['target'],
    task_type='classification',
    preserve_dataframe=True
)
df_binned2 = binner2.fit_transform(df)

print("Feature binning with DataFrame:")
print(df_binned2.head())

Regression Task

from sklearn.datasets import make_regression

# Create regression dataset
X_reg, y_reg = make_regression(n_samples=1000, n_features=3, noise=0.1, random_state=42)

# Supervised binning for regression
reg_binner = TreeBinning(
    task_type='regression',
    tree_params={
        'max_depth': 4,
        'min_samples_leaf': 50,
        'random_state': 42
    }
)

# Fit and transform
reg_binner.fit(X_reg, y_reg)
X_reg_binned = reg_binner.transform(X_reg)

print(f"Regression binning:")
print(f"  Original features: {X_reg.shape[1]}")
print(f"  Bins per feature: {[len(edges)-1 for edges in reg_binner.bin_edges_.values()]}")

Multi-class Classification

# Multi-class classification example
X_multi, y_multi = make_classification(
    n_samples=1500,
    n_features=5,
    n_classes=3,
    n_informative=4,
    random_state=42
)

multi_binner = TreeBinning(
    task_type='classification',
    tree_params={
        'max_depth': 5,
        'min_samples_split': 100,
        'min_samples_leaf': 30,
        'random_state': 42
    }
)

multi_binner.fit(X_multi, y_multi)
X_multi_binned = multi_binner.transform(X_multi)

# Analyze bins created for each class
print(f"Multi-class binning results:")
for i, edges in multi_binner.bin_edges_.items():
    print(f"  Feature {i}: {len(edges)-1} bins, edges: {edges}")

Advanced Tree Configuration

# Fine-tuned decision tree parameters
advanced_binner = SupervisedBinning(
    task_type='classification',
    tree_params={
        'max_depth': 6,              # Deeper trees for more bins
        'min_samples_split': 200,    # Require more samples for splits
        'min_samples_leaf': 100,     # Larger leaf nodes
        'max_features': 'sqrt',      # Feature sampling
        'random_state': 42,          # Reproducibility
        'class_weight': 'balanced'   # Handle imbalanced classes
    }
)

advanced_binner.fit(X, y)
X_advanced_binned = advanced_binner.transform(X)

print("Advanced configuration results:")
for feature_idx, edges in advanced_binner.bin_edges_.items():
    print(f"  Feature {feature_idx}: {len(edges)-1} bins")

Comparison with Unsupervised Methods

from binlearn.methods import EqualWidthBinning, EqualFrequencyBinning
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Split data for comparison
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compare different binning methods
binners = {
    'supervised': SupervisedBinning(task_type='classification'),
    'equal_width': EqualWidthBinning(n_bins=5),
    'equal_frequency': EqualFrequencyBinning(n_bins=5)
}

results = {}
classifier = RandomForestClassifier(random_state=42, n_estimators=100)

for name, binner in binners.items():
    # Fit binner
    if name == 'supervised':
        binner.fit(X_train, y_train)
    else:
        binner.fit(X_train)

    # Transform data
    X_train_binned = binner.transform(X_train)
    X_test_binned = binner.transform(X_test)

    # Train classifier
    classifier.fit(X_train_binned, y_train)
    y_pred = classifier.predict(X_test_binned)

    results[name] = accuracy_score(y_test, y_pred)
    print(f"{name}: {results[name]:.3f} accuracy")

Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# Create pipeline with supervised binning
pipeline = Pipeline([
    ('binning', SupervisedBinning(task_type='classification')),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Parameter grid for both binning and classification
param_grid = {
    'binning__tree_params': [
        {'max_depth': 3, 'min_samples_leaf': 20},
        {'max_depth': 4, 'min_samples_leaf': 30},
        {'max_depth': 5, 'min_samples_leaf': 50}
    ],
    'classifier__n_estimators': [50, 100, 200]
}

# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")

Parameter Guide

task_type (str, required)

Type of supervised learning task:

  • ‘classification’: For discrete target variables

  • ‘regression’: For continuous target variables

tree_params (dict, optional)

Parameters passed to sklearn’s DecisionTree:

  • max_depth: Maximum depth of decision tree

  • min_samples_split: Minimum samples required to split

  • min_samples_leaf: Minimum samples in leaf nodes

  • random_state: Random seed for reproducibility

  • class_weight: Class weighting for imbalanced data (classification only)

guidance_columns (list, optional)

Columns to use as targets (binlearn-style interface):

  • Alternative to passing y separately

  • Useful when target is part of input DataFrame

  • Can specify multiple guidance columns

Usage Patterns

When to Use Supervised Binning:

  1. Target-Optimized Features: When bins should maximize predictive performance

  2. Feature Engineering: Creating informative bins for downstream models

  3. Interpretable Models: When bin boundaries need to be explainable

  4. Imbalanced Data: Tree parameters can handle class imbalance

Best Practices:

  1. Validation: Always validate on holdout data to avoid overfitting

  2. Tree Tuning: Adjust tree parameters based on your data size and complexity

  3. Task Type: Ensure correct task_type for your target variable

  4. Feature Selection: Consider feature importance when interpreting results

See Also