DBSCANBinning

Bases: IntervalBinningBase

DBSCAN clustering-based binning implementation using clean architecture.

Creates bins based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering of each feature. The bin edges are determined by the natural cluster boundaries identified by DBSCAN, which naturally groups densely connected values together while treating isolated points as noise.

The DBSCAN algorithm finds dense regions in the data and creates natural groupings that respect the underlying data distribution. Unlike k-means or equal-width binning, DBSCAN does not assume any particular shape for clusters and can identify clusters of varying densities. The resulting bins correspond to naturally occurring dense regions in the data.

When DBSCAN produces fewer clusters than the minimum required bins, the algorithm falls back to equal-width binning to ensure the minimum bin count is satisfied.

This implementation follows the clean binlearn architecture with straight inheritance, dynamic column resolution, and parameter reconstruction capabilities.

Parameters:

eps – The maximum distance between two samples for them to be considered as in the same neighborhood. This is the key parameter that controls cluster density. Smaller values create more, smaller clusters. Larger values merge clusters together. If None, uses configuration default.
min_samples – The minimum number of samples in a neighborhood for a point to be considered as a core point (including the point itself). Controls the minimum cluster size. If None, uses configuration default.
min_bins – Minimum number of bins to create. If DBSCAN produces fewer clusters, falls back to equal-width binning. Must be at least 1. If None, uses configuration default.
allow_fallback – Whether to fall back to equal-width binning when DBSCAN produces fewer clusters than min_bins. If True (default), uses equal-width binning as fallback with a warning. If False, raises an error when insufficient clusters are found. If None, uses configuration default.
clip – Whether to clip values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
fit_jointly – Whether to fit all columns together (False for DBSCAN - always fits columns independently). If None, uses configuration default.
bin_edges – Pre-computed bin edges for reconstruction. Should not be provided during normal usage.
bin_representatives – Pre-computed bin representatives for reconstruction. Should not be provided during normal usage.
class – Class name for reconstruction compatibility. Internal use only.
module – Module name for reconstruction compatibility. Internal use only.

eps: Maximum distance for neighborhood definition

min_samples: Minimum samples for core point definition

min_bins: Minimum number of bins to ensure

allow_fallback: Whether to fall back to equal-width binning when needed

Example

>>> import numpy as np
>>> from binlearn.methods import DBSCANBinning
>>>
>>> # Create sample data with natural clusters
>>> data = np.concatenate([
...     np.random.normal(0, 0.5, 100),    # First cluster
...     np.random.normal(5, 0.8, 150),    # Second cluster
...     np.random.normal(10, 0.3, 80)     # Third cluster
... ])
>>>
>>> # Initialize DBSCAN binning
>>> binner = DBSCANBinning(eps=0.8, min_samples=10, min_bins=3)
>>>
>>> # Fit and transform
>>> X = data.reshape(-1, 1)
>>> binner.fit(X)
>>> X_binned = binner.transform(X)
>>>
>>> # Check identified bins
>>> print(f"Number of bins: {len(binner.bin_edges_[0]) - 1}")
>>> print(f"Bin edges: {binner.bin_edges_[0]}")

Note

DBSCAN is particularly effective for data with natural density-based clusters
The eps parameter requires careful tuning based on data scale and density
Noise points (outliers) identified by DBSCAN are included in boundary bins
Falls back to equal-width binning if insufficient clusters are found
Each column is processed independently (unsupervised approach)
Requires at least min_samples finite values per column for clustering

See also

KMeansBinning: Alternative clustering-based binning with fixed cluster count EqualWidthBinning: Simple equal-width interval binning GaussianMixtureBinning: Probabilistic clustering-based binning

Initialize DBSCAN binning with clustering parameters.

Sets up DBSCAN clustering-based binning with specified parameters. Applies configuration defaults for any unspecified parameters and validates the resulting configuration.

Parameters:

eps – Maximum distance between two samples for neighborhood definition. Controls cluster density - smaller values create tighter, more numerous clusters. Must be positive. If None, uses configuration default.
min_samples – Minimum number of samples in a neighborhood for a core point. Controls minimum cluster size and noise tolerance. Must be positive integer. If None, uses configuration default.
min_bins – Minimum number of bins to ensure. If DBSCAN produces fewer clusters, falls back to equal-width binning. Must be at least 1. If None, uses configuration default.
allow_fallback – Whether to fall back to equal-width binning when DBSCAN produces fewer clusters than min_bins. If True (default), uses equal-width binning as fallback with a warning. If False, raises an error when insufficient clusters are found. If None, uses configuration default.
clip – Whether to clip transformed values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
fit_jointly – Whether to fit all columns together. Always False for DBSCAN as it processes columns independently. If None, uses configuration default.
bin_edges – Pre-computed bin edges dictionary for reconstruction. Internal use only - should not be provided during normal initialization.
bin_representatives – Pre-computed representatives dictionary for reconstruction. Internal use only.
class – Class name string for reconstruction compatibility. Internal use only.
module – Module name string for reconstruction compatibility. Internal use only.

Example

>>> # Standard initialization with custom parameters
>>> binner = DBSCANBinning(eps=0.5, min_samples=8, min_bins=3)
>>>
>>> # Use configuration defaults
>>> binner = DBSCANBinning()
>>>
>>> # Custom clustering with clipping enabled
>>> binner = DBSCANBinning(
...     eps=1.2,
...     min_samples=15,
...     min_bins=4,
...     clip=True,
...     preserve_dataframe=True
... )

Note

Parameter validation occurs during initialization
Configuration defaults are applied for None parameters
Reconstruction parameters (bin_edges, bin_representatives, class_, module_) are used internally for object reconstruction and should not be provided during normal usage
The eps parameter is critical for DBSCAN performance and may require experimentation based on data characteristics

classmethod __init_subclass__(**kwargs)

Set the set_{method}_request methods.

This uses PEP-487 [1] to set the set_{method}_request methods. It looks for the information available in the set default values which are set using __metadata_request__* class attributes, or inferred from method signatures.

The __metadata_request__* class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the default None.

References

static check_data_quality(data: ndarray[Any, Any], name: str = 'data') → None: Check data quality and issue warnings if needed.

property feature_names_in_: list[str] | None: Get feature names.

fit(X: Any, y: Any | None = None, **fit_params: Any) → GeneralBinningBase

Fit the binning transformer with comprehensive orchestration.

This method orchestrates the complete fitting process, handling parameter validation, input preprocessing, column separation, and routing to the appropriate fitting strategy (joint vs independent).

Parameters:

X – Input data to fit the binning transformer on. Can be: - pandas.DataFrame: Column names are preserved - polars.DataFrame: Column names are preserved - numpy.ndarray: Numeric column indices are used - array-like: Converted to numpy array
y – Target values for supervised binning methods. Ignored by unsupervised methods. Can be array-like or None.
**fit_params – Additional fitting parameters passed to the specific binning algorithm implementation. Common parameters include: - guidance_data: Alternative guidance data (conflicts with fit_jointly=True)

Returns:

The fitted binning transformer instance.

Return type:

self

Raises:

ValueError – If parameter validation fails, inputs are invalid, or conflicting parameters are provided (e.g., fit_jointly=True with guidance_data).
BinningError – If the binning algorithm fails to fit the data.
RuntimeError – If an unexpected error occurs during fitting.

Example

>>> from binlearn import EqualWidthBinning
>>> import pandas as pd
>>> X = pd.DataFrame({'feature1': [1, 2, 3, 4, 5], 'feature2': [10, 20, 30, 40, 50]})
>>> binner = EqualWidthBinning(n_bins=3)
>>> binner.fit(X)
EqualWidthBinning(...)

Note

The method automatically handles column separation when guidance_columns is specified, routing guidance columns separately from binning columns. The fitting strategy (joint vs independent) is determined by the fit_jointly parameter.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

get_input_columns() → list[Any] | None

Get input columns for data preparation.

This method should be overridden by derived classes to provide appropriate column information without exposing binning-specific concepts.

Returns:: Column information or None if not available

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep: bool = True) → dict[str, Any]

Get parameters for this estimator, including fitted parameters.

This method extends sklearn’s standard get_params to include fitted parameters when the estimator is fitted, enabling complete object reconstruction through the get_params/set_params interface. This is essential for pipeline persistence and model serialization.

Parameters:

deep – If True, returns parameters for sub-estimators (not applicable here but maintained for sklearn compatibility).

Returns:

Constructor parameters extracted from __init__ signature
Fitted parameters (if estimator is fitted) mapped from attributes
Class metadata (class_, module_) for automatic reconstruction

Return type:

Dictionary of parameter names mapped to their values, including

Example

>>> binner = EqualWidthBinning(n_bins=5)
>>> params = binner.get_params()
>>> print(params)
{'n_bins': 5, 'clip': None, ..., 'class_': 'EqualWidthBinning', 'module_': '...'}
>>>
>>> binner.fit(X)
>>> fitted_params = binner.get_params()
>>> # Now includes: {'bin_edges': {...}, 'bin_representatives': {...}, ...}

Note

Automatically extracts constructor parameters from __init__ signature
Includes fitted parameters only when estimator is fitted
Adds class metadata for reconstruction workflows
Excludes internal sklearn attributes like n_features_in_
class_ and module_ parameters are handled specially during set_params

inverse_transform(X: Any) → Any

Inverse transform from bin indices back to representative values.

Converts discrete bin indices back to their representative values, effectively reversing the binning transformation. This is useful for interpreting results or reconstructing approximate original values.

Parameters:

X – Input data containing bin indices to inverse transform. Should contain only binning columns (no guidance columns). Can be: - pandas.DataFrame: Column names should match binning columns - polars.DataFrame: Column names should match binning columns - numpy.ndarray: Must have same number of binning columns - array-like: Converted to numpy array

Returns:

Inverse transformed data where bin indices are replaced with their representative values (typically bin centers). Output format matches the preserve_dataframe setting.

Raises:

RuntimeError – If the transformer has not been fitted yet.
ValueError – If input data has wrong number of columns or invalid format.
BinningError – If inverse transformation fails.

Example

>>> # After fitting and transforming
>>> X_binned = [[0, 1], [1, 0], [2, 2]]  # Bin indices
>>> X_reconstructed = binner.inverse_transform(X_binned)
>>> print(X_reconstructed)
[[0.5, 1.5], [1.5, 0.5], [2.5, 2.5]]  # Representative values

Note

For guided binning (when guidance_columns is specified), the input should only contain the binning columns, not the guidance columns. The number of input columns must match the number of binning columns.

property n_features_in_: int: Get number of features.

set_output(*, transform=None)

Set output container.

See Introducing the set_output API for an example on how to use the API.

Parameters:

transform ({"default", "pandas", "polars"}, default=None) –

Configure output of transform and fit_transform.

”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged

Added in version 1.4: “polars” option was added.

Returns:

self – Estimator instance.

Return type:

estimator instance

set_params(**params: Any) → SklearnIntegrationBase

Set the parameters of this estimator.

This method supports reconstruction workflows by handling fitted parameters that come from get_params() output (without underscores) and setting them as fitted attributes (with underscores).

Parameters:: **params – Parameters to set. Can include: - Regular constructor parameters (n_bins, clip, etc.) - Fitted parameters from get_params (bin_edges, bin_representatives) - Class metadata (ignored during reconstruction)
Returns:: Returns the instance itself.
Return type:: self

transform(X: Any) → Any

Transform input data using fitted binning parameters.

Applies the fitted binning transformation to new data, converting continuous values to discrete bin indices or representatives. Handles column separation when guidance columns are present.

Parameters:

X – Input data to transform. Must have the same structure as the data used during fitting (same number of columns). Can be: - pandas.DataFrame: Column names should match training data - polars.DataFrame: Column names should match training data - numpy.ndarray: Must have same number of columns as training - array-like: Converted to numpy array

Returns:

Transformed data where continuous values are replaced with bin indices or representative values. The output format depends on: - preserve_dataframe setting: DataFrame vs array format - binning method: indices vs representatives - guidance_columns: only binning columns are transformed

Raises:

RuntimeError – If the transformer has not been fitted yet.
ValueError – If the input data has incompatible structure or format.
BinningError – If transformation fails due to data issues.

Example

>>> # After fitting
>>> X_new = pd.DataFrame({'feature1': [1.5, 2.5], 'feature2': [15, 25]})
>>> X_binned = binner.transform(X_new)
>>> print(X_binned)
[[0, 0], [1, 1]]  # Bin indices

Note

When guidance_columns is specified, only the binning columns are transformed. Guidance columns are filtered out from the output. The method preserves the original data format when preserve_dataframe=True.

static validate_array_like(data: Any, name: str = 'data', allow_none: bool = False) → ndarray[Any, Any] | None

Validate and convert array-like input to numpy array.

This method provides robust validation and conversion of various input formats to numpy arrays, with comprehensive error handling and helpful suggestions for common issues.

Parameters:

data – Input data to validate and convert. Can be: - numpy.ndarray: Used directly - pandas.DataFrame/Series: Converted to numpy array - polars.DataFrame: Converted to numpy array - list, tuple: Converted to numpy array - None: Allowed only if allow_none=True
name – Name of the data parameter for error messages. Used to provide context in error messages (e.g., “X”, “y”, “guidance_data”).
allow_none – Whether to allow None as a valid input. If True, None is returned unchanged; if False, None raises InvalidDataError.

Returns:

Validated numpy array, or None if data is None and allow_none=True. The returned array maintains the same data content but is guaranteed to be a numpy array.

Raises:

InvalidDataError – If validation fails: - data is None when allow_none=False - data cannot be converted to numpy array - Conversion process encounters errors

Example

>>> # Valid inputs
>>> arr = ValidationMixin.validate_array_like([1, 2, 3], "X")
>>> print(type(arr))
<class 'numpy.ndarray'>
>>>
>>> # Allow None
>>> result = ValidationMixin.validate_array_like(None, "y", allow_none=True)
>>> print(result)
None
>>>
>>> # Invalid input
>>> ValidationMixin.validate_array_like(None, "X", allow_none=False)
InvalidDataError: X cannot be None

Note

This method focuses on format validation and conversion. Content validation (like checking for NaN values) should be done separately using other validation methods.

static validate_column_specification(columns: Any, data_shape: tuple[int, ...]) → list[Any]: Validate column specifications.

static validate_guidance_columns(guidance_cols: Any, binning_cols: list[Any], data_shape: tuple[int, ...]) → list[Any]: Validate guidance column specifications.

Overview

DBSCANBinning creates bins based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering of each feature. The bin edges are determined by the natural cluster boundaries identified by DBSCAN, which groups densely connected values together while handling outliers as noise.

This approach is particularly useful when:

Your data has natural density-based clusters
You want to identify and handle outliers automatically
You need clustering that adapts to arbitrary cluster shapes
You want bins that reflect the local density structure of your data

Key Features

Density-Based Clustering: Uses DBSCAN for robust density-based clustering
Outlier Detection: Automatically identifies and handles outliers as noise
Arbitrary Shapes: Can find clusters of any shape (not just spherical)
Parameter Control: Fine-tune clustering with eps and min_samples parameters
Fallback Strategy: Uses equal-width binning when insufficient clusters are found
Sklearn Compatibility: Full transformer interface with fit/transform methods
DataFrame Support: Preserves pandas/polars column names and structure

Basic Usage

import numpy as np
import pandas as pd
from binlearn.methods import DBSCANBinning

# Create sample data with clusters and outliers
np.random.seed(42)
cluster1 = np.random.normal(10, 1, 100)
cluster2 = np.random.normal(25, 1.5, 80)
outliers = np.random.uniform(0, 40, 20)  # Scattered outliers
data = np.concatenate([cluster1, cluster2, outliers])

# Apply DBSCAN binning
binner = DBSCANBinning(eps=2.0, min_samples=5)
data_binned = binner.fit_transform(data.reshape(-1, 1))

print(f"Bin edges: {binner.bin_edges_[0]}")
print(f"Original data shape: {data.shape}")
print(f"Binned data shape: {data_binned.shape}")

DataFrame Example

# DataFrame usage with multiple features
df = pd.DataFrame({
    'feature1': np.concatenate([
        np.random.normal(10, 2, 150),
        np.random.normal(30, 2, 150),
        np.random.uniform(0, 40, 30)  # outliers
    ]),
    'feature2': np.concatenate([
        np.random.normal(5, 1, 150),
        np.random.normal(15, 1, 150),
        np.random.uniform(0, 20, 30)  # outliers
    ])
})

binner = DBSCANBinning(
    eps=3.0,
    min_samples=10,
    min_bins=2,
    preserve_dataframe=True
)
df_binned = binner.fit_transform(df)

print(f"Bin edges for feature1: {binner.bin_edges_['feature1']}")
print(f"Bin edges for feature2: {binner.bin_edges_['feature2']}")

Advanced Configuration

# Fine-tuned DBSCAN parameters for different data characteristics

# For dense, well-separated clusters
dense_binner = DBSCANBinning(
    eps=0.5,           # Small neighborhood
    min_samples=10,    # Require more points for core samples
    min_bins=3         # Minimum number of bins
)

# For sparse data with loose clusters
sparse_binner = DBSCANBinning(
    eps=5.0,           # Larger neighborhood
    min_samples=3,     # Fewer points needed for core samples
    min_bins=2,        # Accept fewer bins
    clip=True          # Clip outliers to bin edges
)

Parameter Tuning Example

# Visualize different parameter effects
import matplotlib.pyplot as plt

# Test different eps values
eps_values = [0.5, 1.0, 2.0, 4.0]

fig, axes = plt.subplots(2, 2, figsize=(12, 8))
axes = axes.flatten()

for i, eps in enumerate(eps_values):
    binner = DBSCANBinning(eps=eps, min_samples=5)
    data_binned = binner.fit_transform(data.reshape(-1, 1))

    axes[i].hist(data_binned.flatten(), bins=20, alpha=0.7)
    axes[i].set_title(f'eps={eps}, bins={len(binner.bin_edges_[0])-1}')

plt.tight_layout()
plt.show()

Scikit-learn Pipeline Integration

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Create a pipeline with DBSCAN binning
pipeline = Pipeline([
    ('binning', DBSCANBinning(eps=2.0, min_samples=5)),
    ('classifier', RandomForestClassifier(random_state=42))
])

# Use in ML workflow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)

Parameter Guide

eps (float, default=0.1)

The maximum distance between two samples for one to be considered in the neighborhood of the other. This is the most important DBSCAN parameter:

Small values: More restrictive clustering, more bins
Large values: More permissive clustering, fewer bins
Rule of thumb: Start with the standard deviation of your data

min_samples (int, default=5)

The number of samples in a neighborhood for a point to be considered as a core point:

Higher values: More restrictive clustering, fewer but denser clusters
Lower values: More permissive clustering, more clusters but potentially noisier
Rule of thumb: Use 2 * dimensions for 2D data, or at least 3

min_bins (int, default=2)

Minimum number of bins to create. If DBSCAN produces fewer clusters than this, equal-width binning is used as a fallback strategy.

Handling Edge Cases

# When DBSCAN finds insufficient clusters
sparse_data = np.random.uniform(0, 100, 50).reshape(-1, 1)

binner = DBSCANBinning(
    eps=1.0,
    min_samples=5,
    min_bins=3  # Fallback to equal-width if < 3 clusters found
)

# Will use equal-width binning as fallback for sparse data
data_binned = binner.fit_transform(sparse_data)
print(f"Used fallback strategy: {len(binner.bin_edges_[0]) - 1} bins created")

Tips for Parameter Selection

Start with data exploration:

# Analyze data distribution first
print(f"Data std: {np.std(data)}")
print(f"Data range: {np.max(data) - np.min(data)}")

# Start with eps ≈ std(data)
suggested_eps = np.std(data)

Use elbow method for eps:

from sklearn.neighbors import NearestNeighbors

# Find optimal eps using k-distance plot
neighbors = NearestNeighbors(n_neighbors=5)
neighbors_fit = neighbors.fit(data.reshape(-1, 1))
distances, indices = neighbors_fit.kneighbors(data.reshape(-1, 1))

# Plot sorted distances to find "elbow"
distances = np.sort(distances[:, 4], axis=0)
plt.plot(distances)
plt.ylabel("4th Nearest Neighbor Distance")
plt.xlabel("Data Points sorted by distance")

DBSCANBinning

Overview

Key Features

Basic Usage

DataFrame Example

Advanced Configuration

Parameter Tuning Example

Scikit-learn Pipeline Integration

Parameter Guide

Handling Edge Cases

Tips for Parameter Selection

See Also