DBSCANBinning ============= .. currentmodule:: binlearn.methods .. autoclass:: DBSCANBinning :members: :inherited-members: :show-inheritance: Overview -------- ``DBSCANBinning`` creates bins based on DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering of each feature. The bin edges are determined by the natural cluster boundaries identified by DBSCAN, which groups densely connected values together while handling outliers as noise. This approach is particularly useful when: * Your data has natural density-based clusters * You want to identify and handle outliers automatically * You need clustering that adapts to arbitrary cluster shapes * You want bins that reflect the local density structure of your data Key Features ------------ * **Density-Based Clustering**: Uses DBSCAN for robust density-based clustering * **Outlier Detection**: Automatically identifies and handles outliers as noise * **Arbitrary Shapes**: Can find clusters of any shape (not just spherical) * **Parameter Control**: Fine-tune clustering with eps and min_samples parameters * **Fallback Strategy**: Uses equal-width binning when insufficient clusters are found * **Sklearn Compatibility**: Full transformer interface with fit/transform methods * **DataFrame Support**: Preserves pandas/polars column names and structure Basic Usage ----------- .. code-block:: python import numpy as np import pandas as pd from binlearn.methods import DBSCANBinning # Create sample data with clusters and outliers np.random.seed(42) cluster1 = np.random.normal(10, 1, 100) cluster2 = np.random.normal(25, 1.5, 80) outliers = np.random.uniform(0, 40, 20) # Scattered outliers data = np.concatenate([cluster1, cluster2, outliers]) # Apply DBSCAN binning binner = DBSCANBinning(eps=2.0, min_samples=5) data_binned = binner.fit_transform(data.reshape(-1, 1)) print(f"Bin edges: {binner.bin_edges_[0]}") print(f"Original data shape: {data.shape}") print(f"Binned data shape: {data_binned.shape}") DataFrame Example ----------------- .. code-block:: python # DataFrame usage with multiple features df = pd.DataFrame({ 'feature1': np.concatenate([ np.random.normal(10, 2, 150), np.random.normal(30, 2, 150), np.random.uniform(0, 40, 30) # outliers ]), 'feature2': np.concatenate([ np.random.normal(5, 1, 150), np.random.normal(15, 1, 150), np.random.uniform(0, 20, 30) # outliers ]) }) binner = DBSCANBinning( eps=3.0, min_samples=10, min_bins=2, preserve_dataframe=True ) df_binned = binner.fit_transform(df) print(f"Bin edges for feature1: {binner.bin_edges_['feature1']}") print(f"Bin edges for feature2: {binner.bin_edges_['feature2']}") Advanced Configuration ---------------------- .. code-block:: python # Fine-tuned DBSCAN parameters for different data characteristics # For dense, well-separated clusters dense_binner = DBSCANBinning( eps=0.5, # Small neighborhood min_samples=10, # Require more points for core samples min_bins=3 # Minimum number of bins ) # For sparse data with loose clusters sparse_binner = DBSCANBinning( eps=5.0, # Larger neighborhood min_samples=3, # Fewer points needed for core samples min_bins=2, # Accept fewer bins clip=True # Clip outliers to bin edges ) Parameter Tuning Example ------------------------ .. code-block:: python # Visualize different parameter effects import matplotlib.pyplot as plt # Test different eps values eps_values = [0.5, 1.0, 2.0, 4.0] fig, axes = plt.subplots(2, 2, figsize=(12, 8)) axes = axes.flatten() for i, eps in enumerate(eps_values): binner = DBSCANBinning(eps=eps, min_samples=5) data_binned = binner.fit_transform(data.reshape(-1, 1)) axes[i].hist(data_binned.flatten(), bins=20, alpha=0.7) axes[i].set_title(f'eps={eps}, bins={len(binner.bin_edges_[0])-1}') plt.tight_layout() plt.show() Scikit-learn Pipeline Integration --------------------------------- .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # Create a pipeline with DBSCAN binning pipeline = Pipeline([ ('binning', DBSCANBinning(eps=2.0, min_samples=5)), ('classifier', RandomForestClassifier(random_state=42)) ]) # Use in ML workflow X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) pipeline.fit(X_train, y_train) accuracy = pipeline.score(X_test, y_test) Parameter Guide --------------- **eps** (float, default=0.1) The maximum distance between two samples for one to be considered in the neighborhood of the other. This is the most important DBSCAN parameter: * Small values: More restrictive clustering, more bins * Large values: More permissive clustering, fewer bins * Rule of thumb: Start with the standard deviation of your data **min_samples** (int, default=5) The number of samples in a neighborhood for a point to be considered as a core point: * Higher values: More restrictive clustering, fewer but denser clusters * Lower values: More permissive clustering, more clusters but potentially noisier * Rule of thumb: Use 2 * dimensions for 2D data, or at least 3 **min_bins** (int, default=2) Minimum number of bins to create. If DBSCAN produces fewer clusters than this, equal-width binning is used as a fallback strategy. Handling Edge Cases ------------------- .. code-block:: python # When DBSCAN finds insufficient clusters sparse_data = np.random.uniform(0, 100, 50).reshape(-1, 1) binner = DBSCANBinning( eps=1.0, min_samples=5, min_bins=3 # Fallback to equal-width if < 3 clusters found ) # Will use equal-width binning as fallback for sparse data data_binned = binner.fit_transform(sparse_data) print(f"Used fallback strategy: {len(binner.bin_edges_[0]) - 1} bins created") Tips for Parameter Selection ---------------------------- 1. **Start with data exploration**: .. code-block:: python # Analyze data distribution first print(f"Data std: {np.std(data)}") print(f"Data range: {np.max(data) - np.min(data)}") # Start with eps ≈ std(data) suggested_eps = np.std(data) 2. **Use elbow method for eps**: .. code-block:: python from sklearn.neighbors import NearestNeighbors # Find optimal eps using k-distance plot neighbors = NearestNeighbors(n_neighbors=5) neighbors_fit = neighbors.fit(data.reshape(-1, 1)) distances, indices = neighbors_fit.kneighbors(data.reshape(-1, 1)) # Plot sorted distances to find "elbow" distances = np.sort(distances[:, 4], axis=0) plt.plot(distances) plt.ylabel("4th Nearest Neighbor Distance") plt.xlabel("Data Points sorted by distance") See Also -------- * :class:`KMeansBinning` - K-means clustering-based binning * :class:`GaussianMixtureBinning` - Gaussian mixture model binning * :class:`EqualFrequencyBinning` - Quantile-based binning * :class:`TreeBinning` - Decision tree-based supervised binning