EqualWidthMinimumWeightBinning
- class binlearn.methods.EqualWidthMinimumWeightBinning(n_bins: int | str | None = None, minimum_weight: float | None = None, bin_range: tuple[float, float] | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]
Bases:
SupervisedBinningBaseEqual-width binning with minimum weight constraint implementation using clean architecture.
Creates bins of equal width across the range of each feature, but adjusts the number of bins to ensure each bin contains at least the specified minimum total weight from the guidance column. This method combines the interpretability of equal-width binning with weight-based constraints for more balanced bins.
This approach is particularly valuable when working with weighted data where statistical significance or minimum sample requirements must be maintained within each bin. The algorithm starts with equal-width bins and then merges adjacent underweight bins until all remaining bins meet the minimum weight requirement.
The weight constraint helps ensure that: - Each bin has sufficient statistical power for analysis - Bins are meaningful for weighted modeling or evaluation - Sparse regions in the data don’t create unreliable bins - The resulting binning respects both spatial (equal-width) and statistical (weight)
considerations
When no bins can meet the minimum weight requirement individually, the algorithm creates a single bin containing all data to maintain functionality.
This implementation follows the clean binlearn architecture with straight inheritance, dynamic column resolution, and parameter reconstruction capabilities.
- Parameters:
n_bins – Initial number of equal-width bins to create before weight-based merging. Controls the granularity of the initial binning. Can be an integer or a string expression like ‘sqrt’, ‘log2’, etc. for dynamic calculation. Final number of bins may be smaller due to merging. If None, uses configuration default.
minimum_weight – Minimum total weight required per bin. Bins with lower total weight will be merged with adjacent bins until this requirement is met. Must be positive. If None, uses configuration default.
bin_range – Optional tuple specifying (min, max) range for binning. If provided, bins are created within this range rather than the data’s natural range. Useful for ensuring consistent binning across datasets. If None, uses data’s min/max values.
clip – Whether to clip values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
guidance_columns – Column specification for weight/guidance data used in supervised binning. Should point to weight values for each sample.
bin_edges – Pre-computed bin edges for reconstruction. Should not be provided during normal usage.
bin_representatives – Pre-computed bin representatives for reconstruction. Should not be provided during normal usage.
class – Class name for reconstruction compatibility. Internal use only.
module – Module name for reconstruction compatibility. Internal use only.
- n_bins
Initial number of bins before merging
- minimum_weight
Minimum weight requirement per bin
- bin_range
Optional fixed range for binning
Example
>>> import numpy as np >>> from binlearn.methods import EqualWidthMinimumWeightBinning >>> >>> # Create sample data with weights >>> np.random.seed(42) >>> X = np.random.uniform(0, 100, 1000).reshape(-1, 1) >>> weights = np.random.exponential(2.0, 1000) # Exponentially distributed weights >>> >>> # Initialize with minimum weight constraint >>> binner = EqualWidthMinimumWeightBinning( ... n_bins=10, ... minimum_weight=50.0, ... guidance_columns='weight' ... ) >>> >>> # Fit with weight data >>> binner.fit(X, weights.reshape(-1, 1)) >>> X_binned = binner.transform(X) >>> >>> # Check bin weights >>> for i, edges in enumerate(zip(binner.bin_edges_[0][:-1], binner.bin_edges_[0][1:])): ... left, right = edges ... mask = (X >= left) & (X < right) if i < len(binner.bin_edges_[0]) - 2 ... else (X >= left) & (X <= right) ... bin_weight = np.sum(weights[mask.flatten()]) ... print(f"Bin {i}: [{left:.1f}, {right:.1f}] weight: {bin_weight:.1f}")
Note
Requires guidance data containing weight values for each sample
Final number of bins may be less than n_bins due to merging underweight bins
All weights must be non-negative (negative weights raise ValueError)
Bins are merged by combining adjacent underweight bins
Creates a single bin if no individual bins can meet the weight requirement
Each column is processed independently with its corresponding weight data
Weight-based merging preserves the equal-width property where possible
See also
EqualWidthBinning: Standard equal-width binning without weight constraints EqualFrequencyBinning: Equal-frequency binning for balanced sample counts SupervisedBinningBase: Base class for supervised binning methods
References
This method extends standard equal-width binning with statistical adequacy constraints commonly used in risk modeling and weighted analysis scenarios.
- __init__(n_bins: int | str | None = None, minimum_weight: float | None = None, bin_range: tuple[float, float] | None = None, clip: bool | None = None, preserve_dataframe: bool | None = None, guidance_columns: Any | None = None, *, bin_edges: dict[Any, list[float]] | None = None, bin_representatives: dict[Any, list[float]] | None = None, class_: str | None = None, module_: str | None = None)[source]
Initialize Equal Width Minimum Weight binning with weight constraints.
Sets up equal-width binning with minimum weight constraints, combining spatial and statistical adequacy requirements. Applies configuration defaults for any unspecified parameters and validates the resulting configuration.
- Parameters:
n_bins – Initial number of equal-width bins to create before weight-based merging. Controls the granularity of the initial binning. Can be: - Integer: Exact initial number of bins - String: Dynamic calculation expression (‘sqrt’, ‘log2’, etc.) Final number of bins may be smaller due to merging. Must be positive. If None, uses configuration default.
minimum_weight – Minimum total weight required per bin. Bins with total weight below this threshold will be merged with adjacent bins until the requirement is met. Must be positive. If None, uses configuration default.
bin_range – Optional tuple specifying (min_value, max_value) range for binning. If provided, equal-width bins are created within this range regardless of the actual data range. Useful for: - Consistent binning across multiple datasets - Excluding outliers from bin range calculation - Domain-specific range constraints Must be (min, max) where min < max. If None, uses data’s actual range.
clip – Whether to clip transformed values outside the fitted range to the nearest bin edge. If None, uses configuration default.
preserve_dataframe – Whether to preserve pandas DataFrame structure in transform operations. If None, uses configuration default.
guidance_columns – Column specification for weight/guidance data. Should point to columns containing weight values for each sample. Required for supervised binning during fit operations.
bin_edges – Pre-computed bin edges dictionary for reconstruction. Internal use only - should not be provided during normal initialization.
bin_representatives – Pre-computed representatives dictionary for reconstruction. Internal use only.
class – Class name string for reconstruction compatibility. Internal use only.
module – Module name string for reconstruction compatibility. Internal use only.
Example
>>> # Standard initialization with weight constraints >>> binner = EqualWidthMinimumWeightBinning( ... n_bins=8, ... minimum_weight=100.0, ... guidance_columns='sample_weight' ... ) >>> >>> # Custom range with tighter weight requirements >>> binner = EqualWidthMinimumWeightBinning( ... n_bins=12, ... minimum_weight=50.0, ... bin_range=(0, 1000), ... guidance_columns=['weight_column'] ... ) >>> >>> # Use configuration defaults >>> binner = EqualWidthMinimumWeightBinning( ... guidance_columns='weights' ... )
Note
Parameter validation occurs during initialization
Configuration defaults are applied for None parameters
The minimum_weight parameter is crucial for determining bin merging behavior
bin_range allows for consistent binning across datasets with different ranges
Guidance columns must point to weight data for the minimum weight constraint to work
Reconstruction parameters should not be provided during normal usage
- classmethod __init_subclass__(**kwargs)
Set the
set_{method}_requestmethods.This uses PEP-487 [1] to set the
set_{method}_requestmethods. It looks for the information available in the set default values which are set using__metadata_request__*class attributes, or inferred from method signatures.The
__metadata_request__*class attributes are used when a method does not explicitly accept a metadata through its arguments or if the developer would like to specify a request value for those metadata which are different from the defaultNone.References
- static check_data_quality(data: ndarray[Any, Any], name: str = 'data') None
Check data quality and issue warnings if needed.
- fit(X: Any, y: Any | None = None, **fit_params: Any) GeneralBinningBase
Fit the binning transformer with comprehensive orchestration.
This method orchestrates the complete fitting process, handling parameter validation, input preprocessing, column separation, and routing to the appropriate fitting strategy (joint vs independent).
- Parameters:
X – Input data to fit the binning transformer on. Can be: - pandas.DataFrame: Column names are preserved - polars.DataFrame: Column names are preserved - numpy.ndarray: Numeric column indices are used - array-like: Converted to numpy array
y – Target values for supervised binning methods. Ignored by unsupervised methods. Can be array-like or None.
**fit_params – Additional fitting parameters passed to the specific binning algorithm implementation. Common parameters include: - guidance_data: Alternative guidance data (conflicts with fit_jointly=True)
- Returns:
The fitted binning transformer instance.
- Return type:
self
- Raises:
ValueError – If parameter validation fails, inputs are invalid, or conflicting parameters are provided (e.g., fit_jointly=True with guidance_data).
BinningError – If the binning algorithm fails to fit the data.
RuntimeError – If an unexpected error occurs during fitting.
Example
>>> from binlearn import EqualWidthBinning >>> import pandas as pd >>> X = pd.DataFrame({'feature1': [1, 2, 3, 4, 5], 'feature2': [10, 20, 30, 40, 50]}) >>> binner = EqualWidthBinning(n_bins=3) >>> binner.fit(X) EqualWidthBinning(...)
Note
The method automatically handles column separation when guidance_columns is specified, routing guidance columns separately from binning columns. The fitting strategy (joint vs independent) is determined by the fit_jointly parameter.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- get_input_columns() list[Any] | None
Get input columns for data preparation.
This method should be overridden by derived classes to provide appropriate column information without exposing binning-specific concepts.
- Returns:
Column information or None if not available
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
routing – A
MetadataRequestencapsulating routing information.- Return type:
MetadataRequest
- get_params(deep: bool = True) dict[str, Any]
Get parameters for this estimator, including fitted parameters.
This method extends sklearn’s standard get_params to include fitted parameters when the estimator is fitted, enabling complete object reconstruction through the get_params/set_params interface. This is essential for pipeline persistence and model serialization.
- Parameters:
deep – If True, returns parameters for sub-estimators (not applicable here but maintained for sklearn compatibility).
- Returns:
- Return type:
Dictionary of parameter names mapped to their values, including
Example
>>> binner = EqualWidthBinning(n_bins=5) >>> params = binner.get_params() >>> print(params) {'n_bins': 5, 'clip': None, ..., 'class_': 'EqualWidthBinning', 'module_': '...'} >>> >>> binner.fit(X) >>> fitted_params = binner.get_params() >>> # Now includes: {'bin_edges': {...}, 'bin_representatives': {...}, ...}
Note
Automatically extracts constructor parameters from __init__ signature
Includes fitted parameters only when estimator is fitted
Adds class metadata for reconstruction workflows
Excludes internal sklearn attributes like n_features_in_
class_ and module_ parameters are handled specially during set_params
- inverse_transform(X: Any) Any
Inverse transform from bin indices back to representative values.
Converts discrete bin indices back to their representative values, effectively reversing the binning transformation. This is useful for interpreting results or reconstructing approximate original values.
- Parameters:
X – Input data containing bin indices to inverse transform. Should contain only binning columns (no guidance columns). Can be: - pandas.DataFrame: Column names should match binning columns - polars.DataFrame: Column names should match binning columns - numpy.ndarray: Must have same number of binning columns - array-like: Converted to numpy array
- Returns:
Inverse transformed data where bin indices are replaced with their representative values (typically bin centers). Output format matches the preserve_dataframe setting.
- Raises:
RuntimeError – If the transformer has not been fitted yet.
ValueError – If input data has wrong number of columns or invalid format.
BinningError – If inverse transformation fails.
Example
>>> # After fitting and transforming >>> X_binned = [[0, 1], [1, 0], [2, 2]] # Bin indices >>> X_reconstructed = binner.inverse_transform(X_binned) >>> print(X_reconstructed) [[0.5, 1.5], [1.5, 0.5], [2.5, 2.5]] # Representative values
Note
For guided binning (when guidance_columns is specified), the input should only contain the binning columns, not the guidance columns. The number of input columns must match the number of binning columns.
- set_output(*, transform=None)
Set output container.
See Introducing the set_output API for an example on how to use the API.
- Parameters:
transform ({"default", "pandas", "polars"}, default=None) –
Configure output of transform and fit_transform.
”default”: Default output format of a transformer
”pandas”: DataFrame output
”polars”: Polars output
None: Transform configuration is unchanged
Added in version 1.4: “polars” option was added.
- Returns:
self – Estimator instance.
- Return type:
estimator instance
- set_params(**params: Any) SklearnIntegrationBase
Set the parameters of this estimator.
This method supports reconstruction workflows by handling fitted parameters that come from get_params() output (without underscores) and setting them as fitted attributes (with underscores).
- Parameters:
**params – Parameters to set. Can include: - Regular constructor parameters (n_bins, clip, etc.) - Fitted parameters from get_params (bin_edges, bin_representatives) - Class metadata (ignored during reconstruction)
- Returns:
Returns the instance itself.
- Return type:
self
- transform(X: Any) Any
Transform input data using fitted binning parameters.
Applies the fitted binning transformation to new data, converting continuous values to discrete bin indices or representatives. Handles column separation when guidance columns are present.
- Parameters:
X – Input data to transform. Must have the same structure as the data used during fitting (same number of columns). Can be: - pandas.DataFrame: Column names should match training data - polars.DataFrame: Column names should match training data - numpy.ndarray: Must have same number of columns as training - array-like: Converted to numpy array
- Returns:
Transformed data where continuous values are replaced with bin indices or representative values. The output format depends on: - preserve_dataframe setting: DataFrame vs array format - binning method: indices vs representatives - guidance_columns: only binning columns are transformed
- Raises:
RuntimeError – If the transformer has not been fitted yet.
ValueError – If the input data has incompatible structure or format.
BinningError – If transformation fails due to data issues.
Example
>>> # After fitting >>> X_new = pd.DataFrame({'feature1': [1.5, 2.5], 'feature2': [15, 25]}) >>> X_binned = binner.transform(X_new) >>> print(X_binned) [[0, 0], [1, 1]] # Bin indices
Note
When guidance_columns is specified, only the binning columns are transformed. Guidance columns are filtered out from the output. The method preserves the original data format when preserve_dataframe=True.
- static validate_array_like(data: Any, name: str = 'data', allow_none: bool = False) ndarray[Any, Any] | None
Validate and convert array-like input to numpy array.
This method provides robust validation and conversion of various input formats to numpy arrays, with comprehensive error handling and helpful suggestions for common issues.
- Parameters:
data – Input data to validate and convert. Can be: - numpy.ndarray: Used directly - pandas.DataFrame/Series: Converted to numpy array - polars.DataFrame: Converted to numpy array - list, tuple: Converted to numpy array - None: Allowed only if allow_none=True
name – Name of the data parameter for error messages. Used to provide context in error messages (e.g., “X”, “y”, “guidance_data”).
allow_none – Whether to allow None as a valid input. If True, None is returned unchanged; if False, None raises InvalidDataError.
- Returns:
Validated numpy array, or None if data is None and allow_none=True. The returned array maintains the same data content but is guaranteed to be a numpy array.
- Raises:
InvalidDataError – If validation fails: - data is None when allow_none=False - data cannot be converted to numpy array - Conversion process encounters errors
Example
>>> # Valid inputs >>> arr = ValidationMixin.validate_array_like([1, 2, 3], "X") >>> print(type(arr)) <class 'numpy.ndarray'> >>> >>> # Allow None >>> result = ValidationMixin.validate_array_like(None, "y", allow_none=True) >>> print(result) None >>> >>> # Invalid input >>> ValidationMixin.validate_array_like(None, "X", allow_none=False) InvalidDataError: X cannot be None
Note
This method focuses on format validation and conversion. Content validation (like checking for NaN values) should be done separately using other validation methods.
- static validate_column_specification(columns: Any, data_shape: tuple[int, ...]) list[Any]
Validate column specifications.
- static validate_guidance_columns(guidance_cols: Any, binning_cols: list[Any], data_shape: tuple[int, ...]) list[Any]
Validate guidance column specifications.
- validate_guidance_data(guidance_data: Any, name: str = 'guidance_data') ndarray[Any, Any]
Validate and preprocess guidance data for supervised binning.
Ensures that the guidance data is appropriate for supervised binning by validating its shape and checking for data quality issues.
- Parameters:
guidance_data – Raw guidance/target data to validate. Should be a 2D array with shape (n_samples, 1) or 1D array with shape (n_samples,).
name – Name used in error messages for better debugging context.
- Returns:
Validated and preprocessed guidance data with shape (n_samples, 1).
- Raises:
ValidationError – If guidance data has invalid shape or format.
Overview
EqualWidthMinimumWeightBinning creates bins of equal width across the range of each feature,
but adjusts the number of bins to ensure each bin contains at least the specified minimum total
weight from the guidance column. This method combines the interpretability of equal-width binning
with weight-based constraints for more balanced bins.
This approach is particularly useful when:
You want interpretable equal-width bins but need weight balance
You have sample weights or importance scores that should be considered
You need to ensure statistical significance in each bin
You want to prevent empty or sparse bins in weighted scenarios
Key Features
Equal-Width Foundation: Starts with equal-width intervals for interpretability
Weight-Based Adjustment: Ensures minimum weight per bin from guidance column
Automatic Merging: Intelligently merges bins that don’t meet weight requirements
Flexible Weight Sources: Supports any numeric column as weight guidance
Range Control: Optional custom range specification for binning
Robust Validation: Comprehensive error handling and data validation
Sklearn Compatibility: Full transformer interface with fit/transform methods
DataFrame Support: Preserves pandas/polars column names and structure
Basic Usage
import numpy as np
import pandas as pd
from binlearn.methods import EqualWidthMinimumWeightBinning
# Create sample data with weights
np.random.seed(42)
X = np.random.uniform(0, 100, 1000).reshape(-1, 1)
weights = np.random.exponential(2, 1000) # Some samples more important
# Apply equal-width binning with minimum weight constraint
binner = EqualWidthMinimumWeightBinning(
n_bins=10,
minimum_weight=50.0 # Each bin must have at least 50 total weight
)
X_binned = binner.fit_transform(X, guidance_data=weights)
print(f"Original shape: {X.shape}")
print(f"Binned shape: {X_binned.shape}")
print(f"Final number of bins: {len(binner.bin_edges_[0]) - 1}")
print(f"Bin edges: {binner.bin_edges_[0]}")
DataFrame Example with Weight Column
# Create DataFrame with features and weight column
df = pd.DataFrame({
'income': np.random.lognormal(10, 1, 2000),
'age': np.random.uniform(18, 80, 2000),
'transaction_amount': np.random.exponential(100, 2000),
'sample_weight': np.random.gamma(2, 2, 2000) # Weights for each sample
})
# Bin income using transaction_amount as weights
income_binner = EqualWidthMinimumWeightBinning(
guidance_columns=['sample_weight'],
n_bins=8,
minimum_weight=20.0,
preserve_dataframe=True
)
df_binned = income_binner.fit_transform(df)
print(f"Income bins created: {len(income_binner.bin_edges_['income']) - 1}")
print(f"Income bin edges: {income_binner.bin_edges_['income']}")
Survey Data Example
# Example with survey data where response weights matter
survey_df = pd.DataFrame({
'satisfaction_score': np.random.uniform(1, 10, 1500),
'response_time': np.random.lognormal(3, 1, 1500),
'respondent_weight': np.random.choice([0.5, 1.0, 1.5, 2.0], 1500) # Survey weights
})
# Bin satisfaction ensuring each bin has sufficient weighted responses
satisfaction_binner = EqualWidthMinimumWeightBinning(
guidance_columns=['respondent_weight'],
n_bins=5,
minimum_weight=50.0, # At least 50 weighted responses per bin
bin_range=(1, 10), # Fixed range for satisfaction scores
preserve_dataframe=True
)
survey_binned = satisfaction_binner.fit_transform(survey_df)
# Verify weights per bin
for i, (start, end) in enumerate(zip(satisfaction_binner.bin_edges_['satisfaction_score'][:-1],
satisfaction_binner.bin_edges_['satisfaction_score'][1:])):
mask = (survey_df['satisfaction_score'] >= start) & (survey_df['satisfaction_score'] < end)
total_weight = survey_df.loc[mask, 'respondent_weight'].sum()
print(f"Bin {i} [{start:.1f}, {end:.1f}): {total_weight:.1f} total weight")
Advanced Configuration
# Fine-tuned binning for different scenarios
# Conservative binning (higher weight requirements)
conservative_binner = EqualWidthMinimumWeightBinning(
n_bins=6,
minimum_weight=100.0, # High weight requirement
bin_range=(0, 1000), # Fixed range
clip=True # Clip outliers
)
# Adaptive binning (lower weight requirements, more bins)
adaptive_binner = EqualWidthMinimumWeightBinning(
n_bins="sqrt", # Dynamic based on sample size
minimum_weight=10.0, # Lower weight requirement
clip=False # Preserve outliers
)
Financial Risk Example
# Financial data where exposure amounts act as weights
financial_df = pd.DataFrame({
'credit_score': np.random.normal(700, 100, 10000),
'loan_amount': np.random.lognormal(10, 1, 10000),
'exposure': np.random.exponential(50000, 10000) # Dollar exposure per loan
})
# Bin credit scores ensuring each bin has sufficient exposure
credit_binner = EqualWidthMinimumWeightBinning(
guidance_columns=['exposure'],
n_bins=10,
minimum_weight=500000, # At least $500K exposure per bin
bin_range=(300, 850), # Standard credit score range
preserve_dataframe=True
)
financial_binned = credit_binner.fit_transform(financial_df)
# Analyze exposure distribution across bins
bin_stats = []
for i, (start, end) in enumerate(zip(credit_binner.bin_edges_['credit_score'][:-1],
credit_binner.bin_edges_['credit_score'][1:])):
mask = (financial_df['credit_score'] >= start) & (financial_df['credit_score'] < end)
total_exposure = financial_df.loc[mask, 'exposure'].sum()
avg_score = financial_df.loc[mask, 'credit_score'].mean()
count = mask.sum()
bin_stats.append({
'bin': f"[{start:.0f}, {end:.0f})",
'avg_score': avg_score,
'count': count,
'total_exposure': total_exposure,
'avg_exposure': total_exposure / count if count > 0 else 0
})
for stats in bin_stats:
print(f"Bin {stats['bin']}: {stats['count']} loans, "
f"${stats['total_exposure']:,.0f} exposure, "
f"avg score {stats['avg_score']:.0f}")
Comparison with Standard Equal-Width
from binlearn.methods import EqualWidthBinning
# Compare standard equal-width vs. minimum weight equal-width
X_sample = np.random.exponential(2, 500).reshape(-1, 1) # Skewed data
weights = np.random.exponential(1, 500)
# Standard equal-width binning
standard_binner = EqualWidthBinning(n_bins=8)
X_standard = standard_binner.fit_transform(X_sample)
# Equal-width with minimum weight
weighted_binner = EqualWidthMinimumWeightBinning(
n_bins=8,
minimum_weight=15.0
)
X_weighted = weighted_binner.fit_transform(X_sample, guidance_data=weights)
print("Standard equal-width:")
print(f" Bins created: {len(standard_binner.bin_edges_[0]) - 1}")
print(f" Bin edges: {standard_binner.bin_edges_[0]}")
print("\\nWeight-constrained equal-width:")
print(f" Bins created: {len(weighted_binner.bin_edges_[0]) - 1}")
print(f" Bin edges: {weighted_binner.bin_edges_[0]}")
# Analyze weight distribution in standard bins
print("\\nWeight distribution in standard bins:")
for i, (start, end) in enumerate(zip(standard_binner.bin_edges_[0][:-1],
standard_binner.bin_edges_[0][1:])):
mask = (X_sample.flatten() >= start) & (X_sample.flatten() < end)
total_weight = weights[mask].sum()
print(f" Bin {i}: {total_weight:.1f} total weight")
Scikit-learn Pipeline Integration
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_sample_weight
# Create classification data with sample weights
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, n_features=4, n_classes=2, random_state=42)
# Compute sample weights (e.g., for imbalanced data)
sample_weights = compute_sample_weight('balanced', y)
# Create pipeline with weight-aware binning
pipeline = Pipeline([
('binning', EqualWidthMinimumWeightBinning(
n_bins=6,
minimum_weight=20.0
)),
('classifier', RandomForestClassifier(random_state=42))
])
# Split data
X_train, X_test, y_train, y_test, weights_train, weights_test = train_test_split(
X, y, sample_weights, test_size=0.2, random_state=42
)
# Fit pipeline with weights
pipeline.fit(X_train, y_train,
binning__guidance_data=weights_train) # Pass weights to binning step
accuracy = pipeline.score(X_test, y_test)
print(f"Pipeline accuracy: {accuracy:.3f}")
Parameter Guide
- n_bins (int or str, default=10)
Initial number of equal-width bins to create:
int: Direct specification (e.g., 10)
“sqrt”: Square root of number of samples
“log”: Natural logarithm of number of samples
Actual bins may be fewer due to weight constraints
- minimum_weight (float, default=1.0)
Minimum total weight required per bin:
Higher values: Fewer, more stable bins
Lower values: More bins, potentially less stable
Should reflect your statistical significance requirements
- bin_range (tuple, optional)
Custom range for binning as (min, max):
None: Uses data min/max
Fixed range: Ensures consistent bins across datasets
Useful for scores with known ranges (e.g., 0-100)
- guidance_columns (list, optional)
Columns providing weights for bin constraints:
Should contain positive numeric values
Can be sample weights, importance scores, etc.
Used only for weight calculation, not bin placement
Handling Edge Cases
# Insufficient total weight scenario
sparse_X = np.random.uniform(0, 100, 50).reshape(-1, 1)
sparse_weights = np.ones(50) * 0.1 # Very low weights
# Algorithm will create fewer bins to meet weight requirements
sparse_binner = EqualWidthMinimumWeightBinning(
n_bins=10,
minimum_weight=2.0 # May be too high for this data
)
try:
sparse_binned = sparse_binner.fit_transform(sparse_X, guidance_data=sparse_weights)
print(f"Created {len(sparse_binner.bin_edges_[0]) - 1} bins (requested 10)")
except Exception as e:
print(f"Error: {e}")
# Reduce minimum_weight or increase data
Tips for Best Results
Set appropriate minimum_weight: Balance statistical significance with granularity
Consider your weight distribution: Check weight statistics before setting constraints
Use fixed ranges when appropriate: Ensures consistent binning across datasets
Validate weight requirements: Ensure total weight can support desired number of bins
Monitor bin merging: Check if many bins are being merged due to weight constraints
See Also
EqualWidthBinning- Standard equal-width binning without weight constraintsEqualFrequencyBinning- Quantile-based binning for balanced sample countsTreeBinning- Decision tree-based supervised binningKMeansBinning- K-means clustering-based binning