EqualWidthMinimumWeightBinning
================================

.. currentmodule:: binlearn.methods

.. autoclass:: EqualWidthMinimumWeightBinning
   :members:
   :inherited-members:
   :show-inheritance:

Overview
--------

``EqualWidthMinimumWeightBinning`` creates bins of equal width across the range of each feature, 
but adjusts the number of bins to ensure each bin contains at least the specified minimum total 
weight from the guidance column. This method combines the interpretability of equal-width binning 
with weight-based constraints for more balanced bins.

This approach is particularly useful when:

* You want **interpretable equal-width bins** but need weight balance
* You have **sample weights** or **importance scores** that should be considered
* You need to ensure **statistical significance** in each bin
* You want to **prevent empty or sparse bins** in weighted scenarios

Key Features
------------

* **Equal-Width Foundation**: Starts with equal-width intervals for interpretability
* **Weight-Based Adjustment**: Ensures minimum weight per bin from guidance column
* **Automatic Merging**: Intelligently merges bins that don't meet weight requirements
* **Flexible Weight Sources**: Supports any numeric column as weight guidance
* **Range Control**: Optional custom range specification for binning
* **Robust Validation**: Comprehensive error handling and data validation
* **Sklearn Compatibility**: Full transformer interface with fit/transform methods
* **DataFrame Support**: Preserves pandas/polars column names and structure

Basic Usage
-----------

.. code-block:: python

   import numpy as np
   import pandas as pd
   from binlearn.methods import EqualWidthMinimumWeightBinning
   
   # Create sample data with weights
   np.random.seed(42)
   X = np.random.uniform(0, 100, 1000).reshape(-1, 1)
   weights = np.random.exponential(2, 1000)  # Some samples more important
   
   # Apply equal-width binning with minimum weight constraint
   binner = EqualWidthMinimumWeightBinning(
       n_bins=10,
       minimum_weight=50.0  # Each bin must have at least 50 total weight
   )
   
   X_binned = binner.fit_transform(X, guidance_data=weights)
   
   print(f"Original shape: {X.shape}")
   print(f"Binned shape: {X_binned.shape}")
   print(f"Final number of bins: {len(binner.bin_edges_[0]) - 1}")
   print(f"Bin edges: {binner.bin_edges_[0]}")

DataFrame Example with Weight Column
------------------------------------

.. code-block:: python

   # Create DataFrame with features and weight column
   df = pd.DataFrame({
       'income': np.random.lognormal(10, 1, 2000),
       'age': np.random.uniform(18, 80, 2000),
       'transaction_amount': np.random.exponential(100, 2000),
       'sample_weight': np.random.gamma(2, 2, 2000)  # Weights for each sample
   })
   
   # Bin income using transaction_amount as weights
   income_binner = EqualWidthMinimumWeightBinning(
       guidance_columns=['sample_weight'],
       n_bins=8,
       minimum_weight=20.0,
       preserve_dataframe=True
   )
   
   df_binned = income_binner.fit_transform(df)
   
   print(f"Income bins created: {len(income_binner.bin_edges_['income']) - 1}")
   print(f"Income bin edges: {income_binner.bin_edges_['income']}")

Survey Data Example
-------------------

.. code-block:: python

   # Example with survey data where response weights matter
   survey_df = pd.DataFrame({
       'satisfaction_score': np.random.uniform(1, 10, 1500),
       'response_time': np.random.lognormal(3, 1, 1500),
       'respondent_weight': np.random.choice([0.5, 1.0, 1.5, 2.0], 1500)  # Survey weights
   })
   
   # Bin satisfaction ensuring each bin has sufficient weighted responses
   satisfaction_binner = EqualWidthMinimumWeightBinning(
       guidance_columns=['respondent_weight'],
       n_bins=5,
       minimum_weight=50.0,  # At least 50 weighted responses per bin
       bin_range=(1, 10),    # Fixed range for satisfaction scores
       preserve_dataframe=True
   )
   
   survey_binned = satisfaction_binner.fit_transform(survey_df)
   
   # Verify weights per bin
   for i, (start, end) in enumerate(zip(satisfaction_binner.bin_edges_['satisfaction_score'][:-1],
                                       satisfaction_binner.bin_edges_['satisfaction_score'][1:])):
       mask = (survey_df['satisfaction_score'] >= start) & (survey_df['satisfaction_score'] < end)
       total_weight = survey_df.loc[mask, 'respondent_weight'].sum()
       print(f"Bin {i} [{start:.1f}, {end:.1f}): {total_weight:.1f} total weight")

Advanced Configuration
----------------------

.. code-block:: python

   # Fine-tuned binning for different scenarios
   
   # Conservative binning (higher weight requirements)
   conservative_binner = EqualWidthMinimumWeightBinning(
       n_bins=6,
       minimum_weight=100.0,    # High weight requirement
       bin_range=(0, 1000),     # Fixed range
       clip=True                # Clip outliers
   )
   
   # Adaptive binning (lower weight requirements, more bins)
   adaptive_binner = EqualWidthMinimumWeightBinning(
       n_bins="sqrt",           # Dynamic based on sample size
       minimum_weight=10.0,     # Lower weight requirement
       clip=False               # Preserve outliers
   )

Financial Risk Example
----------------------

.. code-block:: python

   # Financial data where exposure amounts act as weights
   financial_df = pd.DataFrame({
       'credit_score': np.random.normal(700, 100, 10000),
       'loan_amount': np.random.lognormal(10, 1, 10000),
       'exposure': np.random.exponential(50000, 10000)  # Dollar exposure per loan
   })
   
   # Bin credit scores ensuring each bin has sufficient exposure
   credit_binner = EqualWidthMinimumWeightBinning(
       guidance_columns=['exposure'],
       n_bins=10,
       minimum_weight=500000,  # At least $500K exposure per bin
       bin_range=(300, 850),   # Standard credit score range
       preserve_dataframe=True
   )
   
   financial_binned = credit_binner.fit_transform(financial_df)
   
   # Analyze exposure distribution across bins
   bin_stats = []
   for i, (start, end) in enumerate(zip(credit_binner.bin_edges_['credit_score'][:-1],
                                       credit_binner.bin_edges_['credit_score'][1:])):
       mask = (financial_df['credit_score'] >= start) & (financial_df['credit_score'] < end)
       total_exposure = financial_df.loc[mask, 'exposure'].sum()
       avg_score = financial_df.loc[mask, 'credit_score'].mean()
       count = mask.sum()
       
       bin_stats.append({
           'bin': f"[{start:.0f}, {end:.0f})",
           'avg_score': avg_score,
           'count': count,
           'total_exposure': total_exposure,
           'avg_exposure': total_exposure / count if count > 0 else 0
       })
   
   for stats in bin_stats:
       print(f"Bin {stats['bin']}: {stats['count']} loans, "
             f"${stats['total_exposure']:,.0f} exposure, "
             f"avg score {stats['avg_score']:.0f}")

Comparison with Standard Equal-Width
------------------------------------

.. code-block:: python

   from binlearn.methods import EqualWidthBinning
   
   # Compare standard equal-width vs. minimum weight equal-width
   X_sample = np.random.exponential(2, 500).reshape(-1, 1)  # Skewed data
   weights = np.random.exponential(1, 500)
   
   # Standard equal-width binning
   standard_binner = EqualWidthBinning(n_bins=8)
   X_standard = standard_binner.fit_transform(X_sample)
   
   # Equal-width with minimum weight
   weighted_binner = EqualWidthMinimumWeightBinning(
       n_bins=8,
       minimum_weight=15.0
   )
   X_weighted = weighted_binner.fit_transform(X_sample, guidance_data=weights)
   
   print("Standard equal-width:")
   print(f"  Bins created: {len(standard_binner.bin_edges_[0]) - 1}")
   print(f"  Bin edges: {standard_binner.bin_edges_[0]}")
   
   print("\\nWeight-constrained equal-width:")
   print(f"  Bins created: {len(weighted_binner.bin_edges_[0]) - 1}")
   print(f"  Bin edges: {weighted_binner.bin_edges_[0]}")
   
   # Analyze weight distribution in standard bins
   print("\\nWeight distribution in standard bins:")
   for i, (start, end) in enumerate(zip(standard_binner.bin_edges_[0][:-1],
                                       standard_binner.bin_edges_[0][1:])):
       mask = (X_sample.flatten() >= start) & (X_sample.flatten() < end)
       total_weight = weights[mask].sum()
       print(f"  Bin {i}: {total_weight:.1f} total weight")

Scikit-learn Pipeline Integration
---------------------------------

.. code-block:: python

   from sklearn.pipeline import Pipeline
   from sklearn.ensemble import RandomForestClassifier
   from sklearn.model_selection import train_test_split
   from sklearn.utils.class_weight import compute_sample_weight
   
   # Create classification data with sample weights
   from sklearn.datasets import make_classification
   X, y = make_classification(n_samples=2000, n_features=4, n_classes=2, random_state=42)
   
   # Compute sample weights (e.g., for imbalanced data)
   sample_weights = compute_sample_weight('balanced', y)
   
   # Create pipeline with weight-aware binning
   pipeline = Pipeline([
       ('binning', EqualWidthMinimumWeightBinning(
           n_bins=6,
           minimum_weight=20.0
       )),
       ('classifier', RandomForestClassifier(random_state=42))
   ])
   
   # Split data
   X_train, X_test, y_train, y_test, weights_train, weights_test = train_test_split(
       X, y, sample_weights, test_size=0.2, random_state=42
   )
   
   # Fit pipeline with weights
   pipeline.fit(X_train, y_train, 
                binning__guidance_data=weights_train)  # Pass weights to binning step
   
   accuracy = pipeline.score(X_test, y_test)
   print(f"Pipeline accuracy: {accuracy:.3f}")

Parameter Guide
---------------

**n_bins** (int or str, default=10)
    Initial number of equal-width bins to create:
    
    * int: Direct specification (e.g., 10)
    * "sqrt": Square root of number of samples
    * "log": Natural logarithm of number of samples
    * Actual bins may be fewer due to weight constraints

**minimum_weight** (float, default=1.0)
    Minimum total weight required per bin:
    
    * Higher values: Fewer, more stable bins
    * Lower values: More bins, potentially less stable
    * Should reflect your statistical significance requirements

**bin_range** (tuple, optional)
    Custom range for binning as (min, max):
    
    * None: Uses data min/max
    * Fixed range: Ensures consistent bins across datasets
    * Useful for scores with known ranges (e.g., 0-100)

**guidance_columns** (list, optional)
    Columns providing weights for bin constraints:
    
    * Should contain positive numeric values
    * Can be sample weights, importance scores, etc.
    * Used only for weight calculation, not bin placement

Handling Edge Cases
-------------------

.. code-block:: python

   # Insufficient total weight scenario
   sparse_X = np.random.uniform(0, 100, 50).reshape(-1, 1)
   sparse_weights = np.ones(50) * 0.1  # Very low weights
   
   # Algorithm will create fewer bins to meet weight requirements
   sparse_binner = EqualWidthMinimumWeightBinning(
       n_bins=10,
       minimum_weight=2.0  # May be too high for this data
   )
   
   try:
       sparse_binned = sparse_binner.fit_transform(sparse_X, guidance_data=sparse_weights)
       print(f"Created {len(sparse_binner.bin_edges_[0]) - 1} bins (requested 10)")
   except Exception as e:
       print(f"Error: {e}")
       # Reduce minimum_weight or increase data

Tips for Best Results
---------------------

1. **Set appropriate minimum_weight**: Balance statistical significance with granularity
2. **Consider your weight distribution**: Check weight statistics before setting constraints
3. **Use fixed ranges when appropriate**: Ensures consistent binning across datasets
4. **Validate weight requirements**: Ensure total weight can support desired number of bins
5. **Monitor bin merging**: Check if many bins are being merged due to weight constraints

See Also
--------

* :class:`EqualWidthBinning` - Standard equal-width binning without weight constraints
* :class:`EqualFrequencyBinning` - Quantile-based binning for balanced sample counts
* :class:`TreeBinning` - Decision tree-based supervised binning
* :class:`KMeansBinning` - K-means clustering-based binning