Frequently Asked Questions

Common questions and answers about using binlearn.

General Questions

What is binlearn?

binlearn is a comprehensive Python library for data binning and discretization. It provides multiple binning methods with sklearn compatibility, DataFrame support, and modern Python features like type safety and comprehensive error handling.

Why use binning/discretization?

Binning is useful for:

  • Noise reduction: Smoothing out small variations in data

  • Model compatibility: Some algorithms work better with categorical data

  • Interpretability: Creating meaningful categories from continuous data

  • Feature engineering: Creating new categorical features for ML models

  • Data exploration: Understanding data distribution patterns

How does binlearn compare to other binning libraries?

binlearn offers several advantages:

  • Modern Python: Type safety, comprehensive error handling, 100% test coverage

  • Framework integration: Native pandas/polars support, full sklearn compatibility

  • Multiple methods: 8 different binning algorithms including supervised binning

  • Flexibility: Custom bin specifications, guidance columns, configurable behavior

  • Performance: Optimized implementations with efficient memory usage

Installation and Setup

What Python versions are supported?

binlearn supports Python 3.10, 3.11, 3.12, and 3.13.

What are the required dependencies?

Core dependencies (automatically installed): - NumPy >= 1.21.0 - SciPy >= 1.7.0 - Scikit-learn >= 1.0.0 - kmeans1d >= 0.3.0

Optional dependencies: - Pandas >= 1.3.0 (for DataFrame support) - Polars >= 0.15.0 (for Polars DataFrame support)

How do I install optional dependencies?

# Install with pandas support
pip install binlearn[pandas]

# Install with polars support
pip install binlearn[polars]

# Install with all optional dependencies
pip install binlearn[pandas,polars]

I get import errors with pandas/polars. What should I do?

Install the optional dependencies:

pip install pandas polars

The library will work without them, but DataFrame-specific features won’t be available.

Usage Questions

Which binning method should I choose?

Choose based on your data and goals:

  • EqualWidthBinning: Simple, interpretable, good for uniformly distributed data

  • EqualFrequencyBinning: Balanced bin sizes, good for skewed data

  • KMeansBinning: Natural clusters in data, good for multimodal distributions

  • SupervisedBinning: When you have target variables, optimizes for prediction

  • SingletonBinning: For numeric discrete values (one bin per unique value)

  • Manual methods: When you need specific bin boundaries

How do I preserve DataFrame column names?

Use the preserve_dataframe=True parameter:

binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
df_binned = binner.fit_transform(df)
# df_binned will be a DataFrame with original column names

Can I bin only specific columns?

Yes, select the columns you want to bin:

# Method 1: Select columns before binning
selected_data = df[['column1', 'column2']]
binner.fit_transform(selected_data)

# Method 2: Use sklearn's ColumnTransformer
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
    ('binning', EqualWidthBinning(n_bins=5), ['column1', 'column2']),
    ('passthrough', 'passthrough', ['column3', 'column4'])
])

How do I handle missing values?

binlearn handles missing values automatically:

  • NaN values are preserved and assigned a special bin value

  • The binner will warn if there are excessive missing values

  • Missing values don’t affect bin edge calculations

# Data with missing values
import numpy as np
data_with_nan = np.array([1, 2, np.nan, 4, 5])

binner = EqualWidthBinning(n_bins=3)
result = binner.fit_transform(data_with_nan.reshape(-1, 1))
# NaN values are preserved in the result

What happens with outliers?

By default, outliers are included in the outermost bins. You can control this with the clip parameter:

# Include outliers in outermost bins (default)
binner = EqualWidthBinning(n_bins=5, clip=False)

# Clip outliers to bin edges
binner = EqualWidthBinning(n_bins=5, clip=True)

How do I get bin boundaries and representatives?

Access the bin_edges_ and bin_representatives_ attributes after fitting:

binner = EqualWidthBinning(n_bins=5)
binner.fit(X)

print(f"Bin edges: {binner.bin_edges_}")
print(f"Bin representatives: {binner.bin_representatives_}")

Can I save and load trained binners?

Yes, use pickle or joblib:

import pickle

# Save trained binner
with open('binner.pkl', 'wb') as f:
    pickle.dump(binner, f)

# Load binner
with open('binner.pkl', 'rb') as f:
    loaded_binner = pickle.load(f)

Advanced Usage

How do I use binning in sklearn pipelines?

All binlearn transformers are sklearn-compatible:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([
    ('binning', EqualWidthBinning(n_bins=5)),
    ('classifier', RandomForestClassifier())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

What is supervised binning and when should I use it?

Supervised binning uses target variable information to create optimal bins for prediction tasks:

from binlearn import SupervisedBinning

# For classification
sup_binner = SupervisedBinning(
    n_bins=4,
    task_type='classification'
)
X_binned = sup_binner.fit_transform(X, guidance_data=y)

Use it when: - You have labeled data (classification/regression) - You want bins optimized for prediction performance - Traditional binning doesn’t capture important patterns

How do I create custom bin boundaries?

Use ManualIntervalBinning:

from binlearn import ManualIntervalBinning

# Define custom bin edges
custom_edges = {
    'feature1': [0, 25, 50, 75, 100],
    'feature2': [-2, -1, 0, 1, 2]
}

manual_binner = ManualIntervalBinning(
    bin_edges=custom_edges,
    preserve_dataframe=True
)

Can I mix different binning methods for different columns?

Yes, use sklearn’s ColumnTransformer:

from sklearn.compose import ColumnTransformer
from binlearn import EqualWidthBinning, SingletonBinning

preprocessor = ColumnTransformer([
    ('numeric', EqualWidthBinning(n_bins=5), ['age', 'income']),
    ('discrete', SingletonBinning(), ['category_id', 'region_code'])
])

How do I optimize binning performance for large datasets?

Several strategies:

  1. Use appropriate data types: Float32 instead of float64 if precision allows

  2. Sample for fitting: Fit on a representative sample, transform the full dataset

  3. Choose efficient methods: EqualWidthBinning is faster than KMeansBinning

  4. Use chunked processing: For datasets larger than memory (future feature)

# Sample-based fitting for large datasets
sample_size = 10000
sample_indices = np.random.choice(len(X), sample_size, replace=False)
X_sample = X[sample_indices]

binner = EqualWidthBinning(n_bins=5)
binner.fit(X_sample)
X_binned = binner.transform(X)  # Transform full dataset

Troubleshooting

I get a ConfigurationError. What does this mean?

ConfigurationError indicates invalid parameters. Common causes:

  • n_bins <= 0: Must be positive

  • Invalid bin_range: Must be tuple with min < max

  • Conflicting parameters: Can’t use guidance_columns with fit_jointly=True

Check the error message for specific guidance.

My binned data has unexpected values. What’s wrong?

Common issues:

  1. Out-of-range values: Check if clip=True is needed

  2. Missing values: NaN inputs produce special bin values

  3. Insufficient data: Very small datasets may not bin as expected

  4. Wrong method: Consider if your chosen method suits your data distribution

The binner seems slow. How can I speed it up?

Performance tips:

  1. Use EqualWidthBinning for fastest performance

  2. Reduce data size if possible (fewer samples or features)

  3. Use appropriate dtypes (float32 vs float64)

  4. Avoid KMeansBinning for very large datasets

  5. Consider sampling for fitting on large datasets

I get different results each time. Why?

Some methods have randomness:

  • KMeansBinning: Uses random initialization, set random_state for reproducibility

  • SupervisedBinning: Decision trees have randomness, set random_state in tree_params

# Reproducible results
binner = KMeansBinning(n_bins=5, random_state=42)

Can I contribute new binning methods?

Yes! We welcome contributions. See the Contributing guide for details on:

  • Development setup

  • Coding standards

  • Testing requirements

  • Pull request process

Integration Questions

Does binlearn work with Dask?

Not directly, but you can use binlearn with Dask by:

  1. Fitting on a representative sample

  2. Applying the trained binner to Dask chunks

  3. Using map_partitions for transformation

Does binlearn support sparse matrices?

Yes, binlearn supports scipy sparse matrices for memory-efficient processing of high-dimensional sparse data.

Can I use binlearn with Apache Spark?

Not directly, but you can:

  1. Convert Spark DataFrames to pandas for binning

  2. Use fitted binners in Spark UDFs

  3. Apply binning in preprocessing steps before Spark ML

Still Have Questions?

If you don’t find your answer here:

  1. Check the documentation: Browse the user guide and API reference

  2. Search GitHub issues: Someone may have asked the same question

  3. Create an issue: For bugs or feature requests

  4. Start a discussion: For general questions or usage help

Visit our GitHub repository for more information.