Beginner Tutorial: Your First Binning Project
Welcome to binlearn! This tutorial will walk you through your first data binning project, covering the fundamentals you need to get started.
What is Data Binning?
Data binning (also called discretization) is the process of converting continuous numerical data into discrete intervals or categories. This is useful for:
Reducing noise in your data
Simplifying complex relationships
Making data compatible with algorithms that require categorical input
Creating interpretable features for analysis
Improving model performance in some cases
Let’s see this in action!
Setting Up Your Environment
First, let’s import the libraries we’ll need:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from binlearn import EqualWidthBinning, EqualFrequencyBinning
# Set random seed for reproducibility
np.random.seed(42)
Creating Sample Data
Let’s create some sample data that represents customer information:
# Generate sample customer data
n_customers = 1000
customer_data = pd.DataFrame({
'age': np.random.normal(40, 15, n_customers), # Age in years
'income': np.random.lognormal(10.5, 0.8, n_customers), # Annual income
'spending_score': np.random.beta(2, 5, n_customers) * 100, # Spending score 0-100
'account_balance': np.random.exponential(5000, n_customers) # Account balance
})
# Clean up negative ages
customer_data['age'] = np.maximum(customer_data['age'], 18)
print("Sample data:")
print(customer_data.head())
print(f"\nData shape: {customer_data.shape}")
print(f"\nData types:\n{customer_data.dtypes}")
Let’s visualize our data to understand its distribution:
# Create histograms for each feature
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()
for i, column in enumerate(customer_data.columns):
axes[i].hist(customer_data[column], bins=30, alpha=0.7, edgecolor='black')
axes[i].set_title(f'Distribution of {column}')
axes[i].set_xlabel(column)
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Your First Binning: Equal-Width Binning
Let’s start with the simplest binning method - equal-width binning. This divides the range of each feature into bins of equal width.
# Create an equal-width binner
ew_binner = EqualWidthBinning(
n_bins=5, # Create 5 bins for each feature
preserve_dataframe=True # Keep the DataFrame format
)
# Fit the binner to our data and transform it
customer_data_binned = ew_binner.fit_transform(customer_data)
print("Binned data:")
print(customer_data_binned.head())
print(f"\nBinned data shape: {customer_data_binned.shape}")
Understanding the Results
Let’s examine what the binner learned:
# Check the bin edges for each feature
print("Bin edges for each feature:")
for feature, edges in ew_binner.bin_edges_.items():
print(f"{feature}: {edges}")
# Look at the range of binned values
print("\nRange of binned values:")
for column in customer_data_binned.columns:
unique_values = sorted(customer_data_binned[column].unique())
print(f"{column}: {unique_values}")
Visualizing the Binning Results
Let’s create a comparison plot to see the effect of binning:
# Create comparison plots
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
for i, column in enumerate(customer_data.columns):
# Original data
axes[0, i].hist(customer_data[column], bins=30, alpha=0.7,
edgecolor='black', color='blue')
axes[0, i].set_title(f'Original {column}')
axes[0, i].set_xlabel(column)
axes[0, i].set_ylabel('Frequency')
# Binned data
axes[1, i].hist(customer_data_binned[column], bins=20, alpha=0.7,
edgecolor='black', color='red')
axes[1, i].set_title(f'Binned {column}')
axes[1, i].set_xlabel(f'{column} (binned)')
axes[1, i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
Trying Equal-Frequency Binning
Equal-width binning can sometimes create bins with very different numbers of samples. Let’s try equal-frequency binning, which creates bins with approximately equal numbers of samples:
# Create an equal-frequency binner
ef_binner = EqualFrequencyBinning(
n_bins=5,
preserve_dataframe=True
)
# Fit and transform
customer_data_eq_freq = ef_binner.fit_transform(customer_data)
print("Equal-frequency binned data:")
print(customer_data_eq_freq.head())
Comparing the Methods
Let’s compare how the two methods distribute the samples:
# Compare bin distributions for the 'income' feature
print("Sample distribution comparison for 'income' feature:")
print("\nEqual-Width Binning:")
ew_counts = customer_data_binned['income'].value_counts().sort_index()
for bin_val, count in ew_counts.items():
print(f" Bin {bin_val}: {count} samples")
print("\nEqual-Frequency Binning:")
ef_counts = customer_data_eq_freq['income'].value_counts().sort_index()
for bin_val, count in ef_counts.items():
print(f" Bin {bin_val}: {count} samples")
Working with Individual Features
Sometimes you might want to bin only specific features:
# Bin only age and income
selected_data = customer_data[['age', 'income']]
binner = EqualWidthBinning(n_bins=3, preserve_dataframe=True)
selected_binned = binner.fit_transform(selected_data)
print("Binning only selected features:")
print(selected_binned.head())
Custom Bin Ranges
You can also specify custom ranges for binning:
# Create a binner with custom range for age (18-80 years)
custom_binner = EqualWidthBinning(
n_bins=4,
bin_range=(18, 80), # Custom range for all features
preserve_dataframe=True
)
# Apply only to age column
age_data = customer_data[['age']]
age_binned = custom_binner.fit_transform(age_data)
print("Custom range binning for age:")
print(f"Bin edges: {custom_binner.bin_edges_['age']}")
print(f"Unique binned values: {sorted(age_binned['age'].unique())}")
Handling Missing Values and Outliers
binlearn provides robust handling of missing values and outliers:
# Create data with some outliers and missing values
noisy_data = customer_data.copy()
# Add some outliers
noisy_data.loc[0, 'income'] = 1000000 # Very high income
noisy_data.loc[1, 'age'] = 150 # Impossible age
# Add missing values
noisy_data.loc[2:4, 'spending_score'] = np.nan
print("Data with noise:")
print(noisy_data.head())
# Bin the noisy data
robust_binner = EqualWidthBinning(
n_bins=5,
clip=True, # Clip outliers to bin edges
preserve_dataframe=True
)
try:
noisy_binned = robust_binner.fit_transform(noisy_data)
print("\nSuccessfully binned noisy data:")
print(noisy_binned.head())
except Exception as e:
print(f"Error handling noisy data: {e}")
Saving and Loading Binners
You can save trained binners for later use:
import pickle
# Train a binner
production_binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
production_binner.fit(customer_data)
# Save the trained binner
with open('customer_binner.pkl', 'wb') as f:
pickle.dump(production_binner, f)
# Load and use the binner
with open('customer_binner.pkl', 'rb') as f:
loaded_binner = pickle.load(f)
# Use the loaded binner on new data
new_customer = pd.DataFrame({
'age': [35],
'income': [50000],
'spending_score': [65],
'account_balance': [3000]
})
new_customer_binned = loaded_binner.transform(new_customer)
print("Binned new customer data:")
print(new_customer_binned)
Next Steps
Congratulations! You’ve completed your first binning project. You’ve learned how to:
Create and apply equal-width and equal-frequency binning
Visualize binning results
Handle noisy data
Save and load trained binners
Work with DataFrames and individual features
What to explore next:
Intermediate Tutorial: Learn about K-means binning and supervised binning
Advanced Tutorial: Explore manual binning and flexible binning strategies
sklearn Integration: Use binning in machine learning pipelines
Performance Tips: Optimize binning for large datasets
Key Takeaways:
Equal-width binning is simple but may create uneven sample distributions
Equal-frequency binning creates more balanced distributions
Always visualize your results to understand the impact of binning
binlearn handles edge cases like missing values and outliers gracefully
Trained binners can be saved and reused on new data
Ready for more? Check out the intermediate_tutorial to learn about more advanced binning methods!