Beginner Tutorial: Your First Binning Project
==============================================

Welcome to binlearn! This tutorial will walk you through your first data binning project, covering the fundamentals you need to get started.

What is Data Binning?
---------------------

Data binning (also called discretization) is the process of converting continuous numerical data into discrete intervals or categories. This is useful for:

- **Reducing noise** in your data
- **Simplifying complex relationships** 
- **Making data compatible** with algorithms that require categorical input
- **Creating interpretable features** for analysis
- **Improving model performance** in some cases

Let's see this in action!

Setting Up Your Environment
----------------------------

First, let's import the libraries we'll need:

.. code-block:: python

   import numpy as np
   import pandas as pd
   import matplotlib.pyplot as plt
   from binlearn import EqualWidthBinning, EqualFrequencyBinning
   
   # Set random seed for reproducibility
   np.random.seed(42)

Creating Sample Data
--------------------

Let's create some sample data that represents customer information:

.. code-block:: python

   # Generate sample customer data
   n_customers = 1000
   
   customer_data = pd.DataFrame({
       'age': np.random.normal(40, 15, n_customers),           # Age in years
       'income': np.random.lognormal(10.5, 0.8, n_customers), # Annual income
       'spending_score': np.random.beta(2, 5, n_customers) * 100, # Spending score 0-100
       'account_balance': np.random.exponential(5000, n_customers)  # Account balance
   })
   
   # Clean up negative ages
   customer_data['age'] = np.maximum(customer_data['age'], 18)
   
   print("Sample data:")
   print(customer_data.head())
   print(f"\nData shape: {customer_data.shape}")
   print(f"\nData types:\n{customer_data.dtypes}")

Let's visualize our data to understand its distribution:

.. code-block:: python

   # Create histograms for each feature
   fig, axes = plt.subplots(2, 2, figsize=(12, 10))
   axes = axes.ravel()
   
   for i, column in enumerate(customer_data.columns):
       axes[i].hist(customer_data[column], bins=30, alpha=0.7, edgecolor='black')
       axes[i].set_title(f'Distribution of {column}')
       axes[i].set_xlabel(column)
       axes[i].set_ylabel('Frequency')
   
   plt.tight_layout()
   plt.show()

Your First Binning: Equal-Width Binning
----------------------------------------

Let's start with the simplest binning method - equal-width binning. This divides the range of each feature into bins of equal width.

.. code-block:: python

   # Create an equal-width binner
   ew_binner = EqualWidthBinning(
       n_bins=5,                    # Create 5 bins for each feature
       preserve_dataframe=True      # Keep the DataFrame format
   )
   
   # Fit the binner to our data and transform it
   customer_data_binned = ew_binner.fit_transform(customer_data)
   
   print("Binned data:")
   print(customer_data_binned.head())
   print(f"\nBinned data shape: {customer_data_binned.shape}")

Understanding the Results
~~~~~~~~~~~~~~~~~~~~~~~~~

Let's examine what the binner learned:

.. code-block:: python

   # Check the bin edges for each feature
   print("Bin edges for each feature:")
   for feature, edges in ew_binner.bin_edges_.items():
       print(f"{feature}: {edges}")
   
   # Look at the range of binned values
   print("\nRange of binned values:")
   for column in customer_data_binned.columns:
       unique_values = sorted(customer_data_binned[column].unique())
       print(f"{column}: {unique_values}")

Visualizing the Binning Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's create a comparison plot to see the effect of binning:

.. code-block:: python

   # Create comparison plots
   fig, axes = plt.subplots(2, 4, figsize=(16, 8))
   
   for i, column in enumerate(customer_data.columns):
       # Original data
       axes[0, i].hist(customer_data[column], bins=30, alpha=0.7, 
                      edgecolor='black', color='blue')
       axes[0, i].set_title(f'Original {column}')
       axes[0, i].set_xlabel(column)
       axes[0, i].set_ylabel('Frequency')
       
       # Binned data
       axes[1, i].hist(customer_data_binned[column], bins=20, alpha=0.7, 
                      edgecolor='black', color='red')
       axes[1, i].set_title(f'Binned {column}')
       axes[1, i].set_xlabel(f'{column} (binned)')
       axes[1, i].set_ylabel('Frequency')
   
   plt.tight_layout()
   plt.show()

Trying Equal-Frequency Binning
-------------------------------

Equal-width binning can sometimes create bins with very different numbers of samples. Let's try equal-frequency binning, which creates bins with approximately equal numbers of samples:

.. code-block:: python

   # Create an equal-frequency binner
   ef_binner = EqualFrequencyBinning(
       n_bins=5,
       preserve_dataframe=True
   )
   
   # Fit and transform
   customer_data_eq_freq = ef_binner.fit_transform(customer_data)
   
   print("Equal-frequency binned data:")
   print(customer_data_eq_freq.head())

Comparing the Methods
~~~~~~~~~~~~~~~~~~~~~

Let's compare how the two methods distribute the samples:

.. code-block:: python

   # Compare bin distributions for the 'income' feature
   print("Sample distribution comparison for 'income' feature:")
   print("\nEqual-Width Binning:")
   ew_counts = customer_data_binned['income'].value_counts().sort_index()
   for bin_val, count in ew_counts.items():
       print(f"  Bin {bin_val}: {count} samples")
   
   print("\nEqual-Frequency Binning:")
   ef_counts = customer_data_eq_freq['income'].value_counts().sort_index()
   for bin_val, count in ef_counts.items():
       print(f"  Bin {bin_val}: {count} samples")

Working with Individual Features
--------------------------------

Sometimes you might want to bin only specific features:

.. code-block:: python

   # Bin only age and income
   selected_data = customer_data[['age', 'income']]
   
   binner = EqualWidthBinning(n_bins=3, preserve_dataframe=True)
   selected_binned = binner.fit_transform(selected_data)
   
   print("Binning only selected features:")
   print(selected_binned.head())

Custom Bin Ranges
------------------

You can also specify custom ranges for binning:

.. code-block:: python

   # Create a binner with custom range for age (18-80 years)
   custom_binner = EqualWidthBinning(
       n_bins=4,
       bin_range=(18, 80),  # Custom range for all features
       preserve_dataframe=True
   )
   
   # Apply only to age column
   age_data = customer_data[['age']]
   age_binned = custom_binner.fit_transform(age_data)
   
   print("Custom range binning for age:")
   print(f"Bin edges: {custom_binner.bin_edges_['age']}")
   print(f"Unique binned values: {sorted(age_binned['age'].unique())}")

Handling Missing Values and Outliers
------------------------------------

binlearn provides robust handling of missing values and outliers:

.. code-block:: python

   # Create data with some outliers and missing values
   noisy_data = customer_data.copy()
   
   # Add some outliers
   noisy_data.loc[0, 'income'] = 1000000  # Very high income
   noisy_data.loc[1, 'age'] = 150         # Impossible age
   
   # Add missing values
   noisy_data.loc[2:4, 'spending_score'] = np.nan
   
   print("Data with noise:")
   print(noisy_data.head())
   
   # Bin the noisy data
   robust_binner = EqualWidthBinning(
       n_bins=5,
       clip=True,  # Clip outliers to bin edges
       preserve_dataframe=True
   )
   
   try:
       noisy_binned = robust_binner.fit_transform(noisy_data)
       print("\nSuccessfully binned noisy data:")
       print(noisy_binned.head())
   except Exception as e:
       print(f"Error handling noisy data: {e}")

Saving and Loading Binners
---------------------------

You can save trained binners for later use:

.. code-block:: python

   import pickle
   
   # Train a binner
   production_binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True)
   production_binner.fit(customer_data)
   
   # Save the trained binner
   with open('customer_binner.pkl', 'wb') as f:
       pickle.dump(production_binner, f)
   
   # Load and use the binner
   with open('customer_binner.pkl', 'rb') as f:
       loaded_binner = pickle.load(f)
   
   # Use the loaded binner on new data
   new_customer = pd.DataFrame({
       'age': [35],
       'income': [50000],
       'spending_score': [65],
       'account_balance': [3000]
   })
   
   new_customer_binned = loaded_binner.transform(new_customer)
   print("Binned new customer data:")
   print(new_customer_binned)

Next Steps
----------

Congratulations! You've completed your first binning project. You've learned how to:

- Create and apply equal-width and equal-frequency binning
- Visualize binning results
- Handle noisy data
- Save and load trained binners
- Work with DataFrames and individual features

**What to explore next:**

1. **Intermediate Tutorial**: Learn about K-means binning and supervised binning
2. **Advanced Tutorial**: Explore manual binning and flexible binning strategies
3. **sklearn Integration**: Use binning in machine learning pipelines
4. **Performance Tips**: Optimize binning for large datasets

**Key Takeaways:**

- Equal-width binning is simple but may create uneven sample distributions
- Equal-frequency binning creates more balanced distributions
- Always visualize your results to understand the impact of binning
- binlearn handles edge cases like missing values and outliers gracefully
- Trained binners can be saved and reused on new data

Ready for more? Check out the :doc:`intermediate_tutorial` to learn about more advanced binning methods!