Quick Start Guide ================= This guide will get you up and running with binlearn in just a few minutes. Basic Binning Example ---------------------- Let's start with a simple example using equal-width binning: .. code-block:: python import numpy as np import pandas as pd from binlearn import EqualWidthBinning # Create sample data np.random.seed(42) data = pd.DataFrame({ 'age': np.random.normal(35, 10, 1000), 'income': np.random.lognormal(10, 0.5, 1000), 'score': np.random.uniform(0, 100, 1000) }) # Create and fit the binner binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True) data_binned = binner.fit_transform(data) print(f"Original shape: {data.shape}") print(f"Binned shape: {data_binned.shape}") print(f"Bin edges for age: {binner.bin_edges_['age']}") This will output something like: .. code-block:: text Original shape: (1000, 3) Binned shape: (1000, 3) Bin edges for age: [ 6.74 17.88 29.02 40.16 51.30 62.44] Working with Different Data Types ---------------------------------- binlearn supports various data formats: NumPy Arrays ~~~~~~~~~~~~ .. code-block:: python import numpy as np from binlearn import EqualWidthBinning # NumPy array X = np.random.rand(100, 3) binner = EqualWidthBinning(n_bins=4) X_binned = binner.fit_transform(X) print(f"Original: {X.shape}, Binned: {X_binned.shape}") Pandas DataFrames ~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd from binlearn import EqualFrequencyBinning # Pandas DataFrame with preserved column names df = pd.DataFrame({ 'feature1': np.random.normal(0, 1, 100), 'feature2': np.random.exponential(2, 100) }) binner = EqualFrequencyBinning(n_bins=3, preserve_dataframe=True) df_binned = binner.fit_transform(df) print(df_binned.columns.tolist()) # ['feature1', 'feature2'] Exploring Different Binning Methods ------------------------------------ Equal-Frequency Binning ~~~~~~~~~~~~~~~~~~~~~~~~ Creates bins with approximately equal number of samples: .. code-block:: python from binlearn import EqualFrequencyBinning # Create skewed data X = np.random.exponential(2, (1000, 2)) binner = EqualFrequencyBinning(n_bins=4) X_binned = binner.fit_transform(X) # Check bin counts unique, counts = np.unique(X_binned[:, 0], return_counts=True) print(f"Bin counts: {counts}") # Should be approximately equal K-Means Binning ~~~~~~~~~~~~~~~ Uses K-means clustering to determine bin boundaries: .. code-block:: python from binlearn import KMeansBinning # Data with natural clusters X = np.concatenate([ np.random.normal(0, 1, (200, 1)), np.random.normal(5, 1, (200, 1)), np.random.normal(10, 1, (200, 1)) ]) binner = KMeansBinning(n_bins=3, random_state=42) X_binned = binner.fit_transform(X) print(f"Bin edges: {binner.bin_edges_[0]}") Supervised Binning ~~~~~~~~~~~~~~~~~~ Uses target information to create optimal bins: .. code-block:: python from binlearn import SupervisedBinning from sklearn.datasets import make_classification # Create classification dataset X, y = make_classification(n_samples=1000, n_features=4, n_classes=2, random_state=42) # Supervised binning considers the target variable sup_binner = SupervisedBinning( n_bins=4, task_type='classification', tree_params={'max_depth': 3} ) X_binned = sup_binner.fit_transform(X, guidance_data=y) print(f"Supervised binning shape: {X_binned.shape}") Numeric Discrete Data with SingletonBinning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Handle numeric discrete data with SingletonBinning: .. code-block:: python from binlearn import SingletonBinning # Numeric discrete data numeric_discrete_data = pd.DataFrame({ 'category_id': [1, 2, 1, 3, 2, 1, 4], 'rating_code': [1, 0, 1, 2, 0, 1, 3] }) singleton_binner = SingletonBinning(preserve_dataframe=True) numeric_binned = singleton_binner.fit_transform(numeric_discrete_data) print(f"Original category IDs: {numeric_discrete_data['category_id'].unique()}") print(f"Binned shape: {numeric_binned.shape}") Scikit-learn Integration ------------------------ binlearn transformers are fully compatible with scikit-learn pipelines: .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from binlearn import EqualWidthBinning # Sample data X, y = make_classification(n_samples=1000, n_features=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create pipeline with binning pipeline = Pipeline([ ('binning', EqualWidthBinning(n_bins=5)), ('classifier', RandomForestClassifier(random_state=42)) ]) # Train and evaluate pipeline.fit(X_train, y_train) y_pred = pipeline.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Pipeline accuracy: {accuracy:.3f}") Configuration Management ------------------------- binlearn provides global configuration for consistent behavior: .. code-block:: python from binlearn import get_config, set_config # Check current configuration current_config = get_config() print(f"Current config: {current_config}") # Set global defaults set_config( preserve_dataframe=True, clip=True, fit_jointly=False ) # Now all binners will use these defaults binner = EqualWidthBinning(n_bins=3) # Will preserve DataFrames by default Error Handling and Validation ------------------------------ binlearn provides comprehensive error handling: .. code-block:: python from binlearn import EqualWidthBinning from binlearn.utils.errors import ConfigurationError try: # This will raise a ConfigurationError binner = EqualWidthBinning(n_bins=0) # Invalid: n_bins must be positive except ConfigurationError as e: print(f"Configuration error: {e}") try: # This will raise a ValidationError during fit binner = EqualWidthBinning(n_bins=5) binner.fit([[1, 2], [3]]) # Invalid: inconsistent array dimensions except Exception as e: print(f"Validation error: {e}") Next Steps ---------- Now that you have the basics down, explore: 1. **User Guide**: Detailed explanations of all binning methods 2. **Examples**: Real-world use cases and advanced techniques 3. **API Reference**: Complete documentation of all classes and functions 4. **Performance Tips**: Optimization strategies for large datasets For more advanced usage, check out the :doc:`user_guide/index` or browse the :doc:`examples/index`.