Performance Tips ================ Optimize binlearn for better performance with large datasets and complex workflows. General Performance Guidelines ------------------------------ Choose the Right Binning Method ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Different binning methods have different performance characteristics: **Fastest to Slowest:** 1. **EqualWidthBinning** - O(n) complexity, minimal computation 2. **EqualFrequencyBinning** - O(n log n) due to sorting 3. **ManualIntervalBinning** - O(n), but with validation overhead 4. **EqualWidthMinimumWeightBinning** - O(n), with weight constraints 5. **SingletonBinning** - O(n), with unique value processing overhead 6. **KMeansBinning** - O(n × k × iterations), iterative clustering 7. **SupervisedBinning** - O(n log n), decision tree overhead .. code-block:: python import numpy as np import time from binlearn import EqualWidthBinning, KMeansBinning, SupervisedBinning # Large dataset for benchmarking n_samples = 100000 n_features = 20 X = np.random.rand(n_samples, n_features) y = np.random.randint(0, 2, n_samples) # Benchmark different methods methods = [ ("EqualWidthBinning", EqualWidthBinning(n_bins=5)), ("KMeansBinning", KMeansBinning(n_bins=5, random_state=42)), ("SupervisedBinning", SupervisedBinning(n_bins=5, task_type='classification')) ] for name, binner in methods: start_time = time.time() if "Supervised" in name: binner.fit_transform(X, guidance_data=y) else: binner.fit_transform(X) elapsed = time.time() - start_time print(f"{name}: {elapsed:.2f}s") Memory Optimization ------------------- Use Appropriate Data Types ~~~~~~~~~~~~~~~~~~~~~~~~~~ Choose data types based on your precision needs: .. code-block:: python import numpy as np from binlearn import EqualWidthBinning # Original data (float64 - 8 bytes per value) X_float64 = np.random.rand(1000000, 10) print(f"Float64 memory: {X_float64.nbytes / 1024**2:.1f} MB") # Reduced precision (float32 - 4 bytes per value) X_float32 = X_float64.astype(np.float32) print(f"Float32 memory: {X_float32.nbytes / 1024**2:.1f} MB") # binlearn works with both binner = EqualWidthBinning(n_bins=5) # Both will produce similar results result_64 = binner.fit_transform(X_float64) result_32 = binner.fit_transform(X_float32) print(f"Max difference: {np.max(np.abs(result_64 - result_32))}") Process Large Datasets Efficiently ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For datasets that don't fit in memory: .. code-block:: python import numpy as np from binlearn import EqualWidthBinning def fit_on_sample_transform_in_chunks(X, binner, sample_ratio=0.1, chunk_size=10000): """Efficient processing for large datasets.""" n_samples = X.shape[0] # Fit on a representative sample sample_size = int(n_samples * sample_ratio) sample_indices = np.random.choice(n_samples, sample_size, replace=False) X_sample = X[sample_indices] print(f"Fitting on sample of {sample_size} rows...") binner.fit(X_sample) # Transform in chunks print(f"Transforming {n_samples} rows in chunks of {chunk_size}...") results = [] for i in range(0, n_samples, chunk_size): end_idx = min(i + chunk_size, n_samples) chunk = X[i:end_idx] chunk_result = binner.transform(chunk) results.append(chunk_result) if (i // chunk_size + 1) % 10 == 0: print(f"Processed {end_idx}/{n_samples} rows") return np.vstack(results) # Example usage X_large = np.random.rand(500000, 50) binner = EqualWidthBinning(n_bins=5) X_binned = fit_on_sample_transform_in_chunks( X_large, binner, sample_ratio=0.05, chunk_size=50000 ) DataFrame Performance --------------------- Optimize Pandas Operations ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pandas as pd import numpy as np from binlearn import EqualWidthBinning # Create large DataFrame n_rows = 1000000 df = pd.DataFrame({ f'feature_{i}': np.random.rand(n_rows) for i in range(20) }) # Performance tip 1: Select only columns you need columns_to_bin = ['feature_0', 'feature_1', 'feature_2'] df_subset = df[columns_to_bin] # Performance tip 2: Use preserve_dataframe=False for large datasets # if you don't need DataFrame output binner_fast = EqualWidthBinning(n_bins=5, preserve_dataframe=False) result_array = binner_fast.fit_transform(df_subset) # Returns numpy array # Performance tip 3: Use preserve_dataframe=True only when needed binner_df = EqualWidthBinning(n_bins=5, preserve_dataframe=True) result_df = binner_df.fit_transform(df_subset) # Returns DataFrame Consider Polars for Large DataFrames ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For very large datasets, consider using Polars: .. code-block:: python try: import polars as pl from binlearn import EqualWidthBinning # Convert pandas DataFrame to Polars (more memory efficient) df_polars = pl.from_pandas(df) # binlearn supports Polars DataFrames binner = EqualWidthBinning(n_bins=5, preserve_dataframe=True) # Convert to pandas for binning (if needed), then back to Polars df_pandas = df_polars.to_pandas() result_pandas = binner.fit_transform(df_pandas) result_polars = pl.from_pandas(result_pandas) print(f"Polars result shape: {result_polars.shape}") except ImportError: print("Polars not available") Pipeline Performance -------------------- Optimize Sklearn Pipelines ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from binlearn import EqualWidthBinning, SingletonBinning import numpy as np import pandas as pd # Create mixed dataset n_samples = 50000 df = pd.DataFrame({ 'numeric1': np.random.normal(0, 1, n_samples), 'numeric2': np.random.exponential(2, n_samples), 'categorical1': np.random.choice(['A', 'B', 'C'], n_samples), 'categorical2': np.random.choice(['X', 'Y', 'Z'], n_samples), 'target': np.random.randint(0, 2, n_samples) }) X = df.drop('target', axis=1) y = df['target'] # Performance tip 1: Use ColumnTransformer for different column types preprocessor = ColumnTransformer([ ('numeric', EqualWidthBinning(n_bins=5), ['numeric1', 'numeric2']), ('discrete', SingletonBinning(), ['discrete1', 'discrete2']) ], remainder='drop') # Performance tip 2: Choose efficient estimators pipeline = Pipeline([ ('preprocessing', preprocessor), ('classifier', RandomForestClassifier( n_estimators=100, # Reasonable number max_depth=10, # Limit depth n_jobs=-1, # Use all cores random_state=42 )) ]) # Performance tip 3: Use appropriate train/test split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) pipeline.fit(X_train, y_train) score = pipeline.score(X_test, y_test) print(f"Pipeline accuracy: {score:.3f}") Configuration Optimization -------------------------- Global Configuration for Performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from binlearn import set_config, get_config # Check current configuration current_config = get_config() print("Current config:", current_config) # Optimize for performance set_config( preserve_dataframe=False, # Faster for large datasets fit_jointly=False, # More memory efficient clip=False # Skip outlier clipping if not needed ) # All new binners will use these optimized defaults from binlearn import EqualWidthBinning # This binner will use the optimized configuration fast_binner = EqualWidthBinning(n_bins=5) Parallel Processing ------------------- Leverage Multiple Cores ~~~~~~~~~~~~~~~~~~~~~~~ While binlearn doesn't directly support parallelization, you can parallelize at different levels: .. code-block:: python import numpy as np from joblib import Parallel, delayed from binlearn import EqualWidthBinning import time def bin_feature_subset(X_subset, n_bins=5): """Bin a subset of features.""" binner = EqualWidthBinning(n_bins=n_bins) return binner.fit_transform(X_subset) # Large dataset with many features X = np.random.rand(10000, 100) # Method 1: Sequential processing start_time = time.time() binner = EqualWidthBinning(n_bins=5) X_binned_sequential = binner.fit_transform(X) sequential_time = time.time() - start_time print(f"Sequential time: {sequential_time:.2f}s") # Method 2: Parallel processing by feature groups start_time = time.time() n_jobs = 4 features_per_job = X.shape[1] // n_jobs feature_groups = [ X[:, i:i+features_per_job] for i in range(0, X.shape[1], features_per_job) ] # Process feature groups in parallel results = Parallel(n_jobs=n_jobs)( delayed(bin_feature_subset)(group) for group in feature_groups ) X_binned_parallel = np.hstack(results) parallel_time = time.time() - start_time print(f"Parallel time: {parallel_time:.2f}s") print(f"Speedup: {sequential_time/parallel_time:.2f}x") Caching and Preprocessing ------------------------- Cache Expensive Operations ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import pickle import os from binlearn import EqualWidthBinning import numpy as np def cached_binning(X, cache_file, n_bins=5, force_recompute=False): """Cache fitted binner to avoid recomputation.""" if os.path.exists(cache_file) and not force_recompute: print("Loading cached binner...") with open(cache_file, 'rb') as f: binner = pickle.load(f) else: print("Computing and caching binner...") binner = EqualWidthBinning(n_bins=n_bins) binner.fit(X) with open(cache_file, 'wb') as f: pickle.dump(binner, f) return binner # Example usage X_train = np.random.rand(100000, 20) # First run: computes and caches binner = cached_binning(X_train, 'binner_cache.pkl') X_binned = binner.transform(X_train) # Subsequent runs: loads from cache binner = cached_binning(X_train, 'binner_cache.pkl') Benchmarking and Profiling -------------------------- Profile Your Code ~~~~~~~~~~~~~~~~ .. code-block:: python import cProfile import pstats from binlearn import EqualWidthBinning import numpy as np def benchmark_binning(): """Function to profile.""" X = np.random.rand(100000, 50) binner = EqualWidthBinning(n_bins=10) return binner.fit_transform(X) # Profile the function profiler = cProfile.Profile() profiler.enable() result = benchmark_binning() profiler.disable() # Analyze results stats = pstats.Stats(profiler) stats.sort_stats('cumulative') stats.print_stats(10) # Top 10 functions by cumulative time Memory Profiling ~~~~~~~~~~~~~~~~ .. code-block:: python import tracemalloc import numpy as np from binlearn import EqualWidthBinning def memory_benchmark(): """Monitor memory usage during binning.""" tracemalloc.start() # Create large dataset X = np.random.rand(500000, 20) snapshot1 = tracemalloc.take_snapshot() # Perform binning binner = EqualWidthBinning(n_bins=5) X_binned = binner.fit_transform(X) snapshot2 = tracemalloc.take_snapshot() # Calculate memory difference top_stats = snapshot2.compare_to(snapshot1, 'lineno') print("Top memory allocations:") for stat in top_stats[:5]: print(stat) tracemalloc.stop() return X_binned result = memory_benchmark() Best Practices Summary --------------------- **For Large Datasets:** 1. Use ``EqualWidthBinning`` for fastest performance 2. Set ``preserve_dataframe=False`` if DataFrame output not needed 3. Consider sample-based fitting for very large datasets 4. Use appropriate data types (float32 vs float64) **For Complex Pipelines:** 1. Use ``ColumnTransformer`` for different column types 2. Cache fitted binners when possible 3. Choose efficient downstream estimators 4. Leverage parallel processing where applicable **For Memory Efficiency:** 1. Process data in chunks if necessary 2. Use sparse matrices for high-dimensional sparse data 3. Consider Polars for very large DataFrames 4. Monitor memory usage with profiling tools **Configuration:** 1. Set global configuration for consistent performance 2. Use ``fit_jointly=False`` for better memory efficiency 3. Disable ``clip`` if outlier handling not needed Following these guidelines will help you get optimal performance from binlearn in your specific use case.