DataFrame Support and Column Handling

One of binlearn’s most powerful features is its comprehensive support for different data formats while maintaining consistent behavior and column integrity across operations. This page explains how binlearn handles numpy arrays, pandas DataFrames, and polars DataFrames, the logic behind preserve_dataframe, and the internal column representation system.

Overview of Data Format Support

binlearn supports three primary data formats:

NumPy Arrays

The foundation format - all internal operations work on numpy arrays. Column identifiers are numeric indices (0, 1, 2, …).

Pandas DataFrames

Full support with column name preservation, index handling, and dtype consistency. Requires pandas installation.

Polars DataFrames

High-performance columnar support with column name preservation. Requires polars installation (optional dependency).

The Core Design Principle

binlearn follows a “format-agnostic processing, format-aware output” design:

  1. Input: Accept any supported format

  2. Processing: Convert to numpy arrays internally for consistent computation

  3. Output: Return data in the format specified by preserve_dataframe setting

This approach ensures computational consistency while providing flexibility in data handling.

The preserve_dataframe Parameter

The preserve_dataframe parameter controls output format behavior:

from binlearn import EqualWidthBinning
import pandas as pd
import numpy as np

# Create sample data
df = pd.DataFrame({
    'feature1': [1.0, 2.0, 3.0, 4.0, 5.0],
    'feature2': [10.0, 20.0, 30.0, 40.0, 50.0]
})

# preserve_dataframe=True: Output matches input format
binner_preserve = EqualWidthBinning(n_bins=3, preserve_dataframe=True)
result_df = binner_preserve.fit_transform(df)
print(type(result_df))  # <class 'pandas.core.frame.DataFrame'>
print(result_df.columns.tolist())  # ['feature1', 'feature2']

# preserve_dataframe=False: Always output numpy arrays
binner_array = EqualWidthBinning(n_bins=3, preserve_dataframe=False)
result_array = binner_array.fit_transform(df)
print(type(result_array))  # <class 'numpy.ndarray'>

Global Configuration

The preserve_dataframe setting can be controlled globally:

import binlearn

# Check current setting
config = binlearn.get_config()
print(config.preserve_dataframe)  # Default: False

# Set globally
binlearn.set_config(preserve_dataframe=True)

# Now all binners will preserve DataFrame format by default
binner = EqualWidthBinning(n_bins=5)  # preserve_dataframe=True by default

Column Representation and Handling

binlearn uses a sophisticated column handling system that maintains consistency across different input formats and usage patterns.

Column Identification Priority

When determining column identifiers, binlearn follows this priority order:

  1. Column names from input data (DataFrames): ['feature1', 'feature2']

  2. Stored original columns (for fitted estimators): Maintains training consistency

  3. Numeric indices (arrays): [0, 1, 2]

  4. Generated indices (fallback): Based on data shape

# DataFrame input - column names extracted
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
binner = EqualWidthBinning(n_bins=2)
binner.fit(df)
# Internal representation uses: ['A', 'B']

# NumPy input - numeric indices generated
arr = np.array([[1, 2, 3], [4, 5, 6]])
binner = EqualWidthBinning(n_bins=2)
binner.fit(arr)
# Internal representation uses: [0, 1, 2]

Column Consistency Across Operations

binlearn maintains column consistency between fitting and transformation:

# Training with DataFrame
train_df = pd.DataFrame({
    'income': [30000, 45000, 60000, 80000, 120000],
    'age': [25, 35, 45, 55, 65]
})

binner = EqualWidthBinning(n_bins=3, preserve_dataframe=True)
binner.fit(train_df)

# Transform maintains column structure even with different input
test_data = np.array([[50000, 40], [90000, 50]])  # NumPy format
result = binner.transform(test_data)
# Result preserves training column structure as DataFrame

print(type(result))  # pandas.DataFrame
print(result.columns.tolist())  # ['income', 'age']

Internal Column Resolution

The column resolution system handles format mismatches gracefully:

# Train with numeric columns (NumPy)
X_train = np.array([[1, 10], [2, 20], [3, 30]])
binner = EqualWidthBinning(n_bins=2)
binner.fit(X_train)  # Uses columns: [0, 1]

# Transform with named columns (DataFrame)
X_test = pd.DataFrame({'feature_0': [1.5], 'feature_1': [15]})
result = binner.transform(X_test)
# Automatic mapping: 'feature_0' -> 0, 'feature_1' -> 1

Advanced Column Handling

Guidance Column Separation

For supervised binning methods, binlearn automatically separates binning columns from guidance columns:

from binlearn import EqualWidthMinimumWeightBinning

# Data with features and weights
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [10, 20, 30, 40, 50],
    'sample_weight': [1.0, 2.0, 1.5, 3.0, 2.5]
})

# Specify which columns are for guidance
binner = EqualWidthMinimumWeightBinning(
    n_bins=3,
    minimum_weight=2.0,
    guidance_columns='sample_weight',  # This column provides weights
    preserve_dataframe=True
)

# Fit processes all columns but only bins feature columns
binner.fit(data)

# Transform only processes and outputs feature columns
result = binner.transform(data)
print(result.columns.tolist())  # ['feature1', 'feature2'] - no weight column

Column Key Resolution

binlearn handles different column identifier formats:

# Bin specifications can use different key formats
edges_dict = {
    'feature_0': [0, 1, 2, 3],    # String format
    1: [10, 20, 30, 40]           # Integer format
}

# Works with both NumPy arrays and DataFrames
binner = EqualWidthBinning(bin_edges=edges_dict, preserve_dataframe=True)

# Automatic key resolution during transformation
df_result = binner.transform(pd.DataFrame({
    'feature_0': [0.5, 1.5],
    'feature_1': [15, 25]
}))

Implementation Details

Data Flow Architecture

The complete data flow follows this pattern:

Input Data (Any Format)
       ↓
prepare_input_with_columns()
       ↓
[numpy_array, column_list]
       ↓
Binning Operations (NumPy)
       ↓
return_like_input()
       ↓
Output (Format based on preserve_dataframe)

Key Functions

The core data handling functions are:

prepare_input_with_columns(X, fitted, original_columns)
  • Converts any input format to numpy array

  • Extracts or generates column identifiers

  • Maintains column consistency for fitted estimators

return_like_input(result, original_input, columns, preserve_dataframe)
  • Formats output to match desired format

  • Preserves column names and structure when requested

  • Handles pandas and polars DataFrame construction

convert_to_python_types(value)
  • Recursively converts numpy types to Python types

  • Essential for JSON serialization of fitted parameters

  • Handles nested structures (dicts, lists, arrays)

Memory and Performance Considerations

Format Conversion Overhead

  • DataFrame → NumPy: Moderate overhead for format conversion

  • NumPy Operations: Minimal overhead - native computation

  • NumPy → DataFrame: Moderate overhead for format reconstruction

  • Column Tracking: Minimal overhead for metadata management

Optimization Tips

# For performance-critical applications with large DataFrames
# Option 1: Use preserve_dataframe=False for faster processing
binner = EqualWidthBinning(n_bins=5, preserve_dataframe=False)
result = binner.fit_transform(large_df)  # Returns NumPy array

# Option 2: Work with NumPy arrays directly
arr = large_df.values
result = binner.fit_transform(arr)  # Avoids DataFrame conversion

# Option 3: Use global setting to avoid repeated parameter specification
binlearn.set_config(preserve_dataframe=False)

Best Practices

  1. Consistent Input Formats: Use the same format for training and prediction when possible

  2. Column Names: Use meaningful column names in DataFrames for better interpretability

  3. Global Configuration: Set preserve_dataframe globally for consistent behavior across projects

  4. Performance: Consider using NumPy arrays for performance-critical applications

  5. Mixed Formats: binlearn handles format mismatches, but consistency improves performance

# Good: Consistent formats
df_train = pd.DataFrame({'income': [...], 'age': [...]})
df_test = pd.DataFrame({'income': [...], 'age': [...]})

# Also good: Consistent NumPy usage for performance
X_train = np.array([[...], [...]])
X_test = np.array([[...], [...]])

# Works but less optimal: Mixed formats
df_train = pd.DataFrame({'income': [...], 'age': [...]})
X_test = np.array([[...], [...]])  # Format conversion required