From O(n) to O(1): How We Made DataFrame Lookups 23x Faster

Recently, I was working on a data processing pipeline that needed to replicate configurations across multiple entities. The script worked fine during development, but when we scaled to production data volumes, execution time became a bottleneck. What should have taken seconds was taking minutes.

Here’s how I identified the problem and fixed it with a simple but powerful optimization.

The Problem

The pipeline processes a CSV file containing thousands of rows and needs to find matching records between different subsets of data. The original code used pandas’ query() method to find matches:

MATCHING_COLUMNS = [
    'entity_id', 'external_id', 'name',
    'category', 'value', 'label', 'status', 'phase'
]

def find_matching_row(source_row, target_df, logger):
    """Original implementation - O(n) per lookup."""
    query_parts = []
    for col in MATCHING_COLUMNS:
        val = source_row[col]

        if pd.isna(val):
            query_parts.append(f"`{col}`.isnull()")
        elif isinstance(val, str):
            escaped_val = val.replace("'", "\\'")
            query_parts.append(f"`{col}` == '{escaped_val}'")
        else:
            query_parts.append(f"`{col}` == {val}")

    query = " and ".join(query_parts)

    try:
        matching_rows = target_df.query(query)
    except Exception as e:
        logger.error(f"Query failed: {e}")
        return None

    if matching_rows.empty:
        return None

    return matching_rows.iloc[0]

This looks reasonable, right? It dynamically builds a query string and uses pandas’ built-in query() method. But there’s a hidden performance trap here.

Why This Is Slow

Every call to df.query():

Parses the query string - pandas needs to interpret the string expression
Evaluates against every row - it scans the entire DataFrame (O(n) complexity)
Handles edge cases - escaping, null checks, type coercion

When you call this function once, it’s barely noticeable. But in my case, I was calling it thousands of times in nested loops:

for entity in entities:           # ~30 entities
    for config in configs:        # ~50 configs per entity
        for item in items:        # ~10 items per config
            find_matching_row(item, target_df, logger)  # Called 15,000 times!

With a DataFrame of 5,000+ rows, each query scans all rows. That’s 75 million row comparisons just for the lookups.

The Solution: Multi-Index Lookups

The fix was to pre-build an index and use O(1) hash-based lookups instead of O(n) scans.

Step 1: Create an Indexed DataFrame

def prepare_lookup_index(df):
    """
    Create a multi-index DataFrame for fast O(1) lookups.
    
    Returns tuple of (indexed_df, sentinel_value).
    """
    df_indexed = df.copy()
    
    # Handle NaN values with a sentinel (can't use NaN in index)
    sentinel = '__NAN_SENTINEL__'
    
    for col in MATCHING_COLUMNS:
        if col in df_indexed.columns:
            df_indexed[col] = df_indexed[col].fillna(sentinel)
    
    # Set multi-index for O(1) lookups
    # See pandas MultiIndex docs: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html
    df_indexed = df_indexed.set_index(MATCHING_COLUMNS, drop=False)
    df_indexed.index = pd.MultiIndex.from_tuples(
        df_indexed.index, 
        names=MATCHING_COLUMNS
    )
    
    return df_indexed, sentinel

Step 2: Build Lookup Keys

def build_lookup_key(source_row, sentinel):
    """Build a tuple key for multi-index lookup."""
    key = []
    for col in MATCHING_COLUMNS:
        val = source_row[col]
        if pd.isna(val):
            key.append(sentinel)
        else:
            key.append(val)
    return tuple(key)

Step 3: Use Direct Index Access

def find_matching_row_fast(source_row, target_indexed_df, sentinel, logger):
    """
    Optimized implementation - O(1) per lookup.
    
    Uses hash-based index lookup via pandas `.loc[]` instead of query scanning.
    See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
    """
    lookup_key = build_lookup_key(source_row, sentinel)
    
    try:
        matching_rows = target_indexed_df.loc[lookup_key]
    except KeyError:
        return None
    except Exception as e:
        logger.error(f"Lookup failed: {e}")
        return None

    # Handle single match (Series) vs multiple matches (DataFrame)
    if isinstance(matching_rows, pd.Series):
        return matching_rows

    if matching_rows.empty:
        return None

    return matching_rows.iloc[0]

Step 4: Prepare Index Once, Use Many Times

The key insight is to prepare the index once before the loop, not inside it:

# Prepare index ONCE before processing
target_indexed, sentinel = prepare_lookup_index(target_df)

for entity in entities:
    for config in configs:
        for item in items:
            # Now each lookup is O(1) instead of O(n)
            find_matching_row_fast(item, target_indexed, sentinel, logger)

Bonus: Dictionary Lookups for Simple Keys

For lookups based on just a few columns (like ID fields), a plain Python dictionary is even faster:

def prepare_dict_index(df):
    """
    Create a dictionary for O(1) lookups by composite key.
    """
    lookup_dict = {}
    
    for idx, row in df.iterrows():
        key = (
            str(row['entity_id']) if pd.notna(row['entity_id']) else '',
            str(row['item_id']) if pd.notna(row['item_id']) else '',
            str(row['config_id']) if pd.notna(row['config_id']) else ''
        )
        if key not in lookup_dict:
            lookup_dict[key] = row
    
    return lookup_dict

# Usage
source_dict = prepare_dict_index(source_df)

# O(1) lookup
row = source_dict.get(('123', '456', '789'))

Performance Results

I added timing instrumentation to measure the real-world impact:

import time

class ExecutionTimer:
    def __init__(self, logger):
        self.logger = logger
        self.timings = {}
        self.start_time = None
    
    def start(self, name: str):
        self.timings[name] = {'start': time.perf_counter()}
    
    def stop(self, name: str):
        if name in self.timings:
            self.timings[name]['elapsed'] = (
                time.perf_counter() - self.timings[name]['start']
            )
    
    def print_summary(self):
        total = time.perf_counter() - self.start_time
        print("=" * 60)
        print("EXECUTION TIME SUMMARY")
        print("=" * 60)
        for name, data in self.timings.items():
            elapsed = data['elapsed']
            pct = (elapsed / total * 100) if total > 0 else 0
            print(f"  {name}: {elapsed:.3f}s ({pct:.1f}%)")
        print(f"  TOTAL: {total:.3f}s")

Before Optimization

============================================================
EXECUTION TIME SUMMARY (Original)
============================================================
  Load data: 0.045s (0.2%)
  Generate combinations: 12.340s (52.1%)
  Replicate to entities: 11.250s (47.5%)
  Write output: 0.052s (0.2%)
------------------------------------------------------------
  TOTAL: 23.687s
============================================================

After Optimization

============================================================
EXECUTION TIME SUMMARY (Optimized)
============================================================
  Load data: 0.045s (0.4%)
  Generate combinations: 12.340s (94.8%)
  Index preparation: 0.089s (0.7%)
  Replicate to entities: 0.480s (3.7%)
  Write output: 0.052s (0.4%)
------------------------------------------------------------
  TOTAL: 13.006s
============================================================

The replication step went from 11.25s to 0.48s — a 23x speedup!

The overall script execution improved by 45%, and the bottleneck shifted from DataFrame lookups to the actual data generation logic.

Key Takeaways

1. Profile Before Optimizing

I added very basic timing instrumentation to identify where time was actually being spent. Without measurement, I might have optimized the wrong thing. For more sophisticated profiling, tools like Python’s cProfile or line_profiler can provide detailed insights.

2. Understand Your Data Access Patterns

Single lookup: df.query() is fine
Many lookups on same DataFrame: Pre-build an index
Lookups by simple keys: Use a dictionary

3. Trade Memory for Speed

The indexed DataFrame and dictionary use more memory, but the time savings are massive. In my case:

Extra memory: ~2x the DataFrame size
Time savings: 23x faster

4. Index Preparation Is an Investment

Building the index takes time (0.089s in my case), but it pays off after just a few lookups. The break-even point is around 20 lookups for my data size.

5. Test Your Optimization

I kept the original implementation to validate that the optimized version produced identical outputs:

# Run both versions
python process_data_old.py --output old_output.json
python process_data.py --output new_output.json

# Compare outputs
diff <(cat old_output.json | python -m json.tool --sort-keys) \
     <(cat new_output.json | python -m json.tool --sort-keys)

When NOT to Use This

This optimization isn’t always the right choice:

Single or few lookups: The index preparation overhead isn’t worth it
Frequently changing data: Rebuilding the index negates the benefits
Memory-constrained environments: The extra memory usage might be prohibitive
Simple column filtering: df[df['col'] == value] is already optimized

Conclusion

Sometimes the biggest performance wins come from understanding the algorithmic complexity of your code. The original df.query() approach was O(n) per lookup. With 15,000 lookups on a 5,000-row DataFrame, that’s 75 million operations.

By switching to indexed lookups, each operation became O(1), reducing those 75 million operations to just 15,000 hash lookups plus the one-time cost of building the index. This is similar to how database indexes work — they’re an upfront investment that pays dividends on repeated queries.

The lesson? When you’re doing repeated lookups on the same data, build an index first.

Code Reference

For convenience, here’s the full implementation as a reusable module.

# df_indexer.py - Reusable DataFrame indexing utilities

import pandas as pd
from typing import Dict, List, Tuple, Any

def prepare_lookup_index(
    df: pd.DataFrame, 
    index_columns: List[str],
    sentinel: str = '__NAN_SENTINEL__'
) -> Tuple[pd.DataFrame, str]:
    """
    Create a multi-index DataFrame for fast O(1) lookups.
    
    Args:
        df: Source DataFrame
        index_columns: Columns to use for the multi-index
        sentinel: Value to replace NaN (can't use NaN in index)
    
    Returns:
        Tuple of (indexed_df, sentinel_value)
    """
    df_indexed = df.copy()
    
    for col in index_columns:
        if col in df_indexed.columns:
            df_indexed[col] = df_indexed[col].fillna(sentinel)
    
    df_indexed = df_indexed.set_index(index_columns, drop=False)
    df_indexed.index = pd.MultiIndex.from_tuples(
        df_indexed.index, 
        names=index_columns
    )
    
    return df_indexed, sentinel


def prepare_dict_index(
    df: pd.DataFrame, 
    key_columns: List[str]
) -> Dict[Tuple, pd.Series]:
    """
    Create a dictionary for O(1) lookups by composite key.
    
    Args:
        df: Source DataFrame
        key_columns: Columns to use for the lookup key
    
    Returns:
        Dictionary mapping tuple keys to DataFrame rows
    """
    lookup_dict = {}
    
    for idx, row in df.iterrows():
        key = tuple(
            str(row[col]) if pd.notna(row[col]) else ''
            for col in key_columns
        )
        if key not in lookup_dict:
            lookup_dict[key] = row
    
    return lookup_dict


def build_lookup_key(
    row: pd.Series, 
    key_columns: List[str],
    sentinel: str = '__NAN_SENTINEL__'
) -> Tuple:
    """Build a tuple key for multi-index lookup."""
    return tuple(
        sentinel if pd.isna(row[col]) else row[col]
        for col in key_columns
    )

✏️ Personal Notes

This optimization came from a real-world scenario where a script that “worked fine” during development became painfully slow at production scale. The fix was straightforward once I knew where the issue was — but only after I took the time to actually measure where the time was going.
If there’s one thing to keep in mind, it’s this: add timing instrumentation before optimizing. Without data, you’re just guessing. And guessing usually means optimizing the wrong thing. Python’s cProfile or line_profiler can help identify these bottlenecks without manual instrumentation.
The pandas query() method is convenient and readable — and for single lookups, it’s perfectly fine. The problem isn’t the tool; it’s using the wrong tool for the job. Context matters.
I kept the old implementation around to validate that the new one produced identical outputs for real-world data. Testing optimizations is just as important as making them. A faster wrong answer isn’t a win.
This pattern isn’t unique to pandas. The same principle applies to database indexes, hash maps, and any repeated lookup scenario. When you’re searching the same data multiple times, build an index first.
This optimization isn’t always worth it — if you’re only doing a handful of lookups, the index preparation overhead isn’t justified. Measure first, then decide.
At the end of the day, performance optimization is about trade-offs: memory vs. speed, complexity vs. maintainability, development time vs. execution time. There’s no universal answer — just the right answer for your situation. This aligns with the value quadrant framework I’ve written about before.
Because sometimes, the biggest wins are hiding in the code you wrote six months ago and never looked at again.

The Problem#

Why This Is Slow#

The Solution: Multi-Index Lookups#

Step 1: Create an Indexed DataFrame#

Step 2: Build Lookup Keys#

Step 3: Use Direct Index Access#

Step 4: Prepare Index Once, Use Many Times#

Bonus: Dictionary Lookups for Simple Keys#

Performance Results#

Before Optimization#

After Optimization#

Key Takeaways#

1. Profile Before Optimizing#

2. Understand Your Data Access Patterns#

3. Trade Memory for Speed#

4. Index Preparation Is an Investment#

5. Test Your Optimization#

When NOT to Use This#

Conclusion#

Code Reference#

✏️ Personal Notes#