Recently, I was working on a data processing pipeline that needed to replicate configurations across multiple entities. The script worked fine during development, but when we scaled to production data volumes, execution time became a bottleneck. What should have taken seconds was taking minutes.
Here’s how I identified the problem and fixed it with a simple but powerful optimization.
The Problem
The pipeline processes a CSV file containing thousands of rows and needs to find matching records between different subsets of data. The original code used pandas’ query() method to find matches:
MATCHING_COLUMNS = [
'entity_id', 'external_id', 'name',
'category', 'value', 'label', 'status', 'phase'
]
def find_matching_row(source_row, target_df, logger):
"""Original implementation - O(n) per lookup."""
query_parts = []
for col in MATCHING_COLUMNS:
val = source_row[col]
if pd.isna(val):
query_parts.append(f"`{col}`.isnull()")
elif isinstance(val, str):
escaped_val = val.replace("'", "\\'")
query_parts.append(f"`{col}` == '{escaped_val}'")
else:
query_parts.append(f"`{col}` == {val}")
query = " and ".join(query_parts)
try:
matching_rows = target_df.query(query)
except Exception as e:
logger.error(f"Query failed: {e}")
return None
if matching_rows.empty:
return None
return matching_rows.iloc[0]
This looks reasonable, right? It dynamically builds a query string and uses pandas’ built-in query() method. But there’s a hidden performance trap here.
Why This Is Slow
Every call to df.query():
- Parses the query string - pandas needs to interpret the string expression
- Evaluates against every row - it scans the entire DataFrame (O(n) complexity)
- Handles edge cases - escaping, null checks, type coercion
When you call this function once, it’s barely noticeable. But in my case, I was calling it thousands of times in nested loops:
for entity in entities: # ~30 entities
for config in configs: # ~50 configs per entity
for item in items: # ~10 items per config
find_matching_row(item, target_df, logger) # Called 15,000 times!
With a DataFrame of 5,000+ rows, each query scans all rows. That’s 75 million row comparisons just for the lookups.
The Solution: Multi-Index Lookups
The fix was to pre-build an index and use O(1) hash-based lookups instead of O(n) scans.
Step 1: Create an Indexed DataFrame
def prepare_lookup_index(df):
"""
Create a multi-index DataFrame for fast O(1) lookups.
Returns tuple of (indexed_df, sentinel_value).
"""
df_indexed = df.copy()
# Handle NaN values with a sentinel (can't use NaN in index)
sentinel = '__NAN_SENTINEL__'
for col in MATCHING_COLUMNS:
if col in df_indexed.columns:
df_indexed[col] = df_indexed[col].fillna(sentinel)
# Set multi-index for O(1) lookups
# See pandas MultiIndex docs: https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html
df_indexed = df_indexed.set_index(MATCHING_COLUMNS, drop=False)
df_indexed.index = pd.MultiIndex.from_tuples(
df_indexed.index,
names=MATCHING_COLUMNS
)
return df_indexed, sentinel
Step 2: Build Lookup Keys
def build_lookup_key(source_row, sentinel):
"""Build a tuple key for multi-index lookup."""
key = []
for col in MATCHING_COLUMNS:
val = source_row[col]
if pd.isna(val):
key.append(sentinel)
else:
key.append(val)
return tuple(key)
Step 3: Use Direct Index Access
def find_matching_row_fast(source_row, target_indexed_df, sentinel, logger):
"""
Optimized implementation - O(1) per lookup.
Uses hash-based index lookup via pandas `.loc[]` instead of query scanning.
See: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html
"""
lookup_key = build_lookup_key(source_row, sentinel)
try:
matching_rows = target_indexed_df.loc[lookup_key]
except KeyError:
return None
except Exception as e:
logger.error(f"Lookup failed: {e}")
return None
# Handle single match (Series) vs multiple matches (DataFrame)
if isinstance(matching_rows, pd.Series):
return matching_rows
if matching_rows.empty:
return None
return matching_rows.iloc[0]
Step 4: Prepare Index Once, Use Many Times
The key insight is to prepare the index once before the loop, not inside it:
# Prepare index ONCE before processing
target_indexed, sentinel = prepare_lookup_index(target_df)
for entity in entities:
for config in configs:
for item in items:
# Now each lookup is O(1) instead of O(n)
find_matching_row_fast(item, target_indexed, sentinel, logger)
Bonus: Dictionary Lookups for Simple Keys
For lookups based on just a few columns (like ID fields), a plain Python dictionary is even faster:
def prepare_dict_index(df):
"""
Create a dictionary for O(1) lookups by composite key.
"""
lookup_dict = {}
for idx, row in df.iterrows():
key = (
str(row['entity_id']) if pd.notna(row['entity_id']) else '',
str(row['item_id']) if pd.notna(row['item_id']) else '',
str(row['config_id']) if pd.notna(row['config_id']) else ''
)
if key not in lookup_dict:
lookup_dict[key] = row
return lookup_dict
# Usage
source_dict = prepare_dict_index(source_df)
# O(1) lookup
row = source_dict.get(('123', '456', '789'))
Performance Results
I added timing instrumentation to measure the real-world impact:
import time
class ExecutionTimer:
def __init__(self, logger):
self.logger = logger
self.timings = {}
self.start_time = None
def start(self, name: str):
self.timings[name] = {'start': time.perf_counter()}
def stop(self, name: str):
if name in self.timings:
self.timings[name]['elapsed'] = (
time.perf_counter() - self.timings[name]['start']
)
def print_summary(self):
total = time.perf_counter() - self.start_time
print("=" * 60)
print("EXECUTION TIME SUMMARY")
print("=" * 60)
for name, data in self.timings.items():
elapsed = data['elapsed']
pct = (elapsed / total * 100) if total > 0 else 0
print(f" {name}: {elapsed:.3f}s ({pct:.1f}%)")
print(f" TOTAL: {total:.3f}s")
Before Optimization
============================================================
EXECUTION TIME SUMMARY (Original)
============================================================
Load data: 0.045s (0.2%)
Generate combinations: 12.340s (52.1%)
Replicate to entities: 11.250s (47.5%)
Write output: 0.052s (0.2%)
------------------------------------------------------------
TOTAL: 23.687s
============================================================
After Optimization
============================================================
EXECUTION TIME SUMMARY (Optimized)
============================================================
Load data: 0.045s (0.4%)
Generate combinations: 12.340s (94.8%)
Index preparation: 0.089s (0.7%)
Replicate to entities: 0.480s (3.7%)
Write output: 0.052s (0.4%)
------------------------------------------------------------
TOTAL: 13.006s
============================================================
The replication step went from 11.25s to 0.48s — a 23x speedup!
The overall script execution improved by 45%, and the bottleneck shifted from DataFrame lookups to the actual data generation logic.
Key Takeaways
1. Profile Before Optimizing
I added very basic timing instrumentation to identify where time was actually being spent. Without measurement, I might have optimized the wrong thing. For more sophisticated profiling, tools like Python’s cProfile or line_profiler can provide detailed insights.
2. Understand Your Data Access Patterns
- Single lookup:
df.query()is fine - Many lookups on same DataFrame: Pre-build an index
- Lookups by simple keys: Use a dictionary
3. Trade Memory for Speed
The indexed DataFrame and dictionary use more memory, but the time savings are massive. In my case:
- Extra memory: ~2x the DataFrame size
- Time savings: 23x faster
4. Index Preparation Is an Investment
Building the index takes time (0.089s in my case), but it pays off after just a few lookups. The break-even point is around 20 lookups for my data size.
5. Test Your Optimization
I kept the original implementation to validate that the optimized version produced identical outputs:
# Run both versions
python process_data_old.py --output old_output.json
python process_data.py --output new_output.json
# Compare outputs
diff <(cat old_output.json | python -m json.tool --sort-keys) \
<(cat new_output.json | python -m json.tool --sort-keys)
When NOT to Use This
This optimization isn’t always the right choice:
- Single or few lookups: The index preparation overhead isn’t worth it
- Frequently changing data: Rebuilding the index negates the benefits
- Memory-constrained environments: The extra memory usage might be prohibitive
- Simple column filtering:
df[df['col'] == value]is already optimized
Conclusion
Sometimes the biggest performance wins come from understanding the algorithmic complexity of your code. The original df.query() approach was O(n) per lookup. With 15,000 lookups on a 5,000-row DataFrame, that’s 75 million operations.
By switching to indexed lookups, each operation became O(1), reducing those 75 million operations to just 15,000 hash lookups plus the one-time cost of building the index. This is similar to how database indexes work — they’re an upfront investment that pays dividends on repeated queries.
The lesson? When you’re doing repeated lookups on the same data, build an index first.
Code Reference
For convenience, here’s the full implementation as a reusable module.
# df_indexer.py - Reusable DataFrame indexing utilities
import pandas as pd
from typing import Dict, List, Tuple, Any
def prepare_lookup_index(
df: pd.DataFrame,
index_columns: List[str],
sentinel: str = '__NAN_SENTINEL__'
) -> Tuple[pd.DataFrame, str]:
"""
Create a multi-index DataFrame for fast O(1) lookups.
Args:
df: Source DataFrame
index_columns: Columns to use for the multi-index
sentinel: Value to replace NaN (can't use NaN in index)
Returns:
Tuple of (indexed_df, sentinel_value)
"""
df_indexed = df.copy()
for col in index_columns:
if col in df_indexed.columns:
df_indexed[col] = df_indexed[col].fillna(sentinel)
df_indexed = df_indexed.set_index(index_columns, drop=False)
df_indexed.index = pd.MultiIndex.from_tuples(
df_indexed.index,
names=index_columns
)
return df_indexed, sentinel
def prepare_dict_index(
df: pd.DataFrame,
key_columns: List[str]
) -> Dict[Tuple, pd.Series]:
"""
Create a dictionary for O(1) lookups by composite key.
Args:
df: Source DataFrame
key_columns: Columns to use for the lookup key
Returns:
Dictionary mapping tuple keys to DataFrame rows
"""
lookup_dict = {}
for idx, row in df.iterrows():
key = tuple(
str(row[col]) if pd.notna(row[col]) else ''
for col in key_columns
)
if key not in lookup_dict:
lookup_dict[key] = row
return lookup_dict
def build_lookup_key(
row: pd.Series,
key_columns: List[str],
sentinel: str = '__NAN_SENTINEL__'
) -> Tuple:
"""Build a tuple key for multi-index lookup."""
return tuple(
sentinel if pd.isna(row[col]) else row[col]
for col in key_columns
)
✏️ Personal Notes
This optimization came from a real-world scenario where a script that “worked fine” during development became painfully slow at production scale. The fix was straightforward once I knew where the issue was — but only after I took the time to actually measure where the time was going.
If there’s one thing to keep in mind, it’s this: add timing instrumentation before optimizing. Without data, you’re just guessing. And guessing usually means optimizing the wrong thing. Python’s
cProfileorline_profilercan help identify these bottlenecks without manual instrumentation.The pandas
query()method is convenient and readable — and for single lookups, it’s perfectly fine. The problem isn’t the tool; it’s using the wrong tool for the job. Context matters.I kept the old implementation around to validate that the new one produced identical outputs for real-world data. Testing optimizations is just as important as making them. A faster wrong answer isn’t a win.
This pattern isn’t unique to pandas. The same principle applies to database indexes, hash maps, and any repeated lookup scenario. When you’re searching the same data multiple times, build an index first.
This optimization isn’t always worth it — if you’re only doing a handful of lookups, the index preparation overhead isn’t justified. Measure first, then decide.
At the end of the day, performance optimization is about trade-offs: memory vs. speed, complexity vs. maintainability, development time vs. execution time. There’s no universal answer — just the right answer for your situation. This aligns with the value quadrant framework I’ve written about before.
Because sometimes, the biggest wins are hiding in the code you wrote six months ago and never looked at again.