Real Pandas for Industrial Applications: Finding Extra Records Between Two DataFrames
7/3/20252 min read
Working with real-world data almost always involves comparing datasets. You might be:
Checking for updates between two versions of a dataset
Verifying if two systems are synchronized
Finding mismatched or missing records
With Python and Pandas, comparing two DataFrames is both powerful and flexible. In this guide, you'll learn how to compare two DataFrames and identify:
Rows present in one DataFrame but not the other
Row-by-row differences
Differences by selected columns
๐ ๏ธ Why Compare DataFrames?
Letโs say you received a new export from a system and want to check what changed since the last export. This could mean:
New rows were added
Some rows were removed
Some values were updated
Pandas makes these tasks straightforward using functions like .merge(), .isin(), .equals(), and logical indexing.
๐ข Sample Data
import pandas as pd
# Original dataset
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Eve'],
'age': [25, 30, 35, 40]
})
# Updated dataset
df2 = pd.DataFrame({
'id': [2, 3, 4, 5],
'name': ['Bob', 'Charlie', 'Eve', 'Frank'],
'age': [30, 36, 40, 45]
})
โ 1. Find Records in One DataFrame but Not the Other
We can use merge() with an indicator flag to find differences.
1.Records in df1 but not in df2:
diff1 = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
print(diff1)
2.Records in df2 but not in df1:
diff2 = df2.merge(df1, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])
print(diff2)
3.Compare Specific Columns for Differences
def find_differences(df1, df2, subset_cols):
# Compare only the selected columns
df1_subset = df1[subset_cols].drop_duplicates()
df2_subset = df2[subset_cols].drop_duplicates()
# Find rows in df1 not in df2
df1_only = df1_subset.merge(df2_subset, how='outer', indicator=True).query('_merge == "left_only"')
extra_in_df1 = df1.merge(df1_only[subset_cols], on=subset_cols, how='inner')
# Find rows in df2 not in df1
df2_only = df2_subset.merge(df1_subset, how='outer', indicator=True).query('_merge == "left_only"')
extra_in_df2 = df2.merge(df2_only[subset_cols], on=subset_cols, how='inner')
return extra_in_df1, extra_in_df2
#function calling
subset_cols = ['id', 'name']
extra1, extra2 = find_differences(df1, df2, subset_cols)
print("Extra in df1:\n", extra1)
print("Extra in df2:\n", extra2)
4. Compare Values Between Matching Rows
Sometimes rows exist in both DataFrames but with different values (e.g., age changed).
First, align on the common key (like 'id') and compare columns:
merged = df1.merge(df2, on='id', suffixes=('_old', '_new'))
# Check for changes in any field
changed = merged[(merged['name_old'] != merged['name_new']) | (merged['age_old'] != merged['age_new'])]
print("Changed rows:\n", changed)
๐ Final Thoughts
Comparing DataFrames is a key skill for anyone working with data โ whether you're syncing systems, testing pipelines, or validating inputs. With Pandas, it's both fast and flexible.
If you deal with large data or want to automate this process, you can extend these methods into a data validation script or notebook.