Real Pandas for Industrial Applications: Finding Extra Records Between Two DataFrames

7/3/20252 min read

white concrete building during daytime
white concrete building during daytime

Working with real-world data almost always involves comparing datasets. You might be:

  • Checking for updates between two versions of a dataset

  • Verifying if two systems are synchronized

  • Finding mismatched or missing records

With Python and Pandas, comparing two DataFrames is both powerful and flexible. In this guide, you'll learn how to compare two DataFrames and identify:

  • Rows present in one DataFrame but not the other

  • Row-by-row differences

  • Differences by selected columns

๐Ÿ› ๏ธ Why Compare DataFrames?

Letโ€™s say you received a new export from a system and want to check what changed since the last export. This could mean:

  • New rows were added

  • Some rows were removed

  • Some values were updated

Pandas makes these tasks straightforward using functions like .merge(), .isin(), .equals(), and logical indexing.

๐Ÿ”ข Sample Data

import pandas as pd

# Original dataset

df1 = pd.DataFrame({

'id': [1, 2, 3, 4],

'name': ['Alice', 'Bob', 'Charlie', 'Eve'],

'age': [25, 30, 35, 40]

})

# Updated dataset

df2 = pd.DataFrame({

'id': [2, 3, 4, 5],

'name': ['Bob', 'Charlie', 'Eve', 'Frank'],

'age': [30, 36, 40, 45]

})

โœ… 1. Find Records in One DataFrame but Not the Other

We can use merge() with an indicator flag to find differences.

1.Records in df1 but not in df2:

diff1 = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

print(diff1)

2.Records in df2 but not in df1:

diff2 = df2.merge(df1, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

print(diff2)

3.Compare Specific Columns for Differences

def find_differences(df1, df2, subset_cols):

# Compare only the selected columns

df1_subset = df1[subset_cols].drop_duplicates()

df2_subset = df2[subset_cols].drop_duplicates()

# Find rows in df1 not in df2

df1_only = df1_subset.merge(df2_subset, how='outer', indicator=True).query('_merge == "left_only"')

extra_in_df1 = df1.merge(df1_only[subset_cols], on=subset_cols, how='inner')

# Find rows in df2 not in df1

df2_only = df2_subset.merge(df1_subset, how='outer', indicator=True).query('_merge == "left_only"')

extra_in_df2 = df2.merge(df2_only[subset_cols], on=subset_cols, how='inner')

return extra_in_df1, extra_in_df2

#function calling

subset_cols = ['id', 'name']

extra1, extra2 = find_differences(df1, df2, subset_cols)

print("Extra in df1:\n", extra1)

print("Extra in df2:\n", extra2)

4. Compare Values Between Matching Rows

Sometimes rows exist in both DataFrames but with different values (e.g., age changed).

First, align on the common key (like 'id') and compare columns:

merged = df1.merge(df2, on='id', suffixes=('_old', '_new'))

# Check for changes in any field

changed = merged[(merged['name_old'] != merged['name_new']) | (merged['age_old'] != merged['age_new'])]

print("Changed rows:\n", changed)

๐Ÿ“Ž Final Thoughts

Comparing DataFrames is a key skill for anyone working with data โ€” whether you're syncing systems, testing pipelines, or validating inputs. With Pandas, it's both fast and flexible.

If you deal with large data or want to automate this process, you can extend these methods into a data validation script or notebook.