Real Pandas for Industrial Applications: Finding Extra Records Between Two DataFrames

7/3/20252 min read

Working with real-world data almost always involves comparing datasets. You might be:

Checking for updates between two versions of a dataset
Verifying if two systems are synchronized
Finding mismatched or missing records

With Python and Pandas, comparing two DataFrames is both powerful and flexible. In this guide, you'll learn how to compare two DataFrames and identify:

Rows present in one DataFrame but not the other
Row-by-row differences
Differences by selected columns

🛠️ Why Compare DataFrames?

Let’s say you received a new export from a system and want to check what changed since the last export. This could mean:

New rows were added
Some rows were removed
Some values were updated

Pandas makes these tasks straightforward using functions like .merge(), .isin(), .equals(), and logical indexing.

🔢 Sample Data

import pandas as pd

# Original dataset

df1 = pd.DataFrame({

'id': [1, 2, 3, 4],

'name': ['Alice', 'Bob', 'Charlie', 'Eve'],

'age': [25, 30, 35, 40]

})

# Updated dataset

df2 = pd.DataFrame({

'id': [2, 3, 4, 5],

'name': ['Bob', 'Charlie', 'Eve', 'Frank'],

'age': [30, 36, 40, 45]

})

✅ 1. Find Records in One DataFrame but Not the Other

We can use merge() with an indicator flag to find differences.

1.Records in df1 but not in df2:

diff1 = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

print(diff1)

2.Records in df2 but not in df1:

diff2 = df2.merge(df1, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])

print(diff2)

3.Compare Specific Columns for Differences

def find_differences(df1, df2, subset_cols):

# Compare only the selected columns

df1_subset = df1[subset_cols].drop_duplicates()

df2_subset = df2[subset_cols].drop_duplicates()

# Find rows in df1 not in df2

df1_only = df1_subset.merge(df2_subset, how='outer', indicator=True).query('_merge == "left_only"')

extra_in_df1 = df1.merge(df1_only[subset_cols], on=subset_cols, how='inner')

# Find rows in df2 not in df1

df2_only = df2_subset.merge(df1_subset, how='outer', indicator=True).query('_merge == "left_only"')

extra_in_df2 = df2.merge(df2_only[subset_cols], on=subset_cols, how='inner')

return extra_in_df1, extra_in_df2

#function calling

subset_cols = ['id', 'name']

extra1, extra2 = find_differences(df1, df2, subset_cols)

print("Extra in df1:\n", extra1)

print("Extra in df2:\n", extra2)

4. Compare Values Between Matching Rows

Sometimes rows exist in both DataFrames but with different values (e.g., age changed).

First, align on the common key (like 'id') and compare columns:

merged = df1.merge(df2, on='id', suffixes=('_old', '_new'))

# Check for changes in any field

changed = merged[(merged['name_old'] != merged['name_new']) | (merged['age_old'] != merged['age_new'])]

print("Changed rows:\n", changed)

📎 Final Thoughts

Comparing DataFrames is a key skill for anyone working with data — whether you're syncing systems, testing pipelines, or validating inputs. With Pandas, it's both fast and flexible.

If you deal with large data or want to automate this process, you can extend these methods into a data validation script or notebook.

Let's Get in Touch

marketing@deepaiautomation.com

+91 6309397994

Industries

Manufacturing

Financial Services

Retail

Solutions

DocuMind AI - Intelligent Document Migration

InsightEdge AI - Intelligent Power BI Reporting & Analytics

DCT AI - Digital Control Tower for Intelligent Enterprise Visibility

Inventra AI - Intelligent Inventory Optimization Platform

Maintenix AI - Predictive Maintenance Intelligence Platform

PayPredict AI - Intelligent Customer Payment Prediction Platform

SegMind AI - Intelligent Customer Segmentation & RFM Analytics Platform

DataForge AI - Intelligent ETL & Analytics Modernization Platform

DataSense AI - Intelligent Data Quality & Outlier Detection Agent

Cookies Policy

Terms & Conditions