Data Pipelines | To merge, or not to merge

Sep 10

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.

Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.

However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.

So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?

Olivier Soucy

Data Pipelines | To merge, or not to merge

Lakehouse as Code | 01. Unity Catalog

Unity Catalog | 3 levels to rule them all