Lakehouse as Code | 04. Delta Live Tables Data Pipelines
Olivier Soucy Olivier Soucy

Lakehouse as Code | 04. Delta Live Tables Data Pipelines

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this fourht part, we focus on configuring a Databricks workspace.

Read More
Lakehouse as Code | 03. Data Pipeline Jobs
Olivier Soucy Olivier Soucy

Lakehouse as Code | 03. Data Pipeline Jobs

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this second part, we focus on configuring a Databricks workspace.

Read More
Lakehouse as Code | 02. Workspace
Olivier Soucy Olivier Soucy

Lakehouse as Code | 02. Workspace

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this second part, we focus on configuring a Databricks workspace.

Read More
Lakehouse as Code | 01. Unity Catalog
Olivier Soucy Olivier Soucy

Lakehouse as Code | 01. Unity Catalog

Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this first part, we focus on laying the foundation for Unity Catalog.

Read More
Data Pipelines | To merge, or not to merge
Olivier Soucy Olivier Soucy

Data Pipelines | To merge, or not to merge

In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.

Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.

However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.

So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?

Read More
Unity Catalog | 3 levels to rule them all
Olivier Soucy Olivier Soucy

Unity Catalog | 3 levels to rule them all

In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the organization and security of data across cloud platforms.

However, what’s harder to find are the best practices that businesses are adopting to leverage these three levels efficiently, particularly within the context of a medallion architecture. Questions like: What should a catalog represent? Should I have separate schemas for bronze, silver, and gold layers? How do I grant access to logical groups of data? These are the questions we’ll aim to provide guidance on today.

Read More
Databricks AI Playground | How to bring your own model
Olivier Soucy Olivier Soucy

Databricks AI Playground | How to bring your own model

After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if you’ve been living under a rock (or, shall we say, a brick) and have no idea what it’s all about, check out this short video by Holly Smith: https://www.youtube.com/shorts/pNA-YYLBJH4

In essence, this playground is a chat-like environment where you can test, prompt, and compare LLMs. And because a picture is worth a thousand words, here’s a little snapshot:

Read More
Building a Data Pipeline with Polars and Laktory
Olivier Soucy Olivier Soucy

Building a Data Pipeline with Polars and Laktory

When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake immediately come to mind. However, not every problem requires these super powers. Many businesses default to these large-scale solutions, but they can be overkill for the data sizes at hand. Additionally, those still learning the basics of data engineering and data modeling need access to simple and cost-effective setups to master their craft. That's why today we'll explore how to leverage Polars dataframes and the Laktory ETL framework to build an end-to-end data pipeline that can be executed on your local machine.

Read More
Laktory Introduction
Olivier Soucy Olivier Soucy

Laktory Introduction

Watch a quick introduction to Laktory, the open-source ETL framework, and learn how you can leverage its pipeline model to efficiently build and deploy dataframe-centric pipelines to Databricks or other data platforms.

Read More
Laktory Overview
Olivier Soucy Olivier Soucy

Laktory Overview

Watch a demo on what Laktory, the dataframe-centric open source ETL framework is all about!

Read More
DataFrames Battle Royale | Pandas vs Polars vs Spark
Olivier Soucy Olivier Soucy

DataFrames Battle Royale | Pandas vs Polars vs Spark

Pandas operates with an in-memory, single-threaded architecture ideal for small to medium datasets, providing simplicity and immediate feedback. Polars, built with Rust, offers multi-threaded, in-memory processing and supports both eager and lazy execution, optimizing performance for larger datasets. Apache Spark uses a distributed computing architecture with lazy execution, designed for processing massive datasets across clusters, ensuring scalability and fault tolerance.

Read More
Analytics for Everyone | Data driven decisions using ChatGPT
Olivier Soucy Olivier Soucy

Analytics for Everyone | Data driven decisions using ChatGPT

Last week, my friend Véronique Desjardins from Fondation Jeunesses Musicales Canada asked for help analyzing donation data to calculate metrics like average donation, retention rates, and trends from an Excel file with about 1,000 rows. Initially, I was eager to dive into the task using Pandas, a tool I’ve used extensively for similar analyses, but then I reconsidered. Instead of providing a one-time solution, I thought about empowering her to handle such analyses independently in the future. With the rise of self-service analytics and tools like ChatGPT, I wondered if this versatile AI could enable non-technical users to extract insights without needing a data analyst or a big budget.

Read More
Mastering Streaming Data Pipelines with Kappa Architecture
Olivier Soucy Olivier Soucy

Mastering Streaming Data Pipelines with Kappa Architecture

These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that every business has a need, or at least an appetite, for streaming data. So, what’s all the fuss about? How do we build pipelines that support this type of data flow?

To illustrate the various concepts, we will build a pipeline that processes stock prices in near-real time and share some latency metrics. The code to deploy the pipeline and review the data is available on github.

Read More
Laktory SparkChain - A serializable spark-based data transformations
Olivier Soucy Olivier Soucy

Laktory SparkChain - A serializable spark-based data transformations

In our previous article, we explored the pros and cons of using Spark versus SQL for data transformations within data pipelines. We concluded that while Spark excels in creating modular and scalable transformations, it falls short in the portability and declarative simplicity offered by SQL queries. Today, we will delve deeper into Laktory's SparkChain model, which aims to integrate the strengths of both technologies.

Read More
Sparkling Queries | An In-Depth Spark vs SQL for data pipelines
Olivier Soucy Olivier Soucy

Sparkling Queries | An In-Depth Spark vs SQL for data pipelines

As more big data platforms begin to support both Spark and SQL, you might wonder which one to choose. This article aims to offer some guidance from a data engineering perspective. We'll focus on how each language supports the development of scalable data pipelines for data transformation, setting aside performance considerations for now. This topic would deserve its own separate discussion.

Read More
Data Dimensional Modeling: A shooting star?
Olivier Soucy Olivier Soucy

Data Dimensional Modeling: A shooting star?

Recently, I've reviewed numerous data engineering job postings and found that data modeling, specifically the ability to design and build dimensional data models, is a highly sought-after skill. This made me question the relevance of these models in the era of cloud computing and infinite scalability. Historically, star, snowflake, and galaxy schemas were all the rage. And rightfully so. However, I believe they are now overused, overshadowing approaches better suited to modern data stacks.

Read More
Databricks Volumes
Olivier Soucy Olivier Soucy

Databricks Volumes

Watch a demo on Databricks Volumes and learn how you can access your cloud data storage from your workspace.

Read More
Introducing Kubic
Olivier Soucy Olivier Soucy

Introducing Kubic

Join us to welcome a new team member. It will definitely become a familiar face on this channel.

Read More