Lakehouse as Code | 04. Delta Live Tables Data Pipelines
Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this fourht part, we focus on configuring a Databricks workspace.
Lakehouse as Code | 03. Data Pipeline Jobs
Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this second part, we focus on configuring a Databricks workspace.
Lakehouse as Code | 02. Workspace
Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this second part, we focus on configuring a Databricks workspace.
Lakehouse as Code | 01. Unity Catalog
Welcome to the Lakehouse as Code mini-series! In this series, we'll walk you through deploying a complete Databricks lakehouse using infrastructure as code with Laktory. In this first part, we focus on laying the foundation for Unity Catalog.
Data Pipelines | To merge, or not to merge
In recent years, data has shifted towards a more streaming-centric nature. Online transactions, website clicks, TikTok likes, and even your car's real-time energy consumption now contribute to an ecosystem where each data point represents a new, immutable event. This evolution has led to the rise of incremental data pipelines, which process data as it arrives, in contrast to traditional batch processing of large historical datasets.
Most online transaction processing (OLTP) systems have adapted by generating streams of events through techniques like Change Data Capture (CDC), where every row-level change in a database is tracked and delivered in real time. SQL operations, such as MERGE INTO, enable these events to be seamlessly integrated into an existing data table without the need to overwrite its entire content, making it a more efficient and targeted way to update data.
However, not all systems can adopt this approach. Many legacy or simpler systems lack the ability to stream changes, leaving data engineers with no choice but to take periodic "snapshots" or "dumps" of their data.
So, if you're building a data pipeline that relies on snapshots, what are your options for ingestion and processing? And more importantly, how does this affect performance?
Unity Catalog | 3 levels to rule them all
In May 2021, Databricks introduced Unity Catalog (UC), promising a unified governance layer designed to streamline the organization and security of data across cloud platforms.
However, what’s harder to find are the best practices that businesses are adopting to leverage these three levels efficiently, particularly within the context of a medallion architecture. Questions like: What should a catalog represent? Should I have separate schemas for bronze, silver, and gold layers? How do I grant access to logical groups of data? These are the questions we’ll aim to provide guidance on today.
Databricks AI Playground | How to bring your own model
After a few months in public preview, Databricks AI Playground has garnered great feedback from the community. But if you’ve been living under a rock (or, shall we say, a brick) and have no idea what it’s all about, check out this short video by Holly Smith: https://www.youtube.com/shorts/pNA-YYLBJH4
In essence, this playground is a chat-like environment where you can test, prompt, and compare LLMs. And because a picture is worth a thousand words, here’s a little snapshot:
Building a Data Pipeline with Polars and Laktory
When discussing data pipelines, distributed engines like Spark and big data platforms such as Databricks and Snowflake immediately come to mind. However, not every problem requires these super powers. Many businesses default to these large-scale solutions, but they can be overkill for the data sizes at hand. Additionally, those still learning the basics of data engineering and data modeling need access to simple and cost-effective setups to master their craft. That's why today we'll explore how to leverage Polars dataframes and the Laktory ETL framework to build an end-to-end data pipeline that can be executed on your local machine.
Laktory Introduction
Watch a quick introduction to Laktory, the open-source ETL framework, and learn how you can leverage its pipeline model to efficiently build and deploy dataframe-centric pipelines to Databricks or other data platforms.
Laktory Overview
Watch a demo on what Laktory, the dataframe-centric open source ETL framework is all about!
DataFrames Battle Royale | Pandas vs Polars vs Spark
Pandas operates with an in-memory, single-threaded architecture ideal for small to medium datasets, providing simplicity and immediate feedback. Polars, built with Rust, offers multi-threaded, in-memory processing and supports both eager and lazy execution, optimizing performance for larger datasets. Apache Spark uses a distributed computing architecture with lazy execution, designed for processing massive datasets across clusters, ensuring scalability and fault tolerance.
Analytics for Everyone | Data driven decisions using ChatGPT
Last week, my friend Véronique Desjardins from Fondation Jeunesses Musicales Canada asked for help analyzing donation data to calculate metrics like average donation, retention rates, and trends from an Excel file with about 1,000 rows. Initially, I was eager to dive into the task using Pandas, a tool I’ve used extensively for similar analyses, but then I reconsidered. Instead of providing a one-time solution, I thought about empowering her to handle such analyses independently in the future. With the rise of self-service analytics and tools like ChatGPT, I wondered if this versatile AI could enable non-technical users to extract insights without needing a data analyst or a big budget.
Mastering Streaming Data Pipelines with Kappa Architecture
These days, experience with streaming data is a common requirement in most data engineering job postings. It seems that every business has a need, or at least an appetite, for streaming data. So, what’s all the fuss about? How do we build pipelines that support this type of data flow?
To illustrate the various concepts, we will build a pipeline that processes stock prices in near-real time and share some latency metrics. The code to deploy the pipeline and review the data is available on github.
Laktory SparkChain - A serializable spark-based data transformations
In our previous article, we explored the pros and cons of using Spark versus SQL for data transformations within data pipelines. We concluded that while Spark excels in creating modular and scalable transformations, it falls short in the portability and declarative simplicity offered by SQL queries. Today, we will delve deeper into Laktory's SparkChain model, which aims to integrate the strengths of both technologies.
Sparkling Queries | An In-Depth Spark vs SQL for data pipelines
As more big data platforms begin to support both Spark and SQL, you might wonder which one to choose. This article aims to offer some guidance from a data engineering perspective. We'll focus on how each language supports the development of scalable data pipelines for data transformation, setting aside performance considerations for now. This topic would deserve its own separate discussion.
Data Dimensional Modeling: A shooting star?
Recently, I've reviewed numerous data engineering job postings and found that data modeling, specifically the ability to design and build dimensional data models, is a highly sought-after skill. This made me question the relevance of these models in the era of cloud computing and infinite scalability. Historically, star, snowflake, and galaxy schemas were all the rage. And rightfully so. However, I believe they are now overused, overshadowing approaches better suited to modern data stacks.
Databricks Delta Live Tables
Watch a demo on Databricks Delta Live Tables (DLT) and learn how you can simplify building data pipelines.
Databricks Volumes
Watch a demo on Databricks Volumes and learn how you can access your cloud data storage from your workspace.
Introducing Kubic
Join us to welcome a new team member. It will definitely become a familiar face on this channel.
Dashboard Cemetery
Dashboards are Dead. This was the title of an article written by Taylor Brownlow almost 4 years ago. She came back to the topic 3 years later with Dashboards Are Dead: 3 Years Later. I don't exactly recall if I stumbled upon these great reads looking for guidance or if I simply found them because of confirmation bias. Regardless of the path, the title (and the ideas) resonated with my own experience.