Data Engineering Blog

Genuine News about the Data Ecosystem.
Topics: #dataengineering #bigdata #python #opensource #etl

Rust for Data Engineering

Will Rust kill Python for Data Engineers? If you only came here to know this, my answer is no. Betteridge’s Law strikes again! But then again, you have to ask: was Python made for Data Engineering in the first place? Rust may not replace Python outright, but it has consumed more and more of JavaScript tooling and there are increasingly many projects trying to do the same with Python/Data Engineering. Let’s explore why Rust has potential for data engineers, what it does well and why it has become the most loved programming language for 7 years running.

Why Vim Is More than Just an Editor – Vim Language, Motions, and Modes Explained

Throughout my time as a developer, I’ve used VS Code, Sublime, Notepad++, TextMate, and others. But shortcuts like cmd(+shift)+end and jumping with option+arrow-keys from word to word needed to be faster at some point. I was hitting my limits. Everything I was doing I did decently fast, but I didn’t get any faster. I’ve since learned that Vim is the only editor that you get faster using with time. Vim is based solely on shortcuts.

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Image by Rachel Claire on Pexels Ever wanted or been asked to build an open-source Data Lake offloading data for analytics? Asked yourself what components and features would that include. Didn’t know the difference between a Data Lakehouse and a Data Warehouse? Or you just wanted to govern your hundreds to thousands of files and have more database-like features but don’t know how? This article explains the data lake power and which technologies can build one to avoid creating a Data Swamp with no structure and orphaned files.

The Rise of the Semantic Layer

A semantic layer is something we use every day. We build dashboards with yearly and monthly aggregations. We design dimensions for drilling down reports by region, product, or whatever metrics we are interested in. What has changed is that we no longer use a singular business intelligence tool; different teams use different visualizations (BI, notebooks, and embedded analytics). Instead of re-creating siloed metrics in each app, we want to define them once, open in a version-controlled way and sync them into each visualization tool.

Data Orchestration Trends: The Shift From Data Pipelines to Data Products

Data consumers, such as data analysts, and business users, care mostly about the production of data assets. On the other hand, data engineers have historically focused on modeling the dependencies between tasks (instead of data assets) with an orchestrator tool. How can we reconcile both worlds? This article reviews open-source data orchestration tools (Airflow, Prefect, Dagster) and discusses how data orchestration tools introduce data assets as first-class objects. We also cover why a declarative approach with higher-level abstractions helps with faster developer cycles, stability, and a better understanding of what’s going on pre-runtime.

Building an Analytics API with GraphQL: The Next Level of Data Engineering?

Image by Mohammad Bagher Adib Behrooz on Unsplash Why GraphQL for data engineers, you might ask? GraphQL solved the problem of providing a distinct interface for each client by unifying it to a single API for all clients such as web, mobile, web apps. The same challenge we’re now facing in the data world, where we integrate multiple clients with numerous backend systems. So what is GraphQL? In the world of microservices and web apps, GraphQL is a popular query language and serves as a data layer.