Today, there are 6,500 people on LinkedIn who call themselves data engineers according to stitchdata.com. In San Francisco alone, there are 6,600 job listings for this same title. The number of data engineers has doubled in the past year, but engineering leaders still find themselves faced with a significant shortage of data engineering talent. So is it really the future of data warehousing? What is data engineering? These questions and much more I want to answer in this blog post.
In unicorn companies like Facebook, Google, Apple where data is the fuel for the company, mostly in America, is where data engineers are largely used. In Europe, the job title does not completely exist besides the startup mecca Berlin, Munich, etc. They are called or included in jobs like software engineer, big data engineer, business analyst, data analyst, data scientist and also the business intelligence engineer. Myself, I started as a business intelligence engineer and using more and more time on the engineering rather the business part, that’s why I am starting this blog from the data warehousing angle.
What is data warehousing or what is a business intelligence engineer doing, and why are they using a data warehouse?
To use the analogy to a physical retail-type warehouse, you want to sell very structured products in the most efficient way to your customers. In a data warehouse (DWH) you have typically structured data and optimised them for business users to query. If you dig a little deeper, you offload data from the trucks in the back of the physical shop, before it gets sorted and structured into the warehouse for the customers to buy. In a DWH you basically do the same, just with data. As you see in the DWH architecture below, the offloading area in the back of the store is your stage area where you store the source data from your operational systems or external data.
A traditional Data Warehouse architecture by Wikipedia:
The physical warehouse where the customers buying the articles is in a DWH normally the so-called data mart. The data processed between each layer seen in the architecture above is called ETL (Extract Transform Load). This is not to confuse with ELT (Extract Load Transform) which is the common mythology data lakes (more in my recent post). In a DWH you always transform to get data as clean and structured as possible.
Besides the obvious reasons of a shop explained above, a data warehouse gives you big advantages:
Data engineering is the less famous sibling of data science. Data science is growing like no tomorrow and so does data engineer, but much less heard. Compared to existing roles it would be a software engineering plus business intelligence engineer including big data abilities as the Hadoop ecosystem, streaming and computation at scale. Business creates more reporting artefacts themselves but with more data that needs to be collected, cleaned and updated near real-time and complexity is expanding every day. With that said more programmatic skills are needed similar to software engineering. The emerging language at the moment is Python (more in the chapter below) while used in engineering with tools alike Apache Airflow as well as data science with powerful libraries. Where today as a BI-engineer you use SQL for almost everything except when using external data from an FTP-server for example. You would use bash and PowerShell in the nightly batch jobs. But this is no longer sufficient and because it gets a full-time job to develop and maintain all these requirement and rules (called pipelines), the data engineering is needed.
In order to get high quality and frequently updated data sets, it is important to distinguish between data pipelines that are done and cleaned by data engineers and all the others that are mostly exploratory. We at Airbus use a folder that is called “cleaned” and all data sets produced there are constantly updated, documented and of the highest quality. Based on these data sets you create your own. We use the data lake solution Palantir Foundry (brand name of Airbus: Skywise) which provides you with a map where you see the data lineage easily. Documentation and metadata to each data set are crucial as otherwise, you lose the overview of your data, which is also one main task of a data engineer.
Another important task or service which a data engineer provides is automation that data scientists or data analysts do manually. A good overview what task this includes are provided by Maxime Beauchemin, the founder of Apache Airflow, a tool that helps a data engineer to lift the majority of tasked mentioned:
While the nature of the workflows that can be automated differs depending on the environment, the need to automate them is common across the board. By Maxime Beauchemin
I believe, that not every company is in need of data engineers. His skills are mostly required if the company either:
If English is the language of business, SQL is the language of data and Python the language of engineering. While technology disappears often, SQL is still here. This means you need a reliable understanding of:
Stitchdata.com anticipated that as company size increased, so would the focus on scaling-related skill. However, that’s not the story the data told. Instead, data engineers at larger companies tend to be more focused on “enterprise” skills like ETL, BI, and data warehousing, whereas data engineers at smaller companies focus more on core technologies:
Programming languages have always come and gone, but in the last couple of years, Python rises on top of the popularity. The question is why. One valid reason for sure is because of the rise of the data engineers but also the use of libraries for data science and data analytics.
According to the Codeacademy and their source data from Stack Overflow, they say it’s connected to the rise of data science. This and machine learning were the biggest trends in tech 2017. Additionally, Python has become a go-to language for data analysis. With data-focused libraries like pandas, NumPy, and matplotlib, anyone familiar with Python’s syntax and rules can deploy it as a powerful tool to process, manipulate, and visualise data.
Related to the rise of data science and data engineering, it’s clear to me that Python is here to stay and it’s becoming the Swiss Army Knife of programming languages.
But for what can you use Python in data engineering. For example, you use it for data wrangling (reshaping, aggregating, joining disparate sources, etc.) which mostly done with the library Pandas, small-scale ETL, API interaction (our presentation usually happens in Tableau which has Python APIs) and automation with Apache Airflow, which is also natively in Python.
“Apache Airflow has several building blocks that allow Data Engineers to easily piece together pipelines to and from different sources. Because it is written in Python, Data Engineers find it easy to create ETL pipelines by just extending classes of Airflow’s DAG and Operator objects. And this allows us to write our own Python code to create any ETL we wish, with the structure given by Airflow. Airflow uses several packages mentioned all ready to do the job: boto for S3 handling, pandas for obvious advantages with data frames, psycopg2 for popular integrations with Postgres and Redshift, and several more.” said by David Dalisay.
According to a job description as a data engineer at Facebook in Menlo Park in Seattle, he needs to have the following qualification and responsibilities.
A picture of such a persona could look like (quora.com)
The salary of a data engineer is hard to say as it is very new, especially in Switzerland where the following salary-report is from. But on the following table you see all the jobs that are related or close by (unfortunately only in German, sorry for that) and their salary for a full year ins Swiss Francs (CHF) created by Robert Half.ch:
As already mentioned in further up, the data engineer has the skillset of a business intelligence engineer plus also solid programming and big data skills. I believe BI-engineers will transit over to a data engineer anyway, depending on the size of a company. But why? What is changing/has changed?
As the power of computers and especially the speed of the internet is growing, more data can be collected and needs to be analysed. Therefore many parameters around the data warehouse environment have or will change. Below the points that I see with most influence:
Furthermore the way we do ETL is changing as well, as Maxime Beauchemin, data engineer at Airbnb quotes: “Product know-how on platforms like Informatica, IBM Datastage, Cognos, AbInitio or Microsoft SSIS isn’t common amongst modern data engineers, and being replaced by more generic software engineering skills along with understanding of programmatic or configuration driven platforms like Airflow, Oozie, Azkabhan or Luigi. It’s also fairly common for engineers to develop and manage their own job orchestrator/scheduler.” He is also saying that code is the best abstraction there is for software rather than using drag and drop tools (ETL-tools). Most important what I see as well, that the transformation logic is of a higher need and shouldn’t be locked away exclusively for BI developers.
As you can’t change ETL without modelling differently, also this is changing:
Facebook, Airbnb and other companies taking it a step into so-called “Data Camps or Data University” to educate internal employees in respect of data to get more data savvy.
So after all, is the data engineer the new business intelligence engineer? I would say in the long run yes. I imagine that data warehouses – in any way – will always be a need for the business, where the data is fully structured and easily accessible. But how we build DWHs or a similar type, will change and therefore more engineering and data engineers, are needed.
6,474 total views, 20 views today