These days everyone talks about open-source, however still not common in the Data Warehouse (DWH) field. Why is this? In my recent blog, I researched OLAP technologies and what’s coming next, in this blog I choose one of the open-source technologies and build it together to have a full data architecture for DWH use-cases based on modern open-source technology. As the title says it all, I went for Apache Druid, querying with Apache Superset and using the Apache Airflow as an Orchestrator.
Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
Druid has many Key Features as Sub-second OLAP Queries, Real-time Streaming Ingestion, Scalable, Highly Available, Power Analytic Applications and Cost Effective.
With the Comparison of modern OLAP Technologies in mind, I choose Druid over ClickHouse, Pinot and Apache Kylin. As ClickHouse and Pinot compare as very similar and I didn’t find any much information about Apache Kylin, I decided to go for the most promising technologies also reading many different blog posts. Also just recently Microsoft announced to add Druid to their Azure HDInsight 4.0 (preview) seen on this article*.
“Druid’s strong points are very compelling but there are many important problems that Druid does not attempt to solve. Most important of these is joins. Druid has some basic capabilities to join its internal data against small dimension tables loaded from external systems (this is called query-time lookup in Druid terms.) This ability aside, the Druid authors make it clear that large-scale join support is not a priority.
Second, Druid’s SQL native implementation is extremely new. Though progress is being made, SQL is a huge and complex language and it is unlikely that we will see comprehensive SQL analytical capabilities such as set operators and windowing functions any time soon. In addition, we can expect that Druid’s SQL implementation will not support heavy-duty joins.” written by Carter Shanklin on Horthonwork.com.
Remarkable about Druid is that it is scalable due to hist cluster architecture. You have three different nodes type which is the Middle-Manager-Node, the Historical Node and the Broker. The great thing is that you can add as many nodes as you want in the specific area that fits best for you. If you have many queries done you would add more Brokers or if a lot of data needs to be batch-ingested you would add middle managers and so on. A simple architecture you can see below (more here).
The easiest way to query against Druid is a light way open-source BI-Dashboard tool called Apache Superset that integrated Druid well. It is easy to use and has all common chart types like Bubble Chart, Word Count, Heatmaps, Boxplot and many more. Druid provides a Rest-API and in the newest version also a SQL-Query API which makes it easy to consume it with any tool no matter if standard SQL, any BI-Tool or a custom application.
*Unfortunately Microsoft is not there yet. I installed the HDIsight Cluster on Azure and wanted to use a Druid Cluster. However you see the service in their Apache Ambari web portal, you are not able to “add” the service and install it. The link is dead, and nothing happens. Let’s hope Microsoft will it fix until the official release.
As mentioned in Orchestrators – Scheduling and monitor workflows, this is one of the most critical decisions. As in traditional ETL, you had your ETL-Tools like Microsoft SQL Server Integration Services (SSIS) and others where your transformation, cleaning and normalisation took place. In more modern architecture this kind of tools isn’t enough anymore. Moreover, code and transformation-logics are much more valuable also to other data-savvy people in the company.
Read also the highly recommended blog post from Maxime Beauchemin about Functional Data Engineering — a modern paradigm for batch data processing (same as YouTube) which goes much deeper into how modern pipelines should be. Also, consider the read of The Downfall of the Data Engineer where Max explains about the breaking “data silo” and much more.
Therefore it is wise to use a more open tool or even programming language. Very popular and a perfect tool for that purpose, in my opinion, is Apache Airflow. Airflow is written in Python, and also all the DAGs will be written in Python. Instead of encapsulating your critical transformation-logic somewhere in a tool, you place it where it belongs to inside the Orchestrator. Another advantage of it that is using plain python, there is no need to encapsulate other dependencies or requirements like fetching from an FTP, copy data from A to B, writing a batch-file, you do that and everything else at the same place.
Moreover, you get a full functionalities overview of all current task in one very place as this interactive image shows:
More features of Airflow:
If you want to start with Apache Airflow as your new ETL-tool, please start with this ETL best practices with Airflow shared with you. It has examples simple ETL-examples, with plain SQL, with HIVE, with Data Vault, Data Vault 2, Data Vault with Big Data processes. It gives you an excellent overview of what’s possible and also how you would approach it.
At the same time, there is a Docker container that you can use, meaning you don’t even have to set-up any infrastructure, pull the container from here.
For the GitHub-repo follow the link on etl-with-airflow.
If your searching for open-source data architecture, you cannot ignore above-mentioned data store druid, for speedy OLAP responses, Apache Airflow as an orchestrator that keep you data lineage and schedules in line plus an easy to use dashboard tool as in Apache Superset.
My experience so far is that Druid is bloody fast and a perfect fit for OLAP cube replacements in a traditional way, but still needs a more relaxed startup to install clusters, ingesting data, viewing logs etc. If you need that, have a look at Impy which are the founders of Druid and creating all the services around for Druid that you need. Unfortunately not open-source.
Apache Airflow and its features as an orchestrator is something which I see not happen yet in traditional Business Intelligence environments. I believe this change comes very naturally when you start using open-source and more new technologies.
And Apache Superset as an easy and fast way to be up and running and showing data from Druid. There for sure more better tools like Tableau, etc., but not for free. That’s why superset fits well in the ecosystem if you’re already using above open-source technologies. But as an enterprise company, you might want to spend some money in that category because that is what the users can see at the end of the day.
940 total views, 53 views today