These days, everyone talks about open-source. However, this is still not common in the Data Warehouse (DWH) field. Why is this?
In my recent blog, I researched OLAP technologies, for this post I chose some open-source technologies and used them together to build a full data architecture for a Data Warehouse system.
I went with Apache Druid for data storage, Apache Superset for querying and Apache Airflow as a task orchestrator.
Druid – the data store
Druid is an open-source, column-oriented, distributed data store written in Java. It’s designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data.
What druid.io is
Druid has many key features, including sub-second OLAP queries, real-time streaming ingestion, scalability, and cost-effectiveness.
With the comparison of modern OLAP Technologies in mind, I chose Druid over ClickHouse, Pinot and Apache Kylin. Recently, Microsoft announced they will add Druid to their Azure HDInsight 4.0.
Why not Druid?
Carter Shanklin wrote a detailed post about Druid’s…