I have been a very avid user of open source software. Today, I am highlighting use-case of two popular software, apache spark, delta lake and PrestoDB for building a reliable, scalable, fast, versatile data storage for large scale enterprise data for Machine Learning use cases.
Understanding use case
There is need for a growing enterprise ML Team for a storage system that can be used storing cleaned feature data generated from streaming or batch jobs over raw fact data.
Traditionally we were using cloud providers like Google Cloud, AWS data warehouses like BigQuery, RedShift respectively.
However, there are major challenges with data warehouses, such as
- data staleness,
- reliability,
- total cost of ownership,
- data lock-in,
- and limited use-case support.
Below is an excerpt from the official site https://delta.io/
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
One can read in more detail about delta lake through the research paper or website.