Join us
Highlighting use-case of two popular software, apache spark, delta lake and PrestoDB for building a reliable, scalable, fast, versatile data storage for large scale enterprise data for Machine Learning use cases.
I have been a very avid user of open source software. Today, I am highlighting use-case of two popular software, apache spark, delta lake and PrestoDB for building a reliable, scalable, fast, versatile data storage for large scale enterprise data for Machine Learning use cases.
There is need for a growing enterprise ML Team for a storage system that can be used storing cleaned feature data generated from streaming or batch jobs over raw fact data.
Traditionally we were using cloud providers like Google Cloud, AWS data warehouses like BigQuery, RedShift respectively.
However, there are major challenges with data warehouses, such as
Below is an excerpt from the official site https://delta.io/
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.
One can read in more detail about delta lake through the research paper or website.
Idea is to configure a multi-region, highly available bucket on top of any popular and cheap cloud storage like
Set up Apache Spark cluster in standalone or cluster mode.
I have created a python client package that support delta lake CRUD operation.
https://pypi.org/project/python-prakashravip1/
PrestoDB comes in picture as a SQL engine on top of our feature store built upon delta lake.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
One can take a look at this article which explains in great detail on setting up prestoDB in single node cluster or multi-node cluster.
Writing down steps on how to go about setting up prestodb SQL on top of our feature lake.
at <path-to-delta-table>/_symlink_format_manifest/
. Manifest file is a snapshot of the delta table, generally in apache parquet, which is read by prestodb to serve queries.For better understanding, please go through this article.
Join other developers and claim your FAUN account now!
Influence
Total Hits
Posts
Only registered users can post comments. Please, login or signup.