Building ML Feature Lake

Setting up feature lake

Idea is to configure a multi-region, highly available bucket on top of any popular and cheap cloud storage like

Set up Apache Spark cluster in standalone or cluster mode.

I have created a python client package that support delta lake CRUD operation.
https://pypi.org/project/python-prakashravip1/

Setting up PrestoDB

PrestoDB comes in picture as a SQL engine on top of our feature store built upon delta lake.

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

One can take a look at this article which explains in great detail on setting up prestoDB in single node cluster or multi-node cluster.

Getting started with prestodb

Writing down steps on how to go about setting up prestodb SQL on top of our feature lake.

Create a presto database.
Generate a manifest file for delta table either using spark configured with delta lake at at <path-to-delta-table>/_symlink_format_manifest/. Manifest file is a snapshot of the delta table, generally in apache parquet, which is read by prestodb to serve queries.
Create a presto table with the same schema as delta table and specify the delta table storage path
Create a presto table with the same schema as delta table and specify the feature delta table storage path
For Delta table having partitioned columns, run MSCK REPAIR TABLE mytable after generating the manifests to force the metastore (connected to Presto or Athena) to discover new partitions.

For better understanding, please go through this article.

References

Start writing about what excites you in tech — connect with developers, grow your voice, and get rewarded.

Join other developers and claim your FAUN.dev() account now!

Publish your first story!

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.

Building ML Feature Lake

Understanding use case

High level design representation

Setting up feature lake

Setting up PrestoDB

References

Let's keep in touch!

Give a Pawfive to this post!

Start writing about what excites you in tech — connect with developers, grow your voice, and get rewarded.

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead — fast.

Ravi Prakash

Developer Influence

16

2k

1