Feature stores – how to avoid feeling that every day is Groundhog Day
Feature stores stop the duplication of each task in the ML lifecycle. You can reuse features and pipelines for different models, monitor models consistently, and sidestep data leakage with this MLOps technology that everyone is talking about.
By Monte Zweben, CEO, Splice Machine.
Work as a data scientist follows a cycle: log in, clean data, define features, test and build a model, and make sure the model is running smoothly. Sounds straightforward enough, except not all parts of the cycle are created equal: data preparation takes 80% of any given data scientist’s time. No matter what project you’re working on, most days you’re cleaning data and converting raw data into features that machine learning models can understand. The monotonous hole of data prep blends hours together and makes each day of work feel identical to the one before it.
Why can't you do this tedious process more effectively?
You can—with a Feature Store. A Feature Store is a shareable repository of features made to automate the input, tracking, and governance of data into ML models. Feature stores compute and store features, enabling them to be registered, discovered, used, and shared across a company. A feature store makes sure features are always up to date for predictions and maintains the history of each feature’s values in a consistent manner so that models can be seamlessly trained and re-trained.
So, how will this save time for you as a data scientist? Glad you asked.
Reuse features and ETL pipelines
First and foremost, a feature store allows you to reuse features and data pipelines for each model you create. No more waiting around for bespoke ETL pipelines from data engineers, and no more having to copy, paste, and tweak feature definitions from previous models—just search, find, and use the feature you want.
Consistent model monitoring
In addition, a feature store allows you to ensure that the code used to generate features is the same as what your deployed models see. Because feature stores use the same pipelines to get features and training sets, they are automatically consistent. This makes it easy to evaluate and train models, especially at a scale where you start to lose track of where the different training sets are stored and when they were updated.
For example, instead of having weekly aggregations begin on Sunday in training and on Monday when deployed, they would begin on the same day in both cases. This consistency makes it way easier to identify model or feature drift in the case it does occur. In addition, it’s far easier to retrain your model if something does go wrong since there’s a single point of truth to refer to in a single location.
Evade data leakage
Creating a training set from historical values is complicated on its own, but when you’re building a model that needs to be regularly re-trained, retrieving different time-series feature sets is a real headache. Having to write complex SQL temporal joins is a lot of work and often outside the skill set of most data scientists. It’s all too easy to make mistakes that could compromise your model.
Feature stores log the time a feature was observed, as well as when it was added to the feature store, so you can identify the value of features at any point in time. By logging these timestamps, feature stores automatically enable point-in-time correctness when building training sets. This makes it incredibly easy to train a model on accurate data and automatically retrain it without worrying about data leakage.
Many people assume that data scientists spend most of their time testing and building models, but so much of getting a model up and running is wrangling the data that makes the model possible. Feature stores keep track of every data pipeline, feature, and all associated metadata so that the pieces you use to build a model are all reusable, easily shareable, and completely transparent to review. By making sure you only have to do each task once, you free up time and energy to focus on other projects. Feature stores are an efficiency no-brainer; if you want to learn more about their impact on the scale of an entire business, you can read my blog post, “Do you need a feature store?"
Bio: Monte Zweben is the co-founder and CEO of Splice Machine, the real-time AI company. A technology industry veteran, Monte’s early career was spent with the NASA Ames Research Center as the deputy chief of the artificial intelligence branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. Monte then transitioned to the entrepreneurial world, founding the industry-leading Blue Martini and Red Pepper Software startups. Monte has published articles in the Harvard Business Review, various computer science journals, and conference proceedings. He was Chairman of Rocket Fuel Inc. and serves on the Dean’s Advisory Board for Carnegie Mellon University’s School of Computer Science.
Related: