From Oracle to Databases for AI: The Evolution of Data Storage
From Oracle, to NoSQL databases, and beyond, read about data management solutions from the early days of the RBDMS to those supporting AI applications.
Technology vector created by fullvector - www.freepik.com
Even though machine learning has become commoditized, it’s still the Wild West. ML teams across various industries are developing their own techniques for processing data, training models, and using them in production. This is clearly not a sustainable approach to Machine Learning. Over time, these diverse approaches will become standardized. To accelerate that process, the industry needs developer tools designed specifically for AI. In this article, you’ll see the difference between traditional data storage solutions and databases that are built to address AI use cases.
Anyone who has worked with an enterprise has seen an interface with rows and columns where you diligently add information related to your work.
The Oracle database was introduced more than 40 years ago as a target for this kind of data entry. Even though many enterprises were using it as a default database, it was a quite complicated product for a regular user. It was also proprietary and thus not accessible to developers. Oracle even initiated the DeWitt Clause, which prohibits the publication of database benchmarks that haven’t been authorized by the creators of the database in question.
Fast forward 15 years, and we saw a new generation of databases like MySQL and PostgreSQL. The key differentiator from Oracle's offer was accessibility – the new breed of databases were open-source products and could be run on a local machine, which made it much easier for developers to set up and use. It’s not that surprising that they gained wide adoption during the dot.com boom.
Web 2.0 and Open Source Databases Take Off
Moving forward in the timeline, we’ve seen the rise of “Web 2.0” in the early 2000s. Jeff Dean and Sanjay Ghemawat of Google proposed MapReduce, a radically new approach to data processing. The logic behind this method was alternatively performing mapping and reduction operations on datasets. Mapping operations include filtering and sorting of data in rows and columns. Reduction operations involve summaries and aggregations of tabular data. The infrastructure was designed to run in parallel. This MapReduce approach sits at the core of Hadoop, one of the most successful open source projects of the 2000s.
The years of 2007-2012 brought to the world the array of open source NoSQL DBs. To name the most popular ones – Apache Cassandra, Google’s BigTable, MongoDB, Redis, and Amazon’s DynamoDB.
The key differentiator for these databases is that they scale well horizontally. Many machines could spin up in parallel and collaboratively serve a database capable of housing truly large large datasets. NoSQL databases don’t use a relational approach to model their data as rows and columns, and readily accept unstructured data without requiring complex data models.
It’s also important to mention that there are two types of databases: transactional and analytical. Transactional databases are operational and store critical data, like user logs, activity, IP addresses, purchase orders, etc. They are designed to be readily available, accept new data quickly, and serve requests for data with minimal delay. Analytical databases work much more slowly, can be hosted on cheaper storage, and allow for more in-depth analysis of data than would be feasible to perform in a transactional database.
Big Data: Analytics and the Beginning of the AI Hype Cycle
The term ‘Big Data’ was initially introduced in 2005, but caught on much later as applications became more complex. At that time proper analytics involved pulling larger amounts of data from different places. New products entered the market to address the rising challenge -- Amazon RedShift, Snowflake, Google BigQuery, Clickhouse. These data warehouses have become mainstays for modern enterprises.
The year of 2020 introduced a new solution for data orchestration – the ‘lakehouse’. Lakehouses help their users query historical data at a low cost but experience little to no degradation in analytical and ML workloads. This turned into something of an arms race between Snowflake and Databricks. While they carry on their battle in the world of tabular data, something big is rising from the deep.
In 2021, Deep learning (DL) has become mainstream. This is the time of unstructured data. Unlike tabular data, which is represented in rows and columns, unstructured data consists of images, video-streams, audio, and text. This is data that cannot be represented in a spreadsheet.
Because of last year's lockdowns, the consumption of Netflix, TikTok, and other social media has skyrocketed to unprecedented levels. These products all deal with and generate large quantities of unstructured data. How should the companies manage that data, especially if they want to use it to drive deep insights through AI? Can AI teams working with this data use database solutions for their work?
You might be surprised, but none of the technologies we’ve discussed fully address the needs of deep learning practitioners. Companies like Tesla, which work with large amounts of unstructured data, admit that they had to build their own infrastructure.
Very few enterprises have enough internal bandwidth to build data infrastructure from scratch. The AI industry needs a solid data foundation and a database designed specifically for AI.
A Database for AI is a Pillar of Deep Learning Innovation
For a long time now, the industry has been in need of a simple API for creating, storing and collaborating on AI datasets of any size. More importantly, this tool should be created specifically for DL-related tasks.
Traditional DBs store data in-memory to run fast queries, but in-memory storage is extremely expensive for ML datasets as they are very big. Even using attached disks like AWS EBS is very costly. The most cost-efficient storage layer is object storage such as S3, Google Cloud Storage, and MinIO, but these tools are slow for rapid querying or computation.
A database built specifically for deep learning should support hosting data on those cloud storage systems and store data in a format native to deep learning models. This functionality will allow much faster computation and is more cost-effective, because accessing data from this type of framework should not incur any additional pre-processing costs for ML models. Such a framework would help modern, AI-enabled enterprises save up to 30% on their infrastructure costs.
The framework should be optimized to transfer data from cloud storage to machine learning models with minimal latency. This will allow AI to instantly explore and visualize massive datasets that could not be housed elsewhere.
More importantly, modern databases for AI should be very easy to use with just a few lines of code. According to a recent survey from Kaggle, over 20% of data scientists have less than 3 years of industry experience, and another 14% have less than a year. Machine learning is still a young field and simplicity is crucial.
We have begun to see an emergence of databases built specifically for AI use cases. In the next five years, we will witness a standardization on the ML infrastructure stack. Organizations that adopt AI databases more quickly will be on the forefront of the ML revolution.
Davit Buniatyan (@DBuniatyan) started his Ph.D. at Princeton University at 20. His research involved reconstructing the connectome of the mouse brain under the supervision of Sebastian Seung. Trying to solve hurdles he faced analyzing large datasets in the neuroscience lab, David became the founding CEO of Activeloop, a Y-Combinator alum startup. He is also a recipient of the Gordon Wu Fellowship and AWS Machine Learning Research Award. Davit is the creator of Activeloop, the Database for AI.