Data management has evolved significantly over the last several decades. Years ago, the architecture was much simpler – think Oracle databases running on IBM mainframes, in the form of an on-premise data stack. When the cloud evolved, it solved some of the big disadvantages of on-premises data stacks, including cost, limited user access and processing power. But eventually, significant problems began to appear – the volume and diversity of data sources exploded, and so did the management and analysis required for this data.
This led to the next stage of evolution – the modern data stack. Now companies had more than just a relational database in their arsenal. By adopting a large set of tools, companies could analyze much broader datasets and benefit from better performance. From a management and cost perspective, a modern data stack in the cloud seemed to make everything much easier. Scalability, speed to high-quality insights and low CapEx encouraged adoption.
However, challenges started to arise slowly related to building and managing this modern data infrastructure. Ultimately, companies started cobbling together what they thought were the best versions of these products. And now, we are quickly reaching a tipping point where the weaknesses of the modern data stack are starting to outweigh its benefits. For example:
- Complexity of multiple tools and platforms – There are different databases for each data type and multiple tools for ETL and ingestion, with each tool reinventing cataloging, governance and access control processes;
- Extremely high total cost of ownership (TCO) – There are overlapping license costs for the numerous databases and tools, and hiring data engineering specialists for each of these best-in-class solutions is becoming cost-prohibitive;
- Data silos, which hinder collaboration – These stacks have over-rotated to extreme decoupling to make everything modular but disjointed. This impacts collaboration between data analysts, data scientists, and product owners. Since they don’t work on the same platform and don’t see each other’s processes, handoffs are weak, communication is poor, and silos form. Decision tools and data applications are fed with inaccurate or stale data, and time to insight slows.
- Governance and security – With so many tools and data transfers across teams and different data silos, it becomes impossible to centrally handle access policies and security. Context of data and degree of abstraction of the data products itself is an issue. There’s still a lot of effort involved in delineating who gets access to raw data versus derived data products.
- Performance and scaling – The weakest linkage or orchestration in one part of the data stack negates performance gains in another part. Sure, your team may have amazing BI tools, but an integration with a poorly selected database system may result in dashboards not loading in a timely fashion.
Over the past year, things have gotten much worse. Recently proposed mitigations of some of the above issues (such as silos and governance) added more complexity than they alleviated. For example, the data mesh framework opted to retain all the software used in the modern data stack, but added on top of them another layer for cataloging and governance. That often means buying another vendor license and additional engineering work to tame the explosion of tools in the organization. LLMs and other user-facing AI/ML solutions increase these challenges even further with the custom data structures that support their statistical models, things that traditional data architectures weren’t designed to handle either. This drives the need for multimodal data management beyond tables, which means that even verticals that “traditionally” used tabular databases as a focal point in their infrastructure are now seeking specialized software for the non-tabular data and workloads (such as vector databases).
The problem, at its core, is one of tables, special-purpose solutions, and files. Tables are too rigid to structure arbitrary data (e.g., images, ML embeddings, etc), which forces organizations to build bespoke solutions using tables, almost always compromising performance and reusability. To address this, special-purpose solutions pop up, creating the very silos that exacerbate the issue they’re trying to solve. And files capture everything else, leading to the proliferation of numerous obscure formats, most often very specific to the industry or use case.
This is amplified by the way users manage their code and spin up compute workloads (be it pipelines or web apps). When you separate these components from the data infrastructure, you need to enforce governance, compliance and sane overall management using yet another third-party tool. The process of integrating tools for coding and spinning up computations ends up dramatically increasing TCO, as each new integration comes with its own operational overhead to maintain them indefinitely (e.g., updating and dealing with conflicting package versions).
We argue that the solution to the broken modern data stack is two-fold: (i) a more flexible, unified data model that can adapt to handle the challenges of modern architectures, and (ii) unifying the compute and code platform with the database itself. The unified data model allows organizations to handle all their data modalities with a single data system, which implements governance, cataloging, resource provisioning, etc. once, regardless of the use case. There is proof today in the market that such systems exist, which chose the multi-dimensional array as the unified format.
Treating code and compute as part of the same system you store and compute on your data again reuses the same governance and compliance model, obviating the need for rebuilding separate infrastructure for them. It also brings the cost and performance benefits that come with not having to replicate and reprocess the same data into multiple systems. Again, there are examples in the market that support more than just the storage and analysis of structured data, offering coding capabilities (such as user-defined functions) and spinning up arbitrary computations (task graphs for pipelines or web apps).
To sum up, we argue that the way to fix the problems of the modern data stack seem to boil down to consolidating the functionalities of the disparate tools to a single system, which looks more like a database management system than some kind of a mesh of different tools. Therefore, it seems that the responsibility of fixing the modern data stack should be shifted from the organizations to the data software vendors, who are able to attract the appropriate talent required to build such solutions. And the gears have already been put to motion – it’s only a matter of time before organizations realize that such software exists today in the market, ready to take on their data infrastructure challenges.
About the Author
Stavros Papadopoulos, Founder and CEO, TileDB. Prior to founding TileDB, Inc. in February 2017, Stavros was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW