Databricks & the Future of Data
A summary of Databricks, the data industry, and if I'll invest in the Databricks IPO.
2/27/2024 Update: This was one of my early articles, I don’t think the quality is as high as my recent articles (hopefully they’ve been getting better). However, Ali Ghodsi (CEO of Databricks) shared the article on LinkedIn with some positive feedback. That led to this being one of my most popular articles, so I’ll keep the article up with the caveat that it’s not as detailed as I’d like it to be. When Databricks decides to IPO, I’ll write another Databricks primer that provides more context on where Databricks fits into the Data ecosystem with some more quantitative research. Thanks for reading!
Introduction
Databricks is the most advanced unified analytics platform on the market. It has the potential to empower AI more than any other data company on the planet. For years, it’s been the bridge between data and actionable AI. As the hype fades over the next year, the question becomes, “Who can actually make AI happen?” Databricks actually can.
Nvidia is soaring high right now, and their growth is coming almost exclusively from cloud providers expanding their capacity to account for the increased computing needs of AI. What does that mean for Databricks? The money will soon be flowing into “do it yourself” AI on those cloud companies. Databricks does that better than anyone on the planet.
What about competition? Microsoft just dropped a bombshell on the data warehousing industry. Yet, it didn’t seem to explode. Fabric did not get the immediate hype I thought it would.
Microsoft Fabric is going to brand themselves as the most comprehensive data platform on the market. There’s going to be an aggressive campaign from Microsoft to take market share as the industry is still in its infancy. Most enterprise companies have not adopted a data platform of the future yet. So, who wins? Databricks, Microsoft, Snowflake?
What will those data platforms of the future look like?
Here’s the blueprint: A SaaS, consolidated, open-source, platform.
Companies are looking to be able to complete all their data operations in one tool. They’re going to want complete control over their data, you can only fully support that with open-source data formats. They’re going to want it in a SaaS offering. They don’t want to manage their own updates, data centers, infrastructure, etc.
Whoever can execute best on this vision will win the data industry.
So, let’s dive into whether or not Databricks can be that winner.
Big Data Trends
The data industry will be driven by two primary trends over the next 5 years.
Open-Source: More than anything else, companies want complete control over their data. Right now, all the leading data warehousing providers have proprietary formats in which they store their data. What does that mean? If you want to export your organized data to another tool, you can’t. You would have to keep your backups of your data in their original format, and then re-organize them in the new tool. While the leading data providers offer some open-source options, the future is default open-source storage. It’s a gigantic hassle to migrate data from proprietary formats, which means two things:
It is a massive business moat for data-storage providers. Moving your data is a massive hassle with huge downside risk.
From a CIO’s perspective, that is a terrible decision. You become completely locked into one tool. You don’t have bargaining power, and you’re reliant on that tool to work as it’s supposed to.
Consolidation: The amount of data in the world is growing exponentially. There are hundreds of places for that data to reside. This trend has been ongoing for some time, and it’s continuing to accelerate for a variety of reasons.
AI: Your data has to be properly stored and organized to be made valuable.
Security: Hackers are getting better and better. They will only be empowered by AI. Data is constantly being shared, and it’s vulnerable. Having a simple data storage solution is absolutely critical. If you don’t know where your data is, you can’t protect it.
Governance: Over the next decade, regulations will become stricter and stricter on data. In Europe, data is much more regulated. The US is getting there. If you can’t explain what data you have, where it’s stored, and how you share it…that is a huge liability to your company.
Simplicity: Managing data is very hard. Having it all in one place is extremely valuable. What’s even more valuable? Being able to do all of your data processes from one tool as well. Data streaming, analysis, training, and deployment all from one place. IMMENSELY valuable.
This leads us to two questions about Databricks:
How are they positioned to take advantage of these two trends?
Can their business model adapt to these trends?
There’s one other vitally important question with any non-cloud provider tech company:
How well does Databricks play with the other three cloud providers? Do they have a durable competitive advantage against them? Could any of the big three develop a competing solution? Does Databricks provide value to the big three?
Before we answer these, let’s dive into what Databricks actually does:
Databricks: The Unified Analytics Platform
Databricks is all built around the developer experience. It’s essentially a managed SaaS offering consolidating all the open-source tools you need into one place. It’s a platform built by the creators of a few of the world’s most popular open-source software. It’s a platform built by the experts for the experts.
It started out as a managed Apache Spark solution, centered around the Notebook. Developers could write their code for ML workflows, then deploy it using Databricks without having to worry about managing compute and storage size. Since then, it’s developed into a legitimate data platform. Their SVP of field engineering does an excellent demo of Databricks platform.
Link to presentation shown below: (1) Discover the Data Lakehouse - YouTube
It starts with the “lakehouse” architecture. Combining the best of both worlds from the Data Lake and data warehouse.
In 2021, Databricks announced Unity Catalog, it’s governance solution for Databricks. This was a massive step in becoming a full-scale platform.
One of Snowflake’s biggest selling points is its data sharing capabilities. Investors point to it as another protective moat around their business. Databricks support an open-source option to do the exact same thing.
The most important piece of Databricks is its ability to enable ML & AI. The image below shows perfectly how it happens. It is the easiest platform in the world to deploy machine learning workflows.
Finally, Databricks supports all SQL workloads, by far the world’s most used language for structuring data. Most of the world’s structured data is stored in a SQL-based format; when Databricks released support for SQL, they transitioned from being a ML platform to being a full-scale data platform. When you look at a competitor like Snowflake, the SQL based interface is the majority of their platform. It’s only a small part of Databricks.
As I mentioned previously, Databricks is all centered around the developer experience. So, the role-based experience is a really nice benefit. Whichever role you select, Databricks pre-populates the tools you are most likely to use.
Tying everything together, a screenshot of the Databricks interface itself.
Quick summary of how you actually use the features mentioned above:
Workspaces - where you work in Databricks.
Notebooks - where you write your code.
Tables - where you deal with file uploading.
Jobs - productionizing your notebooks.
Clusters - how much compute you need.
How does Databricks play into the future of data?
Going back to our original questions, how does this technology fit into the data trends of the future?
Open Source: Databricks has proven their commitment to open source, and I think that will be one of the biggest drivers to their success.
Point 1: Databricks makes its Delta Lake cloud data platform fully open source (techmonitor.ai)
Just last year, Databricks transitioned Delta Lake 2.0 to being fully open-source. This business model of managed open-source will be THE winning play in data over the next decade. Primarily, because it gives c-suite individuals at companies peace of mind that they have control over their data. They don’t feel locked in or pressured by their software provider.
Point 2: Databricks can transition much easier to a fully open-source framework if they need to. If you look at a company like Snowflake, their business model revolves around being a proprietary storage format. Open-sourcing their tech would significantly lessen their moat.
Databricks, on the other hand, has a full-scale data ecosystem. You can store your data, train your models, and deploy your models all in Databricks. It’s much easier to add Business Intelligence features to their platform. It’s much harder for Snowflake to add a full-scale machine learning service to their platform.
Point 3: Databricks connects to the most-popular open-source data products: TensorFlow, Terraform, PyTorch. The future of Databricks really is as a managed open-source machine learning platform. That future is bright.
Consolidation: This is a more interesting trend. Snowflake is a better data warehousing tool, and it makes it easy to store your data in one place. Then, do security, governance, and business intelligence. Snowflake will never be a consolidated data platform for all your data needs. The question is: can Databricks match Snowflake’s SQL-based platform, and its ease of use for data management.
Point 1: Databricks SQL is the most important play for Databricks to become a full-scale consolidation play. Most of the data stored in enterprise companies is in SQL databases. It’s still taught in schools, and it’s still arguably the most important language for a data engineer to know. Databricks historically didn’t play in the SQL space, but they changed that in 2020 releasing Databricks SQL.
Where does Snowflake absolutely kill it? Providing a cloud-based hybrid-cloud SQL warehouse. If Databricks can prove that they provide an equal level of ease-of-use for their SQL platform, I think Snowflake will have a very difficult time competing.
Point 2: Unifying everything else. Databricks can legitimately provide a unified source for all your data tools.
Databricks really is an awesome, extensive platform. Studying Databricks helps you understand how simple Snowflake is. So, the only issue here is that Databricks could be too big. If a company just wants Data warehousing for the governance & management, they’ll choose snowflake. If they want the full data platform, it seems pretty obvious to me that Databricks is the best option. The open data is gigantic. Snowflake will not power AI/ML but Databricks has the potential to. The question really is what Big Tech’s analytics platforms can do to Databricks.
Competing with Microsoft, Google, and Amazon
AND the ever-important question: how does Databricks play with the big guys?
How well does Databricks play with the other three cloud providers?
There’s an interesting dynamic between the cloud providers and other software vendors. Big tech has a competing product with most software vendors. Each of the cloud providers has taken a different strategy with them, but it all comes down to being “frenemies” with ISVs.
AWS: Amazon is probably the most software-company friendly cloud provider. They’ve made digital natives a priority since the beginning. Because of this, they make a naturally good partner with Databricks. Plus, it seems that Amazon is focusing on core compute and storage while letting ISVs invest in managed services. They’d rather let another company fight that battle and collect the margins from providing the infrastructure to those companies.
Azure: Microsoft traditionally tries to create the entire tech stack for companies. The idea is that you can do everything you need with Microsoft, everything integrates together, and you get good pricing with the bundling strategy. It’s an outstanding business model but it ostracizes software vendors. Databricks has been one of the exceptions thus far. With Databricks on Azure, Microsoft has been Databricks preferred partner since the beginning.
GCP: Interestingly, GCP and Databricks share a similar focus on open source. As announced in their press release, “Google Cloud and Databricks share a common vision of open source, open data platforms, and an open cloud.”
How do I see the battle playing out long-term? I see AWS being the best partner for Databricks long-term. I believe both MSFT and GCP will release unified analytics platforms that put a lot of competitive pressure on Databricks. They are in an intense fight to be the AI company of the future, and the data fight is the biggest slice of the pie.
As always, thanks for reading!
Amazing article, few facets you might have missed in analyzing:
A.
- Comparing R&D and the pace of upcoming new features - If we analyse past 2 years: Which company has brought radical data features, changing the industry trends: Google, Amazon, MS, Snowflake or Databricks?
I believe its Databricks..
- This being the key indicator of past, present and future growth, will it foster my decision around investing into any IPO/Shares???
100% I will..
Future is consistent R&D into data platform space, companies which can bring innovations, product features sensing industry needs, will be the leader in this market.
B.
Another key aspect into consideration, if you invest into Google, Microsoft or AWS, and assuming If they grows, then subsequently their key ecosystem players too will grow[key is imp here]..
Now either you invest in all these three major players and hedge the risk or else invest in a data company which has already leveraged hedging there....
Paresh
I prefer the upside of Dataiku who have a far superior SaaS AI ML solution..