Derek Zanutto: The Rise of Data Lakes
Image credits: Donald Iain Smith / Getty Images (via TechCrunch)
While data warehouses have evolved over the years as a way to store large amounts of structured data, more and more enterprises needed a way to store large amounts of unstructured data to fuel machine learning applications.
Enter data lakes.
Now, cloud vendors like Amazon and newer startups like CapitalG portfolio company Databricks have taken the data lakes concept to a new level, with new opportunities for startups and investors filling the quickly maturing data lakes ecosystem.
TechCrunch’s Ron Miller spoke with CapitalG general partner Derek Zanutto about the impact of cloud vendors in enterprise data lakes, how investors are approaching the space, and where the new opportunities and challenges lie.
Read the original article (paywalled) on TechCrunch
TechCrunch: Where are the opportunities for startups in the data lakes space with players like Snowflake and the cloud infrastructure vendors so firmly established?
Derek: The rise of the data lake model is creating a large, rapidly growing market opportunity adjacent to the more established data warehouse model typified by Snowflake and the big cloud vendors. The data lake model enables enterprises to unlock insights from a broader array of data (structured and unstructured) for a broader array of use cases (historical financial reporting and predictive AI analytics).
While the data lake offers many benefits for data-driven organizations, there are certain emerging challenges that will need to be addressed for the data lake model to further accelerate in the enterprise (for example, data reliability and highly performant querying). Companies that can build solutions to address these emerging pain points will have the opportunity to capture a large share of the significant profit pool being created around the data lake model.
TechCrunch: What are the biggest challenges for startups entering the data lake market right now and how do they overcome them?
Derek: Startups must help enterprises navigate what is essentially a market creation story by convincing buyers — who have spent the last 40 years procuring data warehouse technologies — to rethink their approach to data storage and analytics. Furthermore, the data lake category is intensively competitive — startups will often compete against well-entrenched data platforms who often own the underlying object storage infrastructures.
To succeed in the data lake market, startups will need to focus on product depth, not breadth. Companies that focus on building the “best of breed” technology for a core use case will have a better shot of scaling. From a go-to-market perspective, successful startups often sell solutions to specifically address tangible, real-world pain points (as opposed to broad technology solutions). Successful companies aim to secure small, paid pilots with real business impact and leverage those successful implementations as opportunities for future expansion.
Finally, open sourcing the underlying technology can be a highly successful way to gain both broad, bottoms-up distribution as well as mind share. Open sourcing also gives startups another opportunity for differentiation: It enables them to sell a multi and hybrid cloud story. This appeals to chief data officers who are increasingly looking to standardize onto open formats to give them the flexibility to leave the “walled gardens” of one cloud data platform for another.
TechCrunch: What impact do the big cloud vendors have on the data lake market with their offerings?
Derek: The big cloud vendors are developing and selling end-to-end ecosystems around their data lake offerings. Today they primarily offer the underlying object storage infrastructure on top of which data lakes are built, as well as data integration tools to move data into the data lake. They also offer a variety of data services, such as data science notebooks and federated SQL query engines, in order to make data stored in the data lake accessible for data consumers. Through their comprehensive platform approach, cloud players offer “one-stop shops” for all the data services enterprises need, often at compelling price points, since they’re able to bundle data services with core infrastructure spend.
Additionally, their services integrate seamlessly with other technologies in their cloud portfolios, providing what can be a highly compelling value proposition to many enterprises. As a result, the cloud vendors have had a tremendous impact on the data lake market. In order to compete with the cloud players, data lake companies must drive real product innovation and differentiation. In some categories, this could take the form of a verticalized solution that better addresses the pain points of particular industries and buyers.
TechCrunch: Beyond data lakes, there are lots of adjacent services with data governance, preparation, management and getting it in and out of the data lake? What kind of startup opportunity do you see in these adjacent markets?
Derek: There is tremendous opportunity for startups solving data governance and management challenges, whether or not they deal with data lakes. We’ve found through extensive conversations with chief data officers that their largest pain points center around data quality, data governance and time-to-insight.
In the area of data quality, one in three business users and consumers of data surveyed didn’t trust the data their teams were using. This is due to widespread inconsistency in data usage and terminology across teams, as well as a lack of a single source of truth. The lack of standardization creates inconsistent and inaccurate data models. These in turn lead to inconsistent and inaccurate model outputs that lead to imperfect business decisions that ultimately hurt the bottom line. Startups that help enterprises solve data quality challenges will be uniquely positioned to capture both interest and budget from investors and potential customers alike.
With regards to data governance, regulatory pressures from increasingly stringent legislation (e.g., GDPR, CCPA, etc.) have made the protection and privacy of consumer data a top priority within most global enterprises. Despite the prioritization, 80% of the chief data officers we surveyed do not consider themselves to be sufficiently GDPR compliant today. In fact, many enterprises lack the systems necessary to answer even the most basic questions of their data — what data they have, where it came from, where it has been, what it is being used for and how it may be impacted in the event of a data breach. Startups that help enterprises better understand their data history and potential impact are becoming increasingly mission-critical in most corporate board rooms.
Finally, with regard to time-to-insights, most enterprises are experiencing a massive proliferation of data coupled with significant fragmentation of that data across silos. These challenges make it difficult for consumers of data such as data analysts and data scientists to fully utilize the data they have and to drive it toward actionable insights. In fact, the business analysts we have surveyed spend on average 50% of their time simply looking for the right data for their analyses. We believe there’s a tremendous market opportunity for startups that will significantly reduce time-to-insight for enterprises; the winning startups will achieve this by democratizing self-serve access to data for all users, augmenting existing business intelligence workflows using no-code and NLP, and automatically surfacing insights to business users.
Special thanks to Mo Jomaa