Infra

Your Data Agents Need Context

Jason Cui and Jennifer Li Posted March 10, 2026

Your Data Agents Need Context Table of Contents

The context on context

Recently in the world of data and AI agents, context layers and graphs have emerged as an interesting topic of discussion. In fact, it’s difficult today to have a conversation with an organization working with data and AI and not have the topic of context come up.

And for good reason. Over the past year, the market has realized that data and analytics agents are essentially useless without the right context – they aren’t able to tease apart vague questions, decipher business definitions, and reason across disparate data effectively.

It’s not their fault, of course. The modern data stack has undergone a decade+ transition from disparate data sources to consolidated data and cleaned definitions (which is good), but even then the consolidation is never perfect and a lot of messiness is introduced. The general market evolution has been as follows:

  1. The rise of the modern data stack – We’ve covered the transformational rise of the modern data stack with Tristan Handy from dbt in the past and in our own reference architectures. The general evolution over the past decade has transformed data architectures across ingestion, transformation, warehousing, and storage to centralize data and make it quickly and easily accessible. The idea is then with cleanly organized data, teams could simply write SQL to derive data from their data warehouses, power charts/dashboards, and enable business intelligence across an entire organization.
  2. The agent frenzy – In 2024 moving into 2025 as LLM capabilities increased, essentially every single organization wanted to build and deploy agents on top of their existing data stacks. We’ve previously discussed how we define agents, but from an organizational point of view the natural allure of more work being done with greater efficiency and less time naturally led to the gravitation towards agentic workflows. Companies attempted to build “chat with your data” chatbots, support agents, etc. The frenzy was both bottoms-up and tops-down – developers wanted to make use of the newest, shiniest LLM capability and leadership applied AI adoption pressure to increase automation and reduce costs.
  3. Hitting the wall – Despite the initial optimism, it quickly became clear that most of these efforts failed. Organizations tried to deploy their agents but ran into a wall. MIT famously published their “State of AI in Business 2025” report which stated that with AI deployments, “most fail due to brittle workflows, lack of contextual learning, and misalignment with day-to-day operations.”

A crucial reason agents didn’t work well was a lack of proper data context. Enterprise data today is still incredibly disparate and messy – and because of that, data agents struggled to answer basic questions like “what was revenue growth last quarter?” across various data architectures amassing structured and unstructured data.

Just like how the vision of fully self-serve analytics of years ago fell short, the vision for data agents seemed to fall short as well.

The context problem – beyond just text to SQL

So why did that initial era of agent deployments struggle? At first, many believed the issue to be a fundamental data reasoning and SQL/Python codegen gap on the model side. The general idea was that a model should be able to take in a natural language query as an initial input, reason over existing data systems, and generate corresponding SQL code in traditional business intelligence (BI) fashion to pull the right data and answer the initial question accordingly.

If the model failed or wasn’t accurate, the gap was chalked up to the model not being good at SQL with the expectation that model performance would improve.

This is not entirely false. While model capabilities have improved dramatically for use cases like codegen and mathematical reasoning, they still lag behind on the data side (as evidenced through SQL benchmarks like Spider 2.0 and Bird Bench). There have undoubtedly been big leaps in models’ abilities, but we’ve quickly learned that the problem extends beyond just text to SQL.

To crystallize things and break down the revenue growth example a bit more:

  1. Let’s say a data agent is constructed internally within an organization. It’s built to leverage modern foundational models, connected to all the right data sources, and is hooked up to a nice UI to field data questions from internal users.
  2. Our query comes in. “What was revenue growth last quarter?” A deceptively simple question that is typically easily answered by a quick glance at a Looker or Tableau dashboard – shouldn’t be a challenge for an advanced, intelligent agent!
  3. Challenge #1 – how does the agent know how revenue or quarters are actually defined? Revenue is actually a business definition that isn’t hard-coded into a warehouse or pipeline. Is the user looking for run rate revenue or ARR? Fiscal quarter reporting may not be normalized for all organizations and could be a completely different three month period depending on the company you ask. What’s the right time window to look at?
  4. Luckily the head of the data platform steps in and says – “we’ve built semantic layers to solve this exact problem. We capture our revenue definition there.” And – the agent should be able to ingest all semantic layers as context. A promising potential solution, but the team takes a look at a couple of YAML files and realizes that they were updated by a data team member that left last year, no longer used by BI tools, and also doesn’t include the two new product lines that have launched since then. The agent has no idea how revenue is actually defined today.
  5. To overcome this blocker, a team member hard codes the exact revenue and timeframe definitions. The data agent continues chugging along but quickly runs into challenge #2 – where are the right data sources? Which ones are the right sources of truth? Raw data is split across multiple tables and warehouses. The finance team uses the fct_revenue table which could be correct, but the data team has materialized views created like mv_revenue_monthly and mv_customer_mrr.

It’s clear that data agents need access to a repository of up-to-date business definitions and data sources to overcome these problems.

Enter the context layer

The crux of the problem at hand is that the agent isn’t given the proper business context to answer even the most basic questions. This is representative of a larger gap that’s present in building automated AI systems within organizations – there needs to be up-to-date and maintained context that not only understands how an enterprise works and how the data systems are structured, but also maintains the tribal knowledge to tie everything together.

This has led to the evolution of the context layer. Many names have emerged in discussion today – context OS, context engine, contextual data layer, ontology, and more – but the underlying concept remains the same. Tie together all of an enterprise’s messy data, add a contextual layer on top that helps agents understand business logic, and package it such that the context can be supplied to agents.

Context management déjà vu

But wait. Doesn’t what we’re describing sound eerily similar to a semantic layer…? There are indeed some similarities, but at the end of the day if agent workflows are to become truly autonomous they need a bit more than how semantic layers currently manifest.

A traditional semantic layer in the context of BI is great for specific metric definitions (like revenue, churn, ARPU). However, they are usually hand constructed by data teams using very specific syntax through a dedicated layer like LookML and are connected directly to a BI tool like Looker.

A modern data context layer should essentially become a superset of what a semantic layer would traditionally cover. Sure, specific metric definitions can be hard-coded, but a modern context layer should include more to ensure agent autonomy – canonical entities, identity resolution, specific instructions to dissect tribal knowledge, proper governance guidance, and more.

This piece will primarily focus on data context that ties together traditional systems of record. An equally important and overlapping opportunity is also capturing an organization’s decisions and workflow logic so truly multipurpose agents can be built that are properly grounded in all of an organization’s data and decisioning context.

Putting it all together

Based on our recent conversations with customers and understanding their needs, here’s what we believe a modern context layer paired with an agentic data system could look like, broken down step-by-step:

1) Accessing the right data – the first order of business is ensuring all the right data is accessible. This is table stakes. Ideally an organization would be implementing some form of the modern data stack with some degree of unification through a lakehouse architecture. Even then, we’d want to ensure the agent has access to all the data it needs, which may extend beyond just what’s available in warehouses and operational apps. This includes tribal knowledge captured in internal systems, GDrive/Slack, etc.

2) Automated context construction – once all the right data is accessible, the next step is starting the construction of the context layer. The benefit of using LLMs is that a lot of the initial context gathering can be done in an automated way. An emphasis of focus should be on high signal context – for example, looking through past query history can be high signal in determining the most referenced tables and most common joins, and data modeling solutions like dbt or LookML can provide clear definitions for business metrics.

3) Human refinement – Automated context construction may be able to form a large portion of the context corpus, but it can’t create the full picture. It’s tempting to set agents loose and have them collect all internal knowledge, but some of the most important context is implicit, conditional, and historically contingent, and only exists as tribal knowledge inside teams.

Human input provides the final crucial links that enable true agent automation. For example – “for CRM data, look at Affinity for all new USCAN deals from 2025 onwards but Salesforce for all global leads before that.”

In this way the context layer can become a multi-dimensional corpus where code lives alongside natural language, capturing any context an agent might need. Just like developers can set up .cursorrules files to guide agents and control output behavior, data practitioners can maintain rules and guidelines.

4) Agent connection – once the context layer is properly constructed, it just needs to be exposed to agents and accessible real-time. This can typically be done through API or MCP.

5) Self-updating context flows – While the initial system has been set up correctly, data systems are never static and as a result the context layer shouldn’t be either. Data sources and formats can change upstream and individuals may have custom instructions they’ll want to add and modify based on changing business requirements. In the case a data agent provides incorrect data and requires accuracy refinement, that should of course be incorporated back into the context layer. In this way the context layer becomes a living and constantly evolving corpus.

One takeaway from this whole exercise is that building a proper data agent is no small feat. It’s a blend of technical challenges related to data infrastructure and engineering with human operational challenges related to tribal knowledge collection.

The OpenAI team recently published a fantastic piece detailing the creation of their own internal data agent. It’s a transparent detail of a very detailed and elegant implementation – but points to the long journey required to get there. Similarly, Palantir has a long history of constructing ontologies for organizations that provide clear context from messy data, and have built a big business doing so.

Market direction

Naturally, that opens up a window for an external solution. Realistically not every enterprise can (or should) build this in house, and we’re starting to see various solutions coming to market already.

While we believe we’re still early in seeing solutions come to fruition, the below is a high level market map of solutions being built:

To break the categories down:

Data gravity platforms – platforms like Databricks and Snowflake have already gone through the motions of data ingestion, transformation, and storage and data gravity is powerful. They are already working on AI data analyst products like Databricks Genie and Snowflake Cortex Analyst which are built on top of data warehouses and utilize foundational models for text to SQL to allow users to ask questions about their data in natural language.

While these platforms don’t have super sophisticated context layer functionalities per se, they do allow for lightweight semantic modeling and there’s a feasible path forward through either company acquisition or in-house development for context layers to be brought in on platform.

Existing “AI data analyst companies” – there’s a wave of companies that have already emerged utilizing AI to allow customers to chat with your data. Many have realized through time in market that the key to effective data agents is actually building the relevant context layer. As a result, some have evolved to encompass data context construction as a key part of their products.

New, dedicated context layer companies – a new category of company has emerged that is building context layers from the ground up. They will have to go through our journey above of ingesting data, collecting tribal knowledge, and more – and will have to do so for each individual customer they work with.

Looking forward

We are at an interesting point in time of market development, where the problem of a lack of context has become apparent, but we are still in the early innings of building solutions.

The future is exciting – perhaps the vision of truly self-serve analytics can be fully realized, and BI, data analytics, and data science can be transformed through AI.

Naturally, many open questions still remain. Where will this context layer live? Can it live in multiple places? Will it be its own standalone product?

We remain excited about the opportunities for innovation here. If you’re building in the space, we’d love to chat! Please reach out at jcui@a16z.com or @jasonscui on X.

Want More a16z Infra?

Analysis and news covering the latest trends reshaping AI and infrastructure.

Learn More

Want More Infra?

Analysis and news covering the latest trends reshaping AI and infrastructure.

Sign Up On Substack

Views expressed in “posts” (including podcasts, videos, and social media) are those of the individual a16z personnel quoted therein and are not the views of a16z Capital Management, L.L.C. (“a16z”) or its respective affiliates. a16z Capital Management is an investment adviser registered with the Securities and Exchange Commission. Registration as an investment adviser does not imply any special skill or training. The posts are not directed to any investors or potential investors, and do not constitute an offer to sell — or a solicitation of an offer to buy — any securities, and may not be used or relied upon in evaluating the merits of any investment.

The contents in here — and available on any associated distribution platforms and any public a16z online social media accounts, platforms, and sites (collectively, “content distribution outlets”) — should not be construed as or relied upon in any manner as investment, legal, tax, or other advice. You should consult your own advisers as to legal, business, tax, and other related matters concerning any investment. Any projections, estimates, forecasts, targets, prospects and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Any charts provided here or on a16z content distribution outlets are for informational purposes only, and should not be relied upon when making any investment decision. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, posts may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein. All content speaks only as of the date indicated.

Under no circumstances should any posts or other information provided on this website — or on associated content distribution outlets — be construed as an offer soliciting the purchase or sale of any security or interest in any pooled investment vehicle sponsored, discussed, or mentioned by a16z personnel. Nor should it be construed as an offer to provide investment advisory services; an offer to invest in an a16z-managed pooled investment vehicle will be made separately and only by means of the confidential offering documents of the specific pooled investment vehicles — which should be read in their entirety, and only to those who, among other requirements, meet certain qualifications under federal securities laws. Such investors, defined as accredited investors and qualified purchasers, are generally deemed capable of evaluating the merits and risks of prospective investments and financial matters.

There can be no assurances that a16z’s investment objectives will be achieved or investment strategies will be successful. Any investment in a vehicle managed by a16z involves a high degree of risk including the risk that the entire amount invested is lost. Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by a16z is available here: https://a16z.com/investments/. Past results of a16z’s investments, pooled investment vehicles, or investment strategies are not necessarily indicative of future results. Excluded from this list are investments (and certain publicly traded cryptocurrencies/ digital assets) for which the issuer has not provided permission for a16z to disclose publicly. As for its investments in any cryptocurrency or token project, a16z is acting in its own financial interest, not necessarily in the interests of other token holders. a16z has no special role in any of these projects or power over their management. a16z does not undertake to continue to have any involvement in these projects other than as an investor and token holder, and other token holders should not expect that it will or rely on it to have any particular involvement.

With respect to funds managed by a16z that are registered in Japan, a16z will provide to any member of the Japanese public a copy of such documents as are required to be made publicly available pursuant to Article 63 of the Financial Instruments and Exchange Act of Japan. Please contact compliance@a16z.com to request such documents.

For other site terms of use, please go here. Additional important information about a16z, including our Form ADV Part 2A Brochure, is available at the SEC’s website: http://www.adviserinfo.sec.gov.