Introduction
Most AI systems that work brilliantly in demos fail in production. For text-to-SQL, the culprit is almost never the model; it is context. Tables get renamed. Business definitions drift. A “revenue” column can mean something different to Finance than it does to Sales. Without the right context, the model has no way to know.
This is the first post in our engineering blog series, where we share the mechanics behind building accurate, enterprise-grade data analysis agents—the kind companies like Cisco, ConocoPhillips, ARM, and others use in production. We’re starting with the piece we’re most proud of: the Adaptive Context Engine (ACE).
In this post, we cover:
The mechanics of automatically building and managing context
How we evaluate the learning loop
Why BIRD-SQL is not a great benchmark for enterprise analytics
Results and examples from our evaluation on LiveSQLBench
Mechanics of Automated Learning
Text-to-SQL is a “solved problem” on benchmarks. When some benchmarks do not reach 99% accuracy, the reason is often missing context or errors in the benchmark itself. In practice, however, most deployments get stuck at “good enough for demos” because they treat context as a one-time setup problem. It isn’t. Context drifts.
What “revenue” means to your Finance team can differ from what it means to Sales. A metric that is straightforward in one business unit may be calculated differently in another. Data also has its own nuances: it may be incomplete for certain cuts and inflated for others. New use cases emerge with their own schemas, terminology, and domain rules—and no two deployments look exactly alike. Context is not static; it varies across teams, use cases, and time. Anthropic found the same while building their internal data analysis stack. They say: “We watched our offline accuracy drift from ~95% at launch to ~65% over a month before we treated this as an engineering problem”.
We built the Adaptive Context Engine to solve this context challenge. Rather than locking in a fixed snapshot, ACE treats context as a first-class engineering system—one that adapts and evolves alongside your data, users, and business logic. It bootstraps from what you already have, learns continuously from every interaction, and stays coherent as complexity grows. Production systems are so dynamic and complex that managing this process manually is impossible, which is a leading reason many AI analytics solutions fail.
To treat context as critical infrastructure that is always up-to-date, there are three key problems to solve: bootstrapping from existing fragmented sources, continuously learning from user interactions, and keeping context conflict-free as context and the AI analytics system evolves. We discuss each below.
Bootstrapping
On day one, WisdomAI generates context automatically from three sources: your database schema, crawled data samples, and warehouse query logs. Each context fragment carries a confidence score. Admins can auto-accept high-confidence context or manually review edge cases. Users can also seed the system directly with sources such as:
Code and data pipelines, such as dbt, LookML, and related tools
Wikis or knowledge base documents
MCP sources, such as SaaS apps and web pages
The key idea behind the bootstrapping agent is to explore the impact of extracted context on the domain evaluation scores. With each additional piece of context the scores go up or remain unchanged. The domain evaluation scores are based on ground truth responses and LLM-based evaluations against user chat sessions.
Continuous Learning and Context Health
The Adaptive Context Engine gets smarter with every query. When a user provides feedback—a thumbs up, a thumbs down, or a free-text correction—WisdomAI decides what to do based on the signal type.
Well-formed feedback, such as a thumbs up or explicit knowledge like “the region field maps to our sales territories,” is remembered as-is.
Ambiguous feedback, such as “the answer should be in the range of 1M–2M,” triggers state-space exploration: the system generates multiple candidate SQL queries offline, evaluates each against the feedback signal, and stores the one it is most confident in.
Critically, every new context fragment is cross-checked against existing fragments before it is stored. That leads to the next problem we had to solve.
Conflict Resolution
“Revenue” might be defined two different ways across the system. For the same natural-language filter, different columns might be used across different queries. In many cases, the issue is not simply what is right versus what is wrong; it is about choosing one consistent definition and applying it reliably. Without that consistency, you can get a different number for the same question every time you ask it, which is a cardinal sin in data analytics.
In our experience, these conflicts are the biggest source of unexplained trust erosion in production AI analytics. WisdomAI automatically detects these conflicts and surfaces suggested fixes for admin review. We’ll publish a separate deep dive on conflict detection because it is a problem worth its own post.
Benchmarking the Context Engine
Limitations of BIRD-SQL
Most text-to-SQL papers use BIRD-SQL as the benchmark. We respect what BIRD has accomplished, but it does not capture what matters for enterprise analytics. There are three specific reasons:
Ground-truth SQL errors. The dataset contains incorrect reference SQL, as documented in
arxiv.org/pdf/2402.12243, which means a correct answer can be scored as wrong.Oversimplified queries. The SQL is relatively straightforward. Real enterprise BI workloads involve domain-specific metrics, nested aggregations, and business logic that is not present in the schema.
Context is handed to the model. Bird questions include explicit knowledge context in the question itself. This sidesteps the exact problem we are solving: the knowledge needed to answer a user’s question has to be learned and is not always readily available.
Our evaluations report 80–92% accuracy on the Bird dev set with the context learning loop active, but Bird is not the right benchmark for this problem. Most of our evaluation is done through internal evals built in-house using a combination of LLM- and human-validated ground truth—a more rigorous bar that better reflects real enterprise workloads. We’ll cover those evaluations in a separate post.
For this blog post, we use the open benchmark livesqlbench-base-full-v1 (a newer benchmark from the BIRD team), filtered to query-only cases for a few datasets. For context on the difficulty: the current top performer on the leaderboard sits at 48% accuracy (note that this includes all types of cases, not just query only). That gap is what Wisdom’s Adaptive Context Engine is built to close.
Evaluating Learning Accuracy
We evaluate accuracy in three phases to isolate the impact of each part of the context-building process:
Phase 1 — Baseline accuracy. The schema is connected with no additional context. Even here, WisdomAI’s agent extracts nuances from data sampling that lift accuracy above a pure schema-only baseline.
Phase 2 — After Bootstrap accuracy. Accuracy is measured after uploading the knowledge base and column-description files. This captures the lift from adding existing context to the domain.
Phase 3 — After Learning accuracy. SQL is learned from evaluation queries, with the ground-truth SQL hidden from the learner. For each query, one-shot feedback is provided. Feedback is generated from execution output alone by an LLM, and SQL is kept completely hidden. In production, this type of feedback comes from:
Users around expected responses (e.g. expected data or range of data)
Exported data from existing dashboards and reports
These steps mirror real-world onboarding: there is rarely a clean source of ground-truth SQL for every question at the start—just tribal knowledge, usage patterns, and a pile of dashboards. The main downside of this approach is that the feedback generated by the LLM might not be fully reflective of real-world user feedback, but this provides a great way to evaluate the learning loop in a scalable way.
Note that this evaluation covers SQL accuracy only. It does not assess visualization, clarification quality, or conversational experience, which are topics for separate blog posts. For SQL accuracy specifically, we use a combination of heuristics over data matches, accounting for factors such as extra columns and row ordering, and LLM-as-judge.
Results
We present the accuracy at the three phases for 5 datasets from livesqlbench-base-full-v1 here, and a final aggregate accuracy across all 5. We also report our estimate on the ground truth error rate across those datasets. Note that the learning loop has a large impact on accuracy. Accuracy improves from 20% to 50% after adding the knowledge base files, and from 50% to 85% through context learning alone. No model changes. No SQL tuning. No manual context building. On top of this, human review of suggested context and reviewed queries typically pushes this number to above 95% in production.

*Generated by WisdomAI
How Learning Produces a Major Accuracy Boost
Example 1 — solar_panel_10: Wrong Row Selection Logic
“What’s the actual power each plant is sending to the grid right now? List plant names and their latest effective power output, most powerful first.”
Baseline returned 352 rows; the correct answer is 336 rows.
The original query selected the latest record among rows with non-null power values. That logic was reasonable, but wrong. The correct approach selects the absolute latest snapshot per plant using the composite key (snapts, snapkey), regardless of whether the power value is null.
Feedback generated: “Expect to see 336 rows in the output.”
What was learned: “Latest” in this schema requires composite-key ordering, not null filtering. The context now encodes this rule explicitly, so any future query involving “latest” per plant will apply it.
Example 2 — solar_panel_8: Business Logic Hidden from the Schema
“When a plant with HJT panels breaks, what’s the average cost to fix it? Assume two years of operation and a valid, positive MTBF record.”
Baseline output: $4,850. Correct answer: $1,541.
The model averaged raw maintcost values. The correct formula is total maintenance cost divided by the expected number of failures:
Feedback generated: “Answer is in the range of $1,500–$2,000.”
What was learned: The system explored multiple candidate SQL queries, converged on the correct formula, and stored this business rule as context. Future maintenance-cost queries will apply it automatically.
Note: Stored context includes the full SQL, which is omitted here to avoid leaking data into LLM training pipelines.
Examples of Dataset Quality Issues
Two queries in the solarpanel dataset likely had bad ground-truth SQL, which is a reminder that benchmark quality matters when setting expectations for target accuracy:
solar_panel_3: The temperature-corrected power calculation returns −161.87 watts, which is physically impossible.solar_panel_17: The repair-cost query for machines with more than one day of downtime returns an empty result, which is almost certainly wrong.
What’s Next
This is the first post in our engineering series. Coming soon are more blog posts on learnings from our work on proactive agents, context extraction agents, latency optimization, fine-grained security and privacy controls, and more.
If you’re building enterprise data products and wrestling with context quality in production, we’d love to talk. If you’re interested in working on challenging agentic AI and context extraction problems, we are actively hiring.




