Context Engineering for Analytics Engineers
I mined Redshift query logs into a join registry and anti-pattern library. AI agents stopped hallucinating column names and join paths.
AI in data analytics is an interesting challenge — and it shows: large segments of the analytics development lifecycle have no spiffy new AI tools to automate them off our plates.
The allure of AI for enterprise data work is powerful; indeed, more CEOs every day ask their teams “Can’t we just give AI our data?” But in practice, it’s hard to realize. Coding SQL itself is straightforward; AI training data covers basically every language. For frontend code, every user gets handled by the same functions — the output is determined by what’s right there in the files. AI models, however, have never seen your database. They’ve never seen your company’s data models, naming conventions, or idiosyncrasies. There are a thousand little things to learn about a company’s data, and for AI to be useful here, we need to build tools that somehow learn those things — so they can actually take work off the analyst’s plate.
In the AI jargon, it’s a context engineering problem.
One Article Changed it All
While researching semantic layers, I found a MotherDuck blog post that helped me crack part of this challenge. “What If We Don’t Need the Semantic Layer?” asks: “Instead of a top-down, prescriptive model, what if we embraced a bottom-up, descriptive one?”
The author’s key insight: the semantic layer isn’t something we have to build from scratch. It already exists, implicitly, in our query logs. Query logs are the empirical record of how people actually use data, and that collective behavior represents the real “source of truth” in an organization.
A straightforward observation: “every query ever run against a database contains latent semantic information, just sitting there waiting to be used.”
Lightbulb moment. Like fumbling around in a dark garage at night, trying to find where my damn socket wrench is, and someone flicked the overhead light on.
Data Do’s and Don’ts
I was already familiar with Redshift’s system tables. I’d previously built a platform ops dashboard to track compute metrics — total rows joined, rows scanned, seconds of runtime, aggregates charted over time. A table sorted the most compute-hungry queries to the top so we could target the worst ones for optimization.
I shared the MotherDuck article with Claude Code and said: “Hey, read this. And check out this table from my platform ops report. Could we do something similar?”
Claude built a Python parser using sqlglot, ran it against our logs, and came back with several hundred canonical terms and a few dozen naming conflicts — all the ways fields get aliased in business-friendly terms, and cases where the same field had multiple aliases. “Amount” vs “transaction_amount” vs “txn_amt.” Which alias should be The One?
Good stuff, Claude! But my mind was still racing. I challenged Claude to come up with more ways to extract valuable information — not just for semantic layer development, but for building the dbt pipelines that must come first.
I asked Claude to search the web for every possible way to mine production query logs into helpful artifacts, and share the best ideas with me. Claude’s final proposal included two game-changing artifacts: a Join Registry and an Anti-Pattern Library.
The do’s and the don’ts.
The Join Registry (Do’s)
The parser runs four analysis passes over the query logs:
Parse — sqlglot turns raw SQL into traversable syntax trees
Extract Joins — every
ON a.id = b.idbecomes a cataloged relationshipMine Aliases — every
SELECT x AS friendly_nameis a vote for a naming conventionAnalyze Impact — correlate SQL patterns with execution time
The join registry is the artifact I’m most excited about: a complete directory of every table join that’s actually been executed against the warehouse.
You know that familiar analyst question — “How do I link these tables?” Now it has an instant answer. No need to rack your brain or post in the company Slack channel and hope someone responds.
The registry shows the path between any two tables, ranked by how often it appears in the logs, so you can easily distinguish a go-to pattern from more obscure ones. Each join includes frequency, average execution time, and an orphan risk flag for LEFT JOINs.
Here’s what it looks like:
# join-registry.yml
joins:
- left_table: fct_orders
right_table: dim_customers
join_keys:
- "fct_orders.customer_id = dim_customers.customer_id"
join_type: LEFT
frequency: 342
avg_execution_time_sec: 4.2
additional_conditions:
- "dim_customers.is_current = 1"
orphan_risk: false
notes: "Standard customer lookup — high frequency, fast execution"
# Bidirectional index — look up any table, see all its connections
table_index:
fct_orders:
joins_to:
- table: dim_customers
frequency: 342
- table: dim_products
frequency: 187
dim_customers:
joined_from:
- table: fct_orders
frequency: 342Why does this matter? In my earlier AI testing days, Copilot agents would write SQL that looked great at first glance — then quickly turned out to contain fabricated field names. The join registry largely eliminates that failure mode — at least for tables and paths that exist in the logs. If you have existing workflows with dbt skills or Claude projects, you can require a check against the registry before any join gets written. The agent consults verified paths instead of guessing.
The Anti-Pattern Library (Don’ts)
By correlating SQL patterns with execution times across hundreds of thousands of queries, we got clear performance data. Baseline for clean queries: 16.89 seconds average.
Anti-Pattern Performance Impact
Baseline for clean queries: 16.89 seconds average.
NOT IN (subquery)— 4.18x slowerORin JOIN condition — 4.07x slowerDeep nesting (3+ levels) — 3.06x slower
SELECT *— 2.08x slower
NOT IN queries averaged 70.67 seconds. Not a style preference but a measurable 4x cost.
Each pattern now has an automatic substitution rule. The agent doesn’t just detect NOT IN; it rewrites to NOT EXISTS at generation time. Both artifacts can be consulted when writing new models and when auditing existing queries at any point after initial creation.
How It Changed Our Agents
Before this, agents regularly produced syntactically correct but semantically wrong queries — wrong column names, wrong join paths, slow patterns. In our first couple months of AI-assisted development, I’d estimate roughly 4 in 10 queries needed correction.
After integrating the artifacts into our workflows, queries need far fewer corrections. I can’t even remember the last time I had to resolve a column “does not exist” error.
The agents now check the join registry before writing joins and scan for anti-patterns before finalizing any query. The files themselves are lightweight — a few megabytes at most. They’re easily shared in a dbt repo’s docs folder.
An unexpected bonus: GitHub’s Copilot code review picked up the anti-pattern library on its own. It flagged a PR containing a NOT IN clause, suggested a NOT EXISTS replacement, and cited the anti-pattern library in the repo — without us ever configuring it to do so.
Every data warehouse has query logs. Your company’s logs already contain a massive trove of collective data context — capturing things that years of documentation efforts never did. Not every query is an example to follow, necessarily; but the frequency ranking separates tribal knowledge from one-off experiments, and the anti-pattern library filters out the worst practices. The tools are straightforward: AST parsing plus frequency analysis. The hard part isn’t the technology. It’s realizing the gold mine is already sitting there.
This is the first in a series on context engineering for enterprise data — practical techniques I’ve found while building dbt pipelines with AI agents. Follow along and connect if you’re solving similar problems!


