Rendered at 14:20:26 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
fluffet 24 minutes ago [-]
It's kind of bespoke for me tbh.
For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.
For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.
Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.
alexhans 1 days ago [-]
Very, very heterogenous and fast moving space.
Depending on how they're made up, different teams do vastly different things.
No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.
It's definitely an afterthought for most teams although we are starting to see increased interest.
My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.
What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.
I use LLMs to determine what a caller’s “intent” is. I do my best with my initial prompt and then I have the “business” test it and I log phrases that they use.
I then make those phrases my scripted test suite. Any changes in prompts or models get put through the same test suite. In my case, I give my customers a website they can use to test new prompts and takes care of versioning.
I also log phrases that didn’t trigger an intent and modify the prompt and put it back through the suite.
mierz00 4 hours ago [-]
I highly rate Braintrust.
It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
kelseyfrog 14 hours ago [-]
Automated benchmarking.
We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.
From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.
We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.
kbdiaz 9 hours ago [-]
The vast majority of AI companies I talk to seem to evaluate models mostly based on vibes.
At my company, we use a mix of offline and online evals. I’m primarily interested in search agents, so I’m fortunate that information retrieval is a well-developed research field with clear metrics, methodology, and benchmarks. For most teams, I recommend shipping early/dogfooding internally, collecting real traces, and then hand-curating a golden dataset from those traces.
Many people run simple ablation experiments where they swap out the model and see which one performs best. That approach is reasonable, but I prefer a more rigorous setup.
If you only swap the model, some models may appear to perform better simply because they happen to work well with your prompt or harness. To avoid that bias, I use GEPA to optimize the prompt for each model/tool/harness combination I’m evaluating.
12 hours ago [-]
minikomi 10 hours ago [-]
The more you can afford to build up your understanding of the problem space and define what inputs & outputs look like, the more flexible you can be with evals. Unfortunately, this is a lot of work and requires thinking and discussion with your team and those involved.
I wrote about general ideas I take towards simple single prompt features, but most of it is applicable to more involved agentic approaches too.
bisonbear 18 hours ago [-]
assume you're referencing coding agents - I don't think people are. If they are, it's likely using
- AI to evaluate itself (eg ask claude to test out its own skill)
- custom built platform (I see interest in this space)
I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?
maxalbarello 14 hours ago [-]
Also wondering how to evals agentic pipelines. For instance, I generated memories from my chatGPT conversation history, how do I know whether they are accurate or not?
I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.
HPSimulator 12 hours ago [-]
One thing I’ve been noticing while building AI tooling is that most “agents” focus on doing work for the user — writing code, sending emails, managing tasks, etc.
But there’s another category that might become just as important: agents that simulate other humans instead of automating tasks.
For example, before shipping a landing page change or pricing update, it’s surprisingly useful to simulate how different types of visitors might react psychologically — where they hesitate, what signals reduce trust, what makes them bounce.
Traditional analytics only shows what happened after users interact. A lot of decisions happen earlier, in the first few seconds, before anything measurable occurs.
I wouldn’t be surprised if we start seeing “human simulation agents” alongside task agents, especially for product, marketing, and UX decisions.
kristianp 8 hours ago [-]
What you've said is nothing to do with AI Evals.
celestialcheese 14 hours ago [-]
mix of promptfoo and ad-hoc python scripts, with langfuse observability.
Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.
moltar 6 hours ago [-]
I use Promptfoo
rurban 8 hours ago [-]
Doing tickets and commenting cost and quality in the PR.
Still, the best are outstanding, and the medium ones bare usable. I rank it by IQ. From 140 to utterly stupid. opencode/gpt-oss-120b local got a 90. opencode/opus-4.6 gets 140. codex/gpt-5.4 gets 115. All for C/C++ tasks.
There was one expensive Chinese SWE benchmark posted recently to arxiv. It did confirm my evaluation.
dkoy 14 hours ago [-]
Curious who’s used OpenAI Evals
mock-possum 8 hours ago [-]
We feed a handful of preset questions through the new AI, we collect the results, we ask another AI to score the answers based on example ‘hood’ answers we’ve written, then we have a guy sit down and use the fallout as a starting point to rank the performance of that AI, compared to all the previous ones.
Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.
satisfice 10 hours ago [-]
It’s called testing. And from the reports and comments, there doesn’t seem to be much of it happening. The reason is: it’s quite expensive to do well.
I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.
aszen 6 hours ago [-]
Seems like you are testing llms genric abilities rather than your actual agent logic.
Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.
For a co-pilot inside an app that could answer product questions, I looked at ~2000 or so support emails. I asked one LLM to dig out "How would you formulate the users question into a chatbot-like question from this email thread" and "What is the actual answer that should be in the response from this email thread", then just asked our bot that question, and have another LLM rate the answer like SUPERIOR | ACCEPTABLE | UNKNOWN etc. These labels proved out to be a good "finger in the wind"-indicator for altering the chunks, prompt changes or model updates.
For an invoice procesing app processing about 14M invoices/year, it was mostly doing fuzzy accuracy metrics against a pretty ok annotated dataset and iterating the prompt based on diffs for a long time. Once you had that dataset you could alter things and see what broke.
Currently, I work on an app with a pretty sophisicated prompt chain flow. Depending on bugs etc we kind of do tests against _behaviour_, like intent recognition or the correct sql filters. As long as the baseline is working with the correct behaviour, whatever model is powering it is not so important. For the final output, it's humans. But we know immediately if some model or prompt change broke some particular intent.
Depending on how they're made up, different teams do vastly different things.
No evals at all, integration tests with no tooling, some use mixed observability tools like LangFuse in their CI/CD. Some other tools like arize phoenix, deepeval, braintrust, promptfoo, pydanticai throughout their development.
It's definitely an afterthought for most teams although we are starting to see increased interest.
My hope is that we can start thinking about evals as a common language for "product" across role families so I'm trying some advocacy [1] trying to keep it very simple including wrapping coding agents like Claude. Sandboxing and observability "for the masses" is still quite a hard concept but UX getting better with time.
What are you doing for yourself/teams? If not much yet, i'd recommend to just start and figure out where the friction/value is for you.
- [1] https://ai-evals.io/ (practical examples https://github.com/Alexhans/eval-ception)
In some cases I've seen teams rely on a mix of automated metrics and human review, especially for production systems where reliability matters a lot.
But evaluation pipelines for AI still seem much less standardized compared to traditional software monitoring.
https://news.ycombinator.com/item?id=47241412
I use LLMs to determine what a caller’s “intent” is. I do my best with my initial prompt and then I have the “business” test it and I log phrases that they use.
I then make those phrases my scripted test suite. Any changes in prompts or models get put through the same test suite. In my case, I give my customers a website they can use to test new prompts and takes care of versioning.
I also log phrases that didn’t trigger an intent and modify the prompt and put it back through the suite.
It wouldn’t be too difficult to build something like that for your own usage, but I found it pretty easy to get datasets set up.
Essentially a game changer in understanding if your prompts are working. Especially if you’re doing something which requires high levels of consistency.
In our case we would use LLM for classification which fits in perfectly with evals.
We were lucky enough to have PMs create a set of questions, we did a round of generation and labeled pass/fail annotations on each response.
From there we bootstrapped AI-as-a judge and approximately replicated the results. Then we plug in new models, change prompts, pipelines while being able to approximate the original feedback signal. It's not an exact match, but it's wildly better than one-off testing and the regressions it brings.
We're able to confidently make changes without accidentally breaking something else. Overall win, but it can get costly if the iteration count is high.
At my company, we use a mix of offline and online evals. I’m primarily interested in search agents, so I’m fortunate that information retrieval is a well-developed research field with clear metrics, methodology, and benchmarks. For most teams, I recommend shipping early/dogfooding internally, collecting real traces, and then hand-curating a golden dataset from those traces.
Many people run simple ablation experiments where they swap out the model and see which one performs best. That approach is reasonable, but I prefer a more rigorous setup.
If you only swap the model, some models may appear to perform better simply because they happen to work well with your prompt or harness. To avoid that bias, I use GEPA to optimize the prompt for each model/tool/harness combination I’m evaluating.
https://poyo.co/note/20260217T130137/
I wrote about general ideas I take towards simple single prompt features, but most of it is applicable to more involved agentic approaches too.
- AI to evaluate itself (eg ask claude to test out its own skill) - custom built platform (I see interest in this space)
I've actually been thinking about this problem a lot and am working on making a custom eval runner for your codebase. What would your usecase be for this?
I would like a single number that I would use to optimize the pipeline with but I find it hard to figure out what that number should be measuring.
But there’s another category that might become just as important: agents that simulate other humans instead of automating tasks.
For example, before shipping a landing page change or pricing update, it’s surprisingly useful to simulate how different types of visitors might react psychologically — where they hesitate, what signals reduce trust, what makes them bounce.
Traditional analytics only shows what happened after users interact. A lot of decisions happen earlier, in the first few seconds, before anything measurable occurs.
I wouldn’t be surprised if we start seeing “human simulation agents” alongside task agents, especially for product, marketing, and UX decisions.
Definitely not happy with it, but everything is moving too fast to feel like it's worth investing in.
Still, the best are outstanding, and the medium ones bare usable. I rank it by IQ. From 140 to utterly stupid. opencode/gpt-oss-120b local got a 90. opencode/opus-4.6 gets 140. codex/gpt-5.4 gets 115. All for C/C++ tasks.
There was one expensive Chinese SWE benchmark posted recently to arxiv. It did confirm my evaluation.
Seems like it works pretty well. Our prompts and params get tweaked towards better and better results, and we get a sense of what’s worth paying more for.
I find that for every hypothesis I might have to run a thousand prompts to collect enough data for a conclusion. For instance, to discover how reliably different models can extract noun phrases from a text: hours of grinding. Even so that was for a small text. I haven’t yet run the process on a large text.
Llms are like vendor code you don't need to test them yourself people already created benchmarks for that.