AI evals for AI — sounds sci-fi, leads to improvement

UX meets AI meets LLM.

Running a modern business is a bowl of alphabet soup these days. But bear with me, because all these acronyms are related. And how well any one of them works is dependent upon the others.

For instance, your product or products in your tech stack likely function using some amount of AI to improve the user experience (UX). This could be as simple as taking notes during a meeting or as complicated as operating a self-driving car. While all in-product AI should be evaluated, I want to talk specifically about large language models (LLMs) and how they can be evaluated at scale with AI to keep your UX running smoothly.

What exactly is an eval?

Evaluations, or evals, are sets of data points and criteria for judging the output of your AI algorithm. They’re a way to evaluate the quality of products built on LLMs and other forms of AI. As your product scope expands over time, your team will need to evaluate how effective the AI is.

Evals are sets of data points and criteria for judging the output of an LLM-based algorithm.

Output requires input. So when a customer enters a question or comment into your product, that’s an input. The answer your LLM responds with is the output.

Judge the output’s effectiveness with criteria. For a question-answering LLM, you might judge its responses based on criteria like:

Tone
Grammar
Accuracy
Comprehensiveness
Clarity
Relevance

Start with manual evals

The best way to start is to judge the output manually. Read through what your LLM produces and determine what is good, what isn’t, and why. Start simply and small. Over time, you can iterate on your evals by adding new data points or changing the evaluation criteria.

But you can only manually judge your LLM so far. Once the output gets to be too much for a team to review, it’s time to invite AI into the eval process.

Learn more about the leading customer service AI conference of the year

GLADLY CONNECT LIVE

Evolve your evals with automation

Automation in the eval process helps your product team iterate and improve faster. With an AI as a judge of your LLM’s output, it can look at many more responses and determine if those answers are acceptable.

Some advice from the Gladly Product Team: set your evals to judge only one thing. It's easier for an LLM to work when it can focus on one task at a time. While a person can check for tone, grammar, and clarity all in one go, it’s best if your LLM doesn’t. Instead, create and run different judges for each criteria you care about.

Set your evals to judge only one thing. It’s easier for an LLM to work when it can focus on one task at a time.

Also important, you still need human eyes on your LLM evals. So make sure you still keep some level of manual evaluation going, even as you replace most of it with automated evals. Where your team and your LLM judge differ, there’s a good indication you need to improve your eval.

Even once you implement AI evals, you still need human review to ensure alignment with your LLM eval.

For example, when a human on your team thinks that an output is tonally friendly and the algorithm thinks that it's friendly, that's a positive eval. When a human on your team says an output isn’t friendly and the algorithm agrees, that’s also a positive eval. But when the human and algorithm disagree, that means you need to fix your algorithm. You can start to tune your algorithm with this human label data and use classification metrics like accuracy, false positives, or false negatives to guide your algorithm towards the behavior you want.

Remember: evals affect your product

Evals are product decisions, plain and simple.

Even with AI automation, running evals is still a very painstaking process. The hardest part is curating the data points (some might argue it’s even harder than tuning the algorithm). So your team will have a decision to make about what evals are worth investing time and resources into.

Some eval criteria may not matter enough, like whether or not a response is fully comprehensive if it partially solves a customer’s question. Other criteria may not seem important in the beginning, but then will reveal their importance based on customer feedback and actions. You’ll have to determine whether you’re doing the right thing or doing the thing right, and tune your data set to get the behavior that you want in the algorithm.

Eval are product decisions and tradeoffs that impact your brand’s customer experience.

Finally, you’ll need to decide what your evaluation targets are and balance that with the operational realities of using LLMs, like added cost and latency. You and your team will have to ask yourselves questions such as:

Do we want to save on cost here? What if it doubles the hallucinations that we let through?
Do we want to improve accuracy? What if that leads to longer inference time and thus a longer wait time for customers?
Do we want to respond to more customers even if that reduces the quality of responses? Or do we only want to respond if we have very high confidence in our response?

A lot of your eval targets will come down to tradeoffs, and that's a decision that you should make with your product and engineering team together. Because at the end of the day, it's going to impact your customer's experience with your product and brand.

AI evaluations show that you care

If you care about the quality of your product’s outputs, you need evals. You can manually review in the short term if you’re prototyping a new algorithm. Or you can automate your evals to continuously improve the product over the long term.

The eval process is about making sure your LLM-based products are scalable and effective. So determine what's the best fit for your product, and evaluate the solutions for your team’s and customers’ needs.

DEMO

See how rules and routing live at the center of Gladly’s platform.

Book a demo