How to Use a Custom Grading Criteria to Evaluate LLM Responses (LLM-as-a-Judge)

In the rapidly evolving field of language models, ensuring the accuracy and relevance of responses is crucial. This blog post will guide you through setting up a custom grading criteria to evaluate responses from large language models (LLMs), using a simple conditional evaluation system.

What is it?

A custom grading criteria is a method used to evaluate the responses of language models based on predefined conditions.

It operates on a simple principle: if the response meets a certain condition X, it fails; otherwise, it passes.

This evaluation is integrated into a Chain of Thought (CoT) prompt, ensuring that the output is a structured JSON containing the pass/fail status along with a reason.

Why do you need it?

For developers working with LLMs, ensuring that the model's responses meet specific standards or criteria is essential.

This tool is particularly useful for applications where responses need to adhere strictly to certain guidelines or quality standards. It simplifies the process of assessing whether the responses from an LLM are adequate, based on the conditions you define.

Some examples:

“If the response contains a financial figure, then fail. Otherwise pass”

“If the response contains a phone number, then fail. Otherwise pass”

“If the response says something like I don’t know, then fail. Otherwise pass”

“If the response claims to have taken an action, then fail. Otherwise pass”

“If the response mentions a refund, then fail. Otherwise pass”