Benchmarking and Evaluation

Metrics for Evaluating The LLM-Powered Chatbot

Evaluating the performance of LLM-powered chatbots is crucial in understanding how effectively they interact with users and achieve their intended functions. Given the complexity of natural language and the nuances of human communication, several metrics are employed to measure a chatbot’s proficiency from various angles.

1. Accuracy

The concept of accuracy within the context of LLM-powered chatbots encompasses several layers and is pivotal in assessing the overall effectiveness of these systems in understanding and responding to user inputs. At its core, accuracy measures the chatbot’s ability to produce responses that are both correct and relevant to the user’s query or statement.

Alternative text for the image

What is Confusion Matrix and why you need it?

Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values.

Alternative text for the image
Alternative text for the image

2. Intent Recognition Accuracy

Specifically gauges how accurately the chatbot identifies the users’ intentions. This is critical for routing the conversation correctly and providing the appropriate responses or actions.

3. Fallback

A fallback is an alternative plan that may be used in an emergency

Crucially, fallbacks can be applied not only on the LLM level but on the whole runnable level. This is important because often times different models require different prompts. So if your call to OpenAI fails, you don’t just want to send the same prompt to Anthropic - you probably want to use a different prompt template and send a different version there. There are different types of fallbacks:

+Fallback for LLM API Errors:

This is maybe the most common use case for fallbacks. A request to an LLM API can fail for a variety of reasons - the API could be down, you could have hit rate limits, any number of things. Therefore, using fallbacks can help protect against these types of things.

+Fallback for Sequences

We can also create fallbacks for sequences, that are sequences themselves. Here we do that with two different models: ChatOpenAI and then normal OpenAI (which does not use a chat model). Because OpenAI is NOT a chat model, you likely want a different prompt.

+Fallback for Long Inputs

One of the big limiting factors of LLMs is their context window. Usually, you can count and track the length of prompts before sending them to an LLM, but in situations where that is hard/complicated, you can fallback to a model with a longer context length..

4. Task Completion Rate

The percentage of conversations where the chatbot successfully completes the intended task or resolves the user’s issue without escalation. High rates indicate effectiveness in autonomous problem solving.

5. Engagement Rate

Reflects how well the chatbot maintains users’ interest or participation over time. Metrics could include the number of conversation turns, session duration, or repeat interactions.

6. Response Time

The average time taken by the chatbot to respond to user inputs. Faster response times are typically associated with better user experiences but must also balance the need for accurate and thoughtful responses.

Hands on for Evaluation Metrics

Benchmarking LLM Performance

Do’s and dont’s section

The following table outlines essential do’s and dont’s for effective benchmarking in AI:

Do’s

Don’ts

Utilize diverse and representative datasets for evaluation

Rely solely on synthetic or biased datasets

Ensure transparency and fairness in the benchmarking process

Overlook the influence of biases and confounding factors in evaluations

Implement standardized evaluation methodologies and metrics

Neglect the continuous assessment and re-evaluation of benchmarking practices

Consider the impact of contextual factors in performance assessments

Overemphasize single-point performance metrics without holistic consideration

1. Comparison with Baselines

When introducing a new LLM, it’s essential to compare its performance against established baselines. These baselines are typically previous models or well-known standards in the industry that represent the minimum expected performance. To make this comparison:

  • Identify standard tasks that the LLM should perform (e.g., text classification, question answering).

  • Use the same datasets and metrics that were used to evaluate the baseline models to ensure comparability.

  • Run the LLM on these tasks and compare the outcomes with the results from the baseline models.

  • Report improvements or regressions in performance, providing a clear picture of where the new LLM stands.

2. Task-Specific Benchmarks

Task-specific benchmarks involve evaluating the LLM’s performance on a variety of tasks that are representative of its expected usage. This might include:

  • NLP tasks such as sentiment analysis, named entity recognition, or language translation.

  • Specialized tasks that are relevant to the domain where the LLM will be applied, such as industrial applications.

  • Standardized benchmarks such as GLUE or SuperGLUE that aggregate multiple NLP tasks to provide a comprehensive assessment.

  • Performance is then quantified using appropriate metrics like accuracy, F1 score, BLEU score for translation, or ROUGE for summarization.

3. LLM’s efficiency in resource utilization

Efficiency in resource utilization assesses how well the LLM uses computational resources relative to the performance it achieves. This aspect is increasingly important due to the large environmental and economic costs of training and running LLMs(Look at The computational requirements for Training LLMs section for more details).

Using Similarity Metrics to Validate Fine-Tuning

Compare the scores from the pretrained and fine-tuned models. Significant improvements in these metrics indicate that the fine-tuning process has enhanced the model’s ability to generate semantically similar and high-quality text. ( In the cosine method we have also added a comparaison with the reference text from the training dataset )

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

dataset_output = "To calibrate a pressure gauge, first ensure the gauge is disconnected from any pressure source. Connect the gauge to a known, accurate pressure source or a calibrator. Apply pressure in increments and compare the gauge reading with the known pressure. Adjust the gauge calibration screw to correct any discrepancies. Repeat the process to ensure accuracy. Document the calibration in the maintenance log."


input_sentences = tokenizer("How to calibrate a pressure gauge?", return_tensors="pt").to('cuda')

# Generate output using the original pretrained model
foundational_outputs_sentence = get_outputs(loaded_model, input_sentences, max_new_tokens=50)
pretrained_output = tokenizer.batch_decode(foundational_outputs_sentence, skip_special_tokens=True)[0]

# Generate output using the finetuned model
foundational_outputs_sentence_finetuned = get_outputs(foundation_model, input_sentences, max_new_tokens=50)
finetuned_output = tokenizer.batch_decode(foundational_outputs_sentence_finetuned, skip_special_tokens=True)[0]

# Prepare the documents for vectorization
documents = [pretrained_output, finetuned_output,dataset_output]

# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(stop_words="english")
sparse_matrix = count_vectorizer.fit_transform(documents)

# Convert the sparse matrix to a dense matrix
doc_term_matrix = sparse_matrix.todense()

# Create a DataFrame from the dense matrix
df = pd.DataFrame(
   doc_term_matrix,
   columns=count_vectorizer.get_feature_names_out(),
   index=["pretrained_output", "finetuned_output","dataset_output"]
)

# Print the DataFrame
print("pretrained_output"+ pretrained_output +"\n"  , "finetuned_output" + finetuned_output +"\n" ,"dataset_output"+ dataset_output)

# Calculate and print the cosine similarity
cosine_sim = cosine_similarity(df, df)
print(cosine_sim)

The output is as follows:

Alternative text for the image

The cosine similarity matrix indicates the following:

  • pretrained_output vs. finetuned_output: Cosine similarity is 0.5876, indicating moderate similarity between outputs from the pretrained model and the finetuned model.

  • pretrained_output vs.dataset_output: Cosine similarity is 0.4463, suggesting a lower similarity between outputs from the pretrained model and the dataset answers.

  • finetuned_output vs. dataset_output: Cosine similarity is 0.5455, showing moderate similarity between outputs from the finetuned model and the dataset answers.

These results suggest that the fine-tuned LLM has diverged significantly from both the pretrained LLM and the dataset answer, indicating that the fine-tuning process has not aligned the model closely with the expected dataset answers. Further refinement may be needed to improve alignment.

In practice, human evaluation plays a crucial role in assessing the quality of the model’s outputs. Humans can judge nuances, context, and appropriateness of responses in ways that metrics cannot fully capture.

This highlights the necessity of combining human evaluation with quantitative metrics to ensure a comprehensive assessment of the fine-tuning process. Human evaluators can provide insights into the correctness, relevance, and coherence of the generated text, ensuring that the fine-tuned model truly meets the desired standards.