In this session, we will explore performance evaluation techniques for text generation by Large Language Models (LLMs) and Generative AI. This is a nuanced issue and a particular difficulty given the subjective nature of the text generated by LLMs, but also a pressing challenge given the multitude of options now available to businesses in the Generative AI and LLM market. We will cover popular metric scores such as BLEU, ROUGE, and BERTScore, which are widely used to evaluate the quality of text generated by LLMs. Additionally, we will also discuss the LLM-as-a-judge technique, where one LLM is used to evaluate another LLM's generated text. This technique has gained popularity due to its ability to capture more nuanced aspects of text quality, such as coherence and fluency. We'll also go over the current practice of using leaderboards such as the Hugging Face Open Leaderboard to understand the relative quality of LLM performance on various academic benchmarks, especially the LMSys Chat Leaderboard, which uses a variant of the ELO Score to relatively grade the mainstream LLMs available today. By the end of this session, attendees should have a first-level of understanding of the evaluation techniques used to assess the text generation capabilities of LLMs and be able to apply these techniques to their own work.
Agenda for the session
- Challenges in LLM evaluation: lack of standards and subjective outputs
- Metrics to assess quality: BLEU, ROUGE, and BERTScore
- LLM-as-a-Judge: using LLMs to evaluate others, but it's subjective
- LLM Leaderboards:Hugging Face and LMSys Chat Leaderboards
About Speakers

Mr. Davood Wadi
AI Research Scientist - intelChain
Dr. Davood Wadi is an AI Research Scientist at intelChain. Before joining intelChain, he excelled as an AI researcher and pursued his Ph.D. at HEC Montreal, renowned globally for its academic excellence. His interest in applying modern technologies to data sparked his tenure as a financial analyst, where he started incorporating mathematical methods into mass psychology to understand investment patterns. Davood’s expertise and interests include developing new algorithms for AI and ML applications, computer vision, NLP, Meta-Learning, and Consumer neuroscience.