Publication
Benchmarking Generative AI Performance Requires a Holistic Approach
Ajay Dholakia; David Ellison; Miro Hodak; Debojyoti Dutta; Carsten Binnig
In: Raghunath Nambiar; Meikel Poess (Hrsg.). Performance Evaluation and Benchmarking: 15th TPC Technology Conference, TPCTC 2023, Vancouver, BC, Canada, August 28 - September 1, 2023, Revised Selected Papers. Technology Conference on Performance Evaluation and Benchmarking (TPCTC), Pages 34-43, Lecture Notes in Computer Science, Vol. 14247, Springer, 2023.
Abstract
The recent focus in AI on Large Language Models (LLMs) has brought
the topic of trustworthy AI to the forefront. Along with the excitement of human-
level performance, the Generative AI systems enabled by LLMs have raised many
concerns about factual accuracy, bias along various dimensions, authenticity and
quality of generated output. Ultimately, these concerns directly affect the user’s
trust in the AI systems that they interact with. The AI research community has
come up with a variety of metrics for perplexity, similarity, bias, and accuracy that
attempt to provide an objective comparison between different AI systems. How-
ever, these are difficult concepts to encapsulate in metrics that are easy to compute.
Furthermore, AI systems are advancing to multimodal foundation models that
further make creating simple metrics a challenging task. This paper describes the
recent trends in measuring the performance of foundation models like LLMs and
multimodal models. The need for creating metrics and ultimately benchmarks that
enable meaningful comparisons between different Generative AI system designs
and implementations is getting stronger. The paper concludes with a discussion
of future trends aimed at increasing trust in Generative AI systems.
