Почему показатели AI Benchmark могут ввести в заблуждение и что это значит для пользователей.

Наиболее используемые стандарты AI не были адаптированы или пересмотрены для отражения текущего использования моделей, согласно экспертам.

“`html

AI benchmarks reveal little | ENBLE

🤖🎯

AI companies are in a constant race to claim that their models outperform the competition. Anthropic and Inflection AI are the latest contenders, boasting superior performance and quality compared to industry giants like OpenAI’s GPT models. But what do these claims actually mean, and do they translate to tangible improvements for users? Let’s dive into the world of AI benchmark metrics to uncover the truth.

Esoteric Measures: The Problem with Benchmarks

📊🧪

Most AI models, particularly chatbot-powered ones, rely on benchmarks to assess their capabilities. However, these benchmarks often fail to capture how the average person interacts with these models in real-life scenarios. For example, a benchmark like GPQA focuses on graduate-level questions in varying scientific fields, while most users rely on chatbots for everyday tasks like writing emails or expressing their feelings.

Jesse Dodge from the Allen Institute for AI describes this situation as an “evaluation crisis.” Many benchmarks used today are outdated and not aligned with the diverse ways people use generative AI models. As a result, these benchmarks don’t truly reflect the models’ real-world utility or user experience.

The Wrong Metrics: Irrelevant Skills and Tests

❌🧪

Commonly used benchmarks often assess skills and knowledge that are irrelevant to the majority of users. Assessing a model’s ability to solve grade school-level math problems or identify anachronisms doesn’t accurately measure its usefulness in everyday scenarios.

David Widder, a postdoctoral researcher at Cornell, explains that older AI systems focused on solving problems within specific contexts, making it easier to evaluate their performance. However, as models become more “general purpose,” it becomes challenging to rely on context-specific evaluation. Consequently, current benchmarks aim to test models across a range of fields, but they still miss the mark in terms of real-world usability and relevance.

Moreover, there are concerns about the accuracy and validity of certain benchmarks. The HellaSwag test, designed to evaluate commonsense reasoning in models, contains questions with typos and nonsensical writing. Another benchmark, MMLU, tests models on logic problems that can be solved through rote memorization, rather than true comprehension and reasoning ability.

Fixing What’s Broken: Human Involvement and Contextual Evaluation

🔨🤝

To overcome the limitations of existing benchmarks, experts propose incorporating more human involvement and evaluating models in real user scenarios.

Jesse Dodge suggests combining evaluation benchmarks with human evaluation. Models should be prompted with real user queries, and humans can then rate the quality of the responses. This approach would provide a more accurate assessment of a model’s performance from a user’s perspective.

David Widder, however, believes that current benchmarks, even with fixes for errors like typos, cannot sufficiently inform the vast majority of generative AI model users. Instead, he suggests evaluating models based on their downstream impacts on users and the desirability of those impacts. This approach would involve examining the contextual goals and assessing whether AI models successfully meet those goals.

Looking Ahead: The Impact and Future of AI Benchmarking

🔮🚀

The fragmented state of AI benchmark metrics suggests a need for a more comprehensive approach. AI companies must prioritize developing benchmarks that align with real-world use cases and measure the practical impact of their models. As AI becomes increasingly integrated into various aspects of our lives, it’s crucial to address the limitations of benchmarks to ensure the technology is effectively meeting user needs.

In the future, we may witness a shift towards more holistic evaluation strategies that consider the multidimensional aspects of AI model performance. By focusing on contextual goals and evaluating downstream impacts, we can better understand the value that these models bring to different domains and user requirements.

🤔 Вопросы Читателей:

Q: Are there any alternative benchmarks being developed that address the limitations mentioned?

A: Yes, efforts are underway to address the shortcomings of existing benchmarks. Some researchers are working on developing benchmarks​ that better reflect real-world usage scenarios, focusing on areas like business communications, language understanding, and customer service interactions. These benchmarks aim to provide a more accurate assessment of AI models’ performance in practical applications. Check out this article for more information.

“““html

В: Как пользователи могут оценить производительность моделей искусственного интеллекта, не полагаясь исключительно на показатели стандартных тестов?

Оценка моделей искусственного интеллекта выходит за рамки метрик на основе стандартных тестов. Пользователи могут учитывать такие факторы, как реактивность модели, точность, языковая связность и контекстное понимание. Кроме того, сбор обратной связи от реальных пользователей и проведение пользовательских опросов могут предоставить ценные идеи об эффективности модели и удовлетворенности пользователей. В конечном итоге пользователи должны отдавать предпочтение моделям, соответствующим их конкретным потребностям и требованиям.

Ссылки:

  1. New Linux Kernel Released: One of the Largest Ever
  2. Understanding GPT: What Does GPT-4 Stand For?
  3. Google’s New Gemini Model Can Analyze Hour-long Videos
  4. Original App Store Innovator Clear Relaunches Swipeable List App
  5. GPT-2 and GPT-3: Best AI as per Digital Trends

“`