Measures for a more user-centered evaluation of classification quality

This paper, presented at the 14th International Conference on Recent Advances in Natural Language Processing, introduces measures to evaluate practical requirements for the usability of AI-based tools.

Summary

A solution to limited annotation budgets is active learning (AL), a collaborative process between humans and machines for the strategic selection of a small but informative set of examples. While current approaches optimize AL from a machine learning perspective, we argue that successful real-world deployment requires additional criteria that target the second pillar of AL: the human annotators and their needs. For example, the usefulness of AL methods in text classification is typically assessed using common performance measures such as accuracy or F1. However, such measures fall short when applied to real-world datasets, which often contain a higher number of imbalanced classes. In these scenarios, additional criteria—such as quickly identifying all classes (e.g., topics) or detecting rare cases—become important. We therefore introduce four measures that reflect the class-specific requirements users have for data collection and content analysis.

In a comprehensive comparison of uncertainty-, diversity-, and hybrid-based data selection strategies across six different datasets, we find, for example, that strong F1 performance does not necessarily correspond to complete class coverage (i.e., not all topics are identified), and that different data selection strategies exhibit varying strengths and weaknesses with respect to class-specific requirements. Our empirical findings highlight that a holistic perspective is essential when evaluating AL approaches to ensure their practical usefulness. To this end, standard measures for evaluating machine-based text classification must be complemented by those that better reflect user needs.

Selected results

  • This publication proposes four new class-specific evaluation metrics for AL approaches that take into account how well and how quickly rare or all classes are detected. These criteria are not captured in detail by standard metrics such as F1.
  • The new measures enable practice-relevant insights into performance, particularly on datasets with varying frequently occurring classes and a wide range of different classes – characteristics that are common in real-world applications (e.g., topic detection in public participation processes).
  • It becomes clear that the choice of an appropriate AL strategy should not be based solely on standard performance measures. For example, the top-performing approaches according to the F1 score may not ensure that all classes are identified, despite this being a crucial requirement in the automated analysis of participation contributions: no topic should be overlooked, no voice should go unheard. The measures we developed can provide additional guidance in strategy selection and thus support the choice of practice-oriented solutions.

Publication

Romberg, J. (2023). Mind the User! Measures to More Accurately Evaluate the Practical Value of Active Learning Strategies. Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, 996–1006. https://aclanthology.org/2023.ranlp-1.107/