Validation of Intelligent Systems

Publication highlights

Metrics reloaded: Recommendations for image analysis validation

Main contributions: First comprehensive metric recommendation framework guiding researchers in the problem-aware selection of validation metrics for machine learning algorithm validation.

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M. D., Büttner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., Reyes, M., Riegler, M. A., Wiesenfarth, M., Kavur, E., Sudre, C. H., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Rädsch, A. T., ... Jäger, P. F. (2024). Metrics reloaded: Recommendations for image analysis validation. Nature Methods 21. [pdf]

Understanding metric-related pitfalls in image analysis validation

Main contributions: First reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis based on a domain-agnostic taxonomy to categorize pitfalls.

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

Reinke, A., Tizabi, M. D., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Kavur, A. E., ... Maier-Hein, L. (2024). Understanding metric-related pitfalls in image analysis validation. Nature Methods. [pdf]

Reinke, A., Tizabi, M. D., Sudre, C. H., Eisenmann, M., Rädsch, T., Baumgartner, M., ... Maier-Hein, L. (2021). Common limitations of image processing metrics: A picture story. arXiv [pdf]

Confidence Intervals Uncovered: Are We Ready for Real-World Medical Imaging AI?

Main contributions: We present the first large-scale analysis of medical image processing papers to demonstrate that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

Christodoulou, E., Reinke, A., Houhou, R., Kalinowski, P., Erkan, S., Sudre, C. H., ... & Maier-Hein, L. (2024). Confidence intervals uncovered: Are we ready for real-world medical imaging AI?. In International Conference on Medical Image Computing and Computer-Assisted Intervention - MICCAI 2024 (pp. 124-132). [pdf]

Quality Assured: Rethinking Annotation Strategies in Imaging AI

Main contributions: Comprehensive analysis of image annotation strategies revealing that (1) professional annotation companies outperform crowdsourcing platforms like Amazon Mechanical Turk (2) investment in high-quality instructions is to be preferred over investment of internal quality assurance, and (3) internal quality assurance should focus on images with specific characteristics.

This paper does not describe a novel method. Instead, it studies an essential foundation for reliable benchmarking and ultimately real-world application of AI-based image analysis: generating high-quality reference annotations. Previous research has focused on crowdsourcing as a means of outsourcing annotations. However, little attention has so far been given to annotation companies, specifically regarding their internal quality assurance (QA) processes. Therefore, our aim is to evaluate the influence of QA employed by annotation companies on annotation quality and devise methodologies for maximizing data annotation efficacy. Based on a total of 57,648 instance segmented images obtained from a total of 924 annotators and 34 QA workers from four annotation companies and Amazon Mechanical Turk (MTurk), we derived the following insights: (1) Annotation companies perform better both in terms of quantity and quality compared to the widely used platform MTurk. (2) Annotation companies’ internal QA only provides marginal improvements, if any. However, improving labeling instructions instead of investing in QA can substantially boost annotation performance. (3) The benefit of internal QA depends on specific image characteristics. Our work could enable researchers to derive substantially more value from a fixed annotation budget and change the way annotation companies conduct internal QA.

Rädsch, T., Reinke, A., Weru, V., Tizabi, M. D., Heller, N., Isensee, F., ... & Maier-Hein, L. (2025). Quality Assured: Rethinking Annotation Strategies in Imaging AI. In European Conference on Computer Vision - ECCV 2024 (pp. 52-69). [pdf]

Why is the winner the best?

Main contributions: First comprehensive multi-center study on 80 international benchmarking competitions, which investigated winning solution characteristics and common participation strategies.

International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.

Eisenmann, M., Reinke, A., Weru, V., Tizabi, M. D., Isensee, F., Adler, T. J., ... & Maier-Hein, L. (2023). Why is the winner the best?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition - CVPR 2023 (pp. 19955-19966). [pdf]

Labelling instructions matter in biomedical image analysis

Main contributions: First systematic study on labelling instructions and their impact on annotation quality in biomedical image analysis.

Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labelling instructions are key. Despite their importance, their optimization remains largely unexplored. Here we present a systematic study of labelling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice and international competitions registered at the Medical Image Computing and Computer Assisted Intervention Society, the largest international society in the biomedical imaging field, we uncovered a discrepancy between annotators' needs for labelling instructions and their current quality and availability. On the basis of an analysis of 14,040 images annotated by 156 annotators from four professional annotation companies and 708 Amazon Mechanical Turk crowdworkers using instructions with different information density levels, we further found that including exemplary images substantially boosts annotation performance compared with text-only descriptions, while solely extending text descriptions does not. Finally, professional annotators constantly outperform Amazon Mechanical Turk crowdworkers. Our study raises awareness for the need of quality standards in biomedical image analysis labelling instructions.

Rädsch, T., Reinke, A., Weru, V., Tizabi, M. D., Schreck, N., Kavur, A. E., ... & Maier-Hein, L. (2023). Labelling instructions matter in biomedical image analysis. Nature Machine Intelligence, 5(3), 273-283. [pdf]

Beyond rankings: Learning (more) from algorithm validation

Main contributions: Introduction of a statistical framework for learning from challenges, particularly focusing on the task of instrument instance segmentation in laparoscopic videos. The framework utilizes semantic metadata annotations to conduct a General Linear Mixed Models strength-weakness analysis.

Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which state-of-the-art algorithms fail. To address this gap in the literature, we (1) present a statistical framework for learning from challenges and (2) instantiate it for the specific task of instrument instance segmentation in laparoscopic videos. Our framework relies on the semantic meta data annotation of images, which serves as foundation for a General Linear Mixed Models (GLMM) analysis. Based on 51,542 meta data annotations performed on 2,728 images, we applied our approach to the results of the Robust Medical Instrument Segmentation Challenge (ROBUST-MIS) challenge 2019 and revealed underexposure, motion and occlusion of instruments as well as the presence of smoke or other objects in the background as major sources of algorithm failure. Our subsequent method development, tailored to the specific remaining issues, yielded a deep learning model with state-of-the-art overall performance and specific strengths in the processing of images in which previous methods tended to fail. Due to the objectivity and generic applicability of our approach, it could become a valuable tool for validation in the field of medical image analysis and beyond.

Roß, T., Bruno, P., Reinke, A., Wiesenfarth, M., Koeppel, L., Full, P. M., ... & Maier-Hein, L. (2023). Beyond rankings: Learning (more) from algorithm validation. Medical image analysis, 86, 102765. [pdf]