Validation of Intelligent Systems - German Cancer Research Center

Metrics reloaded: Recommendations for image analysis validation

Alternative text: "Infographic illustrating pitfalls in validation metrics for intelligent systems, highlighting inappropriate choices and poor selection. It introduces a problem-driven metrics framework that aids in metric selection and shows applications and a user-guidance tool for better metric effectiveness in various scenarios."

Main contributions: First comprehensive metric recommendation framework guiding researchers in the problem-aware selection of validation metrics for machine learning algorithm validation.

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M. D., Büttner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., Reyes, M., Riegler, M. A., Wiesenfarth, M., Kavur, E., Sudre, C. H., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Rädsch, A. T., ... Jäger, P. F. (2024). Metrics reloaded: Recommendations for image analysis validation. Nature Methods 21. [pdf]

Understanding metric-related pitfalls in image analysis validation

The image compares two sets of medical imaging results: one shows actual images and the other displays model predictions. Visual markers highlight differences between real data and the predictions, illustrating the effectiveness of the model in detecting specific features within the images.

Main contributions: First reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis based on a domain-agnostic taxonomy to categorize pitfalls.

Validation metrics are key for the reliable tracking of scientific progress and for bridging the current chasm between artificial intelligence (AI) research and its translation into practice. However, increasing evidence shows that particularly in image analysis, metrics are often chosen inadequately in relation to the underlying research problem. This could be attributed to a lack of accessibility of metric-related knowledge: While taking into account the individual strengths, weaknesses, and limitations of validation metrics is a critical prerequisite to making educated choices, the relevant knowledge is currently scattered and poorly accessible to individual researchers. Based on a multi-stage Delphi process conducted by a multidisciplinary expert consortium as well as extensive community feedback, the present work provides the first reliable and comprehensive common point of access to information on pitfalls related to validation metrics in image analysis. Focusing on biomedical image analysis but with the potential of transfer to other fields, the addressed pitfalls generalize across application domains and are categorized according to a newly created, domain-agnostic taxonomy. To facilitate comprehension, illustrations and specific examples accompany each pitfall. As a structured body of information accessible to researchers of all levels of expertise, this work enhances global comprehension of a key topic in image analysis validation.

Reinke, A., Tizabi, M. D., Baumgartner, M., Eisenmann, M., Heckmann-Nötzel, D., Kavur, A. E., ... Maier-Hein, L. (2024). Understanding metric-related pitfalls in image analysis validation. Nature Methods. [pdf]

Reinke, A., Tizabi, M. D., Sudre, C. H., Eisenmann, M., Rädsch, T., Baumgartner, M., ... Maier-Hein, L. (2021). Common limitations of image processing metrics: A picture story. arXiv [pdf]

Confidence Intervals Uncovered: Are We Ready for Real-World Medical Imaging AI?

The image presents a comparison of methods in a results table featuring DSC and HD95 values. It highlights a proposed method with higher scores and visualizes two scenarios: one with a narrow confidence interval (desired) and another with a wide confidence interval, indicating uncertainty in the results.

Main contributions: We present the first large-scale analysis of medical image processing papers to demonstrate that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

Medical imaging is spearheading the AI transformation of healthcare. Performance reporting is key to determine which methods should be translated into clinical practice. Frequently, broad conclusions are simply derived from mean performance values. In this paper, we argue that this common practice is often a misleading simplification as it ignores performance variability. Our contribution is threefold. (1) Analyzing all MICCAI segmentation papers (n = 221) published in 2023, we first observe that more than 50% of papers do not assess performance variability at all. Moreover, only one (0.5%) paper reported confidence intervals (CIs) for model performance. (2) To address the reporting bottleneck, we show that the unreported standard deviation (SD) in segmentation papers can be approximated by a second-order polynomial function of the mean Dice similarity coefficient (DSC). Based on external validation data from 56 previous MICCAI challenges, we demonstrate that this approximation can accurately reconstruct the CI of a method using information provided in publications. (3) Finally, we reconstructed 95% CIs around the mean DSC of MICCAI 2023 segmentation papers. The median CI width was 0.03 which is three times larger than the median performance gap between the first and second ranked method. For more than 60% of papers, the mean performance of the second-ranked method was within the CI of the first-ranked method. We conclude that current publications typically do not provide sufficient evidence to support which models could potentially be translated into clinical practice.

Christodoulou, E., Reinke, A., Houhou, R., Kalinowski, P., Erkan, S., Sudre, C. H., ... & Maier-Hein, L. (2024). Confidence intervals uncovered: Are we ready for real-world medical imaging AI?. In International Conference on Medical Image Computing and Computer-Assisted Intervention - MICCAI 2024 (pp. 124-132). [pdf]

Quality Assured: Rethinking Annotation Strategies in Imaging AI

A flowchart outlines the process of image annotation and quality assurance. It highlights key statistics: 924 annotators, 57,648 instance segmentation masks, 57,636 metadata annotations, and 34 QA workers. Three research questions focus on the impact of annotation provider choices and quality assurance on annotation outcomes.

Main contributions: Comprehensive analysis of image annotation strategies revealing that (1) professional annotation companies outperform crowdsourcing platforms like Amazon Mechanical Turk (2) investment in high-quality instructions is to be preferred over investment of internal quality assurance, and (3) internal quality assurance should focus on images with specific characteristics.

This paper does not describe a novel method. Instead, it studies an essential foundation for reliable benchmarking and ultimately real-world application of AI-based image analysis: generating high-quality reference annotations. Previous research has focused on crowdsourcing as a means of outsourcing annotations. However, little attention has so far been given to annotation companies, specifically regarding their internal quality assurance (QA) processes. Therefore, our aim is to evaluate the influence of QA employed by annotation companies on annotation quality and devise methodologies for maximizing data annotation efficacy. Based on a total of 57,648 instance segmented images obtained from a total of 924 annotators and 34 QA workers from four annotation companies and Amazon Mechanical Turk (MTurk), we derived the following insights: (1) Annotation companies perform better both in terms of quantity and quality compared to the widely used platform MTurk. (2) Annotation companies’ internal QA only provides marginal improvements, if any. However, improving labeling instructions instead of investing in QA can substantially boost annotation performance. (3) The benefit of internal QA depends on specific image characteristics. Our work could enable researchers to derive substantially more value from a fixed annotation budget and change the way annotation companies conduct internal QA.

Rädsch, T., Reinke, A., Weru, V., Tizabi, M. D., Heller, N., Isensee, F., ... & Maier-Hein, L. (2025). Quality Assured: Rethinking Annotation Strategies in Imaging AI. In European Conference on Computer Vision - ECCV 2024 (pp. 52-69). [pdf]

Why is the winner the best?

A grid of various images representing different projects and technologies in the field of medical imaging research. Each project is identified by a label, showcasing innovations and methodologies. A question mark and icons suggest engaging with a community or platform for further exploration and discussion.

Main contributions: First comprehensive multi-center study on 80 international benchmarking competitions, which investigated winning solution characteristics and common participation strategies.

International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work.

Eisenmann, M., Reinke, A., Weru, V., Tizabi, M. D., Isensee, F., Adler, T. J., ... & Maier-Hein, L. (2023). Why is the winner the best?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition - CVPR 2023 (pp. 19955-19966). [pdf]

Labelling instructions matter in biomedical image analysis

This infographic illustrates research on the discrepancy between professional annotators' needs for labeling instructions and their availability. It examines the impact of varying instruction types on annotation quality through surveys and experiments involving professional annotators and crowdworkers, detailing data collection and analysis methods.

Main contributions: First systematic study on labelling instructions and their impact on annotation quality in biomedical image analysis.

Biomedical image analysis algorithm validation depends on high-quality annotation of reference datasets, for which labelling instructions are key. Despite their importance, their optimization remains largely unexplored. Here we present a systematic study of labelling instructions and their impact on annotation quality in the field. Through comprehensive examination of professional practice and international competitions registered at the Medical Image Computing and Computer Assisted Intervention Society, the largest international society in the biomedical imaging field, we uncovered a discrepancy between annotators' needs for labelling instructions and their current quality and availability. On the basis of an analysis of 14,040 images annotated by 156 annotators from four professional annotation companies and 708 Amazon Mechanical Turk crowdworkers using instructions with different information density levels, we further found that including exemplary images substantially boosts annotation performance compared with text-only descriptions, while solely extending text descriptions does not. Finally, professional annotators constantly outperform Amazon Mechanical Turk crowdworkers. Our study raises awareness for the need of quality standards in biomedical image analysis labelling instructions.

Rädsch, T., Reinke, A., Weru, V., Tizabi, M. D., Schreck, N., Kavur, A. E., ... & Maier-Hein, L. (2023). Labelling instructions matter in biomedical image analysis. Nature Machine Intelligence, 5(3), 273-283. [pdf]

Beyond rankings: Learning (more) from algorithm validation

Four-panel infographic illustrating a data analysis workflow. Panels include: 1) Challenge Organization with a podium graphic; 2) Semantic Meta Data Annotation showing checked and unchecked characteristics; 3) Generalized Linear Mixed Model analysis with a graph displaying log odds; 4) Tailored Algorithm Improvement featuring a box plot and arrow indicating progress.

Main contributions: Introduction of a statistical framework for learning from challenges, particularly focusing on the task of instrument instance segmentation in laparoscopic videos. The framework utilizes semantic metadata annotations to conduct a General Linear Mixed Models strength-weakness analysis.

Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which state-of-the-art algorithms fail. To address this gap in the literature, we (1) present a statistical framework for learning from challenges and (2) instantiate it for the specific task of instrument instance segmentation in laparoscopic videos. Our framework relies on the semantic meta data annotation of images, which serves as foundation for a General Linear Mixed Models (GLMM) analysis. Based on 51,542 meta data annotations performed on 2,728 images, we applied our approach to the results of the Robust Medical Instrument Segmentation Challenge (ROBUST-MIS) challenge 2019 and revealed underexposure, motion and occlusion of instruments as well as the presence of smoke or other objects in the background as major sources of algorithm failure. Our subsequent method development, tailored to the specific remaining issues, yielded a deep learning model with state-of-the-art overall performance and specific strengths in the processing of images in which previous methods tended to fail. Due to the objectivity and generic applicability of our approach, it could become a valuable tool for validation in the field of medical image analysis and beyond.

Roß, T., Bruno, P., Reinke, A., Wiesenfarth, M., Koeppel, L., Full, P. M., ... & Maier-Hein, L. (2023). Beyond rankings: Learning (more) from algorithm validation. Medical image analysis, 86, 102765. [pdf]