• 知识分享
  • 医学图像分割中故障检测方法的比较基准评测:揭示置信度聚合的作用|文献速递-视觉大模型医疗图像应用

  • Oldlee

    Lv. 72



Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation











Semantic segmentation is an essential component of medical image analysis research, with recent deelearning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements,segmentation failures remain a significant concern for real-world clinical applications, necessitating reliabledetection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluatingfailure detection methodologies within medical image segmentation. Through our analysis, we identify thestrengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as aholistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections,we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Ourfindings highlight the importance of pixel confidence aggregation and we observe superior performance of thepairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robustbaseline for failure detection in medical image segmentation. To promote ongoing research, we make thebenchmarking framework available to the community




To benchmark failure detection methods, we need concise failuredetection metrics that fulfill the requirements R1–R3 from Section 2.We compare common metric candidates in Table 1 and choose toperform a risk-coverage analysis as the main evaluation, with the areaunder the risk-coverage curve (AURC) as a scalar failure detectionperformance metric, as it fulfills all requirements. The risk-coverageanalysis was originally proposed by El-Yaniv and Wiener (2010) andAURC was suggested as a comprehensive failure detection metric forimage classification by Jaeger et al. (2023).




In conclusion, our study addresses the pitfalls in existing evaluation protocols for segmentation failure detection by proposing aflexible evaluation pipeline based on a risk-coverage analysis. Usingthis pipeline, we introduced a benchmark comprising multiple radiological 3D datasets to assess the generalization of many failure detectionmethods, and found that the pairwise Dice score between ensemblepredictions consistently outperforms other methods, serving as a strong baseline for future studies.




In the following sections, we first report the segmentation performances without failure detection in Section 5.1. Then, we describe themain benchmark results, starting with a comparison of pixel confidenceaggregation methods (Section 5.2) and extending the scope towardspixel- and image-level methods (Section 5.3). In Section 5.4, we studythe effect of alternative failure risk definitions. Finally, we performa qualitative analysis of the pairwise DSC method, to understand itsstrengths and weaknesses (Section 5.5).




Fig. 1. Overview of the research questions and contributions of this paper. Based on a formal definition of the image-level failure detection task, we formulate requirements for theevaluation protocol. Existing failure detection metrics are compared and the risk-coverage analysis is identified as a suitable evaluation protocol. We then propose a benchmarkingframework for failure detection in medical image segmentation, which includes a diverse pool of 3D medical image datasets. A wide range of relevant methods are compared,including lines of research for image-level confidence and aggregated pixel confidence, which have been mostly studied in separation so far.

图1. 本文研究问题和贡献概览。基于图像级故障检测任务的正式定义,我们提出了评价协议的要求。对现有的故障检测指标进行了比较,并确定风险-覆盖分析是一种合适的评价协议。随后,我们提出了一个针对医学图像分割中故障检测的基准框架,其中包括多样化的3D医学图像数据集池。比较了广泛的相关方法,包括图像级置信度和聚合像素置信度的研究方向,这些方向迄今为止大多是分开研究的。


Fig. 2. Segmentation performance of a single U-Net on the test sets. Boxes show the median and IQR, while whiskers extend to the 5th and 95th percentiles, respectively. Eachdataset contains samples drawn from the same distribution as the training set (in-distribution, ID) and samples drawn from a different data distribution (dataset shift) with thesame structures to be segmented. Usually, the performance on the in-distribution samples is higher than on the samples with distribution shift, but especially for the Kidney tumor(which lacks dataset shifts) and Covid datasets, there are also several in-distribution failure cases.

图2. 单一U-Net在测试集上的分割性能。盒子表示中位数和四分位距(IQR),须线分别延伸到第5和第95百分位数。每个数据集包含从与训练集相同分布中抽取的样本(分布内,ID)以及从不同数据分布中抽取的样本(数据集分布变化),分割目标结构相同。通常情况下,分布内样本的性能高于分布变化样本,但尤其是在肾肿瘤(没有数据集分布变化)和Covid数据集上,也存在一些分布内的失败案例。


Fig. 3. Comparison of aggregation methods from Section 4.4.2 in terms of AURC scores for all datasets (lower is better). The experiments are named as ‘‘prediction model +confidence method’’ and each of them was repeated using 5 folds. Colored markers denote AURC values achieved by the methods, while gray marks above/below them are AURCvalues for random/optimal confidence rankings (which differ between the models trained on different folds; see Section 4.1). Pairwise DSC scores consistently best, but does notapply to single network outputs. Aggregation methods based on regression forests (RF) also show performance gains compared to the mean PE baseline, but fail catastrophicallyon the prostate dataset, possibly due to the small training set size. PE: predictive entropy. RF: regression forest.

图3. 第4.4.2节中的聚合方法在所有数据集上的AURC分数比较(AURC分数越低越好)。实验命名为“预测模型 + 置信度方法”,每个实验重复了5折交叉验证。彩色标记表示各方法获得的AURC值,灰色标记表示随机/最优置信度排序的AURC值(这些值因训练于不同折的模型而异;见第4.1节)。成对Dice相似系数(DSC)分数始终表现最佳,但不适用于单一网络输出。基于回归森林(RF)的聚合方法与平均预测熵(PE)基线相比也显示出性能提升,但在前列腺数据集上表现灾难性失败,可能是由于训练集规模过小。 PE:预测熵。RF:回归森林。


Fig. 4. Rankings by average AURC (top, lower ranks are better) and the underlying AURC scores (bottom; lower is better) for methods from Section 4.4.3 and all datasets. Theexperiments are named as ‘‘prediction model + confidence method’’ and each of them was repeated using 5 folds. In the lower diagram, colored dots denote AURC values achievedby the methods, while gray marks above/below them are AURC values for random/optimal confidence rankings (which differ between the models trained on different folds; seeSection 4.1). Ensemble + pairwise DSC is the best method overall, often achieving close to optimal AURC scores. The ranking on the prostate dataset is an outlier, which couldbe due to the small training set size. PE: predictive entropy.

图4. 来自第4.4.3节的方法在所有数据集上的平均AURC排名(上方,排名越低越好)及其对应的AURC分数(下方,AURC分数越低越好)。实验命名为“预测模型 + 置信度方法”,每个实验使用5折交叉验证重复进行。在下方图表中,彩色点表示各方法的实际AURC值,灰色标记表示随机/最优置信度排序的AURC值(这些值因不同折训练的模型而异,见第4.1节)。集成预测+成对Dice相似系数(DSC)是总体上最佳的方法,通常能达到接近最优的AURC分数。前列腺数据集的排名是一个异常值,这可能是由于训练集规模较小所致。 PE:预测熵。


Fig. 5. Impact of the choice of segmentation metric as a risk function on the ranking stability, comparing mean DSC (left) and NSD (right). Bootstrapping (𝑁 = 500) was used toobtain a distribution of ranks for the results of each fold and the ranking distributions of all folds were accumulated. All ranks across datasets are combined in this figure, wherethe circle area is proportional to the rank count and the black x-markers indicate median ranks, which were also used to sort the methods. Overall, the ranking distributionsare similar for mean DSC and NSD. The variance in the ranking distributions largely originates from combining the rankings across datasets, so for each dataset individually theranking is more stable (see for example the Covid dataset in fig. B.12).

图5. 不同分割指标作为风险函数对排名稳定性的影响,比较了平均Dice相似系数(DSC,左图)和归一化表面距离(NSD,右图)。通过自助法(𝑁 = 500)获取每个折结果的排名分布,并累计所有折的排名分布。本图中所有数据集的排名均被合并,其中圆的面积与排名次数成正比,黑色叉号表示中位排名,并用于对方法进行排序。总体来看,平均DSC和NSD的排名分布相似。排名分布的方差主要来源于跨数据集的排名合并,因此对于每个单独数据集的排名会更稳定(例如,Covid数据集的排名稳定性,见附图B.12)。


Fig. 6. Qualitative analysis of ensemble predictions on all datasets. For each dataset (rows), an interesting failure case is shown, consisting of (columns from left to right): thereference segmentation, the ensemble prediction and individual predictions of ensemble members (Ensemble #1 – 5) trained with different random seeds. True mean DSC isreported alongside the pairwise DSC scores. The ensemble predictions often disagree about test cases for which segmentation errors occur, which leads to low pairwise Dice andcan be considered a detected failure (rows 1–4). However, there are also cases where the ensemble is confident about a faulty segment, which could result in a silent failure (lasttwo rows).

图6. 针对所有数据集的集成预测定性分析。对于每个数据集(行),展示了一个有趣的失败案例,包括(从左到右的列):参考分割结果、集成预测结果,以及用不同随机种子训练的集成成员(Ensemble #1–5)的单独预测结果。同时报告了真实的平均Dice相似系数(DSC)和成对Dice分数。对于发生分割错误的测试案例,集成预测结果通常存在分歧,导致成对Dice分数较低,可被视为检测到的失败(第1–4行)。然而,也存在一些情况,集成对一个错误的分割结果表现出较高的置信度,这可能导致无声失败(最后两行)。



Table 1Comparison of metric candidates for segmentation failure detection. Among those, AURC is the only metric that captures segmentation performanceand confidence ranking, which we find necessary for the comprehensive evaluation of a failure detection system. A detailed discussion of therequirements (R1–R3) associated with each column is in Section 2. f-AUROC uses binary failure labels. MAE: mean absolute error. PC: Pearsoncorrelation. SC: Spearman correlation

表1 分割故障检测指标候选的比较。在这些指标中,AURC是唯一能够同时捕捉分割性能和置信度排序的指标,我们认为这是对故障检测系统进行全面评价所必需的。与每一列相关的要求(R1–R3)的详细讨论见第2节。f-AUROC使用二元故障标签。MAE:平均绝对误差。PC:皮尔逊相关系数。SC:斯皮尔曼相关系数。


Table 2Summary of datasets used in this study. The #Testing column contains case numbers for each subset of the test set separated by a comma, starting with the in-distribution testsplit and followed by the shifted ‘‘domains’’. The number of classes includes one count for background.

表2 本研究中使用的数据集概览。#Testing 列包含每个测试集子集的案例数量,以逗号分隔,首先是分布内测试集,然后是偏移的“领域”。类别数量包括一个背景类别的计数。
