Title
题目
Comparative benchmarking of failure detection methods in medical image segmentation: Unveiling the role of confidence aggregation
医学图像分割中故障检测方法的比较基准:揭示置信度聚合的作用
01
文献速递介绍
医学图像分割是医学图像分析中研究最广泛的任务之一,一些基于深度学习的算法在各种数据集上表现良好(Isensee et al., 2021)。然而,在实际应用环境中,尤其是在面对来自未知扫描设备或机构的数据时,深度学习模型的性能往往会下降,这种现象不仅出现在分割任务中,也存在于其他图像分析任务中(AlBadawy et al., 2018; Zech et al., 2018; Badgeley et al., 2019; Beede et al., 2020; Campello et al., 2021)。因此,模型的预测可能会出现错误,无法被盲目信任。
虽然可以通过人工检查来识别存在问题的分割结果,但随着图像分辨率的增加和分割结构的复杂化,特别是在处理(放射学)三维(3D)医学图像时,人工检查将变得非常耗时。当分割仅是大规模数据集自动化分析流程中的一个环节时,这一问题尤为突出,手动检查变得不现实,而可靠的分割结果至关重要。提升分割模型的性能和鲁棒性是一种可能的解决方案,但本研究关注的是另一种互补方法,即在分割模型中引入故障检测(failure detection)机制。
在本研究中,故障检测的目标是自动识别需要排除或进行人工校正的分割结果,以确保后续任务(如体积测量、放射治疗规划、大规模数据分析)能够顺利进行。这涉及为每个分割结果提供一个标量置信度评分(image-level),指示该分割失败的可能性。虽然故障检测也可以在类别级别(class-level)或像素级别(pixel-level)进行,但本研究侧重于图像级(image-level)的故障检测,因为这在实际应用中更为重要:如果某个分割结果存在故障,就需要决定是否保留整个预测结果用于进一步分析,还是直接舍弃它。在某些情况下,部分预测结果仍然有用(如某些像素或类别的分割仍然有效),此时像素级或类别级的故障检测可能更具价值。从方法学角度来看,像素级或类别级的方法仍然适用于图像级故障检测任务,但它们需要额外的聚合函数,这会增加计算复杂度。医学图像分割的故障检测引发了多个研究方向,各自采用不同的方法来解决这一问题:不确定性估计(Uncertainty Estimation):这类方法(Mehrtash et al., 2020)通常旨在为每个像素预测的正确性提供校准概率。一些研究(Roy et al., 2019; Jungo et al., 2020; Ng et al., 2023)提出通过聚合这些分数到类别或图像级别来进行故障检测。
异常(OOD,Out-of-Distribution)检测:这类方法(González et al., 2022; Graham et al., 2022)用于识别偏离训练集分布的数据样本,这些样本被认为容易导致分割失败。分割质量回归(Segmentation Quality Regression):这类方法(Valindria et al., 2017; Robinson et al., 2018; Li et al., 2022)试图在没有真实标签(ground truth)的情况下直接预测分割质量指标值。尽管分割故障检测在实际应用中非常重要,并且研究方法多种多样,但目前该领域的进展受到以下评估实践不足的制约:
任务定义和评估指标不一致:虽然不同方法共享相同的故障检测目标,但由于采用的任务定义和评估指标不同,导致研究间的结果难以比较。许多研究仅评估代理任务,如异常检测(OOD detection)、不确定性校准或分割质量回归,而非直接进行故障检测(Mehrtash et al., 2020; Graham et al., 2022; Zhao et al., 2022; Ouyang et al., 2022)。此外,故障检测的评估指标缺乏标准化(Valindria et al., 2017; Wang et al., 2019; Jungo et al., 2020; Kushibar et al., 2022; Ng et al., 2023),不同指标的特点和缺点很少被讨论。评估通常局限于部分相关方法:故障检测方法可以大致分为像素级和图像级方法,但现有研究通常仅关注其中一种,而未充分探讨从像素级不确定性聚合到图像级不确定性的可能性。部分研究(González et al., 2022; Lennartz and Schultz, 2023)比较了两类方法,但仅采用简单的均值不确定性进行聚合,这种方法容易受到目标大小的偏倚(Jungo et al., 2020; Kahl et al., 2024)。数据集多样性不足,缺乏数据集分布变化(dataset shifts)考虑:许多研究仅使用单一解剖部位的数据集(Jungo et al., 2020; Ng et al., 2023),而未考虑数据分布的变化。尽管针对特定应用的研究可以聚焦于单一任务或数据集,但这无法回答方法在其他数据集和真实世界应用中的泛化性问题。在nnU-Net(Isensee et al., 2021)等分割方法能够轻松适用于不同数据集的背景下,探索适用于不同数据集的故障检测方法尤为重要。
缺乏公开可用的代码:少数研究公开了其代码,但通常未提供基线方法的实现(详见附录A),这影响了可复现性,并导致基线性能评估不可靠。
本研究贡献为了解决上述问题,我们重新审视了故障检测任务的定义和评估协议,使其与实际应用需求保持一致。这使得所有相关方法能够被系统性地比较,并构建了一个全面的医学图像分割故障检测基准测试。
本研究的贡献如下(如图1所示):整合现有评估协议:分析现有方法的不足之处,并提出一个通用且稳健的故障检测评估流程。该流程基于选择性分类(Selective Classification)文献中的风险-覆盖分析(Risk-Coverage Analysis),以减轻已识别的评估问题。引入一个基准测试框架:该框架包含多个公开可用的三维放射学数据集,以评估故障检测方法在单一数据集之外的泛化能力。我们的测试数据集涵盖了现实世界中的分布变化,以模拟可能导致分割失败的因素,从而进行更全面的评估。比较不同类别的故障检测方法:在该基准测试框架下,我们比较了多种故障检测方法,包括图像级方法和像素级方法(后续通过聚合转换为图像级)。结果表明,基于预测集成(ensemble predictions)之间的配对Dice系数(Roy et al., 2019)在所有比较方法中表现最佳,因此我们推荐该方法作为未来研究的强基线。
此外,我们公开了所有实验的源代码,包括数据集准备、分割、故障检测方法实现及评估脚本,以促进该领域的可复现性和进一步研究。
Aastract
摘要
Semantic segmentation is an essential component of medical image analysis research, with recent deeplearning algorithms offering out-of-the-box applicability across diverse datasets. Despite these advancements,segmentation failures remain a significant concern for real-world clinical applications, necessitating reliabledetection mechanisms. This paper introduces a comprehensive benchmarking framework aimed at evaluatingfailure detection methodologies within medical image segmentation. Through our analysis, we identify thestrengths and limitations of current failure detection metrics, advocating for the risk-coverage analysis as aholistic evaluation approach. Utilizing a collective dataset comprising five public 3D medical image collections,we assess the efficacy of various failure detection strategies under realistic test-time distribution shifts. Ourfindings highlight the importance of pixel confidence aggregation and we observe superior performance of thepairwise Dice score (Roy et al., 2019) between ensemble predictions, positioning it as a simple and robustbaseline for failure detection in medical image segmentation. To promote ongoing research, we make thebenchmarking framework available to the community
语义分割是医学图像分析研究中的一个关键组成部分,近年来,深度学习算法在各种数据集上展现了开箱即用的应用潜力。尽管这些进展显著,但分割失败仍然是现实临床应用中的一个重要问题,因此需要可靠的检测机制。本文提出了一个全面的基准测试框架,旨在评估医学图像分割中的故障检测方法。通过分析,我们识别了当前故障检测度量的优缺点,并倡导采用风险覆盖分析作为一种整体评估方法。通过使用包含五个公共3D医学图像集合的综合数据集,我们评估了在现实测试时分布变化下各种故障检测策略的效果。研究结果强调了像素置信度聚合的重要性,并观察到成对Dice系数(Roy等,2019)在集成预测中的优异表现,将其定位为医学图像分割故障检测的简单而强大的基准方法。为了促进持续的研究,我们将该基准测试框架开放给社区使用。
Method
方法
4.1. Evaluation
To benchmark failure detection methods, we need concise failuredetection metrics that fulfill the requirements R1–R3 from Section 2.We compare common metric candidates in Table 1 and choose toperform a risk-coverage analysis as the main evaluation, with the areaunder the risk-coverage curve (AURC) as a scalar failure detectionperformance metric, as it fulfills all requirements. The risk-coverageanalysis was originally proposed by El-Yaniv and Wiener (2010) andAURC was suggested as a comprehensive failure detection metric forimage classification by Jaeger et al. (2023).
4.1. 评估
为了对故障检测方法进行基准测试,我们需要简洁的故障检测度量,以满足第2节中的要求R1-R3。我们在表1中比较了常见的度量候选项,并选择执行风险覆盖分析作为主要评估方法,使用风险覆盖曲线下的面积(AURC)作为标量故障检测性能度量,因为它满足所有要求。风险覆盖分析最早由El-Yaniv和Wiener(2010)提出,AURC被Jaeger等人(2023)建议作为图像分类的综合故障检测度量。
Conclusion
结论
In conclusion, our study addresses the pitfalls in existing evaluation protocols for segmentation failure detection by proposing aflexible evaluation pipeline based on a risk-coverage analysis. Usingthis pipeline, we introduced a benchmark comprising multiple radiological 3D datasets to assess the generalization of many failure detectionmethods, and found that the pairwise Dice score between ensemblepredictions consistently outperforms other methods, serving as a strongbaseline for future studies.
总之,我们的研究通过提出一个基于风险覆盖分析的灵活评估管道,解决了现有分割失败检测评估协议中的不足。通过使用这个评估管道,我们引入了一个包含多个放射学3D数据集的基准,评估了多种失败检测方法的泛化能力,并发现集成预测之间的成对Dice分数始终优于其他方法,成为未来研究的强有力基准。
Results
结果
In the following sections, we first report the segmentation performances without failure detection in Section 5.1. Then, we describe themain benchmark results, starting with a comparison of pixel confidenceaggregation methods (Section 5.2) and extending the scope towardspixel- and image-level methods (Section 5.3). In Section 5.4, we studythe effect of alternative failure risk definitions. Finally, we performa qualitative analysis of the pairwise DSC method, to understand itsstrengths and weaknesses (Section 5.5).
在接下来的章节中,我们首先在第5.1节报告不使用故障检测的分割性能。然后,我们介绍基准测试的主要结果,首先对比像素置信度聚合方法(第5.2节),随后扩展至像素级和图像级方法(第5.3节)。在第5.4节,我们研究不同故障风险定义的影响。最后,在第5.5节,我们对成对Dice相似系数(DSC)方法进行定性分析,以深入理解其优势和局限性。
Figure
图
Fig. 1. Overview of the research questions and contributions of this paper. Based on a formal definition of the image-level failure detection task, we formulate requirements for theevaluation protocol. Existing failure detection metrics are compared and the risk-coverage analysis is identified as a suitable evaluation protocol. We then propose a benchmarkingframework for failure detection in medical image segmentation, which includes a diverse pool of 3D medical image datasets. A wide range of relevant methods are compared,including lines of research for image-level confidence and aggregated pixel confidence, which have been mostly studied in separation so far.
图1. 本文的研究问题和贡献概述。基于图像级故障检测任务的正式定义,我们制定了评估协议的要求。对现有的故障检测度量进行比较,并确定风险覆盖分析作为合适的评估协议。接着,我们提出了一个医学图像分割中故障检测的基准测试框架,其中包括多样化的3D医学图像数据集。比较了广泛相关的方法,包括图像级置信度和聚合像素置信度的研究方向,而这两者迄今为止大多是分开研究的。
Fig. 2. Segmentation performance of a single U-Net on the test sets. Boxes show the median and IQR, while whiskers extend to the 5th and 95th percentiles, respectively. Eachdataset contains samples drawn from the same distribution as the training set (in-distribution, ID) and samples drawn from a different data distribution (dataset shift) with thesame structures to be segmented. Usually, the performance on the in-distribution samples is higher than on the samples with distribution shift, but especially for the Kidney tumor(which lacks dataset shifts) and Covid datasets, there are also several in-distribution failure cases.
图2. 单一U-Net模型在测试集上的分割性能。箱线图显示中位数和四分位间距(IQR),而胡须分别延伸到第5和第95百分位数。每个数据集包含来自与训练集相同分布的样本(分布内,ID)和来自不同数据分布(数据集偏移)的样本,且具有相同的分割结构。通常,分布内样本的性能高于具有分布偏移的样本,但特别对于肾脏肿瘤(没有数据集偏移)和Covid数据集,仍然存在一些分布内的失败案例。
Fig. 3. Comparison of aggregation methods from Section 4.4.2 in terms of AURC scores for all datasets (lower is better). The experiments are named as ‘‘prediction model +confidence method’’ and each of them was repeated using 5 folds. Colored markers denote AURC values achieved by the methods, while gray marks above/below them are AURCvalues for random/optimal confidence rankings (which differ between the models trained on different folds; see Section 4.1). Pairwise DSC scores consistently best, but does notapply to single network outputs. Aggregation methods based on regression forests (RF) also show performance gains compared to the mean PE baseline, but fail catastrophicallyon the prostate dataset, possibly due to the small training set size. PE: predictive entropy. RF: regression forest.
图3. 比较第4.4.2节中的聚合方法,按所有数据集的AURC得分进行比较(得分越低越好)。实验名称为“预测模型 + 置信度方法”,每个实验使用5折交叉验证重复进行。彩色标记表示方法所达到的AURC值,而其上方/下方的灰色标记表示随机/最优置信度排名的AURC值(这些值在不同折次训练的模型之间有所不同;见第4.1节)。成对的DSC得分表现 consistently最好,但不适用于单一网络输出。基于回归森林(RF)的聚合方法相较于均值PE基线也表现出了性能提升,但在前列腺数据集上出现了灾难性失败,可能是由于训练集较小的缘故。PE:预测熵。RF:回归森林。
Fig. 4. Rankings by average AURC (top, lower ranks are better) and the underlying AURC scores (bottom; lower is better) for methods from Section 4.4.3 and all datasets. Theexperiments are named as ‘‘prediction model + confidence method’’ and each of them was repeated using 5 folds. In the lower diagram, colored dots denote AURC values achievedby the methods, while gray marks above/below them are AURC values for random/optimal confidence rankings (which differ between the models trained on different folds; seeSection 4.1). Ensemble + pairwise DSC is the best method overall, often achieving close to optimal AURC scores. The ranking on the prostate dataset is an outlier, which couldbe due to the small training set size. PE: predictive entropy
图4. 按照平均AURC排名(上图,排名越低越好)以及基础AURC得分(下图,得分越低越好)对第4.4.3节中的方法进行排名,涵盖所有数据集。实验名称为“预测模型 + 置信度方法”,每个实验使用5折交叉验证重复进行。在下图中,彩色点表示方法所达到的AURC值,而其上方/下方的灰色标记表示随机/最优置信度排名的AURC值(这些值在不同折次训练的模型之间有所不同;见第4.1节)。集成+成对DSC是整体表现最好的方法,通常能接近最优的AURC得分。前列腺数据集的排名为异常值,可能是由于训练集较小的缘故。PE:预测熵
Fig. 5. Impact of the choice of segmentation metric as a risk function on the ranking stability, comparing mean DSC (left) and NSD (right). Bootstrapping (𝑁 = 500) was used toobtain a distribution of ranks for the results of each fold and the ranking distributions of all folds were accumulated. All ranks across datasets are combined in this figure, wherethe circle area is proportional to the rank count and the black x-markers indicate median ranks, which were also used to sort the methods. Overall, the ranking distributionsare similar for mean DSC and NSD. The variance in the ranking distributions largely originates from combining the rankings across datasets, so for each dataset individually theranking is more stable (see for example the Covid dataset in fig. B.12)
图5. 分割度量选择作为风险函数对排名稳定性的影响,比较平均DSC(左图)和NSD(右图)。使用自助法(𝑁 = 500)来获取每个折次结果的排名分布,并累积所有折次的排名分布。在此图中,所有数据集的排名合并在一起,圆形的面积与排名数量成正比,黑色×标记表示中位数排名,这些排名也用于对方法进行排序。总体而言,平均DSC和NSD的排名分布相似。排名分布的方差主要来自于跨数据集合并排名,因此对于每个单独的数据集,排名会更稳定(例如,见图B.12中的Covid数据集)。
Fig. 6. Qualitative analysis of ensemble predictions on all datasets. For each dataset (rows), an interesting failure case is shown, consisting of (columns from left to right): thereference segmentation, the ensemble prediction and individual predictions of ensemble members (Ensemble #1 – 5) trained with different random seeds. True mean DSC isreported alongside the pairwise DSC scores. The ensemble predictions often disagree about test cases for which segmentation errors occur, which leads to low pairwise Dice andcan be considered a detected failure (rows 1–4). However, there are also cases where the ensemble is confident about a faulty segment, which could result in a silent failure (lasttwo rows).
图6. 对所有数据集的集成预测的定性分析。对于每个数据集(行),展示了一个有趣的失败案例,包括(从左到右的列):参考分割、集成预测和集成成员的单独预测(集成 #1 – 5),这些成员使用不同的随机种子进行训练。报告了真实的平均DSC,并附带成对DSC分数。集成预测通常对测试案例有不同的意见,特别是在发生分割错误的情况下,这导致了较低的成对Dice分数,可以视为检测到的失败(第1-4行)。然而,也有一些情况,集成对错误的分割表现出信心,这可能导致“无声失败”(最后两行)。
Table
表
Table 1Comparison of metric candidates for segmentation failure detection. Among those, AURC is the only metric that captures segmentation performanceand confidence ranking, which we find necessary for the comprehensive evaluation of a failure detection system. A detailed discussion of therequirements (R1–R3) associated with each column is in Section 2. f-AUROC uses binary failure labels. MAE: mean absolute error. PC: Pearsoncorrelation. SC: Spearman correlation
表1 分割故障检测度量候选项的比较。在这些度量中,AURC 是唯一能够同时捕捉分割性能和置信度排序的度量,我们认为这是全面评估故障检测系统所必需的。每一列相关要求(R1–R3)的详细讨论见第2节。f-AUROC 使用二进制故障标签。MAE:平均绝对误差。PC:皮尔逊相关系数。SC:斯皮尔曼相关系数。
Table 2Summary of datasets used in this study. The #Testing column contains case numbers for each subset of the test set separated by a comma, starting with the in-distribution testsplit and followed by the shifted ‘‘domains’’. The number of classes includes one count for background.
表2 本研究中使用的数据集总结。#Testing 列包含测试集每个子集的案例数,用逗号分隔,从分布内测试拆分开始,后面是“域”变化的测试。类别数量包括背景的计数。