Title
题目
Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for pulmonary embolism diagnosis and report generation from CTPA
Abn-BLIP:面向 CT 肺动脉造影(CTPA)肺栓塞诊断与报告生成的异常对齐引导式语言 - 图像预训练模型
01
文献速递介绍
肺栓塞相关研究背景与Abn-BLIP模型介绍 肺栓塞(Pulmonary Embolism, PE)是一种危及生命的疾病,由肺动脉血栓栓塞阻塞引起,常导致严重并发症、长期病损及高死亡风险(Belohlávek 等,2013)。及时且准确的诊断对于有效治疗和改善预后至关重要(Alonso-Martínez 等,2010;Hendriksen 等,2017;Cahan 等,2023)。计算机断层扫描肺动脉造影(Computed Tomography Pulmonary Angiography, CTPA)因其高敏感性和特异性,仍是诊断该疾病的金标准(Stein 等,2006)。然而,在高诊疗量的临床环境中,CTPA图像解读工作耗时费力、结果受阅片者主观影响大,且易出现诊断延迟(Singh 等,2011)。 近年来,医学影像人工智能领域的进展为提升CTPA肺栓塞诊断水平展现出巨大潜力(Soffer 等,2021)。基于深度学习的多模态方法已被开发用于实现栓子识别自动化、血栓负荷量化及患者风险分层(Huang 等,2020a;Liu 等,2020;Zhong 等,2025b),进而提高诊断效率并降低阅片者间的差异。尽管如此,现有大多数系统主要生成概率预测结果,可解释性有限,这制约了其临床可靠性(Huang 等,2020a,b;Lindenmeyer 等,2024)。已有研究尝试融入血管空间结构(Tajbakhsh 等,2019)或结合基于阈值分析的动脉分割技术(Pu 等,2023),虽在栓子特征描述方面有所改进,但当前解决方案仍主要局限于栓塞检测,无法对CTPA图像进行全面评估,例如评估心功能、血栓分布及胸部其他伴随异常表现等。 视觉-语言模型(Vision-Language Models, VLMs)通过整合影像与文本描述,为基于CTPA的肺栓塞全面评估提供了可行方向,有助于提升结果可解释性和决策支持能力(Wu 等,2025;Zhong 等,2025a)。医学视觉-语言模型能够将人工智能生成的输出结果与放射科医生的工作流程衔接,推动结构化报告自动生成,并减少观察者间的差异(Nazi 和 Peng,2024;Hartsock 和 Rasool,2024;Tanno 等,2024;Jin 等,2024)。通过整合临床评分、患者病史等多模态信息,视觉-语言模型可实现整体评估,从而更好地进行患者管理和风险分层(Zhong 等,2024)。与传统的仅能进行分类或分割的模型不同,视觉-语言模型可直接从影像数据生成内容全面、人类可读懂的报告,进而提升结果透明度和临床应用接受度(Wu 等,2023;Bai 等,2024;Huang 等,2023)。 尽管具备这些优势,通用医学视觉-语言模型在基于CTPA的肺栓塞评估中仍表现欠佳(Hager 等,2024;Zhong 等,2025a)。这类模型训练所用的数据集涵盖多种影像模态,数据异质性强,往往缺乏肺栓塞诊断相关的领域专业知识,导致对肺栓塞诊断至关重要的细微放射学表现敏感性降低。此外,它们在处理复杂报告和多异常查询方面的能力也有限,难以达到专业放射科医生水平,实现视觉、文本与临床信息的深度整合(Hartsock 和 Rasool,2024)。当前的核心挑战在于开发一款针对肺栓塞的专用视觉-语言模型,该模型需兼具高诊断准确性与高可解释性,符合放射科医生的报告规范,并能通过有效的多模态整合提供全面的决策支持。 为填补这一空白,我们提出了“异常对齐引导式语言-图像预训练模型”(Abnormality-aligned Bootstrapping Language-Image Pretraining, Abn-BLIP)。这是一款针对肺栓塞的专用视觉-语言模型,可将异常识别与结构化描述相整合,用于CTPA报告生成(图1)。Abn-BLIP能将特定于异常的视觉查询转化为有条理的诊断结果,构建多阶段工作流程,从而提升可解释性、系统梳理评估内容,并增强人工智能生成放射学报告的临床实用性。本文的主要贡献如下: - 提出多标签异常识别模块,以提升CTPA报告生成的诊断准确性,尤其聚焦于肺动脉区域的分层分析。 - 引入Abn-QFormer(异常引导式查询转换器),该模块利用异常驱动的查询在异常层面聚合图像-文本特征,动态优化跨模态检索,并支持临床医生式的单个异常表现逐项检查。 - 开发异常对齐对比学习(Abnormality-aligned Contrastive Learning, ACL)方法,实现放射学特征与文本描述结果的细粒度对齐,进而强化异常层面的对应关系。 - 遵循医学诊断原则,本框架明确构建了解剖区域与异常表现之间的层级关系,确保生成的CTPA报告内容全面、结构清晰且具有临床意义。
Aastract
摘要
Medical imaging plays a pivotal role in modern healthcare, with computed tomography pulmonary angiography(CTPA) being a critical tool for diagnosing pulmonary embolism and other thoracic conditions. However,the complexity of interpreting CTPA scans and generating accurate radiology reports remains a significantchallenge. This paper introduces Abn-BLIP (Abnormality-aligned Bootstrapping Language-Image Pretraining),an advanced diagnosis model designed to align abnormal findings to generate the accuracy and comprehensiveness of radiology reports. By leveraging learnable queries and cross-modal attention mechanisms, ourmodel demonstrates superior performance in detecting abnormalities, reducing missed findings, and generatingstructured reports compared to existing methods. Our experiments show that Abn-BLIP outperforms stateof-the-art medical vision-language models and 3D report generation methods in both accuracy and clinicalrelevance. These results highlight the potential of integrating multimodal learning strategies for improvingradiology reporting.
医学影像在现代医疗保健中发挥着关键作用,其中计算机断层扫描肺动脉造影(CTPA)是诊断肺栓塞及其他胸部疾病的重要工具。然而,解读CTPA扫描图像并生成准确的放射学报告仍面临重大挑战。本文提出了Abn-BLIP(异常对齐引导式语言-图像预训练模型,Abnormality-aligned Bootstrapping Language-Image Pretraining),这是一种先进的诊断模型,旨在通过对齐异常表现来提升放射学报告的准确性和完整性。借助可学习查询和跨模态注意力机制,与现有方法相比,我们的模型在异常检测、减少漏诊以及生成结构化报告方面展现出更优异的性能。实验结果表明,在准确性和临床相关性方面,Abn-BLIP均优于最先进的医学视觉-语言模型和3D报告生成方法。这些结果凸显了整合多模态学习策略对改进放射学报告的潜力。
Method
方法
Based on clinical diagnostic guidelines for CTPA (Tan et al., 2022;Bukhari et al., 2024), we identified the necessity of a systematic framework to enhance abnormality detection and structured report generation for PE diagnosis. Accordingly, we developed a hierarchicaldiagnostic framework informed by the clinical expertise of radiologistsfrom Brown University, Johns Hopkins University, and the University ofMichigan, in collaboration with emergency physicians and pulmonologists. Their combined clinical insights ensured the framework’s clinicalrelevance, consistency, and generalizability across diverse healthcaresettings.As illustrated in Fig. 2, the framework systematically structures thediagnostic process through a hierarchical evaluation of seven anatomical regions and 32 critical CTPA abnormalities. Within this structuredapproach, abnormalities are identified at a regional level and synthesized into a comprehensive diagnostic summary, facilitating preciseabnormality localization and standardized.
For diagnostic model training and report generation, CTPA radiology reports were processed with a large language model (LLM) (Dubeyet al., 2024) to extract training targets. The LLM identified 32 abnormality labels (𝑌 ) and retrieved their corresponding text-based findings(𝑇 ), which served as training references for both binary and textualpredictions.
基于CTPA临床诊断指南的层级诊断框架构建与数据处理 依据CTPA临床诊断指南(Tan等,2022;Bukhari等,2024),我们明确了构建一套系统性框架的必要性,该框架需能提升肺栓塞(PE)诊断中的异常检测效果与结构化报告生成质量。为此,我们联合布朗大学、约翰·霍普金斯大学及密歇根大学的放射科医生,协同急诊科医生与呼吸科医生,结合其临床专业知识,开发了一套层级诊断框架。他们共同提供的临床洞见,确保了该框架在不同医疗场景下的临床相关性、一致性与普适性。 如图2所示,该框架通过对7个解剖区域及32种关键CTPA异常表现的层级化评估,系统地构建了诊断流程。在这一结构化方法中,异常表现先在区域层面被识别,再整合为全面的诊断总结,进而实现精准的异常定位与诊断标准化。 在诊断模型训练与报告生成阶段,我们采用大型语言模型(LLM)(Dubey等,2024)对CTPA放射学报告进行处理,以提取训练目标。该LLM识别出32个异常标签(𝑌),并提取出对应的文本形式诊断结果(𝑇),这些标签与文本结果可作为二分类预测和文本预测的训练参考依据。
Conclusion
结论
In conclusion, Abn-BLIP represents a significant advancement inautomated medical imaging interpretation, introducing a clinicallyaligned vision–language framework tailored for CTPA analysis. By integrating learnable abnormality-guided queries with a hierarchical multimodal transformer (Abn-QFormer) and employing fine-grained crossmodal alignment, Abn-BLIP effectively captures abnormality-specificfindings across pulmonary and cardiovascular structures.The model demonstrates robust performance in both multi-labelabnormality classification and structured radiology report generation,achieving consistent improvements in NLG and CE metrics across internal and external datasets. Expert evaluations further confirm theclinical accuracy, clarity, and relevance of the generated reports, withstrong correlations observed between expert ratings and automatedmetrics. Qualitative visualizations and case studies highlight Abn-BLIP’sability to localize and describe both primary and incidental findings, acritical feature for comprehensive patient management.In addition, Abn-BLIP exhibits favorable inference efficiency withmoderate computational requirements (∼280M parameters), supporting its feasibility for real-world deployment. Its modular and extensibledesign enables adaptation across institutions and customization to localdiagnostic protocols.
Overall, Abn-BLIP establishes a structured, interpretable, and clinically oriented pipeline for CTPA interpretation, marking a promisingstep toward trustworthy AI-assisted diagnosis and radiology workflowoptimization in diverse healthcare environments.
结论 综上所述,Abn-BLIP在自动化医学影像解读领域取得了显著进展,构建了一个与临床需求对齐、专为CTPA分析设计的视觉-语言框架。该模型通过将可学习的异常引导查询与层级化多模态转换器(Abn-QFormer)相整合,并采用细粒度跨模态对齐技术,能够有效捕捉肺部与心血管结构中特定于异常的诊断信息。 在多标签异常分类与结构化放射学报告生成两项任务中,Abn-BLIP均展现出稳健性能——在内部与外部数据集上,其在自然语言生成(NLG)和临床效用(CE)指标上均实现了一致性提升。专家评估进一步证实,模型生成的报告具备临床准确性、清晰度与相关性,且专家评分与自动评估指标之间呈现出强相关性。定性可视化结果与案例研究表明,Abn-BLIP能够定位并描述主要异常与附带异常表现,这一特性对患者的全面管理至关重要。 此外,Abn-BLIP的推理效率良好,计算需求适中(约2.8亿参数),为其在实际临床场景中的部署提供了可行性。其模块化与可扩展的设计,使其能够在不同医疗机构间灵活适配,并可根据当地诊断协议进行定制化调整。 总体而言,Abn-BLIP为CTPA解读构建了一套结构化、可解释且以临床为导向的流程,为在各类医疗环境中实现可信的人工智能辅助诊断与放射科工作流程优化,迈出了富有前景的一步。
Results
结果
4.1. Datasets
To assess the effectiveness of the proposed method across multipleclinical tasks, we conducted experiments on two CTPA datasets pairedwith radiology reports: (1) INSPECT (Huang et al., 2023) from StanfordUniversity and (2) a retrospective CTPA dataset from Brown UniversityHealth (BUH).The INSPECT dataset collected at Stanford Medicine between 2000and 2021, comprises 23,248 CTPA scans from 19,402 patients at risk forPE. It includes the impression sections of radiology reports, providingradiologist-authored diagnostic descriptions and interpretationsThe BUH dataset includes patients who underwent CTPA imagingbetween 2015 and 2019, with some patients having multiple follow-upscans. In total, it consists of 59,754 image–report pairs from 19,565patients. The two datasets were combined and randomly partitionedinto training, validation, and testing sets at a 7:1:2 ratio. This study wasapproved by the Lifespan Institutional Review Board 3 (Ref. [1791856-20]; Project Code 214421), with informed consent waived due tothe retrospective use of de-identified imaging and clinical data. Allparticipants were over 18 years of age.
4.1 数据集 为评估所提方法在多项临床任务中的有效性,我们在两个附带放射学报告的CTPA数据集上开展了实验:(1)来自斯坦福大学的INSPECT数据集(Huang等,2023);(2)来自布朗大学健康中心(Brown University Health, BUH)的回顾性CTPA数据集。 INSPECT数据集由斯坦福医学院于2000年至2021年期间收集,包含19,402名肺栓塞高危患者的23,248次CTPA扫描数据,且涵盖放射学报告的“结论”章节,提供由放射科医生撰写的诊断描述与解读内容。 BUH数据集包含2015年至2019年期间接受CTPA检查的患者数据,部分患者有多次随访扫描记录。该数据集共包含19,565名患者的59,754组图像-报告对。我们将两个数据集合并,并按7:1:2的比例随机划分为训练集、验证集与测试集。本研究已获得Lifespan机构审查委员会3的批准(参考编号[1791856-20];项目编号214421),由于研究采用回顾性方式使用去标识化的影像与临床数据,故豁免知情同意程序。所有纳入研究的参与者均年满18周岁。
Figure
图

Fig. 1. Abn-BLIP inference pipeline for CTPA abnormality identification andstructured report generation. The Abn-IDed image encoder detects 32 CTPAabnormalities and extracts abnormality-identified features. The learned visualqueries interrogate CTPA scans by Abn-QFormer to extract the correspondingabnormal findings. These queries help generate a structured CTPA report,categorizing abnormalities under relevant organ-specific sections, such aspulmonary arteries and the heart
图1 Abn-BLIP用于CTPA异常识别与结构化报告生成的推理流程 异常识别图像编码器(Abn-IDed image encoder)可检测32种CTPA异常表现并提取异常识别特征;经Abn-QFormer(异常引导式查询转换器)处理后的可学习视觉查询对CTPA扫描图像进行分析,以提取对应的异常表现信息;这些查询结果助力生成结构化CTPA报告,并将异常表现归类到相关的器官特异性章节下(如肺动脉、心脏等章节)。

Fig. 2. The figure illustrates the population distribution of 32 CTPA abnormalities across two datasets (BUH and INSPECT), categorized into 7 anatomical regions😛ulmonary Arteries, Lungs and Airways, Pleura, Heart, Mediastinum and Hila, Chest Wall and Lower Neck, and Bones. This hierarchical framework facilitatescomprehensive abnormality detection and enhances the generation of clinically meaningful CTPA reports. The abnormality labels were extracted from radiologyreports using a large language model (LLM), enabling a multi-dimensional assessment of inter-regional variations across the datasets
图2 32种CTPA异常表现在两个数据集(BUH数据集与INSPECT数据集)中的分布情况 该图将32种CTPA异常表现按7个解剖区域分类展示,分别为:肺动脉(Pulmonary Arteries)、肺与气道(Lungs and Airways)、胸膜(Pleura)、心脏(Heart)、纵隔与肺门(Mediastinum and Hila)、胸壁与下颈部(Chest Wall and Lower Neck)、骨骼(Bones)。此层级框架有助于实现全面的异常检测,并提升具有临床意义的CTPA报告生成质量。异常表现标签通过大型语言模型(LLM)从放射学报告中提取,为评估不同数据集间各解剖区域的异常分布差异提供了多维度依据。

Fig. 3. Overview of the proposed Abn-BLIP model for CTPA abnormality diagnosis and report generation. (a) Anatomy-guided multi-abnormality identificationin Stage 1: Multi-scale abnormality-identified image feature extraction for transformer encoders. (b) Abnormality-driven visual Querying Transformers (AbnQFormer): Joint optimization of two objectives, enforcing abnormal queries (a set of learnable embeddings) to extract visual abnormal representations mostrelevant to their corresponding abnormal text descriptions. © Abnormality-aligned Contrastive Learning (ACL): Achieving more fine-grained visual queriedrepresentations by aligning abnormalities.
图3 所提Abn-BLIP模型用于CTPA异常诊断与报告生成的整体框架 (a)第一阶段:解剖引导式多异常识别——为Transformer编码器提取多尺度异常识别图像特征; (b)异常驱动的视觉查询转换器(Abn-QFormer)——对两个目标进行联合优化,促使异常查询(一组可学习嵌入向量)提取与对应异常文本描述最相关的视觉异常表征; (c)异常对齐对比学习(ACL)——通过异常对齐实现更细粒度的视觉查询表征。

Fig. 4. Visualization of cross-modal cosine similarity heatmap between textualand visual features of 32 distinct CTPA abnormalities. The textual featuresare derived from the text descriptions of each abnormality, while the visualfeatures are the queried representations on the corresponding images. Each cellin the heatmap indicates the similarity score between a specific abnormality’stextual and visual representation, providing insights into the alignment between the two modalities
图4 32种不同CTPA异常表现的文本特征与视觉特征间跨模态余弦相似度热力图可视化 文本特征源于每种异常表现的文本描述,视觉特征则是对应图像上的查询表征。热力图中的每个单元格代表某一特定异常表现的文本表征与视觉表征之间的相似度得分,可用于洞察两种模态(文本与视觉)之间的对齐情况。

Fig. 5. t-SNE visualization of normalized image and text features for abnormalities. Each colored point represents one of 32 detected abnormalities, from20,000 randomly sampled features. (a) The abnormal image features wereextracted using visual querying, guided by learned abnormality-wise queriesfrom the visual querying transformer encoder. (b) The abnormal text featureswere encoded by a text transformer encoder based on descriptive sentences ofthe abnormalities
图5 异常表现的标准化图像特征与文本特征的t-SNE可视化结果 图中每个彩色点代表从20,000个随机采样特征中提取的32种已检测异常表现之一。(a)异常图像特征通过视觉查询提取,提取过程由视觉查询转换器编码器所学习的异常专属查询引导;(b)异常文本特征由文本转换器编码器基于异常表现的描述性语句编码得到。

Fig. 6. Examples of the generated reports. Our results are compared with the ground truth, the 3D report generation methods and medical VLM methods. Theblue italic text is the correct predictions corresponding to the actual reports, and the red areas indicate the untrue information in the predictions.
图6 生成报告示例 本图将所提方法(Abn-BLIP)的报告生成结果与真实报告(ground truth)、3D报告生成方法及医学视觉-语言模型(medical VLM)方法的结果进行对比。其中,蓝色斜体文本为与实际报告相符的正确预测内容,红色区域标注预测结果中的不实信息。

Fig. 7. Scatter plot of expert ratings, response times, and confidence levels inevaluating generated radiology reports. Each dot represents a single expertassessment. The 𝑥-axis indicates the expert rating, the 𝑦-axis denotes theresponse time (in seconds), and color encodes the reviewer’s confidence level.
图7 专家对生成放射学报告的评分、响应时间及置信度散点图 图中每个点代表一次专家评估。横轴(x轴)表示专家评分,纵轴(y轴)表示响应时间(单位:秒),颜色代表评估者的置信度。

Fig. 8. Correlation between expert ratings and automated evaluation metrics for generated radiology reports. Each subplot shows the relationship between expertscores and a specific evaluation metric (NLG or CE). Pearson correlation coefficients are reported, and red lines indicate linear regression fits with 95% confidenceintervals.
图8 生成放射学报告的专家评分与自动评估指标间的相关性 每个子图展示专家评分与某一特定评估指标(自然语言生成指标NLG或临床效用指标CE)的关系。图中标注了皮尔逊相关系数(Pearson correlation coefficients),红色线条代表带有95%置信区间的线性回归拟合曲线。
Table
表

Table 1Comparison of current 3D medical VLMs on a combined testing set using multilabel classification metrics. The highest performances are highlighted in bold.
表1 采用多标签分类指标在合并测试集上对现有3D医学视觉-语言模型(3D medical VLMs)的对比结果。表现最优的结果以粗体突出显示。

Table 2Diagnosis performance for the 7 anatomical regions
表 2 7 个解剖区域的诊断性能(结果)

Table 3Comparison of PE diagnosis performance
表3 肺栓塞(PE)诊断性能对比(结果)

Table 4Natural Language Generation (NLG) metrics comparison on captioning- and learning-based report generation.
表4 基于描述生成式与基于学习式报告生成的自然语言生成(NLG)指标对比(结果)

Table 5Clinical Efficacy (CE) metrics comparison between baseline models and the proposed Abn-BLIP model.
表5 基准模型与所提Abn-BLIP模型的临床效用(CE)指标对比(结果)

Table 6Ablation studies for multi-abnormality identification
表6 多异常识别的消融实验(结果)

Table 7Ablation studies for report generation
表7 报告生成的消融实验(结果)

Table 8Expert assessment of LLMs on abnormality extraction
表8 大型语言模型(LLMs)在异常提取任务上的专家评估(结果)