RecruitingNCT07500428

Construction of a Benchmark for Breast Ultrasound AI Interpretation and Performance Evaluation of Multimodal AI Models

Construction of a Standardized Benchmark Evaluation System for Intelligent Breast Ultrasound Image Interpretation and Systematic Performance Assessment of Multimodal Artificial Intelligence Models Based on ACR BI-RADS v2025 Criteria

Sponsor

Peking Union Medical College Hospital

Enrollment

1,380 participants

Start Date

Mar 12, 2026

Study Type

OBSERVATIONAL

Conditions

Breast Neoplasms Breast Diseases Ultrasonography

Summary

This single-center, retrospective, observational study aims to construct a standardized benchmark evaluation system for intelligent breast ultrasound image interpretation and to systematically assess the diagnostic performance of current mainstream multimodal artificial intelligence (AI) models. De-identified B-mode breast ultrasound images with confirmed pathological diagnoses will be retrospectively collected from the institutional archive (2018-2025) and supplemented with images from published open-access datasets. Expert radiologists with varying experience levels will independently annotate all images according to the American College of Radiology (ACR) Breast Imaging Reporting and Data System (BI-RADS) v2025 criteria, including glandular tissue composition, lesion characterization (mass vs. non-mass lesion), morphological descriptors, and final BI-RADS classification. Baseline deep learning models (CNN-based ResNet-50 and Transformer-based USFM) will be trained to establish performance baselines and to stratify cases by diagnostic difficulty through cross-architecture consensus. Multiple multimodal large language models (MLLMs), including both general-purpose and medical-domain models, will then be evaluated via standardized API calls using BI-RADS-guided chain-of-thought prompts at temperature 0 for reproducibility. Primary endpoints include BI-RADS classification accuracy and diagnostic AUC for benign-malignant differentiation. Model robustness and safety will be assessed through out-of-distribution rejection testing, temperature-stability experiments, and thinking-mode ablation studies. This study adheres to the FLAIR and TRIPOD-LLM reporting guidelines.

Eligibility

Sex: FEMALEMin Age: 18 YearsMax Age: 75 Years

Inclusion Criteria4

B-mode breast ultrasound grayscale images from the institutional PACS database or from published open-access breast ultrasound datasets with documented original institutional ethics approval
Image quality adequate for clinical diagnosis with clear visualization of the region of interest
Pathological diagnosis confirmed (for benign and malignant lesion groups), or normal breast status confirmed by a senior radiologist with >15 years of breast ultrasound experience (for the normal group)
Complete de-identification with removal of all personally identifiable information

Exclusion Criteria5

Severely degraded image quality precluding meaningful BI-RADS assessment
Duplicate images from the same patient (only the most representative image retained per lesion)
Images with residual personally identifiable information after de-identification processing
Cases with ambiguous, disputed, or unavailable pathological results
Non-B-mode ultrasound images, including elastography, contrast-enhanced ultrasound, and Doppler imaging

Interested in this trial?

Get notified about updates and connect with the research team.

Interventions

DIAGNOSTIC_TESTMultimodal AI Model Diagnostic Evaluation

Retrospective evaluation of de-identified breast ultrasound images by multiple AI systems, including baseline deep learning models (ResNet-50, USFM) and multimodal large language models, using standardized BI-RADS-guided chain-of-thought prompts via API. No patient contact or clinical decision-making is involved.