本篇研究的主要目標是要運用深度學習模型來做退化性膝關節炎分類以及疾病進展之預測,尤其著重在實際臨床場景的表現。我們使用公開資料庫 (Osteoarthritis Initiative, OAI) 作為深度學習模型訓練基礎。在分類模型中,使用數量達到8,964張膝蓋X光影像,其中926張被當作測試資料;我們也使用了亞東醫學中心的246張X光影像作外部驗證。除此之外,公開資料庫的測試資料及亞東醫學中心的資料也由三位資深臨床醫師判讀,包括兩位骨科醫師及一位影像科醫師。為了量化模型及醫師們的判讀結果,我們應用了可視化圖像、正確率、觀察者信度、F1 score、精確度、召回率、特異度及判讀手術適應症等指標。在疾病預測模型中,模型基礎為Vision Transformer (ViT),使用5,565位參與者的膝蓋X光影像及臨床資訊, 其中578個為測試資料;亞東醫學中心的274位病人的X光影像及臨床資訊作為外部驗證。判讀模型預測結果,我們應用了可視化圖像、正確率、敏感度、特異度及勝算比等指標,並與傳統疾病風險因子做比較。經過訓練後,分類模型在判斷OAI資料庫的分類正確率有78%,在外部驗證資料的判讀也與臨床醫師高度相關。值得注意的是,我們發現在模型判讀失敗的公開資料裡,臨床醫師彼此間的判讀結果也不大一致,這可能是因為部分資料本來就具有爭議性。在判讀手術適應症的指標,模型的判讀結果甚至比臨床醫師更好。在疾病預測模型中,我們使用了影像輸入資料及臨床輸入資料,根據資料量的多寡而有不同輸入組合,最後選用了單一影像及精簡的臨床資料 (年紀、性別、身體質量指數) 輸入來做外部驗證。疾病預測模型在OAI資料庫有74.1%正確率,敏感度及特異度,在外部驗證資料有71.2%,敏感度及特異度;與傳統風險因子相比,模型預測結果的勝算比明顯較高,在OAI資料庫為23.87,在外部驗證資料為5.92。整體而言,在判讀退化性膝關節炎的分類能力,我們設計的深度學習模型表現不輸給臨床醫師,而且能成功應用於實際臨床場景;以ViT為基礎的預測模型比傳統風險因子相比能更準確的預測疾病進展,且在次分析中發現模型疾病有進展的族群中表現較好,這可能幫助我們及早介入以避免疾病惡化到必須接受手術的程度。
The present study aimed to develop deep-learning-based models to classify and predict knee osteoarthritis (OA) using Kellgren-Lawrence (KL) classification based on knee radiographs. A model using deep convolutional neural network (CNN) was developed, aiming to classify radiographs with knee osteoarthritis (OA). The model was trained by Osteoarthritis Initiative (OAI) dataset (4,796 participants in total), with 962 images reserved for testing purposes. In order to validate the model's performance, an additional set of 246 knee radiographs from the Far Eastern Memorial Hospital (FEMH) was applied.The evaluation of the model's performance involved expert assessment by experienced specialist, including one musculoskeletal radiologist and two orthopedic surgeons. Images from OAI and FEMH were assessed by the specialists. In order to quantify the model's performance, multiple metrics were used, including inter-observer agreement, F1 score, precision, recall, accuracy, specificity, and the model's ability to identify surgical candidates. We also applied attention maps to demonstrate the interpretability of the OA classification model. In the design of the prediction model, a Vision-Transformer-based approach was employed. The model was trained using a baseline dataset of 5,565 knee radiographs from the OAI, with 578 images reserved for testing purposes. Each knee radiograph in the dataset was associated with a corresponding Kellgren-Lawrence (KL) stage determined through a 48-month follow-up. Additionally, 274 cases from our institute (FEMH) were applied for the purpose of external validation. The data input for the model included a combination of single or paired images, as well as relevant clinical factors, both comprehensive and essential. To quantify the performance of the prediction model, several metrics were utilized, including the area under the receiver operating characteristics curve (AUROC), accuracy, odds ratio, sensitivity, specificity, and the model's ability to identify the cases with advanced KL stage. The classification model demonstrated an impressive accuracy of 78% and exhibited consistent inter-observer agreement for both the OAI dataset (К value between 0.80 and 0.86) and the externally validated images (К value between 0.81 and 0.83). However, in cases where the model misclassified images, we observed a lower inter-observer agreement (К value between 0.47 and 065). Notably, the model outperformed the surgeons and radiologist in identifying surgical candidates (KL 3 and KL 4 ), achieving F1 score of 0.923. In cases with OA progression, the AUROC to identifying surgical candidates was 0.844, 0.804, 0.766 and 0.718 in the combination of single image with essential factors, single image with full factors, pairing images with essential factors and pairing image with full factors, respectively. In OAI testing set using the simplest input, the AUROC of identifying OA progression was 0.808, with 74.1% accuracy, 91.8% sensitivity and 71% specificity. In external validation, the AUROC of identifying OA progression was 0.709, with 71.2% accuracy, 72.2% sensitivity and 70.3% specificity. Positive model prediction had an odds ratio of 23.87 (CI: 11.24~50.67) in OAI and 5.92 (CI: 3.50~10.03) in external validation.The classification model exhibited comparable performance to specialists in identifying surgical candidates and demonstrated consistent results across open databases and real-life radiographs. However, the misclassification of knee OA images led to a notable discrepancy, mainly attributed to a considerably lower inter-observer agreement. The prediction model can provide a reliable prediction result in knee OA cases with the advantages of simplicity and flexibility. The model performance was excellent in progression cases, potentially making the early intervention in OA patients more efficiently.