機器學習應用於化學性質預測等研究在近年受到高度關注。然而,模型表現易受其訓練集之數量與品質影響。在過去化學領域中的研究,已成功透過量化不確定性(Uncertainty Quantification)以定義分子性質預測值的信心程度,並且證實可透過加以訓練模型自認為不熟悉之資料,讓模型的預測表現得以自主進步。然而,理解模型所預測之不確定性仍是尚未探討的問題。本研究開發一基於原子的模型架構,此模型於分子性質預測任務中,將分子性質的預測值與其預測值之不確定性分配到原子上。本研究開發之模型可分別量化兩種不確定性:偶然不確定性(Aleatoric uncertainty)與認知不確定性(Epistemic uncertainty),前者用以形容訓練集本身品質上的誤差,後者形容模型受限於其預測能力,對預測值之不確定性。此架構賦予模型解釋能力(Explainability),透過檢視分子內原子性質預測值理解模型預測行為;透過原子之兩種不確定性,將模型預測不佳之原因歸咎於分子中某類原子或官能基,並透過兩種不確定性,解釋該分子結構預測之困難是源自於訓練資料品質的噪聲或是模型缺乏相關分子結構之訓練資料所造成。此外,本研究提出一個後處理方法以校正集成模型方法(Deep Ensembles)下所預測的偶然不確定性,使偶然不確定性有較好的表現。綜上所述,本研究提出一校正方法以提升不確定性之品質與一模型架構以量化原子性質與原子不確定性,從而理解模型預測行為及其預測不佳之原因。
Recent advances in machine learning have opened the door to rapid prediction of molecular properties of interest. However, the performances of the data-driven methods on molecular property predictions are not always satisfactory because of the limited size and quality of chemical datasets. Therefore, uncertainty quantification for the data-driven approach has recently gained attention in the chemistry society. Previous studies have successfully quantified uncertainty for molecular property predictions and have shown that it is possible to provide new training data for uncertain samples to improve model performance. However, rationalizing the uncertainty value predicted by the model remains a challenging task. In this work, we design an atom-based framework which can attribute the uncertainty to the chemical structures present in a molecule by quantifying both atomic uncertainties and atomic contributions to the molecular property. Moreover, this method can separate aleatoric and epistemic uncertainties, which capture noise inherent in the data and uncertainty in the model predictions, respectively. In the atom-based framework, both aleatoric and epistemic uncertainties become more explainable on the basis of the substructures in a molecule. Furthermore, we propose a post-hoc method for recalibrating aleatoric uncertainty of the ensemble method for better confidence interval quantification. Overall, this method not only improves uncertainty calibration but also provides a framework for better assessing whether and why a prediction can be considered unreliable.