在本文中,我們著重於分析數據資料庫之各種資料引用相關研究。我們認 為,一致性的資料引用的實作將有助於推動的數據共享與增進數據重複使用 性,因為它可被視為類比於期刊或其他出版物中的引用模式並受相關領域使用 者的認可。 蛋白質資料庫(Protein Data Bank,PDB) 為一個專門儲存蛋白質及核酸之 三維結構資料的數據庫。他們大部份扮演了生物機制中關鍵的角色。這些資料 數據主要經由世界各地的結構生物學家以X 射線晶體學或NMR 光譜學實驗所 結構化而得。各個主要的科學雜誌要求科學家將自己的研究成果提交給PDB, 並以獨立識別碼(PDB IDs) 存放到PDB 供公眾免費使用,是結構生物學研究中 的重要資源。因此,PDB 是一個很好的實作對象用以進行資料引用之相關研 究。我們的研究考慮PDB ID 在本文中提及的模式與其引用至參考文獻的模式 之間的交互作用,並且藉由研究該資料引用模式來表達此兩種引用機制之間的 相對重要性。 通過探索這些豐富的蛋白質結構資料和相關的引文中,我們可以從引文網 絡的觀點來研究蛋白質結構之間的關係。此外,文獻和數據引網絡的分析可以 顯示潛在的科學發展途徑,即知識和數據如何被用於推進結構生物學的發展之 過程。基於這些分析的結果,我們可以提出適當的資料引用的實作方法,用以 鏈接引用與資料兩者,以及衡量資料使用度量方式。這將有利於資料的重複使 用,並有助於實驗過程的再現性,甚至提供機器可識別之資料使用追蹤能力。
In this thesis, we focus on analyzing the various of data citation to the data repository. We think consistent practice of data citation facilitates and incentivizes data sharing and reuse because it could be counted as professional recognition for data providers as citations of journal and other types publications. The Protein Data Bank (PDB) is the worldwide repository of 3D structures of proteins, nucleic acids and complex assemblies, most of which play essential biological roles. The major data of PDB are the experimentally determined structures of protein, and are provided by unique identifiers (PDB IDs) and corresponding primary citations that make them easier to be used as the referenced data. Therefore, it could be a good practice model for data citation research. Meanwhile, our studies focus on the interplay of PDB IDs mentions recognition and references cited of the literature, and the relative importance of these two mechanisms can be expressed by investigating the data citation patterns. By exploring rich structures and related citations of PDB, we can investigate the relationships between protein structures from the viewpoint of the citation network. Moreover, the analysis of the literature and data citation networks may demonstrate potential pathways of scientific discovery, that is, how knowledge and data were used to advance a particular field in structural biology. Based on the results of analyses, we could recommend data citation and provenance practices, approaches to discover data citations, ways of linking citations and data, and data access metrics. We hope our work will benefit the data reused, experiments reproduced, and even provide machine readability for tracing the data usage.