透過您的圖書館登入
IP:3.15.31.22
  • 學位論文

資料探勘與篩選後分析方法於多方面生化應用化合物之研究

A Study in Mining and Post Screening Methods for Compounds Used in Various Biochemical Applications

指導教授 : 羅濟群 楊進木

摘要


A phenomenal increase in the quality of human life is due to tremendous advancements and use of computer-aided methods in medicine and various biotechnological applications. Such technologies rely on the increasing availability of biochemical data and structural information which are highly significant for current advances. The solved crystal structures of 3D compounds stored in databases contribute greatly in bioinformatics as they are employed in studies and development of numerous lead compounds used in drug design and other industrial applications. However, screening and retrieving compounds for various applications presents a challenge for in retrieving and analyzing prospect targets. Therefore, a constant improvement of methods and tools is necessary for the proper classification, query, retrieval and analysis of available compounds data. With advances in computer technology, information management and data mining the developments of accurate, rapid and efficient algorithms enable studies in biotechnology to have significant improvements. However, mining appropriate candidates for various purposes by virtually screening thousands of docked protein-compound complexes is one of the biggest challenges. One of the main issues in virtual screening comes from an insufficient description of ligand binding mechanisms which results in the development of imprecise scoring functions. In aiming to provide solutions to this issue we studied various docking algorithms and post screening methods used in mining and investigating specific compounds. Comparing different virtual screening and post screening analyses we observed that interaction profiles (e.g. van der Walls, hydrogen bonding) are highly relevant in the overall performance of compound mining. Moreover, this study concluded that a method which uses two combined stages of cluster analysis can be more efficient than one-stage clustering methods in selecting appropriate candidates for drug design and other biotechnological applications. Our study of interaction profiles also provided evidence of the possibility of mining novel compounds for potential uses in cosmetics, industry and agriculture in addition to pharmaceutics using similar virtual screening and post screening analysis. The above findings and observations contributed to the development of our method, Two Stage Combinative Clustering (TSCC) where we combine virtual screening and two stages of cluster analyses (interaction and physico-chemical). The methodology of TSCC has contributed to combinatorial computation approaches used to indentify tetracycline derivatives for inhibiting Dengue virus neuraminidases and inhibitors for flaviviruses. TSCC, similar to other post screening analysis methods starts with the virtual screening of compounds obtained from various databases e.g., Available Chemical Directory (ACD) or Comprehensive Medical Chemistry (CMC) using GEMDOCK. Top ranking compounds are then clustered based on their protein-ligand binding interactions and grouped into clusters with distinct binding interactions. Compounds are also clustered based on physico-chemical features using atom composition and are grouped in similar structure clusters. Compounds with lowest energy from each interaction cluster are selected as representatives while active compounds and similar to active compounds are chosen as representatives from each structure cluster. Lastly, final representatives from both interaction and structure clustering are chosen based on energy and structure similarity respectively and can be verified trough bioassays for proper function and application. TSCC’s novel feature is the use of two clustering stages to better filter and accurately retrieve the final representative compounds. Another key feature is to represent interactions at the atomic-level for including measures of interactions strength, enabling better descriptions of protein-ligand interactions to achieve a more specific analysis of virtual screening. The proposed two-stage clustering method enhanced our post-screening analysis by revealing more accurate performances than a one-stage clustering in visualizing and mining compound candidates and improving the virtual screening enrichment while being used successfully to identify novel inhibitors and functions of some proteins.

並列摘要


A phenomenal increase in the quality of human life is due to tremendous advancements and use of computer-aided methods in medicine and various biotechnological applications. Such technologies rely on the increasing availability of biochemical data and structural information which are highly significant for current advances. The solved crystal structures of 3D compounds stored in databases contribute greatly in bioinformatics as they are employed in studies and development of numerous lead compounds used in drug design and other industrial applications. However, screening and retrieving compounds for various applications presents a challenge for in retrieving and analyzing prospect targets. Therefore, a constant improvement of methods and tools is necessary for the proper classification, query, retrieval and analysis of available compounds data. With advances in computer technology, information management and data mining the developments of accurate, rapid and efficient algorithms enable studies in biotechnology to have significant improvements. However, mining appropriate candidates for various purposes by virtually screening thousands of docked protein-compound complexes is one of the biggest challenges. One of the main issues in virtual screening comes from an insufficient description of ligand binding mechanisms which results in the development of imprecise scoring functions. In aiming to provide solutions to this issue we studied various docking algorithms and post screening methods used in mining and investigating specific compounds. Comparing different virtual screening and post screening analyses we observed that interaction profiles (e.g. van der Walls, hydrogen bonding) are highly relevant in the overall performance of compound mining. Moreover, this study concluded that a method which uses two combined stages of cluster analysis can be more efficient than one-stage clustering methods in selecting appropriate candidates for drug design and other biotechnological applications. Our study of interaction profiles also provided evidence of the possibility of mining novel compounds for potential uses in cosmetics, industry and agriculture in addition to pharmaceutics using similar virtual screening and post screening analysis. The above findings and observations contributed to the development of our method, Two Stage Combinative Clustering (TSCC) where we combine virtual screening and two stages of cluster analyses (interaction and physico-chemical). The methodology of TSCC has contributed to combinatorial computation approaches used to indentify tetracycline derivatives for inhibiting Dengue virus neuraminidases and inhibitors for flaviviruses. TSCC, similar to other post screening analysis methods starts with the virtual screening of compounds obtained from various databases e.g., Available Chemical Directory (ACD) or Comprehensive Medical Chemistry (CMC) using GEMDOCK. Top ranking compounds are then clustered based on their protein-ligand binding interactions and grouped into clusters with distinct binding interactions. Compounds are also clustered based on physico-chemical features using atom composition and are grouped in similar structure clusters. Compounds with lowest energy from each interaction cluster are selected as representatives while active compounds and similar to active compounds are chosen as representatives from each structure cluster. Lastly, final representatives from both interaction and structure clustering are chosen based on energy and structure similarity respectively and can be verified trough bioassays for proper function and application. TSCC’s novel feature is the use of two clustering stages to better filter and accurately retrieve the final representative compounds. Another key feature is to represent interactions at the atomic-level for including measures of interactions strength, enabling better descriptions of protein-ligand interactions to achieve a more specific analysis of virtual screening. The proposed two-stage clustering method enhanced our post-screening analysis by revealing more accurate performances than a one-stage clustering in visualizing and mining compound candidates and improving the virtual screening enrichment while being used successfully to identify novel inhibitors and functions of some proteins.

參考文獻


1. Frank, E., et al., Data mining in bioinformatics using Weka. Bioinformatics, 2004. 20(15): p. 2479-2481.
2. Stahl, M. and T. Schulz-Gasch, Practical database screening with docking tools. Ernst Schering Res Found Workshop 2003. 42: p. 24.
3. Bissantz, C., G. Folkers, and D. Rognan, Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/scoring combinations. Journal of Medicinal Chemistry, 2000. 43(25): p. 4759-4767.
4. Joachimiak, A., High-throughput crystallography for structural genomics. Current Opinion in Structural Biology, 2009. 19(5): p. 573-584.
5. Blundell, T.L., H. Jhoti, and C. Abell, High-throughput crystallography for lead discovery in drug design. Nature Reviews Drug Discovery, 2002. 1(1): p. 45-54.

延伸閱讀