DSpace at EWHA: A Novel Drug Screening Method based on Large-scale Bioactivity Dataset

Browse

My Repository

DSpace at EWHA일반대학원 생명과학과 Theses_Ph.D

View : 1160 Download: 0

A Novel Drug Screening Method based on Large-scale Bioactivity Dataset

Title: A Novel Drug Screening Method based on Large-scale Bioactivity Dataset

Authors: 권예지

Issue Date: 2018

Department/Major: 대학원 생명과학과

Publisher: 이화여자대학교 대학원

Degree: Doctor

Advisors: 김완규

이상혁

Abstract: 데이터 기반 신약 개발(Data-driven drug discovery, D4)은 포괄적인 대규모 데이터를 활용하여 신약 개발에 있어 효율적인 경로를 제안한다. 2018년 현재, 미국 NCBI 산하 PubChem에서는 120만개 이상의 bioassay 데이터가 공개되었으며, 여전히 빠르게 성장하고 있고, 수백만 가지 화합물의 bioactivity profile에 대한 광범위한 정보를 제공한다. 본 연구는 표적 단백질에 대한 활성 화합물을 예측 및 제시하는 BEAR (Bioactivity Enrichment by Assay Repositioning)라는 컴퓨터 가상 탐색 방법에 대한 내용이다. BEAR는 수백만 화합물에 대한 bioassay profile을 통계적으로 활용하며, 표적 단백질과 리간드의 구조 정보를 전혀 사용하지 않습니다. BEAR의 기본 아이디어는 ‘assay repositioning’이다. 이는 표적 단백질의 활성 화합물을 찾기 위해 다른 의도로 수행된 bioassay를 재사용하는 것이다. 천 개가 넘는 표적 단백질을 대상으로 리간드 예측력을 확인한 결과, BEAR는 알려진 리간드를 높은 정확도(median AUC≈0.87)로 찾아냈다. 이 정확도는 해당 표적 단백질과 관련이 높은 bioassay를 재사용하지 않고도 유지되었다 (median AUC=0.78~0.85). 즉, 기존의 지식으로 연관 짓기 힘든 bioassay와 표적 단백질도 ‘assay repositioning’으로 연결될 수 있음을 보여준다. 기존에 공개된 7개의 kinase 억제제 실험 결과와 BEAR로 예측된 결과를 비교했을 때도, BEAR가 각 실험에서 밝힌 kinase 억제제를 정확히 찾아내는 편이었다. 두 개 실험을 제외한, 나머지 대부분의 결과에서 평균 AUC는 0.7이 넘었다. 해당 kinase를 대상으로 한 BEAR 예측에서 kinase 억제제들이 높은 점수를 부여 받은 것이다. BEAR의 결과를 실험으로도 검증하였다. GPCR에 속한 CHRM1과 HRH1, kinase에 속한 CLK1, CLK4, MAPK1, AURKC를 표적 단백질로 선택하여, 각 15~40개 후보 약물을 선별하였다. 실험은 모두 억제 신호를 찾는 in-vitro 실험을 수행했고, 총 6개 실험에서 평균 0.2의 수득율을 얻었다. 즉, 10개의 후보 약물이 있으면 해당 표적 단백질에 대해 평균 2개의 약물이 억제제로서 효과가 있었다. 더구나 이렇게 억제 효과를 가지는 약물들을 기존에 알려진 리간드와 구조적으로 달랐다. CHRM1에서 억제 효과를 가지는 7개의 후보 약물에 대해서는, 쥐의 해마 뉴런에서 신경 돌기가 자라는 효과를 추가로 확인했다. BEAR는 이처럼 리간드를 기반으로 한 가상 탐색이기 때문에 기존에 알려진 리간드 정보가 필요하다. 최소한 3개의 리간드가 있어야 BEAR로 새로운 리간드를 제안할 수 있다. 나아가 30개 정도의 리간드가 알려져 있으면 BEAR의 예측 정확도가 어느 정도 보장된다. 이는 BEAR의 한계라고 할 수 있다. 본 연구는 이렇게 리간드 수가 적거나 아예 없는 표적 단백질을 위해 BEAR의 확장 버전도 제안한다. iBEAR(iterative BEAR)는 알려진 리간드가 너무 적을 때 적용된다. 일단 적은 리간드를 사용해서 BEAR를 적용하고, 이로부터 얻은 결과의 상위 화합물을 알려진 리간드와 합쳐서 다시 BEAR의 입력 값으로 사용한다. 이렇게 나온 결과는 본래 BEAR의 예측력과 비교해서 다소 그 정확성이 올라간다. 그러나 다른 구조의 리간드를 예측하는 능력은 감소하였다. fBEAR(family BEAR)는 표적 단백질에 대해 알려진 리간드가 전혀 없을 때 같은 단백질 계통 (protein family)의 알려진 리간드들을 대신 사용하는 방법이다. 일부 단백질은 이렇게 단백질 계통의 리간드를 대신 사용할 때 리간드 예측을 더 정확히 해내기도 했다. 그래서 BEAR와 fBEAR는 상보적으로 리간드 예측에 활용할 수 있다. BEAR와 유사하게 대규모 데이터 마이닝을 통한 리간드 가상 탐색 방법이 보고된 바 있다. 이들 방법 중 대표적인 SEA (similarity ensemble approach)와 본 연구와 동일한 bioactivity profile로 네트워크를 구축한 BACoN (BioActivity Based Compound Network) 방법을 선정해 BEAR와 예측력을 비교하였다. SEA와 비교했을 때, BEAR의 예측력은 확연히 높았고, BACoN은 BEAR와 비등한 예측력을 보였다. 결론적으로, 기존의 가상 탐색 방법과 대비되는 BEAR의 차별점은 1) 구조 정보에 비 의존적이고, bioactivity 데이터를 활용하는 점, 2) 새로운 구조를 가진 리간드를 제안해 준다는 점, 3) 천여개의 표적 단백질에 대해 적용이 가능한 점, 그리고 4) 데이터가 더 확보됨에 따라 그 성능이 개선된다는 점이 있다.;Data-driven drug discovery (D4) exploits a comprehensive set of big data to provide an efficient path to new drug development. Currently, more than 1.2 million bioassays are publicly available at PubChem in 2018, and is still growing rapidly, providing extensive information on the bioactivity profiles for millions of compounds. I developed a novel in silico method to virtually screen active compounds for a target protein, named BEAR (Bioactivity Enrichment by Assay Repositioning). BEAR uses large-scale bioassay dataset for compound screening and does not depend on any structural information of either target or ligand. The underlying idea of BEAR is to reuse bioassay data for predicting active compounds for targets other than their originally intended targets, i.e. ‘assay repositioning.’ When tested for more than a thousand targets, BEAR predicted known ligands highly accurately (median AUC≈0.87). Its accuracy maintained at a high level even after the relevant bioassays were excluded (median AUC=0.78~0.85), suggesting effective repositioning of seemingly unrelated bioassay data. With 7 independent kinase-inhibitor experimental results, BEAR also accurately predicted high ranked inhibitors for each kinase. Overall average AUC is over 0.7, except two data which had been performed with highly selective inhibitors. To validate BEAR’s prediction power in experimental level, 15~40 new ligand candidates for 2 kinds of GPCR proteins (CHRM1 and HRH1) and 4 kinases (CLK1, CLK4, MAPK1, AURKC) were selected. The in-vitro assays tested on the 6 proteins have an average 0.2 hit ratio. Moreover, the hit candidates were structurally different from previously known ligands. In case of CHRM1 antagonistic 7 candidates, neurite outgrowth is identified in mouse hippocampal neurons. Since BEAR is ligand based virtual screening (LBVS), a known ligand set is required. At least three ligands are required to predict new ligands for the protein target, and if more than 30 ligands are known, the predictive power of BEAR is significantly improved. This is a limitation of BEAR. I have developed an extended version of BEAR, for when the ligand is too little or absent. Iterative BEAR (iBEAR) is applied when there are significantly fewer known ligands. Once BEAR has been applied with a small number of known ligands, some compounds that have been scored high in the result are re-used as inputs to BEAR. The predictive power of iBEAR is somewhat better than that of BEAR, but its ability to predict different structures from previously known ligands has declined. Family BEAR (fBEAR) uses ligands of the same protein family instead when there is no known ligand. Some proteins are more predictive in fBEAR than using specific ligands (BEAR). Thus, BEAR and fBEAR are complementary, and the predicted ligands of both algorithms can be used synthetically. There are several LBVS algorithms developed by large scale data mining like BEAR. I compared BEAR with another LBVS tools. As a result of comparing performance with similarity ensemble approach (SEA), the predictive power of BEAR was remarkably high. In addition, there is an algorithm called BACoN (BioActivity Based Compound Network) which builds a network by calculating bioactivity similarity between compounds with the same bioactivity data as BEAR. BACoN has similar performance to BEAR. In conclusion, BEAR differentiates from conventional virtual screening methods, in that 1) it depends on no structural information, but only bioactivity data, 2) allows scaffold hopping, 3) easily scalable to thousands of targets, and 4) its performance is expected to improve with growth data.