DSpace at EWHA: Optimal estimation of FST for detecting positive selection from SNPs correcting for sample size difference, missing values, and low frequency variants

Browse

My Repository

DSpace at EWHA일반대학원 에코크리에이티브협동과정 Theses_Master

View : 1800 Download: 0

Optimal estimation of FST for detecting positive selection from SNPs correcting for sample size difference, missing values, and low frequency variants

Title: Optimal estimation of FST for detecting positive selection from SNPs correcting for sample size difference, missing values, and low frequency variants

Authors: 이송은

Issue Date: 2018

Department/Major: 대학원 에코크리에이티브협동과정

Publisher: 이화여자대학교 대학원

Degree: Master

Advisors: 김유섭

Abstract: 집단 간의 유전적 분화를 설명하기 위해 도입된 FST는, 양성 자연선택(positive selection)을 검출하는 효과적인 방법으로 각광받아 왔다. 전체 집단(total population)에 속한 부분 집단(subpopulation)에서 특정 환경요인에 의해 자연선택이 작용할 경우, 해당 집단에서 양성 자연선택을 받은 대립유전자의 빈도는 눈에 띄게 증가하는데 이를 선택일소(selective sweep)이라고 한다. 염색체의 한 좌위에 선택일소가 작용하여 이로운 대립유전자의 빈도가 증가하면, 같은 염색체 위의 가까운 자리에 위치한 대립유전자의 빈도도 함께 증가하게 된다. 이를 유전자 편승(hitchhiking)이라고 한다. 유전자 편승은 양성 자연선택의 영향을 받은 좌위의 주변에서 유전적 다양성(genetic diversity)이 감소하는 결과를 가져온다. 부분 집단과 전체 집단에서 이러한 유전적 다양성의 정도를 측정하여 비교하면 어떤 부분 집단에서, 어느 정도의 세기로 양성 자연선택이 일어났는지 여부를 알 수 있다. 이것이 바로 FST로 양성 자연선택을 검출하는 원리이다. FST의 간단한 공식은 FST=1-HW/HT로, HW (mean within-population diversity)는 부분 집단들에서 계산된 유전적 다양성의 평균, HT (total population diversity)는 전체 집단의 유전적 다양성을 의미한다. FST의 개념이 소개된 이래로, DNA서열에 존재하는 다수의 SNP(single nucleotide polymorphism) 좌위에서 FST를 계산하는 다양한 방법들이 제시되었다. 본 연구자는 이러한 방법들로 FST를 계산하는 데에 있어서의 문제점을 살펴보고 보완점을 제시하는 연구를 수행하였다. 문제점은 다음과 같다. 첫번째, FST를 계산하는 방법 중 KST는, HT 값이 두 부분집단에서 얻은 DNA서열 표본의 크기 차이에 영향을 받음으로써 FST 값이 변동한다. 두번째, NGS를 통해 분석한 염기서열에 결측 데이터(missing value)가 존재할 경우 여러 방법으로 FST를 계산할 시 값이 어떻게 변하는지에 관해 연구된 바가 거의 없다. 세번째, 한 좌위에서 FST의 상한값(upper bound)은 minor allele frequency(MAF)에 따라 단조증가하는 함수로 주어지므로, 각 SNP 좌위에서 MAF값이 FST 값에 영향을 미친다. 그리하여, 본 연구자는 FST를 계산하는 세가지 방법- T1, T2, T3 -을 고안하여 위의 문제들에 답을 하고자 하였다. T1은 두 부분집단에서 얻어진 DNA서열 표본의 크기가 다를 경우, sub-sampling을 통하여 표본 크기(sample size)를 맞춘 후 KST를 계산하는 방법이다. T2는 각 SNP 좌위에서 FST를 계산하여 평균을 취하고, T3는 다수의 SNP 좌위에서 구해진 population diversity, 다시 말하자면 heterozygosity 값을 평균내어 FST 계산에 이용한다. 한 부분 집단에서 선택일소가 일어나는 시나리오를 시뮬레이션하여 위의 방법들로 양성 자연선택을 검출하여 보았다. FST를 계산한 결과, T1, T2, T3는 sample size의 차이에 상관없이 일정한 값을 가졌다. 통계적 검정력을 테스트한 결과에서는 KST는 sample size의 차이에 따라 검정력에 차이를 보인 반면, T1, T2, T3는 상대적으로 일정하였고 T1, T3가 T2 보다 높은 검정력을 보였다. 또한 DNA서열 표본에 missing value가 존재할 때 KST, T1, T2, T3 모두 통계적 검정력이 떨어지는 경향이 관찰되었는데, 이는 KST에서 가장 두드러졌다. 마지막으로, 낮은 MAF값을 가지는 SNP 좌위를 제외하고 통계적 검정력을 구해본 결과 KST, T1, T2, T3 모두에서 검정력이 커지는 경향을 보였다.;Wright’s F_ST is widely used as a robust signature for detecting positive selection, for example, a complete or incomplete selective sweep in a local population. Given the basic formula, F_ST=1- H_W⁄H_T where H_W is mean within-population diversity and H_T is total population diversity, various methods for estimating F_ST from DNA sequence were proposed. When it comes to calculating FST with these methods, there are many issues to tackle: First, in case of KST, one of the ways of measuring FST, as H_T is predominantly determined by sample size difference, K_ST is sensitive to sample size difference. Second, it is not clear how robust F_ST is in the presence of missing base calls(missing values) in NGS-based data. Third, there needs to be a correction on the effect of MAF on the statistic because the upper bound of F_ST is an increasing function of minor allele frequency (MAF). To address these problems, we devised three methods of calculating F_ST which are T1, T2, and T3. For T1, H_W and H_T are calculated from mean pairwise sequence differences but after subsampling to make sample sizes equal. For T2, we average F_ST values calculated per individual sites, where allele frequencies in H_T are obtained giving equal weights to populations. For T3, the above method is modified to obtain mean heterozygosity first from each population and then from the total population. Applied to selective sweep scenarios, FST values calculated using T1, T2 and T3 are relatively constant compared to KST. Besides, statistical powers of these three methods are relatively constant as well while the power of KST varies depending on the sample size difference. In most cases, T1 and T3 show higher statistical powers than T2. Also, T1, T2, and T3 show better performances compare to KST when there are missing values. Finally, in addressing the third problem of FST estimation, we found that the statistical power of FST increases by excluding SNPs with low MAF from the calculation.