Thesis icon

Thesis

Methods for phasing and imputation of very low coverage sequencing data

Abstract:

The introduction of massively parallel short-read sequencing has facilitated rapidly dropping costs of DNA sequencing. This has led to substantial growth in the size of human sequencing projects, with consortia of low coverage sequencing data containing tens of thousands of samples. However, current statistical methods for genotype calling from this data scale poorly with sample size, and are infeasible to use on the largest of current projects. This thesis explores the problem of genotype calling and phasing of large sample sizes of low-coverage sequencing data.

Current methods are applied to call and phase genotypes of the CONVERGE consortium, a data set consisting of very low coverage next-generation sequencing data collected from around 12,000 Chinese women. A genotyping accuracy of 92% as measured by squared Pearson correlation (R2) against a SNP geno-typing chip is achieved for minor allele frequencies >5%, demonstrating that very low coverage sequencing can be used instead of SNP genotyping chips to genotype a study of this size.

A new statistical model is described that allows genotype calling and phasing of low coverage sequencing data in N(logN) time complexity, where N is sample size, which greatly improves run time compared to current methods. Other adaptations of the model, including a GPU implementation, are also presented.

The new statistical model is used to call and phase genotypes from the largest collection of low coverage sequencing data in the world (about 32,000 Europeans), the Haplotype Reference Consortium (HRC). At a non-reference allele frequency of 0.1% the HRC haplotypes provide a downstream imputation accuracy of up to 64% R2, compared to an R2 of 36% when using 1000 Genomes Phase 3 haplotypes, the largest publicly available collection of haplotypes derived from low coverage sequencing.

Finally, a web server has been written to allow small numbers of high coverage whole genome sequenced samples to be phased using the HRC panel. The HRC panel is only available to HRC consortium members, but this web server allows the academic community to gain access to the HRC panel for phasing their own samples.

Actions


Access Document


Files:

Authors


More by this author
Division:
MSD
Department:
Doctoral Training Centre - MSD
Role:
Author

Contributors

Role:
Supervisor


More from this funder
Funding agency for:
Kretzschmar, WW
Grant:
WT097307


DOI:
Type of award:
DPhil
Level of award:
Doctoral
Awarding institution:
University of Oxford


Language:
English
Keywords:
Subjects:
UUID:
uuid:19ce0c8e-1d65-44ff-b56c-77ee849b2167
Deposit date:
2017-01-09

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP