Bioinformatics

SoloDel: A probabilistic model for detecting low-frequent somatic deletions from unmatched sequencing data

Kim, J., Kim, S., Nam, H., Kim, S., Lee, D..

Motivation: Finding somatic mutations from massively parallel sequencing data is becoming a standard process in genome-based biomedical studies. There are a number of robust methods developed for detecting somatic single nucleotide variations (SNVs). However, detection of somatic copy number alteration (SCNAs) has been substantially less explored and remains vulnerable to frequently raised sampling issues: low frequency in cell population and absence of the matched control samples.

Results: We developed a novel computational method SoloDel that accurately classifies low-frequent somatic deletions from germline ones with or without matched control samples. We first constructed a probabilistic, somatic mutation progression model that describes the occurrence and propagation of the event in the cellular lineage of the sample. We then built a Gaussian mixture model to represent the mixed population of somatic and germline deletions. Parameters of the mixture model could be estimated using the expectation-maximization (EM) algorithm with the observed distribution of read-depth ratios at the points of discordant-read based initial deletion calls. Combined with conventional structural variation caller, SoloDel greatly increased the accuracy in classifying somatic mutations. Even without control, SoloDel maintained a comparable performance in a wide range of mutated subpopulation size (10% to 70%). SoloDel could also successfully recall experimentally validated somatic deletions from previously reported neuropsychiatric whole genome sequencing data.

Availability and implementation: Java-based implementation of the method is available at http://sourceforge.net/projects/solodel/

Contact: kimjh@biosoft.kaist.ac.kr