Phasing of Unphased Genotype Data

PublishedDecember 5, 2017

Assigneenot available in USPTO data we have

InventorsChuong Do Eric Durand John Michael Macpherson

Technical Abstract

Patent Claims

21 claims

Legal claims defining the scope of protection. Each claim is shown in both the original legal language and a plain English translation.

Claim 1

Original Legal Text

1. A computer-implemented method for performing out-of-sample phasing of unphased genotype data of a chromosome pair of a first individual, comprising: under control of one or more computer systems configured with executable instructions, (a) providing a predetermined reference haplotype graph generated from phased genotype data for a set of L different polymorphic genetic markers of a chromosome pair of a plurality of reference individuals, wherein each different polymorphic genetic marker of the set of L different polymorphic genetic markers is located at an associated polymorphic locus on each chromosome of the chromosome pair, wherein L is an integer, the chromosome pair is one pair of human autosomal chromosomes or one pair of human X chromosomes, the plurality of reference individuals comprises at least 100,000 individuals, and the first individual is not included in the plurality of reference individuals, and wherein the predetermined reference haplotype graph comprises: a plurality of nodes organized into L+1 levels, the plurality of nodes comprising a first node, a plurality of intermediate nodes, and a terminal node, and a plurality of edges, each edge of the plurality of edges connecting two nodes of the plurality of nodes, wherein all edges that emanate from a node at a first level lead to one or more nodes at a second, next successive, level and represent one polymorphic locus at a first location on each chromosome of the chromosome pair of the plurality of reference individuals, and all edges that emanate from the one or more nodes at the second, next successive, level represent one polymorphic locus at a second location on each chromosome of the chromosome pair of the plurality of reference individuals, the second location being different from the first location and following successively the first location on each chromosome of the chromosome pair, wherein each edge has an associated probability of a particular allele being present at the one polymorphic locus of the chromosome pair of the plurality of reference individuals represented by each such edge; (b) receiving unphased genotype data of the first individual for the chromosome pair, the unphased genotype data comprising unphased genotype data for the L different polymorphic genetic markers of the set of L different polymorphic genetic markers; and (c) performing out-of-sample phasing on the unphased genotype data of the chromosome pair of the first individual received in (b) using the predetermined reference haplotype graph, wherein performing out-of-sample phasing comprises performing dynamic programming which comprises: (1) searching the predetermined reference haplotype graph for a plurality of possible paths through the predetermined reference haplotype graph, each possible path representing a possible haplotype for a chromosome of the chromosome pair of the first individual given the unphased genotype data of the first individual received in (b), wherein each possible path begins on the first node, ends on the terminal node, traverses intermediate nodes and edges between the first node and terminal node, and does not traverse any node more than once, and wherein a probability of each possible path is based on the associated probabilities of all edges in that possible path; and (2) identifying two possible paths of the plurality of possible paths of (c)(1) for which (i) a combination of alleles present at each of the polymorphic loci represented by the identified two paths is consistent with alleles present at each of the corresponding polymorphic loci of the unphased genotype data of the chromosome pair of the first individual and (ii) a product of the probability of each of the identified two possible paths having a combination of alleles as recited in (i) is greater than a product of the probability of each of any other two possible paths having the combination of alleles as recited in (i), wherein the identified two possible paths represent a most likely pair of haplotypes for the chromosome pair of the first individual, whereby the unphased genotype data of the chromosome pair of the first individual is phased.

Plain English Translation

A computer system uses a reference haplotype graph to determine the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown). The system first provides a pre-built graph. This graph, constructed from the phased genotypes of at least 100,000 unrelated individuals, represents possible allele combinations across a set of genetic markers on a chromosome pair. The graph has nodes organized into levels with edges connecting them. Each edge represents an allele at a specific genetic marker, with associated probabilities. The system receives the individual's unphased genotype data for the same genetic markers. It then uses dynamic programming to search the graph for the two most probable paths. These paths must (1) match the individual's unphased genotype data and (2) have the highest combined probability compared to any other path combination. These two paths represent the most likely haplotypes for the individual.

Claim 2

Original Legal Text

2. The method of claim 1 , wherein the predetermined reference haplotype graph comprises a directed acyclic graph.

Plain English Translation

Claim 3

Original Legal Text

3. The method of claim 1 , wherein the probability of a possible path of (c)(1) is a product of the associated probabilities of all edges in the possible path and each edge is a directed edge.

Plain English Translation

When determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), each path's probability within the reference haplotype graph is calculated as the product of the probabilities of the individual, directed edges that make up that path, taking the direction of the edges into account.

Claim 4

Original Legal Text

4. The method of claim 1 , wherein the predetermined reference haplotype graph permits recombination.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) allows for recombination events, which are the shuffling of genetic material during reproduction. This enables the graph to model a wider range of possible haplotypes.

Claim 5

Original Legal Text

5. The method of claim 4 , wherein the predetermined reference haplotype graph includes a recombination edge at a polymorphic locus, the recombination edge corresponding to a recombination event that is not represented in the phased genotype data of the plurality of reference individuals, thereby providing an additional path that includes the recombination edge in the predetermined reference haplotype graph.

Plain English Translation

To account for recombination events, the haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) includes a special "recombination edge" at a genetic marker. This edge represents a recombination event not present in the reference individuals used to build the graph, providing an alternate path through the graph to represent potential haplotypes in the individual.

Claim 6

Original Legal Text

6. The method of claim 5 , wherein the predetermined reference haplotype graph accounts for genotyping error in the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), including recombination edges that correspond to recombination events that are not represented in the phased genotype data of the reference individuals, also considers potential genotyping errors in the individual's unphased genotype data.

Claim 7

Original Legal Text

7. The method of claim 6 , wherein the predetermined reference haplotype graph comprises at least one extra edge at a polymorphic locus, wherein the at least one extra edge represents an allele that corresponds to genotyping error in the unphased genotype data of the chromosome pair of the first individual, and an associated probability of the at least one extra edge is determined based on a rate of genotyping error of a genotyping technology used to obtain the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

To account for genotyping errors when determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), the haplotype graph, including recombination edges that correspond to recombination events that are not represented in the phased genotype data of the reference individuals, includes extra edges at some genetic markers. These edges represent alleles that might be due to errors in the genotyping process. The probability associated with these error edges is based on the known error rate of the technology used to generate the individual's genotype data.

Claim 8

Original Legal Text

8. The method of claim 1 , wherein the predetermined reference haplotype graph accounts for genotyping error in the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

Claim 9

Original Legal Text

9. The method of claim 8 , wherein the predetermined reference haplotype graph comprises at least one extra edge at a polymorphic locus, wherein the at least one extra edge represents an allele that corresponds to genotyping error in the unphased genotype data of the chromosome pair of the first individual, and an associated probability of the at least one extra edge is determined based on a rate of genotyping error of a genotyping technology used to obtain the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

To account for genotyping errors when determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), the haplotype graph includes extra edges at some genetic markers. These edges represent alleles that might be due to errors in the genotyping process. The probability associated with these error edges is based on the known error rate of the technology used to generate the individual's genotype data.

Claim 10

Original Legal Text

10. The method of claim 9 , wherein the probability of a possible path of (c)(1) is a product of the associated probabilities of all edges in the possible path and each edge is a directed edge.

Plain English Translation

When determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), and the haplotype graph includes extra edges representing potential genotyping errors, each path's probability within the reference haplotype graph is calculated as the product of the probabilities of the individual, directed edges that make up that path, taking the direction of the edges into account.

Claim 11

Original Legal Text

11. The method of claim 1 , wherein the predetermined reference haplotype graph has been pruned of at least one unlikely path through the predetermined reference haplotype graph, the at least one unlikely path having a probability that is less than a threshold value.

Plain English Translation

To improve efficiency when determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), the haplotype graph has been pre-processed to remove unlikely paths. Any path with a probability below a certain threshold is pruned from the graph, reducing the search space.

Claim 12

Original Legal Text

12. The method of claim 11 , wherein the predetermined reference haplotype graph (a) permits recombination and includes a recombination edge at a polymorphic locus, the recombination edge corresponding to a recombination event that is not represented in the phased genotype data of the plurality of reference individuals, thereby providing an additional path that includes the recombination edge in the predetermined reference haplotype graph, and (b) accounts for genotyping error in the unphased genotype data of the chromosome pair of the first individual by including at least one extra edge at a polymorphic locus, wherein the at least one extra edge represents an allele that corresponds to genotyping error in the unphased genotype data of the chromosome pair of the first individual, and an associated probability of the at least one extra edge is determined based on a rate of genotyping error of a genotyping technology used to obtain the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) is optimized to model real-world genetic phenomena. First, it allows for recombination events by including special "recombination edges". These edges represent recombination events not seen in the reference population used to build the graph. Second, it accounts for potential genotyping errors by including "error edges" with probabilities based on the error rate of the genotyping technology. Finally, unlikely paths below a probability threshold are removed to increase the efficiency.

Claim 13

Original Legal Text

13. The method of claim 1 , wherein the predetermined reference haplotype graph is represented in a compressed form comprising a plurality of segments, wherein each segment corresponds to a contiguous set of edges in the predetermined reference haplotype graph, an end of each segment has 0 or 1 branch, and no segment points to a middle position of another segment.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) is stored in a compressed format. The graph is broken into segments, each representing a contiguous set of edges. Each segment only branches at its end (0 or 1 branch), and segments do not point to the middle of other segments. This compression reduces storage space and improves data access speed.

Claim 14

Original Legal Text

14. The method of claim 1 , wherein performing dynamic programming comprises performing a Viterbi algorithm.

Plain English Translation

The dynamic programming algorithm used to find the most likely paths through the haplotype graph, to determine the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), is implemented using a Viterbi algorithm.

Claim 15

Original Legal Text

15. The method of claim 1 , wherein each different polymorphic genetic marker of the set of L polymorphic genetic markers is a polymorphic single nucleotide polymorphism (SNP) marker.

Plain English Translation

The genetic markers used in the haplotype graph for determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) are single nucleotide polymorphisms (SNPs).

Claim 16

Original Legal Text

16. A computer-implemented method for performing out-of-sample phasing of unphased genotype data of a chromosome pair of a first individual, comprising: under control of one or more computer systems configured with executable instructions, (a) providing a predetermined reference haplotype graph generated from phased genotype data for a set of L different polymorphic genetic markers of a chromosome pair of a plurality of reference individuals, wherein each different polymorphic genetic marker of the set of L different polymorphic genetic markers is located at an associated polymorphic locus on each chromosome of the chromosome pair, wherein L is an integer, the chromosome pair is one pair of human autosomal chromosomes or one pair of human X chromosomes, the plurality of reference individuals comprises at least 100,000 individuals, and the first individual is not included in the plurality of reference individuals, and wherein the predetermined reference haplotype graph comprises: a plurality of nodes organized into L+1 levels, the plurality of nodes comprising a first node, a plurality of intermediate nodes, and a terminal node, and a plurality of edges, each edge of the plurality of edges connecting two nodes of the plurality of nodes, wherein all edges that emanate from a node at a first level lead to one or more nodes at a second, next successive, level and represent one polymorphic locus at a first location on each chromosome of the chromosome pair of the plurality of reference individuals, and all edges that emanate from the one or more nodes at the second, next successive, level represent one polymorphic locus at a second location on each chromosome of the chromosome pair of the plurality of reference individuals, the second location being different from the first location and following successively the first location on each chromosome of the chromosome pair, wherein each edge has an associated probability of a particular allele being present at the one polymorphic locus of the chromosome pair of the plurality of reference individuals represented by each such edge; (b) receiving unphased genotype data of the first individual for the chromosome pair, the unphased genotype data comprising unphased genotype data for the L different polymorphic genetic markers of the set of L different polymorphic genetic markers; and (c) performing out-of-sample phasing on the unphased genotype data of the chromosome pair of the first individual received in (b) using the predetermined reference haplotype graph, wherein performing out-of-sample phasing comprises performing dynamic programming which comprises searching the predetermined reference haplotype graph to identify two paths for which: (i) a combination of alleles present at each of the polymorphic loci represented by the identified two paths is consistent with alleles present at each of the corresponding polymorphic loci of the unphased genotype data of the chromosome pair of the first individual, and (ii) a product of the probability of each of the identified two paths is greater than a product of the probability of each of any other two paths having the combination of alleles as recited in (i), wherein each identified path begins on the first node, ends on the terminal node, traverses intermediate nodes and edges between the first node and terminal node, and does not traverse any node more than once, and wherein a probability of each identified path is based on the associated probabilities of all edges in that identified path, and wherein the identified two paths represent a most likely pair of haplotypes for the chromosome pair of the first individual; whereby the unphased genotype data of the chromosome pair of the first individual is phased.

Plain English Translation

A computer system uses a reference haplotype graph to determine the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown). The system first provides a pre-built graph based on phased genotypes of at least 100,000 unrelated individuals. This graph represents possible allele combinations across a set of genetic markers on a chromosome pair and has nodes organized into levels with edges connecting them. Each edge represents an allele at a specific genetic marker, with associated probabilities. The system receives the individual's unphased genotype data for the same markers. It then uses dynamic programming to search the graph to identify the two most probable paths. These paths must (1) match the individual's unphased genotype data and (2) have the highest combined probability compared to any other path combination. These two paths represent the most likely haplotypes for the individual.

Claim 17

Original Legal Text

17. The method of claim 16 , wherein the predetermined reference haplotype graph permits recombination and includes a recombination edge at a polymorphic locus, the recombination edge corresponding to a recombination event that is not represented in the phased genotype data of the plurality of reference individuals, thereby providing an additional path that includes the recombination edge in the predetermined reference haplotype graph.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), finding the two most probable paths through the graph that match the individual's data and have the highest combined probability, allows for recombination events. It includes a special "recombination edge" at a genetic marker, representing a recombination event not present in the reference individuals used to build the graph, thus providing an alternate path.

Claim 18

Original Legal Text

18. The method of claim 17 , wherein the predetermined reference haplotype graph accounts for genotyping error in the unphased genotype data of the chromosome pair of the first individual by including at least one extra edge at a polymorphic locus, wherein the at least one extra edge represents an allele that corresponds to genotyping error in the unphased genotype data of the chromosome pair of the first individual, and an associated probability of the at least one extra edge is determined based on a rate of genotyping error of a genotyping technology used to obtain the unphased genotype data of the chromosome pair of the first individual.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), finding the two most probable paths through the graph that match the individual's data and have the highest combined probability, while also accounting for recombination by including a "recombination edge", also considers potential genotyping errors in the individual's unphased genotype data by including "error edges" with probabilities based on the error rate of the genotyping technology.

Claim 19

Original Legal Text

19. The method of claim 18 , wherein the predetermined reference haplotype graph has been pruned of at least one unlikely path through the predetermined reference haplotype graph, the at least one unlikely path having a probability that is less than a threshold value.

Plain English Translation

The haplotype graph used in determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown), finding the two most probable paths through the graph that match the individual's data and have the highest combined probability, while accounting for recombination with a "recombination edge" and genotyping errors with "error edges", is further optimized by removing unlikely paths. Any path with a probability below a certain threshold is pruned from the graph, reducing the search space.

Claim 20

Original Legal Text

20. The method of claim 16 , wherein the dynamic programming includes pruning the predetermined reference haplotype graph of one or more unlikely paths that each have a probability that is less than a threshold value.

Plain English Translation

When determining the most likely haplotypes for an individual based on their unphased genotype data (genetic data where the order of alleles on each chromosome is unknown) and using dynamic programming to search a haplotype graph, the process includes pruning the graph of unlikely paths during the dynamic programming search itself. Any path with a probability below a threshold is removed, improving the efficiency of finding the most probable paths.

Claim 21

Original Legal Text

21. A system for performing out-of-sample phasing of unphased genotype data of a chromosome pair of a first individual, comprising: one or more processors configured to: (a) provide a predetermined reference haplotype graph generated from phased genotype data for a set of L different polymorphic genetic markers of a chromosome pair of a plurality of reference individuals, wherein each different polymorphic genetic marker of the set of L different polymorphic genetic markers is located at an associated polymorphic locus on each chromosome of the chromosome pair, wherein L is an integer, the chromosome pair is one pair of human autosomal chromosomes or one pair of human X chromosomes, the plurality of reference individuals comprises at least 100,000 individuals, and the first individual is not included in the plurality of reference individuals, and wherein the predetermined reference haplotype graph comprises: a plurality of nodes organized into L+1 levels, the plurality of nodes comprising a first node, a plurality of intermediate nodes, and a terminal node, and a plurality of edges, each edge of the plurality of edges connecting two nodes of the plurality of nodes, wherein all edges that emanate from a node at a first level lead to one or more nodes at a second, next successive, level and represent one polymorphic locus at a first location on each chromosome of the chromosome pair of the plurality of reference individuals, and all edges that emanate from the one or more nodes at the second, next successive, level represent one polymorphic locus at a second location on each chromosome of the chromosome pair of the plurality of reference individuals, the second location being different from the first location and following successively the first location on each chromosome of the chromosome pair, wherein each edge has an associated probability of a particular allele being present at the one polymorphic locus of the chromosome pair of the plurality of reference individuals represented by each such edge; (b) receive unphased genotype data of the first individual for the chromosome pair, the unphased genotype data comprising unphased genotype data for the L different polymorphic genetic markers of the set of L different polymorphic genetic markers; and (c) perform out-of-sample phasing on the unphased genotype data of the chromosome pair of the first individual received in (b) using the predetermined reference haplotype graph, wherein to perform out-of-sample phasing comprises performing dynamic programming which comprises searching the predetermined reference haplotype graph to identify two paths for which: (i) a combination of alleles present at each of the polymorphic loci represented by the identified two paths is consistent with alleles present at each of the corresponding polymorphic loci of the unphased genotype data of the chromosome pair of the first individual, and (ii) a product of the probability of each of the identified two paths is greater than a product of the probability of each of any other two paths having the combination of alleles as recited in (i), wherein each identified path begins on the first node, ends on the terminal node, traverses intermediate nodes and edges between the first node and terminal node, and does not traverse any node more than once, and wherein a probability of each identified path is based on the associated probabilities of all edges in that identified path, and wherein the identified two paths represent a most likely pair of haplotypes for the chromosome pair of the first individual; whereby the unphased genotype data of the chromosome pair of the first individual is phased; and one or more memories coupled with the one or more processors, the one or more memories being configured to provide the one or more processors with instructions and wherein the one or more memories store the predetermined reference haplotype graph.

Plain English Translation

A computer system determines the most likely haplotypes for an individual from their unphased genotype data using a reference haplotype graph. One or more processors are configured to provide a pre-built graph derived from phased genotypes of at least 100,000 unrelated individuals. The graph represents potential allele combinations across genetic markers and consists of nodes in levels connected by edges. Each edge signifies an allele at a marker, with associated probabilities. The processors receive the individual's unphased genotype data for the same markers. Dynamic programming is then used to identify the two most probable paths that match the individual's data and have the highest combined probability. The one or more memories store the predetermined haplotype graph and provide instructions to the one or more processors.

Patent Metadata

Filing Date

Unknown

Publication Date

December 5, 2017

Inventors

Chuong Do

Eric Durand

John Michael Macpherson

Want to explore more patents?

Browse 5M+ US patents with plain-English claim translations and AI-generated analysis.

Browse All Patents Try Prior Art Search