The localized haplotype-cluster model uses variable-order Markov chains to create an empirical model for haplotype probabilities that adapts to the changing structure of linkage disequilibrium (LD) across the genome. By clustering haplotypes based on the Markov property, the model is able to take advantage of conditional independencies to improve estimates of haplotype frequencies while still respecting the dependencies induced by LD. We introduce a method for training such models using regularized likelihood functions to prevent overfitting along with a method for cross-validation to select a regularization parameter which accounts for the high probability of out-of-sample haplotypes not accommodated by the model. When applied to dense single-nucleotide polymorphism (SNP) markers from population data, our method obtains a better-fitting and more parsimonious model than the leading method.
In addition, we note that these models represent a variable-order Markov chain defined in a single direction along the genome, which ignores the LD structure that could be represented by conditional dependencies in the opposite direction. Therefore, fitting the model to the same data in the reverse direction along the genome usually results in different haplotype frequency estimates, which is an undesirable property for genomic models. We investigate a method of reconciling two localized haplotype-cluster models fit in opposite directions along the genome that takes advantage of the differing LD structure represented in both models to derive a new bidirectional model. Preliminary results indicate that these bidirectional models often collapse when fit to a large number of markers, so we discuss alternative possibilities which incorporate the bidirectionality into the initial model training. We also consider modeling LD using discrete graphical models which do not rely on directional conditional probabilities.