In each region, 4, haplotypes across kb were generated using the coalescent-based program ms [ 19 ]. The effective population size was set at 10,, the recombination rate per base pair bp per generation was set at 10 -9 , and SNPs were simulated in each region. For the human genome, recombination occurs at an average rate of about 10 -8 per bp per generation [ 20 ].

Our recombination rate, 10 -9 per bp per generation, is the low end of the recombination rates in the human genome [ 18 ], representing a stronger linkage disequilibrium LD. We chose this rate because multi-marker approaches are primarily designed for strong-LD regions. In each chromosomal region, 2, diplotypes were generated by randomly pairing the 4, haplotypes. Then the 2, diplotypes of the first region were randomly paired with the 2, diplotypes of the second region, to form 2, subjects.

In this way, we generated datasets. We then considered nine disease models listed in Additional file 1. Additional file 1 lists the causal allele frequencies, the penetrance values of two-locus genotypes, and the marginal penetrance values of one-locus genotypes, for all disease models. Model 0 was used to evaluate Type-I error rates, while the other eight models were used to evaluate powers. Models exhibit interactions in the absence of main effects when genotypes conform to Hardy-Weinberg equilibrium.

We used these six disease models because they further challenged the ability of our method to discover the joint associations or 'interactions' in this situation of gene pairs. Models 7 and 8 exhibit both interactions and main effects. Model 7 is the jointly dominant-dominant model , which requires at least one copy of the disease allele from both loci to be affected [ 21 , 22 ].

### Introduction

Model 8 has the same penetrance table with Model 3, but has different causal allele frequencies. We deliberately let the causal allele frequency of one locus be smaller than that of another locus.

- Retreat from New Jerusalem: British Politics, 1951–64?
- ii. Unsupervised Learning?
- American independent cinema?

We then used the H-clust method [ 23 , 24 ] to choose tag SNPs with a subset formed by subjects randomly drawn from the pool of 2, subjects. After generating the phenotypes, the genotypes of the causal SNPs were removed from our datasets. We compared these with the HapForest approach [ 13 ]. HapForest is based on a tree structure, and is naturally suitable for analyzing interactions. Then HapForest was used to identify haplotypes and haplotype-haplotype interactions in association with the disease.

This method suggests potential epistasis among significant haplotypes. For HapForest , a rejection of null hypothesis was defined as the identification of at least one significant haplotype from any of the two chromosomal regions. To calculate haplotype similarities from unphased multi-marker genotypes, we first inferred haplotype phases by the EM algorithm, using the function of 'haplo.

The obtained posteriors were then treated as weights, and all possible haplotype pairs were considered with their probabilities see equation 2. All the haplotypes with frequencies less than 0. To avoid possible genotyping errors, we follow Sha et al. For example, Haplotype A is considered to be a rare haplotype because its frequency is less than 0. Haplotypes C and F are the most similar haplotypes to Haplotype A both with a similarity of 0.

## Popular Decision Tree: Classification and Regression Trees (C&RT)

We merge Haplotype A with Haplotype C, the most similar haplotype with the highest frequency. We then update the haplotype data by replacing Haplotype A with Haplotype C. We recorded the minimum P value P min from among all the 64 P values, and then adjusted this P min on the basis of Sidak correction [ 28 ], with an effective number of tests, M eff.

We then evaluated the validity and power of the eight tests with the datasets. For each dataset, we recorded the P values of 50 repetitions so there were 15, P values in total ; in each repetition, P values were obtained with 1, permutations. Given a significance level, the type I error rate if under Model 0 or power if under Models was the proportion of the number of P values smaller than the significance level to the total number of P values.

The effective number of tests M eff was estimated by the eigenvalue-based approach [ 29 , 30 ]. Then the eigenvalues of this correlation matrix were calculated to estimate the effective number of tests see [ 29 , 30 ], or see [ 31 ] for a nice review. In Additional file 1 , Model 0 disease status independent of the composite genotypes was used to evaluate the type-I error rates. This model demonstrates our null hypothesis: no main effects and no interactions. In this model, the penetrance of each composite genotype was set to be 0.

The sample size was set at subjects, of which half were cases and half were controls. HapForest reported P values as 1. Type-I error rates under different nominal significance levels. The x -axis is nominal significance level, and the y -axis is type-I error rate. For all models except for Models 2 and 7, the total sample size was set at 1, subjects, of which half were cases and half were controls.

## Discovering joint associations between disease and gene pairs with a novel similarity test

For Models 2 and 7, the total sample size was set at and 50, respectively. If the sample size was also set at 1, for Models 2 and 7, the powers of these tests would be all close to 1. Therefore, we chose two smaller sample sizes for effectively exploring the power difference between these tests. We first define two scores to distinguish the two situations. We estimated haplotype frequencies based on all 2, subjects in a dataset when calculating the scores of SC 1 or SC 2.

## Popular Decision Tree: Classification and Regression Trees (C&RT)

While the score of SC 2 is designed for Models and 7, SC 1 is designed for Models 5, 6, and 8 because of their relatively low causal allele frequencies. The x -axis is significance level, and the y -axis is power. The top row is for disease mutants introduced at rare haplotypes; the bottom row, at common haplotypes.

HapForest suggests potential epistasis among significant haplotypes. At each step, it builds a classifier that optimally distinguishes cases from controls based on haplotype data. Let S be the judgement of similarity of X between words W1 and W2. The questions we are interested in are:. This model takes texts as input and returns word similarities. It should be cognitively plausible for both inputs and outputs: first, the amount of input texts should be coherent with the quantity of written material people are exposed to and second, the measure of similarity between words should correspond to the human judgment of semantic similarities.

For all these reasons, we relied on the Latent Semantic Analysis LSA model of word meaning acquisition and representation. LSA takes as input a large corpus of texts and, after determining the statistical context in which each word occurs, represents each word meaning as a high-dimensional vector, usually composed of several hundreds of dimensions. As opposed to complex symbolic structures, a vector representation is very appropriate for comparing objects since it is straightforward to define a similarity measure.

The cosine is an usual measure for that: the highest the cosine, the better the similarity. This is what allows children to understand progressively the meaning of many words by coming across them in various contexts while reading.

- Eyewitness Companions: Herbal Remedies (EYEWITNESS COMPANION GUIDES);
- Introductory Overview - Basic Ideas.
- Mind States 3: An Introduction to Light & Sound Mind Machine Technology (3rd Edition)!
- Servicios Personalizados.
- Prison Break: True Stories of the Worlds Greatest Escapes;

In LSA, the unit of context is the paragraph. Therefore, LSA first counts the number of occurrences of each word in each paragraph. Words are then represented as vectors.

For instance, if the corpus contains , paragraphs, the word tree may be given the following representation, composed of , numbers: It means that tree occurs once in the third paragraph, twice in the fifth, etc. However, this representation is very noisy and dependent on the writers' idiosyncrasies. LSA reduces this huge information in order to only keep the outstanding information.

- Ciba Foundation Symposium 28 - The Structure and Function of Chromatin.
- Shop now and earn 2 points per $1.
- The Blackwell Companion to Sociology of Religion (Wiley Blackwell Companions to Religion);

The previous vectors are then represented in an occurrence matrix, from which singular values are extracted. Basically, singular values represent the strength of the previous dimensions.