INPUT FILES =========== Mach 1.0 needs a Merlin format data and pedigree files as input. The data file should look like this: M marker1 M marker2 ... The pedigree file should list one individual per row. Each row should start with an family id and individual id, followed by a father and mother id (which should both be 0, 'zero', since mach1 assumes individuals are unrelated), and sex. These initial columns are followed by a series of marker genotypes, each with two alleles. Alleles can be coded as 1, 2, 3, 4 or A, C, T, G. For example: FAM1001 ID1234 0 0 M 1 1 1 2 2 2 FAM1002 ID1234 0 0 F 1 2 2 2 3 3 Or: FAM1001 ID1234 0 0 M A A A C C C FAM1002 ID1234 0 0 F A C C C G G USING MACH 1.0 for HAPLOTYPING ============================== To use Mach 1.0 to haplotype a sample of unrelated individuals, you'll need a MERLIN format pedigree and data file (the names of these files are specified with the --pedfile and --datfile command line options, abbreviated as -p and -d, respectively). You should make sure that markers are ordered according to their physical position and use the --phase command line option to request the output of phased chromosomes. The key parameters for managing the quality of inferred haplotypes and the amount of computational effort expended in generating them are the --rounds and --states parameters. If missing data is not distributed evenly among the available individuals, you should also consider the --weighted parameter (which favors using individuals with more genotype data as templates for haplotyping other individuals). The parameter --rounds K specifies how many iterations of the Markov sampler should be run. Larger numbers will result in better solutions. If there isn't much missing data, a value of 50 should give a reasonable solution. Larger values will provide even better solutions. The parameter --states K specifies how many haplotypes should be considered when updating each individual. Larger values will generate more accurate solutions, but may slow things down a bit (as well as requiring more memory). A value of 200 or larger typically provides quite good solutions. The default is to use all available haplotypes for each update (but this can require a lot of memory and time!). Other important parameters are --compact (reduces memory use) and --poll K (to request intermediate solutions every K iterations). Example Usage: mach -d sample.dat -p sample.ped --rounds 50 --states 200 --phase USING MACH 1.0 to INFER UNTYPED MARKERS ======================================= To use Mach 1.0 to infer genotypes at untyped markers, you should use the --geno command line option. Genotypes at untyped markers for each individual are inferred by comparing the available genotypes to those in other individuals that have been typed at higher density. Individuals typed at high density will often come from public resources, such as the HapMap. There are two main strategies for imputation: INCLUDE REFERENCE (e.g. HAPMAP) GENOTYPES TO YOUR DATASET: A simple way to infer missing genotypes is to create one large pooled dataset. Some individuals will have missing data and others will have much more complete genotyping information. In addition to estimating the most likely genotype for each individual, you can use the command line options --dosage and --quality options to request additional information about each inferred genotype. USE REFERENCE (e.g. HAPMAP) HAPLOTYPES AS INPUT: If you select this option, you should generate a file that includes a set of reference haplotypes. These can be typed at more markers than are available in your sample. You will also need a small file that lists all the markers that appear in the phased haplotypes. Then, to estimate missing genotypes, you'll need to provide the Merlin format data and pedigree files, the reference haplotypes and the list of SNPs in the reference haplotypes. The reference haplotype and snp list files are named in the command line through the --snps and --haps options (these can be abbreviated as -s and -h, respectively). It is very important to ensure that alleles are labelled consistently in your sample and in the reference panel. Mach 1.0 will automatically warn you about alleles that differ in frequency greatly between your sample and the reference panel or that have different allele names in the two subsets of data. However, these checks will not catch all inconsistently labelled alleles. If you use the --autoFlip option, Mach 1.0 will try to automatically resolve problems with alleles that are inconsistently labelled in your sample and the reference panel (by flipping strands and dropping markers where this trivial solution does not help). Most of the time, you'll get good estimates of genotypes at untyped markers using the --rounds N and --greedy option. If you don't use the --greedy option, you can control computational effort with the --weighted and --states options. However, this alternative strategy generally requires quite a few more iterations before converging to a good solution. Examples: mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --geno mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --geno mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 500 --states 200 --weighted --geno SPEEDING UP IMPUTATION The standard genotype imputation approach, described in the preceding section works best when you execute a large number of iterations of the Markov Chain (50-100). These iterations are used to simultaneously update the crossover map (which determines the likely locations for haplotype transitions), to update the error rate map (which flags unusual markers), and to estimate the missing genotypes. An alternative approach is to use a single set of estimates for the crossover and error rate maps and, conditional on these, to find the most likely genotypes. This approach seems to work quite well. To use it, use the --crossovermap and --errormap options to specify estimates of error and crossover rates from a previous mach run, and request the --mle option instead of --genos. If you don't have an available set of map estimates, you can request that Mach estimate them using a small number of iterations of the Markov Chain with the rounds option. Examples: mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --crossovermap mach.rec --errormap mach.erate --greedy --mle mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --greedy --mle --rounds 5 MACH1 OUTPUT KEY ================ Mach 1.0 generates a table that provides useful information about each marker. The filename for the table has the extension .info or .mlinfo, depending on whether the --mle option is used. This table includes the marker name, allele labels, minor allele frequency for each marker. In addition, the estimated probability that an average imputed genotype will match an experimental genotype is output (this should be 1.0 for genotyped markers, and will often be less for untyped markers). You will also get an estimate of the r-squared correlation between an estimated genotype scores and true genotypes. ASSESSING QUALITY OF SOLUTIONS ============================== One simple way to empirically assess quality of the solutions generated by Mach 1.0 is to use the mask option. This option hides a small proportion of genotypes from the haplotyper and then compares the imputed genotypes at these locations with the actual genotypes. Example: mach -d sample.dat -p sample.ped --rounds 50 --states 200 --mask 0.02 mach -d sample.dat -p sample.ped -h hapmap.haplos -s hapmap.snps --rounds 50 --greedy --mask 0.02