This README file contains: A. To install the ibd_haplo program within your MORGAN-3 B. Running the program on the test examples C. Details of input files for examples D. Details of output files for examples ----------------------------------------------------- A. To install the ibd_haplo program within your MORGAN-3: ----------------------------------------------------- 0) Download and install your MORGAN-3 In the main MORGAN-3 directory you will say make morgan.gcc.dbg (Of course, you may use any of the morgan make options, but I always use this one) 1) Untar the ibdhap_prog.tar.gz file within your main MORGAN-3 directory. This will create a subdirectory IBDHAP_PROG, which contains a Makefile and four *.c source code files. 2) cd into the IBDHAP_PROG subdirectory; then make ibd_haplo.gcc.dbg Note 1: it is probably advisable to use the same make option as in step 0 Note 2: you will likely get a lot of warnings of the form ../Makefile.progs:188: warning: ignoring old commands for target `.cc.dml' You can ignore these; they come about because we have only one main program in this subdirectory 3) To remove the executable either simply rm ibd_haplo or make myclean Note: You MUST do this if for any reason you remake the rest of your MORGAN-3. The general "make morgan" commands will not clean and remake the ibd_haplo program, so library links will be incorrect. i.e. if you redo step 0, you MUST remove the ibd_haplo executable and redo step 2. -------------------------------------------------------------- B. Running the program on the test examples: 1) Untar the file ibdhap_examples.tar.gz in any directory where you want to run things. It will make a subdirectory IBD_HAPDAT_RUN with a lot of files, including a copy of this README_ file. 2) cd into the IBD_HAPDAT_RUN subdirectory. 3) make a (soft) link to the ibd_haplo executable: for example ln -s /castor/thompson/MORGAN_3_Feb09_CVS/IBDHAP_PROG/ibd_haplo ibd_haplo Note 1: Of course, you will use the full pathname of where your ibd_haplo program is. Note 2: If ls -l ibd_haplo shows there is already an ibd_haplo link, you should probably unlink it first: "unlink ibd_haplo". Note 3: If preferred the link may go in your bin directory, or anywhere your system will look for commands, as may be preferred by you or your system administrator. 4. As with other MORGAN programs, the general format is progname parfile > outfile So here, for example, one might say ibd_haplo ibd_zheng.par > my_out.txt 5. You will find you have generated two new output files zheng_qibd, my_out.txt These are described in detail below in section D. 6. Before rerunning, remove or move (if you want to keep them) these two output files!! In the case of "my_out.txt", the program probably (depending on your setup) will not run if that file already exists. In the case of the "zheng_qibd" file, the next output will append to the file, and this (large) file gets ever larger and confusing!! Note: These examples were originally set up for Zheng Cai, a visiting student from the University of Utah, in Winter/Spring 2009. Her very useful testing of these programs has made them much LESS user-UNfriendly than they would have been otherwise: be grateful!! ---------------------------------------------------------------------- C. Details of input files for examples 1. First look at the MORGAN parameter file ibd_zheng.par It just gives the names of two files: an input file zheng_filenames and an output file zheng_qibd -- which we met above. Note: clearly you could change these names for your examples. 2. Now look at the MORGAN extra input file zheng_filenames It consists only of three more filenames: compu_zheng_4haps.dat zheng_haplos.dat chr07_markers which we now describe. Note: again, you could change these names for your examples. 3: chr07_markers: This file contains the marker information, in a very similar way to other MORGAN files, but without the explicit MORGAN parameter statements. The data are for 2132 SNP markers on "chr07". First the 2132 chromosomal positions of the SNPs are listed. For convenience they are here in 213 lines of 10, with two extra, but that is not required. These are sex-averaged cM positions-- only differences (cM distances) are used. If your position info is in bp, a rough translation is given by dividing by 10^6. Then the allele frequencies of each SNP are given, for markers 1 to 2132, in order (the integer count is for convenience, it is ignored by the program). Again these are put one per line for convenience. This is not required, but having them in correct order is!! 4. zheng_haplos.dat This file contains the haplotypic data for the haplotypes or genotypes to be analyzed. This particular data set consists of 16 haplotypes, each with an integer "name". The name is followed by 2132 alleles making up the haplotype. 1 and 2 are the SNP alleles, and 0 denotes missing. For genoptypic data the format would be the same, but there would be 2132*2 = 4264 alleles following each "name". The two alleles making up a genotype can be entered in either order ("1 2" or "2 1"). The program makes no assumptions about phase when analyzing genotypic data. Here the program has all the data for one haplotype/genotype on one line, because this is how the script produced it. Again, lines may be cut for convenience if desired. 5. compu_zheng_4haps.dat This is the complicated file that tells the program what to do!! The comment in the file reads # First line no.states, no phenos, indicator for genodata # in this case we have pairs of SNP haplotypes # so there are 4 possible combinations of SNP allales for the pair # Second line no.sets of haplos for analysis, and number in each set, # and prs requs # in this case we will do a set of pairs of chromosomes # inserted line (if nprs>0) set of pairs # Third line: total markers in input (0 if we will read all SNPs), # total haplotypes in input # total chromosome length in 10^6 bp, # min allele freq -- should not matter # fourth line; no.SNPs constructed/read, skip count for analysis, fkin, ffrate # fifth line ; no. in block (nblk), gap btw blocks (ngap), start point (nbg) # these lines used to subselect SNPs -- here we will use all. (a) This particular file "compu_zheng_4haps.dat" reads: 15 16 0 4 4 0 0 16 192.30 0.06 2132 1 0.15 0.1 1 0 1 (b) A similar example for genotypic data would read: 9 9 1 4 4 0 0 0 192.3 0.06 2132 1 0.15 0.1 1 0 1 (c) One that only required comparison of pairs of haplotypes might read: 2 4 0 8 2 0 0 16 192.30 0.06 2132 1 0.15 0.1 1 0 1 (d) And yet another version might be: 15 16 0 1 4 6 0 1 0 2 0 3 1 2 1 3 2 3 40435 120 100.00 0.06 10000 1 0.15 0.1 1 0 1 # First line no.states, no phenos, indicator for genodata Examples (a) (d): For 4 haplotypes there are 15 ibd states, and 16 phenotypic data configurations at each SNP (not counting missing data): i.e. each SNP can be allele 1 or 2 on each of the 4 ordered haplotypes. Example (b): For a pair of genotypes there are 9 ibd states, and 9 data configurations -- each individual can be 1 1, 1 2 or 2 2. Example (c): For just 2 haplotypes there are 2 states (ibd/not) and 4 data configurations 1 1, 1 2, 2 1, and 2 2. # Second line no.sets of haplos for analysis, and number in each set, # and prs requs Example (a): 4 sets of 4 haplotypes to be analyzed (b): same, but this time it will take each successive pair or haplotypes and interpret as unordered genotypes (this needs checking ?? it is not consistent over examples) (c): same data again, but now to be analyzed as 8 sets of 2 haplotypes each. (d): one set of 4 haplotypes to be analyzed jointly and ADDITIONALLY 6 pairwise analyses to be done. in this case the next line gives the required 6 pairs. # Third line: total markers in input (0 if we will read all SNPs), # total haplotypes in input # total chromosome length in cM (or 10^6 bp), # min allele freq -- should not matter (a),(b),(c): we will read all SNPs, total haps is 16 (but actually for this will be disregarded in (b) -- so the "0" is ok) (d): There are a total or 40435 SNPs, and 120 haplotypes (These were hapmap data on 60 individuals) but we are going to subselect them. # fourth line; no.SNPs constructed/read, skip count for analysis, fkin, ffrate (a),(b), (c): 2132 SNPs and all will be used in analysis (d) : 10000 SNPs will be subselected, and these will all be used IMPORTANT: fkin is prior prob of IBD ---0.15 is VERY high -- unless you know you have a lot of IBD ffrate is rate change parameter for IBD-- 0.1 this also needs checking -- roughly we are looking for segments down to 0.1 cM. # fifth line ; no. in block (nblk), gap btw blocks (ngap), start point (nbg) # these lines used to subselect SNPs -- here we will use all. Not used in these examples. Here (after subselecting SNPs in case of (d)) we use all SNPs in the analysis. ---------------------------------------------------------------------- D. Details of output files for examples As we have seen in B.5 there are two output files, one specified in the parameter file (zheng_qibd) and the other as standard output in your command line (my_out.txt). "zheng_qibd": this is the core output, which can then be processed in R (e.g). Each line is for each marker: the marker number, 1,2,3,... the marker position, in cM ..as originally input and in the current example 15 additional probabilities summing (hopefully) to 1. IMPORTANT: these are the probabilities, under the given model and conditional on the data, of each of the 15 stated of ibd among the four haplotypes.a The ordering of the 15 states is states 1111,1122,1112,1121,1123, 1211,1222,1233, 1212,1221, 1213, 1231, 1223, 1232, 1234 Note: for genotypic analyses there will be 9 state probs (11 columns) we have the same 15 latent states, but genotypically equivalent ones are combined. for pairwise haplotype analyses there will be 2 state probs (4 columns); of the two state-probs, first is ibd, and second is non-ibd. "my_out.txt": standard output file. As with mamy MORGAN programs this is simply a summary of what has been read in, and what is processes, mainly so the user can check that all is as expected: first all the various paremeters of the run are printed. then the genetic data, with the first 10 alleles only (for checking) Next the equilibrium state probabilities under the provided fkin value, and the latent-ibd- process transition matrix under the given parameters is printed for a 1cM distance. This is repeated (unnecessarily!!) for each set of haplotypes processed. --------------------------------------------------------------------------- ----------------------------------------------------------------------