User Tools

Site Tools


workshop:2016:questions

This is an old revision of the document!


Questions and Answers

Quick instructions on formatting

??? Start a question with three question-marks like this

!!! Start an answer with 3 exclaimation marks like this

If it is a long answer, then end with "???" on a line by themselves
???

Questions/comments from Friday

First question is if you only have the self rating and the rating of the other twin or if you did ask both twins to do this. If you only asked one member of the twin pair you will end up with rater bias problems, but in principle a bivariate model would do. If you have the self and the co-twin rating for both twins of the pair a multiple rater design would be best. If you need help with a multiple rater design you can contact Meike Bartels (m.bartels@vu.nl) Just for fun look at an old paper by Eaves and Last (1980) in J. Pets. Ind. Diff (1) “Assessing empathy in twins through their mutual perception of social attitudes”. It used this type of design but a different approach to analysis.

You can include twin pairs with missing data in OpenMx. You do not need to drop pairs with missing data in advance of reading data into OpenMx because complete data is used. Similarly, it is not necessary to replace missing values with the mean. Like classic Mx, OpenMx drops incomplete pairs and covariances will not be able to be estimated in pairs where data from one twin is not available.

Of note, R (and therefore OpenMx) only understands that missing is “NA”. If your missing values are coded to anything but “NA”, you will need to tell R what your missing value is in your original dataset. When reading data into R (typically through the read.table() function), you can specify what your missing value will be (using the “na.strings = ” option).

Rob K. says: Also, be aware that OpenMx does not tolerate NAs on definition variables–it will throw an error at runtime.

Mike N says: Also note that the behavior with missing definition variables is the same as it used to be in classic Mx. If you have missing phenotype data AND missing definition variables on one member of a twin pair, it is important to put in a dummy value for the cotwin, say -999. However, if you are modeling Twin 1 as a function of Twin 2's definition variables (necessary in GxE models where both own and cotwin's moderators are regressed out) then the best approach is probably to delete the pair at this time. In future we'll add features to deal with missing definition variables, subject to certain assumptions.

It depends. If there is assortative mating, the NTF model can give you higher heritability than the classical twin design. If both C & D are affecting the trait at the same time, the NTFD will tend to give lower values of A and higher values of D & C, although the broad sense heritability might not be too different. Empirically, this is what's seen when we compare estimates from extended twin family models vs. classical twin designs (e.g., see Coventry & Keller, 2005).

 So when I mention the “ heritability” of a gene, I mean the heritability of the quantity of the gene observed in the white blood cells of these 2700 twin pairs. In fact every individual has every gene, but the sequence of nucleotides in a gene might vary between individuals. Note that a protein coding gene can be between a few hundred and a million nucleotides log, and only a limited number of those will vary between people. Now while every individual has 2 copies of each gene, the degree to which this gene is expressed might widely vary between subjects. Expression requires the cellular machinery in a cell to unpack the DNA strand, copy the DNA strand and have various other parts of the cellular machinery translate the DNA into a protein. To summarize everyone has 2 copies of each gene, the content of the gene(nucleotide sequence) might vary between people (though is identical for MZs and aprox correlated .5 for DZs), the expression of a gene further also varies between subjects and generally does NOT correlate 1 for MZ's and .5 for DZs. Therefore, we can compute the heritability of the expression levels of a gene.

Imagine that on average MZ twins correlate .4 across their entire epi-genome and DZ twins on average correlate .2 across their entire epi-genome. Now imagine further that most MZ or DZ twins do not differ much from these .4 and .2 correlations respectively, in that case the epigenetic effects will behave additively, as their correlation pattern best matched the additive variance component. Now a scenario under which this might arise is when methylated regions related to your trait of interest are strongly correlated to a set of SNPS (i.e. the gene sequence), than MZ's would share all these SNPs, while DZ's would on average share halve of these SNPs. This in turn would lead MZ's to be more alike for these methylated regions than DZ twins, this would be absorbed by the additive component. Given methylation is influenced by genetic effects, common environmental effects and unique environmental effects, much like any complex trait, we need to study the bivariate relationship between methylation and a complex trait to determine whether the relationship is in C (for example the womb), E or A. also note that MZ twin in the womb more often share their chorions and amniotic sacs, thus even the environment in the womb might be shared more strongly between MZ twins than between DZ twins, and thus could behave in a pattern that matches additive genetic effects.

MCK: It is an issue of degree. The rarer the causal variants are, the less well tagged they tend to be with common (MAF > .01) SNPs that exist on modern platforms. So causal variants (CVs) with .005 < MAF < .01 will be only partially captured by GREML, CVs with .001 < MAF < .005 even less so, and so forth. Generally, once you get to CV MAF < .001, the LD between them and common SNPs will be so low as to be ~ totally missed in GREML. One of the issues of active research is in using rarer SNPs (e.g., imputed or sequenced variants) or IBD haplotypes to get at the heritability due to these very rare CVs.

TCB: Worth reading Yang2015 “simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants [is captured]”

Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., . . . Visscher, P. M. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics, 47(10), 1114-1120. doi:10.1038/ng.3390

MCK: Right, but note that the Yang 2015 paper used *imputed* SNPs that were thereby much rarer on average than the SNPs usually included on arrays. They picked up additional variation due to rare causal variants that were better tagged by these rare imputed SNPs. So they are picking up additional rare causal variant variation that would NOT usually have been picked up using GCTA just on SNPs on arrays. So this isn't equivalent to the usual way of running GCTA. In essence, if we had all the sequence variation (instead of just imputed SNPs), we'd be able to pick up 100% of variation due to both rare and common variants.

MCK: First off, you'd want to have a sample that is relatively ethnically homogeneous; e.g., analyzing a sample of mixed ethnicity can lead to problems even if you correct for stratification because the CV-SNP LD will be different between the groups. So, now that we have an ethnically homogeneous sample, the most common way people control for any additional stratification (e.g., subtle differences between Caucasians on a north-south Europe gradient) is to add 5-20 ancestry principal components into the fixed part of the model. This should correct for any effect broad-level stratification has on your estimates.

For background, see these papers.

Rob K. says: I would tentatively answer your second question with a “yes,” as long as you have a sample that is representative of the population of patients who meet diagnostic criteria for (whatever disorder). Obviously, the generalizability of your results to the general population would be highly questionable.

Rob K. says: Status RED means the optimizer is not certain it has found a minimum of the fitfunction. Status RED with code 6 (first-order conditions not met) is worse than with code 5 (second-order conditions not met). You should always try to do something about status RED, for instance:

  • If you're analyzing ordinal data, sometimes a status RED is unavoidable without changing some mxOptions.
  • Use different start values.
  • Try a different optimizer.
  • Reparameterize your MxModel.
  • Use mxTryHard() or one of its wrapper functions. Note that, by default, mxTryHard() prints to console the start values it used to find the best solution it found. The idea is that you copy-paste those start values into your script, and assign them to your pre-mxRun() model, using omxSetParameters().

MCK: I'll let Ben weigh in as well. But in essence, they should be two different ways of estimating the *same* parameter: SNP-heritability. I think that the differences in the literature are about what we'd expect given the SE's on the estimates. Ben - are there any systematic differences between the two?

Ben here - Here's the original LD Score MS: http://www.nature.com/ng/journal/v47/n3/full/ng

workshop/2016/questions.1457733846.txt.gz · Last modified: 2016/03/11 15:04 by 65.114.233.215