This is an old revision of the document!

Questions and Answers

Quick instructions on formatting

??? Start a question with three question-marks like this

!!! Start an answer with 3 exclaimation marks like this

If it is a long answer, then end with "???" on a line by themselves
???

Questions/comments from Friday

1. I have a twin design in which the twin completes a questionnaire about him (her) self and then completes the questionnaire about his (her) cotwin. So I end up with bivariate data. I think that a bivariate model should be appropriate. Or is there something else I should be doing given the nature of the data?

First question is if you only have the self rating and the rating of the other twin or if you did ask both twins to do this. If you only asked one member of the twin pair you will end up with rater bias problems, but in principle a bivariate model would do. If you have the self and the co-twin rating for both twins of the pair a multiple rater design would be best. If you need help with a multiple rater design you can contact Meike Bartels (m.bartels@vu.nl) Just for fun look at an old paper by Eaves and Last (1980) in J. Pets. Ind. Diff (1) “Assessing empathy in twins through their mutual perception of social attitudes”. It used this type of design but a different approach to analysis.

2. In her folder from Monday, Hermine Maes provided many different scripts, which is absolutely great! Is is possible to give some explanation to these scripts and in what they differ (except of ACE, ADE & SAT which is quite clear)?

MN: Hermine's computer died today so she can't help directly. There is a table describing which script is which is here: http://ibg.colorado.edu/cdrom2016/hmaes/UnivariateAnalysis/ It links to the files from the table so you can navigate easily.

3. In classic Mx, we were able to include twin pairs with missing data. Can you clarify that we need to either drop twin pairs with any missing data or replace missing values with the mean? Are we not able to define a missing value and include that individual in the model?

You can include twin pairs with missing data in OpenMx. You do not need to drop pairs with missing data in advance of reading data into OpenMx because complete data is used. Similarly, it is not necessary to replace missing values with the mean. Like classic Mx, OpenMx drops incomplete pairs and covariances will not be able to be estimated in pairs where data from one twin is not available.

Of note, R (and therefore OpenMx) only understands that missing is “NA”. If your missing values are coded to anything but “NA”, you will need to tell R what your missing value is in your original dataset. When reading data into R (typically through the read.table() function), you can specify what your missing value will be (using the “na.strings = ” option).

Rob K. says: Also, be aware that OpenMx does not tolerate NAs on definition variables–it will throw an error at runtime.

Mike N says: Also note that the behavior with missing definition variables is the same as it used to be in classic Mx. If you have missing phenotype data AND missing definition variables on one member of a twin pair, it is important to put in a dummy value for the cotwin, say -999. However, if you are modeling Twin 1 as a function of Twin 2's definition variables (necessary in GxE models where both own and cotwin's moderators are regressed out) then the best approach is probably to delete the pair at this time. In future we'll add features to deal with missing definition variables, subject to certain assumptions.

4. Would we usually expect heritability to decrease when we fit Nuclear Twin-Family Models?

It depends. If there is assortative mating, the NTF model can give you higher heritability than the classical twin design. If both C & D are affecting the trait at the same time, the NTFD will tend to give lower values of A and higher values of D & C, although the broad sense heritability might not be too different. Empirically, this is what's seen when we compare estimates from extended twin family models vs. classical twin designs (e.g., see Coventry & Keller, 2005).

5. Question for Michel - I don't understand how you calculate the "heritability of gene expression". Is it a combination of the heritability of the gene + (if you have the gene), its expression? I really don't understand how you arrive at the number.

So when I mention the “ heritability” of a gene, I mean the heritability of the quantity of the gene observed in the white blood cells of these 2700 twin pairs. In fact every individual has every gene, but the sequence of nucleotides in a gene might vary between individuals. Note that a protein coding gene can be between a few hundred and a million nucleotides log, and only a limited number of those will vary between people. Now while every individual has 2 copies of each gene, the degree to which this gene is expressed might widely vary between subjects. Expression requires the cellular machinery in a cell to unpack the DNA strand, copy the DNA strand and have various other parts of the cellular machinery translate the DNA into a protein. To summarize everyone has 2 copies of each gene, the content of the gene(nucleotide sequence) might vary between people (though is identical for MZs and aprox correlated .5 for DZs), the expression of a gene further also varies between subjects and generally does NOT correlate 1 for MZ's and .5 for DZs. Therefore, we can compute the heritability of the expression levels of a gene.

6. Question for Michel, but could prob be answered by other faculty - I had thought that epigenetic effects would be located in C for those effects shared (say, from the womb) between twins and making them more similar in phenotype and in E for those epigenetic effects from environmental stimuli after birth making twins less similar. However, you also seemed to be saying that epigenetic effects could be located in the A variance as well. Is that true and why?

Imagine that on average MZ twins correlate .4 across their entire epi-genome and DZ twins on average correlate .2 across their entire epi-genome. Now imagine further that most MZ or DZ twins do not differ much from these .4 and .2 correlations respectively, in that case the epigenetic effects will behave additively, as their correlation pattern best matched the additive variance component. Now a scenario under which this might arise is when methylated regions related to your trait of interest are strongly correlated to a set of SNPS (i.e. the gene sequence), than MZ's would share all these SNPs, while DZ's would on average share halve of these SNPs. This in turn would lead MZ's to be more alike for these methylated regions than DZ twins, this would be absorbed by the additive component. Given methylation is influenced by genetic effects, common environmental effects and unique environmental effects, much like any complex trait, we need to study the bivariate relationship between methylation and a complex trait to determine whether the relationship is in C (for example the womb), E or A. also note that MZ twin in the womb more often share their chorions and amniotic sacs, thus even the environment in the womb might be shared more strongly between MZ twins than between DZ twins, and thus could behave in a pattern that matches additive genetic effects.

7. In David's presentation on GCTA, slide 12 (Expected Covariance Matrix - Unrelateds), I'm not sure of the derivation of the A matrix. Why is the top left a11 and the bottom right a^2nm. Should they not both be squared? Thanks!

Thank you for spotting this! Actually the bottom right corner should be a_nn. Sorry for the typo. I've now fixed this in the slide presentation.

8. David Evans said to not call the heritability estimate from GCTA "heritability of common SNPs" because (relatively) rare SNPs are also measured by the chips, but Matt Keller describes GCTA heritability as the "heritability of common SNPs" because rare variants and de novo mutations are not measured. Both are right, so how should we most accurately describe the GCTA heritability estimate?

MCK: It is an issue of degree. The rarer the causal variants are, the less well tagged they tend to be with common (MAF > .01) SNPs that exist on modern platforms. So causal variants (CVs) with .005 < MAF < .01 will be only partially captured by GREML, CVs with .001 < MAF < .005 even less so, and so forth. Generally, once you get to CV MAF < .001, the LD between them and common SNPs will be so low as to be ~ totally missed in GREML. One of the issues of active research is in using rarer SNPs (e.g., imputed or sequenced variants) or IBD haplotypes to get at the heritability due to these very rare CVs.

TCB: Worth reading Yang2015 “simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants [is captured]”

Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., . . . Visscher, P. M. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics, 47(10), 1114-1120. doi:10.1038/ng.3390

MCK: Right, but note that the Yang 2015 paper used *imputed* SNPs that were thereby much rarer on average than the SNPs usually included on arrays. They picked up additional variation due to rare causal variants that were better tagged by these rare imputed SNPs. So they are picking up additional rare causal variant variation that would NOT usually have been picked up using GCTA just on SNPs on arrays. So this isn't equivalent to the usual way of running GCTA. In essence, if we had all the sequence variation (instead of just imputed SNPs), we'd be able to pick up 100% of variation due to both rare and common variants.

9. Could Matt or someone please expand on how to address ethnic heterogeneity in GCTA? On the slide on assumptions in estimating heritability, the options include PCA or analyzing cases and controls separately. Could you please provide more detail on these two strategies?

MCK: First off, you'd want to have a sample that is relatively ethnically homogeneous; e.g., analyzing a sample of mixed ethnicity can lead to problems even if you correct for stratification because the CV-SNP LD will be different between the groups. So, now that we have an ethnically homogeneous sample, the most common way people control for any additional stratification (e.g., subtle differences between Caucasians on a north-south Europe gradient) is to add 5-20 ancestry principal components into the fixed part of the model. This should correct for any effect broad-level stratification has on your estimates.

10. What exactly is the issue in using GREML with ascertained samples? Would it be appropriate for a continuous trait within a clinical group?

For background, see these papers.

Rob K. says: I would tentatively answer your second question with a “yes,” as long as you have a sample that is representative of the population of patients who meet diagnostic criteria for (whatever disorder). Obviously, the generalizability of your results to the general population would be highly questionable.

11. If we get a code Mx status RED, what do we need to do/consider (in general)? E.g., are our model estimates still reliable?

Rob K. says: Status RED means the optimizer is not certain it has found a minimum of the fitfunction. Status RED with code 6 (first-order conditions not met) is worse than with code 5 (second-order conditions not met). You should always try to do something about status RED, for instance:

If you're analyzing ordinal data, sometimes a status RED is unavoidable without changing some mxOptions.
Use different start values.
Try a different optimizer.
Reparameterize your MxModel.
Use mxTryHard() or one of its wrapper functions. Note that, by default, mxTryHard() prints to console the start values it used to find the best solution it found. The idea is that you copy-paste those start values into your script, and assign them to your pre-mxRun() model, using omxSetParameters().

12. For Ben's presentation, where (papers, websites?) are the graphs on slides 17, 22, & 28 located? They are from published studies, yeah?

MCK: I'll let Ben weigh in as well. But in essence, they should be two different ways of estimating the *same* parameter: SNP-heritability. I think that the differences in the literature are about what we'd expect given the SE's on the estimates. Ben - are there any systematic differences between the two?

Ben here - Here's the original LD Score MS: http://www.nature.com/ng/journal/v47/n3/full/ng