Questions and Answers

Ask questions at the OpenMx forums and the dedicated 2016 Workshop forum there.

Quick instructions on formatting

??? Start a question with three question-marks like this

!!! Start an answer with 3 exclaimation marks like this

If it is a long answer, then end with "???" on a line by themselves
???

Questions/comments from Friday

1. I have a twin design in which the twin completes a questionnaire about him (her) self and then completes the questionnaire about his (her) cotwin. So I end up with bivariate data. I think that a bivariate model should be appropriate. Or is there something else I should be doing given the nature of the data?

First question is if you only have the self rating and the rating of the other twin or if you did ask both twins to do this. If you only asked one member of the twin pair you will end up with rater bias problems, but in principle a bivariate model would do. If you have the self and the co-twin rating for both twins of the pair a multiple rater design would be best. If you need help with a multiple rater design you can contact Meike Bartels (m.bartels@vu.nl) Just for fun look at an old paper by Eaves and Last (1980) in J. Pets. Ind. Diff (1) “Assessing empathy in twins through their mutual perception of social attitudes”. It used this type of design but a different approach to analysis.

2. In her folder from Monday, Hermine Maes provided many different scripts, which is absolutely great! Is is possible to give some explanation to these scripts and in what they differ (except of ACE, ADE & SAT which is quite clear)?

MN: Hermine's computer died today so she can't help directly. There is a table describing which script is which is here: http://ibg.colorado.edu/cdrom2016/hmaes/UnivariateAnalysis/ It links to the files from the table so you can navigate easily.

Also, Hermine's webpage table has links for scripts for univariate/monophenotype [one] and bivariate/diphenotype [two] twin analyses. The file names reflect a naming convention and model details. The naming convention reflects the following features:

1a- Analysis for 2 zygosity groups without any covariates and with an age covariate [a] 1b- Analysis for 5 groups without any covariates and with age covariate [a]

2- Each set includes a saturated model (SAT with sub models testing assumptions), an ACE model (with sub models) and an ADE model (with sub models).

3- Analyses for specific types of variables: continuous ©, binary (b), ordinal (o), mean/variance ordinal (m) and joint (j). The joint analyses refer to bivariate analyses where one measure is continuous and the other is ordinal.

Two examples of the naming convention: 1- The file called “twoACEj” is a script for a bivariate ACE model for joint data (continuous and ordinal measures) and has no covariates. This script also only uses data from two zygosity groups. 2- The file called “oneSAT5ca” is a script for a univariate saturated model using continuous data with age as a covariate. This script uses data from 5 zygosity groups.

3. In classic Mx, we were able to include twin pairs with missing data. Can you clarify that we need to either drop twin pairs with any missing data or replace missing values with the mean? Are we not able to define a missing value and include that individual in the model?

You can include twin pairs with missing data in OpenMx. You do not need to drop pairs with missing data in advance of reading data into OpenMx because complete data is used. Similarly, it is not necessary to replace missing values with the mean. Like classic Mx, OpenMx drops incomplete pairs and covariances will not be able to be estimated in pairs where data from one twin is not available.

Of note, R (and therefore OpenMx) only understands that missing is “NA”. If your missing values are coded to anything but “NA”, you will need to tell R what your missing value is in your original dataset. When reading data into R (typically through the read.table() function), you can specify what your missing value will be (using the “na.strings = ” option).

Rob K. says: Also, be aware that OpenMx does not tolerate NAs on definition variables–it will throw an error at runtime.

Mike N says: Also note that the behavior with missing definition variables is the same as it used to be in classic Mx. If you have missing phenotype data AND missing definition variables on one member of a twin pair, it is important to put in a dummy value for the cotwin, say -999. However, if you are modeling Twin 1 as a function of Twin 2's definition variables (necessary in GxE models where both own and cotwin's moderators are regressed out) then the best approach is probably to delete the pair at this time. In future we'll add features to deal with missing definition variables, subject to certain assumptions.

4. Would we usually expect heritability to decrease when we fit Nuclear Twin-Family Models?

It depends. If there is assortative mating, the NTF model can give you higher heritability than the classical twin design. If both C & D are affecting the trait at the same time, the NTFD will tend to give lower values of A and higher values of D & C, although the broad sense heritability might not be too different. Empirically, this is what's seen when we compare estimates from extended twin family models vs. classical twin designs (e.g., see Coventry & Keller, 2005).

5. Question for Michel - I don't understand how you calculate the "heritability of gene expression". Is it a combination of the heritability of the gene + (if you have the gene), its expression? I really don't understand how you arrive at the number.

So when I mention the “ heritability” of a gene, I mean the heritability of the quantity of the gene observed in the white blood cells of these 2700 twin pairs. In fact every individual has every gene, but the sequence of nucleotides in a gene might vary between individuals. Note that a protein coding gene can be between a few hundred and a million nucleotides log, and only a limited number of those will vary between people. Now while every individual has 2 copies of each gene, the degree to which this gene is expressed might widely vary between subjects. Expression requires the cellular machinery in a cell to unpack the DNA strand, copy the DNA strand and have various other parts of the cellular machinery translate the DNA into a protein. To summarize everyone has 2 copies of each gene, the content of the gene(nucleotide sequence) might vary between people (though is identical for MZs and aprox correlated .5 for DZs), the expression of a gene further also varies between subjects and generally does NOT correlate 1 for MZ's and .5 for DZs. Therefore, we can compute the heritability of the expression levels of a gene.

6. Question for Michel, but could prob be answered by other faculty - I had thought that epigenetic effects would be located in C for those effects shared (say, from the womb) between twins and making them more similar in phenotype and in E for those epigenetic effects from environmental stimuli after birth making twins less similar. However, you also seemed to be saying that epigenetic effects could be located in the A variance as well. Is that true and why?

Imagine that on average MZ twins correlate .4 across their entire epi-genome and DZ twins on average correlate .2 across their entire epi-genome. Now imagine further that most MZ or DZ twins do not differ much from these .4 and .2 correlations respectively, in that case the epigenetic effects will behave additively, as their correlation pattern best matched the additive variance component. Now a scenario under which this might arise is when methylated regions related to your trait of interest are strongly correlated to a set of SNPS (i.e. the gene sequence), than MZ's would share all these SNPs, while DZ's would on average share halve of these SNPs. This in turn would lead MZ's to be more alike for these methylated regions than DZ twins, this would be absorbed by the additive component. Given methylation is influenced by genetic effects, common environmental effects and unique environmental effects, much like any complex trait, we need to study the bivariate relationship between methylation and a complex trait to determine whether the relationship is in C (for example the womb), E or A. also note that MZ twin in the womb more often share their chorions and amniotic sacs, thus even the environment in the womb might be shared more strongly between MZ twins than between DZ twins, and thus could behave in a pattern that matches additive genetic effects.

7. In David's presentation on GCTA, slide 12 (Expected Covariance Matrix - Unrelateds), I'm not sure of the derivation of the A matrix. Why is the top left a11 and the bottom right a^2nm. Should they not both be squared? Thanks!

Thank you for spotting this! Actually the bottom right corner should be a_nn. Sorry for the typo. I've now fixed this in the slide presentation.

8. David Evans said to not call the heritability estimate from GCTA "heritability of common SNPs" because (relatively) rare SNPs are also measured by the chips, but Matt Keller describes GCTA heritability as the "heritability of common SNPs" because rare variants and de novo mutations are not measured. Both are right, so how should we most accurately describe the GCTA heritability estimate?

MCK: It is an issue of degree. The rarer the causal variants are, the less well tagged they tend to be with common (MAF > .01) SNPs that exist on modern platforms. So causal variants (CVs) with .005 < MAF < .01 will be only partially captured by GREML, CVs with .001 < MAF < .005 even less so, and so forth. Generally, once you get to CV MAF < .001, the LD between them and common SNPs will be so low as to be ~ totally missed in GREML. One of the issues of active research is in using rarer SNPs (e.g., imputed or sequenced variants) or IBD haplotypes to get at the heritability due to these very rare CVs.

TCB: Worth reading Yang2015 “simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants [is captured]”

Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., . . . Visscher, P. M. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics, 47(10), 1114-1120. doi:10.1038/ng.3390

MCK: Right, but note that the Yang 2015 paper used *imputed* SNPs that were thereby much rarer on average than the SNPs usually included on arrays. They picked up additional variation due to rare causal variants that were better tagged by these rare imputed SNPs. So they are picking up additional rare causal variant variation that would NOT usually have been picked up using GCTA just on SNPs on arrays. So this isn't equivalent to the usual way of running GCTA. In essence, if we had all the sequence variation (instead of just imputed SNPs), we'd be able to pick up 100% of variation due to both rare and common variants.

9. Could Matt or someone please expand on how to address ethnic heterogeneity in GCTA? On the slide on assumptions in estimating heritability, the options include PCA or analyzing cases and controls separately. Could you please provide more detail on these two strategies?

MCK: First off, you'd want to have a sample that is relatively ethnically homogeneous; e.g., analyzing a sample of mixed ethnicity can lead to problems even if you correct for stratification because the CV-SNP LD will be different between the groups. So, now that we have an ethnically homogeneous sample, the most common way people control for any additional stratification (e.g., subtle differences between Caucasians on a north-south Europe gradient) is to add 5-20 ancestry principal components into the fixed part of the model. This should correct for any effect broad-level stratification has on your estimates.

10. What exactly is the issue in using GREML with ascertained samples? Would it be appropriate for a continuous trait within a clinical group?

For background, see these papers.

Rob K. says: I would tentatively answer your second question with a “yes,” as long as you have a sample that is representative of the population of patients who meet diagnostic criteria for (whatever disorder). I'm not altogether sure, though. Obviously, the generalizability of your results to the general population would be highly questionable.

11. If we get a code Mx status RED, what do we need to do/consider (in general)? E.g., are our model estimates still reliable?

Rob K. says: Status RED means the optimizer is not certain it has found a minimum of the fitfunction. So, no, your parameter estimates are probably not reliable, and the standard errors are even more suspect. Status RED with code 6 (first-order conditions not met) is worse than with code 5 (second-order conditions not met). You should always try to do something about status RED, for instance:

If you're analyzing ordinal data, sometimes a status RED is unavoidable without changing some mxOptions. In fact, it's possible to get status RED with ordinal data even when the optimizer has found a minimum.
Use different start values.
Try a different optimizer.
Reparameterize your MxModel.
Use mxTryHard() or one of its wrapper functions. Note that, by default, mxTryHard() prints to console the start values it used to find the best solution it found. The idea is that you copy-paste those start values into your script, and assign them to your pre-mxRun() model, using omxSetParameters().

Trying different start values is the most important thing.

12. For Ben's presentation, where (papers, websites?) are the graphs on slides 17, 22, & 28 located? They are from published studies, yeah?

Ben here - Here's the original LD Score MS: http://www.nature.com/ng/journal/v47/n3/full/ng

13. What is the difference in interpretation between GCTA h2 estimates and LD score regression h2 estimates?

MCK: I'll let Ben weigh in as well. But in essence, they should be two different ways of estimating the *same* parameter: SNP-heritability. I think that the differences in the literature are about what we'd expect given the SE's on the estimates. Ben - are there any systematic differences between the two?

14. @Sarah where can we find the full syntax of this morning's assumption testing (so the syntax with the right answers)? thanks!

15. Where can researchers find publicly available twin data?

One place a researcher can begin searching for publicly available data is the repository developed and managed by the Inter-university Consortium for Political and Social Research (ICPSR). You can search for “twins” and datasets that have twin data collected and available for dissemination will be listed. http://www.icpsr.umich.edu/

As an aside, another publicly available resource towards developing harmonized measures of biomedical phenotypes is the PhenX toolbox (https://www.phenxtoolkit.org/). No data for analysis is available for download at this website, but it is good if you are thinking about study design of a project.

MCK: Great question! Nick, Dorret, John?? Want to weight in? Sadly (I think), twin research has not kept up with whole-genome research, where sharing of data is not only the norm, but also mandatory (for NIH funding at least). Would be scientifically useful if the same were true of twin research!

Questions/comments from Thursday

1. Coming from using more of a path analysis software, I am having a hard time understanding where the R/OpenMx script indicates the relationships between latent ACE and the observed indicators (family, happy etc). Specifically this line of script here: modelACE <- mxModel( "mulACEc", pars, modelMZ, modelDZ, multi, estVC ) We have built the matrices and populated them with data, how does the mxModel process them?

There are several ways that expected means, thresholds and covariance matrices can be generated. For this workshop, the algebraic approach is taken by setting up matrix objects and matrix formulae to generate these values. Internally, OpenMx inspects the algebra elements to make sure that there's nothing illogical (like A = B and B = A). During optimization, different values of the parameters are supplied to the free parameter elements of the matrices, and the algebras are recomputed, so that the expected covariance matrix and means/thresholds change. The software evaluates the amount of change in the -2lnL as a function of changing each of the parameters. This information is used to figure out a new set of parameter estimates which are hopefully nearer the minimum.

Note that it is also possible to specify models using mxPath() function calls, which may be more familiar if you are accustomed to specifying models that way in other software. There's a simple example on the OpenMx homepage.

TCB: Good question! This is the code snippet you posted:

	modelACE <- mxModel( "mulACEc", pars, modelMZ, modelDZ, multi, estVC )

There is a lot of stuff bundled up in there.

Let's unpack and simplify… We'll make a “top” model that contains all the algebra, then “MZ” and “DZ” models that refer to these, and contain their own data and optimization functions.

Then we put “top”, “MZ”, and “DZ” inside a supermodel “ACE”.

ACE.r

ntv = 2
nv = ntv/2
myACE = mxModel("ACE",
	mxModel("top",
		mxMatrix(name = "meanG", "Full" , nrow = 1 , ncol = ntv, free = T),
		mxMatrix(name = "a"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxMatrix(name = "c"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxMatrix(name = "e"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxAlgebra(name = "A", a %*% t(a)),
		mxAlgebra(name = "C", c %*% t(c)),
		mxAlgebra(name = "E", e %*% t(e)),
		mxAlgebra(name = "ACE", A + C + E),
		mxAlgebra(name = "AC" , A + C),
		mxAlgebra(name = "hAC", 0.5 %x% A + C),
		mxAlgebra(rbind(cbind(ACE, AC ),
		                cbind(AC , ACE)), name = "expCovMZ"),
		mxAlgebra(rbind(cbind(ACE, hAC),
		                cbind(hAC, ACE)), name = "expCovDZ")
	),
	mxModel("MZ", 
		mxExpectationNormal("top.expCovMZ", means = "top.meanG"),
		mxFitFunctionML(),
		mxData(mzData, type = "raw")
	),
	mxModel("DZ",
		mxExpectationNormal("top.expCovDZ", means = "top.meanG"),
		mxFitFunctionML(),
		mxData(dzData, type = "raw")
	),
	mxFitFunctionMultigroup(c("MZ","DZ"))
)

Now we can answer your question: “where does the script indicate the relationships between latent ACE and the observed indicators (family, happy etc)?”

The answer is: “in mxExpectationNormal and mxFitFunctionML”. mxExpectationNormal takes the expected covariance “top.expCovDZ” and the expected means (“top.meanG”), and compares these (according to maximum likelihod) to the mxData in the model containing the expectation.

The supermodel looks at how well each of the expectations (MZ and DZ) fit, and ensures the sum of both is as good (small) as possible.

Does that help?

2. What to do with strongly skewed (non-normal) data? normalize or not? are log and square root transformations appropriate?

The same answers hold here as elsewhere in stats: If your data are highly skewed, then it's likely that they are better expressed in a transform.

Reaction time for instance is skewed and long-tailed. The reciprocal or “reactions per second” (1/RT) is normally distributed, and is the advisable transform.

You should plot them, however. If the most common value is 0, then transforming won't normalize, and you might want an ordinal or tobit analysis instead.

- Another point of view on this (from Sarah)-

If you are going to transform, as a reviewer, I like to see the results of the untransformed data in an appendix/sup table.

If after transformation your data does not have a single mode and the distribution is very asymmetric (ie symptom count type data in an unselected cohort) you should convert the raw data to an ordinal categorical variable. Remember that you do not really gain much power by adding extra categories if the prevalence of the extra categories is very low and you end up empty cells in your twin1/twin2 cross tab.

Additional remarks (Conor): 1) The distributional assumption in ML (as used here) is that the data are multivariate normally distributed. So if you are concerned about normality given the ML normality assumption, then you'll have to consider multivariate normality in addition to marginal (i.e., univariate) normality. 2) Remember that the normality assumption concerns the distribution of the data conditional on fixed covariates (definition variables with a main effect on the phenotypes). So consider height in 500 males and 500 females (all 30 years old). The distribution of height in the complete sample (N=1000) may deviate from normality, while the distributions in the subsamples (males and females) may be perfectly normal.

Rob K. remarks: If there is a generally accepted transformation for your kind of variable in your discipline, use that. If not, I advocate adjusting one's methods to suit one's data, not adjusting one's data to suit one's methods! Non-normality should not bias your normal-theory MLEs, but it can indeed bias your inferences about parameters–test statistics, standard errors, and confidence intervals. So, consider bootstrapping, permutation testing, and the like, which are rather general nonparametric procedures for inference. Alternately, consider using a different likelihood function that's more reasonable for your variables. I'm planning new features in OpenMx that will facilitate its use with multivariate non-Gaussian data. Also, I have a recent paper on twin modeling with phenotypes that are count variables.

Recode your data into ordinal categories, and use table(data$variable_twin1,data$variable_twin2) to make sure there aren't any empty cells.

Mike N: Try to understand why your data are non-normal. If they are a sum of binary items that not very many people endorse, it may be better to work around a model that explicitly includes the item level data and replaces the sum score with a latent factor. Then twin correlations for the latent factor, or its correlation with other factors or measurements, may be of interest. In all cases, it is worth examining measurement invariance, since incorrect inference can emerge from using sum scores or factor modeling if this assumption is violated.

3.How do epigenetic effects affect estimates of V(C), V(D)and V(A)?

Conor: there are a number of interpretations. I think that such effects will render a (A path) potential random over individual. That is whereas we now think of a as a fixed parameter it may be a parameter that is ditributed, say, normally with mean mu(a) and standard deviation sd(a). Treating it as fixed, we get from the model mu(a). This is really just speculation. But for an elaborate presentation of this please see https://www.dovepress.com/methylation-as-an-epigenetic-source-of-random-genetic-effects-in-the-c-peer-reviewed-fulltext-article-AGG.

4. Could you please explain step by step how to solve for A and C given CVmz = A + C and CVdz = 1/2 A + C?

Suppose that the MZ correlation is .8 and the DZ correlation is .5

    A + C = .8
1/2 A + C = .5

Then subtracting the DZ expression from the MZ, we have:

1/2A = .3
   A = .6

substitute this back into either the MZ or the DZ equation so

 .6 + C = .8
      C = .2

Given that these are correlations, we can then deduce the value of E, because A+C+E=1 for standardized variables.

 MCK: You can figure the general formula just by multiplying by constants and subtraction. E.g., 
CVMZ = A + C
CVDZ = 1/2 A + C

So CVMZ-CVDZ = A + C - 1/2A - C = 1/2A. SO to get A:
A_hat = 2*(CVMZ-CVDZ).

Similarly to solve C, find a way to eliminate A and isolate C. E.g.:
2CVDZ - CVMZ = A + 2C - A - C = C_hat

5. Could you please explain why it is better to include age and sex as a covariate rather than using the residual in a model?

One answer is that if the dependent variable is ordinal, you have no choice.

Another is that by incorporating the covariates in the model, you have the estimates in one place.

Rob K. says: Stated briefly, if you incorporate the covariates into the MxModel, you end up simultaneously maximizing the joint likelihood of ALL unknown parameters–both the biometric variance components and the regression coefficients for the covariates. If you instead use OLS regression to residualize the phenotype(s), and then analyze the residuals in OpenMx, the parameter estimates for the variance components are conditional MLEs, given ordinary least-squares estimates of the regression coefficients. Such conditional estimators have weaker theoretical properties than proper “joint” MLEs. See these posts for further detail.

6. Will we discuss methods that can be employed to identify/isolate components of the nonshared environment? (e.g., DF modelling and MZ twin discordance)

Yes - Brad will be discussing this Friday.

7. Question about Power: How do you get from the individual family likelihood contribution to the power percent. We know it has to do with the chi-squared distribution but could you walk us through it?

Not answered yet.

8. Is there a way of monitoring the progress of optimization while openMx is running?

TCB: Kind of: if you set checkpointing, then you can watch the checkpoint file grow.

You can do this with calls to options, or just use umx_set_checkpoint()

	mxOption(NULL, "Always Checkpoint"   , always)
	mxOption(NULL, "Checkpoint Count"    , interval)
	mxOption(NULL, "Checkpoint Units"    , units)	
	mxOption(NULL, "Checkpoint Prefix"   , prefix)

MN adds: Also, on a linux or OS X system you can use the command line function less to inspect the output. You can step through the file with the space bar or up and down arrow keys. Go to the end of the file with F, and stay there as the file grows with G (ctrl-c to exit) and q to quit.

Rob K. adds: if you happen to be using mxTryHard(), you can provide it with a value greater than zero for its argument 'verbose'. You'll see additional information from the optimizer being printed to the console.

9. In path-based (RAM) OpenMx, how do you include an algebra on a path? e.g. to allow a variable Z to moderate the impact of X on Y?

Not answered yet.

10. If you do a study with low power (e.g. 30%) and you get a nominally significant p value (p < .05), can you trust your results?

MCK: Two answer to this. A p-value of .04 with a large sample size (high power) and one of .04 with a small sample size (low power) both provide equal evidence for rejecting the null that the estimate is different from 0 (assuming that's your null). So one answer is, “Yes”.

But another answer to this is “no”. An estimate with a p-value of .04 in a small sample (vs. one of .04 in a large sample) has a lot more error variance (it's SE is much larger) and so you will be less sure about where exactly the true estimate is likely to lie.

JKH: Low power increases the likelihood that `significant' findings will be false positives. (see Button, Ioannidis et al, 2014)

11. In the Latent Dual Change Score model, the betas were all the same inferring that the distance between time points is equal. Can you specify uneven time points in this model? For example, ages 1, 2, 3, 10, 14 for a longitudinal study?

Yes, in two main ways. An easy way is simply to treat the issue as a missing data problem. So if you set the model up for ages 1, 2, … 14 – as if there were up to 14 measures per person (or 28 per twin pair). Then just mark as NA all the occasions on which a person was not measured. The second way works well if there are very many ages (continuous age, for example). Here you would use one definition variable for each observed variable. So if you had 4 occasions measured, then you would have the 4 measures and the 4 ages at which the person was measured. In the model the factor loading for linear growth would be replaced by the age of measurement using a label such as data.ageATtime1 assuming ageATtime1 is a variable in the dataset that states the age at measurement. Here again missing data may be key; the model should be set up for the maximum number of occasions of any individual. Any missing measures would have NA, but their corresponding definition variable can be set to an arbitrary value like -999 (and should NOT be marked as NA). The rationale here is that OpenMx deletes cases that have at least one missing value for a definition variable.

Agreed. Treat as missing data problem. That advantage of full information maximum likelihood raw data analysis is that it's robust to missing data.

12. Can you correct for assortative mating without using the extended twin family design? (say, with data on how much people assortatively mate in the general pop)

If you have an estimate of assortment, you could put it into the model by altering the value of DZr away from .5 gene sharing.

MN: A possibility would be to use a sensitivity analysis, depending on whether you think the marital similarity comes from assortment (selection of spouses) or from social environment similarity, or from marital interaction. Thus one could adjust the amount of the genetic correlation change. A purely phenotypic assortment model might substitute .5(1 + a² μ) where a² is the heritability and μ is the marital correlation.

Questions/comments from Wednesday

1. What is the origin/meaning of the terms "Biometric" and "Psychometric" model (for CP and IP models respectively)

McArdle, J. J., & Goldsmith, H. H. (1990). Alternative common factor models for multivariate biometric analyses. Behavior Genetics, 20(5), 569-608. doi:10.1007/BF01065873

2. Why can't we think of "general health" causing smoking? (i.e., that the reason smoking goes with drinking is a causal latent "health-seeking behavior" trait (for Conor Dolan's factor model tutorial)

One answer is, “you could, of course - that's just a different theory”. It could be tested. In reality, it's likely that both things are happening: some health/resilience factors causing several behaviors, and those behaviors piling up to cause “general health”

I (Conor) consider General Health (GH) to be an index variable, that is a summary variable like the APGAR score (“general health” of the neonate). Suppose I am studying GH in adolescents (say 14 to 16 y olds). I do not see how a kid's GH is a cause the kid's smoking, just like I do not see how a low APGAR score is causing respiratory problems. However, there are definitely situations in which GH may cause smoking. Consider an ex smoker, whose GH has declined due to, say, old age. The decline in GH may be a source of dispair and stress. In reaction the person starts smoking (to relieve stress). Here GH is causing smoking, but in a rather exceptional circumstance. So the causal status of GH may depend on the context?

3. Will the factor analysis of an 84-item autism scale be satisfactory (real) if the threshold is .3% (autistic spectrum disorder) and 99.7% of the subjects in the FA sample are normal?

I would argue that this threshold is not arbitrary and most children are not slightly autistic, they either are or not. So can I use 9,970 normal subjects scoring very low on an autism scale and 30 autistic children score very high on the scale in a factor analysis. Will the results of the FA validly reflect the underlying factor structure of the 84-item autism scale used in the study?

Conor answers: 84 items is alot! Suppose that the response format is dichotomous (yes, no), and the probability of endorsement is low (say on average 2% of testees respond “yes”). Then you're going to have a hard time to fit a factor model using full information maximum likelihood purely for computational reasons. However if your sample size is large (more like 10000 than 1000) then there are least squares estimation methods that may work. E.g., (Diagonally Weighted Least Squares (WLS). Such methods are available in OpenMx.

4. What are the different optimizers for OpenMX, and why would you use one versus another?

Legal values (from mxAvailableOptimizers() ) are:'CSOLNP', 'SLSQP', and 'NPSOL'

NPSOL is proprietary, Fortran-based, and pretty rock solid. See here for its documentation.

CSOLNP is our own open-source implementation of solnp (which has an R implementation in package Rsolnp). It is coded in C++, perhaps less robust than the other two, but improving and already quicker than others at certain problems. learn more here

SLSQP is one of a suite of optimizers implemented in nlopt.

There are also Newton-Raphson and EM optimizers (for advanced users).

FYI: If you are using OpenMx you can set the optimizer easily using the following:

mxOption(NULL,"Default optimizer","NPSOL")

You can also see the currently set optimizer with

mxOption(NULL,"Default optimizer")

If you are using umx you can see and set the optimizer easily with

umx_set_optimizer()

Concerning why you would use one versus another: NPSOL and SLSQP are similar in many respects. We're using NPSOL in this workshop largely for historical reasons–NPSOL was the ONLY available optimizer in OpenMx version 1, and in classic Mx before that. It's therefore what the workshop faculty are most familiar with. Savvy users interested in technical details may wish to consult the help pages for mxOption() and mxComputeGradientDescent() .

5. Can you please provide some guidance about which type of matrix to select (i.e., when to select "Full" versus "Stand" type of matrix in OpenMx)?

This is dependent on what you want the matrix to contain/do. A full matrix allows all cells to take any value. A symmetric matrix makes the upper triangle and lower triangle equal (as you might see in a correlation matrix), a lower matrix only stores the lower triangle (as you might want in a Cholesky where you are going to %*% the matrix with its transform). In general, there is no general answer to this question .

You might also check out the OpenMx wiki, esp matrices. Also see here and here.

6. Can OpenMx accommodate sampling weights for the observations?

Yes. What you do is run your model with a row-likelihood (likelihood for each subject), then embed that in a model which weights each row and tries to minimise the sum of those weighted likelihoods.

7. If you drop the C path in the AE model why doesn't the fitAE$rC change?

When 'C' is dropped from the model, evaluation of fitAE$rC would involve inverting a matrix of all zeroes, which in turn would throw an error. For that reason, OpenMx leaves fitAE$rC untouched (which may not be the ideal behavior). Compare to fitAE$C, which DOES appropriately reflect the fact that the 'C' paths are fixed to zero.

8. This question will likely bring some face-palms, nonetheless…..I’m having trouble understanding the “starting value” component of the modeling. It seems like such an arbitrary method of beginning the analyses.

One way to see that it is not arbitrary is to try putting in impossible values: the model will usually code red with non-positive definite matrices, or other problems. So you need to start things in the legal space. The further away from the bet vlue that you start things, the longer the model will take to run, as it finds its way to the best solution.

Another way to think about this is that if you knew where to start things exactly, you wouldn't need to run the model: you'd just start the model at its optimal values, and it would stay there. So the values are unknown, but important, rather than arbitrary.

9. I notice that when we are testing submodels we opt to drop parameters as opposed to constraining or un-constraining them. What is the benefit to these different approaches? Which is the better approach?

In general mxConstraint is a much harder problem for the optimizer to solve - essentially it will try shift a parameter, optimize, then check that the constraint was not violated. If it was, it just tries again. When you know what you want a parameter to do, e.g. “stay at zero”, it is much better to just do that. Use constraints for situations that can't be done with simple “free = F, value=X” code.

10. How can you tell if something is behaving as a moderator vs. a mediator?

By building and running the moderator model and the mediator model and mxCompare them. It is useful to draw the path diagram on paper to think through exactly what each is claiming.

11. Is there a reference (say to give to reviewers) to say why univariate ACE models are/are not of value (in addition to one's multivariate results?

They are probably right, in that the univariates can be a guide to how to model the multivariate model, and a good opportunity to test assumptions like sex-limitation and variance equation in a simpler context.

12. Is undertaking sex-limitation and other assumption tests at a univariate level acceptable (if one, say rejects sex-lmitation, then moves on to multivariate). ie. do we need to do multivariate assumption tests?

The multivariate model is (somewhat) more powerful. It would likely allow you higher power to detect sex-limitation.

13. We mentioned running a saturated model for bivariate model, however the code provided does not test the twin order, means, ext. assumptions, will going over that information?

(Scripts for bivariate saturated are in Hermine’s folder). Indeed, assumptions have to be tested in the multivariate case as well.

14. What to do with strongly skewed (non-normal) data? normalize or not? are log and square root transformations appropriate?

Questions/comments from Tuesday

1. In Sarah's presentation, I got an error from the **mxCompare** function (line 124): "object "fit" was not defined". Reading through the script it seems it was not created. Do you know where I could find it? Additionally the fitGofs and fitEsts functions didn't work, but I think that was a working directory problem.

(An) Answer: Sarah's presentation script (oneACEca.R) was following up from the saturated script that was run from yesterday (oneSATc.R). She wanted to compare the fit of the ACE model with the saturated model. The object “fit” identified in the script was referring to an object developed in the saturated script (oneSATc.R).

Regarding fitGofs and fitEsts. It is possible that you did not run the file that contains some specialized functions used for summarizing results (line 3 in script, written as– source(“miFunctions.R”) ). If this script wasn't run successfully, R doesn't know anything about these functions and will produce an error. To be clear, miFunctions.R is a script that contains functions that are helpful for seeing results but it isn't necessary for the succesful production of any OpenMx models.

2. Where can I get classic reference papers?

The QIMR website for classic papers is here

Lindon mentioned Urbach 1974.

3. When not using the projector screen on the left for code, please put it in the mode where it mirrors the presenter's slides. People on the left side of the room can't see the projector screen on the right very well.

Great idea! It's not perfect because it's mirroring the Powerpoint notes view.

4. How do I access the faculty folder when I can't access the workshop wifi?

The faculty folder can only be accessed while on the Workshop network. It cannot be accessed from the hotel's wifi, or other outside networks. If you're in the hotel, you can come down to the lobby or someplace nearby that is reached by the Workshop wifi. The Workshop wifi is left running all the time.

When the Workshop is over, all of the faculty files will be collected and placed on the web in a location that is accessible from everywhere. Once this is done an email will be sent out notifying you that it is available.

5. Could we be provided with code for a categorical moderator (Sex? or levels of SES) to compare the syntax?

For 2 or 3 levels, you can “just” make a multi-group model (as you do already for twin models, with a group for each level of the moderator. Then test if you can equate ACE across the groups.

BUT

If you want to run a Purcell type moderation script there are no changes required. The script is the same regardless of whether the moderator is continuous or categorical. Just remember to make your lowest category 0 so that you can interpret the output easily.

6. Why do we talk about opposite sex DZ twins having any common environment?

Because the model is asking a question: If I posit something shared, how much impact does it appear to have? The answer might be (as your question implies) “not at all - they're opposite sex” Or it might be “lots”.

7. Where can I learn more about "tetrachoric" correlations?

Wikipedia is helpful. So too is John Fox's package polycor

(Another) Another nice reference on tetrachoric correlations can be found at http://www.john-uebersax.com/stat/tetra.htm

8. Why would you use **mxRun** over **mxTryHard**? What are the pros and cons of using each?

mxRun just runs your model. You can manually rerun it by just entering the model you get back from mxRun into mxRun again. mxTryHard tries to be intelligent in moving parameter start values, and estimation to solve intransigent models. Basically use `mxRun` unless things don't work, then use mxTryHard. nb: mxTryHard is not a magic bullet - you might need to think about your start values and might need to fix your model.

9. Does V(D) in the twin model account for all non-additive genetic effects (e.g. [[https://en.wikipedia.org/wiki/Epistasis|epistasis]]) or just dominance effects at the allele level? Are epigenetic effects located in V(C)?

The dominance component will be able to account for epistatic inheritance. However, the issue of where variance components go when not modeled correctly is covered in more depth in Mat Keller's talk

10. Is there such a thing as a univariate threshold model, or does a threshold model only apply when there are 2+ traits of interest?

You can very much have a threshold model for one phenotype: In fact it is quite common. e.g. ACE model of thresholded depression status.

It sounds like you are confusing the number of traits with the number of thresholds/categories informing on the traits. A univariate twin model deals with the analysis of one variable measured in twin pairs, e.g. Depression. This variable can have two categories (one threshold) or more categories (where the number of thresholds is n cat - 1).

11. I see the workshop sessions are being recorded - is there somewhere we can see them online?

There will be but not yet.

12. Question regarding Sanja's practical: When plotting the moderating effects of SES on the heritability of IQ, we used Va <- (a + am*ses)^2 , to scale the x axis of the plot to the five categories of SES (0-4). What if the moderating variable is truly continuous (e.g., age) - how do you modify this piece of the script?

In this particular plot, Va ← (a + am*ses)^2 was the expected variance conditional on the level of the moderator, and was represented by the blue line. Here the x-axis was SES (given by ses ← sort(unique(c(mzData$ses))), which returns a vector of integers from 0 to 4), and the y-axis gave the magnitude of variance component(s), with the limits of the y-axis being determined by the matplot() function (and dependent on the range of the plotted Va, Vc, Ve, and Vt values). If your variable was truly continuous, Va ← (a + am*moderator)^2 would still give you expectations for the variance conditional on the moderator, so you could keep this line of code as is. For the x-axis, using sort(unique(c(mzData$moderator))) will also still be correct, as this returns the unique values of the moderator variable in a sorted order (so in the case of SES 0, 1, 2, 3, and 4 – there may be more of these with a truly continuous moderator, and they might not be equidistant, but that is ok – as long as you're using moderator ← sort(unique(c(mzData$moderator))) first and Va ← (a + am*moderator)^2 later, as in the practical code).

13. Is it true that the effects of age can mimic the effects of C in twins?

Twins share their age, which, if behavior is linked to age (which it often is) makes pairs more similar just for that reason. Controlling for age may, or may not, leave some C due to other factors.

This is why we either residualize for age (and sex) (McGue1984), or model sex in the means (Cardon & Neale) or in the covariance matrix itself (Neale1989; Schwabe2015).

McGue, M., & Bouchard, T. J., Jr. (1984). Adjustment of twin data for the effects of age and sex. Behavior Genetics, 14(4), 325-343.

Neale, M. C., & Martin, N. G. (1989). The effects of age, sex, and genotype on self-report drunkenness following a challenge dose of alcohol. Behavior Genetics, 19(1), 63-78. doi:10.1007/BF01065884

Schwabe, I., Boomsma, D. I., Zeeuw, E. L., & Berg, S. M. (2015). A New Approach to Handle Missing Covariate Data in Twin Research : With an Application to Educational Achievement Data. Behavior Genetics. doi:10.1007/s10519-015-9771-1

14. Is it true that the effects of assortative mating can mimic the effects of C in twins?

Assortative mating will drive up genetic similarity in DZ twins. As such it would be modelled by setting rDZ > .5

Assortative mating increases genetic variance, which increases the phenotypic variance, the MZ twin covariance AND the DZ covariance all by the same amount. Therefore, it looks exactly like C in the ACE model.

When not modelled explicitly, then yes - the increased DZ similarity will look like C (shared environment).

15. I am studying risk-taking in twins. I find the following: rMZF = 0.60, rMZM = 0.60, rDZF = 0.30, rDZM = 0.30, rDOS = 0.0. What could possibly account for these correlations?

The trait is heritable, but behavioral expression of these genes is completely sex-specific, and so does not generate shared phenotypic resemblance in opposite-sex pairs. It's worth noting that the same correlations across sex (the .6 and .3 for example) might obscure quite different amounts of genetic variance. Examination of the phenotypic variances and of the unstandardized variance components can be quite helpful.

16. Is it possible to do a walk-through of one or two OpenMx functions, it's arguments, the objects created?

Questions/comments from Monday

1. I'm still having trouble getting files out of Faculty folder (says Read Only + can't save to separate location)

This should now be fixed. If you still have the problem, follow the directions below, and the problem should not return.

The “Home” folder must be opened before the “Faculty” folder is opened. It is a bug in the file browser, in that it gets confused when it opens the read-only faculty folder first. If your system is already confused, then close all of your file browsing windows, logout, login again, and open “Home” first.

2. Are there plug ins for drawing path diagrams in either R or OpenMx?

(An) Answer: The free Onyx application can do this

3. What is meant by Expectation Objects in the oneSATc script?

(An) Answer: The expectation objects organize the predictions (i.e., expectations) of the model. Usually, with continuous data and full information maximum likelihood estimation, they consist of expected covariances and the expected means. For ordinal data they will contain the expected thresholds, as well as the expected means and covariances.

4. When testing if means are equal across twin order/zygosity, what level of significance of p do you use? p < .05 or p<.01?

(An) Answer: Up to you. As there are few causes for means for T1 and T2, they are typically not significantly different.

(Another) Answer: It is also important to consider the effect size. Because sample sizes for twin studies are often quite large, in the 100's or 1,000's or more, very small differences can be significant. In addition, it is worth considering the number of tests (the number of variables). The usual approach is to use a multivariate test first, if that is significant then examine the variables individually.

5. When reporting MZ and DZ correlations (e.g. in a paper), should we give the values of the saturated model, or those we get once we check the equality of means, variances / thresholds?

(An) Answer: The model produces expected, not obtained covariances. So If you are reporting correlations, you should report the actual phenotypic correlations.

(Another) Answer: It can be better to report the model predicted covariances than those calculated from the sample. The two will be almost the same in the event that there are no missing data (in my experience this only happens with simulation studies or when an investigator has thrown out incomplete data cases, which is known as list wise deletion and should be avoided if at all possible). However, in the case that whether data on twin 2 are missing completely at random or partly predictable from the score of twin 1 (known confusingly as missing at random) then the maximum likelihood estimates would on average be closer to the population value than would the calculated-by-hand Pearson product moment correlation. This is a truly great feature of maximum likelihood.

6. Re inclusion of covariates, should we check their effect before or after exploring the equality of means/variances, etc? Which correlations would we report in this case?

(An) Answer: If the intention is to regress out the covariates and to analyze (ACE model) the residuals, then it would make sense to report correlations based on the regressed-out variables. Note that the choice of whether to regress out prior to analysis (obtain residuals from lm() for example) or to include the covariates directly in the model applies to continuous data. With ordinal data, the covariates should be included in the model.

7. Please provide a brief review of why dominance makes DZ twins less phenotypically similar.

(An) Answer: Dominant genes don't make DZ less similar, like anything shared, it makes them more similar. Just much less so than do additive genes.

The example of you and your DZ twins sharing two alleles of an additive gene for height, and two alleles of a dominant-pattern gene for eye color.

Even if you share only one allele for the height genes, the copy you share will have the same effect in both of you, and make your heights similar.

However if you differ in your blue/brown allele status, the phenotype can be the same in both of you even though you differ genetically. That masked-case lowers the phenotypic correlation between DZ twins below where it would be for an additive model (.25 rather than .5)

8. is it possible to get example code to allow different means to be estimate in univariate models

(An) Answer: Yes

9. What is the benefit of using umx?

(An) Answer: Reliability and mental and workflow efficiency. Because the functions operate at a high level, they encapsulate a lot of sub-tasks, and help you think at that level about what you are doing. Many of the helper functions make tasks that you do infrequently easier to learn or remember, Because the scripts are used by many people, they are less likely to contain errors. They also contain a lot of error-checking code, so can guide you to solve problems.

Reference:umx

10. What to do if your assumptions of twin order / zygosity are not met (e.g. means of twin 1 are significantly higher than means of twin 2)

(An) Answer: Good question - should be discussed. In practice people often assign such events to multiple testing inflation and proceed. However you can simply leave (e.g.) the variances between sexes not equated, they will consume a degree of freedom but, if they were significantly different, will improve model fit.

11. How do you decide on the start value for lower bounds?

(An) Answer: Start values and lower bounds are distinct ideas.

Start values are initial guesses, which hopefully allow the optimizer to find a global optimum for the model parameters. Lower (and upper) bounds are hard barriers which the parameter will never be allowed to cross.

So if you know that negative values are illegal, for instance, you might set lbound = 0

If you want to divide variance among A, C, and E, you might start those parameters at around 1/3 of the variance each.

IBG Wiki

Table of Contents