Questions and Answers

Ask questions at the OpenMx forums and the dedicated 2016 Workshop forum there.

Quick instructions on formatting

??? Start a question with three question-marks like this

!!! Start an answer with 3 exclaimation marks like this

If it is a long answer, then end with "???" on a line by themselves

Questions/comments from Friday

First question is if you only have the self rating and the rating of the other twin or if you did ask both twins to do this. If you only asked one member of the twin pair you will end up with rater bias problems, but in principle a bivariate model would do. If you have the self and the co-twin rating for both twins of the pair a multiple rater design would be best. If you need help with a multiple rater design you can contact Meike Bartels ( Just for fun look at an old paper by Eaves and Last (1980) in J. Pets. Ind. Diff (1) “Assessing empathy in twins through their mutual perception of social attitudes”. It used this type of design but a different approach to analysis.

MN: Hermine's computer died today so she can't help directly. There is a table describing which script is which is here: It links to the files from the table so you can navigate easily.

Also, Hermine's webpage table has links for scripts for univariate/monophenotype [one] and bivariate/diphenotype [two] twin analyses. The file names reflect a naming convention and model details. The naming convention reflects the following features:

1a- Analysis for 2 zygosity groups without any covariates and with an age covariate [a] 1b- Analysis for 5 groups without any covariates and with age covariate [a]

2- Each set includes a saturated model (SAT with sub models testing assumptions), an ACE model (with sub models) and an ADE model (with sub models).

3- Analyses for specific types of variables: continuous ©, binary (b), ordinal (o), mean/variance ordinal (m) and joint (j). The joint analyses refer to bivariate analyses where one measure is continuous and the other is ordinal.

Two examples of the naming convention: 1- The file called “twoACEj” is a script for a bivariate ACE model for joint data (continuous and ordinal measures) and has no covariates. This script also only uses data from two zygosity groups. 2- The file called “oneSAT5ca” is a script for a univariate saturated model using continuous data with age as a covariate. This script uses data from 5 zygosity groups.

You can include twin pairs with missing data in OpenMx. You do not need to drop pairs with missing data in advance of reading data into OpenMx because complete data is used. Similarly, it is not necessary to replace missing values with the mean. Like classic Mx, OpenMx drops incomplete pairs and covariances will not be able to be estimated in pairs where data from one twin is not available.

Of note, R (and therefore OpenMx) only understands that missing is “NA”. If your missing values are coded to anything but “NA”, you will need to tell R what your missing value is in your original dataset. When reading data into R (typically through the read.table() function), you can specify what your missing value will be (using the “na.strings = ” option).

Rob K. says: Also, be aware that OpenMx does not tolerate NAs on definition variables–it will throw an error at runtime.

Mike N says: Also note that the behavior with missing definition variables is the same as it used to be in classic Mx. If you have missing phenotype data AND missing definition variables on one member of a twin pair, it is important to put in a dummy value for the cotwin, say -999. However, if you are modeling Twin 1 as a function of Twin 2's definition variables (necessary in GxE models where both own and cotwin's moderators are regressed out) then the best approach is probably to delete the pair at this time. In future we'll add features to deal with missing definition variables, subject to certain assumptions.

It depends. If there is assortative mating, the NTF model can give you higher heritability than the classical twin design. If both C & D are affecting the trait at the same time, the NTFD will tend to give lower values of A and higher values of D & C, although the broad sense heritability might not be too different. Empirically, this is what's seen when we compare estimates from extended twin family models vs. classical twin designs (e.g., see Coventry & Keller, 2005).

 So when I mention the ” heritability” of a gene, I mean the heritability of the quantity of the gene observed in the white blood cells of these 2700 twin pairs. In fact every individual has every gene, but the sequence of nucleotides in a gene might vary between individuals. Note that a protein coding gene can be between a few hundred and a million nucleotides log, and only a limited number of those will vary between people. Now while every individual has 2 copies of each gene, the degree to which this gene is expressed might widely vary between subjects. Expression requires the cellular machinery in a cell to unpack the DNA strand, copy the DNA strand and have various other parts of the cellular machinery translate the DNA into a protein. To summarize everyone has 2 copies of each gene, the content of the gene(nucleotide sequence) might vary between people (though is identical for MZs and aprox correlated .5 for DZs), the expression of a gene further also varies between subjects and generally does NOT correlate 1 for MZ's and .5 for DZs. Therefore, we can compute the heritability of the expression levels of a gene.

Imagine that on average MZ twins correlate .4 across their entire epi-genome and DZ twins on average correlate .2 across their entire epi-genome. Now imagine further that most MZ or DZ twins do not differ much from these .4 and .2 correlations respectively, in that case the epigenetic effects will behave additively, as their correlation pattern best matched the additive variance component. Now a scenario under which this might arise is when methylated regions related to your trait of interest are strongly correlated to a set of SNPS (i.e. the gene sequence), than MZ's would share all these SNPs, while DZ's would on average share halve of these SNPs. This in turn would lead MZ's to be more alike for these methylated regions than DZ twins, this would be absorbed by the additive component. Given methylation is influenced by genetic effects, common environmental effects and unique environmental effects, much like any complex trait, we need to study the bivariate relationship between methylation and a complex trait to determine whether the relationship is in C (for example the womb), E or A. also note that MZ twin in the womb more often share their chorions and amniotic sacs, thus even the environment in the womb might be shared more strongly between MZ twins than between DZ twins, and thus could behave in a pattern that matches additive genetic effects.

MCK: It is an issue of degree. The rarer the causal variants are, the less well tagged they tend to be with common (MAF > .01) SNPs that exist on modern platforms. So causal variants (CVs) with .005 < MAF < .01 will be only partially captured by GREML, CVs with .001 < MAF < .005 even less so, and so forth. Generally, once you get to CV MAF < .001, the LD between them and common SNPs will be so low as to be ~ totally missed in GREML. One of the issues of active research is in using rarer SNPs (e.g., imputed or sequenced variants) or IBD haplotypes to get at the heritability due to these very rare CVs.

TCB: Worth reading Yang2015 “simulations based on whole-genome sequencing data that ~97% and ~68% of variation at common and rare variants [is captured]”

Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., . . . Visscher, P. M. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics, 47(10), 1114-1120. doi:10.1038/ng.3390

MCK: Right, but note that the Yang 2015 paper used *imputed* SNPs that were thereby much rarer on average than the SNPs usually included on arrays. They picked up additional variation due to rare causal variants that were better tagged by these rare imputed SNPs. So they are picking up additional rare causal variant variation that would NOT usually have been picked up using GCTA just on SNPs on arrays. So this isn't equivalent to the usual way of running GCTA. In essence, if we had all the sequence variation (instead of just imputed SNPs), we'd be able to pick up 100% of variation due to both rare and common variants.

MCK: First off, you'd want to have a sample that is relatively ethnically homogeneous; e.g., analyzing a sample of mixed ethnicity can lead to problems even if you correct for stratification because the CV-SNP LD will be different between the groups. So, now that we have an ethnically homogeneous sample, the most common way people control for any additional stratification (e.g., subtle differences between Caucasians on a north-south Europe gradient) is to add 5-20 ancestry principal components into the fixed part of the model. This should correct for any effect broad-level stratification has on your estimates.

For background, see these papers.

Rob K. says: I would tentatively answer your second question with a “yes,” as long as you have a sample that is representative of the population of patients who meet diagnostic criteria for (whatever disorder). I'm not altogether sure, though. Obviously, the generalizability of your results to the general population would be highly questionable.

Rob K. says: Status RED means the optimizer is not certain it has found a minimum of the fitfunction. So, no, your parameter estimates are probably not reliable, and the standard errors are even more suspect. Status RED with code 6 (first-order conditions not met) is worse than with code 5 (second-order conditions not met). You should always try to do something about status RED, for instance:

  • If you're analyzing ordinal data, sometimes a status RED is unavoidable without changing some mxOptions. In fact, it's possible to get status RED with ordinal data even when the optimizer has found a minimum.
  • Use different start values.
  • Try a different optimizer.
  • Reparameterize your MxModel.
  • Use mxTryHard() or one of its wrapper functions. Note that, by default, mxTryHard() prints to console the start values it used to find the best solution it found. The idea is that you copy-paste those start values into your script, and assign them to your pre-mxRun() model, using omxSetParameters().

Trying different start values is the most important thing.

MCK: I'll let Ben weigh in as well. But in essence, they should be two different ways of estimating the *same* parameter: SNP-heritability. I think that the differences in the literature are about what we'd expect given the SE's on the estimates. Ben - are there any systematic differences between the two?

One place a researcher can begin searching for publicly available data is the repository developed and managed by the Inter-university Consortium for Political and Social Research (ICPSR). You can search for “twins” and datasets that have twin data collected and available for dissemination will be listed.

As an aside, another publicly available resource towards developing harmonized measures of biomedical phenotypes is the PhenX toolbox ( No data for analysis is available for download at this website, but it is good if you are thinking about study design of a project.

MCK: Great question! Nick, Dorret, John?? Want to weight in? Sadly (I think), twin research has not kept up with whole-genome research, where sharing of data is not only the norm, but also mandatory (for NIH funding at least). Would be scientifically useful if the same were true of twin research!

Questions/comments from Thursday

There are several ways that expected means, thresholds and covariance matrices can be generated. For this workshop, the algebraic approach is taken by setting up matrix objects and matrix formulae to generate these values. Internally, OpenMx inspects the algebra elements to make sure that there's nothing illogical (like A = B and B = A). During optimization, different values of the parameters are supplied to the free parameter elements of the matrices, and the algebras are recomputed, so that the expected covariance matrix and means/thresholds change. The software evaluates the amount of change in the -2lnL as a function of changing each of the parameters. This information is used to figure out a new set of parameter estimates which are hopefully nearer the minimum.

Note that it is also possible to specify models using mxPath() function calls, which may be more familiar if you are accustomed to specifying models that way in other software. There's a simple example on the OpenMx homepage.

TCB: Good question! This is the code snippet you posted:

	modelACE <- mxModel( "mulACEc", pars, modelMZ, modelDZ, multi, estVC ) 

There is a lot of stuff bundled up in there.

Let's unpack and simplify… We'll make a “top” model that contains all the algebra, then “MZ” and “DZ” models that refer to these, and contain their own data and optimization functions.

Then we put “top”, “MZ”, and “DZ” inside a supermodel “ACE”.

ntv = 2
nv = ntv/2
myACE = mxModel("ACE",
		mxMatrix(name = "meanG", "Full" , nrow = 1 , ncol = ntv, free = T),
		mxMatrix(name = "a"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxMatrix(name = "c"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxMatrix(name = "e"    , "Lower", nrow = nv, ncol = nv , free = T),
		mxAlgebra(name = "A", a %*% t(a)),
		mxAlgebra(name = "C", c %*% t(c)),
		mxAlgebra(name = "E", e %*% t(e)),
		mxAlgebra(name = "ACE", A + C + E),
		mxAlgebra(name = "AC" , A + C),
		mxAlgebra(name = "hAC", 0.5 %x% A + C),
		mxAlgebra(rbind(cbind(ACE, AC ),
		                cbind(AC , ACE)), name = "expCovMZ"),
		mxAlgebra(rbind(cbind(ACE, hAC),
		                cbind(hAC, ACE)), name = "expCovDZ")
		mxExpectationNormal("top.expCovMZ", means = "top.meanG"),
		mxData(mzData, type = "raw")
		mxExpectationNormal("top.expCovDZ", means = "top.meanG"),
		mxData(dzData, type = "raw")

Now we can answer your question: “where does the script indicate the relationships between latent ACE and the observed indicators (family, happy etc)?”

The answer is: “in mxExpectationNormal and mxFitFunctionML”. mxExpectationNormal takes the expected covariance “top.expCovDZ” and the expected means (“top.meanG”), and compares these (according to maximum likelihod) to the mxData in the model containing the expectation.

The supermodel looks at how well each of the expectations (MZ and DZ) fit, and ensures the sum of both is as good (small) as possible.

Does that help?

The same answers hold here as elsewhere in stats: If your data are highly skewed, then it's likely that they are better expressed in a transform.

Reaction time for instance is skewed and long-tailed. The reciprocal or “reactions per second” (1/RT) is normally distributed, and is the advisable transform.

You should plot them, however. If the most common value is 0, then transforming won't normalize, and you might want an ordinal or tobit analysis instead.

- Another point of view on this (from Sarah)-

If you are going to transform, as a reviewer, I like to see the results of the untransformed data in an appendix/sup table.

If after transformation your data does not have a single mode and the distribution is very asymmetric (ie symptom count type data in an unselected cohort) you should convert the raw data to an ordinal categorical variable. Remember that you do not really gain much power by adding extra categories if the prevalence of the extra categories is very low and you end up empty cells in your twin1/twin2 cross tab.

Additional remarks (Conor): 1) The distributional assumption in ML (as used here) is that the data are multivariate normally distributed. So if you are concerned about normality given the ML normality assumption, then you'll have to consider multivariate normality in addition to marginal (i.e., univariate) normality. 2) Remember that the normality assumption concerns the distribution of the data conditional on fixed covariates (definition variables with a main effect on the phenotypes). So consider height in 500 males and 500 females (all 30 years old). The distribution of height in the complete sample (N=1000) may deviate from normality, while the distributions in the subsamples (males and females) may be perfectly normal.

Rob K. remarks: If there is a generally accepted transformation for your kind of variable in your discipline, use that. If not, I advocate adjusting one's methods to suit one's data, not adjusting one's data to suit one's methods! Non-normality should not bias your normal-theory MLEs, but it can indeed bias your inferences about parameters–test statistics, standard errors, and confidence intervals. So, consider bootstrapping, permutation testing, and the like, which are rather general nonparametric procedures for inference. Alternately, consider using a different likelihood function that's more reasonable for your variables. I'm planning new features in OpenMx that will facilitate its use with multivariate non-Gaussian data. Also, I have a recent paper on twin modeling with phenotypes that are count variables.

Recode your data into ordinal categories, and use table(data$variable_twin1,data$variable_twin2) to make sure there aren't any empty cells.

Mike N: Try to understand why your data are non-normal. If they are a sum of binary items that not very many people endorse, it may be better to work around a model that explicitly includes the item level data and replaces the sum score with a latent factor. Then twin correlations for the latent factor, or its correlation with other factors or measurements, may be of interest. In all cases, it is worth examining measurement invariance, since incorrect inference can emerge from using sum scores or factor modeling if this assumption is violated.

Conor: there are a number of interpretations. I think that such effects will render a (A path) potential random over individual. That is whereas we now think of a as a fixed parameter it may be a parameter that is ditributed, say, normally with mean mu(a) and standard deviation sd(a). Treating it as fixed, we get from the model mu(a). This is really just speculation. But for an elaborate presentation of this please see

Suppose that the MZ correlation is .8 and the DZ correlation is .5

    A + C = .8
1/2 A + C = .5

Then subtracting the DZ expression from the MZ, we have:

1/2A = .3
   A = .6

substitute this back into either the MZ or the DZ equation so

 .6 + C = .8
      C = .2

Given that these are correlations, we can then deduce the value of E, because A+C+E=1 for standardized variables.

 MCK: You can figure the general formula just by multiplying by constants and subtraction. E.g., 
CVMZ = A + C
CVDZ = 1/2 A + C

So CVMZ-CVDZ = A + C - 1/2A - C = 1/2A. SO to get A:
A_hat = 2*(CVMZ-CVDZ).

Similarly to solve C, find a way to eliminate A and isolate C. E.g.:
2CVDZ - CVMZ = A + 2C - A - C = C_hat

One answer is that if the dependent variable is ordinal, you have no choice.

Another is that by incorporating the covariates in the model, you have the estimates in one place.

Rob K. says: Stated briefly, if you incorporate the covariates into the MxModel, you end up simultaneously maximizing the joint likelihood of ALL unknown parameters–both the biometric variance components and the regression coefficients for the covariates. If you instead use OLS regression to residualize the phenotype(s), and then analyze the residuals in OpenMx, the parameter estimates for the variance components are conditional MLEs, given ordinary least-squares estimates of the regression coefficients. Such conditional estimators have weaker theoretical properties than proper “joint” MLEs. See these posts for further detail.

TCB: Kind of: if you set checkpointing, then you can watch the checkpoint file grow.

You can do this with calls to options, or just use umx_set_checkpoint()

	mxOption(NULL, "Always Checkpoint"   , always)
	mxOption(NULL, "Checkpoint Count"    , interval)
	mxOption(NULL, "Checkpoint Units"    , units)	
	mxOption(NULL, "Checkpoint Prefix"   , prefix)

MN adds: Also, on a linux or OS X system you can use the command line function less to inspect the output. You can step through the file with the space bar or up and down arrow keys. Go to the end of the file with F, and stay there as the file grows with G (ctrl-c to exit) and q to quit.

Rob K. adds: if you happen to be using mxTryHard(), you can provide it with a value greater than zero for its argument 'verbose'. You'll see additional information from the optimizer being printed to the console.

MCK: Two answer to this. A p-value of .04 with a large sample size (high power) and one of .04 with a small sample size (low power) both provide equal evidence for rejecting the null that the estimate is different from 0 (assuming that's your null). So one answer is, “Yes”.

But another answer to this is “no”. An estimate with a p-value of .04 in a small sample (vs. one of .04 in a large sample) has a lot more error variance (it's SE is much larger) and so you will be less sure about where exactly the true estimate is likely to lie.

JKH: Low power increases the likelihood that `significant' findings will be false positives. (see Button, Ioannidis et al, 2014)

Yes, in two main ways. An easy way is simply to treat the issue as a missing data problem. So if you set the model up for ages 1, 2, … 14 – as if there were up to 14 measures per person (or 28 per twin pair). Then just mark as NA all the occasions on which a person was not measured. The second way works well if there are very many ages (continuous age, for example). Here you would use one definition variable for each observed variable. So if you had 4 occasions measured, then you would have the 4 measures and the 4 ages at which the person was measured. In the model the factor loading for linear growth would be replaced by the age of measurement using a label such as data.ageATtime1 assuming ageATtime1 is a variable in the dataset that states the age at measurement. Here again missing data may be key; the model should be set up for the maximum number of occasions of any individual. Any missing measures would have NA, but their corresponding definition variable can be set to an arbitrary value like -999 (and should NOT be marked as NA). The rationale here is that OpenMx deletes cases that have at least one missing value for a definition variable.

Agreed. Treat as missing data problem. That advantage of full information maximum likelihood raw data analysis is that it's robust to missing data.

If you have an estimate of assortment, you could put it into the model by altering the value of DZr away from .5 gene sharing.

MN: A possibility would be to use a sensitivity analysis, depending on whether you think the marital similarity comes from assortment (selection of spouses) or from social environment similarity, or from marital interaction. Thus one could adjust the amount of the genetic correlation change. A purely phenotypic assortment model might substitute .5(1 + a2 μ) where a2 is the heritability and μ is the marital correlation.

Questions/comments from Wednesday

One answer is, “you could, of course - that's just a different theory”. It could be tested. In reality, it's likely that both things are happening: some health/resilience factors causing several behaviors, and those behaviors piling up to cause “general health”

I (Conor) consider General Health (GH) to be an index variable, that is a summary variable like the APGAR score (“general health” of the neonate). Suppose I am studying GH in adolescents (say 14 to 16 y olds). I do not see how a kid's GH is a cause the kid's smoking, just like I do not see how a low APGAR score is causing respiratory problems. However, there are definitely situations in which GH may cause smoking. Consider an ex smoker, whose GH has declined due to, say, old age. The decline in GH may be a source of dispair and stress. In reaction the person starts smoking (to relieve stress). Here GH is causing smoking, but in a rather exceptional circumstance. So the causal status of GH may depend on the context?

I would argue that this threshold is not arbitrary and most children are not slightly autistic, they either are or not. So can I use 9,970 normal subjects scoring very low on an autism scale and 30 autistic children score very high on the scale in a factor analysis. Will the results of the FA validly reflect the underlying factor structure of the 84-item autism scale used in the study?

Conor answers: 84 items is alot! Suppose that the response format is dichotomous (yes, no), and the probability of endorsement is low (say on average 2% of testees respond “yes”). Then you're going to have a hard time to fit a factor model using full information maximum likelihood purely for computational reasons. However if your sample size is large (more like 10000 than 1000) then there are least squares estimation methods that may work. E.g., (Diagonally Weighted Least Squares (WLS). Such methods are available in OpenMx.

Legal values (from mxAvailableOptimizers() ) are:'CSOLNP', 'SLSQP', and 'NPSOL'

NPSOL is proprietary, Fortran-based, and pretty rock solid. See here for its documentation.

CSOLNP is our own open-source implementation of solnp (which has an R implementation in package Rsolnp). It is coded in C++, perhaps less robust than the other two, but improving and already quicker than others at certain problems. learn more here

SLSQP is one of a suite of optimizers implemented in nlopt.

There are also Newton-Raphson and EM optimizers (for advanced users).

FYI: If you are using OpenMx you can set the optimizer easily using the following:

mxOption(NULL,"Default optimizer","NPSOL")

You can also see the currently set optimizer with

mxOption(NULL,"Default optimizer")

If you are using umx you can see and set the optimizer easily with


Concerning why you would use one versus another: NPSOL and SLSQP are similar in many respects. We're using NPSOL in this workshop largely for historical reasons–NPSOL was the ONLY available optimizer in OpenMx version 1, and in classic Mx before that. It's therefore what the workshop faculty are most familiar with. Savvy users interested in technical details may wish to consult the help pages for mxOption() and mxComputeGradientDescent() .

This is dependent on what you want the matrix to contain/do. A full matrix allows all cells to take any value. A symmetric matrix makes the upper triangle and lower triangle equal (as you might see in a correlation matrix), a lower matrix only stores the lower triangle (as you might want in a Cholesky where you are going to %*% the matrix with its transform). In general, there is no general answer to this question :-).

You might also check out the OpenMx wiki, esp matrices. Also see here and here.

Yes. What you do is run your model with a row-likelihood (likelihood for each subject), then embed that in a model which weights each row and tries to minimise the sum of those weighted likelihoods.

When 'C' is dropped from the model, evaluation of fitAE$rC would involve inverting a matrix of all zeroes, which in turn would throw an error. For that reason, OpenMx leaves fitAE$rC untouched (which may not be the ideal behavior). Compare to fitAE$C, which DOES appropriately reflect the fact that the 'C' paths are fixed to zero.

One way to see that it is not arbitrary is to try putting in impossible values: the model will usually code red with non-positive definite matrices, or other problems. So you need to start things in the legal space. The further away from the bet vlue that you start things, the longer the model will take to run, as it finds its way to the best solution.

Another way to think about this is that if you knew where to start things exactly, you wouldn't need to run the model: you'd just start the model at its optimal values, and it would stay there. So the values are unknown, but important, rather than arbitrary.

In general mxConstraint is a much harder problem for the optimizer to solve - essentially it will try shift a parameter, optimize, then check that the constraint was not violated. If it was, it just tries again. When you know what you want a parameter to do, e.g. “stay at zero”, it is much better to just do that. Use constraints for situations that can't be done with simple “free = F, value=X” code.

By building and running the moderator model and the mediator model and mxCompare them. It is useful to draw the path diagram on paper to think through exactly what each is claiming.

They are probably right, in that the univariates can be a guide to how to model the multivariate model, and a good opportunity to test assumptions like sex-limitation and variance equation in a simpler context.

(Scripts for bivariate saturated are in Hermine’s folder). Indeed, assumptions have to be tested in the multivariate case as well.

Questions/comments from Tuesday

(An) Answer: Sarah's presentation script (oneACEca.R) was following up from the saturated script that was run from yesterday (oneSATc.R). She wanted to compare the fit of the ACE model with the saturated model. The object “fit” identified in the script was referring to an object developed in the saturated script (oneSATc.R).

Regarding fitGofs and fitEsts. It is possible that you did not run the file that contains some specialized functions used for summarizing results (line 3 in script, written as– source(“miFunctions.R”) ). If this script wasn't run successfully, R doesn't know anything about these functions and will produce an error. To be clear, miFunctions.R is a script that contains functions that are helpful for seeing results but it isn't necessary for the succesful production of any OpenMx models.

The QIMR website for classic papers is here

Lindon mentioned Urbach 1974.

The faculty folder can only be accessed while on the Workshop network. It cannot be accessed from the hotel's wifi, or other outside networks. If you're in the hotel, you can come down to the lobby or someplace nearby that is reached by the Workshop wifi. The Workshop wifi is left running all the time.

When the Workshop is over, all of the faculty files will be collected and placed on the web in a location that is accessible from everywhere. Once this is done an email will be sent out notifying you that it is available.

For 2 or 3 levels, you can “just” make a multi-group model (as you do already for twin models, with a group for each level of the moderator. Then test if you can equate ACE across the groups.


If you want to run a Purcell type moderation script there are no changes required. The script is the same regardless of whether the moderator is continuous or categorical. Just remember to make your lowest category 0 so that you can interpret the output easily.

Because the model is asking a question: If I posit something shared, how much impact does it appear to have? The answer might be (as your question implies) “not at all - they're opposite sex” Or it might be “lots”.

Wikipedia is helpful. So too is John Fox's package polycor

(Another) Another nice reference on tetrachoric correlations can be found at

mxRun just runs your model. You can manually rerun it by just entering the model you get back from mxRun into mxRun again. mxTryHard tries to be intelligent in moving parameter start values, and estimation to solve intransigent models. Basically use `mxRun` unless things don't work, then use mxTryHard. nb: mxTryHard is not a magic bullet - you might need to think about your start values and might need to fix your model.

The dominance component will be able to account for epistatic inheritance. However, the issue of where variance components go when not modeled correctly is covered in more depth in Mat Keller's talk

You can very much have a threshold model for one phenotype: In fact it is quite common. e.g. ACE model of thresholded depression status.

It sounds like you are confusing the number of traits with the number of thresholds/categories informing on the traits. A univariate twin model deals with the analysis of one variable measured in twin pairs, e.g. Depression. This variable can have two categories (one threshold) or more categories (where the number of thresholds is n cat - 1).

In this particular plot, Va ← (a + am*ses)^2 was the expected variance conditional on the level of the moderator, and was represented by the blue line. Here the x-axis was SES (given by ses ← sort(unique(c(mzData$ses))), which returns a vector of integers from 0 to 4), and the y-axis gave the magnitude of variance component(s), with the limits of the y-axis being determined by the matplot() function (and dependent on the range of the plotted Va, Vc, Ve, and Vt values). If your variable was truly continuous, Va ← (a + am*moderator)^2 would still give you expectations for the variance conditional on the moderator, so you could keep this line of code as is. For the x-axis, using sort(unique(c(mzData$moderator))) will also still be correct, as this returns the unique values of the moderator variable in a sorted order (so in the case of SES 0, 1, 2, 3, and 4 – there may be more of these with a truly continuous moderator, and they might not be equidistant, but that is ok – as long as you're using moderator ← sort(unique(c(mzData$moderator))) first and Va ← (a + am*moderator)^2 later, as in the practical code).

Twins share their age, which, if behavior is linked to age (which it often is) makes pairs more similar just for that reason. Controlling for age may, or may not, leave some C due to other factors.

This is why we either residualize for age (and sex) (McGue1984), or model sex in the means (Cardon & Neale) or in the covariance matrix itself (Neale1989; Schwabe2015).

McGue, M., & Bouchard, T. J., Jr. (1984). Adjustment of twin data for the effects of age and sex. Behavior Genetics, 14(4), 325-343.

Neale, M. C., & Martin, N. G. (1989). The effects of age, sex, and genotype on self-report drunkenness following a challenge dose of alcohol. Behavior Genetics, 19(1), 63-78. doi:10.1007/BF01065884

Schwabe, I., Boomsma, D. I., Zeeuw, E. L., & Berg, S. M. (2015). A New Approach to Handle Missing Covariate Data in Twin Research : With an Application to Educational Achievement Data. Behavior Genetics. doi:10.1007/s10519-015-9771-1

Assortative mating will drive up genetic similarity in DZ twins. As such it would be modelled by setting rDZ > .5

Assortative mating increases genetic variance, which increases the phenotypic variance, the MZ twin covariance AND the DZ covariance all by the same amount. Therefore, it looks exactly like C in the ACE model.

When not modelled explicitly, then yes - the increased DZ similarity will look like C (shared environment).

The trait is heritable, but behavioral expression of these genes is completely sex-specific, and so does not generate shared phenotypic resemblance in opposite-sex pairs. It's worth noting that the same correlations across sex (the .6 and .3 for example) might obscure quite different amounts of genetic variance. Examination of the phenotypic variances and of the unstandardized variance components can be quite helpful.

Questions/comments from Monday

This should now be fixed. If you still have the problem, follow the directions below, and the problem should not return.

The “Home” folder must be opened before the “Faculty” folder is opened. It is a bug in the file browser, in that it gets confused when it opens the read-only faculty folder first. If your system is already confused, then close all of your file browsing windows, logout, login again, and open “Home” first.

(An) Answer: The expectation objects organize the predictions (i.e., expectations) of the model. Usually, with continuous data and full information maximum likelihood estimation, they consist of expected covariances and the expected means. For ordinal data they will contain the expected thresholds, as well as the expected means and covariances.

(An) Answer: Up to you. As there are few causes for means for T1 and T2, they are typically not significantly different.

(Another) Answer: It is also important to consider the effect size. Because sample sizes for twin studies are often quite large, in the 100's or 1,000's or more, very small differences can be significant. In addition, it is worth considering the number of tests (the number of variables). The usual approach is to use a multivariate test first, if that is significant then examine the variables individually.

(An) Answer: The model produces expected, not obtained covariances. So If you are reporting correlations, you should report the actual phenotypic correlations.

(Another) Answer: It can be better to report the model predicted covariances than those calculated from the sample. The two will be almost the same in the event that there are no missing data (in my experience this only happens with simulation studies or when an investigator has thrown out incomplete data cases, which is known as list wise deletion and should be avoided if at all possible). However, in the case that whether data on twin 2 are missing completely at random or partly predictable from the score of twin 1 (known confusingly as missing at random) then the maximum likelihood estimates would on average be closer to the population value than would the calculated-by-hand Pearson product moment correlation. This is a truly great feature of maximum likelihood.

(An) Answer: If the intention is to regress out the covariates and to analyze (ACE model) the residuals, then it would make sense to report correlations based on the regressed-out variables. Note that the choice of whether to regress out prior to analysis (obtain residuals from lm() for example) or to include the covariates directly in the model applies to continuous data. With ordinal data, the covariates should be included in the model.

(An) Answer: Dominant genes don't make DZ less similar, like anything shared, it makes them more similar. Just much less so than do additive genes.

The example of you and your DZ twins sharing two alleles of an additive gene for height, and two alleles of a dominant-pattern gene for eye color.

Even if you share only one allele for the height genes, the copy you share will have the same effect in both of you, and make your heights similar.

However if you differ in your blue/brown allele status, the phenotype can be the same in both of you even though you differ genetically. That masked-case lowers the phenotypic correlation between DZ twins below where it would be for an additive model (.25 rather than .5)

(An) Answer: Reliability and mental and workflow efficiency. Because the functions operate at a high level, they encapsulate a lot of sub-tasks, and help you think at that level about what you are doing. Many of the helper functions make tasks that you do infrequently easier to learn or remember, Because the scripts are used by many people, they are less likely to contain errors. They also contain a lot of error-checking code, so can guide you to solve problems.


(An) Answer: Good question - should be discussed. In practice people often assign such events to multiple testing inflation and proceed. However you can simply leave (e.g.) the variances between sexes not equated, they will consume a degree of freedom but, if they were significantly different, will improve model fit.

(An) Answer: Start values and lower bounds are distinct ideas.

Start values are initial guesses, which hopefully allow the optimizer to find a global optimum for the model parameters. Lower (and upper) bounds are hard barriers which the parameter will never be allowed to cross.

So if you know that negative values are illegal, for instance, you might set lbound = 0

If you want to divide variance among A, C, and E, you might start those parameters at around 1/3 of the variance each.

workshop/2016/questions.txt · Last modified: 2016/03/11 16:31 by     Back to top