Monday, February 9, 2015

Simulation: surprisingly common!

The other day we did a post suggesting that given the commitment to a mega-genomic endeavor based on the belief that genomes are predictive entities about our individual lives generally, we should explore the the likely causal landscape that underlies the assumption, with tools including research-based computer simulation.  Simulation can show how genomic causation can most clearly be assessed from sequence data.  By comparison to the empirical project itself, simulation is very inexpensive and trivially fast.  Research simulation has a long history of various sorts, and is a tool of choice in many serious sciences.  Research simulations are not just 'toy model' curiosities, but are real science, because the conclusions and predictions can be tested empirically.

Still, many feel free to dismiss simulations, as we discussed. It seems in a sense too easy--too easy to cook up a simulation to prove what you want to prove, too easy to dismiss the results as unrelated to reality.  But what alternatives do we have?

First, roughly speaking, in the physical sciences there exists very powerful theory based on well-established principles, or 'laws'.  Laws of gravity or motion are classic examples.  A key component of this theory, is replicability.  In biology, our theory of evolution and hence of genomic causation is far less developed in terms of, yes, its 'precision'.  Or perhaps it's more accurate to say that as currently understood, genetic theory is fine but we are inaptly applying statistical methods designed for replicable phenomena for things that are not replicable as the methods assume (e.g., every organism is different).  In the physical sciences to a great extent, the theory comes externally to a new situation under study, that is, is already established and being applied from the outside to the new situation.  Newton's laws, the same laws, very specifically help us predict rates of fall here on earth, and also how to land a robot on a comet far away in the solar system, or even study the relative motion of remote galaxies.

By contrast, evolution is a process of differentiation, not similarity. It has its laws that may be universal on earth, and in their own way comparably powerful, but that are very vaguely general: evolution is a process of ad hoc change based on local genomic variation and local circumstances--and luck.  They do not give us specific, much less precise 'equations' for describing what is happening.

We usually must approach evolution and genomic causation from an internal comparison point of view.  We evaluate data by comparing our empirical samples to what we'd get if the experiment were repeatable and nothing systematic is going on ('null' hypothesis) or if it's more likely that something we specify is (alternative hypothesis).  That is, we use things like significance tests, and make a subjective judgment about the results.  Or, we construct some formulation of what we think is going on, like a regression equation Y = a + bX, where Y is our outcome and X is some measured variable, a is an assumed constant and b is an assumed relative-effect constant per unit of X.  This just illustrates what we have a huge array of versions of.

But these 'models' are almost always very generic (in part because we have good ready-made algorithms or even programs to estimate a and b). We make some subjective judgment about our estimate of the parameters of this 'model' such as that our a and b estimates are correct to a known extent and are in fact causal factors. One might estimate the value of a or b to be zero, and thus to declare on the basis of some such subjective judgment, that the factor is not involved in the phenomenon.  Such applied statistical approaches are routine in our areas of science, and there are many versions of the basic idea.  They are these days often, if informally and especially in terms of policy decisions, taken to be true parameters of nature.  For example, as in this simplistic equation, the assumption is made that the effects of a and b on Y are linear.  There is usually no substantial justification for that assumption.

In a fundamental sense, statistical analysis itself is computer simulation!  It is not the testing of an a priori theory against data, either to estimate the values of things involved or to test the reality of the theory. The models tested rarely are intended to specify the actual process going on, or are purposely ignoring other factors that may be involved (or, of course, are ignored because they aren't known of).

Proper research simulation goes through the same sort of epistemological process.  It make some very specific assumptions about how things are, generates some consequent results, which are then compared to real data.  If the fit is good then, as with regular statistical analysis, the simulated model is taken to be informative.  If the fit is not good then, as with regular statistical analysis, the simulation is adjusted (analogous to adding or removing terms, or differently estimate parameters, in a statistical 'model').  Simulation models are not constrained in advance--you can simulate whatever you specify.  But neither are statistical models--you can data-fit whatever 'model' you specify.

There are lots of simulation efforts going on all the time.  Many are using the approach to try to understand the complexity of gene-interaction systems in experimental settings.  There is much of this also in trying to understand details such as DNA biochemistry, protein folding, and the like. There is much simulation in population ecology, and in microbiology (including infectious disease dynamics).  I personally think (because it's what I'm particularly interested in) that not nearly enough is being done to try to understand genetic epidemiology and the evolution of genomically complex traits.  To a great extent, the reason in my oft-expressed opinion is the drive to keep the support for very large-scale and/or long-term studies, given the promises that have been being made of medical miracles from genetic screening.  Those reasons are political, not scientific.

Analytic vs Descriptive theory
I think that the best way to get closer to the truth in any science is to have a more specific analytic theory, that describes the actual causal factors and processes relative to the nature of what you're trying to understand.  Such explanations begin with first principles and make specific predictions, as contrasted to the kind of generic descriptive theory of statistical models.  Both approaches are speculative until tested, but generic statistical descriptions generally do not explain the data, and hence have to make major assumptions when extrapolating retro-fitted statistical estimates to prospective prediction.  The latter is the goal of analytic theory, and certainly what is being grandly promised.  Here, 'analytic' can include probabilistic factors including, of course, the pragmatic ones of measurement and sampling errors, etc.  It is an open question whether the nature of life inherently will thwart both of these, leaving us with the challenge to at least optimize the way we use genomic data.

It is not clear if we have current theory that is roughly as good as it gets in regard to explaining evolutionary histories or genomic predictive power.  Life certainly did evolve, and genomes do affect traits in individuals, but how well those phenomena can be predicted by enumerative approaches to variation in DNA sequences (not to mention other factors we currently lump as 'environment') based on classic assumptions of replicability, is far from clear.  At least, at present it seems that as a general rule, with exceptions that can't easily be predicted but can be understood (such as major single-gene diseases or traits with strong effects of the sort Mendel studied in peas), we do not in fact have precise predictive power.  Whether or when we can achieve that remains to be seen.

There is one difference, however:  statistical models cannot be applied until you have gone to the extent and expense of collecting adequate amounts of data.  In truth, you cannot know in advance how much, or often even what type of, data you would need.  That is why, in essence, our collect-everything mentality is so prevalent.  By contrast, prospective research computer simulation can do its work fast, flexibly: it only has to sample electrons!

Denigrating computer simulation should not be done by anyone who is actually doing computer simulation, by calling their work by another name such as statistical modeling.  There is plenty of room for serious epistemological discussion about whether or how we are or aren't yet able to understand evolutionary and genomic process with sufficient accuracy.

2 comments:

Anonymous said...

Simulation should be prior condition before wasting millions of dollar, but tell that to NIH clowns !

In 1997, there was a big debate about whether to sequence human genome using clone-by-clone technique or whole-genome shotgun sequencing. The NIH community was in favor of BAC assembly, whereas Gene Myers, a very talented computer science professor, proposed WGS. Myers was strongly criticized by the NIH human sequencing group that his method would not work. So, he built a simulation tool to show that it would.

The real advantage of simulation is that instead of making wishful thinking, it forces the 'experimentalist' to establish relevant rules, parameters and limits of validity to support his claim. Also, the time is not wasted. In case of Myers, he joined Venter's company and his simulation tool became the core of Celera assembler, which ended up assembling human genome long before the public genome project.

Manoj

Ken Weiss said...

I guess a main point is to assert (allege?) that much of what we do in statistical analysis and modeling is a form of simulation. Under many conditions it does what is asked of it, but if the conditions don't hold, or we have no idea if they do, then we are asking for trouble.

'Forward' (theory or process based) simulation at least allows a particular idea about the causal process involved to be checked. If genomic and evolutionary phenomena are of astronomical complexity in the sense of the number of interacting or contributing factors involved and their variation, then simulation is a tool of choice, before investing heavily in a particular way of going about our business.