This is just
a short post about the criteria that one sets for the model to fulfill when making a model. In our paper,
we decided to strictly separate criteria for model calibration (based on data
that are used to develop the model) and validation (based on data not used in
model creation that are used to assess it after the development has finished). Drawing
a parallel to the world of machine learning, one could say that calibration
criteria correspond to training data, while validation criteria correspond to
testing [1] data. In the world of machine learning, the use of testing set to assess model
quality is firmly established to the degree that it is hard to imagine that
performance on training set would be reported. Returning from machine learning
to the literature on cardiac computer modelling, it becomes rather apparent
that the waters are much murkier here. Most modelling papers do not seem to
specify whether the model was directly made to do the reported behaviours, or
whether it is an independent validation in our paper’s sense. It is not helped
by the fact that some papers seem to use the term “validation” as true
independent validation, while in other papers, it’s used rather as
“evaluation”, and it contains a mix of calibration, independent validation, and
results where it is impossible to tell. Is this important at all though, or is
it just a linguistic exercise without a deeper point? I would argue it is
actually quite important and I as a reader would personally very much prefer to
know precisely the extent of what is calibration and what is not. It is linked
to how much one can hope the model to work for modelling phenomena it was not
directly fitted to (which is arguably the main interesting application).
I believe
that cardiac models are generally somewhat less prone to overfitting than
machine learning models. They are constrained by data on their structure, by
which currents and fluxes are included, and by the data underlying these –
consequently it’s harder to produce an utterly nonsensical model which produces
good behaviours [2]. At
the same time, harder does not mean it’s hard. The problem with purely
calibration-driven approach is that by adjusting a model to fulfill a
calibration criterion, one may violate other desirable properties of the model
or even other calibration criteria. This can be again “patched” by certain
changes to the model, but these may again spawn other problems. If one fulfils
all the calibration criteria, is it because the model is great, or because all
the problems were simply moved outside the “observed range” of calibration
criteria? If the latter, it’s what I’d call overfitting in the context of
cardiac models. And the validation set of criteria is precisely what guards
against this. If the calibration criteria were achieved at the cost of the
model being nonsensical, the validation criteria are quite likely to point that
out and that’s where their importance lies.
One extra
point at the end - It’s good to mind differences in species, protocols, or
conditions, when designing the calibration and validation criteria. Also, what
is the heterogeneity of a particular feature between papers [3]?
Traditionally, models are created in a way where a single model replicates
multiple behaviours across different studies. When the model fails to replicate
many papers at once, it is thought problematic. However, can we be sure that any
living cell from an experimental study would tick all the boxes from other
studies? Not really. An approach that
nicely appreciates and tackles this issue is Populations of Models, where a
baseline cardiac model is used to generate a population of its clones with
changes in conductances of ionic currents or other properties. This is then
typically calibrated to experimental data (e.g., on action potential duration [4])
to make sure that grossly unrealistic models are omitted. Thus, some form of
heterogeneity is achieved and one model does not have to fulfil the potentially
unrealistic requirement of replicating every single possible behaviour.
[1] It is
slightly unfortunate that validation
criteria don’t really correspond to validation
data in machine learning, which may cause confusion. However, validation of a
computer model is an already established term and it makes sense to stick to
this tradition.
[2] The model
may be also overconstrained to the degree that it’s even hard to achieve the
good behaviours in the first place. Here, I have to mention the statement “you can fit these models to
anything” that I’ve heard from multiple experimental researchers I talked to
about computer models. It was particularly interesting in one stage of ToR-ORd
development, where I was seriously stuck, health was so-so, and simply nothing
worked to achieve the calibration criteria (Part 4 describes what was happening
and how it was solved). And when having a meeting with an experimental
physiologist about a different project at this miserable stage of my work, I
heard him say “you can fit these models to anything” in a fairly dismissive way.
I’m still proud to this day that that I managed to contain the agony-filled
scream of mad desperate laughter inside my head.
[3] For
example, the calcium transient amplitude in different studies in human was reported
to be ca. 350 nM (Coppini et al. 2013), or over 800 nM (Piacentino III et al. 2012), so a model strongly fitted to one
of these datasets would fail miserably on the other one. Another example is the
slope of S1-S2 restitution, where some articles report steep curves, with other
articles reporting flat ones. It’s thus good to be aware of criteria where
literature doesn’t reach a quantitative consensus and not focus on these
overly.
[4] That said, we don’t have enough data on the true biological
variability of ionic channel conductances etc. at the moment. Therefore, there is no guarantee
that the simulated heterogeneity corresponds to heterogeneity which might arise
from biological diversity and/or noise following from experimental measurements
Comments
Post a Comment