Models and diagrams in infectious disease epidemiology
In 1897, Ronald Ross established that malaria is spread by the Anopheles mosquito, and subsequently received the second Nobel prize for medicine. He then defined a mathematical model describing the time dependent dynamics of infection and recovery in human and mosquito populations. The major terms in the differential equations describing this human-mosquito-parasite ecology were (unless otherwise stated, these terms are numbers per unit time): the number of newly infected humans arising due to bites from infected mosquitoes, the number of new mosquito infections due to biting infected humans, and the rate of recovery of both humans and mosquitoes from infection [24].The explicit expression of these differential equations as an a priori model - i.e. a model in which the sole causative agent of disease was assumed from outset to be the protozoan parasite, which was acquired by mosquito biting - led to the startling conclusion that there existed a critical value for the number of mosquitoes per person that needed to be present in order to allow the parasite to persist locally. Ross estimated this critical number of mosquitoes per person to be 40 - implying that Anopheles did not need to be eradicated for the disease to die out [1].
Ross reached this conclusion by modelling the whole system: the human population within its environment. It was built on evidence at the individual level, but with some of the (implied) interventions at group or environmental level. His method was not expressed as a diagram, but it represents a sequential causal relationship, the key outcome being whether the number of infected people in one period is higher or lower than that in the previous one. The method was feasible because he focused on the single cause, malaria transmission by mosquito which had already been established, and omitted other relevant factors, e.g. that nutritional status might affect susceptibility.
This pioneering work initiated methodological developments in infectious disease epidemiology, again modelling a system consisting of a human population within its environment [21]. These include compartmental models such as the SIR (Susceptible-Infected-Recovered) model (Figure 3), where the population is sub-divided into states corresponding to observed (or assumed) steps in the disease process. The transitions from one state to the next, represented by differential equations, reflect the causal effects - although causality is not made explicit - with transition probabilities being determined by quantities such as the contact rate, the infection transmission probability and the recovery rate.
Models of this type can be more complex, for example if vector transmission is involved, but the principle remains the same. The equivalent of Ross's critical mosquito density is the basic reproduction number R0: if is greater than unity, this indicates that the number of new cases in one period is higher than that in the previous one, and therefore that the outbreak can propagate itself; if it is less than unity then the epidemic will fade out. Most such models are deterministic in that they do not consider stochastic causation, but probabilistic elements are increasingly being incorporated [25].
Compartmental models rely on the existence of a single characteristic that can be used to partition the whole population. In the SIR case, the partitioning characteristic is the status of each person with respect to susceptibility and infectiousness. The model is thus mono-causal, neglecting other factors such as nutritional status and the existence of other infections that may influence the recovery rate; models can be modified to take these into account, e.g. stratifying the population into high and low risk groups [26].
Single-chain models outside infectious disease epidemiology
This approach is no longer used only for modelling infectious diseases. For example, it has been applied to cervical cancer, involving carcinogenic HPV transmission dynamics and the natural history of the disease. It involved comparing scenarios of vaccination against HPV-16, either of 12-year-old girls alone or of both sexes, and of the no-vaccination scenario [27]. Thus, the distinction of infectious and non-infectious disease is somewhat artificial, given that the same modelling methodology can be used in situations where the infectious agent is but one factor contributing to the development of the disease.
More generally, compartmental models can be viewed as a sub-type of diagrammatic models: flow diagrams in which the population is subdivided into ordered states. They are also of interest in chronic disease epidemiology, where they can be used to represent the evolution of health status among known steps of disease progression. These stages can either be observed or hidden (e.g. if the prevalence of the asymptomatic affection cannot be measured) [28, 29]. On top of providing a quantification of the impact of risk factors/exposures on the disease risk, these approaches also give an insight into the dynamic of disease progression at the individual level, and at the population level, into the dynamic of the epidemic.
Compartmental models aim at reconstructing the individual or population natural history of the disease progression amongst disease states, based on - potentially longitudinal - exposure or complex mixtures of exposures. Hence, by nature, they incorporate a temporal component in their causal inference, and in accordance with the recently formalised exposome concept [30, 31], they allow the disease risk to be driven not only by exposure level itself but also by its evolution in time and by potential temporal patterns in the exposure history.
A similar use of diagrams has long been standard practice in another branch of biology: biochemical pathways. These are flow diagrams in which at each stage, the molecule is modified by an enzyme belonging to that step in the pathway. In practice they are often drawn as cartoons that include also a spatial element, indicating the location of the different chemical processes within the cell.
An example is the metabolism of ethanol (alcohol) via acetaldehyde to acetic acid, which is then metabolised further, yielding carbon dioxide, water and energy (Figure 4). A fundamental concept in biochemical pathways is the rate-limiting step: if conversion of ethanol to acetaldehyde proceeds faster than that of acetaldehyde to acetic acid, but not in the reverse situation, then acetaldehyde accumulates. This depends on the relative speed of the two enzymes, alcohol dehydrogenase IB (class I), beta polypeptide (ADH1B) and aldehyde dehydrogenase 2 (ALDH2). It so happens that the second of these can be present in different forms, resulting in either faster or slower activity than ADH1B, and that this varies with ethnic group. Since acetaldehyde gives rise to unpleasant symptoms (as well as toxicity), this polymorphism explains why some ethnic groups tend to indulge in drinking large quantities of ethanol, whereas others do not.
The situation here is directly analogous to the SIR model, where the tendency of an outbreak to increase or decrease depends on the balance between inflow and outflow. In that situation this balance depends on the force of infection as measured by R0,: if greater than unity, the outflow is the rate-limiting step and infected individuals will tend to accumulate in the population, like acetaldehyde, and vice versa for values lower than unity. Although both these diagrams have been constructed in radically different contexts, their structure as well as the type of results they provide are comparable, thus highlighting the potential general use of these models. While their formulation is general, the way transitions from one compartment to another are defined is highly specific of the modelled phenomenon. This type of approach relies on the modelling of the whole system rather than focusing on a single link within the system of interest.
A somewhat similar approach can be used in non-infectious disease epidemiology, for example in environmental and occupational epidemiology, which has increasingly moved towards a study of the whole chain from the existence of a pollutant in the environment, through human exposure, to health outcome (Figure 5)[32]. Here we are concerned with a diagram that is constructed from concepts such as "emissions", "concentration" and "exposures" that correspond to substantive knowledge about how the world works, and which are organised in a form suitable for statistical analysis. Building this type of model requires multidisciplinary collaborative work, e.g. involving hygienists and epidemiologists. Typically the upstream causal processes involve a particular location, so that exposure is ecological, i.e. at group level, whereas for epidemiological analysis the individual level is best, to avoid ecological bias that could result when inference is made from one level to another. This combination of levels is routinely employed in infectious disease epidemiology modelling, and this also integrates disparate types of information, e.g. biological, psychosocial and socioeconomic, as well as medical interventions (e.g. immunisation). More generally, the perspective of modelling the whole system fits with the perception that more attention should be paid to "causes of causes" [33], not only to proximal causes.
Multiple causation: diagrams with multiple and branching chains
The models considered so far have been concerned with only one causal pathway. However, epidemiology of non-infectious diseases usually deals with a situation of multiple causation, in which all (or most) links are analysed as stochastic - there are no necessary or sufficient causes, and Koch's postulates do not apply. Under such conditions, diagrammatic models are no longer confined to a single chain.
It is simple to draw a diagram that contains branches, but this introduces new issues that go beyond the scope of the present paper. In principle, causal diagrams and DAGs can readily cope with multiple causation, but further methodological work is needed on effect modification [34–36].
In social epidemiology, a classic question is, how much of the observed social gradient is mediated by known risk factors? It is possible to investigate this question on the simple assumption that no effect modification or other complicating factor is present, in which case a diagram is probably not necessary. However, such an assumption may not be justified. For example, an econometric analysis of the Whitehall II Study has shown that if allowance is made for selection effects, the findings change. Whilst childhood socioeconomic circumstances are still found to impact on adult health, it emerges that the association of current civil service grade with health status reflects the tendency for healthier people to be promoted. And employment grade is also predicted by childhood socioeconomic position, which thus influences adult health both directly and via job success - for example, promotion is more likely for taller people, and height is an indicator of childhood wellbeing [37].
Moreover, a diagram with multiple and branching chains can readily be expanded to encompass a larger system, so enabling integrated analysis of the inter-related factors. In this case the upstream causes can include the wider determinants of ill-health as well as more concrete mediating factors - the "web of causation" for a particular health issue, a concept that has a long history [38, 39] (see Figure 6 for an example).
By making the pathways explicit in a web of causation, a diagram deepens understanding and provides a framework for statistical analysis. In addition, it serves as a valuable practical guide: it not only provides multiple entry points for intervention, but also has the capacity to demonstrate and quantify the inter-relationship of different factors - including unpredicted and possibly undesirable side-effects. Strangely, although influence diagrams have been used informally to clarify hypotheses on the particular pathways that may be operating, it is rare to find causal diagrams being used as the basis for the statistical analysis of a system [40], as has been proposed in the context of setting out the evidence base for Health Impact Assessment [40] or Strategic Health Assessment [41, 42].
However, work along these lines is beginning to appear. Sacerdote and colleagues have used a causal diagram to organise the multitude of factors that are thought to influence the incidence of type II diabetes (Figure 7) [43]. And Rehfuess and colleagues have taken a similar approach to tease out the relative contributions of environmental and social factors that influence childhood death from acute lower respiratory infections in sub-Saharan Africa [44].
Modelling multiple and branching chains is more complicated than in the example of a whole-chain approach to exposure assessment as in Figure 5, because it involves the assumption that the chains are independent; in addition, intervention may involve multiple actions affecting more than one pathway, e.g. combining the use of "carrot" and "stick". Such diagrams are best organised by economic or policy sector; but the criterion for including variables and pathways in the diagram is that they are relevant to health - the content of the diagram is "driven by the bottom line" [40]. An additional layer can also be included below that for health outcomes, if so desired, on the economic costs of each of the adverse health outcomes. The analysis of a diagram of this type, and indeed confirmation of its structure, requires bringing together information from a number of different sources; and some aspects (such as "community severance" in Figure 6) may not be readily quantifiable. Multi-disciplinary research projects to integrate the relevant areas are currently underway [45].