Sampling in health geography: reconciling geographical objectives and probabilistic methods. An example of a health survey in Vientiane (Lao PDR)
© Vallée et al; licensee BioMed Central Ltd. 2007
Received: 10 October 2006
Accepted: 01 June 2007
Published: 01 June 2007
Geographical objectives and probabilistic methods are difficult to reconcile in a unique health survey. Probabilistic methods focus on individuals to provide estimates of a variable's prevalence with a certain precision, while geographical approaches emphasise the selection of specific areas to study interactions between spatial characteristics and health outcomes. A sample selected from a small number of specific areas creates statistical challenges: the observations are not independent at the local level, and this results in poor statistical validity at the global level. Therefore, it is difficult to construct a sample that is appropriate for both geographical and probability methods.
We used a two-stage selection procedure with a first non-random stage of selection of clusters. Instead of randomly selecting clusters, we deliberately chose a group of clusters, which as a whole would contain all the variation in health measures in the population. As there was no health information available before the survey, we selected a priori determinants that can influence the spatial homogeneity of the health characteristics. This method yields a distribution of variables in the sample that closely resembles that in the overall population, something that cannot be guaranteed with randomly-selected clusters, especially if the number of selected clusters is small. In this way, we were able to survey specific areas while minimising design effects and maximising statistical precision.
We applied this strategy in a health survey carried out in Vientiane, Lao People's Democratic Republic. We selected well-known health determinants with unequal spatial distribution within the city: nationality and literacy. We deliberately selected a combination of clusters whose distribution of nationality and literacy is similar to the distribution in the general population.
This paper describes the conceptual reasoning behind the construction of the survey sample and shows that it can be advantageous to choose clusters using reasoned hypotheses, based on both probability and geographical approaches, in contrast to a conventional, random cluster selection strategy.
Geography is an independent scientific approach, while statistics simply constitutes a body of methods and tools that can be employed by scientists from various fields of research. Probabilistic statistics can be used to estimate a variable's prevalence in a given population with a certain precision. However, analysing survey data in this way can become complicated when the sample is selected using a geographical approach. The geographical approach often favours the study of specific areas, deliberately chosen to enable analysis of processes and interactions that are key to the understanding of particular health behaviours and spatial disparities. Cluster samples chosen from a small number of specific areas raise an important statistical challenge, in that the non-independence of observations within clusters has an impact on the statistical validity of the sample at the global level. Waldo Tobler showed in his first law of geography that "everything is related to everything else, but near things are more related than far things" . Therefore, statistical methods for analysing spatial data have to take into consideration spatial arrangements, and the resulting correlations between observations, in order to provide accurate and meaningful conclusions . This article suggests a method for the reconciliation of probabilistic statistical methods and geographical objectives in a unique health survey in Vientiane, the capital of Lao People's Democratic Republic (Lao PDR). In this method, areas are selected so that respondents are as representative of the general population as possible, whilst still enabling the study of health spatial interactions at the local level.
A need for meaningful data for public health and for health geography
The geographical approach calls for selecting specific places from where health information about large segments of the population can be acquired, in order to study interactions between people living in the same place. Selection of specific areas allows precise descriptions of the environment (e.g., the ecological landscape, medical equipment availability, markets, relevant policies, and so on) and its relationship with health, which would be very difficult to assess for a whole city. It is interesting to select some relevant territories where it is possible to study health spatial disparities, to explore interactions between people and places and to gain a better understanding of spatial organisation in a given society. Additionally, to analyse health spatial disparities, geography researchers often distinguish between the effects of "context" (e.g., area or group properties) and "composition" (characteristics of individuals living in different areas) in contextual and multilevelanalyses; these analyses therefore require datasets including individuals nested within areas or neighbourhoods . In conclusion, to conduct geographical analyses (spatial interaction, spatial correlation, contextual and multilevel analysis), researchers need to carry out a health survey in specific places where people and their neighbours can be interviewed.
At the same time, it would be useful if the study also produced an estimate of prevalence for each important health variable, and identified health-seeking behaviours and individual risk factors. Such findings help inform public health policy decisions. Indeed, as the Asian Development Bank noted regarding Lao PDR, "there is an urgent need for a nationwide survey of household sanitation in urban areas. This can be a sample survey as long as the sample size is representative" . Therefore, the results of a survey, even one in which data are collected in specific areas, must also be representative for the whole area.
Stratification can improve the representativeness of a sample by reducing sampling errors, and can make variance estimates more precise . If surveyed individuals were chosen in every stratum by simple random sampling, we would have achieved a good representative sample of the stratum population; however, if this is done, geographical approaches that examine how individuals interact with their wider environment are not possible . In a health survey recently carried out in Lao PDR, we needed to define and select some relevant territories where it was possible to study health spatial disparities, to explore interactions between people and places and to gain a better understanding of spatial organisation in a given society.
Conventional random cluster selection
The most commonly used spatial sampling method is cluster sampling: the studied area is divided into units, and a selection is then randomly chosen. Within each unit, individuals are ideally chosen by simple random sampling. Cluster sampling economises on time, budgets and energy; it is often done primarily for these practical reasons, as in the Expanded Programme on Immunisation (EPI) , because it is less expensive than simple random sampling when the population is dispersed. Cluster designs can also be useful in geographical approaches, because they allow for the study of specific places and territories. However, such designs are often less precise than simple random sampling, due to the homogeneity of individuals within clusters. There may be good reasons why individuals' behaviour within a small area is similar: "Why should we expect independence in spatial observations (...)? All our efforts to understand spatial patterns, structure and process have indicated the lack of independence (...) of things in time and space" . With cluster sampling, every member of the population has an equal chance of selection, but individuals with similar characteristics are more likely to be surveyed. Cluster sampling necessarily has a design effect, making it less statistically robust than simple random sampling . This effect varies between areas and even within the same area, and it can vary depending on the question. The size of the design effect can be calculated after the study as the variance obtained from the cluster sample divided by the variance that would have been obtained with a simple random sample of equal number.
Number of selected clusters
The number of clusters must be sufficiently large that statistical precision at the population level is adequate, and spatial comparisons remain possible, and
The number of clusters must be sufficiently small for the survey to remain logistically feasible, and for geographical analyses to be performed properly at the local scale.
The argument against random selection of clusters
Random cluster selection does not necessarily mean that the sample of clusters is representative of the whole population, especially when the number of clusters sampled is small; in fact, the design effect is highly dependent on this number. The variance obtained by cluster sampling can be reduced by selecting areas that are as internally heterogeneous (i.e. have a full range of variability within them) as possible and as externally homogeneous (i.e. are as similar to one another) as possible. The ideal situation, in terms of statistical precision, would be that each area was a microcosm of the entire population and, therefore, was perfectly representative, in terms of variability, of the overall population . However, this solution is in opposition to geographical objectives, in that the geographic approach (as in the spatial statistics approach) views population homogeneity as interesting in itself. Herein lies the difficulty: the design effect could be reduced by choosing similar areas with high internal heterogeneity, but other than the fact that it is impossible in practice to identify such an area, this would be in contradiction with the objectives of health geography.
To reconcile probabilistic methods and geographical objectives, we propose purposeful, rather than random, selection of a group of clusters that together could contain all the variability of the overall population.
How do we select clusters to best reproduce the population distribution of variables?
To improve the reliability of our sample, consisting of a small number of clusters, we have to choose the best combination of n clusters, such that respondents are representative of population heterogeneity (with regards to, in this case, health variables). Since no health information is available before the survey, a priori variables that can influence the spatial homogeneity of health characteristics in the studied area have to be determined. The notion of resemblance is subjective, but it aims to ensure that any two given populations resemble each other in terms of the phenomenon being researched. The choice of n clusters is thus based on reasoned hypotheses and on the specific research objectives. Cluster selection is therefore conceptually derived from a set of definite hypotheses, which is necessarily different from the hypotheses another group of researchers might have. Among every available variable, we select well-known health determinants (e.g. age, nationality, ethnic origin, education, occupation, etc.) and keep some variables (v1; v2...vn) with unequal spatial distribution within the studied area. It is possible to check the survey results after the research has been carried out for a posteriori relevance of the given variables as health determinants.
Where Cb is the number of possible combinations of n clusters without repetition; N is the total number of clusters existing in the studied area and n is the number of clusters we want to select.
This mean squared difference is compared with the variance of v1 calculated at the studied area level. Cluster combinations are retained where the mean difference corresponds with the variance calculated for the studied area. The same steps are followed with every selected variable (v2...vn). Among the large number of possible combinations of n clusters, few combinations are obtained, whose variability of different selected variables is very similar to the variability calculated at the studied area level. This procedure enables clusters to be selected that have a composition of a priori health determinants that is similar to the composition of the a priori health determinant in the overall studied area. With this procedure, we hope to reduce design effects and gain statistical precision while surveying only a few clusters.
Applications in Vientiane health survey
We applied this form of clustered sampling to a health survey conducted in Vientiane, Lao PDR.
Health survey in Vientiane
The main objective of the research programme, entitled "Urbanization, Governance and Spatial Disparities of Health in Vientiane", is to describe and analyse the organisation of urban areas (including geographical, social, cultural, political, environmental, and behavioural variables) as sources of intra-urban health inequalities. The urbanised area of Vientiane is spread over 148 villages ('ban' in Lao) comprising approximately 277,000 inhabitants in 2005 . In Lao PDR, the "village" (ban) is the smallest administrative, religious and political unit in both rural and urban areas. The spatial division into villages reflects political, administrative and social reality and, as census data are available at the village level, we decided to keep the village as the reference unit for the survey: a cluster thus corresponds to a village. In 2005, an average of approximately 1870 people lived in a Vientiane City village (interquartile range: 1080 – 2311). It is likely that the increased urbanisation of the capital has led to wide disparities in health, but as little health information exists, there is no way to know what kind of health problems the population encounters and how people seek healthcare. To provide health data and to analyse health spatial disparities in Vientiane, the French Research Institute for Development (Institut de Recherche pour le Développement – IRD) carried out a health survey within the city in collaboration with the Lao Ministry of Health, the National Institute of Public Health, the Faculty of Medical Sciences, the Francophone Institute of Tropical Medicine (Institut Francophone de Médecine Tropicale – IFMT) and the Microbiology Laboratory in Mahosot Hospital. Ethical approval for this survey was obtained from the Lao National Ethics Committee for Health Research in Lao PDR.
Two age groups were selected: children (aged from six months to less than six years) and adults (aged 35 years and above). Data were collected in February and March 2006 through household and individual questionnaires. Household questionnaires collected data on house location and description, living conditions, incomes, community bonds and demographic data on every member of the household. Individual questionnaires collected demographic data and socio-economic information, urban lifestyle variables, behavioural risk factors, health status data, and healthcare-seeking behaviours. Health status was measured through medical examination and investigations (weight, height, temperature, blood pressure, dental examination and blood samples from a fingerprick to study diabetes, anaemia and communicable diseases). Healthcare-seeking behaviour was examined through questions about type and gravity of health problem, local health structure, price, quality, and satisfaction with health care services.
With these data, we aimed to: (i) compare levels of morbidity in different urban areas; (ii) identify appropriate urban scales for recognising health disparities; (iii) detect hotspots of morbidity using exploratory spatial data analysis; and (iv) measure the impacts of both social and urban contexts using multilevel analyses.
Use of urban stratification
Number of villages to survey
Budgetary limitations affected the overall sampling size: 2000 adults and 2000 children for the whole city (or 666 adults and 666 children in each urban stratum). This allowed for a 95% CI of +/- 2.3% around a prevalence of 10% at the stratum scale, and +/-1.3% around a prevalence of 10% at the city scale. In this calculation, we have not considered the design effect. The value of the design effect (which differs between variables within the same survey) is difficult to estimate during survey preparation, and is very dependent on the selection of clusters. We planned to survey the same number of individuals in every village so that comparison could be done with the same precision. According to the size of the village population, we fixed a sample size of 27 villages (nine per urban stratum), with 74 adults and 74 children to be sampled in each village. This corresponds to a mean sampling rate of 1/5.6 for adults and 1/2.4 for children, based on the list of households created in December 2005 in every one of the 27 selected villages.
Selection of villages
Number of combinations in every urban stratum
First urbanised belt
Second urbanised belt
Number of villages
Number of possible groups of 3 adjacent villages
Number of possible combinations of 3 groups of 3 adjacent villages
Number of possible combinations of 3 groups of 3 adjacent villages which respect proportions of urbanization types.
Characteristics (in 1995 census) of selected combinations in comparison with the corresponding urban stratum
9 selected villages
First urbanised belt
9 selected villages
Second urbanised belt
9 selected villages
Random selection of respondents in every selected village
To obtain a sample in every selected village as reliably as possible (not only statistically but also spatially) respondents needed to be selected randomly. With the help of village authorities, we created a sampling frame in each village and then selected households to survey at random. Within each selected household, a maximum of one adult and one child were allowed to participate in the survey. The probability of household selection was proportional to the number of eligible people in the household.
In developing this sampling frame, the main difficulty we encountered was the lack of data available on the population. We needed to ascertain accurate information about some health determinants and about their spatial repartition. For the Vientiane survey, only urban data from 1999 with aerial photographs, demographic data from the 1995 census, and the number of inhabitants per village from 2005 census were available. Precise demographic data from the 2005 census (such as literacy, nationality, access to electricity, water, and latrine access) were not yet available during the preparation for this survey.
For the health survey in Vientiane, we adopted a two-stage selection procedure with a first non-random stage of selection of clusters: we chose clusters that would be representative of the urban and health variability of the global population. Conventional random cluster sampling is certainly statistically appropriate when the number of clusters is large. However, when only a small number of clusters are sampled in order to correspond to geographic objectives and/or to logistical needs, it becomes statistically more appropriate to choose clusters instead of randomly select them. Where this method is used, the choice of clusters should be based on reasoned hypotheses and on the specific research objectives. A modified clustered design with a first non-random stage of cluster selection can provide appropriate information both to study health spatial interactions and to estimate other health variables, such as prevalence, at the city level.
This survey was supported by Institut de Recherche pour le Dévelopement (IRD) and French Embassy in Laos. We thank those who helped us carry out this survey: Lao Ministry of Health, the National Institute of Public Health, the Faculty of Medical Sciences, the Francophone Institute of Tropical Medicine and the Wellcome Trust project, Microbiology Laboratory in Mahosot Hospital. We thank Daniel Benoît (IRD Representative in Lao PDR) and Jean-Paul Gonzalez (IRD-UR 178 Director) for their scientific and logistical support. We really appreciate the assistance of Paul Newton, John Cotton and Ulf Winan who corrected the manuscript. We thank too Jean Gary and Stéphane Rican for their help in sample design preparation.
- Tobler WR: A Computer Model Simulating Urban Growth in the Detroit Region. Economic Geography. 1970, 46: 234-240. 10.2307/143141.View ArticleGoogle Scholar
- Waller L, Gotway C: Applied Spatial Statistics for Public Health Data New York: Wiley; 2004.View ArticleGoogle Scholar
- Diez Roux A: Investigating Neighborhood and Area Effects on Health. American Journal of Public Health. 2001, 91 (11): 1783-1789.PubMed CentralView ArticlePubMedGoogle Scholar
- Asian Development Bank (ADB) & MCTPC, Dept of Housing and Urban Planning. Lao Urban Data Book, Development Indicators for the Urban Areas of Lao PDR, Draft Final report 2003.
- Cochran WG: Sampling Techniques 3rd edition. New York: Wiley; 1977.Google Scholar
- Curtis S: Health and Inequality. In Geographical perspectives Sage Publications Ltd: London; 2004.Google Scholar
- Hendersen RH, Sundarset T: Clustered sampling to assess immunization coverage: a review experience with a simplified method. Bull WHO. 1982, 60: 253-260.Google Scholar
- Gould P: Is statistic inferens the geographical name for a wild goose?. Economic geography. 1970, 46: 439-448. 10.2307/143157.View ArticleGoogle Scholar
- Zelin A, Stubbs R: Clustered Sampling: A False Economy?. International Journal of Market Research. 2005, 47 (5): 501-522.Google Scholar
- Vallée J: Espace urbanisé et périmètres urbains, une délimitation complexe. Vientiane, développement urbain et patrimoine, Keophilavan Aphaylat, Pierre Clément, Charles Goldblum et Christian Taillard (édit.), les Cahiers de l'Ipraus, Editions Recherches, Paris, forthcoming in. 2007.
- Rossi G, Tissandier P, Inthiphone B: Atlas infographique de Vientiane, Vientiane, Programme de formation-recherche en coopération inter-universitaire, Université Nationale du Laos, Université de Bordeaux III. 2003.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.