Optimisation of the Tsquare sampling method to estimate population sizes
 Kristof Bostoen^{1}Email author,
 Zaid Chalabi^{2} and
 Rebecca F Grais^{3}
DOI: 10.1186/1742762247
© Bostoen et al; licensee BioMed Central Ltd. 2007
Received: 29 September 2006
Accepted: 01 June 2007
Published: 01 June 2007
Abstract
Population size and density estimates are needed to plan resource requirements and plan health related interventions. Sampling frames are not always available necessitating surveys using nonstandard household sampling methods. These surveys are timeconsuming, difficult to validate, and their implementation could be optimised. Here, we discuss an example of an optimisation procedure for rapid population estimation using TSquare sampling which has been used recently to estimate population sizes in emergencies. A twostage process was proposed to optimise the TSquare method wherein the first stage optimises the sample size and the second stage optimises the pathway connecting the sampling points. The proposed procedure yields an optimal solution if the distribution of households is described by a spatially homogeneous Poisson process and can be suboptimal otherwise. This research provides the first step in exploring how optimisation techniques could be applied to survey designs thereby providing more timely and accurate information for planning interventions.
Background
There is a constant need to estimate population size and density for the purposes of planning resource requirements or assessing health needs. For reasons relating to timeliness, cost or practicality, data are often obtained through surveys that aim to collect representative samples. Public health specialists rely traditionally on detailed sample frames to survey populations. There are however many situations (such as those relating to displaced populations in emergencies) in which detailed sample frames are either unavailable or unfeasible. Only a small number of sampling methods are suitable for such situations.
Ecological methods, which often do not require a detailed sample frame, can offer practical solutions to household sampling problems and are currently being explored. These methods include sequential sampling techniques to estimate prevalence or program coverage [1, 2], capturerecapture techniques [3, 4], adaptive sampling [5], TSquare sampling [6] and Catana's wandering quarter method [7] to estimate population size and density.
One of the problems in validating and verifying sampling methods used in situations devoid of sampling frames is the difficulty in analysing the properties of the sampling methods [8]. Traditional optimisation of sampling methods is done using computationally intensive resampling techniques such as Monte Carlo (MC) or Latin Hypercube Sampling (LHS) simulations, while experimenting with different permutations of the parameters of the sampling method on simulated or real population data. Further, from a theoretical perspective, there are infinitely many scenarios (covering a wide distribution of household and individual data) for which the sampling method requires validation and verification.
TSquare sampling is a distancebased sampling method whose statistical properties have been thoroughly investigated [9–14]. It has been used in ecology to estimate sizes, densities and deviations from random spatial distributions of mainly plant populations [15] and more recently it has been used to estimate the size of displaced human populations in emergency situations [6, 16, 17].
Estimating human populations in emergencies by using distancebased methods, such as the TSquare, rely on collecting data on distances between households (shelters) rather than on households per se. Advantages of distance sampling methods include:

Human population density can be estimated even when not every household per unit area is detected;

The same population density estimate can be calculated from data independently collected by multiple observers;

A relatively small number of distances need to be measured;

It may be less resource intensive and potentially more accurate than traditional sampling methods such as the quadrant method [6, 16].
Two of the substantive issues to be addressed in this paper are whether:

The assumptions on which the TSquare method is originally based for estimating plant population sizes are equally valid for estimating human population sizes;

The TSquare method can be optimised.
Analysis
TSquare sampling and other distancebased methods
Choosing the appropriate distancebased method for use in human populations requires careful practical and theoretical considerations. Distances within which a surveyor can determine accurately the closest household from a random point or the closest household from a previously selected household are limited. In practice, it could be difficult to identify precisely the location of a household that occupies a large area. Furthermore some sampling methods are more sensitive than others to errors in the measurement of angles and distances. In the TSquare method the sample observations are predetermined, unlike the wandering quarter method. The wandering quarter method could therefore be more difficult to plan in advance compared to the TSquare method if health data are to be collected from each household.
In addition to TSquare sampling and the Catana's wandering quarter methods, there are other distancebased methods such as the linetransect and pointtransect distance methods [18, 19]. It could be argued that although these methods are well established for estimating abundance of biological populations (plants or animals), extrapolating their use to household surveys would require evaluation. We note however that distancebased methods do not replace classical sampling methods where sample frames are available.
Optimisation of the TSquare sampling method
The elements of optimising any household sampling method are the objective function (performance measure) to be optimised (maximised or minimised), the parameters of the method which can be tuned to optimise the objective function, and the constraints that are imposed on the values of these parameters [8]. In the context of optimising the TSquare method this is translated as follows.
The choice of the objective function to be optimised is not arbitrary and should be carefully considered. In reallife applications, a set of empiricallyderived objective functions would be proposed and tailored to particular situations. Appendix II derives a simple objective function based on practical considerations. We present several examples of objective functions in the following paragraphs.
The simplest objective functions to be optimised (minimised in this case) are the standard error of the estimate of the average area per household (E) or the "cost" of the sampling (C), defined in a generic sense, as a measure of the "quantity of resources" required for sampling (for example, human resources). We can define an objective function which combines both those functions: T = E + αC where α is a tradeoff scalar, or parameter, which has a dual purpose: to scale E and C numerically to the same unit and to weight the relative significance of each of them in terms of the overall performance measure.
An obvious parameter to tune is the number of sampling points (m). Both terms (E and C) in the above combined objective function depend on m. We would expect E(m) to decrease monotonically with respect to m and C(m) to increase monotonically with m thus providing a tradeoff in the choice of m to be optimised.
A key assumption in the optimisation analysis is that the distribution of the households can be described adequately by a twodimensional spatially homogeneous Poisson process (Appendix I). In using the TSquare method, there is a potential bias in the estimate of the household density (mean number of households per unit area) if the Poisson assumption does not hold. The standard error term E(m) is proportional to $\sqrt{{m}^{1}}$ provided the sampling points are well spaced. The constant of proportionality however will depend on the underlying distribution and therefore would influence the optimal solution. Unlike the expression for E(m), the expression of C(m) is derived from practical considerations. The constraints on m are usually in the form of simple bounds on the sample size, i.e. greater than zero, but less than 60.
The minimisation was carried out in Mathematica using a standard nonlinear programming optimisation algorithm [20]. The optimal sample size (to the nearest integer) is m* = 58.
The two previous simulations were concerned with optimising sample size. Once the optimal sample size is determined, one can envisage a second optimisation stage whose aim is to select the optimal pathway for data collection. This could be required in practice for operational reasons and is not necessarily reflected in the cost function of the first stage optimisation problem. The optimal pathway is defined as the shortest pathway connecting all the sampling points. It is assumed here that one observer would be carrying out the survey.
The optimisation is concerned with computing the shortest pathway that connects all the sampling points. This is a very well known and classical problem in combinatorial optimization known as the "Travelling Salesperson Problem" [21]. The problem is to determine the leastdistance route taken by a salesperson to visit a fixed number of cities in which each city is visited once only and in which the trip starts and ends at the same point. The Travelling Salesperson Problem (TSP) is not easy to solve (computational difficulty increases with the number of cities) and there is extensive literature on fast and efficient numerical algorithms used to solve both the classical version and more complex variations of the TSP [22, 23].
Here, we solved the TSP problem in Mathematica [20, 24]. The optimisation method used is called simulated annealing. Simulated annealing is a stochastic approach to find the global solution of an optimization problem where there could be multiple local solutions [25]. In this approach, an optimal solution is found iteratively by selecting randomly at each step a point in the neighbourhood of the current solution and then directing the search in the subsequent steps to improve the value of the objective function whilst not getting trapped in a local solution. It has been found that simulated annealing has several advantages over other optimization methods to solve TSP [26]. (Additional information and an illustration of simulated annealing [27]).
Because of the strict condition of complete randomness demanded by the TSquare sampling method, it is unlikely that this method would always be applicable. Catana's method could prove a valid alternative in the sense that it does not require complete spatial randomness however no results have been published for its use in human populations. As in the case of the TSquare method, Catana's method also has some restrictions in practice as discussed previously.
Conclusion
The purpose of this paper was to illustrate the principle of optimising a household sampling method in situations where sampling frames are unavailable. We chose the TSquare method as the exemplar because it holds promise for estimating population sizes in such situations. The optimisation of the TSquare method was demonstrated using a simple illustrative example depicting scenarios that are faithful to the basic assumption of the method, namely that the distribution of the households can be described by a twodimensional homogeneous Poisson process. If this assumption does not hold, then the proposed optimisation procedure would likely be suboptimal. Further work should investigate optimising the TSquare method in scenarios that are more realistic and situations in which the distribution of the households is not described by a spatially inhomogeneous Poisson process.
The rigorous optimisation approach, which was demonstrated here on the TSquare method, can be applied to any other sampling method. Traditionally sampling methods were validated using computer simulations and were not formally optimised. The scope of the traditional computingintensive approaches are somehow limited and the necessity of a mathematical approach for validation and optimisation is warranted [8].
Optimisation of sampling methods provides important information for surveys in contexts where sampling frames are not available. These techniques may be contained within computer software used by field survey teams without requiring technical knowledge of the algorithm. That is, a userinterface allowing survey teams to enter their objective function and generate an optimal survey strategy can mask formulae making them easier for use by nontechnical survey teams. Instead of asking survey teams to define the objective function, they could be led through a set of heuristics which provide the number of points to be sampled. For example, in the case of the TSquare method, if the distribution of dwellings is uniform (e.g. as in a streetstructured refugee camp) then sample m_{1} points, if the distribution of dwellings is clumped (e.g. as in a villagestructured refugee camp) then sample m_{2} points. Another way to envision this step would be to ask a similar set of heuristics which are then translated into an objective function behind the userinterface. The second stage of optimisation, the travelling salesperson problem, could be contained within computer software and adapted for use in the field. These heuristics could be tailored to the key issues at hand in other sampling methods.
Appendix I. Statistical properties of the TSquare sampling method
The TSquare sampling method can be described simply in figure 3. We assume that individuals live in households that are not enumerated (i.e. there is no sampling frame). In emergencies, impromptu shelters grouped haphazardly represent households. Points H_{1}, H_{2} and H_{3} represent the locations of three of the households. The region of interest (Ω) could contain n households (H_{1}...H_{ n }). Point S_{1} represents an arbitrary chosen point in Ω. It represents one sample of m points (S_{1}...S_{ m }), which are generated randomly and used as anchors for the estimation method.
Recall the description of figure 3. C is the straight line joining S_{1} to the nearest household (H_{1}). Q is the line perpendicular to C at household H_{1}. Q partitions the Ω plane into two semiplanes R and L indicated by the arrows. Household H_{2} is the nearest to H_{1} on the R semiplane. The distance between S_{1} and H_{1}, and the distance between H_{1} and H_{2} are denoted by x and y, respectively.
In Equation (I.1), N_{A} and N_{B} are respectively the number of households in regions A and B, and λ is the density (number of households per unit area) of the underpinning Poisson process and the parameter to be estimated.
Of course, the principal assumption of the TSquare method is very restrictive in the context of human population estimates. There are several statistical tests available to test for complete randomness of spatial point patterns [9, 12–14, 28–31]. The relaxation of this assumption has implications for the robustness of the method (see below) used to estimate λ [12].
It follows from Equation (I.2) that the random variable ω defined by ω = 2π λ x^{2} is chisquare (χ^{2}) distributed with 2 degrees of freedom [12].
If we selected the households arbitrarily, instead of the sampling points, and measured the distance between each selected household and its nearest neighbour, this distance will have the same pdf as x. However, households cannot be selected arbitrarily without enumeration of these households.
where η is the average area per household.
where κ is the average household population and Γ is total the area of region Ω.
Appendix II. Objective function
This section describes a simple objective function which has been used in practice to determine sample size requirements in cluster surveys on provision of water, sanitation and hygiene. The cluster surveys used a two stage sampling approach. In the first stage the primary sampling units (PSUs) were selected with a probability proportioned to their size. In the second stage a simple random sample of size b was taken from each PSU, where b is the number of basic sampling units (BSUs) within each PSU. b is also known as the 'take'.
Declarations
Authors’ Affiliations
References
 Myatt M, Feleke T, Sadler K, Collins S: A field trial of a survey method for estimating the coverage of selective feeding programmes. Bull World Health Organ. 2005, 83 (1): 2026.PubMed CentralPubMedGoogle Scholar
 Brooker S, Kabatereine NB, Myatt M, Russell Sothard J, Fenwick A: Rapid assessment of schistosoma mansoni: the validity, applicability and costeffectiveness of the Lot Quality Assurance Sampling Method in Uganda. Trop Med Int Health. 2005, 10 (7): 647658. 10.1111/j.13653156.2005.01446.xPubMed CentralView ArticlePubMedGoogle Scholar
 Luan R, Zeng G, Zhang D, Lou L, Yuan P, Liang P, Li Y: A study on methods of estimating the population size of men who have sex with men in Southwest China. European Journal of Epidemiology. 2005, 20: 581585. 10.1007/s1065400543054View ArticlePubMedGoogle Scholar
 Chao A, Tsay PK, Lin SH, Shau WY, Chao DY: The applications of capturerecapture models to epidemiological data. Statist Med. 2001, 20: 31233157. 10.1002/sim.996.View ArticleGoogle Scholar
 Martsolf DS, Courey TJ, Chapman TR, Draucker CB, Mims BL: Adaptive sampling: recruting a diverse community sample of survivors of sexual violence. J Community Health Nurs. 2006, 23 (3): 169182. 10.1207/s15327655jchn2303_4View ArticlePubMedGoogle Scholar
 Grais RF, Coulombier D, Ampuero J, Lucas MES, Barretto AT, Jacquier G, Diaz F, Balandine S, Mahoudeau C, Brown V: Are rapid population estimates accurate? A field trial of two different assessment methods. Disasters. 2006, 30 (3): 364376. 10.1111/j.03613666.2005.00326.xView ArticlePubMedGoogle Scholar
 Catana AJ: The wandering quarter method of estimating population density. Ecology. 1963, 44: 349360. 10.2307/1932182.View ArticleGoogle Scholar
 Bostoen K, Chalabi Z: Optimising household survey sampling without sample frames. International Journal of Epidemiology. 2006, 35 (3): 751755. 10.1093/ije/dyl019View ArticlePubMedGoogle Scholar
 Besag J, Gleaves JT: On the detection of spatial pattern in plant communities. Bulletin of the International Statistical Institute. 1973, 45 (1): 153158.Google Scholar
 Diggle PJ: Robust density estimation using distance methods. Biometrika. 1975, 62 (1): 3948. 10.1093/biomet/62.1.39. 10.1093/biomet/62.1.39View ArticleGoogle Scholar
 Diggle PJ: The detection of random heterogeneity in plant populations. Biometrics. 1977, 33: 390394. 10.2307/2529790.View ArticleGoogle Scholar
 Diggle PJ: Statistical methods for spatial point patterns in ecology. In Spatial and temporal analysis in ecology Edited by: Cormack RM, Ord JK. Fairland, Maryland , International Cooperative Publishing House; 1979.Google Scholar
 Diggle PJ: Statistical analysis of spatial point processes. Second edition. London , Arnold; 2003.Google Scholar
 Diggle PJ, Besag J, Gleaves JT: Statistical analysis of spatial point patterns by means of distance methods. Biometrics. 1976, 32: 659667. 10.2307/2529754.View ArticleGoogle Scholar
 Young LJ, Young H: Statistical ecology: a population perspective. Boston , Kluwer Academic Publishers; 1998.View ArticleGoogle Scholar
 Brown V, Jacquier G, Coulombier D, Balandine S, Belanger F, Legros D: Rapid assessment of population size by area sampling in disaster situations. Disasters. 2001, 25 (2): 164171. 10.1111/14677717.00168View ArticlePubMedGoogle Scholar
 Noji EK: Estimating population size in emergencies. Bulletin of the World Health Organization. 2005, 83 (3): 164PubMed CentralPubMedGoogle Scholar
 Buckland ST, Anderson DR, Burnham KP, Laake JL: Distance sampling: estimating abundance of biological populations. London , Chapman and Hall; 1993.Google Scholar
 Buckland ST, Anderson DR, Burnham KP, Laake JL, Borchers DL, Thomas L: Advanced distance sampling. Estimating abundance of biological populations. Oxford , Oxford University Press; 2004.Google Scholar
 Wolfram S: Mathematica, Fifth Edition. Champaign IL , Cambridge University Press; 2003.Google Scholar
 Lawler EL, Lenstra JK, Rinnooy Kan AHG, Shmoys DB: The traveling salesman problem. A guided tour of combinatorial optimization. Chichester , John Wiley & Sons; 1985.Google Scholar
 Moon C, Kim J, Choi G, Seo Y: An efficient genetic algorithm for the traveling salesman problem with precedence constraints. European Journal of Operational Research. 2002, 140: 606617. 10.1016/S03772217(01)002272.View ArticleGoogle Scholar
 Snyder LV, Daskin MS: A randomkey genetic algorithm for the genralized traveling salesman problem. European Journal of Operational Research. 2006, 174: 3853. 10.1016/j.ejor.2004.09.057.View ArticleGoogle Scholar
 Kripfganz J, Perlt H: Operations Research 3.1. A Mathematica application package. Leipzig , SoftAS Gmbh; 2005.Google Scholar
 Pham DT, Karaboga D: Intelligent optimization techniques. Genetic algorithms, Tabu search, simulated annealing and neural networks. London , SpringerVerlag; 2000.Google Scholar
 Nemhauser GL, Wolsey LA: Integer and combinatorial optimization.New York , John Wiley & Sons; 1999.Google Scholar
 Simulated Annealing http://www.cs.sandia.gov/opt/survey/sa.html
 Byth K, Ripley BD: On sampling spatial patterns by distance methods. Biometrics. 1980, 36: 279284. 10.2307/2529979.View ArticleGoogle Scholar
 Cormack RM: The invariance of Cox and Lewis's statistic for the analysis of spatial patterns. Biometrika. 1977, 64 (1): 143144. 10.2307/2335785.View ArticleGoogle Scholar
 Hines WGS, O'Hara Hines RJ: The Eberhardt statistic and the detection of nonrandomness of spatial point distributions. Biometrika. 1979, 66 (1): 7379. 10.1093/biomet/66.1.73.View ArticleGoogle Scholar
 Holgate P: Tests of randomness based on distance methods. Biometrika. 1965, 52 (34): 345353. 10.1093/biomet/52.34.345. 10.1093/biomet/52.34.345View ArticleGoogle Scholar
 Bennett S, Radalowicz A, Vella A, Tomkins A: A computer simulation of household sampling schemes for health surveys in developing countries. International Journal of Epidemiology. 1994, 23 (6): 12821291. 10.1093/ije/23.6.1282View ArticlePubMedGoogle Scholar
 Kish L: Survey sampling. New York , John Wiley & Sons; 1965.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.