Event-based internet biosurveillance: relation to epidemiological observation

Background The World Health Organization (WHO) collects and publishes surveillance data and statistics for select diseases, but traditional methods of gathering such data are time and labor intensive. Event-based biosurveillance, which utilizes a variety of Internet sources, complements traditional surveillance. In this study we assess the reliability of Internet biosurveillance and evaluate disease-specific alert criteria against epidemiological data. Methods We reviewed and compared WHO epidemiological data and Argus biosurveillance system data for pandemic (H1N1) 2009 (April 2009 – January 2010) from 8 regions and 122 countries to: identify reliable alert criteria among 15 Argus-defined categories; determine the degree of data correlation for disease progression; and assess timeliness of Internet information. Results Argus generated a total of 1,580 unique alerts; 5 alert categories generated statistically significant (p < 0.05) correlations with WHO case count data; the sum of these 5 categories was highly correlated with WHO case data (r = 0.81, p < 0.0001), with expected differences observed among the 8 regions. Argus reported first confirmed cases on the same day as WHO for 21 of the first 64 countries reporting cases, and 1 to 16 days (average 1.5 days) ahead of WHO for 42 of those countries. Conclusion Confirmed pandemic (H1N1) 2009 cases collected by Argus and WHO methods returned consistent results and confirmed the reliability and timeliness of Internet information. Disease-specific alert criteria provide situational awareness and may serve as proxy indicators to event progression and escalation in lieu of traditional surveillance data; alerts may identify early-warning indicators to another pandemic, preparing the public health community for disease events.


Introduction
The World Health Organization (WHO) collects and publishes databases of statistics on confirmed and suspected disease outbreaks for select infectious diseases. The 2005 International Health Regulations (IHR), designed to ensure timely recognition of outbreaks of infectious disease with the potential to spread widely, requires WHO member nations to report outbreaks of international concern to the WHO within 24 hours of discovery [1][2][3]. Consistent with the IHR, during the initial months of the pandemic (H1N1) 2009 WHO requested that countries report the initial cases and thereafter the number of confirmed cases, and deaths in confirmed cases, for as long as feasible [4]. The WHO published weekly updates of pandemic (H1N1) 2009 case and fatality counts based on this reporting [5]. The resulting database represents one of the most comprehensive and timely outbreak reporting databases available to the public on the Internet.
Event-based biosurveillance, relying primarily on Internet sources, is a recognized approach to infectious disease outbreak detection. It complements traditional approaches to public health surveillance and can provide early warning of emerging events relative to such methods, where data may lag behind due to delays in sample collection, laboratory confirmation, and country reporting. There are several active event-based biosurveillance systems: Project Argus (Argus), Biocaster, Global Public Health Intelligence Network (GPHIN), HealthMap, MedISys, ProMED-mail (ProMED) and others [6,7]. Event reports are generated by automated machine-based processes for Biocaster, Health-Map and MedISys and written by human analysts or subject matter experts for Argus, GPHIN and ProMED. Manual report examination for relevancy typically occurs post-dissemination for the automated systems (e.g., do articles with the word "virus" in the title refer to a biological infection or an attack on computers?). With the exception of ProMed, which utilizes local observers on the ground for some of its outbreak reporting, event-based biosurveillance systems often disseminate reports that are not observer or laboratory verified (e.g., a cluster of unconfirmed human avian influenza cases in Vietnam). Thus the reports provide near real-time cueing and alerting to users, but they may lack specificity.
The specificity and timeliness of outbreak detection using event-based biosurveillance can be assessed by comparison with epidemiological data from official sources, such as WHO, when available. In general, detecting a new epidemic or outbreak ("signal") amidst a varying background of disease ("noise", e.g., normal seasonal influenza or influenza-like-illness) from the vast amount of information available on the Internet is difficult. Moreover, event-based biosurveillance systems can generate a sizable amount of information on any given outbreak topic, sometimes overwhelming users with specific interests. For example, Argus alone generated approximately 22,000 reports on pandemic (H1N1) 2009 from April 2009 to March 2010.
Establishing alert criteria can aid users in identifying relevant and anomalous events from such a large amount of information. Argus and other systems have established semi-automated (pushed via email) and customized (user created) email alerts as a method to improve signal detection, to notify users of emerging events of interest, and to allow for easier tracking of outbreaks or the aftermath of natural disasters.
However, establishing criteria for sending email alerts is complex. The WHO pandemic (H1N1) 2009 data provided a means to assess the timeliness of event-based biosurveillance in real-time and retrospectively, as well as to develop and evaluate alert criteria. In this study, a comparison of WHO epidemiological data and Argus reporting data was made in order to: 1) determine to what extent Argus alerting correlated with the epidemiological disease progression by country and region based on WHO data; 2) identify which alert criteria correlate the best with epidemiological data and provide the most reliable situational awareness; and 3) explore the timeliness of biosurveillance reporting.

Project Argus methodology
Project Argus, hosted at the Georgetown University Medical Center, is designed to report and monitor the evolution of biological events threatening human, plant and animal health globally, excluding the United States (US). [6][7][8][9] Argus collects, in an automated process, several thousand local, native-language Internet media articles daily. [10] Bayesian software tools and Boolean search strings, based on a taxonomy of infectious disease, identify candidate relevant articles. Regional experts, collectively fluent in roughly 40 languages, review these articles manually. Relevant media articles are identified based on direct indicators (reports of disease) and indirect indicators (socially disruptive events or precursors to disease, such as preventative measures or adverse enviro-climatic conditions). Regional experts write Argus reports based on these media articles; reports are posted to a password protected Internet portal for users to view. [11] Argus reported on pandemic (H1N1) 2009 from its identification in April 2009 to the postpandemic period. [12] Comparing Argus alert data to WHO case counts  (see table 1), email alerts were meant to capture increasing severity in a locale or region as portrayed in the media and were comprised of direct and indirect indicators [9] of disease. Senior staff reviewed Argus reports and reports meeting alert criteria were extracted. The report metadata and a link to the report on the Argus Internet portal were then emailed to users as a means of notification.
The reporting alert criteria were reviewed and revised (weekly to monthly) during the course of the pandemic, based on reported feedback from system users. By the end of January 2010, 15 alert categories were in use. Some alert criteria were revised during the study period. For example, before November 2009 the number of healthcare workers, military personnel, and officials reported to be infected were recorded, but afterwards only clusters (> = 3) of cases for these categories were recorded. This change reflected the increased frequency of media reporting of individual cases over time and decreased value of monitoring individual case counts. Another example is that before October 2009 reports of overwhelmed ICUs and ventilator shortages were reported, but afterwards only overall hospital/clinic infrastructure strain or collapse was reported. The analysis for this study was performed based on the definition of the alert as specified during the time period of the analysis ( Table 1).
The WHO 2009-2010 H1N1 country case count data was retrieved from the WHO website [5] on March 31, 2010. Argus timeliness could have also been assessed using the date of official confirmed case reporting from public Ministry of Health websites or the date of confirmed case reporting by countries to WHO where these sources of information were available. However, for this study we confined our analysis to WHO data only, because it provided comprehensive official information in one location. Argus weekly alert counts and WHO case counts for each alert category were recorded and plotted over time during ). The WHO case count and Argus alert counts were normalized between 0 and 1 prior to plotting as follows: count_j/Max{count_j}. Data were only used from countries covered by both WHO and Argus to provide an unbiased comparison (e.g. data from the US was not used since Argus does not cover the US).
A Pearson correlation matrix was generated for all Argus alerts to assess correlation with the data from WHO, the degree to which the alerts are related. Pearson correlation coefficients and corresponding p-values were generated for the combined alert data as well as for each alert individually in comparison to overall WHO case count data. A time series of alerts with WHO case data was plotted for all significant (p < 0.05) correlations.
Eight geographical regions were defined for purposes of our analysis: Africa; East Asia; Europe; Latin America and Caribbean; Middle East; Russia and Central Asia; South Asia; Southeast Asia, Oceania, and Canada. Nations in the WHO dataset were also assigned to these regions. Pearson correlation coefficients and corresponding p-values were generated for the total alert counts and WHO case counts by region.
All statistics were computed using R Version 2.11.0 [13].   [14]. The Argus Internet portal was also monitored daily for media reports of confirmed cases in new countries. Argus timeliness was assessed using the date of first official confirmed case as reported by the WHO relative to the date of first case detected by Argus for a given country from Internet sources during the initial phases of the pandemic.

Study period
The overall study period was from April 2009 to January 2010. August 2009 to January 2010 is the time period used for the overall comparison of WHO case counts and Argus alert counts by week ( Figure 1). October 2009 to January 2010 is the time period utilized for the comparison of Argus alerts compared to WHO case data after the alert criteria definitions were nearly finalized (Figures 2, 3, 4, 5, 6, 7, Table 2, Table 3). April 24, 2009 to June 1, 2009 is the study period utilized for the comparison of first confirmed cases for Argus and WHO (Table 4).

Results
Alert data compared to WHO case counts Using the alert criteria in Table 1 Data was present in both WHO and Argus for 49 countries in the 8 regions. Data from Germany, Portugal, Canada, and Brazil was not available on the WHO website as of March 31, 2010. Alert time series with significant correlation to WHO case data were 1, 3, 5 (i.e. severe manifestation, co-infection or re-infection), 6, 11 (i.e. vaccine failure, severe reaction, or black market sales) and 13 (Table 3).  (Table 4). Argus reported the first confirmed case on the same day as WHO for 21 of the 64 countries. Argus reported from 1 to 16 days ahead of WHO for 42 countries: 1 day ahead for 22 countries; 2 days ahead for 8 countries; 3 days ahead for 8 countries, 4 days ahead for 1 country (Costa Rica) and 5 days ahead for 1 country (India). Two countries were identified by Argus only during the study period. Egypt was identified by WHO on June 3 and United Arab Emirates was identified by WHO on June 8, 16 days and 14 days after Argus, respectively. One country was identified by WHO only during the study period, Bahamas, and was reported by Argus 2 days later. Note that the first case in Egypt was identified by Argus on May 18, but did not appear in the sources monitored again until after the study period on June 2. Both dates are recorded in Table 4.

Discussion
As the media coverage intensifies during the course of a high profile event, such as pandemic (H1N1) 2009, establishing alert criteria can help guide users of Internet based biosurveillance systems. In this study, alert categories 1 (i.e. increase in case counts), 3 (i.e. cases or fatalities of health care workers, military personnel and/or national officials), 6 (i.e. healthcare facility strain or collapse), 13 (i.e. health policy change) and 15 (i.e. border closure) were significantly correlated with the WHO confirmed case count in the four month study period (October 2009 to January 2010 (week 41, 2009 to week 5, 2010). Thus, alerts targeting direct indicators (Alerts 1 and 3) and indirect indicators (Alerts 6, 13, and 15) provided situational awareness during the pandemic.
Increase in case counts (Alert 1) is the most similar alert category to WHO case count data. The significant correlation suggests that reports of confirmed cases in the media are consistent with confirmed cases identified through public health surveillance and testing. A rising number of cases or fatalities of health care workers, military personnel or national officials (Alert 3), who are often more aware of prevention measures than the general public, is an indication of an emerging or escalating infectious disease outbreak, consistent with a rise in case counts. Health care facility strain or collapse (Alert 6) is Though only six alerts were generated for border closure (Alert 15), it is not surprising that the alert is correlated with WHO case data considering the severity of an event that would warrant such an action. Similarly, massive release of anti-virals or vaccine stockpiles (Alert 12) indicates a severe escalation or perceived escalation in cases or deaths. This alert did not reach significance, however, likely because only two alerts were generated for this category. Alerts 1, 6 and 13 are also correlated with each other and maintain a highly significant correlation with WHO case counts when compared individually to WHO case counts along with alerts 3 and 15. An increase in case counts would lead to healthcare infrastructure strain and health policy change, likely accounting for the intra-alert correlation. Comparison of alerts in pandemic versus nonpandemic years is required for verification; however, this study suggests that Alerts 1, 3, 6, 13 and 15 may all serve as proxy indicators in the media of an emerging or escalating  event on the ground and could serve as surveillance measures in conjunction with public health surveillance for a future pandemic. The other alerts may not have been significantly correlated with WHO case counts due to the relatively mild manifestation of the pandemic (H1N1) 2009 without a virulent secondary wave or changes in transmission patterns. [15] Though reports of atypical clinical manifestations, transmission to other species, anti-viral resistance [16] and failure and viral mutations were prevalent in the media, such mechanisms appear to have not contributed to a significant escalation in case count. [15] These alerts, however, could serve as potential indicators for a future pandemic. A large increase in fatalities (Alert 2) was borderline significant with only 9 alerts generated.
Again this is likely due to the mild nature of the pandemic, with an estimated 12,000 deaths, compared to previous pandemics, 1918, 1957 or 1968, with estimated attributable mortality of 50 to 100 million, 1-2 million, and 1 million, respectively. [15,17,18] Correlation analysis by region showed some variation in the significant alerts as was expected based on the differences in severity of the pandemic, capacity for disease detection and capability for response for each region. Alert 5 (i.e. severe manifestation, co-infection or re-infection) and alert 11 (i.e. vaccine failure, severe reaction, or black market sales) emerged as significantly correlated to WHO case counts in Europe and in Europe, South Asia-Canada-Oceana, respectively, though they were not significant when global WHO case counts were considered. These results suggest that regional differences in the evolution of the pandemic are important to consider when developing alert criteria. Alerts 1, 3, 6 and 13 were each significant in one or more regions, which further supports their appropriateness for global surveillance.
Utilizing Internet media sources, Argus identified the first cases of confirmed pandemic (H1N1) 2009 published on the Internet an average of 1.5 days ahead of WHO official reporting (range 1 to 16 days) for all 64 non-US countries reporting by June 1, 2009. This was expected since information from Internet media reports is often timelier than the official reporting of cases to the public after laboratory confirmation. Though in this case the lead-time may be only a few days, this study provides evidence of the validity of using event-based biosurveillance for detecting emerging biological events.
This study had limitations. The alert criteria evolved from initiation in August 2009 through November 2009. However, the study period chosen for the majority of the analysis was after October in order to mitigate any bias from changing alert criteria. In addition, the alert criteria changes were small, geared toward making the alert criteria more specific and did not significantly impact the results (data not shown). In event-based biosurveillance studies there is often a lack of robust gold standard official comparison data. WHO data can be limited by delays in country reporting and under-reporting, however for the 2009 pandemic WHO was considered a timely and accurate source of global data [19]. Finally, the study had a restricted time window. Fears of a virulent resurgence of the virus in a second wave were unfounded and when WHO case counts and Argus alerts decreased to low levels in January 2010, the study was ended. Nonetheless, sufficient data was collected to identify significant indicators of the evolving pandemic.
The pandemic (H1N1) 2009 was of global significance and a main focus of local, national and international public health organizations, particularly during the initial phase. However, there are numerous human, animal and plant diseases that are economically important but are not normally tracked by public health organizations, suggesting that Internet surveillance of such diseases could provide lead-time of an outbreak compared to traditional methods [20]. When surveillance for indirect indicators (suspected cases or prevention measures) is performed in addition to direct reports of disease, the  Nelson et al. Emerging Themes in Epidemiology 2012, 9:4 Page 9 of 13 http://www.ete-online.com/content/9/1/4 lead-time often increases further. [8,21] Surveillance of pandemic (H1N1) 2009 serves as an example of the realtime capability of identifying emerging disease events in general, particularly events that may be evident in local media in the regional vernacular.
Other event-based biosurveillance systems have demonstrated the effectiveness of extracting relevant information from Internet media sources as a means for detecting and monitoring disease events. [21] Internet media reporting provides an emerging resource for early detection of new events and for providing situational awareness of evolving events, particularly when official sources may not be available. Alerts based on media reports can provide event situational awareness and cue users of shifts in infectious disease trends. As the number of online news media sources, including social media sources with user-generated content, continues to expand, event-based biosurveillance will play an increasingly important role in disease surveillance. On-going validation and verification of event-based biosurveillance methods with epidemiological and clinical data by users and surveillance system developers will increase the robustness of this approach for detecting and tracking emerging events.