| Data ingestion is usually the first stage with a variety of textual sources such as emails, homepages, Really Simple Syndication (RSS) feeds, Microsoft office files and Portable Document Format (PDF) documents.
Data cleaning is vital in practice to remove unwanted noise from the text (such as advertisements or links to unrelated news stories) and to join together broken sentences. At this stage systems often try to breakdown large documents that talk about multiple topics into separate sections in a process called Zoning in order to remove noise or reclassify the document (Chanlekha et al. 2010).
Data triage assigns the document a topic category for either trashing — in the case of nonrelevant documents — or subsequent processing using detailed fact extraction. At this stage redundant information — multiple reports of the same event — are detected through document clustering. This stage is also intended to remove the most obvious true negatives but systems may struggle to handle the more subtle cases on the borderline of their task definitions leading to high numbers of false positives.
Fact extraction obtains structured information about an event such as the name of the disease, the type of agent, the number of victims and time and location where the event happened. With this information the computer can then begin to answer questions such as what happened, to who, where and when.
Ranking is done by applying rules on the results of earlier stages of processing. High-end systems will use sophisticated statistical analysis to assign an alerting level based on a comparison of aggregated data in the present and past. In practice, this is often the most difficult stage for systems to perform automatically with high levels of accuracy.
Human judgement is a key stage in the process. It is almost always needed to understand what is abnormal, to discovery rare events that the system may have missed, to make the final decision about vague reports and to link together disparate events. The limitations of the system will be most visible to the user at this stage and they have to apply their own judgments to correct for nuances of meaning that are clear to people but opaque to the computer software. Human analytical skills will also be able to discovery regularities in the data that can lead them to investigate new paths not available to current automated approaches. |