| Literature DB >> 21851626 |
Marc E Colosimo1, Matthew W Peterson, Scott Mardis, Lynette Hirschman.
Abstract
BACKGROUND: Current sequencing technology makes it practical to sequence many samples of a given organism, raising new challenges for the processing and interpretation of large genomics data sets with associated metadata. Traditional computational phylogenetic methods are ideal for studying the evolution of gene/protein families and using those to infer the evolution of an organism, but are less than ideal for the study of the whole organism mainly due to the presence of insertions/deletions/rearrangements. These methods provide the researcher with the ability to group a set of samples into distinct genotypic groups based on sequence similarity, which can then be associated with metadata, such as host information, pathogenicity, and time or location of occurrence. Genotyping is critical to understanding, at a genomic level, the origin and spread of infectious diseases. Increasingly, genotyping is coming into use for disease surveillance activities, as well as for microbial forensics. The classic genotyping approach has been based on phylogenetic analysis, starting with a multiple sequence alignment. Genotypes are then established by expert examination of phylogenetic trees. However, these traditional single-processor methods are suboptimal for rapidly growing sequence datasets being generated by next-generation DNA sequencing machines, because they increase in computational complexity quickly with the number of sequences.Entities:
Year: 2011 PMID: 21851626 PMCID: PMC3182884 DOI: 10.1186/1751-0473-6-13
Source DB: PubMed Journal: Source Code Biol Med ISSN: 1751-0473
Figure 1Comparison of CCV and Maximum Likelihood Trees. Patristic Distance plots of trees produced by Maximum Likelihood (Y axes) vs. those produced by CCV and cosine/Euclidian distance measures. (X axes).
Figure 2Comparison of Distance Metrics Used to Create Trees. Phylogenetic trees of the HA gene from the New York State dataset constructed with (a) Euclidian and (b) cosine distances. Note the long leaf-leaf distances on tree (a).
Figure 3Clustering of Influenza Dataset. Phylogenetic tree of the WHO Dataset, colored by cluster membership. The shorter HA1 sequences are boxed in red.
Figure 4Integration of Metadata With Genotyping Results. Clustering of HA genes from the 2007 United Kingdom H5N1 outbreak with other samples isolated around the same time. This shows how researchers can combine genotyping with the metadata. Colors on the map represent the cluster membership. The red box indicates the cluster containing the UK Turkey Sample, along with the two Hungarian samples. Note the dates for the UK and Hungary samples (Light Blue) -- these dates were only provided at the year level in Genbank, even though more accurate dates can be inferred from other sources.