| Literature DB >> 19063732 |
Jesper Nielsen1, Thomas Mailund.
Abstract
BACKGROUND: High-throughput genotyping technology has enabled cost effective typing of thousands of individuals in hundred of thousands of markers for use in genome wide studies. This vast improvement in data acquisition technology makes it an informatics challenge to efficiently store and manipulate the data. While spreadsheets and at text files were adequate solutions earlier, the increased data size mandates more efficient solutions.Entities:
Mesh:
Year: 2008 PMID: 19063732 PMCID: PMC2633306 DOI: 10.1186/1471-2105-9-526
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Memory layout of SNPFile matrix. If your program only accesses a few columns at a time they will cluster nicely in virtual memory and it will be easy for the operating system to keep only the needed pages in physical memory. This means you can handle very big SNPFiles while not using very much actual memory. Furthermore, if your program only access columns ordered left-to-right and from top to bottom, the file will simply be accessed from the beginning to the end. This is what the entire computer, both hardware and software, is optimized for. Thus it should be very fast. If you read a row from the matrix, however, you will access a lot of pages in the file, only use a very small part of each and the operating system will waste a lot of time reading data that will not be used, since it operates on entire pages.
Figure 2Time for accessing matrices. Times for reading or writing an entire matrix as a function of matrix size. Three tests was performed for each matrix size. The tests were done on a Intel Pentium 4 with 1 GB of RAM running linux.
Runtime comparison using the Blossoc tool with its text file format vs. using SNPFile.
| No. Individuals | Text IO | SNPFile |
| 500 | 1200 | 867 |
| 1000 | 1893 | 1671 |
Running time in seconds, for Blossoc using text IO and SNPFile, as a function as the number of individuals.