Literature DB >> 24116207

A DNA-based semantic fusion model for remote sensing data.

Heng Sun¹, Jian Weng, Guangchuang Yu, Richard H Massawe.

Abstract

Semantic technology plays a key role in various domains, from conversation understanding to algorithm analysis. As the most efficient semantic tool, ontology can represent, process and manage the widespread knowledge. Nowadays, many researchers use ontology to collect and organize data's semantic information in order to maximize research productivity. In this paper, we firstly describe our work on the development of a remote sensing data ontology, with a primary focus on semantic fusion-driven research for big data. Our ontology is made up of 1,264 concepts and 2,030 semantic relationships. However, the growth of big data is straining the capacities of current semantic fusion and reasoning practices. Considering the massive parallelism of DNA strands, we propose a novel DNA-based semantic fusion model. In this model, a parallel strategy is developed to encode the semantic information in DNA for a large volume of remote sensing data. The semantic information is read in a parallel and bit-wise manner and an individual bit is converted to a base. By doing so, a considerable amount of conversion time can be saved, i.e., the cluster-based multi-processes program can reduce the conversion time from 81,536 seconds to 4,937 seconds for 4.34 GB source data files. Moreover, the size of result file recording DNA sequences is 54.51 GB for parallel C program compared with 57.89 GB for sequential Perl. This shows that our parallel method can also reduce the DNA synthesis cost. In addition, data types are encoded in our model, which is a basis for building type system in our future DNA computer. Finally, we describe theoretically an algorithm for DNA-based semantic fusion. This algorithm enables the process of integration of the knowledge from disparate remote sensing data sources into a consistent, accurate, and complete representation. This process depends solely on ligation reaction and screening operations instead of the ontology.

Entities: CellLine Chemical Disease Gene

Mesh：

Substances：
DNA

Year: 2013 PMID： 24116207 PMCID： PMC3792926 DOI： 10.1371/journal.pone.0077090

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

As the hereditary basis of every living organism, DNA has an ability to store and process information. This information is determined by the sequence of four distinct bases (A, C, G, T). An oligonucleotide is a short, single-stranded DNA molecule, and the complementary base pairing enables hybridization into a double-stranded polymer. These features of DNA have inspired the idea of DNA computing [1]–[3]. DNA computing, known also under the name of molecular computing, has great advantages of in vivo computing and in vitro computing, such as massive parallelism, extraordinary information density and exceptional energy efficiency. In contrast to traditional silicon-based technology, DNA computing has the natural potential of semantic fusion and reasoning for big data. Nowadays, ontology has gained more and more acceptance as one of semantic technologies to solve the problem of heterogeneous knowledge sharing [4]. Many research efforts have been devoted to ontology modeling over the past decade [5]–[9], and quite a few running systems based on manual ontologies have been developed [10]–[12]. However, data is accumulating at an astounding rate with increasing computing power. Many activities, for instance encoding an organism's DNA [13], collecting satellite data [14], and conducting scientific experiments at the Large Hadron Collider [15], can create a staggering amount of data. The growth of these big data outstrips the capacities of current ontology engineering practices and tools. In bioinformatics, the semantic integration of big data has been identified as a new frontier [16]. The same trend can also be observed in other scientific domains. For example, with a vast amount of geographical data becoming available from satellites, especially the recent opening of the Landsat archive [17], there comes an increasing demand for automatic semantic processing of remote sensing images (RSIs) in a reasonable amount of time. Up to now, reasoning from big data is challenging. As the winner of the Semantic Web Challenge, Williams provided the experimental results showing that reasoning over the Billion Triple Dataset required 3712 processors from IBM LS21 blade servers and the computation time was 1314 seconds per processor [18]. Although this dataset contains 898,966,813 triples and the size of the combined dataset is around 17 GB, the amount of data obtained from satellite devices and open sources on the Internet per day is much higher and beyond the capabilities of analyst to process the data with the help of ontology [19]. Novel tools and approaches are needed to address this problem that has arisen during the current period of rapid data and knowledge growth. Now DNA computing has become an active research area [20]–[24]. DNA-based parallel computing takes advantage of many different DNA molecules to solve the NP-complete problems in polynomial or even linear time, while exponentially increasing time is required in silicon-based computer. In this paper, a DNA model is introduced for semantic fusion of the RSIs. It utilizes DNA computing and ontology technologies to enable the complete representation of the RSI's knowledge in linear time regardless of the amount of data obtained. There is few published work in the literature about the application of DNA-based approach to semantic fusion. Tsuboi proposed a pattern matching algorithm based on stickiness of DNA molecules [25]. Semantic network technology is used to solve information recognition problem. However, the fusion of semantic relationship is not involved. This restricts the analysis and reasoning capacity of the processing system. Moreover, the encoding scheme in this algorithm is not suitable for arbitrary digital information and the different data objects have to be encoded by different oligonucleotides. However, an exhaustive representation is considered unrealistic. Church proposed a novel strategy to store digit information in DNA [20]. In Church's work, all data blocks can be programmed into a bitstream and then encoded onto thousands of oligonucleotides. But the sequential conversion code (Perl) faces the challenge from big data. Xu provided a new DNA computing model for graph vertex coloring problem [26], which can effectively reduce the solution space by seminested polymerase chain reaction. All these approaches described above lack support for semantic reasoning and little attention has been given to big data, which have become the key problems of knowledge sharing and semantic representation in the web environment. In an attempt to overcome these difficulties, we propose here a novel DNA-based semantic fusion model as an extension of our previous research for distributed data application in remote sensing field [27]. In previous work, we have implemented a semantic fusion and reasoning system for the RSIs' retrieval. At present, the use of DNA computing in semantic fusion presents numerous opportunities for our future DNA reasoner. The inherent massive parallelism of DNA strands allows for big data storage and reasoning. The main efforts in this paper are to 1) develop a remote sensing data ontology with 1,264 concepts and 2,030 semantic relationships to annotate the RSIs; 2) encode arbitrary semantic properties, property values, semantic relationships and data types in DNA, and organize the semantic information into directed acyclic graph; 3) evaluate the performance of our parallel conversion method against the sequential approach with the Rest dataset [28]; 4) create an algorithm that takes advantage of the biochemical reaction to fuse the semantic information.

Results and Discussion

Remote sensing data ontology

Ontology, as a formal representation of both implicit and explicit domain knowledge, can help to deal with heterogeneous representations of data and their interrelationships. There exist several forms of ontology with different semantic richness. As a specification developed by World Wide Web Consortium, the Resource Description Framework (RDF) [29] can present semantic information of web resources. RDF Schema [30] provides a type system for RDF and defines classes and properties that may be used to describe classes, properties and other data resources. It can also be used to build a lightweight ontology by describing RDF vocabularies. Figure 1 illustrates the remote sensing data ontology by using RDF Schema language. The computer code of the ontology is provided in File S1. All terms in the ontology vocabulary are divided into five groups (namely, Identification Information, Data Quality Information, Spatial Data Organization Information, Instrument Information, and Location Information) to represent the content, quality, condition, and other characteristics of data. To enable the extensibility of the ontology, we evaluated the suitability of several existing geospatial metadata standards, including the Content Standard for Digital Geospatial Metadata: Extension for Remote Sensing Metadata [31], ISO 19115 [32] and ISO/TS 19319 [33]. The Extension defines the metadata elements published by the U.S. Federal Geographic Data Committee and documents digital remote sensing datasets in the US. While ISO 19115 does only provide a structure for describing digital geographic data and many elements in ISO 19115 are from the Extension standard. ISO/TS 19139 defines an XML schema implementation derived from ISO 19115. These two ISO standards are very simple but not suitable for ontology modeling. Considering the fact that the conceptual model in the Extension does not provide enough semantic description of geographic data, we construct a hierarchical structure of the ontology. The relationships among specific classes are encoded into the ontology structure. The RDF Schema properties rdfs:range and rdfs:domain describe the relationships between specific properties and classes, and a lot of image data relationships have been described using the domain properties from the Extension standard.

Figure 1

RDF graph of the remote sensing data ontology.

RDF graph of the remote sensing data ontology.

This figure contains 1,264 nodes and 2,030 edges. Nodes are a set of classes and concepts in the remote sensing domain, such as Worldwide_Reference_System, Multiple_Image_Alignment, and Spatial_Domain, etc. Edges are a set of specific properties that characterize these classes. Classes, properties, and domains are all considered as ontology elements. All the elements are partitioned according to their namespaces. The namespaces in ontology vocabulary show the Uniform Resource Identifier References (URIrefs) as the URLs of web resources that provide further information about this vocabulary. The xmlns:ersm (http://cs.jnu.edu.cn/sun/ontology/ersm), xmlns:rdfs (http://www.w3.org/2000/01/rdf-schema), and xmlns:rdf (http://www.w3.org/1999/02/22-rdf-syntax-ns) are used mainly in our remote sensing data ontology. (For interpretation of the references to color in this figure, the reader is referred to the web version of this paper.) The real RSIs must be first preprocessed with semantic annotation technique, where semantic tags defined in the ontology are assigned to the phrases in the descriptive metadata of the RSIs. This facilitates the fusion and reasoning based on image semantics. RDF instance of an RSI is shown in Figure 2, where the metadata of RSI 103001001E1EB700 are annotated with the properties such as imagequal (image quality), Cloud_Cover and spatresv (spatial resolution value), etc. The property values are numerous “intermediate” anonymous resources to represent constant values (called literals) such as Excellent, 0, 1.85, or aggregate concepts such as RSI's structured Nominal_Spatial_Resolution values. Anonymous resources cannot be referred to from outside their defining RDF instance, and hence do not require meaningful names.

Figure 2

RDF instance description and visualization of an RSI.

This figure includes three interactive parts: an RSI in A, an RDF annotation of the RSI in B, and data instance visualization in C. (A) One example RSI's ID is 103001001E1EB700 and its resolution is 1.85 meter. (B) The RDF identifies the data instance using the URIref and the image data can be described by making statements. A statement, such as “An RSI 103001001E1EB700 has a nomspres (Nominal Spatial Resolution) whose value is 1.85 meter”, is represented by these two RDF/XML statement blocks. File S2 provides the complete RDF code of catalog ID 103001001E1EB700 imagery. (C) The 193 classes and concepts are partitioned into six colors according to their namespaces. Most of them (120 green nodes) represent blank nodes. They provide a way to more accurately make statements about data because constant values and most aggregate concepts may not have URIs. The other namespaces include xml:base (http://cs.jnu.edu.cn/sun/ontology/103001001E1EB700), xmlns:rdfs (http://www.w3.org/2000/01/rdf-schema), xmlns:ersm (http://cs.jnu.edu.cn/sun/ontology/ersm), xmlns:rdf (http://www.w3.org/1999/02/22-rdf-syntax-ns), and xmlns:owl (http://www.w3.org/2002/07/owl). (For interpretation of the references to color in this figure, the reader is referred to the web version of this paper.)

RDF instance description and visualization of an RSI.

Semantic property and data type

In order to convert the classes and properties representing data semantics into the sequence of nucleotides, we propose the property representation and type design suited for DNA implementation. For example, this paper annotates three RSIs E1EB7, D87C9 and B8EF1 with three properties: city (ct), imagequal (qa) and Cloud_Cover (cc). The first image's property values are Guang Zhou (GZ), Excellent (E), and 0, respectively. The other two's values are Hong Kong (HK), Good (G), 0, and HK, G, 16. Considering the linear structure of DNA strands, we arrange these properties and their values in sequence as shown in Figure 3. The label of a vertex is denoted as two-tuples (property name, property value). The edge denotes the connection between the vertices in the directed graph. To simplify the graphic structure, two new vertices labeled as “Start” and “End” are added to the directed graph and the vertices are integrated into one if they have the same property and property values. As shown in Figure 4, there are directed paths representing the annotation results of the RSIs between initial and terminal vertex in property network.

Figure 3

The linear model of semantic properties in three RSIs.

Figure 4

Network diagram of semantic property set.

Everything would be simple if the type of property to be recorded was obviously in the form of the simple character string literal (plain literal) illustrated so far. However, most RSIs data involve structures that are more complex than that. Many constant values that serve as property values in the RSIs are numbers (e.g. the value of a Nominal_Spatial_Resolution property) or some other kinds of more specialized values. For example, Figure 4 illustrates a network diagram recording information about three RSIs, where the values of RSIs' Cloud_Cover property are literals “0%” and “16%”. However, there is no explicit indication that “0%” or “16%” should be interpreted as a number. The common practice in computer programming or database systems is to provide additional information about how to interpret a literal by associating a data type, such as integer, boolean, or string, with this literal. In our new DNA model, 4-nt oligonucleotides are used to provide this kind of information. Since DNA strand has no built-in data type system of its own, our model simply provides a way to explicitly indicate, for a given data type, what oligonucleotide should be associated with it. Table 1 shows the common data types. The data types in this model refer to the XML Schema Datatypes defined in [34]. An advantage of this approach is that it gives our model the flexibility to directly represent information obtained from various RSIs or web sources. It is worth noting that type conversions may still be required when moving data between systems having different sets of data types.

Table 1

Mapping from the data types to the oligonucleotides.

Data types	Oligonucleotides
string	TCGA
boolean	CTGA
float	GTCA
dateTime	AGTC
duration	TAGC
URI	ACGT
RName	GCTA
integer	CATG
undefined	TGCA

Moreover, a property value may sometimes appear to be simple, but may actually be more complex. For example, the unit information of the spatial resolution for satellite imagery is meter, but in some cases such information is not explicitly given and omitted in contexts where it can be assumed that anyone accessing the property value will understand the unit information being used. However, this assumption is generally unsafe in the wider context of the imagery. One might give a resolution value in kilometer or degree, whilst others might assume that is in meter. In general, a comprehensive consideration should be given to the explicit representation of unit information.

Encoding the semantic information

Before the semantic information is converted into DNA, an encoding model is required. Although diverse coding strategies for DNA sequences have been developed and some have been demonstrated [20], [35], [36], no standard model exists. Church GM [20] first proposed a simple, universal strategy. In Church's work, arbitrary digital information can be converted into bitstreams by utilizing the ASCII code. These bits are then encoded onto the oligonucleotide library. Unlike conventional approaches, Church encodes one bit per base in order to meet the appropriate GC-content and introduces a 19-nt oligonucleotide to represent the data's address space. However, the common type system is not considered in Church's encoding method. Thus, we propose a novel data encoding approach for semantic information. Firstly, the vertices and edges in Figure 4 are converted into DNA sequences in order to efficiently represent the semantic properties. Every vertex is associated with a 48-nt oligonucleotide which is denoted V. The full description about the mapping from the vertex property to the DNA sequence is provided in the Materials and Methods section. Now each V, except the start and end vertices, is decomposed into four oligonucleotides whose lengths are 24, 4, 4, 16: V = NTUA. N, T, U, and A represent the property name, data type, unit (or comment), and property value respectively. The unit value U depends on N and T. For example, the property name cc and property value 0 in the vertex (cc,0) are represented by the first and last parts of V respectively, where N = aaCgaagagCTaagCCgCCgaaTC and A = gaCTgagaggTTggag. The oligonucleotide GCAT in V represent the unit %, as shown in Table 2.

Table 2

The oligonucleotides representing the vertex properties.

Vertex	Oligonucleotides	Denotation
start	5′-ggTaagagaTTCgaCCaCTCaCgagCCaaggTgTCTaaCagTCTgCag-3′	V_start
(cty,GZ)	5′-aCCggaTTgTCCgCaggCCTTggCTCGATGCAaTagaCCTaCgTTaCa-3′	V_(cty,GZ)
(qa,E)	5′-gaTaagaaaTTCaagTgTTggagTTCGATGCAaaCggagagTgagTaT-3′	V_(qa,E)
(qa,null)	5′-gaTaagaaaTTCaagTgTTggagTTGCATGCAaaCggagagaCagaag-3′	V_(qa,null)
(cc,0)	5′-aaCgaagagCTaagCCgCCgaaTCCATGGCATgaCTgagaggTTggag-3′	V_(cc,0)
(cc,null)	5′-aaCgaagagCTaagCCgCCgaaTCTGCATGCAgaCagagaggTaggag-3′	V_(cc,null)
end	5′-ggTaaggaggTaggagagTaaggagCCggTgCgCCaCCTggTTggTaa-3′	V_end

Since the volume of electronic data expands rapidly, it is important to choose the optimal computer architecture for converting big data set. Conversion solutions range from cluster-based computing [37] to cloud-based computing [38]. Considering the cost-effective way to achieve a supercomputer performance, we use the cluster computing. All the conversion experiments in this paper were carried out in the HPC-JNU cluster system. The description about the HPC-JNU is provided in the Materials and Methods section. The sequential and parallel codes in C language are provided in File S3 and File S4 respectively. To evaluate the performance of these conversion programs, our semantic data are partly from the Rest dataset in BTC2012 dataset (http://km.aifb.kit.edu/projects/btc-2012/rest/). This dataset is encoded in NQuads format [39] and includes three data files that range in size from 409.99 MB to 2.69 GB. Figure 5 shows the conversion results of 4.34 GB source dataset in the HPC-JNU cluster system. As an explanatory scripting language, the Perl language has poor IO disk performance. The result of the parallel method shows the best performance although the user of the cluster system has a maximum limit of 80 cores.

Figure 5

Conversion performance on the test dataset.

The result dataset contain DNA sequence information corresponding to the test data. (A) The conversion time is about 4,937 seconds, 31,426 seconds and 81,536 seconds for three programming languages. Error bars depict Standard Error of the mean. (B) The sizes of the datasets are both 54.51 GB for the sequential C and the parallel C. The size is 57.89 GB for the Perl program because the code uses different data block size.

Conversion performance on the test dataset.

DNA's storage density

At present, remote sensing data are dramatically increasing in volume. For example, the U.S. National Climatic Data Center holds the world's largest archive of weather data and has archived 3 PB (petabyte) satellite imagery [40]. The extreme compactness of DNA is incredible. Because the mean molecular weight of a nucleotide is 330 g/mol [41] and a 200 bp encodes 128 bits in our encoding method, one gram of DNA can store 5.84×1020 bits. We approximate DNA's density to water's density (10−3 g/mm3), then the volume of all DNA sequences encoding 3 PB data is 4.63×10−2 mm3. We compare favorably contemporaneous storage technologies in Table 3 [42]–[50]. DNA storage has obviously the potential of storing data 100 times more compactly than other technologies.

Table 3

Storage volume calculations for 3

Medium type	Year	Volume (mm³)	Notes
CD-ROM [42]	1982	6.24×10¹⁰	1.2 mm thickness, 120 mm diameter, 700 MB
DVD-R (single layer) [43]	1996	9.08×10⁹	1.2 mm thickness, 120 mm diameter, 4.7 GB
Blu-ray (single layer) [44]	2002	1.71×10⁹	1.2 mm thickness, 120 mm diameter, 25 GB
Flash memory [45]	2013	1.25×10⁸	72 mm×26.94 mm×21 mm, 1 TB
Magnetic tape (LTO-6) [46]	2012	8.02×10⁷	6.1 µm thickness, 846 m length, 12.65 mm width, 2.5 TB
Hard disk [47]	2013	1.98×10⁵	10 TB/inch², platter 1 mm thickness
Quantum storage [48]–[50]	2012	5.16	5×7 bit/10×10 nm² on the Cu(111) surface, the average height of Cu(111) terrace 65 nm, bilayer cobalt nano-islands 0.8 nm, two additional capping layer 1 nm
This paper	2013	4.63×10⁻²

Semantic fusion based on DNA

Semantic fusion is the key operation that ontology technology supports. It can automatically implement the union of the properties and semantic relationships. A resource, such as an RSI, and its replicas may be widely distributed over several image replicas databases. The owners of the resource may select different kinds of feature properties to annotate this RSI. We must merge these properties and relationships in order to improve the efficiency and accuracy of the knowledge. As shown in Figure 6, the semantic fusion enables image's semantic information from disparate data sources to be merged. The initial properties dissolve in the new properties and do not preserve their duplicate internal structures. However, the performances of ontology fusion and reasoning degrade rapidly as data grows. Therefore, we build a semantic fusion model based on DNA.

Figure 6

Semantic fusion pattern of an RSI.

Semantic fusion pattern of an RSI.

(A) Two owners of the RSI E1EB7 select different properties to annotate it. One of them selects the properties cty and qa. The other selects the properties cty and cc. The property value null means the unannotated property. Certainly, both its data type and its unit are undefined. (B) The result property string after semantic fusion represents the complete semantic information of this RSI. Table 2 shows a set of oligonucleotides representing the possible properties labeling the vertices in Figure 6A. As regards orientation, all of the oligonucleotides are written 5′ to 3′. Now each V in Figure 6A is divided into two oligonucleotides, each of length 24: V = V’V’’. V’ and V’’ are the first and second half of V. An edge from the vertex i to the vertex j is encoded as a 48-nt oligonucleotide, obtainable as the Watson-Crick complement of the second and the first halves of the oligonucleotides encoding the vertices i and j touching the edge. For example, the encoding of an edge from the vertex (cty,GZ) to the vertex (qa,E) is given: e(cty,GZ)→(qa,E) = AGCTACGTTaTCTggaTgCaaTgTCTaTTCTTTaagTTCaCaaCCTCa. For every vertex and every edge in Figure 6A, large quantities of V and eij are mixed together in the hybridization and ligation reaction as shown in Figure 7. The oligonucleotides V served as splints to bring oligonucleotides associated with compatible edges together for ligation. Consequently, many DNA molecules encoding the property string are created. The remaining steps, as well as the conclusion in the output, are filtering and screening procedures. We use the Adleman style [1], [51] algorithm for obtaining the result property string:

Figure 7

The oligonucleotides in the hybridization and ligation reaction.

For each property i including the labels start and end, a 48-nt oligonucleotide V is generated. For each edge ij, an oligonucleotide eij is derived from the 3′ 24-nt of V and the 5′ 24-nt of V.

The oligonucleotides in the hybridization and ligation reaction.

For each property i including the labels start and end, a 48-nt oligonucleotide V is generated. For each edge ij, an oligonucleotide eij is derived from the 3′ 24-nt of V and the 5′ 24-nt of V. Input: DNA molecules generated randomly in large quantities. Step 1: Reject all DNA molecules that do not begin with V and end in V. Step 2: Reject all DNA molecules encoding property strings that do not involve exactly 5 vertices. Step 3: Reject all DNA molecules that contain the oligonucleotide TGCATGCA encoding the null value. Output: Read out the property strings (if any). As shown in Figure 8, we can obtain the result property string by using the semantic fusion method based DNA. It is consistent with the semantic properties in Figure 6B.

Figure 8

DNA sequence representing the complete semantic information.

Abstract representation of semantic fusion

The above algorithm can be formally described by an abstract model. This abstract model is based on the data structure of the tubes. A tube is a multi-set of finite strings over the alphabet {A, C, G, T}, namely the DNA alphabet. Given a tube, one can perform the following operations: pre-separate(T, s)/post-separate(T, s)/sub-separate(T, s). Given a tube T and a string s over the alphabet {A, C, G, T}, this operation creates a tube containing all strands in T that have the string s as a prefix/postfix/substring. length-separate(T, n). Given a tube T and integer n, this operation creates a tube containing all strands in T with length less than or equal to n. detect(T). Given a tube T, this operation outputs true if T contains at least one DNA molecule, otherwise outputs false. In our model, each of the oligonucleotides in T is of length 48. Thus, SemanticFusion(T): input(T) T ← pre-separate(T, V) T ← post-separate(T, V) T ← length-separate(T, 240) T ← sub-separate(T, TGCATGCA) detect(T). This model starts with the input tube T, containing the result of the ligation reaction. All separate operations select the oligonucleotides and thus require the amplification of the resulting tubes by the PCR (polymerase chain reaction). Indeed, semantic fusion problem have been shown to be an NP-complete problem [52], [53], which means that it is unlikely to find an algorithm working in polynomial time. The semantic fusion on image properties of modest size requires an altogether impractical amount of time on conventional electronic computer [54], [55]. However, we use a finite sequence of ligation reaction and screening operations described above to solve the semantic fusion problem. A fusion starts with an initial tube and ends with one final tube. The fusion time depends solely on the total time of ligation reaction and five screening steps instead of the number of semantic properties and ontology complexity. Then the massive parallelism of DNA renders exponential time complexity in semantic fusion to linear time.

Conclusions

Semantic fusion is a process that is ubiquitous in nature. In this paper, a novel DNA-based semantic fusion model is proposed. The model combines organically parallel strategy with DNA encoding, which makes semantic conversion more efficient and storage density higher. Furthermore, we describe the abstract representation of semantic fusion and thus show that the fusion time of semantic properties in remote sensing images depends solely on the biochemical reactions and operations instead of the ontology. However, there are still many issues to be considered. Foremost issue is error. DNA molecules are fragile and they break easily. The errors of separate operations with DNA strands can make a really dramatic difference. Thus, steps towards coping with errors should be taken in. In future work, we also implement the ligation reaction and screening procedures based on biochemical techniques and clarify details in another paper.

Materials and Methods

Mapping from semantic information to an oligonucleotide

All properties and property values are converted to binary strings based on ASCII encoding. Each character corresponds to an 8-bit binary code. For example, the property cty has the binary code 011000110111010001111001. Conversion code in File S4 can then convert these bits to a or g for 0 and T or C for 1. Bases are chosen randomly according to the result of function rand(). Considering the big dataset, we add a 32-bit address starting from 00000000000000000000000000000000. For example, the properties and property values of an RSI E1EB7 in Figure 3 is represented by the string startctyGZ qa E cc00 end, where the symbol represents a whitespace character, start and end are the labels of the new vertices added in Figure 4. This property string has an ASCII code 001000000111001101110100011000010111001001110100011000110111010001111001010001110101101000100000011100010110000100100000010001010010000001100011011000110011000000110000001000000010000000100000011001010110111001100100. It is then encoded to two 200 nt oligonucleotides by the conversion code given in File S4. Each encodes a 128-bit data block (128 nt). Before synthesized, the sequence is augmented to include the bases representing data type and data unit. For example, an oligonucleotide aCCggaTTgTCCgCaggCCTTggCaTagaCCTaCgTTaCa is the result of encoding the property ctyGZ in the vertex (cty,GZ). Considering the data type is string and data unit is undefined, we add TCGA and TGCA to the original oligonucleotide according to Table 1. Thus, the final oligonucleotide of the vertex (cty,GZ) is aCCggaTTgTCCgCaggCCTTggCTCGATGCAaTagaCCTaCgTTaCa, as shown in Table 2.

Specification of the cluster system

The HPC-JNU cluster system (http://hpc.jnu.edu.cn/) has 20 computational nodes. Each node is connected via the InfiniBand network. Table 4 shows the specifications of the HPC-JNU cluster system. Figure S1 and Figure S2 show the photographs of the computational nodes and the storage node.

Table 4

Specifications of the HPC-JNU cluster system.

Hardware			Software
	Computational node	Storage node
CPU	AMD Opteron 2.4 GHz	Intel Xeon 2.13 GHz	OS	CentOS 6.2
Number of nodes	20	1	MPI	Open MPI 1.6
Number of CPU cores/node	24	4	File System	NFS 4.1
Number of CPU cores	480	4	Queue Scheduler	Torque 3.3
Memory/node	48 GB	8 GB
Disk	300 GB	26TB RAID5 Array
Interconnection network	40 G QDR InfiniBand	40 G QDR InfiniBand

Photograph of the computational nodes. (JPG). (JPG) Click here for additional data file. Photograph of the storage node. (JPG). (JPG) Click here for additional data file. Code for remote sensing data ontology (see also http://cs.jnu.edu.cn/sun/ontology ). Computer code in the RDF Schema language is used to generate the remote sensing data ontology in . The RDF/OWL API is required. (RDFS). (RDFS) Click here for additional data file. Code for ID 103001001E1EB700 instance (see also http://cs.jnu.edu.cn/sun/ontology ). Computer code in the RDF language is ontology annotation file of remote sensing data (catalog ID 103001001E1EB700) instance in . The RDF/OWL API is required. (RDF). (RDF) Click here for additional data file. The sequential conversion code in C language. The code accesses and converts the data stored contiguously on disk. Despite the cache provided by the operating system, an application that performs a large number of reads, conversions and writes usually faces the performance challenge. GCC compiler is required. (C). (C) Click here for additional data file. The parallel conversion code in C language. To support the run-time allocation of conversion tasks, a manager/worker-style parallel C program has been built. The multiple processes of this parallel program can simultaneously access and convert big data by utilizing the MPI-IO. The MPI API is required. (C). (C) Click here for additional data file.

28 in total

1. Long-term storage of information in DNA.

Authors: C Bancroft; T Bowler; B Bloom; C T Clelland
Journal: Science Date: 2001-09-07 Impact factor: 47.728

2. Communication. Science and the semantic web.

Authors: James Hendler
Journal: Science Date: 2003-01-24 Impact factor: 47.728

3. Scanning-tunneling-microscopy study of the growth of cobalt on Cu(111).

Authors:
Journal: Phys Rev B Condens Matter Date: 1993-05-15

4. Free access to Landsat imagery.

Authors: Curtis E Woodcock; Richard Allen; Martha Anderson; Alan Belward; Robert Bindschadler; Warren Cohen; Feng Gao; Samuel N Goward; Dennis Helder; Eileen Helmer; Rama Nemani; Lazaros Oreopoulos; Joh Schott; Prasad S Thenkabail; Eric F Vermote; James Vogelmann; Michael A Wulder; Randolph Wynne
Journal: Science Date: 2008-05-23 Impact factor: 47.728

5. An unenumerative DNA computing model for vertex coloring problem.

Authors: Jin Xu; Xiaoli Qiang; Yan Yang; Baoju Wang; Dongliang Yang; Liang Luo; Linqiang Pan; Shudong Wang
Journal: IEEE Trans Nanobioscience Date: 2011-07-07 Impact factor: 2.935

6. Molecular computation of solutions to combinatorial problems.

Authors: L M Adleman
Journal: Science Date: 1994-11-11 Impact factor: 47.728

7. A nonviral DNA delivery system based on surface modified silica-nanoparticles can efficiently transfect cells in vitro.

Authors: C Kneuer; M Sameti; U Bakowsky; T Schiestel; H Schirra; H Schmidt; C M Lehr
Journal: Bioconjug Chem Date: 2000 Nov-Dec Impact factor: 4.774