Clustering of miRNA sequences is an important problem in molecular genetics associated cellular biology. Thousands of such sequences are known today through advancement in sophisticated molecular tools, sequencing techniques, computational resources and rule based mathematical models. Analysis of such large-scale miRNA sequences for inferring patterns towards deducing cellular function is a great challenge in modern molecular biology. Therefore, it is of interest to develop mathematical models specific for miRNA sequences. The process is to group (cluster) such miRNA sequences using well-defined known features. We describe a method for clustering of miRNA sequences using fragmented programming. Subsequently, we illustrated the utility of the model using a dendrogram (a tree diagram) for publically known A.thaliana miRNA nucleotide sequences towards the inference of observed conserved patterns.
Clustering of miRNA sequences is an important problem in molecular genetics associated cellular biology. Thousands of such sequences are known today through advancement in sophisticated molecular tools, sequencing techniques, computational resources and rule based mathematical models. Analysis of such large-scale miRNA sequences for inferring patterns towards deducing cellular function is a great challenge in modern molecular biology. Therefore, it is of interest to develop mathematical models specific for miRNA sequences. The process is to group (cluster) such miRNA sequences using well-defined known features. We describe a method for clustering of miRNA sequences using fragmented programming. Subsequently, we illustrated the utility of the model using a dendrogram (a tree diagram) for publically known A.thalianamiRNA nucleotide sequences towards the inference of observed conserved patterns.
The human genome is known to contain thousands of miRNAs.
More than 3000 new miRNAs with sequences have been
recently identified [1,2,
3]. Increasing numbers of such new
miRNAs will be identified leading to a problem for affiliating
these with known families for finding new families. The
division of miRNAs into families does not adequately reflect
the degree of nucleotide sequence similarity, and the
categorization of miRNAs into families requires quantitative
criteria defining the differences between families. The genomes
of different organisms have orthologous miRNAs that should
be distributed into families. Hence, it is necessary to establish
the degree of similarity for orthologous miRNAs and their
belonging to different families. Several authors propose
different functional clustering methods for this purpose [4,
5,6].
Therefore, it is of interest to describe a method for clustering miRNAs sequences using fragmented programming.
Methodology
Model for clustering miRNA nucleotide sequences
Clustering nucleotide sequences is a process of sequence
comparison with the definition of maximum number of
nucleotide coincidences. This is useful for constructing a
graphical structure like a tree defining relationship between
sequences. The formulated model for a sequence based
clustering problem is illustrated in Figure 1.
Figure 1
Illustration of a mathematical model for miRNA clustering.
Algorithm:
Clustering of miRNA sequences for phylogenetic tree
The main issue with large range nucleotide sequences is lack of
sufficient computing power. We describe a fragmented algorithm (Figure 2) for
clustering miRNA nucleotide sequences in 5 steps using a flowchart (Figure 3).
Figure 2
Illustration of an algorithm for miRNA clustering.
Figure 3
Flowchart for miRNA clustering using fragmented algorithm. This figure shows data processing by each block of the
fragmented algorithm. There is filling matrixes matrOfCoincidences and numOfCoincidences where each element of the first
matrix is a maximum number of nucleotide coincidences in the corresponding positions between two sequences. Each couple of
elements of the second matrix keeps numbers for two compared sequences. In distBetweenSequences matrix the measures of
coincidences between each couple of sequences are saved in percentage ratio.
Dataset for model testing
A dataset of known miRNA nucleotide sequences from A.
thaliana (Table 1) was tested using this method.
Table 1
Splitting miRNA sequences (the number of the processed sequences represents 3700 miRNA) into families for defining the degree of their relationship (in %).
miRNA name
miRNA sequence
leU-7f
UGAGGUAGUAGAUUGUAUAGUU
leU-7f-1*
CUAUACAAUCUAUUGCCUUCCC
leU-7f-2*
CUAUACAGUCUACUGUCUUUCC
leU-7g
UGAGGUAGUAGUUUGUACAGUU
leU-7g*
CUGUACAGGCCACUGCCUUGC
miR-101
UACAGUACUGUGAUAACUGAA
miR-101*
CAGUUAUCACAGUGCUGAUGCU
miR-103
AGCAGCAUUGUACAGGGCUAUGA
miR-103-2*
AGCUUCUUUACAGUGCUGCCUUG
miR-103-as
UCAUAGCCCUGUACAAUGCUGCU
miR-105
UCAAAUGCUCAGACUCCUGUGGU
miR-105*
ACGGAUGUUUGAGCAUGUGCUA
miR-106b
UAAAGUGCUGACAGUGCAGAU
miR-106b*
CCGCACUGUGGGUACUUGCUGC
miR-107
AGCAGCAUUGUACAGGGCUAUCA
miR-1180
UUUCCGGCUCGCGUGGGUGUGU
… … …
… … …
Results & Discussion
A fragmented algorithm for miRNA nucleotide sequence
clustering and a program application were developed to define
the degree of relationship between sequences according to their
clustering. This helps to create phylogenetic trees based on
Neighbourhood-Joining (NJ) and UPGMA algorithms in this
approach after clustering known miRNA sequences (Figure 4).
Figure 4
Dendrogram of miRNA sequences from A. thaliana is shown as a use of the model.
This identified a conserved polynucleotide, which serves as the Ath-miR156a-j binding site in paralogous SPL mRNAs. These nucleotides encode the
conserved ALSLLS motif and the miR156 and miR157 subfamilies belonging to the same family [7]. The HAM1, HAM2, HAM3
paralogous mRNA binding sites for ath-miR171a-c and ath-miR170 are located in the coding DNA sequence and are conserved in
the mRNAs of 39 orthologous genes within 13 species. The human miR-1273 family includes miRNAs with different nucleotide
sequences; therefore in different miRNAs families [8].
Many programs are available for searching related sequences in
databases. This is useful for creating multiple alignments for
generating phylogenetic trees. Tools used in such analysis
include BLAST, ClustalW, ClustalX, UGENE and many others. The main issue here is lack of sufficient computing resources
for large-scale analysis.The method described here using fragmented programming
optimised the time required for data processing during
clustering. This achieved better clustering results by dividing
the set of sequences into M independent groups (fragments)
processed by M blocks, each of which will undergo fragment
clustering irrespective of other fragments. The overall
clustering is performed for all sequences in each group.A clustering process occurs simultaneously in all groups where
independent processing is possible for all processed data in
fragmented programming. Merging all related sequences in a
fragment forms a cluster as clustering is completed in each
block for every group. The main block in the algorithm finishes
a clustering by merging all received M clusters on the basis of
their clustered relationship. Matrixes of related sequences were
broken into clusters of related nucleotide sequences and
processed by M independent blocks by the fragmented
programming algorithm.The advantages of fragmented programming are the feasibility
for automatic (1) parallel computing, (2) dynamic properties, (3)
calculation of multiple architectures, and (4) subsequent
analysis of parallel computing. The fragmented algorithm
requires a minimum management determined by data
dependency and is not dependent on the distribution of
resources. Thus, it assumes a set of ways for process execution
that provides portability. A problem of executive system is to
execute display of objects in an algorithm (variables,
operations) on resources to a concrete computing system. This
automatically provides all necessary dynamic properties for
parallel computing. Fragmentation is a processing method for
reducing the number of objects in the algorithm. This simplifies
a problem of creation for effective distribution of resources and
management.The resultant sub trees were united into one phylogenetic tree
(Figure 4) by the main block in the algorithm. Its computing
complexity makes O(nlk) operations where k is the number of
clusters, n is the size of a dataset and l is the quantity of cycles
in the algorithm (Figure 5).
Figure 5
Comparative analysis of the linear and fragmented algorithms
for clustering miRNA nucleotide sequences while constructing a dendrogram. It should be noted that time is specified in seconds.
Conclusion
We describe a method for clustering of miRNA sequences using
fragmented programming. The method creates sequence
clusters as input to NJ and UPGMA for generating phylogeny
related tree diagrams. We used known A. thaliana miRNA
nucleotide sequences and developed clusters using this method
for generating a sample dendrogram to illustrate the utility of
the model.
Authors: S Nerini-Molteni; M Mennecozzi; M Fabbri; M G Sacco; K Vojnits; A Compagnoni; L Gribaldo; S Bremer-Hoffmann Journal: Curr Med Chem Date: 2012 Impact factor: 4.530
Authors: Abdou ElSharawy; Andreas Keller; Friederike Flachsbart; Anke Wendschlag; Gunnar Jacobs; Nathalie Kefer; Thomas Brefort; Petra Leidinger; Christina Backes; Eckart Meese; Stefan Schreiber; Philip Rosenstiel; Andre Franke; Almut Nebel Journal: Aging Cell Date: 2012-05-30 Impact factor: 9.304
Authors: Eric Londin; Phillipe Loher; Aristeidis G Telonis; Kevin Quann; Peter Clark; Yi Jing; Eleftheria Hatzimichael; Yohei Kirino; Shozo Honda; Michelle Lally; Bharat Ramratnam; Clay E S Comstock; Karen E Knudsen; Leonard Gomella; George L Spaeth; Lisa Hark; L Jay Katz; Agnieszka Witkiewicz; Abdolmohamad Rostami; Sergio A Jimenez; Michael A Hollingsworth; Jen Jen Yeh; Chad A Shaw; Steven E McKenzie; Paul Bray; Peter T Nelson; Simona Zupo; Katrien Van Roosbroeck; Michael J Keating; George A Calin; Charles Yeo; Masaya Jimbo; Joseph Cozzitorto; Jonathan R Brody; Kathleen Delgrosso; John S Mattick; Paolo Fortina; Isidore Rigoutsos Journal: Proc Natl Acad Sci U S A Date: 2015-02-23 Impact factor: 11.205