Murat İnce1. 1. Vocational School of Technical Sciences, Isparta University of Applied Sciences, Cunur West Campus, 32200 Isparta, Turkey.
Abstract
Increasing demand in distance education, e-learning, web-based learning, and other digital sectors (e.g., entertainment) has led to excessive amounts of e-content. Learning objects (LOs) are among the most important components of electronic content (e-content) and are preserved in learning object repositories (LORs). LORs produce different types of electronic content. In producing e-content, several visualization techniques are employed to attract users and ensure a better understanding of the provided information. Many of these visualization systems match images with corresponding text using methods such as semantic web, ontologies, natural language processing, statistical techniques, neural networks, and deep neural networks. Unlike these methods, in this study, an automatic and intelligent content visualization system is developed using deep learning and popular artificial intelligence techniques. The proposed system includes subsystems that segment images to panoptic image instances and use these image instances to generate new images using a genetic algorithm, an evolution-based technique that is one of the best-known artificial intelligence methods. This large-scale proposed system was used to test different amounts of LOs for various science fields. The results show that the developed system can be efficiently used to create visually enhanced content for digital use.
Increasing demand in distance education, e-learning, web-based learning, and other digital sectors (e.g., entertainment) has led to excessive amounts of e-content. Learning objects (LOs) are among the most important components of electronic content (e-content) and are preserved in learning object repositories (LORs). LORs produce different types of electronic content. In producing e-content, several visualization techniques are employed to attract users and ensure a better understanding of the provided information. Many of these visualization systems match images with corresponding text using methods such as semantic web, ontologies, natural language processing, statistical techniques, neural networks, and deep neural networks. Unlike these methods, in this study, an automatic and intelligent content visualization system is developed using deep learning and popular artificial intelligence techniques. The proposed system includes subsystems that segment images to panoptic image instances and use these image instances to generate new images using a genetic algorithm, an evolution-based technique that is one of the best-known artificial intelligence methods. This large-scale proposed system was used to test different amounts of LOs for various science fields. The results show that the developed system can be efficiently used to create visually enhanced content for digital use.
Distance education provides learners with the opportunity to train themselves individually in a flexible environment [1]. When there are obstacles to traditional education—such as the Covid-19 pandemic—the use of distance education increases, and educational data grow exponentially [2, 3]. Moreover, several studies have demonstrated the importance and popularity of distance education [4]. E-learning, web-based education, and web-based training are some example applications of distance education [5]. These systems contain enormous amounts of unstructured content [6]. Given the increasing demand for these types of educational activities [7], the need for qualified e-content has emerged as a problem [8]. These educational systems involve huge amounts of e-content containing plaintext, images, and video. More specifically, e-content is produced with learning objects (LOs) [9]: content, educational, and information objects and reusable learning resources [10]. LOs can contain images, text, video, sound, animations, simulations, graphs, and tables that have educational significance [11]. LO is supported by the IEEE LOM, IMS, and DCMI standards, which require certain properties, such as accessibility, interoperability, compatibility, and reusability [12]. In order to meet these criteria, LOs are combined with descriptive information called metadata [13]. LOs and their identifier metadata are stored in learning object repositories (LORs) [14]. Thanks to LORs, LOs are reusable, shareable, and interoperable with other systems and are also readable by computers [10]. Producing high-quality e-content generally requires a great deal of time and financial resources, but such content can be produced easily and quickly using LOs from LORs. LOs can be segmented into smaller objects, which can in turn be combined to produce different large-scale LOs [15]. Text-based content can be visualized during this process [16]. Visualization of learning content is important because it simplifies topics and increases learners’ comprehension of the information [17]. Nowadays, visualization systems are needed because they aid in clarifying meaning for users with learning disabilities [18]. Images, video, animations, and simulations are some visualization techniques used to attract users, improve concentration, and make sense of the given information within the content [19, 20]. Moreover, visualization helps ensure that information is stored in the learner’s long-term memory [21, 22]. Studies show that people can have a higher rate of remembering when they are exposed to a visualized text than when they are not [23, 24]. Another important effect of visualization is that it improves user retention with regard to color, size, shape, orientation, position, organization, and relations in content design [25]. Moreover, visualization is usable not only in education but also in daily life, such as business, entertainment, and other digital contexts [26, 27].Nowadays, digital documents are extensively used for educational purposes and also used excessively in every aspect of our daily lives. Digitalization has thus created a need for interactive and effective text visualization techniques. Several studies have automatically converted natural language texts to images that represent the meaning of the corresponding texts [28]. Van Wierst et al. [29] developed BolVis to help researchers in philosophy, which largely depends on a plethora of reading texts. Researchers can filter, compare, and explore the meanings of the most relevant part of the large-scale reading texts. Another study developed the Text Variation Explorer software to effectively visualize changes in sociolinguistics studies and provide a generic structure for use in other linguistics studies [30]. Singh et al. [31] combined discriminative feature selection and latent topic analysis to visualize representative data for a large-scale corpus obtained from conference proceedings, movie summaries, and newsgroup postings. They were inspired by the wheel of emotion to visualize data in the form of a flower. In Sui’s study [32], the timely topic scoring technique was used to show topic trends in Twitter text over the studied time period. Another usage of text visualization is clinical studies [33]. Patients’ documents are in unstructured text format, and given the number of these documents, clinicians often cannot deal with all of them in a short time. Thus, MedStory was developed to handle long texts using text visualization. Sprint is a system developed by Yamada et al. [34] to generate 3D models from text descriptions of a scene. Similarly, Joshi et al. [35] developed a story picture engine to represent text with some pictures. Mihalcea et al. [36] developed a similar system to generate pictures to automatically represent simple sentences. In this system, a WordNet structure is used as a lexical source. Utkus is another system based on an ontology for the Russian language that was used to create representative pictures for object behavior [37]. Bui et al. [38] developed another medical text-to-picture system to visualize patient instructions. In a similar study, Ruan et al. [39] summarized patients’ medical data. Jiang et al. [40] developed an instant messaging software to obtain images of queried keywords. Vishit is another software that was used for the speakers of the Hindu language to ease communication between different cultures using semantic-based text-to-image visualization [41]. Another use of a text visualization system is broadcasting news with enhanced representation of emotions [42]. Although there are many text-to-image systems, Hassani and Lee [43] proposed that they are not at the desired level. They wanted to create scenes, rather than showing a representative image for a given text. NALIG [44], WordsEye [45], VizStory [46], and mobile Arabic story scene visualizer [47] are some examples of text-to-scene systems. Manufacturing [48], mapping [49], and education [50] are some of the other application areas of text visualization systems. These studies can also be used for educational e-content. Moreover, there are studies more specific to educational purposes. For example, Gunarathne et al. [51] developed a web-based LO visualization system that works on the well-known MERLOT II LOR. In this system, when users search a keyword, results are visualized using data extraction, transformation, and clustering methods. In another study, the DLNotes2 tool, which is based on semantic networks, was developed to visualize learning activity results in digital libraries [52]. In these studies, semantic web, ontologies, natural language processing, statistical methods, neural networks, and deep neural networks have been used to represent text with images.The objective of this study is to provide an alternative, hybrid method for an automatic and intelligent e-content visualization system that differs from the aforementioned studies. By employing both panoptic image segmentation and genetic algorithms, the proposed study is intended to contribute to the associated literature. In this system, stored and newly added LOs that have plaintext content are visualized with images in an intelligent and novel method that combines deep learning and genetic algorithms. Using panoptic image segmentation (PIS), natural language processing (NLP), convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and genetic algorithm (GAs), e-content can be efficiently visualized and enhanced. E-content generation has high costs and can lead to problems related to wasted time. Using panoptic-segmented image instances to create new images with a GA provides reusability. It is important to note that developing hybrid systems that combine deep learning and GAs provides an effective and interesting perspective on the image visualization problem. Thus, the educational benefits of the e-content are increased. The proposed system contains different modules that use different artificial intelligence methods and differs from other studies in its GA-based image-creating module. This proposed system based on multi-subsystems (modules) is easy to design and offers an alternative approach to the content visualization problem.Considering the general topic of the study, some noteworthy studies from the literature should be explained briefly as follows. Panoptic segmentation is the combination of instance segmentation and semantic segmentation. Its aim is to detect countable foreground objects (“things”; e.g., car, animal, person) and uncountable amorphous background regions (“stuff”; e.g., sky, grass) [53]. Many studies have explored the task of panoptic segmentation. Kirillov et al. [54] combined Mask R-CNN (instance segmentation method) with a feature pyramid network (FPN) (semantic segmentation method). They started with an FPN backbone used to extract rich multi-scale features. In Mask R-CNN, they used a region-based branch on top of the FPN, for instance segmentation. Then, a lightweight dense prediction branch was added on top of the same FPN for semantic segmentation. Thus, Mask R-CNN extended with FPN provides a fast and accurate baseline for both instance and semantic segmentation in a single network. In another study, Cai et al. [55] developed a panoptic segmentation system based on attention over the segmentation regions. They used image features based on the shape of the segmentation regions and generated captions based on the attention-weighted features to independently process things and stuff classes. A similar study that segmented both stuff and things created a semi-supervised panoptic segmentation system that uses a mixture of weak and fully labeled annotations [56]. Attention-guided unified network (AUNET) is another approach to panoptic segmentation that uses a region proposal network (RPN) and pixel-level attention [57]. In order to achieve more correct segmentation, object-level and pixel-level attention-based modules were used in AUNET. A proposal attention module (PAM) was used to detect foreground objects while a mask attention module (MAM) was used for background objects. Both thing and stuff segmentation accuracies were considerably improved by using different attention modules. AUNET was tested on the Mscoco and Cityscapes benchmark datasets, and the results showed that it was effective. In a study by De Geus et al. [58], a single-network method was used with the ResNet-50 feature extraction model for panoptic segmentation. They used Mask R-CNN and a pyramid pooling module to generate regions for potential objects to predict correct pixel classes. Occlusion-aware network (OANET) was introduced by Liu et al. [59] as a novel method that uses a heuristic approach to combine instance and semantic segmentation models to overcome the problem of overlapping between objects. They also used a spatial ranking module for the occlusion problem between the predicted instances. De Geus et al. [60] developed the end-to-end fast panoptic segmentation network (FPSNET). FPSNET assigns an instance ID to each pixel, rather than mask predictions and rule-based classification. The speed of this method was effectively tested on the Cityscapes and Pascal Voc datasets. Ocfusion is another system for solving overlapping object region problems by modeling instance masks with binary relations and testing them with the ground truth relations derived from the existing benchmark datasets [61]. Mohan and Valada [62] developed an efficient panoptic segmentation (EfficientPS) system for autonomous robot operations. EfficientPS used two-way FPN for panoptic fusion of a task-specific instance and semantic segmentation modules without any parameters.NLP—one of the popular applications of artificial intelligence—is the analysis, understanding, and reconstruction of language according to certain rules [63]. NLP operations mainly include text preprocessing, morphological analysis, syntactic analysis, and semantic analysis [64]. Morphological analysis includes the processes of finding roots by separating the suffixes of words. In this way, what part of speech a word is can be determined [65]. In syntactic analysis, the usage purposes of the words in a sentence—such as subject, object, and adverb—are determined. The use of these words in different places and numbers in sentences may have different meanings. Semantic analysis, on the other hand, examines the association of discrete words with appropriate objects [65]. After these operations, text normalization is performed. To do so, NLP tools such as NLTK [66] can be used to convert capital letters to lowercase, convert numbers to text, remove punctuation and markup, remove spaces, open abbreviations, and remove unnecessary words [67].There are many artificial intelligence methods in computer science for evolutionary optimization problems [68]. Evolution strategies, GAs, and programming are well-known artificial intelligence methods based on evolution. GAs seek to solve difficult and complex problems in many subject areas using defined mathematical models and functions. They use encoded parameters to search for and optimize solutions for a related problem [69]. Using a GA enables the possible solutions to a problem in a solution space to be obtained without being restricted to a local maximum or minimum. If a problem is difficult and complex and the solution space is very big, a classic search approach increases search time at the cost of performance. Fitness function, selection, reproduction, and mutation are the main processes of the GA. The GA searches for the best solutions in a defined population for the problem area with the following basic steps [70]: defining the initial population in the solution space, calculating each chromosome as a solution candidate by fitness function, selecting the best chromosomes according to the fitness function, crossing over and mutating chromosomes, and (if the best solution is found or the terminal condition is provided) stopping the algorithm.Deep learning algorithms are used frequently in many areas, including image processing, classification, and NLP. These methods—so-called deep learning networks—differ from classical artificial neural networks in several ways, including the application of layer numbers [71]. CNNs are one of the most well-known deep learning algorithms. Although a CNN is a feedforward neural network, it also contains convolutional, pooling, and fully connected layers. CNN is mostly used for image processing to reduce image size and extract and classify image attributes [72]. Another frequently used deep learning algorithm is recurrent neural networks (RNNs). RNNs relate the value calculated in the previous output to the current input values [73]. RNNs can process input sequences of arbitrary length and time series problems. In the training phase of long-sequence RNN problems, the gradient vector component may grow or decay [74]. This can cause gradient vanishing problems as well as learning problems with respect to finding the correct relations in the sequences of the RNN model. To overcome this issue, LSTM networks—a specific version of a classical RNN—were developed.The rest of the paper is structured as follows. Section 2 describes the methods used in the proposed system. Section 3 details the application and evaluation of the proposed approach for content visualization. The final section presents the conclusion of the paper.
Material and methods
This study’s methods were selected based on the objective, motivations, background, and research effort. In particular, deep learning was used for PIS and a GA was used to generate images from segmented images. Figure 1 represents the stages within the flow of the introduced approach. When there are many plaintext LOs in LORs or other content repositories, content visualization is required to ensure the best comprehension of the given information. The proposed method consists of many subsystems (modules). In our repository, there are many kinds of LOs, some of which are based on plaintext and some of which contain images. As described in the literature, LOs can be split into smaller LOs or vice versa. Segmented image instances are used to compose images that represent a corresponding plaintext in the LO. Thus, the LO can be visualized by an intelligent and automatic method.
Fig. 1
Schema of proposed automatic and intelligent content visualization system
Schema of proposed automatic and intelligent content visualization system
Mscoco dataset
Mscoco is a large-scale dataset used for object detection, image segmentation, and image-captioning tasks. This dataset contains 330,000 images, 1.5 million object instances, 80 object categories, 91 stuff categories, and five captions per image [75]. Given the novelty of panoptic segmentation, there are few datasets with detailed panoptic annotations and public evaluation metrics. Mscoco [76] proved the most suitable and challenging for the new panoptic segmentation task, given its detailed annotations and high data complexity. It consists of 115,000 training images, 5,000 validation images, and 20,000 testing images [75] for the panoptic segmentation task. Mscoco’s panoptic annotations include 80 thing categories and 53 stuff categories. In this study, 100,000 training images, 5,000 validation images, and 10,000 testing images from the Mscoco dataset were used. In addition, 10 extra data per thing category (a total of 800 images and annotations) were added to the dataset.
Evaluation metrics
The Panoptic Quality (PQ) metric is used to evaluate the performance of the panoptic model [57]. PQ was calculated for each class independently and average over classes to match unique splits which the predicted and ground truth segments into three sets: Matched pairs of segments were true positives (TP), unmatched predicted segments were false positives (FP), and unmatched ground truth segments were false negatives (FN) [53]. By using these three sets, PQ was defined as [53]:PQ was intuitive after inspection: was simply the average IoU of matched segments, while was added to the denominator to penalize segments without matches. Note that all segments receive equal importance regardless of their area. IoU(p, g) represents the intersection over union between predicted object p and ground truth g, true positive (TP) the matched pairs of segments (IoU(p, g) > 0.5), false positive (FP) unmatched predicted segments, and false negative (FN) unmatched ground truth segments [57]. Segmentation Quality (SQ) evaluates how closely matched segments are with their ground truths. When this value approaches 1, it means that the TP predicted segments are more closely matched with their ground truths. However, it does not take into account any of the bad predictions, which is the point at which the Recognition Quality (RQ) becomes relevant. This metric is a combination of precision and recall and attempts to identify how effectively the model can make a correct prediction. PQ measures the performance of all classes with the same simple formula.Texts created or translated using NLP methods should be evaluated. One frequently used method for doing so is BLEU-n [77]. In this method, the similarity between the desired reference text and the produced or translated text to be tested is examined. If the similarity between the texts is close, the value converges to 1, and if the similarity is distant, it approaches 0. It is calculated according to the frequency of the n-grams between the reference text and the tested text. Another metric is METEOR [78], which is defined as the harmonic mean of the precision and recall of unigram matches between sentences. It is used for synonyms and paraphrase matching. METEOR is better than BLEU on recall evaluation and when confronted with a lack of explicit word matching. n-gram-based measures work reasonably well when there is a significant overlap between the reference and candidate sentence [78]. CIDEr-D is another special metric designed for image-captioning evaluation to measure the similarity between a candidate image description and the reference sentences [79]. This method uses initial stemming to apply each sentence, represented with a set of 1–4 n-grams. Then, the co-occurrences of n-grams in the reference sentences and candidate sentence are calculated.
Panoptic image segmentation module
The first module of our proposed approach is the PIS module, which detects instances in all images in the LOs as well as images that belong to newly added LOs. The simple panoptic segmentation algorithm is as follows [53]. When a set of semantic classes encoded by is given, the panoptic segmentation method attempts to map each pixel of an image to a pair, where represents the semantic class of pixel and represents its instance ID [54]. The semantic label set consists of subsets for stuff and for things, where and. When a pixel is labeled with, its corresponding instance ID is irrelevant. Thus, all pixels belong to the same instance, called stuff. On the contrary, all pixels with the same assignment, where belong to the same instance, are called things. In our study, Li et al. [57] AUNET was modified and used to perform the panoptic segmentation of the LO images. AUNET uses an RPN and pixel-level attention. In order to achieve more correct segmentation, object-level and pixel-level attention-based modules were used in AUNET. PAM was used to detect foreground objects while MAM was used for background objects. Consequently, both thing and stuff segmentation accuracies are considerably improved by using different aiming modules. When an image is given to the PIS module, thing and stuff instances are detected and segmented. Moreover, the system generates Mscoco data-formatted information (Fig. 2) [75].
Fig. 2
Mscoco data format for panoptic segmentation [75]
Mscoco data format for panoptic segmentation [75]For example, when images that include apples are given to the panoptic segmentation module, detected and segmented apple instances are masked with different colors (Fig. 3). The generated descriptive texts (annotation) for these apple instances by the panoptic segmentation module are as follows: “a group of apples on a tree in an orchard” for Fig. 3a, “a green apple is surrounded by a group of bananas” for Fig. 3b, and “a bowl of bread next to an orange and apple and a glass of orange juice” for Fig. 3c.
Fig. 3
Detected and segmented apple instances with colorful masks [75]
Detected and segmented apple instances with colorful masks [75]Detected and segmented panoptic thing and stuff image instances and their Mscoco data-formatted information are stored in LOR with their LO_ID, image_id, detected instance pixels (segment_info), and detected instance descriptive texts (annotation).
Genetic algorithm-based image generation module
The second module, which is the core of the proposed approach, is the GA-based image generation module. To produce an image that best suits the text according to the searched text (plaintext of LO), stored segmented image instances containing the same named entity recognition (NER) in their descriptive metadata are gathered. NLTK was used to create NERs of the given LO text that needed to be visualized with images. These image instances are combined to produce an image with GA. For this purpose, the segmented image pixels are cropped from the related image in the LO. This cropped image instance object is then exposed to the processes of resizing, rotating, relocating, flipping (y-axes), and z-indexing. The value of these processes on the image are encoded into the GA solution chromosomes (Fig. 4). Each produced image is exposed to a CNN-LSTM-based encoder–decoder module to produce text description. The produced text is analyzed with NLP techniques to produce NERs. Then, two texts consisting of NERs (the first being the searched reference LO text and the second the generated image-descriptive text) are compared using text similarity methods (BLEU-n) during the GA fitness evaluation.
Fig. 4
Encoded image processing property chromosomes
Encoded image processing property chromosomesIn the GA module, SC1, SC2…SCn are candidate solution chromosomes in the search area of the image generation problem. Each gene corresponds to an image property. SCiG1 to SCnG5 are encoded genes of the chromosome by using direct encoding as integer values, where i is the index of the chromosome and n is the chromosome count in the population. SCiG1 represents resized cropped segmented image instance values. The lower boundary of resizing is 32 × 24, and the upper boundary is 320 × 240 (half of the corresponding system output, 640 × 480). SCiG2 represents rotated cropped segmented image instance values. Rotation values change from 0° to 360° in 20° increments. SCiG3 represents relocated cropped segmented image instance values. Image instances are relocated in 640 × 480 image boundaries. SCiG4 represents flipped cropped segmented image instance values according to y-axes. If the image instance is flipped, the value of this gene is 1, otherwise 0. The last gene, SCiG5, represents the z-indexing values of cropped segmented image instances. If the image instance is at the back, the value of this gene is −1. If the image instance is at the same level, the value of this gene is 0. If the image instance is in the front, the value of this gene is 1.After these processes, the produced image is provided to the CNN-LSTM (encoder–decoder)-based image-captioning module to generate caption , which is encoded as a sequence of 1 to encoded words:where is the size of the vocabulary and is the length of the caption. A CNN is used to create image feature vectors (Fig. 5). Encoder CNN produces vectors, each of which is a -dimensional representation of the input image.
Fig. 5
CNN-based feature extraction structure used in the proposed study
CNN-based feature extraction structure used in the proposed studyEncoded vectors are decoded with an LSTM network to generate word-by-word captions (Fig. 6). The LSTM network’s success on long sequences is based on its memory cell [80]. The memory cell (c) can preserve the state over long periods of time and consists of an input gate (i) that decides whether input data should be prevented or conveyed to the memory cell, an output gate (o) to produce or prevent an output itself by recurrent connections in two consecutive time steps (t), and finally a forget gate (f) that decides whether to recall or omit the previous cell (Fig. 7).
Fig. 6
LSTM-based text (caption) generation used in the proposed system
Fig. 7
LSTM memory cell [80]
LSTM-based text (caption) generation used in the proposed systemLSTM memory cell [80]Sigmoid gates are used to control the read and write process of the memory cell. At a given time step , the LSTM network receives inputs from various sources: the current input , the previous hidden state of all LSTM units , and the previous memory cell state . These gates are updated at time step t for given inputs , , and as follows [80]:where are the weight matrices learned from the network and are the bias vectors. The sigmoid activation function is represented by . Thus, and present a hyperbolic tangent . points to the dot products of the vectors. At t = 0, the input data are sigmoid-modulated to input gate i(t), where values lie within [0, 1]. In this step, the values of the forget gates f(t) of the different LSTM units are 0. Along with increasing the time step, the forget gate begins to decide which unimportant information should be forgotten and, meanwhile, retaining information that is deemed useful. The memory cell states c(t) and output gate o(t) then gradually absorb the valuable context information over time and make a rich representation h(t) of the output data [80]. The hidden output with possible outcomes () is used to predict the next word using the softmax function with parameters and [80]:where is the estimated word probability)[80]. In this way, at the end of the time steps, the image caption is generated as text. The generated image caption (text) is then analyzed with NLTK to obtain the NERs, which are compared for text similarity with the NERs of the plaintext LO (that is to be visualized). For this purpose, the fitness function of the chromosomes in the population is calculated according to the selected similarity metrics: BLEU-n, METEOR, and CIDEr-D (Eq. 10). In this study, each similarity metric was tested on the same parameters. For this purpose, mscocoapi [81], which can be downloaded from GitHub, was used. The best chromosomes (i.e., that maximized similarity) are then selected for the new population by using roulette wheel selection. Meanwhile, new chromosomes are regenerated and mutated for the new population.where sm is the similarity metrics (BLEU-n, METEOR, CIDEr-D) used to evaluate similarity and GT is the generated text of the corresponding ith generated image in the population.When the GA is stopped, the images that pass a threshold similarity value (0.5 for BLEU-1) are listed as the best solutions and sorted by fitness values (similarity values) from largest to smallest. The image with the highest value is the first in the list and is used to visualize the image from the plaintext. This image is then stored with the corresponding text-based LO in the LOR.
Application and evaluation
This section provides information regarding the application side of the proposed method. It is important to apply such a method to a real use case and run the necessary evaluation parameters to better understand its performance.
Application on panoptic segmentation and image generation
On the application side of this study, Python was used for all modules in the proposed system. A computer system running an Intel Xeon e5-2620 6c/12t 2.00 GHz CPU with 24 GB RAM was employed for the test application.In the PIS module, AUNET was used with pre-trained ResNet-50, ResNet-101, and ResNet-152 models. Mask R-CNN was used for foreground object detection. The ResNet-50, ResNet-101, and ResNet-152-based networks were optimized using stochastic gradient descent (SGD) with a weight decay of 4e-5, a momentum of 0.9, and a batch size of 100. The learning rate was initialized with 0.002 for extra data added Mscoco dataset with 100 epochs.In the GA module, the population size was 1,000, the crossover rate was 0.7, the mutation rate was 0.25, the elitism rate (the ratio of elitist chromosomes that is kept into the next generation) was 0.1, the maximum iteration count was 300, and roulette wheel selection was used for selecting potentially useful solutions for recombination. Also, single-point crossover and random resetting gene mutation were used in the GA. The elitism means the best string seen up to the current generation which is preserved in a location either inside or outside the population and it can be introduced in various ways [82]. For example, the best solution of initial population can be preserved in a separate location and then compare it with the best result of the new generation and replace it if it is better. In some other studies, some of the top fitted chromosomes are transferred directly into the next generations or the best of the ith generation can be compared with the worst of the (i + 1)th generation and replace it if it is better [82]. The elitist strategy compensates the defects of easy loss of good genes and it can rapidly increase the performance of GA by not losing the best-found solutions [83]. Similarly, in this study, the elitism rate is equal to the fraction of best individuals in the current population that are copied into the new generation directly before the rest of the population is generated. In the CNN-LSTM-based image-captioning module, image-captioning library was used [84]. In this module, CNN was used to create an encoded image feature vector (encoder), while an LSTM network was used for word-by-word text generation for the encoded input vector (decoder). The CNN consists of an image input layer, a convolutional layer, a max pooling layer, and a fully connected layer. In the proposed system, produced input images are 640 × 480 × 3. These numbers indicate width, height, and RGB, respectively. In the convolutional layer, a 5 × 5 filter is used for feature mapping. The max pooling layer has a 3 × 3 filter to decrease the number of parameters and help avoid overfitting. The last layer is the fully connected layer, which is used to gather all image features for classification. The CNN is optimized using SGD with an initial learning rate of 0.001 for 100 epochs. Pre-trained ResNet-50, ResNet-101, and ResNet-152 were used as the CNN model to extract image features. The image feature vector was the mean output of the last convolutional layer of the CNN and thus had a dimension of 2048.The decoder LSTM network uses 512 units in each layer. The Adam [85] optimizer was used with a weight decay of 4e-5, a momentum of 0.9, and a batch size of 100. The learning rate is 0.001. After each LSTM layer, a dropout layer with probability 0.2 was added. The probability of 0.2 means that one in five inputs will be randomly excluded from each cycle of the updating process. Dropout is a regularization technique used to prevent overfitting. In short, this technique randomly selects neurons to be ignored during training. When a neuron is dropped out, it cannot contribute to the network and receive some values in the backpropagation process. The next layer was the dense layer, which used the total number of words in the vocabulary as the dimension. The last layer used softmax as the activation function. The output of this function is modeled as the probability distribution over K possible outcomes. Specifically, it shows the probability that each word in the vocabulary is the most suitable output for the specific element in the output vector.
Testing and evaluation on generated images
In the PIS module of the proposed system, AUNET is used with pre-trained ResNet-50-FPN, ResNet-101-FPN, and ResNet-152-FPN backbones based on extra data added to the Mscoco dataset. In this study, 100,000 training images, 5000 validation images, and 10,000 testing images from the Mscoco dataset were used. In addition, (a total of 800 images and annotations) were added to the dataset. These images were added 10 to each of the 80 object categories and used to test whether the system could distinguish them when new data were added to the existing dataset during the testing phase. The quality of segmentation results in the percentages shown in Table 1. The best (i.e., highest) value was obtained with the ResNet-152-FPN backbone.
Table 1
Panoptic quality results of backbones used
Backbone
PQ (%)
SQ (%)
RQ (%)
ResNet-50-FPN
25.3
69.7
30.4
ResNet-101-FPN
40.1
77.2
49.2
ResNet-152-FPN
43.7
79.4
53.7
Panoptic quality results of backbones usedDifferent amounts of text-based LOs and science topics were used to test the efficiency and validity of the proposed content visualization system (Table 2).
Table 2
Text-based LO count according to science topic
Topic
Text-based LO count
Foreign language
20
History–arts
10
Biology
26
Math
20
Physics–chemistry
14
Geography
10
Text-based LO count according to science topicIn the GA-based image caption module, LO texts are analyzed with NLP to obtain the NERs. Corresponding segmented image instances for detected NERs are obtained from LOR. The GA then attempts to combine these instances to create an image, and each image is exposed to the CNN-LSTM-based image-captioning module, which uses ResNet-50, ResNet-101, and ResNet-152 backbones with extra data added to the Mscoco dataset. The generated image captions are analyzed with NLP to represent text, which is then compared using the BLEU-N, METEOR, and CIDEr-D similarity metrics to evaluate the fitness of the GA chromosomes in a range of 0 to 100 (Table 3). Images that pass the threshold value of 0.5 (percent equivalent = 50) for the BLEU-1 metric are listed from highest to lowest. The first image, which has the highest (i.e., best) value, is selected as the corresponding visualization image of the text-based LO.
Table 3
Evaluation metrics of captions for generated images in GA module
Backbone
Topic
BLEU-1
BLEU-2
BLEU-3
BLEU-4
METEOR
CIDEr-D
ResNet-50
Foreign language
61.2
38.7
26.5
17.4
16.9
53.8
History–arts
53.4
33.8
23
15.2
14.7
47
Biology
63
39.8
27.2
17.8
17.2
55.1
Math
58.7
37.1
25.4
16.7
16.2
51.6
Physics–chemistry
56.3
35.6
24.3
16
15.5
49.5
Geography
50.3
31.8
21.6
14.3
13.8
44.3
ResNet-101
Foreign language
66.4
40.2
28.1
19.2
18.1
58.7
History–arts
55.7
33.6
23.5
16.1
15.2
49.3
Biology
68.4
41.4
28.9
19.7
18.5
60.2
Math
63.1
38.1
26.7
18.2
17.1
55.8
Physics–chemistry
59.9
36.2
25.4
17.3
16.3
52.9
Geography
51.1
30.9
21.6
14.7
13.9
45.2
ResNet-152
Foreign language
69.7
41.7
29.3
22.0
18.4
62.1
History–arts
59.1
35.3
24.8
18.6
15.6
52.6
Biology
71.9
43
30.2
22.6
18.9
63.7
Math
66
39.5
27.7
20.8
17.4
58.8
Physics–chemistry
63.1
37.8
26.5
19.9
16.6
56.2
Geography
53.8
32.2
22.6
16.9
14.2
47.9
Evaluation metrics of captions for generated images in GA moduleThe CNN-LSTM-based produced captions (texts) were best for the biology and foreign language topics, as the Mscoco dataset’s thing category instances were most suitable for these topics. The worst results were obtained for geography, which is more closely related to the stuff category, for which there are few instances in the dataset. If the Mscoco dataset is enhanced with different types of thing and stuff category instances for the panoptic segmentation task, better results can be obtained using our proposed system. In addition, the success of the system is related to the LO texts, which should contain suitable NERs for the corresponding segmented images. Using the proposed system, the same segmented image instances can be used for different text-based LOs and for different fields of science. For example, a segmented dog image can be used to visualize texts in foreign language for the phrase “The dog is playing ball,” in biology for “Dogs have four legs,” and in math for “How many dogs are in the garden?” Therefore, the reusability of the LOs is ensured.Furthermore, to examine the efficiency and validity of the proposed approach with regard to the human factor, 250 text-based LOs were selected randomly according to the proportion of each science field in Table 2. The distribution of the total 250 generated images for text-based LOs is shown in Table 4.
Table 4
Distribution of text-based LO count according to science topic for human evaluation
Topic
Text-based LO count
Foreign language (FLA)
50
History–arts (HIS)
25
Biology (BIO)
65
Math (MAT)
50
Physics–chemistry (PHY)
35
Geography (GEO)
25
Distribution of text-based LO count according to science topic for human evaluationThe generated 250 images were shown to seven instructors (I1 to I7) and 20 students (S1 to S20) at the secondary school level, who labeled the images as usable or not usable for the defined text-based LO. Next, we calculated how many images were labeled as usable and how many images labeled as unusable according to instructors’ (Table 5) and students’ (Table 6) labeling.
Table 5
Comparison of usable images for different topics according to instructors
Instructors
FLA
HIS
BIO
MAT
PHY
GEO
I1
43
19
58
41
26
18
I2
48
23
64
43
30
18
I3
46
24
62
44
29
22
I4
50
21
61
45
32
18
I5
46
20
63
47
32
20
I6
47
23
62
42
31
20
I7
46
22
63
43
33
17
Mean
Percentage (%)
46.57
93.14
21,71
86.85
61.85
95.16
43,57
87.14
30,43
86.94
19
76
Table 6
Comparison of usable images for different topics according to students
Students
FLA
HIS
BIO
MAT
PHY
GEO
S1
39
19
49
36
27
15
S2
47
23
62
42
29
16
S3
41
24
59
40
31
19
S4
45
23
56
44
32
18
S5
46
22
60
42
31
17
S6
41
23
60
43
32
17
S7
44
22
62
41
33
16
S8
45
23
61
39
33
19
S9
46
22
61
47
31
16
S10
45
24
59
45
32
16
S11
41
21
63
45
30
19
S12
45
22
60
40
33
17
S13
47
23
60
39
30
15
S14
43
24
58
43
34
19
S15
48
24
59
43
32
20
S16
42
21
59
41
30
19
S17
46
25
62
45
31
15
S18
40
22
61
42
32
19
S19
42
23
57
44
33
16
S20
37
19
55
35
25
15
Mean
43.5
22.45
59.15
41.8
31.05
17.15
Percentage (%)
87
89.8
91
83.6
88.71
68.6
Comparison of usable images for different topics according to instructorsMeanPercentage (%)46.5793.1421,7186.8561.8595.1643,5787.1430,4386.941976Comparison of usable images for different topics according to studentsThe results are similar to the proposed system with regard to proportion and ranking of values. The categories with generated images that instructors labeled as usable were, from highest percentage to lowest, BIO, FLA, MAT, PHY, HIS, and GEO. The categories with generated images that students labeled as usable were, from highest percentage to lowest, BIO, HIS, PHY, FLA, MAT, and GEO. The usable and not usable image counts as labeled by instructors and students are given as Online Resource 1 for each topic.For an example in the field of biology, if the LO text is “Bears live in the forest,” NLP detects “bear” and “forest” NERs (Fig. 8). Next, input segmented image instances for “bear” in the thing category and “forest” in the stuff category are taken from the LOR by searching the LO metadata. In the first chromosome SC1, resize is 300 × 200, rotation is 60°, location coordinates are (50,40), flipping value is 0 and z-index is 1. After iterations, when the GA stopped, output image is obtained. In the corresponding last chromosome SCn of the output image, resize is 135 × 90, rotation is 0°, location coordinates are (65,90), flipping value is 0 and z-index is 1 (Fig. 8). The related images are then combined according to the fitness value produced by the GA module to create an image (Fig. 9).
Fig. 8
Segmented image instance and generated image with GA module for “Bears live in the forest” text in biology topic
Fig. 9
GA fitness value for “Bears live in the forest” text in biology topic
Segmented image instance and generated image with GA module for “Bears live in the forest” text in biology topicGA fitness value for “Bears live in the forest” text in biology topicAnother example can be provided for the foreign language topic. If the LO text is “There is an apple on the table,” NLP detects “apple” and “table” NERs (Fig. 10). Next, segmented image instances for “apple” and “table” in the “thing” category are taken from the LOR by searching the LO metadata. Related images are combined according to the fitness value produced by the GA module to create an image (Fig. 11).
Fig. 10
Segmented image instance and generated image with GA module for “There is an apple on the table” text in foreign language topic
Fig. 11
GA fitness value for “There is an apple on the table” text in foreign language topic
Segmented image instance and generated image with GA module for “There is an apple on the table” text in foreign language topicGA fitness value for “There is an apple on the table” text in foreign language topicIn our proposed system, there is a service that works in the background of the content visualization system. This service is responsible for detecting stored plaintext-based LOs in the system. When a plaintext-based LO is detected, the GA-based image generation module attempts to visualize the corresponding text-based LO by using panoptic-segmented and stored image instances. This service runs automatically on an everyday basis as well as when a new text-based LO is added to the system. Moreover, the user can start this service manually whenever it is required. Thus, there is no time or space (memory) limitation for image generation. However, further analyses were investigated by means of the performance and time complexity of the GA. Generally, metaheuristic algorithms are compared empirically rather than using Big-Ο notation because such algorithms do not always guarantee that the global solution will be found within a given time period. For these reasons, in addition to the GA, different metaheuristic algorithms were tested, such as artificial bee colony (ABC), differential evolution (DE), and mutation-based particle swarm optimization (MPSO). For comparison, the same population size and maximum iteration count were defined for all algorithms (200 and 100, respectively). Other tuning parameters used were also optimized for all algorithms. For the GA, the crossover rate was 0.7, the mutation rate was 0.25 and elitism rate was 0.1, and roulette wheel selection was used for selecting potentially useful solutions for recombination. Also, single-point crossover and random resetting gene mutation were used in the GA [86]. For the ABC, the onlooker number was 0.5, the employed bee number was 0.5, and the scout number was 1 [87]. For the DE, the crossover factor was 0.8 and the scaling factor was 0.5 [88]. For the MPSO, the inertia weight was 0.1, the lower bound was 1, the upper bound was 2, the mutation rate (for not to be stagnated after finding a local minimum) was 0.1, and attraction terms were equal which c1 was 2 and c2 was 2 [89]. In addition, the same computer system configuration, same data (image generated from panoptic-segmented instances), same similarity method (BLEU-1), and same backbone model (ResNet-152) were used. For the biology example given in Fig. 8 above, best cost (BC) (Fig. 12) and computation time (CT) (Fig. 13) were calculated for all algorithms.
Fig. 12
Comparison of cost values of algorithms
Fig. 13
Comparison of computation time per iteration for algorithms
Comparison of cost values of algorithmsComparison of computation time per iteration for algorithmsAs shown in Fig. 12, BC was calculated for GA (0.280), then MPSO (0.315), DE (0.330), and ABC (0.342) consecutively for 100 iterations. The GA reached BC on iteration 87 and did not change through the end of the run. As shown in Fig. 13, DE required less CT per iteration as well as total time during 100 iterations. The other lowest CT was obtained with MPSO. The GA and ABC algorithms had the third and last CT, respectively. CT can be reduced with more powerful computer configurations. In our proposed system, there is no time limitation because a background service works to track plaintext Los and attempts to visualize them daily. If the user wants to produce an instant visualization by running the service manually, the GA produces an image within 5–8 s per iteration, and the total generation time based on total iteration count (100 in test) is nearly 671.304 s.Moreover, to evaluate the effectiveness of the proposed study for several runs, further calculations were performed to compare the GA, ABC, DE, and MPSO algorithms in terms of CT and BC. In this process, each algorithm is run 100 times for 10 image generations (attempts to visualize text-based Los) per topic (FLA, HIS, BIO, MAT, PHY, and GEO). CT values are calculated for each topic. For example, for FLA-1 image generation, each algorithm is run 100 times, and the minimum, maximum, and mean values of these 100 runs are calculated using CT values (Online Resource 1). This process is performed for each topic and its corresponding image generations. Best values in terms of CT are calculated for the DE, MPSO, GA, and ABC algorithms, respectively. There are no significant differences among these algorithms in terms of CT values (Table 7). The same processes are also performed for BC values (Online Resource 1). Best values in terms of BC are calculated for the GA, MPSO, DE, and ABC algorithms, respectively. In addition, the Wilcoxon ranked-sum test [90] was used for comparison of the GA and other three algorithms (ABC, DE, and MPSO) in order to show significance (Table 7). The p-value was calculated to compare CT, ABC, and GA; CT, DE, and GA; and CT, MPSO, and GA, as well as BC, ABC, and GA; BC, DE, and GA; and BC, MPSO, and GA. In all calculations, p-values were well below the significance threshold (α = 0.05), so the distributions were significantly different. A “ + ” sign was used to indicate when the GA yielded significantly better performance than the other algorithms (ABC, DE, and MPSO) with respect to CT and BC. A “-” sign was used to indicate when the GA yielded significantly worse performance than other compared algorithms (ABC, DE, and MPSO) with respect to CT and BC.
Table 7
Comparison of algorithms in terms of computation time (CT) and best cost (BC)
GA
ABC
DE
MPSO
ImageNo
Topic
MeanCT
MeanBC
MeanCT
CT, ABC vs GA(p−value)
MeanBC
BC, ABC vs GA(p−value)
MeanCT
CT, DE vs. GA(p−value)
MeanBC
BC, DE vs GA(p−value)
MeanCT
CT, MPSO vs GA(p−value)
MeanBC
BC, MPSO vs GA(p−value)
1
FLA
686.074
0.389
704.695
+ (5.979e−08)
0.446
+ (2.524e−34)
643.500
−(6.195e−24)
0.429
+ (1.093e−33)
664.218
−(9.33e−11)
0.408
+ (6.467e−17)
2
FLA
681.433
0.389
699.075
+ (1.085e−07)
0.448
+ (2.524e−34)
643.559
−(4.452e−23)
0.428
+ (1.757e−33)
661.222
−(2.895e−09)
0.407
+ (2.213e−15)
3
FLA
680.568
0.390
705.573
+ (1.118e−12)
0.446
+ (2.524e−34)
643.500
−(1.313e−21)
0.429
+ (4.738e−34)
666.260
−(1.044e−05)
0.407
+ (1.525e−19)
4
FLA
686.136
0.390
706.024
+ (1.399e−08)
0.446
+ (2.524e−34)
645.179
−(8.558e−24)
0.429
+ (1.126e−33)
663.216
−(5.76e−12)
0.409
+ (5.898e−19)
5
FLA
682.441
0.389
708.272
+ (3.619e−13)
0.444
+ (2.524e−34)
643.952
−(2.729e−23)
0.431
+ (7.417e−34)
665.644
−(4.407e−07)
0.408
+ (1.215e−18)
6
FLA
685.112
0.388
703.649
+ (1.707e−08)
0.446
+ (2.524e−34)
644.772
−(1.017e−25)
0.429
+ (2.228e−33)
666.716
−(9.358e−09)
0.409
+ (2.278e−19)
7
FLA
681.744
0.389
701.471
+ (4.719e−09)
0.443
+ (2.524e−34)
648.643
−(5.28e−18)
0.428
+ (1.656e−33)
664.649
−(9.753e−07)
0.411
+ (7.989e−22)
8
FLA
681.744
0.390
704.469
+ (1.114e−10)
0.448
+ (2.524e−34)
646.470
−(1.63e−19)
0.427
+ (1.656e−33)
665.857
−(4.729e−06)
0.410
+ (3.944e−20)
9
FLA
685.113
0.389
703.979
+ (8.746e−08)
0.444
+ (2.524e−34)
648.159
−(1.042e−19)
0.427
+ (2.583e−33)
663.029
−(1.562e−10)
0.407
+ (2.519e−14)
10
FLA
685.440
0.391
701.678
+ (4.256e−06)
0.446
+ (2.524e−34)
645.144
−(3.01e−23)
0.428
+ (1.345e−33)
662.184
−(1.326e−11)
0.408
+ (3.69e−17)
11
HIS
687.066
0.420
703.315
+ (2.135e−06)
0.475
+ (2.524e−34)
646.668
−(1.588e−23)
0.455
+ (1.672e−31)
665.070
−(3.924e−11)
0.436
+ (9.361e−13)
12
HIS
680.210
0.417
700.553
+ (3.618e−09)
0.475
+ (2.524e−34)
646.218
−(6.3e−19)
0.456
+ (4.802e−33)
665.578
−(6.468e−06)
0.438
+ (1.918e−18)
13
HIS
685.897
0.418
700.991
+ (7.317e−07)
0.475
+ (2.524e−34)
647.592
−(1.685e−24)
0.455
+ (5.75e−32)
666.034
−(7.173e−10)
0.436
+ (1.451e−14)
14
HIS
684.482
0.419
700.592
+ (4.729e−06)
0.477
+ (2.524e−34)
647.335
−(2.254e−21)
0.459
+ (9.721e−33)
662.807
−(1.587e−10)
0.439
+ (8.625e−18)
15
HIS
680.361
0.418
702.912
+ (3.924e−11)
0.478
+ (2.524e−34)
646.492
−(7.927e−21)
0.453
+ (1.448e−31)
664.952
−(8.719e−07)
0.436
+ (8.052e−16)
16
HIS
684.111
0.418
702.433
+ (1.207e−07)
0.475
+ (2.524e−34)
645.826
−(1.121e−22)
0.456
+ (3.218e−32)
667.543
−(7.791e−07)
0.437
+ (1.966e−15)
17
HIS
686.857
0.418
702.351
+ (3.967e−06)
0.476
+ (2.524e−34)
643.000
−(5.005e−27)
0.457
+ (1.85e−32)
670.223
−(1.308e−07)
0.438
+ (8.052e−16)
18
HIS
685.513
0.419
703.453
+ (6.144e−08)
0.474
+ (2.524e−34)
646.990
−(6.417e−23)
0.455
+ (2.103e−31)
666.260
−(1.065e−08)
0.437
+ (3.988e−16)
19
HIS
682.945
0.419
703.589
+ (1.145e−08)
0.474
+ (2.524e−34)
648.034
−(2.74e−20)
0.456
+ (1.721e−31)
664.710
−(5.898e−08)
0.438
+ (3.533e−16)
20
HIS
683.277
0.419
701.174
+ (1.257e−07)
0.478
+ (2.524e−34)
644.705
−(3.486e−23)
0.457
+ (1.055e−31)
668.829
−(3.562e−05)
0.437
+ (5.65e−15)
21
BIO
683.771
0.379
699.891
+ (2.497e−06)
0.438
+ (2.524e−34)
647.064
−(4.203e−22)
0.420
+ (9.699e−34)
665.548
−(6.852e−08)
0.399
+ (2.328e−17)
22
BIO
683.628
0.382
700.383
+ (3.034e−07)
0.439
+ (2.524e−34)
645.718
−(3.516e−21)
0.419
+ (1.979e−33)
664.603
−(1.659e−08)
0.401
+ (1.186e−17)
23
BIO
680.929
0.382
703.520
+ (1.719e−10)
0.436
+ (2.524e−34)
644.208
−(1.548e−21)
0.418
+ (7.641e−34)
664.569
−(1.577e−06)
0.399
+ (6.347e−15)
24
BIO
683.590
0.379
700.923
+ (2.806e−07)
0.437
+ (2.524e−34)
647.212
−(2.231e−20)
0.418
+ (6.388e−34)
665.867
−(4.735e−08)
0.401
+ (2.052e−21)
25
BIO
681.981
0.378
699.622
+ (3.642e−08)
0.438
+ (2.524e−34)
646.411
−(4.037e−23)
0.419
+ (4.079e−34)
669.167
−(0.0001023)
0.399
+ (1.022e−20)
26
BIO
684.619
0.380
704.723
+ (3.259e−08)
0.434
+ (2.524e−34)
647.097
−(9.759e−21)
0.421
+ (4.203e−34)
668.886
−(1.915e−06)
0.400
+ (9.768e−19)
27
BIO
684.676
0.379
703.118
+ (1.596e−07)
0.437
+ (2.524e−34)
644.826
−(8.783e−22)
0.418
+ (6.582e−34)
665.475
−(4.582e−09)
0.399
+ (3.222e−18)
28
BIO
681.952
0.381
702.756
+ (1.006e−09)
0.437
+ (2.524e−34)
650.809
−(2.138e−18)
0.420
+ (4.203e−34)
664.359
−(5.433e−08)
0.398
+ (4.181e−17)
29
BIO
683.375
0.379
699.667
+ (7.989e−07)
0.437
+ (2.524e−34)
642.656
−(4.054e−24)
0.417
+ (1.608e−33)
666.859
−(4.819e−07)
0.398
+ (1.089e−17)
30
BIO
683.794
0.379
704.895
+ (1.006e−08)
0.438
+ (2.524e−34)
646.296
−(3.73e−22)
0.419
+ (1.06e−33)
661.517
−(1.467e−11)
0.400
+ (5.541e−20)
31
MAT
685.528
0.399
703.885
+ (1.613e−08)
0.456
+ (2.524e−34)
646.067
−(3.762e−24)
0.437
+ (1.268e−33)
666.055
−(7.867e−09)
0.422
+ (2.032e−23)
32
MAT
682.600
0.399
703.263
+ (8.114e−10)
0.457
+ (2.524e−34)
644.565
−(1.061e−21)
0.435
+ (1.646e−32)
666.014
−(6.785e−07)
0.415
+ (2.93e−14)
33
MAT
679.983
0.399
704.086
+ (9.614e−12)
0.455
+ (2.524e−34)
647.632
−(4.171e−18)
0.437
+ (6.258e−33)
661.426
−(1.885e−08)
0.419
+ (8.443e−18)
34
MAT
683.626
0.398
703.807
+ (3.119e−09)
0.454
+ (2.524e−34)
646.698
−(6.607e−22)
0.437
+ (8.902e−33)
665.889
−(1.683e−07)
0.419
+ (1.09e−18)
35
MAT
685.898
0.397
702.399
+ (1.345e−06)
0.457
+ (2.524e−34)
643.520
−(1.128e−25)
0.437
+ (3.682e−33)
666.141
−(3.512e−09)
0.417
+ (3.684e−20)
36
MAT
683.960
0.399
704.435
+ (9.092e−09)
0.456
+ (2.524e−34)
644.422
−(2.15e−22)
0.438
+ (6.445e−33)
664.510
−(2.769e−09)
0.418
+ (6.333e−16)
37
MAT
685.023
0.398
700.976
+ (9.866e−06)
0.458
+ (2.524e−34)
647.040
−(1.347e−20)
0.439
+ (3.575e−33)
661.091
−(1.419e−11)
0.419
+ (1.022e−20)
38
MAT
683.251
0.400
701.489
+ (3.125e−08)
0.456
+ (2.524e−34)
646.905
−(2.005e−21)
0.436
+ (6.092e−32)
664.445
−(7.532e−09)
0.419
+ (1.152e−16)
39
MAT
681.551
0.400
704.478
+ (9.033e−11)
0.456
+ (2.524e−34)
642.252
−(3.472e−22)
0.438
+ (5.402e−33)
667.264
−(3.96e−05)
0.419
+ (2.047e−18)
40
MAT
681.034
0.399
703.697
+ (2.856e−10)
0.457
+ (2.524e−34)
646.510
−(9.737e−20)
0.439
+ (1.029e−33)
665.756
−(6.93e−06)
0.417
+ (5.845e−16)
41
PHY
685.284
0.409
705.144
+ (1.399e−08)
0.467
+ (2.524e−34)
646.712
−(1.773e−22)
0.448
+ (1.029e−33)
663.793
−(1.045e−10)
0.425
+ (3.988e−16)
42
PHY
681.623
0.409
703.599
+ (8.604e−11)
0.467
+ (2.524e−34)
647.851
−(1.317e−20)
0.446
+ (2.269e−32)
667.578
−(3.377e−05)
0.426
+ (7.071e−14)
43
PHY
684.904
0.409
702.120
+ (1.946e−07)
0.463
+ (2.524e−34)
650.318
−(1.783e−21)
0.445
+ (3.036e−32)
667.440
−(6.061e−08)
0.427
+ (1.028e−14)
44
PHY
681.207
0.410
701.324
+ (2.939e−09)
0.464
+ (2.524e−34)
645.245
−(3.202e−21)
0.448
+ (4.832e−32)
658.578
−(7.574e−12)
0.427
+ (3.367e−13)
45
PHY
683.088
0.408
703.278
+ (6.608e−09)
0.464
+ (2.524e−34)
645.559
−(1.572e−22)
0.447
+ (8.645e−33)
664.899
−(6.143e−09)
0.430
+ (3.288e−20)
46
PHY
680.253
0.407
702.731
+ (7.434e−11)
0.467
+ (2.524e−34)
646.706
−(2.131e−20)
0.446
+ (3.471e−33)
666.581
−(1.782e−05)
0.428
+ (5.93e−20)
47
PHY
683.139
0.409
703.032
+ (3.954e−09)
0.467
+ (2.524e−34)
644.808
−(2.052e−21)
0.445
+ (1.422e−32)
664.268
−(3.796e−08)
0.425
+ (2.291e−14)
48
PHY
685.585
0.407
700.179
+ (6.209e−05)
0.466
+ (2.524e−34)
644.689
−(2.774e−25)
0.446
+ (2.079e−32)
665.828
−(3.396e−10)
0.426
+ (6.106e−15)
49
PHY
685.678
0.408
699.419
+ (6.339e−05)
0.467
+ (2.524e−34)
648.648
−(1.376e−21)
0.446
+ (3.084e−33)
666.424
−(4.194e−09)
0.427
+ (2.328e−17)
50
PHY
683.486
0.409
701.972
+ (3.397e−08)
0.467
+ (2.524e−34)
650.309
−(1.046e−20)
0.447
+ (3.721e−32)
664.201
−(6.608e−09)
0.427
+ (3.754e−16)
51
GEO
687.473
0.428
702.772
+ (7.01e−06)
0.487
+ (2.524e−34)
646.323
−(9.463e−23)
0.466
+ (5.918e−32)
663.746
−(1.8e−12)
0.448
+ (4.389e−15)
52
GEO
685.262
0.430
702.776
+ (2.88e−07)
0.485
+ (2.524e−34)
642.928
−(4.625e−25)
0.465
+ (2.293e−30)
665.416
−(7.423e−09)
0.446
+ (5.446e−14)
53
GEO
680.008
0.427
704.974
+ (1.288e−12)
0.485
+ (2.524e−34)
646.567
−(1.333e−19)
0.463
+ (3.624e−31)
666.612
−(4.22e−05)
0.447
+ (6.881e−17)
54
GEO
684.702
0.430
701.981
+ (6.212e−07)
0.486
+ (2.524e−34)
649.047
−(1.695e−20)
0.465
+ (6.786e−31)
665.315
−(4.241e−08)
0.450
+ (1.491e−15)
55
GEO
682.994
0.426
706.373
+ (7.705e−12)
0.487
+ (2.524e−34)
647.893
−(3.202e−21)
0.464
+ (2.782e−32)
665.382
−(2.052e−08)
0.447
+ (2.228e−19)
56
GEO
683.214
0.428
701.944
+ (1.455e−07)
0.484
+ (2.524e−34)
646.176
−(5.281e−23)
0.464
+ (4.694e−32)
665.490
−(2.718e−08)
0.447
+ (3.393e−16)
57
GEO
685.309
0.429
701.021
+ (4.841e−06)
0.484
+ (2.524e−34)
649.150
−(4.543e−21)
0.465
+ (1.025e−31)
664.905
−(4.32e−09)
0.446
+ (2.164e−14)
58
GEO
687.199
0.428
704.053
+ (5.831e−07)
0.484
+ (2.524e−34)
647.099
−(3.319e−24)
0.466
+ (1.367e−31)
664.856
−(2.681e−11)
0.446
+ (1.11e−14)
59
GEO
681.816
0.431
703.053
+ (7.08e−11)
0.483
+ (2.524e−34)
644.165
−(1.863e−24)
0.466
+ (3.729e−31)
666.110
−(1.132e−06)
0.446
+ (1.288e−10)
60
GEO
684.228
0.430
704.413
+ (1.659e−08)
0.485
+ (2.524e−34)
647.113
−(5.334e−22)
0.465
+ (1.266e−29)
663.103
−(2.217e−10)
0.446
+ (2.06e−13)
Comparison of algorithms in terms of computation time (CT) and best cost (BC)The mean values and p-values for each comparison with the GA are shown in Table 7, which indicates that there are some differences between these algorithms in terms of CT and BC values. In this study, there is no time problem, so any stochastic algorithm can be used for this process. However, BC is important because the fitness values of the solutions in these algorithms are related to BC. The lowest cost value means best fitness. Thus, for this proposed study (for several runs), the best BC values were obtained with GA. Thus, we chose the GA in the GA-based image generation module of our proposed system to generate images from segmented image instances.The results obtained in our study lead to the following contributions:E-content generation has high costs and can lead to time-wasting problems. Using panoptic-segmented image instances to create new images using a GA provides reusability.Deep learning and GAs provide an effective and interesting perspective on the image visualization problem. Therefore, it is important to note that developing hybrid systems is a trendy topic that will likely continue to be popular in the future. Currently, deep learning and other artificial intelligence methods are the strongest members of such systems.Several studies have generated images, especially using generative adversarial networks, but no study has yet generated images from panoptically segmented image instances which are obtained from LOs. Thus, our proposed systems provided reusability of LOs.There are also some gaps (limitations) that should be expressed as contributive suggestions for further studies by interested researchers. Briefly, these gaps are as follows:The method has been designed for an easy-to-use approach. The application parameters of the proposed methods in the system may be optimized in detail to further improve the system.This study was focused only on the Mscoco database because it supports panoptic segmentation. However, for further comparison and evaluation, additional data (thing and stuff instances) may be added to this dataset.In sum, using the proposed content visualization system, text-based LOs in the LOR can be visualized automatically and effectively for e-content production and other educational purposes, such as web-based learning systems, education portals, and learning management systems. In accordance with the importance of the visualization of texts in the given literature, the proposed system, enhanced with artificial intelligence, increases the quality of education and the level of learning. Moreover, generated images can be used not only for educational purposes and but also for other text-based digital repositories, advertisements, entertainment, and so on. For example, in daily life, if we want to visualize text while preparing a PowerPoint presentation, we can search the web for suitable images, which may take time. However, if the proposed automatic and intelligent content visualization system is used, slides in the presentations can be visualized in a novel and efficient method. For these reasons, this software is beneficial for education in terms of saving content generation time and decreasing costs.
Conclusion
The increasing demand for web-based and distance education has triggered new solutions for e-content generation, which is difficult and time consuming. LOs in the LOR can be used for this purpose, but most LOs are text-based. For educational purposes, using only text-based content may not be sufficiently effective. Using visualization techniques such as images attracts users by improving their concentration and also clarifies the given information with the content. For this reason, in this study, a novel and automatic intelligent content visualization system was developed. This large-scale system consists of subsystems (modules): an AUNET-based PIS module, a GA-based image generation module, and a CNN-LSTM-based image-captioning system. In the first module, LO images are exposed to panoptic segmentation, and the detected instances are stored in the LOR with LOs and metadata. The second module generates images for text in LOs. Produced images are sent to the CNN-LSTM module to generate text for images. Generated text and corresponding LO text are exposed to similarity metrics, and images that achieve the best values are selected as images for text-based LO. Segmentation results and similarity results were tested for different LOs and topics. The test results showed that the proposed system can be used to visualize text in LOs for educational purposes as well as for other digital visualization systems. In the future, using these segmented images and the proposed GA-based image generation system, long text can be converted to dynamic animations. In addition, using our fitness evaluation function, automatic text can be generated for paraphrasing tools (the system can generate text similar to a reference text).
Authors: Shixia Liu; Xiting Wang; Christopher Collins; Wenwen Dou; Fangxin Ouyang; Mennatallah El-Assady; Liu Jiang; Daniel A Keim Journal: IEEE Trans Vis Comput Graph Date: 2018-05-08 Impact factor: 4.579