Literature DB >> 28546967

A comparative study of the origin, structure, and indexing language of the Persian and English keywords of articles indexed in the IranMedex database and their compliance with the Persian medical thesaurus and Medical Subject Headings.

Parastoo Parsaei-Mohammadi¹, Ali Hossein Ghasemi¹, Raziyeh Hassanzadeh-Beheshtabad¹.

Abstract

INTRODUCTION: In the present era, thesauri as tools in indexing play an effective role in integrating retrieval preventing fragmentation as well as a multiplicity of terminologies and also in providing information content of documents. GOALS: This study aimed to investigate the keywords of articles indexed in IranMedex in terms of origin, structure and indexing situation and their Compliance with the Persian Medical Thesaurus and Medical Subject Headings (MeSH).
MATERIALS AND METHODS: This study is an applied research, and a survey has been conducted. Statistical population includes 32,850 Persian articles which are indexed in the IranMedex during the years 1385-1391. 379 cases were selected as sample of the study. Data collection was done using a checklist. In analyzing the findings, the SPSS Software were used.
FINDINGS: Although there was no significant difference in terms of indexing origin between the proportion of different types of the Persian and English keywords of articles indexed in the IranMedex, the compliance rates of the Persian and English keywords with the Persian medical thesaurus and MeSH were different in different years. In the meantime, the structure of keywords is leaning more towards phrase structure, and a single word structure and the majority of keywords are selected from the titles and abstracts.
CONCLUSION: The authors' familiarity with the thesauri and controlled tools causes homogeneity in assigning keywords and also provides more precise, faster, and easier retrieval of the keywords. It's suggested that a mixture of natural and control languages to be used in this database in order to reach more comprehensive results.

Entities: Disease

Keywords: Indexing terms; IranMedex database; medical subject headings; thesauri

Year: 2017 PMID： 28546967 PMCID： PMC5433633 DOI： 10.4103/jehp.jehp_137_14

Source DB: PubMed Journal: J Educ Health Promot ISSN： 2277-9531

INTRODUCTION

The sudden increase of information along with time limits in processing and publication of the said information in today's world created a trend of moving to nonbook sources such as articles, documents, and records, and nowadays, these types of information are the basis of most scientific studies. On the other hand, the inefficiency of traditional methods of organizing these sources such as cataloging increased the attention given to more novel methods such as abstracting and indexing, making indexing the most important tool for the description of the contents, searching and retrieval of articles. After that indexing languages were created as a system for description and representation of the subjects and concepts present in the articles which include free language, natural language, and controlled language. In indexing using free language, any word or phrase that can adequately describe the article can be used in indexing. Since there are no specific rules for formulation of questions, most opt to use them for indexing which in turn places the entire burden of search on the shoulders of the on seeking information since the seeker has to account for every possible expression of the subject.[12] In natural indexing language, the indexer extracts all the keywords from the text of the article in the exact same way used by the author.[3] This method can lead to inconsistencies and false drops but in turn dramatically increases the speed on indexing.[4] Indexing using controlled language uses a list of words. This list accounts for synonyms and highlights reference and nonreference words. One of the best examples of controlled words for indexing is thesaurus which is unfortunately rarely used due to the unfamiliarity of most seekers of information, leading to the prominent use of natural language in most searches.[5] Based on advantages and disadvantages of these three types on indexing, in order to search and retrieve the information and answer the needs of the end users, using the best indexing method that can facilitates the information retrieval for the user is of outmost importance. One of these indexing methods is using thesauri. A thesaurus as a language tool creates a composition and semantic relation between words and phrases and plays an important role in selection and editing of search terms by the user and thus increases the relevance of the information retrieved from databases and search engines. Thesaurus plays an important role in expanding the questions, increasing the interaction between the user and system and better retrieval of information. The role of thesaurus is to create useful access points which can be used to retrieve the information. Another role of thesaurus is to help in conceptual analysis and better translation of the phrases used in an article during the indexing process.[6] On the other hand, due to the important role of databases with different range of subjects in keeping updated, the knowledge of experts in various fields and the variety of these databases and the information sources indexed in them, it is preferred that the end users be familiar with characteristics of the database such as content, search options, output options, search user interface, indexing methods, and storage and retrieval tools of the database in order to conduct a successful and easy search. However before that, the designers and creators of databases need to understand the role of thesauri and descriptors in increasing the recall and precision of the searches and try to increase the Consistency, precision and recall of their database by selecting the suitable indexing language and paying attention to structure and selection of consistent descriptors in line with developed standards, thus guiding the users to relevant information. Although most creators and designers of databases believe in the necessity of selecting a suitable indexing language, following the rules of descriptor selection and consistency in selecting the words, there is still some level of chaos and inconsistency in the structure of keywords and indexing methods of databases which can be due to not following the standards, rules and guidelines of indexing, lack of an informed indexer and neglecting the use of indexing tools such as thesauri. Whatever the reason behind this inconsistency, it can disrupt the services of the database and make the process of search and retrieval of information into a time-consuming and overwhelming endeavor. Therefore, studying and investigation of methods of indexing and creating keywords and descriptors used in databases and their compliance with the structural rules of descriptors and scientific thesauri and understanding the indexing language and approach used by the creators of the databases by investigating the origin of keywords is important in order to facilitate a desirable communication between sources indexed in databases and the end users. Some of the most important studies regarding these topics are as follows: Iranshahi and Davarpanah investigated the title keywords and descriptor indexer in Iran theses abstracts database. The studies population was all of the records of this database (65,536 records) from the beginning to the end of your 1381. The results showed that 67% of the descriptor indexers were title keywords. Also in terms of writing 48% of the descriptors were single nouns, 52% were compound words, 87% were single, and 13% of them were plural.[7] Baniegbal et al. compared the words thesis title and abstract descriptors defined in the profile of the National Library and Archives of Iran. Their statistical population was 18951 dissertations from year 1380 to 1387. Their results showed that the compliance between title words and site descriptors was 47%, and the compliance between abstract words and the descriptors was 53.5%. Also about 75% of the descriptors were the words used in title or abstract or their synonyms. Furthermore, the compliance of the title and abstract words and the descriptors had not changed during the investigated years.[8] Nowkarizi and Dehghani investigated matching keywords extracted from abstract with descriptors indexers in dissertation abstracts database Iran. Their statistical population was all of the records present in abstracts database of Iran between years 1368 and 1385 (74,500 records). The findings of this study showed that there is a meaningful relation between the number of the keywords of the dissertation abstracts and the compliance of the keywords with descriptors and between the number of keywords extracted from the abstract and number of assigned descriptors. In terms of structure, 50.26% of the descriptors were single nouns, 49.74% were compound words, 93.2% were single, and 6.8% were plural.[9] Fattahi and Nikzaman conducted a study of subject searches of the students in faculties of agriculture and educational science, psychology and librarianship of Ferdowsi University Mashhad and investigated their compliance with Persian subject headings. Their research findings showed that 98.4% of the subject searches conducted in these faculties was in compliance with Persian subject headings which are more likely to be used in natural language indexing. Furthermore, there was no meaningful distinction between the compliance of subject search phrases and Persian subject headings. Also, the compliance with Persian subject headings was not meaningfully different between different faculties.[10] Naghneh-Esfehani et al. conducted a comparative study of the Persian and English keywords of theses from the Isfahan University of Medical Sciences and the Thesauri and Persian Medical Subject. Their results showed that there is a meaningful relation between the compliance of Persian and English keywords of dissertations and Persian medical thesaurus and Medical Subject Headings (MeSH). Also, most of the keywords had compound structures. Also, the ratio of keywords extracted from the title to keywords extracted from the abstract was different between two languages. Furthermore, the results of this study showed that the percent of keywords which overlap with medical thesaurus and those with relative matching or mad matching differs in Persian and English. Also, >50% of the keywords of both languages used natural indexing and different syntactic structures had different use ratios in each language.[11] Bartol investigated the usefulness of nonagricultural databases and controlled words (thesauri) in retrieval and organization of subjects related to agriculture. His aim was to identify the thesaurus-linked tree structures, controlled subject headings/terms (heading words and descriptors) and the characteristics of main database and evaluation of how the use of controlled words can improve the search results compared to use of uncontrolled words. The results of this study showed that although all investigated thesauri have headings related to agriculture subjects, but the difference between databases in Nonhierarchical communication, hierarchical communication and synonyms in organization and retrieval of agriculture related information is of great importance. Also sometimes using subject headings in the title can improve search and retrieval of information by 60%.[12] Shultz conducted a mapping of medical acronyms and initialisms to MeSH across selected systems. He searched for 415 medical acronyms and initialisms and 46 common acronyms in MeSH. The results of this study showed that other than 46 common acronyms, other medical acronyms and initialisms have no suitable equivalent which highlights an important shortcoming in MeSH. Since these medical acronyms and initialisms are used often in the medical literature, the lack of suitable equivalents in MeSH and not using these in indexing can make it impossible to retrieve most related texts.[13] Kabirzadeh et al. conducted a survey of the keyword adjustment of published articles and MeSH in journal of Mazandaran university of Medical Sciences between years 2009 and 2010. The results of this study showed that 80 keywords (30%) fully overlapped with MeSH and only 1 article (1.4%) had keywords that all overlapped with MeSH and 8 articles (11.4%) had no keywords that overlapped with MeSH. Also, 17 articles with keywords from MeSH were retrieved.[14] A closer look at the previous studies shows that each of these studies somehow emphasize the importance of compliance with thesaurus and subject headings and believe that origin, different syntactic structures and indexing policies of databases including indexing language greatly affects the description and content analysis of the subjects of databases. We previously noted the necessity of studying the indexing situation of databases. However, the necessity of proper indexing is more prominent in medical databases due to the great number of related subjects, the speed of production of knowledge in these fields and the need of medical society for quick access to the latest findings. Iranian database of Medical science articles (IranMedex) is one of the specialized databases in the field of medical science which indexed the articles published in Iranian medical journals along with their full text and according to the managers of this database, covers any article published after year 1361 (1982) with the help of private sector. In the previous works, no study of the indexing of this database was found. Also due to the financial and scientific importance of the present processed in information management such as indexing which enables the users to access the information with the least amount of time and energy and with highest possible efficiency by minimizing the inconsistencies in the keywords of the articles, this study aimed to investigate the keywords of articles indexed in IranMedex database in terms of origin, structure and indexing situation and their Compliance with the Persian Medical Thesaurus and MeSH. Also since this relation is a good scale for determining the relation between the database and its users, one of the factors investigated in this study is the origin of the keywords because this factor affects the subject relevance. In terms of syntactical structure, since this database emphasizes on a specialized subject area, knowing this structure and using it can help make communication more effective. In other words, if the user phrases the question in compliance with this structure, the findings will be more relevant to the question thus increasing the effectiveness of the information retrieval. Knowing the indexing language of the database and user can also improve their communication because if, for example, the database uses controlled language while the user uses natural language, they use different words for a certain concept, thus decreasing the number and relevance of retrieved articles. It is obvious that the results of this study can be used to identify the possible strengths and weaknesses of IranMedex database and help its managers in improving their services.

Research questions

What is the relative frequency distribution of the origin of Persian and English words indexed in IranMedex database between years 2006 and 2013? What is the relative frequency distribution of the syntactic structure of Persian and English words indexed in IranMedex database between years 2006 and 2013? What is the relative frequency distribution of compliance of Persian and English words indexed in IranMedex database between years 2006 and 2013 with MeSH? The ratio of words selected from the title to words selected from abstract (the origin of the indexing) indexed in IranMedex database between years 2006 and 2013 is different in Persian and English words More than 50% of the Persian and English keywords indexed in IranMedex database between years 2006 and 2013 use natural language There is a difference between the compliance of Persian and English words indexed in IranMedex database between years 2006 and 2013 with Persian medical thesaurus and MeSH in different years.

MATERIALS AND METHODS

This is an applied study using a survey method. The data gathering was carried out using library method, during which the use of keywords in title and abstract of articles, origin, structure and indexing language of keywords and their compliance with descriptors was investigated. The statistical population is 32,850 Persian articles indexed in IranMedex database between years 2006 and 2013. It is worth to note that since English articles indexed in this database lack Persian keywords, in order to obtain a uniform population these articles were not included in the study which only covered Persian articles. The data were gathered using a checklist which included rows and columns containing information about the journal title, publication year, volume, number, Persian keywords, and compliance of keywords with Persian MeSH (full overlap, relative compliance, bad compliance), syntactic structure of keywords (single noun, phrasal including adjectival or additional, single, plural, anagram and nonanagram, descriptor with qualifier, abbreviations, and Punctuation mark) and the origin of keywords (Title, abstract, title and abstract, neither title nor abstract). All these information was entered in a separate checklist for English keywords. Then the gathered data were compared to the second edition of Persian MeSH[15] and MeSH.[16] Total sample size was calculated to be 379 articles using Cochran formula and the number of samples from each year was calculated using proportion calculations. The data were extracted from the checklists and was analyzed using IBM SPSS Statistics 21 (IBM Corp.:Armonk, NY) with the significance level of α = 0.95.

RESULTS

Relative frequency distribution of the origin of Persian and English words indexed in IranMedex database between years 2006 and 2013

In order to determine the relative frequency distribution of the origin of Persian and English words indexed in IranMedex database between years 2006 and 2013, the origin of the indexed keywords of this database was investigated. Based on results presented in Table 1, among all 1428 Persian keywords indexed between years 2006 and 2013, the most common origin of the keywords was “Title and abstract” with 56.5% of the keywords and the least common origin was “Title” with 1.8% of the total. Also, among 1449 English keywords indexed during investigated years, the most common origin was “Title and abstract” with 53.9% while the least common origin was “Title” with 2.9%. Therefore, one can conclude that the most common origin of both Persian and English keywords is “title and abstract” with very similar percent of total.

Table 1

Relative frequency distribution of the origin of Persian and English words indexed in IranMedex database between years 2006 and 2013

Relative frequency distribution of the syntactic structure of Persian and English words indexed in IranMedex database between years 2006 and 2013

In order to determine the relative frequency distribution of the syntactic structure of Persian and English words indexed in IranMedex database between years 2006 and 2013, the syntactical structure of the indexed keywords were investigated. The results presented in Table 2 show that among 1428 Persian keywords, the most common structure was “additional” with 35.2%, and the least common structure was “abbreviation” with 0%. On the other hand, from 1449 English keywords indexed, the most common structure was “single noun” with 33.4% while the least common structure was “descriptor with qualifier” with a frequency of 0%.

Table 2

frequency distribution of the syntactic structure of Persian and English words indexed in IranMedex database between years 2006 and 2013

frequency distribution of the syntactic structure of Persian and English words indexed in IranMedex database between years 2006 and 2013 In general, the most common syntactical structure among Persian and English words indexed in IranMedex in the investigated time period is “phrasal” structure. The results showed that more than 97% of both English and Persian keywords were either single nouns or phrasal while other structures such as anagram and nonanagrams, descriptors with qualifier, abbreviations, and punctuation marks were rate. In order to be brief and since the goal is to compare Persian and English languages, two syntactical structures of “anagrams” and “punctuation marks” which were not used in either language are not presented in the table.

Relative frequency distribution of compliance of Persian and English words indexed in IranMedex database between years 2006 and 2013 with Medical Subject Heading

Table 3 shows among 1428 Persian keywords indexed in IranMedex, 65% had bad compliance, 25.9% had full compliance, and only 9.1% had relative compliance with headings. Also from 1449 English keywords, 61.6% had bad compliance, 32.6% had full compliance, and only 5.8% had relative compliance with MeSH. The results showed that most keywords had bad compliance with headings in both languages with statistically similar percentages while the number of English keywords with full compliance was higher than that of Persian keywords.

Table 3

Relative frequency distribution of compliance of Persian and English words between years 2006 and 2013 with MeSH

The ratio of words selected from the title to words selected from abstract (the origin of the indexing) indexed in IranMedex database between years 2006 and 2013 is different in Persian and English words

The findings of nonparametric test (χ2) on the Persian and English keywords indexed in IranMedex database between years 2006 and 2013 in terms of indexing origin showed that with χ2 = 5.408, 3 degrees of freedom, statistical error of 0.05, the P value if equal to 0.144 which is higher than 0.05 thus rejecting this hypothesis. Therefore, it can be concluded that there is no meaningful difference between the origin of Persian and English keywords.

More than 50% of the Persian and English keywords indexed in IranMedex database between years 2006 and 2013 use natural language

According to Chi-square test with the value of 207.803, statistical error of α = 0.05 and 1 degree of freedom, the P = 0.000 which is < 0.05 thus proving the above hypothesis. This means that more than 50% of the Persian and English keywords indexed during the investigated time period use natural indexing language.

There is a difference between the compliance of Persian and English words indexed in IranMedex database between years 2006 and 2013 with Persian medical thesaurus and Medical Subject Headings in different years

By using nonparametric test with χ2 = 22.979, 3 degrees of freedom, statistical error of 0.05, the P value is calculated to be 0.000 which is lower than 0.05 showing that there is a meaningful difference between the compliance of Persian and English keywords with MeSH in different years. This means that Persian and English keywords have different percentages of full compliance, relative compliance, and bad compliance.

DISCUSSION

According to research findings, the relative frequency distribution of the origin of both Persian and English keywords indexed in IranMedex database between years 2006 and 2013 follows a similar trend and the most common origin for both Persian and English keywords is “Title and Abstract.” Results of the study by Esfahani et al.[11] showed that more than half of the keywords were selected from the abstract. Also Baniegbal et al.[8] reported that more than 75% of the keywords were words selected from title and abstract or their synonyms and Iranshahi and Davarpanah[7] reported that 67% of the keywords were selected from title. Similarly Bartol[12] suggests that using subject headings in title and abstract can improve search and retrieval of information by 60%. Comparing the results of relative frequency distribution of the previous studies show that since selection, detection, and identification of the keywords can affect the effectiveness of communication and since keywords selected from title and abstract are more relevant to the subject at hand, results found using these keywords have closer to the ones sought by the user. On the other hand, using different keywords can reduce the number and thus the comprehensiveness of the search results. In terms of syntactical structure of keywords, no keywords with the structure of “anagram” or “punctuation marks” were used in either language. Furthermore, no Persian keyword had the structure of “abbreviation” while no English keyword was a “descriptors with qualifier.” One can say that “phrasal” (including descriptive or additive) and “single noun” (including single or plural) syntactical structures are more common in both English and Persian languages which is in agreement with results reported by Esfahani et al.,[11] Nowkarizi and Dehghani[9] and Iranshahi and Davarpanah.[7] Also, results of the study by Shultz[13] showed that there is no suitable synonym for most abbreviations and initialisms in MeSH, which highlights an important weakness in MeSH. A comparison of the syntactical structure of the keywords suggests that some syntactical structures are more common among the authors, which is probably due to the compatibility of these structures with our language. Another reason behind the popularity of some syntactical structures is that most specialized phrases use certain syntactical structures thus increasing their use. In terms of compliance with MeSH, there is a meaningful difference between Persian and English keywords over different years and the keywords with bad compliance with Persian MeSH and MeSH are twice the number of keywords that fully comply with MeSH. Also, the percentage of English keywords with full compliance with MeSH if more than the percentage of Persian keywords with full compliance which is in agreement with the results reported by Esfahani et al.[11] These results also confirm the results of the study by Nowkarizi and Dehghani[9] which reported a meaningful difference between the compliance of keywords from different subject areas and time periods with different descriptors. These results are also similar with the results of Baniegbal et al.[8] which reported no increasing or decreasing trend in the compliance of title and abstract keywords with subject headings between years 1380 and 1387. In the study by Kabirzadeh et al.,[14] 30% of the keywords used in a period of 1-year fully complied with MeSH while 11.4% had no compliance with MeSH. However, the results of the study by Fattahi and Nikzaman[10] showed no meaningful difference between the compliance levels of different thematic search phrases used by students with different subject headings. In general, the fact that less authors used thesaurus in writing Persian abstracts compared to English abstracts can be due to the novelty of Persian MeSH compared to its English counterpart, MeSH, its low circulation and inaccessibility to most authors. Also MeSH is older, includes a free electronic format, is associated with PubMed database which is one of the most famous databases for medical science experts, is easier to use and regularly updated which increases its use and thus the compliance of English keywords with it. The results of investigating the hypothesis that more than 50% of Persian and English keywords between years 2006 and 2013 use natural indexing language confirmed this hypothesis which is in agreement with the results reported by Esfahani et al.[11] The reason behind this difference in use of natural language is that most authors are unfamiliar with controlled language and its tools and do not have the necessary skills for using this language and on the other hand the search engines of most databases work mostly according to criteria more in line with natural language, evaluating the documents based on the number of the repetitions of the search terms in each document. This leads to most descriptors being selected from natural language thus decreasing the number and relevance of the retrieved articles.

CONCLUSION

In general, it can be concluded that the origin of most Persian and English keywords was “Title and abstract” and most authors used phrasal and single noun keywords. Furthermore, the compliance of English keywords with MeSH is higher than the compliance of Persian keywords with Persian MeSH while more Persian keywords have bad compliance with Persian MeSH compared to English keywords with bad compliance with MeSH. Due to the importance of databases in medical science and improving the utilization of medical information by users, IranMedex database which is one of the most important and largest Iranian medical databases was investigated in this study. The findings showed that most authors were unfamiliar with thesaurus as a vital tool in standardization and retrieval of information thus creating a lack of uniform policy in selection and use of keywords and indexing languages, leading to a lack of homogeneity in the keywords. In other words, thesaurus, which is an indexing and information retrieval tool used by most databases, isn’t used in IranMedex database which can reduce the relevance and accuracy of search results. Also since the familiarity of authors with thesauri and controlled language tools can increase the homogeneity of keywords entered in the databases thus leading to a more homogeny database and faster, more accurate and more relevant retrieval of information, it is suggested that a mixture of both natural and control language be used in IranMedex database.

Financial support and sponsorship

Ahvaz Jundishapur University of Medical Sciences, Ahvaz, Iran.

Conflicts of interest

There are no conflicts of interest.

2 in total

1. Mapping of medical acronyms and initialisms to Medical Subject Headings (MeSH) across selected systems.

Authors: Mary Shultz
Journal: J Med Libr Assoc Date: 2006-10

2. Survey of keyword adjustment of published articles medical subject headings in journal of mazandaran university of medical sciences (2009-2010).

Authors: Azar Kabirzadeh; Hasan Siamian; Ebrahim Bagherian Farah Abadi; Benyamin Mohseni Saravi
Journal: Acta Inform Med Date: 2013

2 in total

1 in total

1. Immigration: analysis, trends and outlook on the global research activity.

Authors: Matthias Trost; Eileen M Wanke; Daniela Ohlendorf; Doris Klingelhöfer; Markus Braun; Jan Bauer; David A Groneberg; David Quarcoo; Dörthe Brüggmann
Journal: J Glob Health Date: 2018-06 Impact factor: 4.413

1 in total