| Literature DB >> 34357542 |
Ark Verma1, Vivek Sikarwar2, Himanshu Yadav3, Ranjith Jaganathan2, Pawan Kumar4.
Abstract
We present Shabd, a psycholinguistic database in Hindi. It is based on a corpus of 1.4 billion words from electronic newspapers and news websites. Word frequencies and part of speech information have been derived and are made available in a cleaned list of 34 thousand hand-selected words, and a list of 96 thousand words observed with a frequency of more than 100 times in the corpus. Next to the Shabd database, we also make a list with all 2.3 million word types available and a list with the 2.5 million most frequent word pairs (word bigrams). The quality of the word frequency measure was tested in two lexical decision tasks. We observed that the Shabd word frequencies outperform existing frequencies based on smaller corpora of newspapers but not the Worldlex word frequencies based on an analysis of blogs. We also observed that word frequency accounts for as much variance as contextual diversity (operationalized as the number of documents in which the words were observed). The Shabd database is freely available for research.Entities:
Keywords: Akshara; Contextual diversity; Corpus; Devangari; Hindi; Lexical decision; Visual word recognition; Word frequency
Mesh:
Year: 2021 PMID: 34357542 DOI: 10.3758/s13428-021-01625-2
Source DB: PubMed Journal: Behav Res Methods ISSN: 1554-351X