| Literature DB >> 26630138 |
Gustavo L Estivalet1,2, Fanny Meunier1,2.
Abstract
In this article, we present the Brazilian Portuguese Lexicon, a new word-based corpus for psycholinguistic and computational linguistic research in Brazilian Portuguese. We describe the corpus development, the specific characteristics on the internet site and database for user access. We also perform distributional analyses of the corpus and comparisons to other current databases. Our main objective was to provide a large, reliable, and useful word-based corpus with a dynamic, easy-to-use, and intuitive interface with free internet access for word and word-criteria searches. We used the Núcleo Interinstitucional de Linguística Computacional's corpus as the basic data source and developed the Brazilian Portuguese Lexicon by deriving and adding metalinguistic and psycholinguistic information about Brazilian Portuguese words. We obtained a final corpus with more than 30 million word tokens, 215 thousand word types and 25 categories of information about each word. This corpus was made available on the internet via a free-access site with two search engines: a simple search and a complex search. The simple engine basically searches for a list of words, while the complex engine accepts all types of criteria in the corpus categories. The output result presents all entries found in the corpus with the criteria specified in the input search and can be downloaded as a.csv file. We created a module in the results that delivers basic statistics about each search. The Brazilian Portuguese Lexicon also provides a pseudoword engine and specific tools for linguistic and statistical analysis. Therefore, the Brazilian Portuguese Lexicon is a convenient instrument for stimulus search, selection, control, and manipulation in psycholinguistic experiments, as also it is a powerful database for computational linguistics research and language modeling related to lexicon distribution, functioning, and behavior.Entities:
Mesh:
Year: 2015 PMID: 26630138 PMCID: PMC4668042 DOI: 10.1371/journal.pone.0144016
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Numbers of word tokens, word types, and lemmas by grammatical category before and after data processing for the Brazilian Portuguese Lexicon.
| Gram. Cat. | Tokens NILC | Types | Lemmas | Tokens LexPorBR | Types LexPorBR | Lemmas LexPorBR |
|---|---|---|---|---|---|---|
| Adjectives | 1,842,597 | 46,249 | 24,478 | 1,829,473 | 40,537 | 24,058 |
| Adverbs | 1,455,573 | 3,611 | 2,857 | 1,455,573 | 2,938 | 2,723 |
| Grammatical | 15,717,557 | 1,809 | 480 | 15,702,419 | 1,144 | 455 |
| Nouns | 7,113,649 | 100,328 | 66,189 | 7,079,524 | 82,097 | 64,421 |
| Numerals | 949,766 | 58,672 | 61,341 | 340,428 | 136 | 54,942 |
| Verbs | 4,298,528 | 105,432 | 14,620 | 4,298,528 | 88,323 | 14,154 |
| Proper names | - | - | 301,860 | - | - | 293,198 |
|
|
|
|
|
|
|
|
a
b
Numbers, columns, and descriptions of the Brazilian Portuguese Lexicon.
| Nb | Column | Description |
|---|---|---|
| 1 | orthography | Orthographic representation |
| 2 | gram_cat | Grammatical category |
| 3 | gram_inf | Grammatical information |
| 4 | ortho_freq | Orthographic frequency |
| 5 | ortho_freq/M | Orthographic frequency per million |
| 6 | log10_ortho_freq | Log10 from ortho_freq |
| 7 | zipf_scale | Standardized frequency scale |
| 8 | zipf_rank | Zipf’s rank-frequency distribution |
| 9 | nb_letters | Number of letters |
| 10 | nb_homogr | Number of homographs |
| 11 | homographs | Homograph grammatical categories |
| 12 | pu_ortho | Orthographic uniqueness point |
| 13 | ortho_neigh | Orthographic neighborhood |
| 14 | old20 | Orthographic Levenshtein Distance 20 words |
| 15 | cvcv_ortho | Consonant/vowel CVCV structure |
| 16 | bigrams | Bigrams representation |
| 17 | bigram_freq | Bigram frequency |
| 18 | trigrams | Trigrams representation |
| 19 | trigram_freq | Trigram frequency |
| 20 | rev_ortho | Reverse orthography |
| 21 | rev_cvcv_ortho | Reverse CVCV structure |
| 22 | rev_bigrams | Reverse bigrams |
| 23 | rev_trigrams | Reverse trigrams |
| 24 | random | Random number between 0–1 |
| 25 | id | Identity number (position) |
Conventions used in the grammatical category and grammatical information columns in the search engines and results at the Brazilian Portuguese Lexicon.
| Convention | Meaning | Example |
|---|---|---|
| adj | adjective | caro, jovem, linda, velho |
| adv | adverb | quase, sempre, seguido, também |
| gram | grammatical | com, depois, para, que |
| nom | noun | cachorro, dedo, roda, tábua |
| num | numeral | 3, 9, 1°, 8ª |
| ver | verb | comer, deitou, ralado, viajará |
| prop | proper name | América, Inglaterra, João, São Paulo |
| conj | conjunction | e, mas, mesmo, logo |
| det | determinant | a, os, um, umas |
| prep | preposition | com, de, sem, sobre |
| pro | pronoun | eu, estas, mim, isso |
| m | masculine | armário, ele, gato, touro |
| f | feminine | ela, gata, mesa, zebra |
| s | singular | barril, casa, este, flor |
| p | plural | barris, casas, estas, flores |
| 1 | first person | cantei, durmo, jogamos, veremos |
| 2 | second person | cantaste, dormes, jogais, vereis |
| 3 | third person | cantou, dorme, jogaram, verão |
| c1 | 1st class conjugation | cantar, jogar, pescar, raspar |
| c2 | 2nd class conjugation | comer, depor, por, viver |
| c3 | 3rd class conjugation | dormir, sorrir, vestir, zumbir |
| ind | indicative | como, diria, deste, viajará |
| sub | subjunctive | coma, diga, desse, viajarmos |
| imp | imperative | coma, diga, demos, viajemos |
| pre | present | pego, tocas, olham, sabem |
| perf | preterit perfect | peguei, tocaste, olharam, souberam |
| imp | preterit imperfect | pegava, tocavas, olhavam, sabiam |
| fut | future | pegarei, tocarás, olharão, saberia |
| inf | infinitive | amar, beber, dormir, compor |
| ger | gerundive | amando, bebendo, dormindo, compondo |
| pp | past participle | amado, bebido, dormido, composto |
Symbols used as wildcards in the search engines in the Brazilian Portuguese Lexicon.
| Symbol | Function | Example | Result |
|---|---|---|---|
| _ | substitute one or more character(s) | a_o_ | amor, anos, aloe, após |
| % | substitute a chain of characters | am% | amor, ama, amei, amava |
| < | less than | nb_letters | words with less than 5 letters |
| > | greater than | nb_homogr | words with homographs greater than 2 |
| < > | less than and greater than | freq_ortho | words with orthographic frequency less than 10 and greater than 6 |
General means and standard deviations between parentheses by grammatical category.
| Gram. Cat. | Letters | Homographs | Ortho. PU | Ortho. N | OLD20 |
|---|---|---|---|---|---|
| Adjectives | 9.97(2.89) | 1.24(0.47) | 8.15(3.08) | 1.57(3.19) | 2.89(1.18) |
| Adverbs | 12.82(3.39) | 1.09(0.42) | 7.22(2.88) | 1.07(3.93) | 3.47(1.12) |
| Grammatical | 5.83(2.59) | 1.55(0.86) | 4.36(2.03) | 8.37(10.47) | 1.98(1.17) |
| Nouns | 9.17(3.29) | 1.12(0.35) | 6.75(2.91) | 1.81(4.44) | 2.94(1.38) |
| Numerals | 6.98(2.87) | 1.61(0.89) | 5.36(2.36) | 6.43(9.59) | 2.44(1.37) |
| Verbs | 9.41(2.49) | 1.08(0.29) | 7.94(2.41) | 1.86(3.29) | 2.39(0.77) |
|
|
|
|
|
|
|
Relative percentage (%) of word types contained in the LexPorBR, SubtlexBR [15], and Worldlex (Portuguese Brazil) [16] corpora.
The head corpus contains the percentage of word types of the left corpus and the left corpus is contained by the head corpus.
| LexPorBr | SublexBR | WlBlog | WlTwitter | WlNews | |
|---|---|---|---|---|---|
|
| 100 | 46.39 | 34.14 | 32.13 | 23.17 |
|
| 63.89 | 100 | 52.26 | 42.57 | 23.13 |
|
| 53.94 | 50.43 | 100 | 26.93 | 26.04 |
|
| 66.52 | 57.93 | 48.46 | 100 | 40.57 |
|
| 60.09 | 60.09 | 45.06 | 37.41 | 100 |
Overestimated and underestimated words by the Brazilian Portuguese Lexicon compared to the SUBTLEX-PT-BR [15] and Worldlex (Portuguese Brazil) [16].
Between parentheses is the number of the most frequent words verified to list the 10 words presented in each list; Zipf scale range interval of the words found is indicated under heads.
| Overestimated SubtlexBR (116) 5.77–4.59 | Underestimated SubtlexBR (1343) 4.16–3.62 | Overestimated Worldlex (125) 5.77–4.54 | Underestimated Worldlex (264) 4.44–3.82 |
|---|---|---|---|
| tão | matarei | tão | esmaltes |
| te | meritíssimo | te | presencial |
| se | danar | se | medite |
| cola | consegues | cola | disponibilizados |
| tudo | estrague | tudo | empreendendorismo |
| teve | abaixem | teve | tadinho |
| cambial | larguem | porte | solzinho |
| verdadeiro | percebes | forço | viciei |
| porte | esperaremos | vôo | lindinho |
| petista | odeie | colher | quitosana |