Literature DB >> 32789660

subs2vec: Word embeddings from subtitles in 55 languages.

Jeroen van Paridon1, Bill Thompson2.   

Abstract

This paper introduces a novel collection of word embeddings, numerical representations of lexical semantics, in 55 languages, trained on a large corpus of pseudo-conversational speech transcriptions from television shows and movies. The embeddings were trained on the OpenSubtitles corpus using the fastText implementation of the skipgram algorithm. Performance comparable with (and in some cases exceeding) embeddings trained on non-conversational (Wikipedia) text is reported on standard benchmark evaluation datasets. A novel evaluation method of particular relevance to psycholinguists is also introduced: prediction of experimental lexical norms in multiple languages. The models, as well as code for reproducing the models and all analyses reported in this paper (implemented as a user-friendly Python package), are freely available at: https://github.com/jvparidon/subs2vec .

Entities:  

Keywords:  Distributional semantics; Lexical norms; Multilingual; Word embeddings

Year:  2021        PMID: 32789660     DOI: 10.3758/s13428-020-01406-3

Source DB:  PubMed          Journal:  Behav Res Methods        ISSN: 1554-351X


  31 in total

1.  Age-of-acquisition norms for a set of 1,749 Portuguese words.

Authors:  Manuela L Cameirão; Selene G Vicente
Journal:  Behav Res Methods       Date:  2010-05

2.  Subjective frequency and imageability ratings for 3,600 French nouns.

Authors:  Alain Desrochers; Glenn L Thompson
Journal:  Behav Res Methods       Date:  2009-05

3.  Affective norms for 210 British English and Finnish nouns.

Authors:  Tiina M Eilola; Jelena Havelka
Journal:  Behav Res Methods       Date:  2010-02

4.  Lexico-semantic effects on word naming in Persian: does age of acquisition have an effect?

Authors:  Mehdi Bakhtiar; Brendan Weekes
Journal:  Mem Cognit       Date:  2015-02

5.  Sensory experience ratings for 5,500 Spanish words.

Authors:  Antonio M Díez-Álamo; Emiliano Díez; Dominika Zofia Wojcik; María Angeles Alonso; Angel Fernandez
Journal:  Behav Res Methods       Date:  2019-06

6.  Toward a brain-based componential semantic representation.

Authors:  Jeffrey R Binder; Lisa L Conant; Colin J Humphries; Leonardo Fernandino; Stephen B Simons; Mario Aguilar; Rutvik H Desai
Journal:  Cogn Neuropsychol       Date:  2016-06-16       Impact factor: 2.468

7.  Normative ratings for perceptual and motor attributes of 750 object concepts in Spanish.

Authors:  Antonio M Díez-Álamo; Emiliano Díez; María Ángeles Alonso; C Alejandra Vargas; Angel Fernandez
Journal:  Behav Res Methods       Date:  2018-08

8.  Concreteness norms for 1,659 French words: Relationships with other psycholinguistic variables and word recognition times.

Authors:  Patrick Bonin; Alain Méot; Aurélia Bugaiska
Journal:  Behav Res Methods       Date:  2018-12

9.  Assessing the usefulness of google books' word frequencies for psycholinguistic research on word processing.

Authors:  Marc Brysbaert; Emmanuel Keuleers; Boris New
Journal:  Front Psychol       Date:  2011-03-02

10.  Humor norms for 4,997 English words.

Authors:  Tomas Engelthaler; Thomas T Hills
Journal:  Behav Res Methods       Date:  2018-06
View more
  2 in total

1.  Rapid adaptation of predictive models during language comprehension: Aperiodic EEG slope, individual alpha frequency and idea density modulate individual differences in real-time model updating.

Authors:  Ina Bornkessel-Schlesewsky; Isabella Sharrad; Caitlin A Howlett; Phillip M Alday; Andrew W Corcoran; Valeria Bellan; Erica Wilkinson; Reinhold Kliegl; Richard L Lewis; Steven L Small; Matthias Schlesewsky
Journal:  Front Psychol       Date:  2022-08-26

2.  The verb-self link: An implicit association test study.

Authors:  Patrick P Weis; Jan Nikadon; Cornelia Herbert; Magdalena Formanowicz
Journal:  Psychon Bull Rev       Date:  2022-05-02
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.