Literature DB >> 33166329

The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics.

Abstract

The text-evaluation application Coh-Metrix and natural language processing rely on the sentence for text segmentation and analysis and frequently detect sentence limits by means of punctuation. Problems arise when target texts such as pop song lyrics do not follow formal standards of written text composition and lack punctuation in the original. In such cases it is common for human transcribers to prepare texts for analysis, often following unspecified or at least unreported rules of text normalization and relying potentially on an assumed shared understanding of the sentence as a text-structural unit. This study investigated whether the use of different transcribers to insert typographical symbols into song lyrics during the pre-processing of textual data can result in significant differences in sentence delineation. Results indicate that different transcribers (following commonly agreed-upon rules of punctuation based on their extensive experience with language and writing as language professionals) can produce differences in sentence segmentation. This has implications for the analysis results for at least some Coh-Metrix measures and highlights the problem of transcription, with potential consequences for quantification at and above sentence level. It is argued that when analyzing non-traditional written texts or transcripts of spoken language it is not possible to assume uniform text interpretation and segmentation during pre-processing. It is advisable to provide clear rules for text normalization at the pre-processing stage, and to make these explicit in documentation and publication.

Entities: CellLine Chemical Disease Gene Species

Year: 2020 PMID： 33166329 PMCID： PMC7652311 DOI： 10.1371/journal.pone.0241979

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Automated evaluation of language and text is becoming increasingly sophisticated, with a range of programmes now available to examine text at the word level, sentence level, and even discourse level. Coh-Metrix is an application “at the forefront of these technologies” [1], offering a broad range of measures of local and global, lexical and textual discourse features. Its usefulness is demonstrated by its popularity with applications ranging from research on schizophrenia [2] to higher education [3]. While Coh-Metrix has most often been used for the analysis of printed text, it can, according to its creators, also be used to analyse spoken discourse and non-traditional written texts with “untidy language and discourse” [4] such as typed chat exchanges, instant messaging, emails, and poetry. While certain features frequently found in spoken and non-traditional written texts, including disfluencies and ungrammatical utterances, might impact some of the indices available in Coh-Metrix, the authors assert that “many of the measures are minimally disturbed”. The present study is part of a wider investigation of English pop song lyrics and their suitability for teaching and learning English as a second language (ESL). Song lyrics usually deviate from formal formatting and editing standards of written texts, for example in their text layout, punctuation, and spelling, and they often contain song-specific disfluencies and interjections. In particular, explicit indications of sentence boundaries in lyrics are often omitted. A further complication lies in the presentation of lyrics in verse format where line breaks frequently occur in the middle of sentences and typically do not indicate the end of sentences or paragraphs but rather signal musical groups and melodic structure (although these can correspond to linguistic constituents). Given these deviations from widely accepted standards of text formatting, song lyrics might require principled text normalisation, that is, the conversion “to a more convenient standard form” [5] prior to their automated textual assessment. The goal of this study is to understand the potential impact of text normalisation at the pre-processing stage, specifically when it comes to the identification of sentence boundaries, on the results of a Coh-Metrix analysis of English pop song lyrics.

1 The relevance of the sentence in Coh-Metrix 3.0

Natural language processing (NLP) and automated text evaluation frequently rely on the sentence as an underlying unit, and text normalisation often “includes sentence segmentation: breaking up a text into individual sentences using cues like periods or exclamation points” [5]. It appears that Coh-Metrix 3.0—the most recent version available—also relies on the sentence as a basic unit of text segmentation for a large number of its measures. Several descriptive indices specifically focus on sentences in the target texts or use sentences to measure larger units of text, including sentence count (DESSC), sentence length/mean number of words (DESSL), sentence length standard deviation (DESSLd), paragraph length/mean number of sentences (DESPL), and paragraph length standard deviation (DESPLd). Examples are listed in [4] and on www.cohmetrix.com. Coh-Metrix also assesses levels of lexical information of target texts, their syntactic complexity and density, and dimensions of text cohesion. Text cohesion in this context refers to text-inherent linguistic, semantic, and discourse features that affect the “connectedness of concepts presented in a text” [4] and have an effect on text difficulty and reading (or listening) comprehension. It can be speculated that lexical information such as diversity measures, word frequency, word meaningfulness and concreteness, would be largely unaffected by sentence segmentation, whereas certain syntactic indices such as sentence syntax similarity of adjacent sentences (SYNSTRUTa) clearly rely on the sentence as a unit. Similarly, cohesion measures such as noun-, argument-, stem-, and content-word overlap in adjacent sentences or across all sentences in a text require sentence boundaries to be indicated, as do measures of Latent Semantic Analysis (LSA), that is, measures of relative semantic overlap between sentences and paragraphs. The same holds for traditional readability scores included in Coh-Metrix 3.0, the Flesch Reading Ease (RDFRE) and the Flesch-Kincaid Grade Level (READFKGL), which both rely on sentence length (mean number of words per sentence) to compute text difficulty. The Coh-Metrix L2 Readability score (L2), intended to assess the suitability of texts for second language learners, includes content-word overlap in and syntactic similarity of sentences as variables in its formula. Finally, Coh-Metrix 3.0 offers so-called Text Easability Principal Component Scores, based on a principal components analysis (PCA) of 54 indices in a corpus of written texts found across school-grade levels (K-12) and academic subjects (see [4, 6] for further details). Five of these principal components—narrativity, syntactic simplicity, word concreteness, referential cohesion, and deep cohesion, account for 54% of the variance in the test corpus and are provided to Coh-Metrix users as comprehensive scores that are more easily accessible and of more immediate practical use for teaching practitioners. [6] show that sentence-based measures contribute to three of these easability scores (narrativity, syntactic simplicity, referential cohesion).

2 The role of punctuation in text segmentation

It appears that in Coh-Metrix 3.0 (and NLP more widely) sentences are not detected based on semantic or syntactic features but rather by means of punctuation. In other words, the end of a sentence is indicated by a full stop, question mark, or exclamation mark; and the sentence unit is detected through the presence of these typographical symbols, which are inserted by the text authors themselves or by members of the research team during the pre-processing phase, based more or less on commonly agreed-upon rules of their use. One issue, however, is that there is no simple definition of what a sentence is [7, 8]. A sentence is generally understood in structural terms as the largest complete grammatical unit which can be analysed into structural constituents or phrases and which is not connected to other units by means of grammatical relationships [9-11]. However, various linguistic phenomena such as sentence fragments challenge this definition. Particularly in spoken language, sentences are difficult to identify. The more common unit applied to oral discourse is that of utterance, a “behavioural unit” that can refer to single words as well as to longer stretches of speech and which is characterised by features such as prosody and intonation, which may or may not correspond to sentence segmentation and punctuation in written language. As [11] emphasise: “The sentence is an indeterminate unit in the sense that it is often difficult to decide, particularly in spoken language, where one sentence ends and another begins”. [12] use the term sentence only for written language, highlighting the reliance on punctuation when arguing that a sentence “is the highest unit of punctuation on the graphological rank scale and has evolved in the writing system (…) We will use the term sentence to refer only to this highest-ranking graphological unit of punctuation” (p. 436). NLP generally, and text assessment involving Coh-Metrix specifically encounter three potential problems: First, sentences are typically identified based on conventions for written texts and their typographical manifestation, that is, punctuation. As [5] explain, “the most useful cues for segmenting a text into sentences are punctuation, like periods, question marks, exclamation points” (p. 29). Second, an utterance in spoken language is frequently incorrectly understood as “the spoken correlate of a sentence” (p. 19). And third, text normalisation in NLP seems to frequently involve inserting punctuation—if missing—without clear documentation of principles followed during the process. This might not be a problem when the target data consists of written texts following formal standards of text composition. However, we hypothesise that the same process applied to non-standard written texts, including song lyrics, characterised by a range of disfluencies and non-standard text formatting might result in a significant but often unacknowledged influence of the transcriber or “punctuator” on the data. [12] touch on the frequent lack of clear guidelines when observing that “many writers punctuate phonologically rather than grammatically, or in some mixture of the two” (p. 8). Consequently, depending on the genre, style, and mode of production of the target texts, inserting punctuation might require a clear, principled approach followed by all involved in text normalisation in the pre-processing stage.

3 How do other studies approach this problem?

How is the problem of punctuation dealt with in other studies that have used Coh-Metrix for data analysis? This question is of particular relevance for research investigating texts that lack sentential segmentation in the original and thus require punctuation by the researchers, including spoken discourse and non-standard written texts. [13] used Coh-Metrix to explore differences in the lyrics written by suicidal and by non-suicidal songwriters. They investigated 560 songs written by 16 songwriters (35 songs each), focusing on 14 indices available through Coh-Metrix 2.0 and the LIWC program (Linguistic Inquiry and Word Count), including LSA and argument overlap in adjacent sentences. Lyrics were “obtained from websites devoted to that artist or band” (p. 1219). It remains unclear whether the websites were curated by the music producers or by fans who do not necessarily have access to the original text. Furthermore, the authors do not specify whether the lyrics presented on these websites included punctuation. They do, however, concede that the websites were not entirely reliable in that they did not necessarily include all repetitions of choruses and phrases as they occurred in the audio-versions; accordingly, the lyrics were checked against recordings. Further relevant information is left unspecified regarding who and how many transcribers undertook the text normalization and what guidelines they followed, including how they dealt with fade-outs and non-semantic vocables and how sentences and paragraphs were delineated. [14] used Coh-Metrix to analyse how linguistic features differ in written and in spoken online communication in multi-party collaborative learning environments. They investigated the impact of modality (written chat, spoken chat) on five principal components (narrativity, referential cohesion, situation model cohesion, syntactic simplicity) and three superficial text characteristics (total number of words, total number of sentences, and number of words per sentence). The description of the transcription process in the extended version of the proceedings is limited to one sentence: “Human transcribers converted conversations in the spoken condition to text for analysis”. However, the impact of text normalisation might be of particular interest in this study, as it depends on two different parties for the two modalities in question, that is, on the interlocutors themselves in the written chat mode and on an unspecified number of transcribers in the spoken mode. It also seems to be of relevance how the transcribers were trained, since written chat texts from the participating interlocutors were available and might have served as the basis for the transcribers’ decision-making processes while preparing the spoken texts for analysis. [1] investigated potential linguistic indicators in another non-traditional written genre, truthful and deceptive text messages, using both Coh-Metrix and LIWC with a number of measures directly or indirectly relying on the sentence unit. The corpus consisted of 242 transcripts of instant messaging exchanges, derived from a study conducted by [15]. [15] provide a brief description of the pre-processing of the transcripts: “First, given that CMC [computer-mediated communication] participants often omit punctuation, periods were placed at the end of all turns. Similarly, if a question mark was omitted after a question, a question mark was inserted. (…) Last, any misspellings were corrected …” (p. 11). The example transcripts in the appendix of [15] indicate that a turn never exceeds the length of one sentence. However, this is conjecture for the transcripts not included in the appendix. [16] focused on CMC and compared language and cohesion of written “monologues” (textbook and textoids) and written interactional dialogues between students and either human tutors or an automated tutoring system (see [17]). The interactional dialogues were taken from a previously published experiment ([18]). It can be assumed that the students and human tutors followed written standards of punctuation. Neither [16] nor [18] detail the pre-processing of data; however, one example transcript provided in [18] (p. 16) includes a student response that indicates that data preparation might have been necessary: “That will depend on the time that the ball is in the air/distance ball traveled.” The need for text normalisation, sentence segmentation, and the insertion of punctuation during pre-processing becomes more apparent when research is focused on spoken texts. [19] explored differences between written and spoken registers in partial replication of [20]’s investigation. However, while [20] focused on word-level features, [19] included global text features in their analysis, which rely on sentential segmentation, including coreference, LSA, and readability scores. To compare results, they used the same corpora as [20]; that is, in both studies the spoken section consisted of six registers from the London-Lund Corpus of Spoken English, including among others face-to-face conversations and telephone conversations. While [19] mention that “all textual coding other than alphanumeric characters and punctuation was removed” (p. 845), they do not specify if and how punctuation was inserted into the transcripts of the spoken texts in preparation for the Coh-Metrix analysis. Documentation detailing the design of the London-Lund corpus of spoken English (see [21]) shows that the transcription of spoken texts was based on tonal units and tone unit boundaries and does not originally contain typographic punctuation to delineate sentences. [22] investigated text-inherent factors influencing human assessment of speaking proficiency in the TOEFL-iBT© (the computer-delivered Test of English as a Foreign Language) and the variance among different scorers. They used Coh-Metrix as well as two further programs, CPIDR (the Computerized Propositional Idea Density Rater; see [23]) and LIWC, to analyse 244 speech samples from a public TOEFL-iBT© data set. Documentation of the transcription process is brief: Samples were transcribed by one trained transcriber and later reviewed by a second transcriber for accuracy. Only lexical words, including disfluencies such as false starts, were included in the transcriptions, whereas metalinguistic data and filler words were not included. [22] state that “periods were added to the samples at the end of idea units” (p. 177). This provides some insight into how the transcriptions were produced. However, several potentially relevant questions remain unanswered. For example, the concept of “idea unit” is undefined, and it is not clear whether the transcribers were trained and followed explicit rules of text segmentation and punctuation. Furthermore, no example transcript is provided. Consequently, an exact replication of the study seems difficult to achieve. [24] used speech samples from the same TOEFL-iBT© public use dataset to investigate the test-takers’ integration of lexical and cohesion features from a listening text into their spoken responses and the effect of such lexical integration on the ratings of speaking proficiency given by human experts. The description of the transcription process was identical to [22]. Finally, [25] also used a data set of TOEFL-iBT© speech samples to investigate an automated speech scoring system and its prediction accuracy of human ratings for speaking proficiency and cohesion in speaking. Again, no example transcript is provided and description of the transcription process is limited to a brief statement: “The spoken responses were all transcribed by humans with punctuation and capitalization” (p.815).

4 Methods

As song lyrics are frequently published in a format that requires text normalisation at the pre-processing, we addressed the following questions before proceeding with the wider investigation: Can the use of different transcribers to insert punctuation into pop song lyrics result in significant differences in sentence segmentation? Could this potential variance in sentence segmentation significantly affect Coh-Metrix measures? To answer these questions, we compiled a small corpus of 30 songs. In an attempt to imitate realistic conditions within a linguistics research group and to reflect the practices described in previous studies involving Coh-Metrix, three different language professionals inserted punctuation into the target text files based on their knowledge of writing conventions and their understanding of the syntax and semantics of the lyrics in question. We then conducted a Coh-Metrix analysis, focusing on any potential differences between transcribers, or rather punctuators, in relevant descriptive measures that underlie many of the more sophisticated measures. We note that the sample of 30 songs and three annotators is small. The work presented here is a first step in a larger project assessing the readability of pop songs for ESL learners. It is intended to address research-methodological problems that seem to be highly relevant not only to the wider investigation at hand but also to the use of Coh-Metrix and automated text evaluation more widely.

4.1 The corpus

30 songs, with a total of 14,287 running words (tokens), were randomly selected from a larger corpus, the Wellington Corpus of Popular Songs ([26]). The WOP comprises 408 English-language pop songs from the top 100 end-of-year US billboard charts from the years 2014, 2012, 2010, and 2008. The ten pop songs, comprising a total of 4,282 tokens (median: 403), were: Whataya want from me (Adam Lambert; P1), Meet me halfway (The Black Eyed Peas; P2), Boom clap (Charli XCX; P3), Crush (David Archuleta; P4), I’m yours (Jason Mraz; P5), I kissed a girl (Katy Perry; P6), Moves like Jagger (Maroon 5; P7), Cooler than me (Mike Posner; P8), What makes you beautiful (One Direction; P9), Blow me (Pink; P10). The ten rap songs, comprising 6,682 tokens (median: 662), were: Dance ass (Big Sean; R1), Hot boy (Bobby Shmurda; R2), So good (B.O.B.; R3), Get like me (David Banner; R4), All I do is win (DJ Khaled; R5), Mercy (Kanye West; R6), Mrs Officer (Lil Wayne; R7), Superstar (Lupe Fiasco; R8), Whatever you like (T.I.; R9), My hitta (Yg; R10). The ten country songs, comprising 3,314 tokens (median: 332), were: Good girl (Carrie Underwood; C1), Springsteen (Eric Church; C2), Wanted (Hunter Hayes; C3), Lover lover (Jerrod Niemann; C4), American kids (Kenny Chesney; C5), Somethin bout a truck (Kip Moore; C6), I don’t dance (Lee Brice; C7), Pontoon (Little Big Town; C8), The house that built me (Miranda Lambert; C9), Mine (Taylor Swift; C10). Note that P, R, and C stand for pop, rap, and country and are used to identify the individual songs in the plot of the correspondence analysis below.

4.2 Data collection

The lyrics were obtained from lyrics websites and were checked manually for errors. Lyrics uploaded to these websites are typically provided by music consumers rather than the music producers themselves. Consequently, sentence- and paragraph-segmentation are often omitted or are reliant on the consumer’s intuition. In the age of streaming (rather than the publication of recordings accompanied by booklets) lyrics provided by the songwriters themselves are increasingly difficult to find. If available, they still frequently lack punctuation. The lyrics contained several genre-specific features that needed to be addressed in a principled manner. Firstly, so-called fade-outs, that is, the repetition of (parts of) the chorus or verses at the end of the song at decreasing volume, were removed. Secondly, the corpus contains repetitions of individual words, phrases, choruses, and verses as used in the selected version of the song, including fully audible echoes of phrases and words vocalized by background singers. An example of word repetition can be found in B.O.B.’s So good: “… pack your bags real good, baby, ‘cause you’ll be gone for a while, while, while, while.” Finally, so-called marginal words (e.g. wobbledy, boing, yo, huh, ha) as well as non-lexical vocables, that is, words that do not have lexical content but are pronounced for the sake of vocalization itself without expressing semantic meaning [27] remained part of the corpus. An example of non-lexical vocables can be found in Lupe Fiasco’s Superspar: “And they wanna know, oh oh oh oh, if you are what you say you are, a superstar.” Before the lyrics were given to the punctuators, all punctuation and sentence-initial capitalization was removed by one researcher. This was done so that punctuators would rely on their own knowledge of writing standards rather than on the text segmentation provided by the song consumer who uploaded the text to a lyrics website. The following excerpt from Eric Church’s Springsteen with punctuation/capitalization removed provides a brief example of what the song lyrics provided to the punctuators looked like: to this day when I hear that song I see you standin’ there on that lawn discount shades store bought tan flip flops and cut-off jeans somewhere between that setting sun I’m on fire and born to run you looked at me and I was done What constitutes a paragraph in song lyrics is also contentious. We approached this problem by inserting paragraph breaks based on distinct changes in melody (based on one researcher’s intuition during listening). The punctuators were instructed to change paragraphing based on their own (semantic or structural) intuition.

4.2.1 Instructions for the punctuators

The following instructions were given to the punctuators: Below, you will find the lyrics of 10 songs in verse form. Please indicate for all song lyrics where you would put punctuation, in particular, full stops (or equivalent “sentence-final” symbols like ? and !). For example, is it: “Hey, good girl with your head in the clouds, I bet you I can tell you what you’re thinking about.” Or is it: Hey, good girl with your head in the clouds! I bet you I can tell you what you’re thinking about.” Or … You can use full stops, question marks, exclamation marks, and colons. You can also use commas—but don’t spend too much time on those. The paragraphs indicate breaks or melody changes in the singing but they do not necessarily indicate the end of sentences. If you think that a paragraph break actually broke up a continuing sentence, please indicate that as well. Simply delete the paragraph break and “connect” the sentence or sentences. Please do this for the entirety of the lyrics provided here. Please go with your gut feeling. There is no right or wrong answer. This is NOT testing you. Symbols you may use: . ? ! : ,

4.2.2 About the punctuators

Three research assistants were tasked with sentence segmentation and paragraphing of the target lyrics. All three punctuators (here named DET, LEM, WAT) are native speakers of New Zealand English and language professionals with many years of work experience, albeit in different capacities. LEM has worked as a journalist and proof-reader for 30 years, holds a BA in linguistics and is currently pursuing an MA in applied linguistics. WAT has worked in the adult education sector for over eight years with a focus on literacy and numeracy. WAT holds a certificate in adult education (literacy and numeracy) and is currently pursuing a BA in linguistics and an additional subject in the humanities. DET has five years of work experience as a language teacher (ESL, English for Academic Purposes, Spanish) and two years of experience as a tertiary tutor and marker in language, linguistics, and communication courses. DET holds an MSc in applied linguistics and second language acquisition and is currently pursuing a PhD in applied linguistics. This project has been reviewed and approved by the Massey University Human Ethics Committee: Southern B, Application 19/49.

5 Results

At first glance, there were noticeable differences in the sentence segmentation between the punctuators. An illustration of these differences is provided in Table 1.

Table 1

Side-by-side in punctuation in an excerpt from Good Girl (Carrie Underwood).

DET(12 sentences)	LEM(5 sentences)	WAT(9 sentences)
But he’s really good at lying, yeah, he’ll leave you in the dust ‘cause when he says forever, well, it don’t mean much. Hey, good girl! So good for him. Better back away honey. You don’t know where he’s been. Why? Why you gotta be so blind? Won’t you open up your eyes? It’s just a matter of time ‘til you find he’s no good, girl, no good for you. You better get to getting on your goodbye shoes and go, go, go, yeah, yeah, yeah. He’s low. Yeah, yeah, yeah!	But he’s really good at lying; yeah, he’ll leave you in the dust, ‘cause when he says forever, well, it don’t mean much. Hey good girl, so good for him, better back away honey, you don’t know where he’s been. Why, why you gotta be so blind? Won’t you open up your eyes, it’s just a matter of time, ‘til you find he’s no good, girl, no good for you, you better get to getting on your goodbye shoes, and go! Go, go, yeah yeah yeah, he’s low, yeah yeah yeah.	But, he’s really good at lying. Yeah, he’ll leave you in the dust ‘cause when he says forever, well it don’t mean much! Hey good girl so good for him, better back away honey. You don’t know where he’s been! Why, why you gotta be so blind. Won’t you open up your eyes, it’s just a matter of time ‘til you find he’s no good girl. No good for you, you better get to getting on your goodbye shoes and go, go, go. Yeah, yeah, yeah he’s low. Yeah, yeah, yeah.

The question is, how different is different? We address this question with some statistical analyses with a focus on the number of sentences as a descriptive measure underlying many other, more sophisticated Coh-Metrix measures. All statistical analyses were performed using the statistical programme R ([28]).

5.1 Poisson generalized linear model

As the number of sentences, our characteristic of interest here, are counts, we assume they are Poisson distributed. Thus, we fit a Poisson generalized linear model with the total number of sentences (denoted as NoSen) used to examine the two predictor variables, punctuator and song genre. The two predictor variables are categorical, with three levels each. That is, Punctuator has levels DET, LEM and WAT and Genre has levels COU, POP and RAP for country, pop, and rap: This model allows us to evaluate the level of difference in sentence counts due to pre-processing (punctuator) and what variation in the number of sentences was observed due to the song style (genre). The null hypotheses being tested are that the parameters β1 and β2 are equal to 0. In the model being fitted here, an important underlying assumption is that the mean number of counts between songs is equal to the variation in the number of counts between songs. However, the mean is considerably smaller than the variance for this data. The variance being higher than expected is called overdispersion. We conducted a test for overdispersion, where the null is that the ratio of the observed variance and the expected variance under the model is 1 (which they would be if they are the same). The test statistic value is 410.5 with a p-value less than 0.001. The test provides strong evidence that we do indeed have overdispersion in the counts. We address the issue of overdispersion by fitting a quasi-Poisson generalised linear model, which fits an additional dispersion parameter to account for the extra variance. It was estimated to be 4.83, which tells us that the variance is nearly five times greater than the mean. An Analysis of Variance (ANOVA) table for a two-way factorial model (see Table 2) allows us to assess the importance of each of the two factors separately.

Table 2

ANOVA table for a two-way factorial model.

	Sum Sq	Df	F value	P-value
Punctuator	241.75	2	25.0	<2.8e-09
Genre	600.79	2	62.2	<2.2e-16
Residuals	410.50	85

As we have fitted a generalised linear model, the output in the ANOVA table is calculated using Type-II Wald tests, where the differences of Wald statistics are used.

As we have fitted a generalised linear model, the output in the ANOVA table is calculated using Type-II Wald tests, where the differences of Wald statistics are used. In the first line of Table 2 we can see the sums of squares, degrees of freedom, and F test statistic for the test where we are checking how useful the factor punctuator is without considering the other factor. That is the two models being compared are a model with only the Punctator as a predictor and the null model with no predictor variables. The F test statistic is a standardised measure of the improvement in fit (based on the residual sums of squares) divided by the extra number of parameters needed for the more complex model. The larger the value of the test statistic, the more evidence we have that the more complex model is better. In the second row, we see the same details for the test of the second factor genre when adjusting for the factor punctuator. That is, here the two models being compared is the model with only the factor punctuator as a predictor variable and the model with both factors. The p-values in Table 2 show that both factors are useful in terms of predicting the number of sentences. This infers there are significant differences in the number of sentences due to both pre-processing as well as song type. The latter is to be expected considering the three song genres are very different in their sentence composition by design. We can look at the coefficients, shown in Table 3, to quantify the differences.

Table 3

(Quasi)Poisson generalized linear model.

Coefficients	Estimate	Std Error	t value	P-value
Intercept	3.54	0.088	40.04	<2e-16
PunctuatorLEM	-0.56	0.086	-6.51	4.93e-09
PunctuatorWAT	-0.43	0.082	-5.16	1.60e-06
GenrePop	0.47	0.10	4.63	1.33e-05
GenreRap	0.97	0.09	10.41	<2e-16

Country songs punctuated by DET (the baseline group) are expected to have an average of about exp(3.54) = 34.47 sentences. On average, we can expect to see exp(0.47) = 1.6 and exp(0.97) = 2.64 times as many sentences in pop and rap songs in comparison to country songs, respectively. More importantly, the expected number of sentences for a song processed by LEM and WAT is exp(-0.56) = 0.57 and exp(-0.43) = 0.65 times the expected number of sentences in a song processed by DET, respectively. That is, song scripts processed by DET result in noticeably more sentences, on average, than both LEM and WAT. Overall, the most important finding here is that we see significant differences in the number of sentences between the three punctuators, which in turn has implications for cohesion measures that rely on the sentence as a unit.

5.2 Correspondence analysis

Correspondence analysis was used to visualize the relationship between the three punctuators and the three song genres in terms of their effect on sentence counts. A Procrustes analysis was used to determine whether differences in the number of sentences between punctuators was due to differences in how they process the data, and not for example, due to intervening variables like exhaustion. The plot we see in Fig 1 shows a reduced space plot from a correspondence analysis. This analysis is a graphical display of the rows and columns of a contingency table, where the row variable is the punctuator and the column variable the song genre. In the plot, punctuators are labelled DET, LEM, and WAT, while songs are indicated by genre (R = rap, C = country, P = pop) and number as described in section 4.1.

Fig 1

Reduced space plot from the correspondence analysis.

The degree of closeness to each other is of interest, that is, whether two points lie in the same quadrant, as well as their relative distance to the origin. Two points close together would be considered similar in terms of the number of sentences they have. In Fig 1, we see along Dimension 1, on the horizontal axis, a gradient that captures the level of “verbosity” in the songs that is genre-typical. Rap songs, which tend to have more words, and therefore sentences, are positioned more to the left, while country songs lean towards the right-hand side of the origin. We also see that this highlights the fact that the punctuators WAT and LEM have in common that they tend to punctuate in a manner that leads to fewer sentences. 60.4% of the variation in the number of sentences between songs is explained by the genre. Along Dimension 2, on the vertical axis, the positioning of the points is driven by the “direction of disagreement” between the two punctuators, in particular LEM and WAT. The songs that differentiate the three punctuators along this second gradient are two rap songs (R8, R9), country song C5 and pop song P9. In these cases, songs where WAT assigned more sentences are above the average and songs where LEM assigned more sentences are below average. 24.6% of the variation in the number of sentences is explained by this contrasting of the punctuators. Overall, the plot shows the ways in which the three punctuators differ from each other in relation to differences between the songs. That is, while most of the variation in the number of sentences is caused by underlying differences in the type of songs being analysed, we also have nearly a quarter of the variation being caused by differences in how the lyrics were processed.

5.3 Procrustes analysis

Fig 2 shows the results of a Procrustes analysis. A Procrustes analysis plot allows us to focus on the punctuators in the analysis in two respects: First, it shows the degree of similarity between any two persons. In this regard, it is apparent that the three punctuators deviate quite clearly from each other as they are spaced in different quadrants in the plot.

Fig 2

General procrustes analysis map.

Second, this plot also visualizes the degree of similarity in which processing was performed across the three genres of songs for each individual. To see more detail we need to zoom in, see Fig 3.

Fig 3

Close-up of deviations for each genre from LEM’s centroid.

The close-up in Fig 3 reveals arrows corresponding to the deviations in the average number of sentences for each type of song from the centroid of the punctuator labelled as LEM. The centroid here is the vector of the mean number of sentences for that punctuator. The fact that these deviations are so small indicates that LEM was consistent in themselves in the way they punctuated across genres. This observation holds for the other two punctuators as well. This implies that we can assume there is very little noise, caused by, for example, the level of exhaustion the punctuator might experience. That is, the differences in the number of sentences appear to be inherently due to how each of the puncuator processed the lyrics.

Discussion

The present study analysed the potential impact of text normalisation on results of the text-analysis program Coh-Metrix 3.0. The focus was on the sentence as a text-structural unit, which bears relevance for a large number of measures provided by the program. In particular, we investigated the impact of punctuation, given Coh-Metrix’s apparent reliance on typographic symbols (full stop, exclamation mark, question mark) to detect sentence boundaries. Our first research question asked whether the use of different transcribers to insert punctuation into pop song lyrics can result in significant variance in sentence segmentation. Further, we investigated whether this potential variance in sentence segmentation significantly affects Coh-Metrix measures. To answer these questions, we quantified the impact of text normalisation at the pre-processing stage on the resulting number of sentences. The impact is found to be significant given the differences found between punctuators in the Poisson regression model (with the total number of sentences as the response variable) and their visual separation in space in the reduced space plot from the correspondence analysis. We have, thus, demonstrated on a small scale that differences in how a text is pre-processed can lead to a significant difference in the number of sentences. This has implications for at least some Coh-Metrix measures such as sentence length/mean number of words (DESSL) or paragraph length/mean number of sentences (DESPL). As these descriptive measures significantly contribute to a range of other, more sophisticated measures, it can be assumed that these might be affected as well. Consequently, it seems that normalisation of non-traditional written texts that do not clearly indicate sentence boundaries should receive greater attention during planning, analysis and documentation. The potentially significant influence of transcriber(s) on language data has received greater attention in interview-based qualitative research, where the “nature of transcription as an interpretive activity” has increasingly been acknowledged, along with the fact “that different people checking transcription quality may generate different versions of the interview transcript” ([29], p. 305). [29] highlights the issue of sentence segmentation in originally spoken language and the potential impact on the meaning expressed: “Because people often talk in run-on sentences, judgment calls must be made in the course of transcription about where to begin and end sentences. The insertion of a period or comma can sometimes alter the interpretation of the text” (p.297). It seems that this argument is equally valid for quantitative research approaches relying on the quantification of sentences rather than or in addition to their meaning in the target data. The need for judgment calls to be made by the transcribers (or punctuators) remains the same. Such judgment calls are not solely based on semantic considerations but are also influenced by grammar and style and each transcriber’s attitude towards these matters. As [11] highlight: “The term ‘grammar’ is indeterminate in the sense that ‘What counts as a grammatical English sentence?’ is not always a question which permits a decisive answer; (…) questions of grammatical acceptability inevitably become involved with questions of meaning, with questions of good or bad style, with questions of lexical acceptability, with questions of acceptability in context, etc.” (p. 47). As the present study has shown, transcribers can answer these questions of meaning, style, and acceptability in different ways, and this can result in significantly different approaches to sentence segmentation. Such differences can be observed despite a similar language background (here: native speakers of New Zealand English) and many years of educational and professional experience with the English language and writing, including at tertiary level. It can be argued that the particularities of a non-standard written genre like song lyrics contribute to these differences. Song lyrics display characteristics of both written and spoken text, sitting “somewhat uneasily on the boundary between writing and speech” ([30], p. 37). Features that might render uniform text interpretation and segmentation difficult include a higher number of disfluencies such as false starts, marginal word, and non-lexical vocables. While the results of the present study point to the possibility of subjective bias being introduced in automated language evaluation due to punctuation, we note that the sample of 30 songs and only three punctuators is small and presents a limitation. Our work so far represents a preliminary study, and a follow-up study on a larger set of songs and punctuators will be necessary to confirm its findings.

Conclusion

Coh-Metrix is a sophisticated application that allows both researchers and language practitioners to assess text not only at the word-level but also regarding more global textual features that contribute to the reading ease or difficulty of different texts for various readers, including L2 learners. As Coh-Metrix relies on punctuation for sentence segmentation, caution should be applied, however, when assessing non-traditional written texts including song lyrics that require the insertion of punctuation during pre-processing. Given the potentially significant impact of a transcriber or punctuator on the language data and consequently on the analysis results, their work should not rely on undefined linguistic intuitions and an assumed consensus regarding punctuation, sentence structure and acceptability, and sentence meaning. Instead, it seems prudent to lay out clear rules and guidelines to be applied during data processing and to make these explicit in sufficient detail in research documentation and publication. Any further research in this area, including replication studies, would benefit from a detailed description of text preparation and normalisation.

30 songs were punctuated by each of the three punctuators.

The pre-processed texts were then entered into Coh-Metrix to produce a report which includes the measure “number of sentences”. Our data set is a collation of the number of sentences for each song for each punctuator. (CSV) Click here for additional data file.

R code used to fit the Poisson regression model, and do the correspondence and Procrustes analyses.

(R) Click here for additional data file. 7 Sep 2020 PONE-D-20-05019 The impact of text segmentation on automated evaluation of song-lyrics PLOS ONE Dear Dr. Parry, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 16 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Diego Raphael Amancio Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 3 in your text; if accepted, production will need this reference to link the reader to the Table. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This study addresses an interesting problem that has been overlooked in previous studies on song-lyrics analysis, namely the one of punctuating (in particular of identifying sentence boundaries in) song-lyrics. Previous studies have relied on manual annotations often made by the authors themselves without questioning the objectivity of the task and the fact that different annotators may come up with significantly different annotations. The present study shows that the task is indeed partly subjective and that annotations do vary from one annotator from another, even between annotators with a strong background or experience in language studies and linguistics. This result is definitely interesting and suggests that researchers should devote more time to punctuating song-lyrics. The study is well conducted, with a clear and precise analysis. The only slight drawback is the relatively small size of the sample considered (30 songs in total) and the relatively small number of annotators involved (3). If the conclusions drawn are valid, it would have been nice to obtain them on a larger set of songs and annotators. Reviewer #2: I read the paper “The impact of text segmentation on automated evaluation of song-lyrics” with great interest. In this paper, the authors use three transcribers to punctuate pop, rap and country song lyrics and highlight the resulting differences with respect to some statistics of the number of sentences. I would like to raise some general concerns: i) It is not clear for me the role of Coh-Metrix in the paper. Is it used only for counting sentences? There is a whole corpus of introduction mentioning the advantages and the importance of pre-processing before the use of Coh-Metrix, but at line 317 the authors say that their analysis focus only on the number of sentences, instead of “more sophisticated Coh-Metrix measures” (line 318). ii) In line 329, it is said “Note that we have overdispersion in the counts”. The reader has some trouble to follow this line of thought because we do not have access to the distributions of the number of counts. I would suggest presenting this distribution or, for instance, the inclusion of a figure with the song id in the x-axis (from 1 to 30) and the number of sentences in the y-axis: this figure would have 3 curves, each one indicating the number of sentences obtained from the punctuation of each transcriber. In this example, the dispersion of the data could be seen in the range of y. iii) The authors should be more descriptive about the tests realized. I believe section 5 could be expanded for the sake of better comprehension. For instance, although ANOVA tables are pretty common, it would be nice to guide the reader to how it is obtained (in a few words) and to the meaning of each column presented in table 2, as well as the relevance of the results to the purpose of the paper. iv) The authors could explain why they decided to use the Poisson generalized linear model. I mean, is it expected the number of sentences (NoSen) to follow a Poisson distribution? In conclusion, it is expected that changes in text segmentation would result in differences on text metrics such as the number of sentences. In my opinion, the most interesting aspect of the paper is the quantification of the different interpretations of the same songs: as pointed out in subsection 4.2.1, the transcribers were instructed to follow their “gut feeling” (line 293). In this sense, the results measure how the same set of song can be perceived, felt, seen, viewed, by the transcribers. The importance of this point is already mentioned by the authors when they talk about the ESL (lines 16, 450), but I would suggest an emphasis in the aspect of “quantification of the differences in perception”, including the title and abstract, to make the paper stronger. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 13 Sep 2020 Please see the Response to Reviewers letter. Will provide the latex file here: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \\documentclass[11pt, a4paper]{letter} % Set the font size (10pt, 11pt and 12pt) and paper size (letterpaper, a4paper, etc) \\input{structure.tex} % Include the file that specifies the document structure %\\longindentation=0pt % Un-commenting this line will push the closing "Sincerely," and date to the left of the page %---------------------------------------------------------------------------------------- % YOUR INFORMATION %---------------------------------------------------------------------------------------- \\Who{Response to Reviewers} % Your name \\Title{} % Your title, leave blank for no title \\authordetails{by\\\\ Dr. Katharina Parry\\\\ Dr. Friedrike Tegge\\\\ } %---------------------------------------------------------------------------------------- % HEADER CONTENTS %---------------------------------------------------------------------------------------- \\logo{logo.png} % Logo filename, your logo should have square dimensions (i.e. roughly the same width and height), if it does not, you will need to adjust spacing within the HEADER STRUCTURE block in structure.tex (read the comments carefully!) \\headerlineone{} % Top header line, leave blank if you only want the bottom line \\headerlinetwo{MASSEY UNIVERSITY} % Bottom header line %---------------------------------------------------------------------------------------- \\begin{document} %---------------------------------------------------------------------------------------- % TO ADDRESS %---------------------------------------------------------------------------------------- \\begin{letter}{ } %---------------------------------------------------------------------------------------- % LETTER CONTENT %---------------------------------------------------------------------------------------- \\opening{Dear Reviewers,} as a general comment we want to express our sincere gratitude to you both for taking the time to provide feedback on our work. These are difficult times so we are very appreciative of your efforts, thank-you. \\textbf{Response to reviewer 1:} \\textbf{Original comment:} \\textit{The study is well conducted, with a clear and precise analysis. The only slight drawback is the relatively small size of the sample considered (30 songs in total) and the relatively small number of annotators involved (3). If the conclusions drawn are valid, it would have been nice to obtain them on a larger set of songs and annotators.} Thank you for pointing this out. We have added the following two paragraphs: \\begin{enumerate} \\item In the Methods section: "We note that the sample of 30 songs and three annotators is small. The work presented here is a first step in a larger project assessing the readability of pop songs for ESL learners. It is intended to address research-methodological problems that seem to be highly relevant not only to the wider investigation at hand but also to the use of Coh-Metrix and automated text evaluation more widely." \\item At the end of the Discussion section: "While the results of the present study point to the possibility of subjective bias being introduced in automated language evaluation due to punctuation, we note that the sample of 30 songs and only three punctuators is small and presents a limitation. Our work so far represents a preliminary study, and a follow-up study on a larger set of songs and punctuators will be necessary to confirm its findings." \\end{enumerate} \\textbf{Response to reviewer 2:} \\textbf{Original comment:} \\textit{i) It is not clear for me the role of Coh-Metrix in the paper. Is it used only for counting sentences? There is a whole corpus of introduction mentioning the advantages and the importance of pre-processing before the use of Coh-Metrix, but at line 317 the authors say that their analysis focus only on the number of sentences, instead of “more sophisticated Coh-Metrix measures” (line 318).} The question as to the relevance of the sentence is a central one. Sentence count is itself a measure, but more importantly, it is an underlying contributor to a large range of more sophisticated measures. As such, we have investigated the differences in sentence count between three different punctuators and draw the conclusion that more sophisticated measures relying on sentence count as a contributing measure would be affected by significant differences in sentence count (such as word overlap between sentences, semantic overlap between sentences, latent semantic analysis, to name a few). This is explained in the section “The relevance of the sentence in Coh-Metrix 3.0”. \\newpage The section includes the following explanation: Quote from lines 46-68: "It can be speculated that lexical information such as diversity measures, word frequency, word meaningfulness and concreteness, would be largely unaffected by sentence segmentation, whereas certain syntactic indices such as sentence syntax similarity of adjacent sentences (SYNSTRUTa) clearly rely on the sentence as a unit. Similarly, cohesion measures such as noun-, argument-, stem-, and content-word overlap in adjacent sentences or across all sentences in a text require sentence boundaries to be indicated, as do measures of Latent Semantic Analysis (LSA), that is, measures of relative semantic overlap between sentences and paragraphs. The same holds for traditional readability scores included in Coh-Metrix 3.0, the Flesch Reading Ease (RDFRE) and the Flesch-Kincaid Grade Level (READFKGL), which both rely on sentence length (mean number of words per sentence) to compute text difficulty. The Coh-Metrix L2 Readability score (L2), intended to assess the suitability of texts for second language learners, includes content-word overlap in and syntactic similarity of sentences as variables in its formula. Finally, Coh-Metrix 3.0 offers so-called Text Easability Principal Component Scores, based on a principal components analysis (PCA) of 54 indices in a corpus of written texts found across school-grade levels (K-12) and academic subjects (see Graesser (2011), McNamara(2014) for further details). Five of these principal components – narrativity, syntactic simplicity, word concreteness, referential cohesion, and deep cohesion, account for 54\\% of the variance in the test corpus and are provided to Coh-Metrix users as comprehensive scores that are more easily accessible and of more immediate practical use for teaching practitioners. Graesser (2011) show that sentence-based measures contribute to three of these easability scores (narrativity, syntactic simplicity, referential cohesion)." Furthermore, the argument of the sentence as an underlying measure contributing to a range of other measures is emphasised in our Discussion so that it connects with the following quote: Quote from lines 406-415: "We have, thus, demonstrated on a small scale that differences in how a text is pre-processed can lead to a significant difference in the number of sentences. This has implications for at least some Coh-Metrix measures such as sentence length/mean number of words (DESSL) or paragraph length/mean number of sentences (DESPL). As these descriptive measures significantly contribute to a range of other, more sophisticated measures, it can be assumed that these might be affected as well. Consequently, it seems that normalisation of non-traditional written texts that do not clearly indicate sentence boundaries should receive greater attention during planning, analysis and documentation." \\textbf{Original comment:} \\textit{ii) In line 329, it is said “Note that we have overdispersion in the counts”. The reader has some trouble to follow this line of thought because we do not have access to the distributions of the number of counts. I would suggest presenting this distribution or, for instance, the inclusion of a figure with the song id in the x-axis (from 1 to 30) and the number of sentences in the y-axis: this figure would have 3 curves, each one indicating the number of sentences obtained from the punctuation of each transcriber. In this example, the dispersion of the data could be seen in the range of y.} Well spotted, this was certainly something we needed to explain better in order for the reader to not just take our word for it, but be able to see it for themselves. Thank you for the helpful suggestion for a plot which would show the overdispersion. However, a plot takes up a lot of space, and we weren't sure if the reader would know what to look for (as in, we would need to give some explanation as well) so we instead decided to do test for overdispersion as another way of getting the reader to understand how we detect overdispersion. That is, in section 5, as highlighted, we explain the idea of the test, e.g. that the test statistic is the ratio of the expected variance (under the model) and the observed variance. We then use the highly significant p-value as proof that we should consider a quasi-Poisson model instead to account for the difference in the true variance from the variance assumed under the model. \\textbf{Original comment:} \\textit{iii) The authors should be more descriptive about the tests realized. I believe section 5 could be expanded for the sake of better comprehension. For instance, although ANOVA tables are pretty common, it would be nice to guide the reader to how it is obtained (in a few words) and to the meaning of each column presented in table 2, as well as the relevance of the results to the purpose of the paper.} Once again, following this advice will improve the article greatly, and as such we have added clarification in section 5 as highlighted. \\textbf{Original comment:} \\textit{iv) The authors could explain why they decided to use the Poisson generalized linear model. I mean, is it expected the number of sentences (NoSen) to follow a Poisson distribution?} This is correct, we do indeed treat the number of sentences as counts, and thus assume they are Poisson distributed. Clearly, this was not explained well enough in the current form of the article and thus the explanation has now been given more attention when addressing the previous comment. \\textbf{Original comment:} \\textit{In conclusion, it is expected that changes in text segmentation would result in differences on text metrics such as the number of sentences. In my opinion, the most interesting aspect of the paper is the quantification of the different interpretations of the same songs: as pointed out in subsection 4.2.1, the transcribers were instructed to follow their “gut feeling” (line 293). In this sense, the results measure how the same set of song can be perceived, felt, seen, viewed, by the transcribers. The importance of this point is already mentioned by the authors when they talk about the ESL (lines 16, 450), but I would suggest an emphasis in the aspect of “quantification of the differences in perception”, including the title and abstract, to make the paper stronger. } Our research shows that different transcribers delineate the structural unit of the sentence differently. However, our research does not provide insight into whether these decisions are made based on semantic interpretation (felt, seen, viewed) or solely on different views on the definition of the sentence unit, on sentence grammar, acceptability and style. It can be assumed that it is a combination of these factors, but that assumption cannot be inferred from the present results. What our study shows is that there can be differences in text segmentation between educated transcribers and these differences can have an impact on the results of automated quantification of textual features. Based on a review of quantitative studies involving Coh-Metrix and also literature describing the role of the sentence in NLP, we can also state that this problem does not seem to have received any attention. Based on your thoughtful feedback we have updated the title to "The impact \\textbf{of differences in} text segmentation on \\textbf{the} automated \\textbf{quantitative} evaluation of song-lyrics" as well as the abstract as highlighted in the revised manuscript. \\bigskip Again, thanks so much for your input! Sincerely, \\par \\medskip Friederike and Katharina \\end{letter} \\end{document} Submitted filename: Response_to_Reviewers.pdf Click here for additional data file. 26 Oct 2020 The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics PONE-D-20-05019R1 Dear Dr. Parry, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Diego Raphael Amancio Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #2: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #2: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #2: No 28 Oct 2020 PONE-D-20-05019R1 The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics Dear Dr. Parry: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Diego Raphael Amancio Academic Editor PLOS ONE

4 in total