A quantitative corpus based approach to profiling heritage speakers

Gert Foget Hansen, University of Copenhagen (Poster session Thursday 16:00)

Heritage speakers are notorious for the variation in competence. This presents a problem for heritage language research if a comparable and reasonable level of fluency among the speakers is assumed a priori. In addition, a rather intuitive rating of fluency is used to rate the competence of speakers and to group them together. Robust speaker profiles, including differences in lemma production, amount and type of code-switching, speech rate, hesitations repetitions and self-interruptions, is thus a desideratum.

It has previously been shown how structural attrition i.e. attrition of grammar and lexicon goes hand in hand with attrition at the performance level as expressed by the occurrence of cognitive disfluency markers such as empty pauses, hesitations, repetitions and self-interruptions (Montrul 2009, Johannesen 2015) and how cognitive disfluency markers and semantic disfluency markers (empty pauses) are affected differently by attrition (Schmid & Fägersten 2010).

The aim of the talk is to qualify the discussion of the connection between the various disfluency markers and the connection to linguistic attrition. Based on the CoNAmDa (Corpus of North American Danish) and CoSAmDa (Corpus of South American Danish), we will use factor analysis to gain insight into how phenomena of disfluency and attrition co-occur across a large dataset (see below) and to look for possible trends across different speaker categories (based on speaker metadata). Among the structural (attrition) phenomena which we include in the analysis are:

  • Gender simplification
  • Occurrence of certain discourse markers (well, you know)
  • Ratio of hypotactic to paratactic constructions
  • Lexical diversity (type-token-ratio)
  • Language switching at three different levels: word-internal, clause-internal, and clause to clause

The CoSAmDa (Argentine Danish) corpus amounts to 858,000 tokens produced by 90 speakers (born between 1911 and 1971). The CoNAmDa (North American Danish) corpus consists of 614,000 tokens produced by 230 speakers (born between 1876 and 1965). Basic speaker metadata like gender, birth year, time of emigration, home town and residence at the time of the recording are registered for each speaker.

Both corpora have the same (rich) set of annotations. The speech is orthographically transcribed, and aligned with sound. The transcription includes all verbal events: hesitations, filled pauses, self-interruptions and other such as laughter.

At the word level, each token is annotated for language; primarily Danish, English and Spanish. Words with word-internal switching (either between stems or between stem and suffixes) are coded as ’hybrid’. Words which cannot be assigned unambiguously to one language due to similarity in form and pronunciation between two languages are coded as ’ambiguous’. Given names, filled pauses, self-interruptions and other discourse phenomena are coded as such. All words have received an automatically based PoS-tagging. Further, the corpora are currently being annotated with a basic syntactic coding identifying main and subordinate clauses, including a distinction between Danish and non-Danish clauses.[1]

Combining the syntactic annotation and the word-level annotation of language enables us to distinguish between non-Danish words inserted into otherwise Danish clauses and non-Danish words forming non-Danish clauses. Thus, we are able to distinguish between language switching at three different levels: word-internal, clause-internal, and clause to clause switching across the whole corpus. Based on the syntactic coding, empty intervals occurring clause-internally can be classified as pauses, whereas empty intervals outside of clauses are disregarded.

It should be emphasized that the coding of the disfluency markers is automatic and/or derived from the syntax coding and thus strictly quantitative; In a more qualitative approach one might for instance distinguish between pauses with different functions, or different reasons for repetitions.


  • Johannessen, J.B. 2015. Attrition in an American Norwegian heritage language speaker. J.B. Johannessen & J. Salmons (red.), Germanic Heritage Languages in North America. Acquisition, attrition and change. Amster-dam: John Benjamins. 46-71.
  • Montrul, S.A. 2008. Incomplete acquisition in bilingualism. Re-examining the age factor. Amsterdam: John Benjamins.
  • Schmid, M. & K.B. Fägersten. 2010. Disfluency markers in L1 attrition. Language Learning. A Journal of Re-search in Language Studies 60.4. 753-791.

[1] In our syntactic coding, a clause minimally requires a finite verb. As a consequence, some utterances are classified as non-clauses. These also include well-formed utterances, in particular simple replies such as "yes" and "no" or matter of fact information such as "in New Brunswik in 1953". As the classification of a clause as Danish or non-Danish hinges on which language the subject and the verb can be attributed to, the non-clauses are not specified for language (at clause level).