1 of 16

Linguistic Data: �Quantitative Analysis and Visualisation

Lecture 1. Introduction

Olga Lyashevskaya

olesar@yandex.ru

HSE University, Moscow, MA program in Computational Linguistics

2 of 16

Objectives

Examining linguistic data

  • Descriptive stats
  • Data visualization
  • Hypothesis testing
  • Data modeling

Doing linguistic research

  • Data collection
  • Research design
  • Data interpretation

Practice in

3 of 16

Focus on

  • Real linguistic data
  • Current trends in linguistic research
    • phonetic studies
    • grammar & lexicon studies
    • typological research
    • sociolinguistic research
    • psycholinguistics & language acquisition

  • Corpus data & Experimental data

  • Lots of practice in

4 of 16

Some case studies

5 of 16

probabilistic models of language

  • Zipf’s law related research

The correlation of frequency and range of words �in the spoken BNC (Gries to appear)

6 of 16

probabilistic models of language

  • Kolmogorov, Prokhorov, Gasparov on rhyming in language

7 of 16

descriptive statistics, confirmatory

The distribution of different types of NPs across subject/non-subject slots (Aarts 1971: table 4.5, cited by Gries 2015:61)

Subject slots prefer structurally lighter NPs: subjects are pronouns/names 86.2 percent of the time (5821/6749= 0.862) whereas non-subjects arepronouns/names 46 percent only of the time (2193/4770= 0.4,597); this is extremelyunlikely if there is no correlation between subjecthood and NP lightness.

8 of 16

multifactorial, confirmatory

  • Chuang, Baayen et al. 2019 /

Geographical variation of the �merging between dental and

retroflex sibilants in Taiwan

Mandarin

/s/ /ș/ /s/

/ts/ /tș/ /ts/

/tsh/ /tșh/ /tsh/

Mandarin Min language geodata: geodata:� Min fluency Sibilant merging

9 of 16

Types of variables

  • numeric
    • F0 (basic frequency in phonetics), responce time
  • binary or categorical
  • ordinal/scale (e.g. in the task to predict the etymological age of a verb (Baayen 2008))
  • frequency counts (task to predict how frequently particular disfluencies happen in particular syntactic environments)

10 of 16

exploratory (hypothesis-generating)

  • clustering

Janda &

Solovyev

2009

11 of 16

exploratory (hypothesis-generating)

  • clustering

languages

Hartmann, Haspelmath & Cysouw �(2014) Identifying semantic role �clusters and alignment types via �microrole coexpression tendencies

12 of 16

exploratory (hypothesis-generating)

Dimensionality

reduction

13 of 16

exploratory (hypothesis-generating)

  • Dimensions of variation in Biber (1988)

14 of 16

individual variation

mix-effect models

15 of 16

Course Grades

  • 40% -- course project

data collection (data), research design, descriptive statistics and data modeling, interpretation (written project paper), oral defence of the project (exam)

  • 60% -- homework assignments
    • Rmd files submitted via GitHub (Classroom), DataCamp report
    • deadline: before the next seminar
    • late submission
        • -10% less than a week later
        • -70% any time before the exam

16 of 16

Core Literature

  • Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. eBook
  • Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. eBook
  • Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. pdf

  • Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
  • Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
  • Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook
  • McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook