1 of 16

Linguistic Data: �Quantitative Analysis and Visualisation

Lecture 1. Introduction

Olga Lyashevskaya

olesar@yandex.ru

HSE University, Moscow, MA program in Computational Linguistics

2 of 16

Objectives

Examining linguistic data

Descriptive stats
Data visualization
Hypothesis testing
Data modeling

Doing linguistic research

Data collection
Research design
Data interpretation

Practice in

3 of 16

Focus on

Real linguistic data
Current trends in linguistic research

phonetic studies
grammar & lexicon studies
typological research
sociolinguistic research
psycholinguistics & language acquisition

Corpus data & Experimental data

Lots of practice in

4 of 16

Some case studies

5 of 16

probabilistic models of language

Zipf’s law related research

The correlation of frequency and range of words �in the spoken BNC (Gries to appear)

6 of 16

probabilistic models of language

Kolmogorov, Prokhorov, Gasparov on rhyming in language

7 of 16

descriptive statistics, confirmatory

The distribution of different types of NPs across subject/non-subject slots (Aarts 1971: table 4.5, cited by Gries 2015:61)

Subject slots prefer structurally lighter NPs: subjects are pronouns/names 86.2 percent of the time (5821/6749= 0.862) whereas non-subjects arepronouns/names 46 percent only of the time (2193/4770= 0.4,597); this is extremelyunlikely if there is no correlation between subjecthood and NP lightness.

8 of 16

multifactorial, confirmatory

Chuang, Baayen et al. 2019 /

Geographical variation of the �merging between dental and

retroflex sibilants in Taiwan

Mandarin

/s/ /ș/ /s/

/ts/ /tș/ /ts/

/ts^h/ /tș^h/ /ts^h/

Mandarin Min language geodata: geodata:� Min fluency Sibilant merging

9 of 16

Types of variables

numeric

F0 (basic frequency in phonetics), responce time

binary or categorical
ordinal/scale (e.g. in the task to predict the etymological age of a verb (Baayen 2008))
frequency counts (task to predict how frequently particular disﬂuencies happen in particular syntactic environments)

10 of 16

exploratory (hypothesis-generating)

clustering

Janda &

Solovyev

2009

11 of 16

exploratory (hypothesis-generating)

clustering

languages

Hartmann, Haspelmath & Cysouw �(2014) Identifying semantic role �clusters and alignment types via �microrole coexpression tendencies

12 of 16

exploratory (hypothesis-generating)

Dimensionality

reduction

13 of 16

exploratory (hypothesis-generating)

Dimensions of variation in Biber (1988)

14 of 16

individual variation

mix-effect models

15 of 16

Course Grades

40% -- course project

data collection (data), research design, descriptive statistics and data modeling, interpretation (written project paper), oral defence of the project (exam)

60% -- homework assignments

Rmd files submitted via GitHub (Classroom), DataCamp report
deadline: before the next seminar
late submission

-10% less than a week later
-70% any time before the exam

16 of 16

Core Literature

Gries, Stefan (2013). Statistics for Linguistics with R : A Practical Introduction (Vol. 2nd revised edition). Berlin: De Gruyter Mouton. eBook
Levshina, Natalia (2015). How to Do Linguistics with R : Data Exploration and Statistical Analysis. Amsterdam: John Benjamins Publishing Company. eBook
Baayen, Harald (2008). Analyzing Linguistic Data: A practical introduction to statistics. Cambridge UP. pdf

Gries, Stefan (2017). Quantitative Corpus Linguistics with R : A Practical Introduction (Vol. Second edition). Milton Park, Abingdon, Oxon: Routledge. eBook
Hadley, W. (2016). Ggplot2 : Elegant Graphics for Data Analysis. Springer. eBook
Harney, H. L. (2016). Bayesian Inference : Data Evaluation and Decisions (Vol. 2nd ed). Springer. eBook
McElreath, R. (2016). Statistical Rethinking : A Bayesian Course with Examples in R and Stan. eBook