Linguistic Data: �Quantitative Analysis and Visualisation
Lecture 1. Introduction
Olga Lyashevskaya
HSE University, Moscow, MA program in Computational Linguistics
Objectives
Examining linguistic data
Doing linguistic research
Practice in
Focus on
Some case studies
probabilistic models of language
The correlation of frequency and range of words �in the spoken BNC (Gries to appear)
probabilistic models of language
descriptive statistics, confirmatory
The distribution of different types of NPs across subject/non-subject slots (Aarts 1971: table 4.5, cited by Gries 2015:61)
Subject slots prefer structurally lighter NPs: subjects are pronouns/names 86.2 percent of the time (5821/6749= 0.862) whereas non-subjects arepronouns/names 46 percent only of the time (2193/4770= 0.4,597); this is extremelyunlikely if there is no correlation between subjecthood and NP lightness.
multifactorial, confirmatory
Geographical variation of the �merging between dental and
retroflex sibilants in Taiwan
Mandarin
/s/ /ș/ /s/
/ts/ /tș/ /ts/
/tsh/ /tșh/ /tsh/
Mandarin Min language geodata: geodata:� Min fluency Sibilant merging
Types of variables
exploratory (hypothesis-generating)
Janda &
Solovyev
2009
exploratory (hypothesis-generating)
languages
Hartmann, Haspelmath & Cysouw �(2014) Identifying semantic role �clusters and alignment types via �microrole coexpression tendencies
exploratory (hypothesis-generating)
Dimensionality
reduction
exploratory (hypothesis-generating)
individual variation
mix-effect models
Course Grades
data collection (data), research design, descriptive statistics and data modeling, interpretation (written project paper), oral defence of the project (exam)
Core Literature