ABCDEFGHIJKLMNOPQRSTUVWXYZAA
1
Data Analytics for Students of Social Studies and Humanities (NPFL134)
2
Glossary of common terms
3
Under construction, last update: July 27, 2022
4
5
termdescription
6
7
Aattributevariable, feature
8
annotation
adding data to data manually or automatically
9
10
Cclassification
identification of group membership, e.g. e-mail spam detection
11
Corpus WorkBench Query Language (CQL)
a query language for linear (non-structured) linguistic data such as tokenized and lemmatized texts
12
corpus
an electronic collection of written and/or spoken texts
13
14
Ddata
information in digital form for computer processing (e.g. text, audio, video, image, software)
15
data set
a set of existing data that could be used to answer research questions
16
digitization
the process of converting information into a digital format
17
directive
18
19
FFAIR data principles
Findability, Accessibility, Interoperability, Reusability
20
feature
a property of real-world objects that can be observed or measured, mostly used in Computer Science, cf. variables in Mathematics and predictors in Statistics
21
22
Iintercoder reliability
a measure indicating how well two or more coders reached the same judgments in coding the data (also known as Inter-rater reliability or Intercoder agreement) https://en.wikipedia.org/wiki/Inter-rater_reliability
23
24
KKonText
a concordance system = concordancer, https://lindat.cz/kontext
25
26
Llemma
the basic form of a word (token), e.g. "mother" is the lemma of "mothers", "be" is the lemma of "were"
27
licence
an agreement that Intellectual Property holder (licensor) gives right of use to another person (licensee)
28
29
MMachine Learning
the research field that gives computers the capability to learn from examples
30
MALACH Centre for Visual History
the centre provides access to large archives of audiovisual oral history testimonies of genocide survivors (mostly Holocaust), https://ufal.mff.cuni.cz/malach/en
31
metadatadata about data
32
33
NNameTag
a system for named-entity recognition, http://lindat.cz/services/nametag
34
Natural Language Processing (NLP)
the research field concerned wit the human/machine interaction
35
n-gram
a contiguous sequence of n items (e.g. words, characters) in a text or speech
36
n-gram, unigram
n=1, e.g. three different word unigrams from, time, to in the sentence From time to time.
37
n-gram, bigram
n=2, e.g. three word bigrams in the sentence From time to time: from, time; time to; to time
38
39
OOptical Character Recognition (OCR)
the automatic conversion of images of typed, handwritten or printed text into a machine-readable text
40
41
PPERO
an application for automatic transcription of several types of printed and handwritten documents with support for Czech, https://pero-ocr.fit.vutbr.cz/index
42
PML-TQ
Prague Markup Language - Tree Query, a powerful query system for collections of structured linguistically annotated data, https://ufal.mff.cuni.cz/pmltq
43
44
RR system
a system for statistical computing (incl. machine learning), https://www.r-project.org
45
regression
prediction of a continuous response, e.g. prediction of future days' temperature
46
repository, data
a digital infrastructure to share data, i.e. to preserve data and help others to find them
47
repository, LINDAT
repository of both LINguistic DATa and non-linguistic content, https://lindat.cz/
48
49
Sstop words
a set of commonly used words in a language; in text mining, stop words are usually exluded from analysis because they carry very little useful information
50
51
T
52
TAbleau
a visual data analytics platform, https://www.tableau.com/
53
tag, morphological
a linguistic markup to capture morphological properties of individual tokens in text, such as part-of-speech, case, number etc.
54
TEITOK
a web-based platform for viewing, creating, and editing text corpora with both rich textual mark-up and linguistic annotation, http://www.teitok.org
55
topic modeling
identifying clusters or recurring patterns of co-occurring words (called 'topics')
56
tree, dependency
a structure capturing syntactic relations in a text, e.g. subject, object, predicate
57
58
UUDPipe
a system for linguistic processing of texts, https://lindat.cz/services/udpipe
59
60
VVoyant
a web-based system of textual-analysis tools, https://voyant-tools.org
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100