1 of 84

Welcome to the Language Bank of Finland!

This document is licensed under the Creative Commons Attribution 4.0 International license. Contents were produced by members of the Language Bank team in FIN-CLARIN (Mietta Lennes, Krister Lindén, Tero Aalto, Sam Hardwick).

Mietta Lennes, FIN-CLARIN�mietta.lennes@helsinki.fi

2 of 84

Today’s topic:

  • Data Management: Things you should know when collecting, studying and sharing (language) data

  • Discussions in small groups
  • Flinga:  https://edu.flinga.fi/s/EK6HE36

www.kielipankki.fi

2

9.6.2023

3 of 84

Program

  • Mietta Lennes, Kielipankki – The Language Bank of Finland: �Corpora, tools, and support for sharing language data

  • Jouni Tuominen, Helsinki Institute for Social Sciences and Humanities (HSSH)
    • Small group discussion in breakout rooms

  • Maija Paavolainen, Data Support at the University of Helsinki
    • General discussion

9.6.2023

4 of 84

5 of 84

Introducing CLARIN

5

6 of 84

https://www.kielipankki.fi

7 of 84

CLARIN ERIC

[ … ]

International cooperation

and sharing of resources�for Humanities and Social Sciences

European Research Infrastructure Consortium

founded on February 29, 2012

Member countries (21):�Austria

Bulgaria

Croatia

Cyprus

Czech Republic

Denmark

Estonia

Finland

Germany

Greece

Hungary

Iceland

Italy

Latvia

Lithuania

The Netherlands

Norway

Poland

Portugal

Slovenia

Sweden

Observers (3):

France

South Africa

United Kingdom

Third party (1):

CMU (USA)

Updated: 24.3.2021

8 of 84

Member countries (22):�Austria

Belgium

Bulgaria

Croatia

Cyprus

Czech Republic

Denmark

Estonia

Finland

Germany

Greece

Hungary

Iceland

Italy

Latvia

Lithuania

The Netherlands

Norway

Poland

Portugal

Slovenia

Sweden

Observers (2):

South Africa

United Kingdom

Third party (1):

CMU (USA)

Updated: 14.4.2022

CLARIN centres

9 of 84

CLARIN centres

Member countries (22):�Austria

Belgium

Bulgaria

Croatia

Cyprus

Czech Republic

Denmark

Estonia

Finland

Germany

Greece

Hungary

Iceland

Italy

Latvia

Lithuania

The Netherlands

Norway

Poland

Portugal

Slovenia

Sweden

Observers (2):

South Africa

United Kingdom

Third party (1):

CMU (USA)

Updated: 14.4.2022

��Kielipankki – �The Language Bank of Finland

B-centre�FIN-CLARIN

10 of 84

www.kielipankki.fi

11 of 84

Users of the Language Bank

  • Researchers from all fields welcome!
  • Many corpora are available even without signing in.
  • FIN-CLARIN can help you in storing and distributing �your own corpus.

12 of 84

Researchers of the Month

13 of 84

FAIR data

Findable

Interoperable

Accessible

Re-usable

14 of 84

https://vlo.clarin.eu

15 of 84

https://www.clarin.eu/resource-families

16 of 84

How to locate and access resources

16

9.6.2023

www.kielipankki.fi

17 of 84

https://www.kielipankki.fi/corpora

18 of 84

www.kielipankki.fi

18

9.6.2023

https://www.kielipankki.fi/corpora

link to metadata

link to license

link to location

link to resource group page

citation instructions

19 of 84

Citation instructions

20 of 84

Remember to check the version of the resource!

Details about, e.g., Suomi24 versions can be found via the page of the resource group.

21 of 84

www.kielipankki.fi

21

9.6.2023

Resource group page (Wanca)

Multiple versions of the same resource

published in different means of publication

22 of 84

Resource-specific metadata

Access location (e.g., Korp, download service, or similar)

Persistent identifier (use this in citations!)

23 of 84

korp.csc.fi: Suomi24 (2001-2017)

24 of 84

Download service, http://www.kielipankki.fi/download

25 of 84

Downloadable VRT format, extracted from Korp

structural elements

structural attributes

positional attributes

26 of 84

Downloadable speech corpora:��Local use in Praat

27 of 84

Downloadable speech corpora:��Local use in ELAN

28 of 84

Virtual Language Observatory�(VLO) vlo.clarin.eu

www.kielipankki.fi

28

9.6.2023

29 of 84

CLARIN licence categories

Publicly available

Available for academic, logged in users

Personal permission is required for access

30 of 84

More detailed licence conditions

+BY author must be cited

+NC non-commercial use only

+ID login is required

+PLAN research plan is required

+PRIV contains personal data, users must follow the resource-specific data protection terms and conditions

+NORED redistribution is not allowed

+DEP modified versions can be deposited for reuse via CLARIN services

31 of 84

The Suomi 24 Sentences Corpus 2001-2020, Korp version:Simple search: rakastaa (’to love’, verb)

32 of 84

The Suomi 24 Sentences Corpus 2001-2020, Korp version:Simple search: rakastaa (’to love’, verb)

33 of 84

Word picture

34 of 84

Extended search:

rakastaa, followed by a direct object

35 of 84

36 of 84

Statistics: rakastaa + direct object

37 of 84

The Suomi 24 Sentences Corpus 2001-2020, Korp version:

Trend diagram: sairaus (’illness’)

38 of 84

The Suomi 24 Sentences Corpus 2001-2020, Korp version:

sairaus (’illness’) vs. korona (’corona’)

(showing approx. the years 2017-2020)     

sairaus

korona

39 of 84

39

9.6.2023

Downloading a concordance from Korp

40 of 84

www.kielipankki.fi

40

9.6.2023

41 of 84

42 of 84

43 of 84

44 of 84

Plenary Sessions of the Parliament of Finland

  • Plenary Sessions of the Parliament of Finland, Kielipankki Korp Version 1.1 contains the transcripts of the plenary sessions held during 10.9.2008 - 1.7.2016, accessible via the Korp search and analysis tool.��The Parliament of Finland (2017). Plenary Sessions of the Parliament of Finland, Helsinki Korp Version 1.1 [text corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2017020202
  • Some search results also contain a link to the video recording

www.kielipankki.fi

44

9.6.2023

45 of 84

Plenary Sessions of the Parliament of Finland:�maahanmuuttaja ’immigrant’ (all forms)

46 of 84

Plenary Sessions of the Parliament of Finland:�maahanmuuttaja ’immigrant’

Click on link to show video!

47 of 84

Plenary Sessions of the Parliament of Finland, Korp version 1.5: Link to video

48 of 84

Compile statistics from search results in Korp

49 of 84

Statistics:�any form of the word maahanmuuttaja (’immigrant’)�vs. speaker’s role and parliamentary group

50 of 84

50

9.6.2023

Researcher of the Month 3/2017

Professor of Finnish Sign Language, University of Jyväskylä

Research interests:

  • structure of sign language from various perspectives
  • Applications of computer vision and motion capture technologies in the study of video corpora

Tommi Jantunen

51 of 84

Corpus of Finnish Sign Language (CFinSL)

  • Finnish Sign Language material collected in the CFINSL project
  • over 15 hours of video files from six camera angles
  • Elicited and spontaneous signing
  • annotations in ELAN format

University of Jyväskylä, Sign Language Centre (2019). Corpus of Finnish Sign Language [sign language corpus]. Kielipankki. Retrieved from http://urn.fi/urn:nbn:fi:lb-2019012321

52 of 84

The CFINSL corpus: elicited narratives

53 of 84

54 of 84

What kind of data can be deposited?

  • Text or speech in any natural language
    • text documents, audio and video recordings
    • annotations and transcripts may be included
    • must be in useful and well-described formats

  • Make sure you have the rights to distribute the data, at least for research purposes
    • Copyrights
    • Personal data�

Contact FIN-CLARIN for details

fin-clarin@helsinki.fi

55 of 84

CLARIN license categories

Publicly available

Available for academic, logged in users

Personal permission is required for access

Language Bank Rights, https://lbr.csc.fi

56 of 84

More detailed license conditions

+BY author must be cited

+NC non-commercial use only

+ID login is required

+PLAN research plan is required

+PRIV contains personal data

+NORED redistribution is not allowed

+DEP modified versions can be redistributed via CLARIN�� and other resource-specific conditions,� if required

57 of 84

Start collecting metadata

  • Document and describe your data:
    • Title
    • Brief description
    • Size
    • Citation: who are the authors/compilers that should be cited?

57

58 of 84

Storage and backups

  • Who should have access?
  • Do you need to protect the material?
    • e.g., personal data
  • Encrypted packages for transfer & safekeeping?

58

59 of 84

Intellectual Property Rights�e.g., copyright and related rights

  • Is your primary data copyrighted? (e.g., original text, translations)
    • Material may be copyrighted even without a “copyright notice”!
    • Who has rights? Authors, translators, publishers, …
  • Distinguish from related/neighboring rights, e.g.:
    • Performers’ rights
    • Databases
    • Photographs

60 of 84

Ask for permission

  • Make sure you are allowed to use the (copyrighted) material for your research and to share it (for research & education, via Kielipankki).
    • Many social media platforms are “difficult” data sources, since they often involve very specific license agreements that may prohibit further distribution (even for scientific research).
  • Ask FIN-CLARIN for help, if needed!

60

61 of 84

Personal data

  • Personal data is any information that can be used for identifying an existing living natural person.
  • Processing personal data is almost any conceivable action on the personal data, including mere storage.

61

62 of 84

Identify the Data Controller

  • Processing personal data involves legal responsibilities.
  • A controller is an individual or organisation that determines the purposes and means of the processing of personal data.
  • The controller may be you (the researcher/Principal Investigator) personally, or in some cases the university or research organization where you work. You need to find out…
  • Follow the data controller’s instructions and decisions in processing personal data!

62

63 of 84

Legal basis in scientific research

  • In many cases, the processing criterion is “processing is necessary in the public interest (Article 6 (1) (e))
    • Consent (tutkittavan suostumus) is not required, but the data subject must be informed
    • Prerequisites: appropriate research plan, responsible person, processing for research purposes only
    • Note: consent to participate is necessary in clinical research and may be necessary for research ethical reasons

  • In some cases, the processing criterion is “consent to the processing of personal data” (Article 6 (1a))
    • E.g. publication in an open database or research that is not in the public interest

63

64 of 84

Personal data in special categories

  • Special categories:
    • Information indicating race or ethnic origin, political opinions, religious or a philosophical belief or union membership
    • Genetic information
    • Biometric data for unambiguous identification of a person
    • Health information
    • Information on sexual behavior and orientation
    • Information related to criminal convictions and violations or security measures

  • Processing is only allowed for scientific research, to fulfill statutory obligations, or with the consent of the data subject:
    • Process only if absolutely necessary!
    • Usually requires more stringent protective measures.

64

65 of 84

Minimize the amount of personal data!

  • Data minimization principle: process no more than what you need for your purpose
  • Pseudonymize or anonymize, if necessary
  • Read more in��Finnish Social Science Data Archive (FSD):�Data Management Guidelineshttps://www.fsd.tuni.fi/en/services/data-management-guidelines/

66 of 84

Further processing (e.g., for secondary research purposes)

  • GDPR does not automatically require the destruction of all personal data at the end of the research project:
    • “subsequent processing […] of scientific or historical for research or statistical purposes is not considered incompatible with the original purposes”
  • Further processing must take into consideration the connection between the original purpose and the subsequent processing, the collection situation, the reasonable expectations of data subjects, etc.

66

67 of 84

The e-form for describing a new resource�for the Language Bank of Finland:

http://urn.fi/urn:nbn:fi:lb-2021121422

68 of 84

Two types of agreements are usually needed:

  1. Deposition (distribution) license agreement (DELA)
    • The Language Bank (University of Helsinki) obtains the right to distribute the resource to end-users under specific terms and conditions.
    • a DELA is usually required unless the University of Helsinki is the rightholder or the resource is already available under a public license.

  • End-User License Agreement (EULA)
    • The End-User agrees to use the data under specific conditions which the rightholder has approved.

69 of 84

Obtain restricted access via �Language Bank Rights�https://lbr.csc.fi

www.kielipankki.fi

70 of 84

Log in at https://lbr.csc.fi

71 of 84

Select resources

71

9.6.2023

72 of 84

Add to basket

www.kielipankki.fi

72

9.6.2023

73 of 84

Fill in the application

In case the corpus contains personal data, the license may include specific personal data protection terms and conditions.

74 of 84

License

75 of 84

Processing and approval

www.kielipankki.fi

75

9.6.2023

76 of 84

Guidelines for processing personal data

77 of 84

Publish a link �to the Privacy Notice�regarding your�processing of �the personal data and �inform the Language Bank:�https://urn.fi/urn:nbn:fi:lb-2022052522

78 of 84

List of tools available via the Language Bankwww.kielipankki.fi/tools

79 of 84

Tools:Aalto-ASR��Automatic speech recognition and alignment��A demo tool�is also �available!

www.kielipankki.fi

80 of 84

Demo tools�https://www.kielipankki.fi/tools/demo/

9.6.2023

81 of 84

FAIR data

Findable

Interoperable

Accessible

Re-usable

82 of 84

FAIR data

Findable

Interoperable

Accessible

Re-usable

HRT / VRT

common formats

DOWNLOAD

Virtual Language Observatory

www.kielipankki.fi

Instructions for resource creators

Support for

versions and

variants

Long-term

archiving (?)

Common

processing

tools

Deposition

agreements

Language

Bank

Rights

Consistent

metadata

+ Access

location

PID

PID

83 of 84

Online courses

Corpus Linguistics and Statistical Methods (5 cr)

Introduction to �Speech Analysis (5 cr)

Data Clinic (5 cr)

The courses are open to all students and researchers

within and outside the University of Helsinki, even abroad.

84 of 84

Kiitos! Tack! Thank you!

www.kielipankki.fi

General support

fin-clarin@helsinki.fi

Technical support

kielipankki@csc.fi