1 of 17

Introduction to Data Science Project Information 2024

2 of 17

Formal requirements

3 of 17

Further requirements

  • Every team must present the project in the poster session (Dec 13 at 2pm)
  • Every team must provide access for the grading instructor to the project code repository hosted either at Github or BitBucket
  • The project will be graded by the instructors and will get maximally 30 points
  • If the project gets X points then each team member gets X points
  • Getting at least 15 points for the project is a prerequisite for passing the course

4 of 17

Evaluation of projects

  • Projects will be evaluated after the project session
  • The grade consists in the following two factors
  • Technical quality (15 points)
    • Your project can get 15 points for technical quality, if you have:
      • stated clear objectives
      • applied relevant data mining methods on relevant data, and
      • the achievements are sufficient, considering the required working hours
  • Presentational quality (15 points)
    • Your project can get 15 points for presentational quality, if the poster is:
      • explaining well the motivation and objectives of the project
      • describing used data science methods so that others could in principle replicate your work
      • presenting the main results of your work in a visually appealing and understandable way
    • and if your project code repository (GitHub or BitBucket) contains:
      • all the code as well as readme-files describing what the code does and how to run it

5 of 17

Instructions

  • There are following options for choosing the topic of your project:
    • Choose a topic on the data offered by us
    • Choose a topic from Kaggle
    • Choose your own topic (on some open data or on any data you might have)
  • For each of those options the formal requirements differ slightly, so please look at the dedicated slide below
  • There can be multiple projects on the same topic

6 of 17

Choosing a topic from Kaggle

  • Kaggle (https://www.kaggle.com/) hosts many interesting machine learning competitions and datasets
  • You can choose to compete in one of the challenges (https://www.kaggle.com/competitions) or work on one of the datasets (https://www.kaggle.com/datasets)
  • In each of the Kaggle challenges and competitions there are available ‘kernels’ where people have explained a way to analyze the data and have provided their code
  • You must declare all kernels that you use in your project

7 of 17

Choosing your own topic

  • You can freely choose your own topic, as long as it requires you to demonstrate mastering some topics from the Introduction to Data Science course
  • In your readme-files of your code repository page you must specify the origin of the data and provide a short description of the data
  • The data do not need to be public but the grading instructor must be able to see the data, at least from your computer
  • Here are some more ideas to help you:
  • Finally, have a look at the project presentations from 2021 �(different format than we will have):

8 of 17

Topics proposed to the IDS course:

9 of 17

Factors of success in abstract theoretical courses�[Reimo Palm, University of Tartu, Institute of Computer Science]

Courses containing a lot of mathematics are traditionally hard for students. However, there are still some students who excel in them. The objective is to find out what these students do differently and whether there are activities or behavioral patterns that could be recommended to the other students to make such courses easier for them.

The dataset consists of coded activities of 119 students in the Theoretical Computer Science course each day throughout the whole semester. The activities are, for example, viewing study materials, submitting homework, participating in a lecture online, etc. The goals are: 1) predict the final grades of students based on their activity pattern; 2) discover activity patterns that separate high-performing students from low-performing students.

10 of 17

Supervision of construction works in municipalities�[Jaanus Tamm, Tartu City Government]

One of the challenges of monitoring construction activities (buildings, facilities, roads) in local municipalities today is that detection of violations are random (mostly complaint-based). The main reason is the limitation of human resources for supervision. Therefore, information about violations is often received after the fact and does not allow timely response to violations.

We see a clear need to detect/identify violations arising from construction activities at the earliest possible stage.

In order for the timely detection of construction activity violations to be feasible with the given human resources, technical solutions and automated processes are needed to support and simplify supervision.

The city of Tartu receives an orthophoto of the city twice a year (once the photo is taken by the Land Agency, and the second time the city orders a survey using drone flights).

We can see that this dataset could be used to detect differences / differences with the data in the Building Register (https://livekluster.ehr.ee/ui/ehr/v1), city databases and orthophotos using machine learning / AI. One example is buildings erected arbitrarily, as there is no information about them in the register.

11 of 17

Regional Film Production Impact Tracker�[Signe Somelar-Erikson, Tartu Centre for Creative Industries (Tartu Film Fund)]

The Tartu Film Fund lacks a systematic, data-driven approach to understanding and measuring the economic impact of film productions on the region (Tartu city and county, Jõgeva, Põlva, Võru and Valga counties). Currently, there is no structured method to gather, analyze, and assess how various productions contribute to the local economy, specifically through spending in production related service industries (hotels, catering, logistics etc.) and the return on regional film funding investments. This makes it challenging to justify and optimize funding strategies and to communicate the broader value of film productions and cash rebate system to policymakers.

GOAL

We aim to build a comprehensive data collection and analysis framework that can capture key metrics related to film productions in the region. This framework will include gathering data on:

- types and sizes of productions coming to the region

- production spending on local services

- geographic distribution of this spending across the region

continues on the next page

12 of 17

- the amount of regional funding support received

- economic return on investment for the region (e.g., local employment, service industry growth)

The final goal is to create a scalable and usable system that allows the Tartu Film Fund to independently monitor and analyze these data points in real time, both during and after the collaboration.

POSSIBLE APPROACH

1) collaborating with Tartu Film Fund to identify key data sources and develop strategies for collecting and organizing the necessary data. This may involve:

- collecting financial data from film budgets and reports Tartu Film Fund has supported (approx 35 films)

- surveys for production teams

- other....

2) analyzing trends in the types of productions coming to the region (e.g., size, genre, origin). Quantify spending patterns across different sectors (hotels, catering, etc.) and locations. Measure the correlation between regional film funding and the economic impact on local industries.

continuing from the previous page

continues on the next page

13 of 17

3) developing visualizations and dashboards that show insights on spending patterns, funding efficiency, and regional economic benefits.

EXPECTED OUTCOME

The expected outcome is a working data platform or dashboard that allows Tartu Film Fund to independently collect, store, and analyze data on film productions and their economic impact on the region. This tool could include:

- an easy-to-use data entry interface for new production data

- automated analysis and reporting tools

continuing from the previous page

14 of 17

Segmentation Smackdown: Battle of the Models�[Marilin Moor, University of Tartu, Institute of Computer Science]

The segmentation of fibrous structures from scanning electron microscopy images is often deemed difficult due to the varying nature of the fiber morphology. The main goal of the project is to improve the segmentation approach implemented thus far in FiBar (https://fibar.elixir.ut.ee/). The idea is to test state-of-the-art segmentation approaches (Segment Anything Model by Meta, U-Net models and potentially other transformer models) and fine-tune these models to improve segmentation.

Very hard!

15 of 17

Coupled seq2seq training�[Mark Fishel, University of Tartu, Institute of Computer Science]

We have a new method of training sequence-to-sequence models that lets an existing pre-trained model (like Whisper / NLLB / mT5 / etc) to serve as a guide in training a new model. That way the new model can serve as an extension module to the guide model: their vector spaces will be compatible and their encoders/decoders can be recombined and used. We have tested the method only on machine translation, the project goal is to expand the testing to new modalities: speech, images, other NLP tasks besides translation, music generation, etc. You will have to (1) find more seq2seq models on HuggingFace and update the implementation of the method to support these models and (2) find datasets on HuggingFace / elsewhere that can be used to train such extension modules in order to evaluate the approach.

Very hard!

16 of 17

Topics using large language models�[Meelis Kull, University of Tartu]

  • We can provide access to the API of OpenAI large language models such as GPT4o, as well as the Embeddings API (costs covered by the university, quota-based)

���SOME IDEAS:

  • LLM-based researcher search engine at the University of Tartu�“Are there researchers in environmental studies who have published on climate change impacts?”�“Find me all researchers in the University of Tartu who have applied machine learning in their papers”�
  • University research trends analysis�Description: Build a dashboard that provides insights into recent research trends across departments or faculties by analyzing recent publications and keywords.

17 of 17

Questions or comments?

Please ask in Campuswire.