Introduction to Data Science Project Information 2024
Formal requirements
Further requirements
Evaluation of projects
Instructions
Choosing a topic from Kaggle
Choosing your own topic
Topics proposed to the IDS course:
Factors of success in abstract theoretical courses�[Reimo Palm, University of Tartu, Institute of Computer Science]
Courses containing a lot of mathematics are traditionally hard for students. However, there are still some students who excel in them. The objective is to find out what these students do differently and whether there are activities or behavioral patterns that could be recommended to the other students to make such courses easier for them.
The dataset consists of coded activities of 119 students in the Theoretical Computer Science course each day throughout the whole semester. The activities are, for example, viewing study materials, submitting homework, participating in a lecture online, etc. The goals are: 1) predict the final grades of students based on their activity pattern; 2) discover activity patterns that separate high-performing students from low-performing students.
Supervision of construction works in municipalities�[Jaanus Tamm, Tartu City Government]
One of the challenges of monitoring construction activities (buildings, facilities, roads) in local municipalities today is that detection of violations are random (mostly complaint-based). The main reason is the limitation of human resources for supervision. Therefore, information about violations is often received after the fact and does not allow timely response to violations.
We see a clear need to detect/identify violations arising from construction activities at the earliest possible stage.
In order for the timely detection of construction activity violations to be feasible with the given human resources, technical solutions and automated processes are needed to support and simplify supervision.
The city of Tartu receives an orthophoto of the city twice a year (once the photo is taken by the Land Agency, and the second time the city orders a survey using drone flights).
We can see that this dataset could be used to detect differences / differences with the data in the Building Register (https://livekluster.ehr.ee/ui/ehr/v1), city databases and orthophotos using machine learning / AI. One example is buildings erected arbitrarily, as there is no information about them in the register.
Regional Film Production Impact Tracker�[Signe Somelar-Erikson, Tartu Centre for Creative Industries (Tartu Film Fund)]
The Tartu Film Fund lacks a systematic, data-driven approach to understanding and measuring the economic impact of film productions on the region (Tartu city and county, Jõgeva, Põlva, Võru and Valga counties). Currently, there is no structured method to gather, analyze, and assess how various productions contribute to the local economy, specifically through spending in production related service industries (hotels, catering, logistics etc.) and the return on regional film funding investments. This makes it challenging to justify and optimize funding strategies and to communicate the broader value of film productions and cash rebate system to policymakers.
GOAL
We aim to build a comprehensive data collection and analysis framework that can capture key metrics related to film productions in the region. This framework will include gathering data on:
- types and sizes of productions coming to the region
- production spending on local services
- geographic distribution of this spending across the region
continues on the next page
- the amount of regional funding support received
- economic return on investment for the region (e.g., local employment, service industry growth)
The final goal is to create a scalable and usable system that allows the Tartu Film Fund to independently monitor and analyze these data points in real time, both during and after the collaboration.
POSSIBLE APPROACH
1) collaborating with Tartu Film Fund to identify key data sources and develop strategies for collecting and organizing the necessary data. This may involve:
- collecting financial data from film budgets and reports Tartu Film Fund has supported (approx 35 films)
- surveys for production teams
- other....
2) analyzing trends in the types of productions coming to the region (e.g., size, genre, origin). Quantify spending patterns across different sectors (hotels, catering, etc.) and locations. Measure the correlation between regional film funding and the economic impact on local industries.
continuing from the previous page
continues on the next page
3) developing visualizations and dashboards that show insights on spending patterns, funding efficiency, and regional economic benefits.
EXPECTED OUTCOME
The expected outcome is a working data platform or dashboard that allows Tartu Film Fund to independently collect, store, and analyze data on film productions and their economic impact on the region. This tool could include:
- an easy-to-use data entry interface for new production data
- automated analysis and reporting tools
continuing from the previous page
Segmentation Smackdown: Battle of the Models�[Marilin Moor, University of Tartu, Institute of Computer Science]
The segmentation of fibrous structures from scanning electron microscopy images is often deemed difficult due to the varying nature of the fiber morphology. The main goal of the project is to improve the segmentation approach implemented thus far in FiBar (https://fibar.elixir.ut.ee/). The idea is to test state-of-the-art segmentation approaches (Segment Anything Model by Meta, U-Net models and potentially other transformer models) and fine-tune these models to improve segmentation.
Very hard!
Coupled seq2seq training�[Mark Fishel, University of Tartu, Institute of Computer Science]
We have a new method of training sequence-to-sequence models that lets an existing pre-trained model (like Whisper / NLLB / mT5 / etc) to serve as a guide in training a new model. That way the new model can serve as an extension module to the guide model: their vector spaces will be compatible and their encoders/decoders can be recombined and used. We have tested the method only on machine translation, the project goal is to expand the testing to new modalities: speech, images, other NLP tasks besides translation, music generation, etc. You will have to (1) find more seq2seq models on HuggingFace and update the implementation of the method to support these models and (2) find datasets on HuggingFace / elsewhere that can be used to train such extension modules in order to evaluate the approach.
Very hard!
Topics using large language models�[Meelis Kull, University of Tartu]
���SOME IDEAS:
Questions or comments?
Please ask in Campuswire.