Experimental projects - A.Y. 2023/2024
Students can submit their work by the following fixed deadlines:
Approximately within two to three weeks from each deadline, there will be an oral examination where the project will be thoroughly discussed with the TAs. We stress that group projects are not allowed: students must complete their projects individually.
Read the following instructions carefully. Besides complying with the project’s specifications, it is extremely important that students follow a sound methodology both in the data preprocessing phase and when running the experiments. In particular, no data manipulation should depend on test set information. Moreover, hyperparameter tuning should focus on regions of values where performance trade-offs are explicit. Any implementation must use Python 3 (any other choice must be preliminarily agreed with the teaching assistants).
Download this dataset. The goal is to learn how to classify the labels based on the numerical features
according to the
-
loss, which is the metric you should adopt when evaluating the trained models. Explore the dataset and perform the appropriate preprocessing steps. Please be mindful of data leakage between the training and test sets.
Implement from scratch (without using libraries such as Scikit-learn) the following machine learning algorithms:
Test the performance of these models. Next, attempt to improve the performance of the previous models by using polynomial feature expansion of degree 2. Include and compare the linear weights corresponding to the various numerical features you found after the training phase.
Then, try using kernel methods. Specifically, implement from scratch (again, without using libraries such as Scikit-learn):
Evaluate the performance of these models as well.
Remember that relevant hyperparameter tuning is a crucial part of the project and must be performed using a sound procedure.
Ensure that the code you provide is polished, working, and, importantly, well-documented.
Write a report discussing your findings, with particular attention to the adopted methodology, and provide a thorough discussion of the models’ performance and their theoretical interpretation. Include comments on the presence of overfitting or underfitting and discuss the computational costs.
Download the Mushroom dataset. The main task of this project is the implementation from scratch of tree predictors for binary classification to determine whether mushrooms are poisonous. The tree predictors must use single-feature binary tests as the decision criteria at any internal node (as seen in the lectures). More precisely, consider thresholds on a single feature in the case of a numerical/ordinal feature or membership tests in the case of a categorical feature. We suggest the following guidelines to aid your work on this project.
First, implement a basic class/structure for the nodes of the tree predictors. It should possess the following attributes/procedures:
Second, implement a class/structure for the (binary) tree predictor. It should contain the following attributes/procedures:
Feel free to add extra attributes/procedures if deemed necessary for the task.
Train the tree predictors adopting at least 3 reasonable criteria for the expansion of the leaves, and at least 2 reasonable stopping criteria. Compute the training error of each tree predictor according to the 0-1 loss.
Perform hyperparameter tuning according to the splitting criteria and the stopping criteria adopted (e.g., tune the threshold on the maximum size of the tree) for at least one of the tree predictors. Keep in mind that the relevant hyperparameter tuning is an important part of the project and must be performed using a sound procedure.
Write a report where you discuss your findings, with particular care about the adopted methodology, and a thorough discussion about the models’ performance (with comments on the eventual presence of over/underfitting). In the case of overfitting, some ways to tackle it would be by pruning the tree predictors (see the references below) or by an appropriate stopping criterion.
We suggest the following resources, in addition to the lecture notes of the course, as further reading for the interested student:
Optional: Implement random forest by reusing the already implemented tree predictor class/structure.
Description: Ultrasound imaging is a critical tool in diagnosing musculoskeletal disorders. In this project, we want to evaluate the relation between the segmented area of the knee recess and the presence of liquid (distension).
Dataset: We have a dataset of ~700 binary masks of knee recess extracted from ultrasound images manually annotated by expert physicians. The masks represent the knee joint recess and each is also associated with the status of “distended” (i.e., enlarged”) or not-distended. For a brief introduction to the medical problem from a computer science perspective, please refer to [1].
Objective: Classify each mask as “distended” or “not-distended”. Classification performance should be evaluated in the following two cases:
Extra task: Use explainable AI to better understand which features the models consider relevant for the classification.
Notice: Before starting to work on project 3 (or if you are interested in these or other ML challenges in the field of medicine), make sure to get in touch with Marco Colussi or Prof. Sergio Mascetti for further instructions or clarifications. They will be responsible for the evaluation of your project.
References:
[1]: Colussi, Marco, et al. "Ultrasound detection of subquadricipital recess distension." Intelligent Systems with Applications 17 (2023): 200183.
Description: Accurate segmentation of musculoskeletal structures in ultrasound images is essential for diagnosis and treatment planning. Advanced deep learning techniques, particularly convolutional neural networks (CNNs), diffusion, and foundation models have shown great promise in medical image segmentation. We are addressing the problem of automatically segmenting the joint recess.
Dataset: We will provide the student with a set of 687 ultrasound images of both distended and nondistended recesses each one with the annotated area of the recess,
Objective: Develop and compare state-of-the-art deep learning models for the segmentation of medical images.
Extra task: Introduce a constrained loss [2].
Notice: Before starting to work on project 4 (or if you are interested in these or other ML challenges in the field of medicine), make sure to get in touch with Marco Colussi or Prof. Sergio Mascetti for further instructions or clarifications. They will be responsible for the evaluation of your project.
References:
[2]: Wang, Ping, et al. "CAT: Constrained adversarial training for anatomically-plausible semi-supervised segmentation." IEEE Transactions on Medical Imaging (2023).
Students who want to work on a theory project must write an email to the instructor indicating a topic (typically chosen among those covered in class) they would like to focus their project on. The instructor will then suggest one or two papers in that area.
Keep in mind that theory projects are specifically addressed to students who have a good disposition toward mathematics. Do not choose a theory project only because you are not good at coding.
Here is an example of a good report for a theory project.