ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
Defense 1Defense 2
2
GROUP PUBLIC IDIntroduction, existing solutions, initial ideasPointsDefense 1 (3), baseline (3), future directions (4)PointsPEER REVIEW DONE
(YES = 0, NO = -10)
Review 1Score 1Review 2Score 2Final remarkScore 3FINAL SCOREFinal score difference with student peer-reviewFINAL REMARK
3
116- not clear, related work missing, methodology is just generaly described, no dataset analysis was performed

- TOPIC CHANGED
- the meeting with UL FF students and the professor will happen April 7, 5:15 PM via Zoom for lab sessions
3- human evaluation part added
- should vocabulary be extended as there are unk tokens in the examples?

- FF part:
- pomagali pri iskanju splošnega korpusa
- priprava korpusa za specifično prevajanje
100Everything worked

---------------

Discussion misses good improvement ideas ( tokenization, transformers...)
28The README.md, although short, it includes all the information needed. The notebook on google colab makes it easy to install all the prerequisites and we did not have much problems with running the code. As suggested the pre-trained models are also included which makes things a lot easier. The only minor problem we have is that we wish the downloading part (wget ...) was a little better automatized, since you have to go to the Downloads mannually and get the download link.
CONCLUSION: The repository follows every requirenment stated, with little to no issues.

---------------

When it comes to the structure of the report it follows the suggested form. When it comes to the method part of the report, it is described in detail including files used and certain settings, so it really feels like you can follow the whole process of machine translation while reading it. It looks like the group tried different models when it comes to the time to train, vocabulary and data used, but not a lot is discussed when it comes to some different settings of the model and the thought process behind using certain parameters for training for example. The datasets are also briefly discussed, although there could be a bit more focus on the contents of the specialized corpora.

The results are presented nicely in a readable form. The results are in most part well discussed. We really liked the part where the experiment with more model steps is conducted. We feel like it gives a reader a nice look into the whole process of model actually learning to translate. We do feel like there could be more examples of the translated sentences, just so the reader would get a feel what kind of sentences the model is able to translate, since the scores, although the most accurate, they don't give you the same information as somthing like a raw text would. In conclusion some possible solutions are also briefly stated. The topic of errors is generally discussed pretty well.

In the conclusion the group does indeed state the shortcomming of the work (issues with data, larger vocabulary...). There could be more information given about the parameters of the model since the training seems like the most important part of the whole process.

abstract, introduction and related work: 5
clear description of selected datasets and analysis: 4 (datasets could be described a bit more)
clear description of methods/algorithms and adequate discussion of results.: 7 (more parameters could be discussed, more translation examples could be shown)

26
- en-slo, slo-en
- training time very low ??
- asistent corpus results: BLEU = 0.05,
- Google Colab

- besednjak -> kako vpliva?
- kako ste dobili validacijsko -> kako ste vedeli, kdaj ustaviti
2538-7
4
122- how will you use LSTM, CNNs and SVMs for the task ???
- the idea was to review other works, not describe some of the methods
- the lexicon is described for classification, not exploratory analysis
- no ideas for further work are given

- TOPIC CHANGED
- the meeting with UL FF students and the professor will happen April 7, 5:15 PM via Zoom for lab sessions
3- no source code/scripts?
- explain parameters in training the model

- FF part:
- zbrali so podatke 4k stavkov
100Evaluation was a bit trickier to run from collab, wer score didnt work, otherwise no problems

---------------

Abstract and introduction short, related work missing. General dataset not described well. Only one aproach described well, others mentioned.
23The repository includes a Google Colab notebook that is clear and runnable. There are no instructions for running the evaluation script, although it is available in the repository.

---------------

The report follows the suggested structure, 8 pages long. It appropriately describes data processing and approaches (general model, domain model, Google translate). Results and discussion are nicely written. The abstract is too short and does not follow the structure of an abstract. The introduction is short but informative and the conclusion is well written. Related work is missing, and the only described framework is the one they used. The military datasets are described well, but the general datasets are only mentioned.

Abstract, introduction and related work: 2
Clear description of selected datasets and analysis: 4
Clear description of methods/algorithms used (general model, domain model, Google translate) and adequate discussion of results: 10
25
- Google Colab
- comparison to Google Translate
- asistent corpus results: BLEU = 0.07
2538-2
5
126- insufficient

- TOPIC CHANGED
- the meeting with UL FF students and the professor will happen April 7, 5:15 PM via Zoom for lab sessions
3- clearly define datasets used + processing done on data
- describe evaluation settings
- report using agreed metrics
- seems a basic translator has been trained
- # GPUs/ training time

- FF part:
- pripravili neke podatke, ki jih bodo potem pregledali
100Readme ne pove, v katero mapo mora uporabnik shraniti model in tekst za evalvacijo po njihovem prenosu. Modele je bilo potrebno shraniti v datoteko "models", ki jo je bilo potrebno narediti ročno ampak tega ni pisalo v navodilih. Skripta translate.sh se ni pognala uspešno, deloma ker mapa "temp" ni obstajala in se generirane datoteke niso shranjevale. Po ročni kreaciji nove mape "temp" se je pojavila nova napaka, ki je nismo uspeli odpraviti in v navodilih ni bila omenjena.

---------------

Abstract, introduction and related work: 3/5. Vsebuje vsa zahtevana poglavja, a bi jih lahko razširili/jih obrazložili.
Clear description of selected datasets and analysis: 5/5. Navedeni so vsi uporabljeni korpusi, vsak izmed njih je opisan in navedene so njihove velikosti
Clear description of methods/algorithms used (at least 3) and adequate discussion of results: 7/10. Pod poglavjem "Results" so navedeni štirje pristopi in njihove ugotovitve, opisali so tudi njihove razlage rezulatov evalvacije. V diskusiji so navedene ideje za nadaljne izboljšave. Tabele prevodov, ki naj bi se nahajale pod poglavjem "Results", so razkropljeni med viri (-1 točka). Poleg tega nimajo slovensko-angleškega modela, ki je zahtevan v navodilih naloge (-2 točki).
18The instructions are not clear, it is not specified where to put specific files, so you have to figure it out from errors you get when running the scripts. Downloading the data from multiple sources is tedious and it is not specified which files to download, so they should have provided a script or at least specific URLs. There are also no provided instructions for Windows, even though the translate script looks like it can easily be adapted for Windows. We had to slightly change the script to make it work. After we managed to make it work, the results that are reported are the same as in the report.

---------------

The report follows the suggested structure, 9 pages long. They compared many different models and parameters, the results of which are appropriately reported. The discussion is well written, though they could’ve commented on the provided examples more. They have a nice abstract and introduction and they described the related work and frameworks in detail. Their models are well described and the datasets are well described as well.

Abstract, introduction and related work: 5
Clear description of selected datasets and analysis: 5
Clear description of methods/algorithms used (transformer model, general model, multiple fine-tuned models) and adequate discussion of results: 9
26
- asistent corpus: max BLEU = 0.11
- no other evaluations, only domain dataset with low results
- what might be the problem?
- trying on their own hardware

- literarna besedila spook
- 30.000 korakov- kaj je to?
25381
6
106- wrt. other groups and the fact topic was not part of labs, you did a good work
- please think of how much data would be needed to pretrain a model and then maybe fine-tune it (depending on selected framework)
- the meeting with UL FF students and the professor will happen April 7, 5:15 PM via Zoom for lab sessions
10- nice datasets description
- good overall structure

- FF part:
- zgledno sodelovanje
100Readme je podrobno napisan, z jasnimi navodili za različne operacijske sisteme in navodili za hitro poganjanje modelov in evaluacijo njihovih rezultatov. Kljub temu jim v navodilih manjka dostop do testne množice korpusa domene oz. specializiranega korpusa. Naveden je samo dostop do testne množice asistenta, ki je namenjen za splošno evalvacijo modela.
Žal pa smo pri namestitivi programskih zahtev naleteli na podobno težavo kot na začetku projekta, ko smo za svojo nalogo želeli usposobiti fairseq framework. Zato delovanje modelov in njihove evaluacije nismo morali preveriti. Postopek smo poskusili na treh različnih računalnikih s sistemom Windows.
Za ta del smo zato odbili 5 točk, saj v Readme.md ni bilo navedenega načina za lahko namestitev programske opreme in posledično rezultatov nismo mogli reproducirati.

---------------

Abstract, introduction and related work: 5/5. Izpolnjujejo vse kriterije. V abstractu v redu povzamejo cilj naloge in pridobljene rezultate, v uvodu je napisana struktura naloge, v Related work je dovolj podroben pregled področja.

Clear description of selected datasets and analysis: 5/5. Jasno navedeni in opisani uporabljeni korpusi. Za vsak korpus so napisali kakšno vsebino ima in kako bi lahko vplivala na rezultate modela.

Clear description of methods/algorithms used (at least 3) and adequate discussion of results: 8/10. Podrobno opisane ubrane metode in pristopi. Pri diskusiji jasno opišejo rezultate, podajo relavantne primere ter navedejo vzroke za slabe ali dobre prevode. V navodilih naloge je navedeno, da je potrebno narediti ang-slo in slo-ang modela, v reportu pa imajo navedeno, da so naredili samo ang-slo model (-2 točki)
23The readme file is easy to follow and the instructions inside of it are all very clear. It worked as it was written in the readme.

---------------

Report in general does follow suggested structure, even though some sections are named differently. It consists of 11 pages including references and appendix which is for this assignment acceptable. Report also describes data preprocessing and as we can see from the results in the table, 3 configurations were used (maybe not 3 approaches/algorithms?). Results are presented in a readable way and further explained with comments. Authors also pointed pros and cons of their work. It is clear that they know what they are doing.
29
- clear datasets description
- SLING usage
- useful evaluation table
- corpus asistent: 8 epoch, max=0.37
- clear model configurations presented
30507
7
110- a part of related work is not clear as classification is not part of this task
- how you will use RNN or BERT for the task?
- methodology does not seem appropriate
7- datasets should be better described
- no need to describe RNNs, BERT, ... - what is added value?
- can you map hate-speech datasets to ordinal scale to use "sentiment classification"?
- you will train a translation model?
- Accuracy, Precision, F1 -> Precision, Recall, F1?
- what is the added value of Fig 6 - b,c,d?
- further work missing
70Repository is a little messy and not organized perfectly. Instructions about which scripts to run were not clear, Conda command was not working straight away. Other than that it's ok, the results obtained with scripts are similar to those in the report.

---------------

Report style matches the one defined in the project introduction slides. Page length is adequate. Abstract could be a little longer. Introduction and related work wover everything discussed later. Datasets are presented very thoroughly with corresponding distribution plots. Deep learning methods could be explained more in detail. Results section is covered with plots and tables but is a little too short text wise. Discussion do point out only what was missing and not how well did the models work overall. Some ideas for improvements are there, but overall it is too short, the discussion should be longer and more thoroughly concluded.
24+ has README.md with instructions
+ has requirements.txt
+ has datasets already in repo, or links to datasets
+ models are provided on google drive, a lot of models (12GB+)
+ all code is in notebook format, which is easy to use on Google Colab
+ code runs, and also notebooks have already results from past run, which are the same as in report
+ all code from methods in report seems to be present

- doesn't have clear instructions for running the nodebooks and which notebooks do what.. Only that there are nootbooks.
- datasets should only be linked, legal reasons?
- should include dataset descriptions in the root README.md (just my opinion)

9/10

---------------

- abstract, introduction and related work 4/5
maybe abstract should give some overall results of the work and more specific which methods were used. Introduction and related work are good.

- clear description of selected datasets and analysis 2/5
datasets are described only vaguely. only binary distributions of all datasets are given, but datasets are in multiclass. No combining explained.

- clear description of methods/algorithms used (at least 3) and adequate discussion of results 5/10
methods explained only vaguely and no training parameters are given. A lot of results for a lot of models, but no clear comparison between them or reasons why some don't work, or reasons why even use them.
20
- figures not updated from previous defense
- no clear explanations, analysis, ...
- binary data only?
23370
8
101- you are not describint exploratory analysis but classsification -> your group topic is accordingly changed
- iniitial idea is too vague - you should have at least propose a type of algorithm, features, describe planned methodology
8- datasets are not described
- Hate-Sonar - you can use such things to compare your results but they do not count as your effort in classification, more like improvements of discussion
- for traditional methods.- how many features are generated?
- TF-IDF vs. POS?
- need to use datasets with more classes, not only two/three
70HateSonar dependency requirements are not well specified (not working out of the box, lower version of
Python required). Some datasets are doubled across the repository, clear structure missing.

Instructions for the BERT model are misleading, since they instruct us to install the
package BERT (which is actually a Python serialization library) and not the transformers package they actually use.
BERT is just an local class that was created. There is no code for testing the BERT models, also no testing set or trained
models are provided. It seems like the reported results are based on the training/validation accuracy, which we could
only check by retraining the models (also shouldn't report the training/validation accuracy, penalized in the report part of peer review).

In the TF-IDF methods we had problems with running the code, since a dependency was missing that isn't specified in the requirements.
It turns out the dependency wasn't even used in the code, just left in the imports. The reported accuracies are also not very clear and its
hard to find and compare the calculated outputs and the reported values.

---------------

Abstract, introduction, related work: (3/5)
- abstract doesn't tell what is achieved in the work, only provides motivation
- good introduction
- related work only has 3 references, other work is mentioned but access not provided

Datasets: (3/5)
- datasets are described in two parts, once under related work and once in methods. There are many datasets listed, but only few are described and used. For Twitter dataset (mentioned twice, with different number of entries stated) we don't even know which reference belongs to it (4 possible twitter sets).
- SLO dataset is not described

Methods: (3/10)
- TF-IDF: what are tasks ABC? Probably tasks mentioned with Offenseval dataset, but there is no description which is which
- HateSonar: only description of Bert, not about the group's work
- POS: good description of group's work
- No methods to transfer ENG training to SLO, only training on SLO data.
Results:
- results are not in separate section, but are included in methods. They are not comparable for different methods, since there are different metrics used
- TF-IDF: presented accuracy, but we don't know how balanced those tasks are, so accuracy could be very misleading
- HateSonar: measured performace with percentage of hatespeech classified. We would say this is not a useful metric, since even if the percentage is the same, we do not know if it is on the right instances.
- POS tagging: measuring with recall and saying (analysis of table 3) that models are best for predicting some class based on the highest recall values. Again the combination of recall and accuracy are not so valuable if we don't know the dataset balance (I can't see which dataset they are using here)
- Results on Slovene data are reported for validation or even training set, it is not clear.

Additionally, it is clear that there are many left overs from previous submissions in the report, which should not be there in the final report. The organization should be better (datasets, methods, preprocessing and results are in section Methods), first few tables are tagged as figures and never referenced - we are not sure about which table they are talking about.
14Prerequisites were not easily installed, some libraries were missing in the README. Once the prerequisites were installed however, we were able to reproduce some results. Some scripts require manual changing of datasets.

---------------

Related work is missing work about cross-lingual identification and Slovene work. Results are mixed with methods. Metrics used for evaluating performance are not suitable in our opinion for example results for HateSonar did not convey much information, also in some cases recall was given without precision or f1-score. Some figures (1, 2) are not referenced at all while others are not referenced in the text directly. Slovene BERT could have better explained results.
21
- tasks A, B, C exactly?
- analysis says which model is the best- is that useful?
- POS vs. TF-IDF comparison - to be rephrased ...
- other features as you did not do much on deep NLP for transfer?
20356
9
115- do you think FB data retrieval is feasible?
- there is only description of some datasets but no clear selection is given. Also no plans to develop a baseline, methodology to tackle the problem is described.
8- do not merge into binary only (you can do that additionaly, not by default)
- desribe datasets side by side using tables/figures, ...
- describe manual features in a list/table
- CA measure is not appropriate
- first explain results, train XLM-R separately and then implement meta-classification
801) Requirements and setup are not clear. There is no explanation about how to install all the needed requirements and for i.e to setup the virtual environment and how to properly run code 2) Preprocessing does not differentiate between binary and mutliclass data. All the paths in the script are hard coded. Preprocessed datasets are not automatically saved in the right directory structure to be later used for features and classifiers. 3) Instructions do not explain how to set up right structure that is required for classifier code runner. There are multiple errors. 4) Classifier.py and multiclass_classifier.py use the same dataset, results in error ValueError: Target is multiclass but average='binary'.

---------------

Abstract gives a brief review and introduction provides a well problem formulation. Related work does not contain any explanation about the shortcomings and benefits of the selected works, it mainly lists the related work. Style is pretty consistent with some grammatical errors. (2 points) . Datasets are solidly described. Missing the plan and the description about did they concatenate the datasets or how they have constructed mutli class dataset. It looks that there is only one multiclass dataset and that all the methods are tried and evaluated on Twitter english dataset and slovene twitter dataset, while not using other collected datasets. (3 points). Traditional methods are not well explained, there are no any listed parameters that are used for the models. Why is F1 macro used? Feature extraction is explained well. They have provided the conclusion and adequate discussion of the results. I would like to see better explanation on how and on what data they have conducted experiments (6 points)
14Repository has a well structured README, everything is clear. Code is runnable and the outputs are the same as in the report. We really liked how they formatted the README, it looks really clean and professional.

---------------

Abstract, introduction and related works are adequately described. Methods were a bit confusing, since we expected it to have a better structure, otherwise everything within methods is described as it should be. Results are very good, probably the highlight of this report, we also really liked the use of charts in the report.
28
- what is the contribution of figs 1 - 5, how to compare?
- binary only?
- analysis should be better (which classes best, subsets, why, ...)
25416
10
117Group 117:
- the idea is not only to work on binary classification (but also try some multi-class classifications)
- some initial analysis and datasets review is already done by you
- related work was not desribed - especially methods as they are implemented along with the selected datasets
Group 118:
- datasets briefly described, two methods pointed out but too general
- no clear plan for work continuation given

THESE GROUPS ARE MERGED
8- datasets nicely described
- for the models, it is important to say something about the parameters if you set them (e.g. GridSearchCV)
- MANUAL FEATURE ENGINEERING FOR Traditional models (ALL GROUPS)!!!
- "length 512", needed to modify datasets, why?
- for the final report results need to be discussed, not just presented.
100The README is adequately described, however it is a bit long for our taste. We recommend that they split the README by parts for future reference (for example, by directories). Otherwise, everything is clear in the README and the code is runnable.

---------------

The report correctly follows the desired structure. The abstract, introduction and related works are well described. The methods are a bit confusing, it starts off with a good structure regarding the dataset descriptions, however we felt that it lost that structure for the methods a little bit. Results are really good, we really liked how they used the tables and charts in their report. The neon coloring of links is very distracting, it isn't present on Ubuntu however, so we don't know if that is just a Windows feature or not.
26The repository is well-structured and has a CLI application for running the experiments. A problem occurs when I tried to setup the environment. First, there are problems with the instructions if you are using Windows, therefore I continued in WSL. Next problem is, that dependency versions are not specified in the requirements.txt and a dependency version error occurs when trying to download en_core_web_sm (click library). Also the gensim package is missing from the dependencies. After we fix this, there are missing nltk download commands, and you need to manually download a few things. In the end, the main.py outputs an error due to a resource file missing, which I cannot solve.

I think the repository has a nice structure and well-written instructions and a practical CLI program for testing, but since it doesn't work we can't give them a lot of points.

---------------

Abstract, introduction, related work (5/5)
Datasets (5/5)
Methods: no transfer methods for ENG -> SLO for traditional methods
Results: No discussion on why CroSloEng BERT works poorly for training only on ENG.
Great reporting of result, including CV (but too bad that there is no uncertainty reported, since they already did CV), balancing the data and f1-score, all the process is very well described, grid search for parameters...
(9/10)
24
- not much work for Slovene data-> where are traditional features then
- report lacks analysis
25431
11
109- nicely presented methodology
- also, you have found some Git repositories that you will use
10- remark: for the final, bulleted text should be created readable
- in the methods description, describe decisions related to your analysis
- 3,5GB datasets for Slovene - needs to be described, what is the class distribution?
100Repository was nicely organized, there were some small errors while running the notebook, but nothing serious. Separating the mail analysis notebook into multiple smaller ones would be more practical. There could be more instructions on how to preprocess data. Data preprocessed is runnable and results are similar to the ones reported.

---------------

The report was very well written. Abstact summarizes everything done in the project. Introduction is short and brief, related work covers everything discussed in the project very thoroughly. Data is presented in a thorough, structured way with a single chart, which gives all the information needed. The methods section is nicely divided into traditional and deep methods, furtherly divided by used methods, which were all explained in detail. Results are also presented very thoroughly. They used appropriate evaluation measures, put the relevant scores in bold and divided results section based on implemented methods. Conclusion points out all the main parts of project and achieved results and failures. Conclusion also contains their ideas for various improvements.
19Repository includes clear and understandable README.md. The parts we tested run with no problems.

---------------

Abstract and Introduction are short and clear. Related work contains work related to the methodology they used. Datasets are explained in detail and there is also explanation how to obtain them from the Internet. Figure with class distributions is at first glance tricky to understand but it is nicely packed in the structure of the report. Used methods are well explained with all parameters used. Results, Discussion and Conclusion are understandable and clear, there is also further work highlighted.
30
- reviewer 1 should give more points

- Fig 1 is nice
- many configurations, what is the final conclusion?

- analysis should be better (which classes best, subsets, why, ...)
28487
12
108- fill out the title as given by the assignment selection
- you will skip traditional models?
10- for the multi-class decide which F1 you will use
- you plan to do cross-lingual embeddings mapping - how will you do that?
100Repository is clear and its README.md file is clear and lead reader to the point. Small portion of code we test works fine.

---------------

Abstract, Introduction and Related work looks a little bit shallow with large number of typos, which makes it challenging to read. Datasets are explained in details with sources from where to be downloaded. There are also figures that show class distributions for each dataset. Methods explanation lacks more detailed explanation with possible statement of the parameters used. Through all report there are typos and minor grammar errors.
28The repository is nicely structured, the Readme is short but includes information how to run the code, where to get the twitter dataset and where to get their trained models. Instructions on how to run the classification are very simple and straightforward. The results produced from the executed code are exactly the same as the ones in the report. There are no instructions or any mention at all how to train the models on the datasets, only how to run the trained classifiers (-). The repository seems to contain the training code, which most likely produced the final models they provided in the README. But it is not mentioned in the instructions (-).

---------------

5 points: The abstract and introduction include the background of the topic, the reasons for the approach, the proposed methods are also mentioned as well as the conclusion. In the related work multiple relevant papers are mentioned. Related work is also mentioned in the section about used datasets.
4 points: clear description of selected datasets and analysis: The selection of the datasets is explained a bit confusingly across two sections with related data and methods being mixed (-). Bonus for obtaining/annotating their own small Slovenian dataset. Dataset preprocessing is also described.
7 points: clear description of methods/algorithms used (at least 3) and adequate discussion of results: They described multiple approaches (tf-idf with different classification, their custom features with different classification, ELMo, BERT and XLM-R). Also the proposed methods are very interesting and broad but the report lacks description which parameters were used during training which makes it hard to reproduce what they have done without seeing the code (-). There is barely any mention of splitting the dataset into train/val/test, only into train and test for tf-idf and custom features and 80% for train on elmo, bert, xlmr but the whole thing is not really clear what has been used where (-). There are comments for all the obtained results explaining them. However they are mixed with some additional explanation of the method which makes it a bit hard to read (-). The best results are not bolded or highlighted (-). The conclusion is a clear summarization of the results that mBERT and XLM-R are better and that translating is not a good solution. There is also a plan on how to improve results by expanding the dataset.
32
- you should bold results in tables
- analysis should be better (which classes best, subsets, why, ...)
2848-2
13
105- you also have an idea about Slovene (novaTV :))
- you have already checked the Slovene data
- as you propose to use only BERT-like methods, you can skip traditional ones and focus more on transfer to Slovene data instead
10- remark: you will write bullet points to text and better describe datasets, add figures/tables for the final submission
- you will try: BERT, translation+BERT, embeddings alignment
- improe discussions with comparative analysis between the models - where they are better/worse, ...
- "Here are the results ..." -> "In Table X we show ..."
100Deducted one point because paths to models and datasets are wrong. With the fixed paths the code is easily runnable and produces similar/exact results. Models were also given.

---------------

Every required chapter is included. Used a sufficient amount of datasets and approaches. Results, methods and data processing were accordingly commented. There were no mentions of any related work that dealt with Slovene language, although we would not deduct a whole point.
29We successfully obtained the base datasets and the Twitter one after contacting the authors. From the Readme it is also clear how to download their already trained models. We were able to run both the vector space alignment and BERT methods though we needed to modify the directories in quite some places - for BERT the base path could be just extracted into a constant. The code for BERT is clear - it includes both training and predicting as well as predicting by loading a pretrained model. Training our own models on two datasets and using them for prediction produced similar results to those in the report.

---------------

5 points: The abstract, introduction and related work are short but include the needed information regarding the relevancy of this topic and what has been already done. There is no mention about the used methods in the abstract or introduction, but just a mention that there are multiple.
5 points: clear description of selected datasets and analysis: They explain datasets very well, how they were structured originally and how they processed it (what they took from the original dataset). They use 4 English and 1 Slovenian dataset which is enough, there are both binary and multiclass datasets. They also modified the multiclass datasets to match the Slovenian one to be able to do the language transfer.
8.5 points: clear description of methods/algorithms used (at least 3) and adequate discussion of results: The methods described (BERT translated, BERT multilingual, vector space alignment) are interesting and explained quite in detail. All the results have accompanying description and explanation why the results are as they are except for table 6 which is not really explained (-). There is little to none comparison between BERT and vector space alignment except for the fact that BERT generally performs better (-), but they couldn’t comment on multilingual, which makes sense considering the results.. The best results are not bolded or highlighted in any way (-). They explained well the cross-lingual performance, which method works better, but that none of the methods work that well and they also explain why. In the conclusion they explain again why the language transfer performs poorly and mention ways of improving it by using multilingual datasets in the same domain
28.5
- interesting analysis why predictions are bad
- binary only - toxic corpus has multiple labels
30502
14
107- clear review of related work
- you should then focus more clearly what will be the first algorithm like/what you will try. As you have already did review, you should have written exact procedue - featue functions, algorithms, data transformations, ...
- also related work should cover speech-act analysis
- there was no overview of the dataset!
7- too much descriptions ..
- no results explanations, just raw numbers
- no future directions
50Repository ni public (odprto ob 10:50)

---------------

Repository ni public (odprto ob 10:50)
0Missing information about Python version.
When all libraries from requirements.txt are installed and you run the program with Python 3.8.9, it gives you an error: "Resource stopwords not found". You need to manually run "nltk.download('punkt')". This step should be added to the description.
Zenodo website was temporarily unavailable (503 error), would be a good idea to have additional upload ready on Google Drive (or equivalent).
It would be easier to use parameters as script arguments instead of changing them within the code.
When running main.py a lot of warnings (zero division parameter, least populated class etc.) appear in terminal which makes the results difficult to read. The results table is also difficult to read in regular Windows Command Prompt (for each metric the values are stretched across 3 lines).
Wasn't able to replicate the neural network results due to insufficient hardware. It would be easier to have a notebook where you could already script the download of word2vec embedding and those of us without GPU would be able to run it too.
Why did you put unstable version in main? Main should be reserved for stable version.

---------------

Abstract section is empty.
You mentioned some of the custom feature but ended with etc. which to me makes it unclear if there were additional features added or not, it might be more clear if you used itemize environment and just listed all the custom features you were using.
It would be easier to compare metrics if you reported Naive Bayes and Random Classifier metrics together in 1 table (without separating classes), similarly for BERT vs traditional methods.
Bar charts have so small text, even at max magnification it is hardly visible.
Bar plots should have title. Axis should be labeled.
For readers which are familiar with NLP, long description of concepts such as BoW, Word2vec, naive Bayes, random forest and BERT is not needed. You should mention them, but there is no need to describe each one thoroughly.
22- abstract empty?
- figures not appropriate
- why is stemming used (comment from previous defense?)
- figure 9 the same as in previous defense
- fig 10 - kaj pomeni?

- what can we learn from the analysis, multiple aspects missing?
- traditional features used (how many), repository changed to public at later time
10224
Obvezen ustni, pisni izpit min. 56%.
15
124You already started to define manual features. Before, you should also check what types of features were used by others in tasks like these.
Take a look into speech-act analysis - for related work.
Describe dataset in more detail (sequences, distribution within groups, book text lengths, lengths of messages, ...)
7- dataset description ok, Fig3 nice
- feature engineering is only plan ??, other further directions not given
90Pros:
+ Good readme, well structured, clear instructions
+ Results when running the code roughly the same as in report
Cons:
- Metric not provided for results in readme
- Class names not provided in confusion matrix - only some class numbers (id-s)
- ./src/IMapBook_dataset_cleaned.txt -> should be .xlsx

---------------

Pros:
+ Intersting manualy ingeneered features
+ Good analysis of features and feature selection
+ In-depth analysis of data set
+ Good and clear description of experiments and used paramaters
Cons:
- Data set preprocessing not clearly described
- Very small labels in graphs
- Only micro F1 used for most results
- No really deep methods used like BERT or GloVe
Comment:
- Feauture "similar to previous message" used in combination with random train-test split. Some messages from test dataset could be used during training to calculate similarity with messages from train dataset.
26The instructions are very ckear, however it seems that not everything is covered, to run the code properly because I have had quite a few problems in getting it running. Not all libraries are in the requirements.text, and the dataset file was difficult to find (nothing runs without it) From what I did managed to run, the resukts were generally reproducable.

---------------

The structure is not explicitly formulated as stated, but it can be seen that it does follow the guidelines and the main sections are broken down into smaller sections. The one complaint here is that the abstract is too detailed and should be more general. The description of data, results, approaches etc. is good the only problem is that because of poorer english, sometimes the sentences are not very clear and I don't fully ubderstand the meaning. All required parts of the reports are present along with a suitable discussion. The only thing I would add are more refferences as there are a lot of terms left unexplained.
24- features selection - which features are the best, examples (p. 6)?
- examples of hapaxes?
- cross-validacija, kateri rezultati izbrani za končni rezultat?
- what is the overall difference between 5 and 11 classes?
- "Our models were missing from semantic meaning repre- sentation. If we would include that we could possibly improve our models." - hm??


- using simple NNs is not adequate
1026-16Obvezen ustni
16
114Dataset analyzed - in the future create deeper analysis (semantics, book texts, ...)
Which message types do you think are important for the IMapBook?
Related work of text/speech act classification is missing? How this task relates to those?
8- dataset should be explained, not just some images added
- basic baseline with tfidf features
- no future directions
60README is written ok but it doesn't includes student id. It specifies how to download pretrained models for BERT. Requirements are not in requirements.txt and are specified in README which makes it dull to install. Code is not runnable on linux, it needs fixes for paths to the files. We couldn't make the BERT model to run on GPU but it runs ok on CPU but it takes long time to get results. We also tried to run other which are specified in main help but after inspection we saw that setting parameters make no difference as they are not connected to the code. We tried to run baseline models by hand but there is no simple way to run them because there is no file for running them.

---------------

Report is missing section "Discussion". Data, data preprocessing and feature extraction is described in introduction which we do not think is the right place for that. Algorithms are only listed but not discussed. There are also 6 tables included that we do not think is necessary. Figure 4 could be presented differently since reading text that size is not easy (especially on dark background). There is no discussion section but results are described in section "Conclusion". We think that author did a lot of experiments but failed to present them in the report.
16The repository includes a README.md file with instructions, but no file like requirements.txt to easily install dependencies, they are just written in a table inside a README.md. If we just run the code with the command provided inside README.md file, we get the following error: 404 Client Error: Not Found for url: https://huggingface.co/resources%5Cmodels%5Cpretrained%5Cenglish/resolve/main/config.json. Since this group made it easy to run just selected models, we were able to run the ones that worked (everything except BERT). The results we got are pretty much the same as the ones in the report of this group. 8 points because no requirements.txt, and because BERT is not runnable.

-----------------------------------------
Should provide requirements.txt instead of listing packages in the readme, module name is scikit-learn, not sklearn. Windows path does not work on linux.
After the initial effort and editing field I was able to run a few models (did not try to run BERT). Checked results were ok.
Points: 6

-----------------------------------------
Points: 6



---------------

Report follows the suggested structure and is less than 8 pages long. The report includes description of data preprocessing, feature extraction and lists methods used. The results of the algorithms used are provided in the results section, and their performance is commented on in the conclusion section. Conclusion also includes the information about which algorithm worked best on which combination of dataset and classification output. The shortcomings of the deep learning methods are also provided, while there are no ideas for future improvements. Also their figure 3 is weird, nowhere in the report it states what exactly is the metric for the most active user, and also we got users “edf-15” and “edf-16” to have the most words and the most messages, while they got user “pim-30” to be most active. The other group we were assigned for review also got “edf-15” to be the most active user.

Abstract, introduction and related work: 3/5 points
Abstract contains some text that would be more fitting in the introduction and it also includes the results - not sure if that is ok or not.
Introduction is mostly just dataset analysis.
Related work is there.

Clear description of selected datasets and analysis: 3/5 points
Dataset is well described, although it is in the introduction section of the paper. Figure 3 is wrong or just uses some weird metric, but nowhere in the paper it states what it is - Figure 3 is never referenced and it has a bad label.

Clear description of methods/algorithms used (at least 3) and adequate discussion of results: 8/10 points
Methods are more enumerated then described, ideas for future improvements are missing.

-------------------------------------------------

Abstract, introduction and related work: 3.5 points
Abstract is to long, part of it could be written in the intrduction, and the Data section could be separate. Introduction is very short.

Clear description of selected datasets and analysis: 3 points
Presentation of data isn’t the best, graphs, top 15 words could be described better.

Clear description of methods/algorithms used (at least 3) and adequate discussion of results: 7 points

-------------------------------------------------

4 points: abstract, introduction and related work
5 points: clear description of selected datasets and analysis
6 points: clear description of methods/algorithms used (at least 3) and adequate discussion of results.
You could take into account many of the custom features by just including punctuation as words in the CountVec/TF-IDF.
I guess you tried CountVec vs TF-IDF vs handcrafted.. why not join handcrafted with one of the first two methods?
Hard to understand code structure and formatting.
Tried a lot of very similar methods/algorithms.
Missing some explanations why something worked, potential future improvements.
21- parameters not connected to the code?
- BERT with 92%, what are the splits?

- where are multiple topics of analysis, discussion, explanation of results?
1529-2
17
123- related work is unsufficient
- dataset is not described, only a distribution is presented
- initial ideas are just general approached without clear vision
4- dataset should be better described
- more advanced models used, no custom features for traditional models
- further steps not adequate
80Repository includes README.me and is written nice and clear. The instructions can be easy to follow.
I actually couldn't run the code, because I have issues with importing colllections from utils, that couldn't resolve by the end of review. The problem was probably with me :/

---------------

The abstract could be a little more informative.
The style of project report is in the right format and follow the suggested structure. It is also short and concise. Data and preprocessing steps are clear.
They used 4 approaches and are nicely presented in methods. Maybe there is a little confusion since algorithms and features are mixed in methods presentation.
Results presentation is nicely organized and well presented.
They are discussing where models performed well and interpretation of results is well written.
Shortcomings and future work are present.
29Pros:
+ Instructions are provided in readme
+ Evaluation code is runnable
Cons:
- No instuctions to unzip downloaded file for GloVe
- Trained model for BERT could be provided since training + evaluation took almost 2 hours
- It's not clear in which directory should one be during model downloading - /code or not -> order of instruction could be different
Comment:
- When running results for BERT didn't match with the ones in report - maybe we didn't see that something else has to be changed (uncommented) in code before running
- requirements.txt is very long - some libraries probably not needed (google-auth?)

---------------

Pros:
+ Intersting related work about IMapBook
+ Nice dataset analysis
+ Very in-depth and thoughtful goruping of classes
Cons:
- Very short abstract
- Nothing about used methods and findings in abstract or introduction
- Not clear wheather CREW data or/and Discussion only data was used
- Small labes in graphs
- Short and not in-depth discussion about results. Almost nothing about results for GloVe and BERT. Only general results, no details.
- Experiments not well described - only theory about methods
Comments:
- Long methods descriptions for such short report
- Only 5 pages (should be ~6-8)
23- how do correlated features influence a model?
- no analysis done/discussion
2032-11
18
112- Rocchio was used merely for IR?? There is no need to theoretically explain how specific methods work - this should be referenced.
- related work is ok
- dataset should be more clearly elaborated (points 1 and 2)
8- no data description
- basic baselines only!
- no future directions
50Instructions for runing classifiers, and setting up project are ok. There are no instructions for running bert locally, and no clear instructions for running on web - also when i run all on web it failed.

---------------

Figure 2 has wrong description. And its imposible to read what the columns are, x and y should be inverted for easier viewing. Also it uses only 1 target for classification, should have considered others, such as Book ID, Topic, etc.. They merged the two datasets, and not explained in the dataset description - only later in methodology is this mentioned. Merging is good extra idea, but they should have checked them seperatly as well, since they where provided as such - and with that they provide different situations of data - real-word and ideal.
23README is well written it specifies all the requirements which are in requirements.txt. Standard models are all runnable with command written in README. The results are comparable to the one written in report.
Bert which is on google colab is working ok. In readme is also written to upload dataset and to set notebook to run on GPU. When running it in sequence it fails because variable is not defined (it is defined in later cells), this also causes plot to fail. We need to go back and run previous cell to make it work. We also observed that validation data is same as test data

---------------

Report structure is well organized, except before Related work there is a section named Traditional Classification methods, which could be described later.
Also there is no section Result, but results are described in multiple separate sections. Selected dataset is well described. Also there is a clear description of all the methods and algorithms. Data and preprocessing part are described, but they could be a little more. At the end section "Discussion" is missing and in section "Conclusion" there is no further ideas for improvements. Overall the report is well written and we get a feeling that the authors know what they are doing.
26- figures captions are not okay
- maybe tail of the distribution is more interesting?
- hard to compare results of traditional methods

- sheet1 and sheet2 merged - what about separately, why?
- uncased bert used (do you remove casing in your preprocessing?)

- F1 (for grouped classes) = 70%

- no analysis per class, only one idea approached, where is discussion?

- 20 points only due to idea of grouping
- story as ContentDiscussion??
2538-3
19
120- related work should be more elaborated in the field of text classification/speech act recognition
- dataset review was done. in the future perform it more deeply. Also, do no use pie charts.
9- data could be described better
- basic features models only (BoW, TFIDF)
- no future directions
60Repository is nicely described. There is a typo: requirements.py sould be .txt. Everything else was good. Code is runnable and results are comparable.

---------------

Report follows the proposed structure. Data set, methods and feature extraction are nicely described. They included more than 3 methods including deep neural networks. Results are in nice format. Discussion critically evaluated their results. At the end they include future work and point out what they could improve
30The repository includes a README.md file with instructions, but they don’t work for us. After installing dependencies from requirements.txt (in their readme they have a typo - requirements.py), we are still missing a dependency called bcolors, keras and textblob. We still had hard time running the project, since it required specific version of tensorflow, and it is not given in the requirements file. Only one person in our group could run the project.
Given results match the one in the report except Deep CNN with TF-IDF where they got F1-score 0.37 instead of 0.307.
Points: 6

--------------------------------------------------------------
Unrunnable. Missing many modules from requirements.txt. Even after quite some effort and after installing all the obvious packages I was getting errors such as:
keras.utils.to_categorical(Y_train)
AttributeError: module 'keras.utils' has no attribute 'to_categorical'
probably because the default package versions were not ok.
Couldn’t check the results as I wasn’t able to run the code.
Points: 3 -- A participation award for the effort. I would actually give 0 and give them a chance to fix it for full points.
--------------------------------------------------------------

Points: 7

---------------

Report follows the suggested structure and is less than 8 pages long. The report includes description of data preprocessing, feature extraction and methods used. The results of the algorithms used are provided in the results section, and their performance is commented on in the results and discussion sections. No shortcomings of their approaches are provided, but they do provide some ideas for the improvements in the future.

Abstract, introduction and related work: 5/5 points
Abstract nicely summarizes their paper, introduction attracts the reader to the paper and presents other sections. Related work is there.

Clear description of selected datasets and analysis: 4/5 points
Dataset is well described and analysed, but I miss some graphs to make the paper more interesting.

Clear description of methods/algorithms used (at least 3) and adequate discussion of results: 9/10 points
No shortcomings of their methods are provided, but otherwise everything.

-----------------------------------------------------------


5/5 points: abstract, introduction and related work.
4/5 points: clear description of selected datasets and analysis.
A lot of very correlated features, why? (e.g. does a message contain ?; number of ‘?’ that message contains). This can actually hurt the models. Did you try to do some feature selection, or try using a smaller number of less correlated features? It could actually improve the performance.
6/10 points: clear description of methods/algorithms used (at least 3) and adequate discussion of results.
The discussion and results basically just list what they tried, what worked well in their specific case and what didn’t. They don’t answer why they chose them or why they think certain methods worked. Future improvement suggestions (use other algorithms) are too vague.

They explicitly mention they tried a basic neural network, a deep dense neural network and a multilayer perceptron … aren’t these three basically the same thing? They mention basic NN misses some layers..? Guessing by the provided image, deep NN was the only one that used dropout layers?
22- if you say all others are under-represented, what could you do - merge them in one class, would it be useful?
- what information can we get from table 2, how it is relevant for the analysis?

- multiple evaluation settings tried (CREW, Crew and discussions, Joined) with F1 0.6, 0.63, 0.67
- our results were quite good - based on what?

- analysis and discussion lacks explanations
2540-3
20
125- zelo jasen opis
- opis korpusa bo treba še razširita in ga obravnavati v ločenem poglavju, saj ni še nikjer opisan. Opisati ga je potebno tudi iz različnih vidikov
10- podrobnejši opis podatkov = +
- osnovni modeli še vedno le s TF-IDF
- plan podan, a pomanjkljiv
90I would recommend adding the version of Python you used (I tried with Python 3.6.0 and 3.9.1 first) but wasn't able to install the dependencies in requirements.txt.

I also had issues with a missing depndency. I managed to install listed dependencies in Python 3.7 but when running the script manual_features.py got an error that xlrd dependency is missing. pip install xlrd did not work as the newest version did not support xlsx extension so openpyxl had to be installed.

Despite that code was still runnable using the prepared notebook on Colab. README has clear instruction, personally I would even shorten it a little bit by excluding the first section where you describe the whole task since that is already included in report.

---------------

The report is 8 pages long and contains the expected sections (abstract, introduction, related work), dataset is well described and also presented visually. The preprocessing and feature extraction steps are also described. Results are presented in tables and also discussed: what affected the obtained results, which features improved/reduced performance.

The only minor complaint regarding the report is that a mixture of Slovenian/English terms is used. In tables you use English terms for precision, macro, recall but in text you used Slovenian terms (točnost, priklic, preciznost). You could also use Slovenian latex package to get translated names in your captions (table, figure etc.)
27The instructions in the read me were more than clear. Even the instructions show that the authors have put a lot of work in to their report as they have prepared a separate Google Colab notebook with even more clear instructions.

---------------

In general report does follow the suggested structure but there are some small differences as it does not divide all the paragraphs correctly. Otherwise their work was really great to read and and every decision they made was discussed and described. Their work was done with motivation as they ran different experiments and tried to more strategies to ensure the best results. Excellent work!
28- CREW 0.54, Discussion 0.75 (makro) + veliko dodatnih rezultatov!
- kako so bile izbrane ročne značilke?
- veliko analiz in razlag

- SUPER!
30504
21
119- more datasets should be used for the task
-> he focused into annotation schemas and procedure from datasets
- sensible methodology
-> nice use of clustering
10- in the repo should be one report only that you update (not related to specific submission)
- data should be better described - e.g. guidelines/goal for annotation (Table 1 is useful)
- BPEmb embeddings - what exactly are they?
- Fig2 vs. Fig3 - more explanation is needed.
- future directions: before training a classifier, you should exhaust explorative analysis/visualizations/filtering/keywords grouping, ...
80Trying to setup environment locally is a nightmare, as there are 419 dependencies listed (not all of them are required for this one notebook), and installing via the described command gives many errors as there are conflicting dependencies and lacking sources. If you list the command as part of setup, you should at least test that it works.
Luckily, running in Colab with the listed backup instructions works perfectly.
In Colab, code is runnable, results are the same as in the report.
The different embeddings for texts are saved in a file so we don't have to extract them, which is great as it saves time.
Code is very well documented and readable.

---------------

B.1) ( 4/5 ) abstract, introduction and related work

Report partially follows the suggested structure (abstract, introduction, related work). While there is no section Related work, some related papers are mentioned in the Data section, but the scientific goals and methods are described for only some of those papers. The abstract is short and efficient. The introduction describes the problem sufficiently.

B.2) (4/5) clear description of selected datasets and analysis

The 8 selected datasets are mostly well described in the Data section. The data preprocessing is not explicitly stated in the report, but the details become clear after you look at the code. Some results of the analysis of the datasets are visualized in Figure 1. More details regarding the construction of the datasets used in the notebook could be given.

B.3) (9/10) clear description of methods/algorithms used (at least 3) and adequate discussion of results.

- The report also does not include the Methods section, although the methods used are described. Some details (parameters) regarding the methods are missing, such as the linkage used for the hierarchical clustering.
- Different embeddings were generated and experimented with (tf-idf, fastText, BERT, ELMo).
- The data was analysed using multiple differing methods of which hierarchical cluering seems to have returned most useful results in regards to the final proposed grouping. Although the report contains the discussion, it is not given its own section in the report. As part of the discussion of the results a seemingly justified grouping is proposed. Discussion also points out where the methods used did not return valuable results.

Other things considered: The length and style of the report follows the instructions. The project group consisted of only one author, who consequently did all the project work.
26Repository is well organized and everything is well described. We had problems installing requirements, as there were some conflicting dependencies between libraries (-3). We also think there is too many libraries, do we really need to install for example "music21" library or library for converting korean lunear colendar to gregorian?

But once we moved to running in Google Collab everything went smoothly and the results are as ones in the report.

---------------

Structure, style and lenght are all appropriate. Data is precisely described, but there is not a word said about preprocessing of it (-1). At least 3 approaches are described and are critically evaluated. When reading one gets feeling the author knows what he is doing and the report flows nicely. Only thing we are missing is a clear and concise conclusion (that is the "final-schema") (-1). Also missing is the more precise descripton of models used (especially in Bert and Elmo case, were they pretrained or trained on your data?) (-2).
23- Fig1-3 should be nicer/more readable
- what is silhouette?
- Figs 4-7 far away from text
- final decision/schema is not clear
25432
22
121- datasets are described but for the related work you should have found similar analysis you will do (for other domains, keywords, ...)
- initial ideas for the baseline are okay - Delete the last bullet, the idea of this task is not classification!
7- as discussed during the lab session
- be careful at t-SNE discussion, use subsets of keywords if they make sense better
- be careful when applying BERT, ELMo
- the final schema should somehow uncover "relationships between keywords"
100README is written really nicely and the results are reproducible (for what we've tried).

---------------

Abstract and introduction way to short. Some parameters are missing in the descriptions of some algorithms, and some non sensical figures.
27README is present with instructions how to install prerequesites and run analysis.
Majority of code is runnable, output is similar to results from report.
One line in a cell: "import pyLDAvis.gensim" gives an import error, tried fixing it but not successful, so topic modelling part could not be checked.
Getting pretrained BERT embeddings is done slowly and takes a lot of time (originally 7 hours), the obtained embeddings could have been saved & included in repo, but they are not.
Code could be a bit better organized/documented, also some cells have old error outputs.

---------------

3/5 points: abstract, introduction and related work
- Abstract is too short, it should contain more information about your work. Moreover, you should say something about your results, maybe what was your biggest finding.
- Introduction is a little too short, you could write a sentence or two about the hate speech topic or gently introduce reader to the problems and challenges in the field.
- Related work is written very descriptive and covers many different approaches to hate speech topic.

4/5 points: clear description of selected datasets and analysis
- Datasets are described very thoroughly. At the end are extracted categories nicely summarized in the table. One thing that is missing from the dataset description is how was the labelling process conducted (i.e. what were the instructions to the annotators, how was the problem of subjectivity addressed etc.). Another remark is that you could drop precise number of texts gathered for each category in the description of the datasets, as it is making the description a little hard to read. You are already showing numbers in the final table, where reader has a nice overview of the data.

8/10 points: clear description of methods/algorithms used (at least 3) and adequate discussion of results.
- Data preprocessing: nice overview together with examples in the table.
- Figures: a bit bigger font would be nice, since some words are really hard to read. Majority of figures have great captions, but some of them (7 and 8 for example) lack more detailed description. Moreover, there are a lot of figures in the report, especially for t-SNE and PCA visualizations, where you could include only most informative ones.
- Approches: you've tried many different approaches: TF-IDF, different neural embeddings (Word2Vec, Fasttext, Glove), contextual embeddings (Bert, ELMo) and described them well.
- Analysis of approaches and results: some approaches miss more detailed analysis of results, as you provide many figures but do not explain them as thoroughly. Moreover, you could try to reason with the results in connection with the method that you've used.
- Future improvement: you've included a suggestion how to improve your results, but it was more focused on dataset and not on methods (i.e. try another method, try different approach to embeddings etc.).
22- datasets and literature overview should be paragraph-like
- nice upset plot
- abstract
- Figures 2-7 should be make clearer - what is their contribution?
- Fig 9: words by topic, not in 2-D space?
- during the lab session we said to improve fig 8??
- Fig11: where is the neighbourhood?
25421
23
113- related work should be more elaborated, connected to the actual task and not general for classification
- do not focus on classification but more on description od datasets, categories
- re-do related work review from the aspects of topic modelling, ... (works that tried to explain categories for other domains)
8- corpora idea/annotation should be better explained
- no need to describe tf-idf, BoW, BERT as you should use more methods, not adapt existing ones
- further work missing
70The requirements are provided, scripts are runnable and the results are identical to the ones in the report (10/10).

---------------

Abstract, introduction and related work - each part is present, related work might be a little irrelevant (4/5).

Dataset descriptions - present, clear, information about data annotation was provided (5/5).

Methods - no intuition behind adding "this is <offensive class>" to each text was provided, using tf-idf on joint documents decreases the number of documents to number of classes which means that tf-idf will be dominated by term frequency, results are adequately commented and discussed (8/10).
27readme.md has clear instructions, repository is well structured.
no trouble installing dependencies

w2v_term_analysis.py runs smoothly, generates similar outputs
w2v_document_embeddings.py runs smoothly
bert.py produces same dendrograms
kmeans.py doesn't show all labels as it does in report.
9/10 points

---------------

abstract, introduction and related work:
well written abstract, clearly presented problem, related work only contains 2 examples, one of which is about recognising hate speecha and not exploratory analysis. 4/5 points

clear description of selected datasets and analysis:
datasets are thoroughly described, good description of data preprocessing. 5/5 points

clear description of methods/algorithms used (at least 3) and adequate discussion of results:
methods and how they used them on their data is very well described, rsults of methods they used are presented in concise and informative way,
result analysis should be more concise. I don't know why Missing speech classes section is in discussion and conclusion section, it doesn't belong there. I think it should be in the data acquisition section. also because the missing labels were not analysed they should not be included in final schema. otherwise final schema looks nice and is very informative.
7/10 points

report all together: 16/20 points
25- kMeans+PCA results
- Glove + PCA
- how can Fig3 be explained?
- very useful explanation - Fig. 6 -> great!

- some labels not analyzed are in final schema - which?
30452
24
103- all expected methods indicated
- how will models 4-8 be used?
-> they will use only preprocessing
- some data was already processed - what about the additional data?
- visualization techniques are also mentioned
- nicce overview of data
10- only two datasets described, wordcloud still there?
- what if you train you own word2vec, fastText, Glove?
- in the BERT section there are some examples - how are they retrieved?
- good example of further work
80Everything looks OK

---------------

No related work (-1 point). Some methods (e.g., GloVe) might be described too much (no deduction here though). Data sets are described well. Nice report :)
29The readme and the notebooks included all the needed information to download the models and run the code.
Some libraries were missing in .yml file.
word2vec_own_models.ipynb needed some changes to work.
We could not run training_glove_model.ipynb.

---------------

2.5/5 points: [abstract, introduction and related work]
abstract: Very vague, does not point out specific methods. Seems like the paper also focuses on detection. [-0.5]
intro: Very short. Primarily focuses on data instead of introducing the topic and goal of the research. [-0.5]
related work: Missing [-1.5]
4.5/5 points: [clear description of selected datasets and analysis]
data: Clear histogram showing the distribution of classes in the data. Descriptions of the annotation process for each data-set and explanation of the original purpose. Short description of data pre-processing, lacking reasoning [-0.5].

7.5/10 points: [clear description of methods/algorithms used (at least 3) and adequate discussion of results.]
word2vec and GloVe training on own data. Point out the classes where it worked, and where it did not work at all.
Pre trained fast-text on urban dictionary. Sensible reasoning for selecting the model. Multiple models analyzed.
Clear explanation of figures relating to analysis with non-contextual embedding methods. Interesting results with term analogies, could have included more examples.
BERT: sensible approach and hierarchical clustering results. Examples of most similar posts to average embeddings.
Lacking results for traditional methods. [-2]
Discussion: Pointed out factors regarding data that impacted the results, discussed the shortcomings of certain approaches.
Schema: Vague grouping and relations of classes [-0.5]. Interesting division based on explicitness of language and challenges of detection.
24- A lot of configurations tried
- best results with ConceptNet Numbermatch embeddings
- final schema could be visualized and represented for all categories, still useful mentioning different aspects (e.g. explicitness, agresiveness)

- lacking: clear final schema
28462
25
102- statistical analysis
- analysis of specific words that may contribute to the class
- Word2Vec, BERT
- classification is not part of this task!
- datasets should be analyzed
9- you describe corpus merging -> could you still do analysis in each corpus separately?
- corpora should be described better
- abusive, fearful, disrespectful have almost the same keywords?
- what are the parameters for figure 1, why only those labels selected?
- Figure 6 finds some interesting clusters - what are the parameters (in the source code you can freeze random value to enable reproducibility)
90README.md is ok, containing all the needed information. Most of the notebooks are runnable, we only had problems with one notebook. The line demoji.download_codes() was missing, but we figured this out quickly. Also nb_conda is required in the base conda environment, because without it we were not able to set correct kernel.

---------------

Abstract ok, introduction ok, related work many cited papers where they produced datasets, each used dataset is described. In the related work they did not explore similar work regarding hate speech analysis. They also described their data preprocessing, and how they combined multiple datasets into one. They used more than 3 methods/algorithms, which are well described. Results are discussed in the experiment section. At the end they conclude with the final thoughts and shortcomings of the data. Problems with specific models are described in the experiments section.
29Looks OK

---------------

Related work only mentions data set authors and their work regarding this data set - there is no other related work mentioned (-0.5 point deducted).
29.5- they introduce specializations of harrasment - why?
- figure 1 is nice
- fig 3, 4 and 5 could be more readable
- used plain BERT and fine-tuned BERT - what exactly are the results?
- final decision for the schema is not exactly explained
2846-3
26
100- short and concrete
- for the datasets and keywords you will probably need to expand
10- Figure 3 interesting
- continue in a way to extend description you provide in the section "Discussion & future directions"
100README is written really nicely and the results are reproducible. Did not run the merging, but on first glance looks ok.

---------------

Introduction is bit too short and doesn't fully capture the problem we are trying to solve. Parameters are missing for some algorithms, and there are is weirdly high correlation for some results that are not discussed.
27Everything is explained in the README.md, notebooks are runnable, results are similar as in the report.

---------------

Very nice report, following all of the requirements, contain all required sections, everything well explained, nice visualizations, nice final schema.
30- they take into account 19 categories (15 are obligatory)
- figure 4 seems weird
- very nice explanation
30503
27
104Very nice!
Still, should focus more on different techniques you will try and additional data acquisition.
10- data description: example +; try to find goal/annotation instructions
- no need to present all the methods -> rather focus on how you used them
- useful visualizations: method approach to get each should be described -> for overlapping keywords continue separate analysis
- interesting idea to use analogies
- How figure 9 was retrieved - same as BERT?
- further work anticipates analysis is already finished
100Necessary instructions are provided, the code is runnable and produces the same results (10/10).

---------------

Abstract, introduction, related work - Everything is present and nicely written (5/5).
Datasets - Clear and well described (5/5).
Methodology - Word2vec is known to capture some analogies, but inferring the hypernymy/hyponymy in similar way doesn't seem too convincing, everything else was nicely done and presented (9/10).
29The readme and the notebooks included all the needed information to download the models and run the code.
We ran all the notebooks with almost no errors.
05_noncontextual_embeddings_labels, line 405 throws an error: ValueError: Array must be symmetric.

---------------

4.5/5 points: [abstract, introduction and related work]
abstract: Announces the problem matter and the methods used
intro: Contextually introduces the problem matter
related work: concise, but no separation between intro and related work [-0.5]
4/5 points: [clear description of selected datasets and analysis]
data: Cite the source of the data for each data-set and the number of samples for each class. Describe the annotation method for each data-set. Don't further describe the original purpose for the data[-0.5]. Very brief overview of the data pre-processing, without clear reasoning [-0.5].

8/10 points: [clear description of methods/algorithms used (at least 3) and adequate discussion of results.]
TF:IDF - useful comments about the data-set origin and impact on results
Non-contextual word embeddings - approach taken directly from the class example. Visualizations too small and crowded to be readable. [-0.5] Interesting idea with word analogy. The method uses the most similar words in the pre-trained model as the source of truth.
Contextual:
BERT: Visualizations to small and crowded. Differences between colors in class hard to distinguish. The color scale should be normalized to the values of the results. (It's all blue here).
The results are defined, but no reasoning is given. [-0.5]
KeyBert: Reasoning for the results based on the source data. Useful results.
Certain method parameters not clearly defined, which would make it difficult to reproduce without the source code. [-0.5]
Schema:
Clear visual schema of relations and groups

Discussion:
Short, mostly just reiterates the abstract. Lack of language context for the results [-0.5].
Slightly too long.
26- from the contextual embeddings they use KeyBERT and USE
- in fig5 there are some long words?
- explain silhouete scores?

- what exactly is new from last time?
- nice schema (explain)?
30504
28
29
30
0
31
0
32
0
33
0
34
0
35
0
36
0
37
0
38
0
39
0
40
0
41
0
42
0
43
0
44
0
45
0
46
0
47
0
48
0
49
0
50
0
51
0
52
0
53
0
54
0
55
0
56
0
57
0
58
0
59
0
60
0
61
0
62
0
63
0
64
0
65
0
66
0
67
0
68
0
69
0
70
0
71
0
72
0
73
0
74
0
75
0
76
0
77
0
78
0
79
0
80
0
81
0
82
0
83
0
84
0
85
0
86
0
87
88
89
90
91
92
93
94
95
96
97
98
99
100