AUEB, Data Science MSc, Thesis Topics

	A	B	C	D	E	F	G	H	I	J
1	Serial Number	Firm's name	Person of Contact - POC	Topics	Description	Skills	Deliverables	Supervisor(s)	Suggested Bibliography	Remarks / Notes

2	1	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Feature extraction and classification: Sentiment Analysis on twitters posts for Greek Banks	Twitter conveys the opinions and interests of people in various topics and domains. At the same time, twitter captures the real time and continuous reaction of its users to news and events. The latter may not only refer to politics and sports but it also includes information about financial institutions, companies and international organizations. This indicates the opportunity that organizations’ decision makers have, as they are now more than ever able to understand how their current and potential consumers/customers tweet - “feel” about their business and identify ways to respond to peoples’ perception for their company. In this study we are going to focus on twitters post for Greek Banks with the purpose of automatically classifying opinion polarities and topic categories. Sentiment Analysis or Opinion Mining aims to determine the attitude of a writer with regard to anything he or she has said or written. It indicates what a particular person is trying to communicate, the emotional state and judgment regarding any topic. In this process, a given text is taken as input and words and sentences found in the document are categorized into different levels of sentiments. The aim of this study is to develop real time algorithm that classify the tweets. Feature selection techniques, natural language processing techniques, hybrid methods and machine learning process will be developed during the execution of the thesis.	Data cleansing, Data merging, feature extraction, preferably R (alternatively SPSS)	Thesis – including existing methodologies, similar studies, R Code, Results and Recommendations, Executive Summary Deck (4-5 Slides)	Alexandra Korda Alexandra.Korda@gr.ey.com
3	2	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Feature extraction and classification: CV Analytics	Talent assessment is one of the most important, pedantic and painstaking procedures every company undergoes in order to enhance its staff and skillset. Whenever a job opening is posted, a company receives hundreds of CVs and the screening process can be long and time consuming. Is this individual a good candidate? Which is the most appropriate department and people to interview him? There are multiple examples where it is important to understand/perform a first screening of the candidates and see how well they match a position, a fund, a scientific expert etc. In this context, this work will review how to organize and understand the textual data and presents an unstructured text analytics approach for qualitative evaluation of CV/Resume documents. Text analytics/text mining refers to deriving high quality text from unstructured or semi-structured documents. Effective approaches for extracting the resume information from various documents and types and subsequently analyzing them; therefore we will work on ways to make it easier for company staff to find suitable resumes. Resumes are a great source of unstructured data which can be usefully analyzed by the companies to shortlist the right candidate. Resume parsing, dictionary based segmentation, general keyword search, machine learning.	Segmentation and Clustering Techniques, Logistic Regression techniques, preferably R (alternatively SPSS)	Thesis – including existing methodologies, similar studies, R Code, Results and Recommendations, Executive Summary Deck (4-5 Slides)	Alexandra Korda Alexandra.Korda@gr.ey.com
4	3	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Factor Analysis and Cluster Analysis: Segmentation of Physicians based on Surveys	Factor analysis is a method for explaining the structure of data by understanding the correlations between variables. Factor analysis summarizes data into a few dimensions by condensing a large number of variables into a smaller set of latent variables or factors. In particular, it allows to determine how much of the variance in each observable variable is accounted for by the factors identified. It also provides the information on how much of the variance in all the variables is accounted for by each factor. In this study, a pharmaceutical company wanted to improve the way it engaged with its customers, which in this case were the Doctors. Apart from the already existing client dataset (CRM), the company performed a survey and asked 400 doctors a number of questions in order to understand their behavior, their preferred channel of contact, their preferred content etc. The student will try to analyze questionnaires by means of factor analysis and clustering techniques. The aim is to cluster the interviewers, identify the top questions and the ideal interviewer based on the customer needs.	Data cleansing & merging, Factor Analysis, Clustering Techniques, preferably R (alternatively SPSS)	Thesis – including existing methodologies, similar studies, R Code, Segmentation Results and Engagement model, Executive Summary Deck (4-5 Slides)	Petros Vamvakaris
5	4	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Dynamical Systems: Social Network Analysis on Retail Bank Customers	The process of mapping or measuring links and relationships, which exist between organizations and individuals, who are engaged in various networking or collaborative activities is known as Social Network Analysis. This visual and mathematical analysis seeks to explore specific expertise or influences, which exist in groups. It aids in the computation of indices, which measure characteristics of the network, such as, measures of centrality. It also seeks to determine the structural importance of a “node” (example, people) in graphs. In this context, the aim of this study is to analyze the network of companies/individuals who are customers of one retail bank. Also, we will see how from the various interactions between the customers and the characteristics of each company or individual we can gain valuable insights for the bank. Which are the communities with more intense interactions between its members, who are the top influencers in the network, who are ‘central’ in networks, which are the underlying patterns that may help predict and proactively assist customers who are ready to default, which are the key members that provide cash flow and how are the members that are more dependent on cash flow performing over time. This information can help managers reallocate rights for effective group performance.	Dynamical Systems, Graph theory, preferably R (alternatively SPSS)	Thesis - including existing methodologies, similar studies, R Code, Results (Network KPIs), Executive Summary Deck (4-5 Slides)	Panos Papadopoulos
6	5	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Linear vs. Non- linear modeling techniques: Application and Challenges in the Consumer Goods and Retail sectors	Commercial Analytics covers a range of econometric methods and solutions that are predominantly applied in the industry of consumer goods and retail. The goal of commercial analytics is to understand what drives consumer behavior, identify which are the underlying patterns in our shopping behavior and above all quantify what is the impact of marketing and pricing strategies in the revenue and volume sales of manufacturers and retailers. By having calibrated models that can describe the above, the business can perform different simulations in their strategy in order to predict and identify which are the opportunities for further improvements. There is a vast range of different methodologies that are used in order to understand how consumer behavior affects product sales, most common of which is regression analysis. The goal of this thesis will be to investigate which functional regression model set up is better positioned to predict product sales (additive vs. multiplicative), what are the pros and cons for each methodological approach and what should the modeler have in mind, analyze whether the type of the independent variable will define the functional form instead of utilizing a predefined set up. During this thesis, the criteria of the model selection will not be only statistical but we will also evaluate which functional forms allows for more intuitive, digestible and actionable results & insights for the people who are running the business.	Data cleansing, data merging, Regression Analysis, preferably R (alternatively SPSS)	Thesis - including existing methodologies, similar studies, R Code, Results (Network KPIs), Executive Summary Deck (4-5 Slides)	Panagiotis Ypsilantis
7	6	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Solution of system of Regression with interaction between variables: Application and Challenges in the Consumer Goods and Retail sectors	They key advantage of predictive analytics is that it allows to identify the true causal relationship between a marketing activity and its direct impact. It can identify the boundaries between correlation and causality and decompose the impact of various probable causes on the dependent powerful. In the business world, this is very powerful because it can associate an investment with its direct impact and, hence, provide to the marketers and business owners tools to evaluate the outcome of their expenditure. In order to be able to do so, the analyst needs to take into account all potential drivers in order not to underestimate or overestimate the return on investment of the activities. In the retail sector, an accurate model selections needs to take into account all probable product interactions and how e.g. the increase of the price in one product has affected the other. This involves simultaneous calibration of many models with many common independent variables e.g. in store activity. During this thesis, we will analyze how to better calibrate such models with interacting variables in order to reach to a statistically valid result that at the same time is in line with business expectations.	Data cleansing, data merging, Regression Analysis, Optimization techniques, preferably R (alternatively SPSS)	Thesis - including existing methodologies, similar studies, R Code, Results comparing different set ups, Executive Summary Deck (4-5 Slides)	Kostas Petsounis
8	7	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Non Linear Optimization on and saturated modeling analytics in evaluating marketing multichannel communication	Media Advertising is one of the most significant channels of product communication from manufacturers to end consumers. Companies spend every year millions to promote their products and increase their market share and profits; only this year $493 billion where spend on advertising. At the same time, the increased popularity of internet and social media provides to companies new opportunities but also challenges in utilizing a broader range of channels. To evaluate media efficiency, planners consider a range of factors including the required coverage and number of exposures in a target audience; the relative cost of the media advertising and the media environment. Analytics have provided a vast range of econometric tools to business analysts in order to estimate the impact of media advertising and develop more efficient media plans. However, despite the sophistication of some of the approaches, there is still great opportunity to enhance the existing models with capabilities that will capture in a more realistic way the impact of Media Advertising on consumers and thus the sales of each product. In this thesis we will analyze the effect of consumer saturation at overexposure, short term and long term media return on investment, optimization techniques to improve media investment and digital advertising.	Data cleansing, data merging, Regression Analysis, Optimization techniques, preferably R (alternatively SPSS)	Thesis - including existing methodologies, similar studies, R Code, Results comparing different set ups, Executive Summary Deck (4-5 Slides)	Kostas Petsounis
9	8	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Path Analysis: Booking Funnels and Customer Journeys based on customers’ cookies on website	How can a company evaluate and improve the experience that potential customers have on its website? The navigation within the website can make the difference or discourage potential customers. This is especially true for robust websites with many pages where the clients are not easy to navigate. The journey of the customer through all brand touch points is vital in order to understand client needs and match customer expectations, especially as the ways an individual can access the website are multiple e.g. desktop, mobile, tablet. In this project we will work with data from an online travel agency. The data contain information regarding the navigation on the company’s website from potential customers at a cookie level and device level. The goal of this analysis is to understand which the most efficient booking funnels are, what is the probability of booking given certain navigation features, which paths are the least likely to end with a booking, which is the seasonality of booking etc. It will offer the opportunity to work with a large amount of data, understand how cleansing with real data can be effective, and challenge the student to identify smart and intuitive ways to visualize the results.	Data cleansing, data merging, Probability Theory, preferably R (alternatively SPSS)	Thesis including existing methodologies, similar studies, R Code, Results comparing different set ups, Executive Summary Deck (4-5 Slides)	Petros Vamvakaris
10	9	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Data Visualization: the story telling from data to insights	For every data scientist, the goal during data analysis is to discover hidden patterns and underlying trends that will lead to useful insights and actionable conclusions. This will provide the relevant tools and support to the people who are responsible of taking decisions. Statistical models and predictive analytics aside, the communication of the results and the insights to the stakeholders and decision makers can be the most crucial part of all the analysis. This is why the business analysts need to be armed with business acumen that will help them to appropriately convey the “message”. Crucial part of this process is the usage of smart graphs and intuitive visualizations that will help during client presentations and workshops. The creation and preparation of such charts is a complicated process that requires a lot of creativity and mathematical intuition. It requires to find the right balance between the appropriate amount of information/charts as well as short messages and dashboards. In this project we will provide the student with various datasets of different nature along with predictive modeling outputs and identify smart ways of communicating the results to a third party. We will use colleagues who will not be aware of the data and projects, as a test audience during short presentations.	Data cleansing, data merging, R, Spotfire, R Shiny, Excel, PowerPoint	R Shiny Code, Power Point Presentation on the different datasets	Panos Papadopoulos
11	10	Ernst & Young Business Advisory Solutions S.A.	Alexandra Korda Alexandra.Korda@gr.ey.com	Regression and Cluster Analysis on Shelf store optimization and Assortment analysis	When consumer pick a product in order to cover a specific need, they are seeking to find the most attractive one on the shelf. However, which are the characteristics making this product attractive? Which are the key elements driving consumers to select a specific product and make them loyal to a specific brand? How will the volume be distributed in a category after a change in the category’s assortment? Assortment analysis decomposes a product to its key elements and identifies how much these contribute to the purchase decision. In this project the student will work on creating statistically valid models that capture the behavior of consumers when standing in front of a shelf that provides a variety of options. We will use principal component analysis, feature selection techniques and other clustering techniques to perform a first feature extraction and then evaluate alternative regression functional forms to fit the choice of consumers. We will take into consideration novel attributes that have not been previously accounted for in the type of studies. The analysis will be done based on retail data.	Data cleansing, data merging, Regression Analysis, PCA and Dimensionality reduction, preferably R (alternatively SPSS)	Thesis including existing methodologies, similar studies, R Code, Results comparing different set ups, Executive Summary Deck (4-5 Slides)	Konstantinos Bloutsos
12	11	Predicta	Ioanna Koutrouvis ikoutrouvis@predicta.gr	Algorithm Development as a part of a Big Data Platform Ανάπτυξη αλγορίθμου για αυτοματοποιημένη επιλογή του βέλτιστου μοντέλου χρονοσειρών , με απώτερο στόχο την ένταξη του αλγορίθμου σε Enterprise Analytics Platform	Οι χρονολογικές σειρές αφορούν χαρακτηριστικά που μεταβάλλονται στον χρόνο (π.χ. πωλήσεις, τιμές, μετοχές). Στόχος είναι η αναγνώριση της χρονοσειράς και η πρόβλεψη μελλοντικών τιμών βασισμένες στα ιστορικά δεδομένα. Οι μελλοντικές προβλέψεις είναι πλέον απαραίτητες για τον σχεδιασμό και την ανάπτυξη κάθε οργανισμού. Οι συχνότεροι τομείς που εφαρμόζονται μοντέλα χρονοσειρών είναι : Demand forecasting, Energy forecasting (Load/Consumption forecasting, Price of electricity forecasting), Revenue forecasting (Sales forecasting, Expenses forecasting), Loss forecasting, Financial Asset forecasting. Δεδομένα από κάθε τομέα παρέχονται στον φοιτητή για την κατασκευή μοντέλων χρονοσειρών. Απαραίτητη κρίνεται η αυτοματοποίηση της διαδικασίας για την εύρεση του καταλληλότερου μοντέλου σε ένα ή περισσότερα από τα παραπάνω πεδία εφαρμογής, γεγονός που θα μειώσει το κόστος και τον χρόνο της ανάλυσης και παράλληλα θα αυξήσει την εγκυρότητα και την αξιοπιστία των αποτελεσμάτων. Πληθώρα μοντέλων για χρονολογικά δεδομένα έχουν προταθεί στη διεθνή βιβλιογραφία. Μεταξύ αυτών τα πιο διαδεδομένα είναι: Autoregressive Integrated Moving Average (ARIMA) με παραλλαγές του (Fourier terms, Box-Cox transformations etc.), Exponential smoothing (ETS) με παραλλαγές (Bagging with STL etc.), Time Series Regression, Neural Networks, Support Vector Regression, Kalman Filter, Vector Autoregressive (VAR), Generalized Autoregressive Conditional Heteroskedasticity (GARCH), etc. Οι αλγόριθμοι για την υλοποίηση των παραπάνω μεθόδων είναι διαθέσιμοι και εύκολα προσβάσιμοι. Ωστόσο, δεν υπάρχει θεωρία που να αποφασίζει τη βέλτιστη μέθοδο πρόβλεψης ανά περίπτωση. Οι επιπτώσεις. κάνοντας την σωστή επιλογή, είναι υψίστης σημασίας τόσο από τη θεωρητική πλευρά, όσο και για πρακτικούς λόγους. Σε πολλές περιπτώσεις, ακόμα και μικρές αλλαγές στην ακρίβεια των προβλέψεων μπορούν να αποφέρουν πολλαπλά οφέλη (ή και ζημίες) στη λειτουργία μεγάλων οργανισμών.	Πολύ καλή γνώση γλωσσών προγραμματισμού ανοιχτού κώδικα (κατά προτίμηση R & Python), Γνώση βασικών εννοιών στατιστικής και μοντέλων πρόβλεψης χρονοσειρών	Η παρούσα διπλωματική θα καλύψει : 1)Ανασκόπηση των μεθόδων που έχουν προταθεί στη βιβλιογραφία για μοντέλα χρονολογικών σειρών, 2) Εφαρμογή των βασικότερων μεθόδων για κάποιο/α πεδίο/α εφαρμογής σε πραγματικά δεδομένα, 3) Ανάπτυξη αλγορίθμου (κατασκευή επαναληπτικής διαδικασίας) που θα εξετάζει κάθε μοντέλο και θα επιλέγει αυτόματα το βέλτιστο, βασισμένο σε κριτήρια αξιολόγησης (π.χ. στασιμότητα κι ελαχιστοποίηση σφάλματος πρόβλεψης)	???	1) Gooijer, J. G. D., & Hyndman, R. J. (2006). 25 years of time series forecasting. International Journal of Forecasting, 22, 443–473, 2) Swanson, N. R., and H. White. (1995). “A Model Selection Approach to Assessing the Information in the Term Structure Using Linear Models and Artificial Neural Networks.” Journal of Business and Economic Statistics 13, 265–275, 3) Hyndman, R. J., A. B. Koehler, R. D. Snyder and S. Grose (2002) A state space framework for automatic forecasting using exponential smoothing methods, International Journal of Forecasting, 18(3), 439–454, 4) Qi, M., Zhang, G.P., 2001. An investigation of model selection criteria for neural network time series forecasting. European Journal of Operational Research 132, 666–680	An Entreprise Analytics Platform for data analysis and insights. Qubida is an end to end single platform, that empowers the user to prepare, analyse and visualise data without recourse to other tools. Qubida manages Big Data, transforming them to actionable Insights making it a turnkey integrated solution. The platform offers interactive accessibility to big data, enabling their democratisation, offering agility and speed, so that businesses can gain insight and take appropriate actions immediately. As a result, decision makers are given direct access to data and deep dive with self-service analytics to discover new insights. It is positioned to work on the streaming data from various sources - sensors, machine data, and other formats. Qubida is designed to work on any device and build using responsive HTML5 frameworks to provide seamless user experience across multiple devices. This also offers true mobility allowing the user to make decisions while on the road. Qubida being a web based solution is uniquely positioned to connect to APIs and web services making it easy to analyze and report on data sources sitting in the cloud. The platform embraces the Big Data technologies with its native Hadoop integration, making it easy to compile visualization and dashboards from the Big Data warehouses. Qubida can ingest any type of structured or unstructured data and provide instantaneous analytics by leveraging distributed computing and cache systems.
13	12	Predicta	Ioanna Koutrouvis ikoutrouvis@predicta.gr	Algorithm Development as a part of a Big Data Platform: Price & Promotion Optimization	Στόχος Μεταπτυχιακής Διατριβής: Αντικειμενικός στόχος της παρούσας μεταπτυχιακής διατριβής είναι η διερεύνηση επιστημονικών μεθοδολογιών και τεχνικών μέσω των οποίων θα καταστεί εφικτή η σαφής απάντηση των ακολούθων, για συγκεκριμένα δειγματοληπτικά δεδομένα εταιρείας που δραστηριοποιείται στην αγορά του Food Market: 1) Ανάλυση, Κατανόηση και Μοντελοποίηση της επίδρασης υποκειμενικώς εταιρικά καθορισμένων προωθητικών ενεργειών στην επίδοση των εταιρικών πωλήσεων και του μικτού κέρδους μέσω της/του απομόνωσης/διαχωρισμού των μη αποδοτικών από τις αποδοτικές ενέργειες αλλά και μέσω της αξιολόγησης της αναμόρφωσης (uplift) των πωλήσεων και της μελέτης της επίδρασης του μικτού κέρδους. 2) Μαθηματική Βελτιστοποίηση των προωθητικών ενεργειών της εταιρείας μέσω της κατανόησης και ανάλυσης διαφόρων σεναρίων προωθητικών ενεργειών τα οποία οδηγούν στην μεγιστοποίηση του μεριδίου αγοράς και του μικτού κέρδους. Επιχειρηματική Προσέγγιση: Αρχικό στάδιο της μεθοδολογίας προσέγγισης του παρόντος θέματος αποτελεί η πλήρης κατανόηση του επιχειρηματικού σκοπού αλλά και των διαθέσιμων για τις ανάγκες της διατριβής εταιρικών δεδομένων. Συγκεκριμένα θα αποφασισθούν από κοινού με την εταιρεία οι κατηγορίες προϊόντων και δη τα SKUs στα οποία θα εστιάσει η ανάλυση, λαμβάνοντας υπόψιν τα δεδομένα πωλήσεων για τα επιλεχθέντα SKUs. Επιπρόσθετα, περαιτέρω δεδομένα τα οποία παρουσιάζονται ακολούθως, θα ληφθούν υπόψιν έτσι ώστε να υπάρχει ολοκληρωμένη πληροφορία διευκολύνοντας και βοηθώντας με αυτό τον τρόπο την επιστημονικώς ορθότερη επίλυση των ζητούμενων επιχειρηματικών προβλημάτων: 1)Δεδομένα Διαφημίσεων: Οι διαφημίσεις για τα επιλεχθέντα SKUs σε εφημερίδες, περιοδικά, τηλεόραση κτλ αποτυπώνονται με την μορφή ποσοτικών/ποιοτικών δεδομένων στην βάση δεδομένων της εταιρείας και είναι διαθέσιμα προς ανάλυση και μελέτη για τυχόν επίδραση αυτών στις τελικές εταιρικές πωλήσεις. 2) Δεδομένα Ανταγωνισμού: Είναι εξαιρετικά πιθανό οι πωλήσεις των SKUs να επηρεάζονται από τις διάφορες προωθητικές ενέργειες που κάνουν οι ανταγωνιστικές εταιρείες. Τα δεδομένα αυτά θα ληφθούν υπόψιν και συγκεκριμένα θα μελετηθούν υπό το πρίσμα των ακόλουθων τεχνικών φαινομένων: 1) Cannibalization: Ο συγκεκριμένος τεχνικός όρος αναφέρεται στην μείωση των πωλήσεων ενός προϊόντος ως αποτέλεσμα της εισαγωγής ενός νέου προϊόντος στην αγορά είτε από την ίδια είτε από κάποια άλλη εταιρεία. 2) Halo Effect: Το halo effect αναφέρεται στην θετική αυξητική επίδραση που έχουν στις πωλήσεις οι διάφορες προωθητικές ενέργειες άλλων προϊόντων. 3) Pull forward: Ο συγκεκριμένος όρος αναφέρεται στην μαζική και όχι συχνή αγορά ενός προϊόντος/προϊόντων λόγω κάποιας ιδιαίτερης προωθητικής ενέργειας. Αποτέλεσμα – Οφέλη: Η ανάλυση και αντίστοιχη μοντελοποίηση των ανωτέρω δεδομένων βάσει της επιχειρησιακής και επιστημονικής μεθοδολογίας θα οδηγήσει στην καλύτερη κατανόηση του τρόπου υλοποίησης των πωλήσεων και στην βελτιστοποίηση των αντίστοιχων εσόδων από αυτές, αυξάνοντας έτσι τα κέρδη και το μερίδιο αγοράς για τις εταιρείες οι οποίες θα υιοθετήσουν την παρούσα λύση. Επιπρόσθετα θα καλύψει ένα μεγάλο κενό που υπάρχει στο συγκεκριμένο επιχειρηματικό αντικείμενο καθιστώντας με αυτό τον τρόπο καινοτόμο και πρωτότυπη την παρούσα μεταπτυχιακή διατριβή. Σημείωση: Αξίζει στο σημείο αυτό να αναφερθεί πως τα δεδομένα που θα χρησιμοποιηθούν στην παρούσα μεταπτυχιακή διατριβή θα είναι πραγματικά και άρα επιχειρηματικώς πλήρως ρεαλιστικά. Τα δεδομένα αυτά θα παρασχεθούν από μια από τις μεγαλύτερες εταιρείες της χώρας στην αγορά του Food Market. Κάτι τέτοιο θα οδηγήσει σε πληρέστερη και ορθότερη αντιμετώπιση των ερευνητικών θεμάτων και θα καταστήσει τα εξαγόμενα συμπεράσματα άκρως ρεαλιστικά καθώς επίσης αντικειμενικώς και επιχειρηματικώς αξιοποιήσιμα.	??	??
14	13	Tripsta	Costas Koukoumtzis ckoukoumtzis@gmail.com	Fraud Detection	Tripsta as an Online Travel Agency sells on average 6000 bookings a day. Being an e-commerce merchant entails the risk of selling to fraudsters who take advantage of the Card Not Present business setting to buy using stolen cards or false identities. In order to fightback these fraudulent activities and bookings, Tripsta has created an inhouse fraud detection system that is used to calculate and assess the risk of each booking that is made through the website. This inhouse application is a rule-based system that uses the booking attributes to determine a final fraud score for this booking. The rules are logically grouped by function and carry a specific score according to their importance for the decision of risk attached to a booking. Elements that have been identified as values used in fraudulent transactions are flagged so that they are identified immediately if they reappear in any new bookings made. In order to increase the validity of the risk attached to specific attributes, external parties are consulted which have information regarding the entire e-commerce network and can provide information retrieved from its network. The third parties provide the following: 1) EA (EmailAge) : provides us a fraud score attached to an email address (and an IP address), 2) TMX (Threatmetrix) : provides us with a deviceID and the true IP, 3) PRS (Perseuss): provides us with information from Airlines and OTAs about blacklistings for the passenger. Expected Outcome: Tripsta wants to increase its accuracy in detecting fraudulent bookings before these occur. A system that would be able to train itself using past data in order to determine whether a booking is risky or not would be the next step for classifying a booking as positive or negative and achieving the accuracy that is desired. The goal of the thesis is to provide such a system. Dataset: The dataset that will be provided is a 2-year dataset of 2.713.053 bookings. The attribute "category" depicts whether this was eventually found as fraud or not. The possible values are: 1) Fraud: The booking was fraud and should have not gone through, 2) Friendly Fraud: The booking was ok but the customer was not happy with his service – the booking went through correctly, 3) NoFraud: The booking was ok – the booking went through correctly. The attribute "dispute" is complementary to the category one as it depicts whether we received an acknowledgment from the bank that the legitimate card/paypal user reported this booking as fraudulent. If it exists in the dataset then its value will be true. Available dataset description.	??	Tripsta wants to increase its accuracy in detecting fraudulent bookings before these occur. A system that would be able to train itself using past data in order to determine whether a booking is risky or not would be the next step for classifying a booking as positive or negative and achieving the accuracy that is desired. The goal of the thesis is to provide such a system. ???	Costas Koukoumtzis ckoukoumtzis@gmail.com
15	14	Yuboto	Apostolis Ioakeim aioakeim@yuboto.com	Chatbot Development	H εργασία χωρίζεται σε τρεις ενότητες: Α. Να εξετάσετε και να αναπτύξετε: α). Τους τρόπους που έχει επηρεάσει/βελτιώσει τα chatbot συστήματα η ραγδαία ανάπτυξη του Artificial Intelligence και του Natural Language Processing. β) Τις τελευταίες τάσεις – trends στη χρήση των συστημάτων chatbots. γ) Τους τρόπους με τους οποίους τα Chatbots μεταμορφώνουν την εμπειρία του χρήστη (UX) και τα οφέλη αυτής της διαδικασίας. δ) Επιτυχημένα case studies εταιρειών που έχουν αναπτύξει δημοφιλή, εύχρηστα chatbot συστήματα, ε) Τη δυνατότητα app aggregation που δίνουν τα chatbots ή/και την πιθανότητα πλήρους αντικατάστασης των mobile apps. Λαμβάνοντας υπόψη τα προαναφερθέντα ζητούμενα, θα θέλαμε να αναλύσετε τρόπους, εφαρμογές και προοπτικές αξιοποίησης των συστημάτων chatbot ως ένα καινοτόμο mobile marketing εργαλείο, στην αναδυόμενη αγορά του m-commerce, που πλέον ορίζεται ως Conversational Commerce. Β. Να εξετάσετε και να παραθέσετε τους τρόπους δημιουργίας – ανάπτυξης Chatbot συστημάτων. Ενδεικτικά αναφέρονται: α) Rule-based approach β) Machine learning approach. Αναφέρετε και αναλύστε τις υπάρχουσες κατηγορίες συστημάτων: α) Utility Chatbots β) Content-driven bots. Αναπτύξτε τις δυνατότητες, τα πλεονεκτήματα και μειονεκτήματα των Open Source συστημάτων συγκριτικά με τις εμπορικές εκδόσεις. Ενδεικτικά: α) Artificial Intelligence as a Service (AiaaS), β) Chatbots as a Service (CaaS), γ) Chatbot Platform Vendors Γ. Αξιοποιώντας τις πληροφορίες και τις γνώσεις που αποκομίσατε από τις 2 προηγούμενες θεματικές ενότητες, παρακαλείστε να αναπτύξετε το δικό σας chatbot σύστημα. Σκοπός του ζητούμενου συστήματος είναι η εξυπηρέτηση πελατών σε θέματα τεχνικής υποστήριξης για τις υπηρεσίες της Yuboto-Telephony. Μπορείτε να βρείτε όλες τις απαραίτητες πληροφορίες, σε επίπεδο περιεχομένου, εδώ: http://www.yuboto-telephony.gr/voip-support/telephony-wiki	??	??	Apostolis Ioakeim aioakeim@yuboto.com
16	15	Convert Group	Elena Chailazopoulou,echailazopoulou@convertgroup.com	Αναγνώριση και συσχέτιση προϊόντων ηλεκτρονικών καταστημάτων	Η διπλωματική αυτή στοχεύει στην ανάπτυξη/βελτίωση του υπάρχοντος αλγορίθμου της υπηρεσίας eRetail Audit, όπου προσπαθεί να αναγνωρίσει και να κατατάξει τις πωλήσεις ηλεκτρονικών καταστημάτων. Ο τρόπος συσχέτισης ενός νέου προϊόντος με την υπάρχουσα βάση επηρεάζεται από πολλαπλά χαρακτηριστικά όπως το όνομα, το SKU και η τιμή. Σκοπός της περαιτέρω ανάπτυξης του αλγορίθμου είναι να κατατάσσει καλύτερα, αυτόματα προϊόντα για τα οποία είμαστε σίγουροι για την ορθότητα τους (βάση κάποιων κανόνων που θα προκύψουν ή ενός score και ορίζοντας ένα περιθώριο λάθους) και για όσα προϊόντα υπάρχει αβεβαιότητα να προσφέρει προτάσεις ώστε να υποβοηθάει την χειροκίνητη συσχέτιση. Επιθυμητή είναι και η ενίσχυση της διαδικασίας χρησιμοποιώντας δημόσια στοιχεία από το internet (π.χ. μέσω crawling) ή άλλων δεδομένων που υπάρχουν ήδη στην βάση και σχετίζονται με τα προϊόντα. Παράγοντες αξιολόγησης της εργασίας αποτελούν τόσο η αύξηση του καθημερινού 'matching rate' (τώρα στο 75%) αλλά και η βελτίωση στην υποβοήθηση των χειροκίνητων συσχετίσεων (για το υπόλοιπο 25% των δεδομένων που δεν κάνουν αυτόματα match. matching rate: Matching Rate ονομάζουμε το ποσοστό των νέων προϊόντων που αυτόματα εντάσσονται στην βάση του eRetail Audit καθώς έχουn περάσει διαδικασία matching και έχει βρεθεί ένα ίδιο με αυτά προϊόν (conf. lvl >0.9).	??	Αναμενόμενο αποτέλεσμα της παραπάνω διπλωματικής είναι παράδοση ενός αλγορίθμου που βελτιώνει το ζητούμενο αλλά και η συλλογιστική πορεία που έκανε τον φοιτητή να καταλήξει σε αυτόν. Οποιαδήποτε ανάλυση και σύγκριση διαφορετικών τεχνικών και μεθόδων επίλυσης του προβλήματος συστήνεται να είναι μέρος του παραδοτέου.	Επιβλέποντες στην εργασία αυτή θα είναι οι Πεχλιβάνης Κωνσταντίνος, Software Engineer & η Έλενα Χαϊλαζοπούλου, Head of eRetail Audit.	Επιβλέποντες στην εργασία αυτή θα είναι οι Πεχλιβάνης Κωνσταντίνος, Software Engineer & η Έλενα Χαϊλαζοπούλου, Head of eRetail Audit.	Η εργασία δεν απαιτεί φυσική παρουσία παρά μόνο μια φορά την εβδομάδα κατά την οποία θα πραγματοποιείται alignement meeting με τον φοιτητή. Στο υπόλοιπο διάστημα ο φοιτητής είναι ευπρόσδεκτος να βρίσκεται στον χώρο της εταιρείας μας και να δουλεύει από αυτόν.
17	16	National Bank of Greece	biks@nbg.gr	Web & mobile user classification	Αφορά τη συλλογή πληροφοριών κατά το interaction του χρήστη με τα web sites και τα mobile applications της εταιρείας με σκοπό την κατηγοριοποίηση του σύμφωνα με τις ικανότητες και τις συνήθειες χρήσης που έχει στα ηλεκτρονικά μέσα. Η κατηγοριοποίηση μπορεί να χρησιμοποιηθεί στο μέλλον για να βοηθήσει στην επιλογή χρηστών που θα συμμετέχουν σε beta προγράμματα, ή σε προγράμματα που αφορούν early adoption εφαρμογών ή υπηρεσιών. Η εργασία συμπεριλαμβάνει: Την αξιοποίηση της υφιστάμενης καταγραφής (logs) του internet και mobile banking ώστε να προσπαθήσουμε να εκτιμήσουμε το βαθμό εξοικείωσης των πελατών μας με τις εφαρμογές και την τεχνολογία γενικότερα. Σήμερα υπάρχει διαθέσιμη μια αναλυτική καταγραφή των interactions των χρηστών με το site για τα τελευταία δύο χρόνια: α) Τον εντοπισμό πιθανών ελλείψεων σε στοιχεία (πληροφορίες που δεν καταγράφουμε ήδη) και την αποτύπωση σε μορφή μελέτης των interaction points που θα έδιναν μια πλήρη εικόνα προκειμένου να είναι δυνατή η κατηγοριοποίηση των χρηστών, β) Την περιγραφή μιας υποδομής που θα ήταν κατάλληλη για τη συλλογή και την επεξεργασία των δεδομένων. Η υποδομή θα υποστηρίζει τόσο τα στοιχεία / δεδομένα που συλλέγουμε σήμερα όσο και πιθανές ελλείψεις που θα εντοπίσει η μελέτη, γ) Την μερική υλοποίηση της διαδικασίας συλλογής σε δύο τουλάχιστον σενάρια χρήσης: mobile και web εφαρμογών. Η υλοποίηση δεν αφορά αναγκαστικά επέμβαση στις υφιστάμενες εφαρμογές, αλλά περισσότερο αφορά την αποτύπωση υποδειγματικών υλοποιήσεων που θα χρησιμοποιηθούν σαν οδηγός για την ενσωμάτωση της λύσης στις εφαρμογές της εταιρείας. δ) Σε επόμενη φάση (ανεξάρτητη ή παράλληλη εργασία) θα περιλαμβάνεται η ταξινόμηση των χρηστών με βάση τα δεδομένα που συγκεντρώνουμε και η επιβεβαίωση της ταξινόμησης με χρήση A-B test ή όποιας άλλης μεθόδου θεωρηθεί κατάλληλη.	Για την ολοκλήρωση της εργασίας απαιτούνται καλές γνώσεις τόσο των web όσο και των mobile τεχνολογιών και κατανόηση σε βάθος των πρωτοκόλλων που χρησιμοποιούνται. Μέρος της εργασίας μπορεί να αφορά τη χρήση και εμπορικών ή open source υλοποιήσεων, βιβλιοθηκών, προϊόντων που διαπιστωμένα καλύπτουν περιοχές του προβλήματος. Σίγουρα απαιτείται η γνώση και η έρευνα σχετικά με τις τεχνολογίες και τις λύσεις που υπάρχουν και έχουν αναπτυχθεί σε αυτή την περιοχή. Και στην περιοχή αυτή η εταιρεία μπορεί να βοηθήσει τον ερευνητή φέρνοντας τον σε επαφή και διοργανώνοντας λεπτομερείς παρουσιάσεις εμπορικών λύσεων και προϊόντων που έχουν σχέση με το πρόβλημα που μελετάμε. Πέρα από τα παραπάνω απαιτούνται επίσης γνώσεις στην ευρύτερη περιοχή των big data: α) υποδομές και τεχνολογίες, β) εργαλεία χρήσης και διαχείρισης δεδομένων, γ) γλώσσες προγραμματισμού	Για τα μέρη της εργασίας που απαιτούν υλοποίηση και ανάλογα με την έκταση της υλοποίησης που απαιτείται, μπορεί να συμφωνηθεί μια προσέγγιση μερικής υλοποίησης, ώστε να προκύπτουν παραδοτέα σε χρόνους που ταιριάζουν με τα χρονικά πλαίσια της άσκησης.	??
18	17	National Bank of Greece	biks@nbg.gr	Εκτίμηση και πρόβλεψη τεχνικού χρέους (technical dept) σε web & mobile εφαρμογές	Αφορά στην διαμόρφωση ενός πλαισίου για την συστηματική εκτίμηση, παρακολούθηση και πρόβλεψη του τεχνικού κόστους που εισάγεται από νέες ή υφιστάμενες web και mobile εφαρμογές καθώς δημιουργούνται ή εξελίσσονται. Η εκτίμηση του τεχνικού χρέους μπορεί να γίνει με βάση χαρακτηριστικά των υλοποιήσεων όπως στατική ανάλυση του κώδικα, smells ("οσμές") στον κώδικα, grime ("ρύποι"), παρεκκλίσεις σε επίπεδο δομοστοιχείωσης κά. Το τεχνικό χρέος αποτιμάται συνήθως ως χρηματικό ποσό ενώ αποτυπώνεται σε δυναμικά δεδομένα όπως αρχεία καταγραφής και σφαλμάτων κατά την εκτέλεση, σε στοιχεία του κύκλου ζωής του λογισμικού (καταβαλλόμενη προσπάθεια, πλήθος μεταβολών και τροποποιήσεων), σε ειδοποιήσεις από εργαλεία παρακολούθησης αλλά και σε δεδομένα από συστήματα αναφοράς και διαχείρισης συμβάντων. Στα πλαίσια της εργασίας θα πρέπει: α) να εντοπιστούν τα χαρακτηριστικά που επιτρέπουν τη συστηματική παρακολούθηση και εκτίμηση του τεχνικού χρέους web και mobile εφαρμογών με μεγαλύτερη ακρίβεια, β) να προσδιοριστούν τα δυναμικά δεδομένα που αποτυπώνουν ακριβέστερα τη διαμόρφωση και εξέλιξη του τεχνικού χρέους και γ) να αναγνωριστούν οι συνθήκες, ενδείξεις και πρακτικές που παρατηρούνται κατά τα πρώτα στάδια του κύκλου ζωής των εφαρμογών (ανάλυση απαιτήσεων, σχεδίαση, υλοποίηση) οι οποίες μπορούν να οδηγήσουν στην πρόβλεψη του τεχνικού χρέους που πρόκειται να εισάγει μια υλοποίηση. Η ανάλυση δεδομένων θα πρέπει/ μπορεί να εστιάζει στην εξέταση: α) Στοιχείων υλοποίησης. Αφορά κυρίως το source code της υλοποίησης. Σε σενάρια που εμπλέκονται και άλλα artifacts πχ templates, configurations, κλπ θα μπορούσε να περιλαμβάνονται και αυτά. Σήμερα υπάρχουν ήδη διαθέσιμα source code repositories που μπορούν να χρησιμοποιηθούν για το σκοπό αυτό. γ) Δυναμικών δεδομένων που συλλέγονται κατά τον χρόνο εκτέλεσης. Σε αυτά μπορεί να συμπεριλαμβάνονται: runtime exceptions, iteration counters, object instantiation counters, κοκ. Σήμερα δεν υπάρχει υλοποιημένη τέτοιου είδους καταγραφή για τις εφαρμογές της εταιρείας, οπότε δεν υπάρχουν άμεσα διαθέσιμα στοιχεία. Μπορεί όμως στα πλαίσια της άσκησης να περιγραφούν τα στοιχεία που έχουν ενδιαφέρον σε μια τέτοια ανάλυση με στόχο να υλοποιηθεί από την εταιρεία η κατάλληλη υποδομή που θα επιτρέψει τη συλλογή και τη χρήση των στοιχείων σε επόμενη φάση της άσκησης.	Για την ολοκλήρωση της εργασίας απαιτούνται καλές γνώσεις στην περιοχή της τεχνολογίας λογισμικού (σχεδίαση OO εφαρμογών, ανάλυση κώδικα, διαδικαστικά μοντέλα κύκλου ζωής, κλπ). Μέρος της εργασίας μπορεί να αφορά τη χρήση και εμπορικών ή open source υλοποιήσεων, βιβλιοθηκών, προϊόντων που διαπιστωμένα καλύπτουν περιοχές του προβλήματος. Πέρα από τα παραπάνω απαιτούνται επίσης γνώσεις στην ευρύτερη περιοχή των big data: α) υποδομές και τεχνολογίες, β) εργαλεία χρήσης και διαχείρισης δεδομένων, γ) γλώσσες προγραμματισμού	Η εργασία πρέπει να καταλήγει με την ανάπτυξη μιας μεθοδολογίας κι ενός prototype το οποίο θα επιτρέπει την έγκαιρη αναγνώριση και αντιμετώπιση παραγόντων κινδύνου ως προς το τεχνικό χρέος web και mobile εφαρμογών, ήδη από τα πρώτα στάδια του κύκλου ζωής των εφαρμογών. Για τους σκοπούς της εργασίας θα διατεθούν υφιστάμενα repositories κώδικα της εταιρείας. Σε όλες τις φάσεις της εργασίας μπορεί να υπάρχει καθοδήγηση αλλά και βοήθεια στην προετοιμασία και την υλοποίηση από την εταιρεία, Για τα μέρη της εργασίας που απαιτούν υλοποίηση και ανάλογα με την έκταση τη υλοποίησης που απαιτείται, μπορεί να συμφωνηθεί μια προσέγγιση μερικής υλοποίησης, ώστε να προκύπτουν παραδοτέα σε χρόνους που ταιριάζουν με τα χρονικά πλαίσια της άσκησης.
19	18	Eurobank	Konstantinos Tsiptsis ktsiptsis@gmail.com	Text Mining for effective Complaints classification	Textual data from customer complaints will be provided as the initial input for analysis. NLP in Greek text should be applied in order to extract Concepts as well as Conceptual Associations. The scope is to structure textual data into specific Banking Concepts & Conceptual Associations that will feed Unsupervised Learning in order to extract efficient Complaints’ Categories. The final Deliverable will have to include the structuring and classification algorithm as well as a Greek Library of most frequent Banking Terms related to extracted Concepts.	Any programming language or s/w can be used such as Python, R, Rapid miner, KNIME with Python to be more preferable.		Γιώργος Γουδέλης, head of analytics team & Βαγγέλης Κοντογεωργάκος Senior Data Scientist.		Γενικά θα απαιτηθεί φυσική παρουσία τουλάχιστον 2 φορές την εβδομάδα από 8 ώρες την ημέρα.
20	19	Eurobank	Konstantinos Tsiptsis ktsiptsis@gmail.com	Customer Segmentation based on Purchasing behavior using Credit & Debit Cards	Transactional data will be provided for Cards’ purchases as the initial input. Data will have to be aggregated to appropriate form in order to support group of purchases within specific time frames as well as MCC (Merchant Code Category) level purchases preferences. The Objective is to identify different Clusters (Segments) of Customers with different purchasing behaviors and to clearly describe the clusters based on demographical and financial data that will be also provided.	Any programming language or s/w can be used such as Python, R, Rapid miner, KNIME with Python to be more preferable.	The final deliverable will have to include the sql used for data management as well as the clustering algorithm used for scoring the segments.	Γιώργος Γουδέλης, head of analytics team & Βαγγέλης Κοντογεωργάκος Senior Data Scientist.		Γενικά θα απαιτηθεί φυσική παρουσία τουλάχιστον 2 φορές την εβδομάδα από 8 ώρες την ημέρα.
21	20	IRI	Platia, Sofia Sofia.Platia@iriworldwide.com	Thesis 1: Time series analysis of product sales looking at long-term forecasts which could be used in strategic planning. Such decisions must take account of market opportunities, environmental factors and internal resources.	In general terms, the proposal aims is to develop a forecasting system that involves several approaches to predicting product sales. Such forecasting systems require the development of R code, applying a range of forecasting methods (We could exponential smoothing methods, Box-Jenkins ARIMA and Seasonal ARIMA models, and a variety of other topics including dynamic regression models), selecting appropriate methods for each problem, and evaluating and refining forecasting methods over time. Steps: a) Problem definition, b) Gathering information, c) Preliminary (exploratory) analysis, d) Choosing and fitting models, e) Using and evaluating a forecasting model.	R, library(fpp)	??	??		the students in our premises for 1-3 days per week using their own laptops.
22	21	IRI	Platia, Sofia Sofia.Platia@iriworldwide.com	Advanced model selection processes	Advanced model selection processes our scope/aim is to develop (using R) a method which should be designed under the rational of providing a fast and lean way towards selecting the most adequate predictor variables. To do that, we could focus on branch and bound algorithms in order to create a subset of good regression models.	R	??	??
23	22	IRI	Platia, Sofia Sofia.Platia@iriworldwide.com	Sales pattern anomaly detection in retail market	A common need of a retailer or a manufacturer in the retail area is to identify anomalies on the sales pattern of a product across time and across stores. The idea is to build a model that will be able to generate predictions of the expected sales and then define a proper criterion in order to infer whether the deviation between the actual and the expected sales indicates that there is an anomaly incident on a particular day. The anomaly detection can be quite useful in cases such as the out-of-stock identification of a product in a retailer store on particular days, since: It would permit to identify such out-of-stock incidents, It would assist in acting proactively by assuming a series of corrective actions. Existing methodologies, like the standard time series models, do not seem able to efficiently cope with such big-data problems. Thus, the goal is to formulate a methodology that would take as input a series of sales values across time along with several explanatory variables available (such as price, promotion status, etc.) and would return as primary output a series of values on a binary indicator (1 = abnormal incident flag, 0 otherwise). Along with the anomaly detection, it would be of value to answer the following relevant questions: a) What levels of on-shelf availability exist in the key store-products under consideration? b) How are out-of-stock incidents trending across time? Do they get better/worse? c) What is the economic business impact of product out-of-stocks? d) Where should a client focus corrective actions in order to realize the optimal incremental lift?	??	??	??
24	23	IRI	Platia, Sofia Sofia.Platia@iriworldwide.com	Performance measures of clustering algorithms in retail industry	A typical challenge in the retail industry is the optimal product category management of a retailer’s shelf. Using the IRI proprietary algorithm that depicts each product into a multidimensional mapping, the scope of the thesis is to compare several clustering algorithms in order to produce product clusters that also have a business-wise interpretation. Since there is a plethora of performance measures, another task is to compare these measures and identify pros and cons on the specific area of interest. Also, the estimation of the optimal number of clusters needs to be incorporated into this procedure.	??	??	??
25	24	Madinad	Konstantinos Livieratos kostas@intelligems.eu	Audience Clustering and Segmentation for Mobile Advertising Network	Madinad (www.madinad.com) is the leading mobile advertising network & platform in Greece. Our business aim is to offer high-quality services to our cooperating media shops, user-engaging ad formats for the best interest of our advertisers and an overall smooth marketing offering to the end users. Our client SDKs lie in the vast majority of websites and serve millions of advertising impressions for several customers through it’s network, with mobile apps extension coming soon as well. WHAT WE DO As a multi-million request serving advertising house, we collect a lot of data. Until now we have used all this data with minimal processing so that we can decide what would be a good step forward for us. For example, from our data and the behaviour of the users against our ad formats, we can suggest new ad formats that do not bother the end user, considering the peculiar environment of mobile devices. PROJECT TOPIC: As our trafﬁc grows and the demand for more engaging, lower cost and highly efﬁcient ad formats becomes huge, we had to take a closer look at how we process our analytics data. In fact, we need to do a total makeover of our analytics processing pipeline. Until now it was based in a multi-server architecture which has now became not-easily scalable. A good idea would be to separate the ad serving application service from the analytics collection one and migrate the latest to a highthroughput service like Kafka or AWS Kinesis. Although those might sound a bit too technical, there is a great opportunity to look deeper in our data, with the help of a data scientist. The concept in this case is to examine • the data we have, • the data we can collect • how can all those data relate to each other • what kind of analytics can we get out of the metrics we keep track of and ultimately design a machine-learning prototype - using one of the available algorithms - in order to help with proﬁling 1. the gender of each user-device 2. the age of each user-device 3. the interests of each user-device for better advertising targeting.	experience writing Python code - Django knowledge is a plus • basic understanding of how advertising technology (adtech) works - we will help as much as needed of course! • machine learning in Python • Pandas, NumPy, SciKitLearn, Tensorﬂow or other similar libs • AWS tools knowledge would be a plus while the tasks of this role are brieﬂy explained above in this document.	??	??
26	25	Hypertech Information Technology	Thomas Papapolyzos	Applying Machine Learning models & algos for prediction of Stock & Foreign exchange market movements & optimization of market investment strategies.	Our company has been conducted by a big & wealthy organization, which wants to explore opportunities to trade some of its cash reserves in 4 asset classes , namely FX pairs USD/EUR & USD/JPY , S&P500 & WTI (West Texas oil). We want to develop a short to midterm(1-4 weeks) trading strategy based on the following assumptions: a) We have only the above 4 asset classes(FX pairs USD/EUR & USD/JPY , S&P500 & WTI (West Texas oil)) , in which we can go long(buy) or short(sell) & the risk free rate is zero (the interest rates are zero – no cost of money). b) We can base our investment strategies on backtesting daily market data starting at 2010 upto now. c) The trading performance metric is the return on capital invested & the following constraints must hold : 1) maximum portfolio monthly loss <12%, 2) maximum portfolio weekly loss < 8%, 3) maximum portolio day portfolio loss < 3%. d) The proposed strategies* must have two elements : 1) Initial Portfolio allocation of 1mil euros capital. 2) Repositioning strategy based on trading signals generated by our model. 3) Transaction costs are fixed & given. 4) The strategies are self financing (no other cash inflows/outflows during the strategies implementation). Data to be used : available on given web sites.	experience on finance & trading although desirable is not a prerequisite. All the basic finance & trading knowledge needed to understand & handle the problem , will be taught in 1-2 weeks. The knowledge of programming in Python & manipulation of large datasets with NumPy & Pandas are prerequisites.	PowerPoint presentation (12 slides maximum) with strategies & results, ML Models & Algorithms used, Implementation of the models in Python (we don’t care much about the presentation layer) & preferably cloud based.	??
27	26	ICAP	Kyriakidou Myrto MKyriakidou@icap.gr	Άντληση στοιχείων επιχειρήσεων από ιστοσελίδες	Το αντικείμενο της διπλωματικής ειναι η δημιουργία στοχευμένου crawler για τις ελληνικές επιχειρήσεις σε συνδυασμό με εξαγωγή στοιχείων για αυτές από τις ιστοσελίδες. Θα χρησιμοποιηθούν τεχνικές κατανεμημένου crawling, κατασκευής wrappers και κατανόησης κειμένου/NLP/text engineering προκειμένου να εξαχθεί μια δομημένη βάση δεδομένων με στοιχεία επιχειρήσεων (στοιχεία επιχείρησης, υποκαταστήματα, διοικητικό προσωπικό, πελάτες, επισκεψιμότητα σελίδας, τελευταία ενημέρωση, κλπ) Ο στόχος είναι η εξαγωγή στοιχείων για μεγάλο ποσοστό των ελληνικών ΜΜΕ και για όλες τις ελληνικές εισηγμένες εταιρίες για αυτόματη παραγωγή του business directory της ICAP.	??	??	Γιάννης Γκαντάρας
28	27	DBLABS ΙΚΕ	Dinos Arkoumanis arkoumanis.dinos@gmail.com	Υλοποίηση και Εκπαίδευση Deep Learning Αλγορίθμων με χρήση Tensor Flow και Apache Spark.	Η εργασία αφορά την ανάπτυξη μιας Cloud-based Big Data Πλατφόρμας για εφαρμογές video processing. Θα υλοποιηθεί ένα σύστημα πολλαπλών κόμβων με Hadoop σαν Storage και Spark σαν Data Processing Framework. Πάνω στο Spark θα εγκατσταθεί το H2O Deep Water και το Google Tensorflow για να μπορούν να εκπαιδευτούν Deep Learning Neural Nets. Στη συνέχεια θα γίνει εκπαίδευση Long Short Memory Recurrent Neural Networks πάνω σε metadata απο videos και Convolutional Neural Networks πάνω σε video frames. Τα δίκτυα που θα αναπτυχθούν θα εκπαιδευτούν για αναγνώριση ενεργειών κλοπής και εφαρμογές αυτόματης αναπλήρωσης προμηθειών σε εμπορικά καταστήματα.	??	??	??
29	28	WINGS ICT Solutions \| Incelligent	Yiouli Kritikou kritikou@wings-ict-solutions.eu	Αλγόριθμοι αναλυτικής πρόβλεψης, μεγάλων δεδομένων, στους τομείς των τηλεπικοινωνιών και της τραπεζικής (Big data predictive analytics algorithms for telecommunication and banking sectors)		??	??	??
30	29	SAS Institute Greece	Stavros Stavrinoudakis Stavros.Stavrinoudakis@sas.com	Sentiment analysis lexicon building using SAS data analytics	Enterprise Guide: data profiling, query and reporting, statistics/advanced analytics, Text Mining: data mining and analytics in unstructured data (text, documents etc.), Βελτιστοποίηση μέρους του ελληνικού λεξικού SAS Sentiment Analysis, Εφαρμογή του ανανεωμένου λεξικού και των εργαλείων SAS Enterprise Guide και SAS Text Mining, για sentiment analytics σε πιλοτικά δεδομένα
31	30	Taxibeat	Dimosthenis Kaponis, Lead Engineer dimosthenis@taxibeat.com	Recommendation algorithms at Taxibeat	At Taxibeat we serve hundreds of thousands of people daily, helping them move between their home, work, school, social activities and everything in between, safely and quickly. More often than not, using Taxibeat becomes a habit and habits have a tendency of forming patterns. It is those patterns that form the basis of our personalized recommendations, a feature that aims to save our customers' time, provide a better booking experience and stay ahead of the competition. Our existing recommendation system is heavily based on detecting cyclical patterns on both the temporal and the spatial domains, and it has served us very well in the short time it has been in production. But we are not standing still! This summer internship project involves researching, designing and implementing novel approaches aimed at increasing both the number of our customers receiving personalized recommendations and enhancing the recommendation algorithms to achieve better accuracy. The intern is expected to: work with our Data Science and Big Data teams to build upon our existing recommendations system, initially by experimenting with, and combining, multiple state-of-the-art algorithms and approaches on recommender systems and secondly by implementing the improved recommendation system using sound engineering methodology, testable and tested code, in Scala or Python, running on our Spark cluster. Data Science at Taxibeat strives for excellence in both theoretical statistics, applied machine learning and software engineering. We offer young professionals the opportunity to work on actual problems and contribute to production systems. As such competence in all three areas is a prerequisite.	Candidates for this project should have a strong engineering background, in addition to a good foundation in Statistics, Machine Learning and Algorithmic Complexity. They should be familiar with UNIX/Linux environments and be comfortable developing software in Python and/or Scala. No prior knowledge of Spark, Hadoop or other platforms is required, but is considered a bonus.
32	31	Luxembourg Institute of Science & Technology - LIST	Dimitrios Kampas dimitrios.kampas@list.lu	STUDENT INTERNSHIP FOR IMMERSION PROGRAM - INNOVATIVE SYSTEMS - REGULATORY TECHNOLOGIES	The “Business Analytics and Regulatory Compliance” Unit, in Luxembourg Institute of Science and Technology (LIST), is offering internships for Computer Science and Information Science students to gain hands-on experience in leading technologies like Open source scalable databases, Ontology, Semantic Technology, and related work could also include working with advanced text processing methodologies like unstructured information management, Text analytics and Natural Language Processing (NLP). This call is part of the Student Immersion Program, in which students from some of the top Universities in the world come to Luxembourg for internship opportunities with LIST. You will have a chance to work on concrete and value-creation projects led by different specialists from our Institution.	Education: Current student in Computer Science, Information Science (or related fields);formally enrolled in a University and about to begin their last year in theirBachelor's degree or Master's degreeAny students engaged in related PhD programs with the aforementioned skill sets are also welcome to apply. Current student in Computer Science, Information Science (or related fields);formally enrolled in a University and about to begin their last year in theirBachelor's degree or Master's degreeAny students engaged in related PhD programs with the aforementioned skill sets are also welcome to apply. Advanced analytics, statistical approaches with respect to text analytics and web/application development, strong interest in infrastructure-architecture for data analytics, interest in the application of Semantic Technologies in the legal domain ranging from XML schemas to representation of rules and in building "documents-to-information" solutions by learning deep and complex XML schemas that are used in European Union regulatory activities. Competencies: Specialization in the masters or PhD programs related to the subjects for the internship saving technical expertise or work/project experience in one or more of the following: a) Application development with NodeJS, Java, Python, HTML 5, XML, interactive databases and text analytics, b) Experience deploying and working with Big Data platforms like Hadoop, Spark, HBase, c) Demonstrative software programming capabilities to develop Proofs-of- Concepts. Language: Fluent in English. Good level of French (optional)
33	32	Palo Services	Panagiotis Tsantilas pt@paloservices.com	Named entity recognition and sentiment extraction with Spark and Kafka	Στα πλαίσια ανασχεδιασμού του συστήματος της palo.gr δημιουργήσουμε ενα big data pipeline που θα πλαισιώνεται από reactive microservices βασισμένες στην υλοποίηση του actor model AKKA. Η βασική γλώσσα ανάπτυξης θα είναι η Java, ενώ οι client εφαρμογές θα χρησιμοποιήσουν και άλλες τεχνολογίες. Οι εφαρμογές που αναπτύσσουμε αφορούν σε ανάκτηση, επεξεργασία και προβολή δεδομένων από το web, τα social media και άλλες πηγές. Οι υπηρεσίες που προσφέρουμε αφορούν σε: α) Ειδησιογραφία (μέσω mobile και web εφαρμογων) (clustering κλπ) β) Παρακολούθηση brands (sentiment analysis, NER κλπ). Οι ανάγκες που θα χρειστεί να καλυφθούν σχετίζονται τόσο με εφαρμογή αλγορίθμων machine learning και AI όσο και με τον σχεδιασμό data αρχιτεκτονικών καθώς επίσης και services που θα έχουν την ευθύνη μεταφοράς των δεδομένων στο data pipeline. Ως computing engine θα χρησιμοποιηθεί το Spark, όπου καλείστε να εισάγετε υπάρχοντες αλγορίθμους. Το σύστημα Kafka θα είναι μεγάλης σημασίας καθώς θα χρησιμοποιηθεί από σχεδόν όλα τα services ως messaging system. Η βασική αποθήκη δεδομένων θα είναι elasticsearch. Οι βασικές εργασίες λοιπόν είναι 3 ειδών: α) Παραμετροποίηση / προγραμματισμός στην πλατφόρμα Spark, β) Προγραμματισμός services για επικοινωνία προς και από το Kafka, γ) Σχεδιασμός data αρχιτεκτονικής στο elasticsearch. Η διπλωματική συνίσταται σε α) NER & sentiment analysis on Spark, β) Clustering on Spark,(σημείωση: ο αλγόριθμοι προϋπάρχουν και είναι υλοποιημένοι σε Java και ενα μικρό κομμάτι σε Python. Θα πρέπει να γίνουν πιθανές αλλάγες - οι υλοποιήσεις που θα εισαχθούν στο Spark θα είναι σε Java) γ) Υλοποίηση microservices ΑΚΚΑ (actor model) για μεταφορά δεδομένων από το Kafka και προς το Spark. Έρευνα και υλοποίηση backpressure τεχνικών ετσι ώστε οι producers των δεδομένων/μηνυμάτων(Kafka/Spark) να μην πιέζουν τους consumers/data store (Spark/elasticsearch).
34	33	Palo Services	Panagiotis Tsantilas pt@paloservices.com	Big information retrieval	Στα πλαίσια ανασχεδιασμού του συστήματος της palo.gr δημιουργήσουμε ενα big data pipeline που θα πλαισιώνεται από reactive microservices βασισμένες στην υλοποίηση του actor model AKKA. Η βασική γλώσσα ανάπτυξης θα είναι η Java, ενώ οι client εφαρμογές θα χρησιμοποιήσουν και άλλες τεχνολογίες. Οι εφαρμογές που αναπτύσσουμε αφορούν σε ανάκτηση, επεξεργασία και προβολή δεδομένων από το web, τα social media και άλλες πηγές. Οι υπηρεσίες που προσφέρουμε αφορούν σε: α) Ειδησιογραφία (μέσω mobile και web εφαρμογων) (clustering κλπ) β) Παρακολούθηση brands (sentiment analysis, NER κλπ). Οι ανάγκες που θα χρειστεί να καλυφθούν σχετίζονται τόσο με εφαρμογή αλγορίθμων machine learning και AI όσο και με τον σχεδιασμό data αρχιτεκτονικών καθώς επίσης και services που θα έχουν την ευθύνη μεταφοράς των δεδομένων στο data pipeline. Ως computing engine θα χρησιμοποιηθεί το Spark, όπου καλείστε να εισάγετε υπάρχοντες αλγορίθμους. Το σύστημα Kafka θα είναι μεγάλης σημασίας καθώς θα χρησιμοποιηθεί από σχεδόν όλα τα services ως messaging system. Η βασική αποθήκη δεδομένων θα είναι elasticsearch. Οι βασικές εργασίες λοιπόν είναι 3 ειδών: α) Παραμετροποίηση / προγραμματισμός στην πλατφόρμα Spark, β) Προγραμματισμός services για επικοινωνία προς και από το Kafka, γ) Σχεδιασμός data αρχιτεκτονικής στο elasticsearch. Η διπλωματική συνίσταται στα εξής: α) Έρευνα και υλοποίηση σχήματος στο elascticsearch για το palo.gr (news and social media aggregator), β) Υλοποίηση microservices ΑΚΚΑ (actor model) από το Spark και προς το elasticsearch. Έρευνα και υλοποίηση backpressure τεχνικών ετσι ώστε οι producers των δεδομένων/μηνυμάτων(Kafka/Spark) να μην πιέζουν τους consumers/data store (Spark/elasticsearch).
35	34	giaola.gr	Thanos Papadimitriou	Growth engineering theory and practice	Α growth Engineer, is a critical component of giaola’s business and is tasked with overseeing all customer lifecycle including acquisition, retention, loyalty and training activities. To support our quick customer base growth and ensure its continuous evolution and desired success, this is what you are going to study and do: Analyze data to identify trends and trigger opportunities to drive loyalty and give value to customers. Data should result actionable experiments to be performed in order to tackle the blind spots and drive more revenue. By data we actualy means various data sources: ● Google Analytics for pageviews and users ● Backend app Analytics to analyse transaction performed ● Behavioural data as collected from proper tools ● Competition Data as collected from various sources ● Data collected as customer’s feedback By actionable experiments we would wish to create a list of: ● New scenarios to test ● Changes in user flows ● More analytics options to be configured
36	35	Cognity	Yannis Stavroulas	Distinguishing Physical Persons from Legal Entities in a Large Database of Clients	Client databases often contain a mixture of entries pertaining to businesses and to actual people. For example, the account details of a particular business entry may make reference to a legal representative of the business but the account is essentially a business account. Sometimes, the distinction between these two is not made explicitly in the schema and companies need to later revisit entries in order to make this necessary distinction. When the database is very large, manual identification becomes too tedious and so a programmatic solution is needed. Thus, the goal of this study is to build a system that will be able to tell whether a particular database entity refers to a physical person or a legal entity. To achieve this, the study will have to look at the tools and approaches relevant to handling a database of several million entities inside. Big data technologies that employ a parallel processing approach would be a good fit for this use case, so this study will make use of the Hadoop ecosystem. At the same time, the study will look at building an appropriate classification model that will take into account, among other potential features, text-based features and patterns for the data in question."
37	36	Workable	Thanos Papaoikonomou	Face recognition	The thesis will investigate using open source tools and public web services to create a very high accuracy, low latency service for performing face recognition on portrait photos. Proposed systems/libraries to compare are openfaces and Amazon Rekognition. The whole lifecycle of images from retrievel from operational systems to preprocessing to performing the task and delivering results to appropriate consumers needs to be implemented and validated experimentally.			Dr Thanos Papaoikonomou
38	37	Workable	Thanos Papaoikonomou	Tensorflow implementation of recommender	Workable is building recommender systems to aid recruiters and firms find the most suitable candidates for each position with less human effort. So far, systems have been implemented using scikit-learn and pytorch. Topic of this thesis is to reimplement the existing algorithms in Tensorflow, optimize performance, verify experimentally that the system perfoms at least as well as the existing implmenetation, and extend technique with new features			Dr Thanos Papaoikonomou
39	38	Workable	Ioannis Klasinas	Section identification in structured text	The goal of the thesis is to use techniques from natural language processing and image recognition to identify sections in CVs. The final deliverable of the thesis will be a report outling the approach chosen and the experimental results, as well as a microservice that takes a CV in pdf as input and outputs a. the line numbers of start-end for each section and b.an annotated CV marking these starting/ending points.			Ioannis Klasinas
40	39	Workable	Ioannis Klasinas	ML-based Data fusion	The thesis will explore machine learning techniques for performing fusion of data coming from a variety of paid and free data sources. The goal is to create a single record of information from multiple records all known to describe the same entity (with the same or different key-value pairs). Different sources may contribute conflicting information about certain attributes of the record, so fusion needs to decide which values are true, The thesis includes a significant component of bibliographic research, a proposal for an algorithm suited to fusing structured CV-like data coming from online services, a prototype implementation and a basic evaluation			Ioannis Klasinas
41	40	Cognitiv+	Αχιλλέας Μίχος	Data Augmentation for Legal Text Analytics	Machine learning projects face often the problem of gathering labelled training data. For many areas of ML application, such as image recognition, there exist techniques that allow the augmentation of the a data set by applying transformations to existing data. Such techniques can significantly improve the performance of the trained models. In NLP data augmentation is not well developed even though we believe there is a potential similar to that of image recognition. This project will research ways to augment annotated training data using transformations both in the context and the content of the annotation. The most promising of the envisioned techniques will be tested against already trained neural networks. They will also be tested on new methods and models that were not possible before due to lack of sufficient data.
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100