Hellenic Mediterranean University
Department of Informatics Engineering
Bachelor Thesis
Analyzing Large-Scale Data with Neo4j in Healthcare
Emmanouil Nitis
Supervisor: Demosthenes Akoumianakis
Heraklion, October 2023
Acknowledgements
I would like to express my sincere appreciation to my supervisor, Dr. Demosthenes Akoumianakis, for his invaluable guidance and support throughout this thesis. Dr. Akoumianakis' insightful feedback, thoughtful discussions, and unwavering encouragement have been instrumental in shaping the direction and outcome of this research. His expertise, dedication, and mentorship have provided me with a solid foundation on which to build my academic and research pursuits. I am deeply grateful for the time and effort he has generously invested in my growth as a scholar. Thank you, Dr. Akoumianakis, for your indispensable contribution to this thesis.
Περίληψη
Αυτή η πτυχιακή εργασία διερευνά τη διασταύρωση της Ιατρικής Πληροφορικής και αναδυόμενων τεχνολογιών διαχείρισης δεδομένων που βασίζονται στο μοντέλο του γράφου, εστιάζοντας στην χρήση του Neo4j για την ανάλυση δεδομένων υγειονομικής περίθαλψης. Με στόχο τη βελτίωση της ανάλυσης και εξερεύνησης δεδομένων ασθενών, η διατριβή αναλαμβάνει μια ολοκληρωμένη έρευνα που περιλαμβάνει συλλογή και προεπεξεργασία δεδομένων, υλοποίηση της βάσης δεδομένων με Neo4j και διερεύνηση προηγμένων τεχνικών αναζήτησης. Το εισαγωγικό τμήμα καθορίζει τον σκοπό της έρευνας, περιγράφει βασικούς στόχους και ορίζει το εύρος και τους περιορισμούς της μελέτης. Το θεωρητικό υπόβαθρο εμβαθύνει στην Ιατρική Πληροφορική, τονίζοντας την ανάλυση της υγειονομικής περίθαλψης και τη σημασία των βάσεων δεδομένων NoSQL, ιδιαίτερα των βάσεων δεδομένων γραφημάτων όπως το Neo4j. Η ροή εργασίας της πτυχιακής εργασίας καθοδηγεί τους αναγνώστες στην ερευνητική τροχιά. Στη συνέχεια, η φάση συλλογής και προεπεξεργασίας δεδομένων επεξηγεί σχολαστικά την πηγή δεδομένων, τις μεθοδολογίες συλλογής και τα εργαλεία προεπεξεργασίας για να διασφαλιστεί η ακρίβεια και η συνέπεια των δεδομένων. Οι ενότητες του μοντέλου δεδομένων γραφήματος και της εφαρμογής βάσης δεδομένων περιγράφουν λεπτομερώς την εννοιολογική και πρακτική κατασκευή μιας βάσης δεδομένων γραφημάτων Neo4j, συμπεριλαμβανομένης της δημιουργίας, διαμόρφωσης, φόρτωσης δεδομένων και οργάνωση αυτών υπό μορφή γράφου καθώς και την διαχείριση ερωτημάτων τροχιάς με χρήση της γλώσσας Cypher. Η προστιθέμενη αξία της πτυχιακής εργασίας βρίσκεται στην ενότητα αναζήτησης δεδομένων, ανάλυσης και εξερεύνησης, που περιλαμβάνει θεμελιώδεις τεχνικές ανάκτησης δεδομένων που υποστηρίζουν μια εις βάθος ανάλυση ομοιότητας ασθενών. Για το σκοπό αυτό χρησιμοποιούνται μετρήσεις ομοιότητας Jaccard που βασίζονται σε ασθένειες, φάρμακα και μετρήσεις, παρέχοντας πολύτιμες πληροφορίες για τις σχέσεις των ασθενών και τα ιατρικά πρότυπα. Συμπερασματικά, η πτυχιακή εργασία συνοψίζει συνοπτικά τα ερευνητικά αποτελέσματα, εξετάζει τις επιπτώσεις τους και παρουσιάζει πιθανές οδούς για μελλοντική εξερεύνηση στο πεδίο της Ιατρικής Πληροφορικής. Αξιοποιώντας τη δύναμη της τεχνολογίας βάσεων δεδομένων γραφημάτων, αυτή η μελέτη εισάγει καινοτόμες προοπτικές για την ανάλυση ιατρικών δεδομένων, υπογραμμίζοντας τις δυνατότητες για επαναστατικά βήματα στην ανάλυση της υγειονομικής περίθαλψης.
Abstract
This thesis explores the intersection of Medical Informatics and graph database technology, focusing on the application of Neo4j in healthcare analytics. With the objective of enhancing patient data analysis and exploration, the thesis undertakes a comprehensive investigation encompassing data collection, preprocessing, database implementation, and advanced querying techniques. The introductory segment establishes the research's purpose, outlines key objectives, and defines the scope and limitations of the study. The theoretical background delves into Medical Informatics, highlighting healthcare analytics and the significance of NoSQL databases, particularly graph databases like Neo4j. The thesis workflow guides readers through the research trajectory. Subsequently, the data collection and preprocessing phase meticulously expounds on the data source, collection methodologies, and preprocessing tools to ensure data accuracy and consistency. The graph data model and database implementation sections detail the conceptual and practical construction of a Neo4j graph database, including database creation, configuration, data loading, and graph construction using Cypher queries. The crux of the thesis resides in the data querying, analysis, and exploration section, encompassing fundamental data retrieval techniques, an in-depth patient similarity analysis, and patient influence analysis. The patient similarity analysis employs Jaccard similarity metrics based on diseases, medicines, and measurements, providing valuable insights into patient relationships and medical patterns. In conclusion, the thesis succinctly summarizes research outcomes, discusses their implications, and presents potential avenues for future exploration within the realm of Medical Informatics. By harnessing the power of graph database technology, this study introduces innovative perspectives on medical data analysis, underscoring the potential for revolutionary strides in healthcare analytics.
Contents
1.2. Scope and Limitations of Thesis 12
2.1. Medical Informatics, healthcare analytics and open data sets 13
2.2. NoSQL Databases and the Property Graph data model 15
2.3. Neo4j, Cypher and Graph Algorithms 16
2.3.2. Cypher Principles and Capabilities 17
2.3.5. Neo4j Graph Algorithms 19
2.3.6. The Concepts of Similarity and Centrality 20
3. Data Collection and Preprocessing 25
3.2. Data Collection Process 25
3.4. Data Preprocessing Tools and Techniques 28
4. Graph Data Model and Database Implementation 29
4.1. Database Creation and Configuration 32
4.2. Data Loading and Graph Creation with Cypher 32
5. Querying, Analyzing and Exploring the Dataset 34
5.2. Patient Similarity Analysis 40
5.2.1. Creating Custom Jaccard Similarity Algorithms 42
5.2.2. Treatment Recommendation for a New Hospitalized Patients 51
5.3. Patient and Disease Influence Analysis 54
5.3.1. Creating PageRank Algorithms 54
5.3.2. Recommending Influential Caregivers for New Patients 57
List of Tables
Table 1. Part of the patients CSV file. 61
Table 2. Part of the admissions CSV file. 62
Table 3. Part of icustays CSV file. 63
Table 4. Part of d_items CSV file. 64
Table 5. Part of inputevents_mv CSV file. 65
Table 6. Part of inputevents_cv CSV file. 66
Table 7. Part of outputs CSV file. 67
Table 8. Part of d_icd_diagnoses CSV file. 68
Table 9. Part of diagnoses_icd CSV file. 69
List of Figures
Figure 1. An example of a property graph database. 16
Figure 2. Workflow of the thesis. 23
Figure 3. The Graph Data Model for MIMIC III. 29
Figure 4. Data Loading and Graph Creation. 33
Figure 5. Query 1 of the section 5.1. 34
Figure 6. Output for query 1 of the section 5.1. 35
Figure 7. Query 2 of the section 5.1. 36
Figure 8. Part of output for query 2 of the section 5.1. 36
Figure 9. Query 3 of the section 5.1. 37
Figure 10. Output for query 3 of the section 5.1. 38
Figure 11. Query 4 of the section 5.1. 37
Figure 12. Query 4 of the section 5.1. 39
Figure 13. Query 5 of the section 5.1. 40
Figure 14. Output for query 5 of the section 5.1. 41
Figure 15. Jaccard similarity algorithm for diseases. 42
Figure 16. Part of the result for the Jaccard similarity algorithm for diseases. 43
Figure 17. Jaccard similarity algorithm for medicines. 43
Figure 18. Part of the result for the Jaccard similarity algorithm for medicines. 44
Figure 19. Jaccard similarity algorithm for measurements. 45
Figure 20. Part of the result for the Jaccard similarity algorithm for measurements. 45
Figure 21. Jaccard similarity algorithm for diseases and medicines. 46
Figure 22. Part of the result for the Jaccard similarity algorithm for diseases and medicines. 47
Figure 23. Jaccard similarity algorithm for diseases and measurements. 48
Figure 24. Part of the result for the Jaccard similarity algorithm for diseases and measurements. 49
Figure 25. Jaccard similarity algorithm for diseases, medicines, and measurements. 50
Figure 28. John's recommended medicines and measurements. 53
Figure 29. PageRank patient influence algorithm. 54
Figure 30. Part of output for PageRank patient influence algorithm. 55
Figure 31. PageRank caregiver influence algorithm. 55
Figure 32. Part of output for PageRank caregiver influence algorithm. 56
Acronyms
SQL Structured Query Language. 15
NoSQL Not only SQL. 11, 12, 13, 14, 15
RDB Relational Database. 15
API Application Programming Interface. 15
Healthcare data plays a crucial role in improving patient outcomes, enhancing medical research, and optimizing healthcare processes. With the exponential growth in healthcare data, it has become essential to leverage advanced data management and analytics techniques to extract valuable insights from this wealth of information. This thesis delves into the intersection of medical informatics and NoSQL databases, specifically native graph data stores, by retooling an open healthcare database using Neo4j – a leading graph database technology. The primary objectives of this thesis are to explore the application of graph databases in the context of healthcare analytics and to develop advanced data querying and analysis techniques for exploring open healthcare data. By utilizing Neo4j's capabilities, we aim to construct an efficient and robust healthcare graph database that can provide powerful insights into patient records, disease patterns, medication histories, and medical measurements. Moreover, we seek to evaluate the performance and effectiveness of custom algorithms designed for patient similarity analysis, enabling the identification of patients with similar medical conditions and treatment histories. This thesis is structured to systematically address critical aspects of healthcare data management and analysis using graph databases, to contribute to advancements in healthcare data analysis and patient care.
The objectives of this thesis are twofold. Firstly, we aim to explore the application of Neo4j, a leading graph database management system, in the domain of medical informatics and healthcare analytics. By leveraging the power of graph databases, we seek to uncover valuable insights from complex healthcare data and discover previously hidden relationships between medical entities, such as patients, diseases, medications, and measurements. Secondly, we aim to demonstrate the practical application and add high added value of various data querying, analysis, and exploration techniques using Neo4j's Cypher query language and graph algorithms. Through a series of use cases and experiments, we intend to showcase the efficiency and effectiveness of Neo4j in handling diverse medical datasets and extracting meaningful patterns, ultimately contributing to more informed decision-making and improved patient care.
The scope of the thesis can be briefly summarized as follows:
In terms of thesis limitations, is it fair to admit the following:
Overall, the thesis set out to present a through exploration of Neo4j's capabilities in medical informatics and healthcare analytics, specifically focusing on patient similarity analysis, thereby demonstrating the practicality of graph databases in healthcare data management and analysis. Future work could follow several paths to address some of the acknowledged limitations in terms of scope and dataset used, to provide a more comprehensive understanding of the thesis's applicability and potential implications.
In the ever-evolving landscape of medical informatics, the fusion of technological innovation and medical practice has paved the way for transformative advancements. This section sets the stage for a comprehensive exploration of the theoretical underpinnings that shape the realm of modern medical informatics. Encompassing the domains of medical informatics, the paradigm shift introduced by NoSQL databases, the intricate world of graph databases exemplified by Neo4j, and the potent capabilities of graph algorithms, this section serves as a foundational framework upon which our journey of patient similarity analysis and graph-based insights will unfold.
Medical informatics is an area of study and application focused on enhancing the management of patient data, clinical knowledge, population data, and other relevant information about patient care and community health. It is a relatively young scientific discipline that emerged following the advent of digital computers in the 1940s. The utilization of mechanical computing in medicine dates back even further, with Herman Hollerith's development of the "punched-card data processing system" during the 19th century, initially employed for the US census and later adapted to support public health surveys and epidemiology. This historical example underscores the interdisciplinary nature of medical informatics, which intersects with various fields such as clinical sciences, public health sciences (including epidemiology and health services thesis), as well as cognitive, computing, and information sciences [1] [2] [9].
A particular stream of research in Medical informatics concentrates on Healthcare analytics. This is a developing field within medical informatics that focuses on the application of computer-based data analysis techniques to facilitate decision-making in both clinical and non-clinical settings. By leveraging electronic healthcare applications, valuable insights can be derived from data, enhancing service quality while minimizing costs. The adoption of systematic solutions has been widespread, driven by the abundance of internal and external data, diverse sources of medical information, and the need for comprehensive reporting. Robust analytic systems are increasingly employed to manage electronic health records, provide clinical decision support, and effectively handle personal or hospital data. These systems play a vital role in supporting managerial decision-making for clinical care and optimizing hospital operations, all while generating evidence-based insights within specific healthcare contexts [3] [4].
Open healthcare data encompasses various healthcare-related datasets and information that are intentionally made available to the public, researchers, and other stakeholders without extensive restrictions. This category of data promotes transparency, collaboration, and innovation within the healthcare sector. It includes de-identified patient records, public health statistics, medical research datasets, and other healthcare-related information that is shared with open-access licenses. However, it's essential to ensure that privacy and security measures are in place to protect individuals' sensitive health information when releasing and using open healthcare data [16].
This thesis concentrates on MIMIC-III dataset which is a large, freely-available dataset comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, there is a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.
Healthcare datasets, particularly open healthcare datasets, are incredibly valuable for further analysis and research. Firstly, they provide a wealth of information on patient demographics, medical conditions, treatments, and outcomes, making them indispensable for clinical research, epidemiological studies, and healthcare policy analysis. Secondly, these datasets enable researchers to identify trends, correlations, and patterns that can lead to new insights into disease management, preventive care, and treatment effectiveness. Additionally, open healthcare datasets support the development and testing of innovative healthcare technologies, such as predictive analytics, machine learning models, and telemedicine applications. By fostering collaboration and transparency, these datasets empower researchers, healthcare professionals, and data scientists to collectively address healthcare challenges, enhance patient care, and drive improvements in the healthcare industry [16].
The advantages of NoSQL databases over traditional SQL databases include being non-relational, distributed, open-source, and horizontally scalable. NoSQL development's original goal was to create contemporary web-scale databases. The construction started in early 2009 and is expanding quickly. NoSQL databases frequently have additional features like Schema-free, simple APIs, easy replication support, ultimately consistent/BASE (essentially available, soft-state, eventual consistency), a lot of data, and more. Furthermore, the misleading term "NoSQL" can be described as "Not Only SQL," which implies that if RDB is suitable for use, use it, if RDB is not suitable for use, use alternatives [5] [6] [8]. According to the official website of the (NoSQL databases, available from link), there are 15 kinds of NoSQL databases based on diverse data models, such as column-oriented databases, document databases, key-value stores, graph databases, and so on. This section will focus on property graph databases as one of the main four categories of NoSQL databases. It will delve into the fundamental concepts of property graph databases, exploring their data model and characteristics. Additionally, it will discuss the unique features of property graph databases that make them suitable for processing and analyzing specific types of data.
The Property graph model is rooted in the principles of graph theory and advocates a fundamental shift in the way data are organized and stored for processing. In general, a graph is, a mathematical abstraction which, is utilized to represent a collection of objects referred to as vertices or nodes, along with the interconnecting links called edges or relationships. The Property graph is a particular type of graph in which both vertices and edges can be qualified by properties. It forms a distinct data modelling pattern that diverges from other data models employed by key-value, column-family, and document stores, offering a distinct approach that enables efficient storage of relationships between diverse data nodes. In property graph databases, nodes and relationships possess individual properties, typically structured as key-value pairs. These databases specialize in managing intricately interconnected data and, as a result, exhibit notable efficiency in traversing relationships among distinct entities. They find suitability in numerous applications, including social networking platforms, pattern recognition, dependency analysis, recommendation systems, and the resolution of pathfinding problems encountered in navigation systems [5] [7]. An example of a property graph database is illustrated below, as shown in Figure 1.
Figure 1. An example of a property graph database.
Neo4j is a prominent open-source property graph database that offers commercial support. It adheres to the ACID (Atomicity, Consistency, Isolation, Durability) properties and employs a property graph model to store data. As a NoSQL graph database, it utilizes graph structures to enable semantic queries, incorporating nodes, edges, and properties. The graph consists of interconnected nodes and relationships, with both entities possessing named values known as properties. Relationships establish connections between nodes, allowing for the retrieval of related data. These relationships can be incoming or outgoing for a node, facilitating the exploration of connections between two nodes. Properties are represented as key-value pairs, with the key being a string. Property values can be of primitive types or an array of a single primitive type. A path represents one or more nodes connected by relationships, often obtained as the result of a query or traversal operation [10].
Neo4j is a vital system in cases where queries require interactions between data relationships. This is attributed to the fact that the database stores relationships and enables immediate access to them, consequently yielding precise and direct responses. Moreover, Neo4j exhibits flexibility and scalability, enabling the addition of new nodes and relationships as the system expands by evolving requirements. Additionally, Neo4j incorporates the employment of (Cypher, available from link), a user-friendly path-querying language, which proficiently addresses complex queries involving nodes and their corresponding relationships [12].
Lastly, Neo4j offers a comprehensive suite of graph algorithms, encompassing various functionalities such as pathfinding, similarity analysis, and more. These algorithms serve the purpose of effectively analyzing and extracting valuable insights from graph data, facilitating in-depth exploration, and understanding of complex interconnected relationships within the dataset.
Path querying is a foundational concept within graph theory and databases, serving as a fundamental mechanism for information retrieval and traversal operations in graph-based data structures. Specifically relevant in the context of graph databases, path querying entails the precise specification of a path or sequence of steps that traverse through a graph, defined by nodes and edges, with the primary objective of extracting targeted information or ascertaining relationships between nodes [15].
The key constituents of a path query encompass the following [15]:
Cypher, a path query language for graph databases, offers several features that facilitate working with graph patterns. It employs pattern matching to express graph patterns, allowing users to specify the relationships and connections they are interested in. Moreover, Cypher incorporates labels, enabling the categorization of nodes and relationships based on specific characteristics or attributes.
In addition to pattern matching and labeling, Cypher provides a wide range of operations to manipulate and analyze graph data. Users can apply filtering, sorting, and aggregations to refine their queries and obtain the desired insights. Furthermore, Cypher supports graph algorithms, allowing for the execution of specialized computations on graph structures.
To ensure efficient query processing, Cypher takes advantage of indexing, caching, and other performance optimizations. These techniques contribute to the scalability and effectiveness of the queries, enabling efficient data retrieval and processing. Overall, Cypher offers a comprehensive set of tools and optimizations that facilitate working with graph databases in a scalable and efficient manner [11].
Interacting with a graph database necessitates the utilization of the following commands for querying purposes [14]:
When it comes to modifying the graph, the following commands are involved in the process:
Some of the most common and prominent use cases of Neo4j include [11]:
The Neo4j open-source developer community's (Graph Algorithms, available from link), provided by (Graph Data Science Library, available from link), are used to determine metrics for graphs, nodes, or relationships. With a core set of validated and supported algorithms, the algorithms reveal hidden patterns and structures in the connected data around community discovery, centrality, and pathfinding. Many graph algorithms are iterative procedures in which the graph is traversed repeatedly for computation utilizing random walks, breadth-first or depth-first searches, or pattern matching. The algorithms provide useful and important insights into relevant elements in the graph (centralities, ranking) or inherent structures such as communities (community identification, graph splitting, clustering) [12].
The Neo4j graph algorithms are categorized into eight categories, which are defined as follows:
The Neo4j Graph Data Science Library offers a comprehensive collection of algorithms categorized into these different problem classes, providing users with a wide range of tools to analyze and extract valuable insights from graph data.
Our thesis emphasizes two categories of algorithms from this library those of centrality and similarity. Centrality algorithms enable us to pinpoint the most influential nodes or entities within the healthcare graph, aiding in tasks such as identifying critical patients or diseases. Simultaneously, similarity algorithms allow us to uncover hidden patient patterns and connections, providing personalized healthcare recommendations based on shared characteristics and medical histories.
Similarity is used to determine how much of a node’s characteristics and neighborhood matches that of another node. Various algorithms are available [18] for calculating similarity scores, like Cosine (COS), Pearson's Correlation (COR), and Jaccard similarity index. In this thesis we are particularly interested in the Jaccard similarity index which is briefly described below. The Jaccard Similarity Index, also known as the Jaccard Coefficient, is a statistical measure used to assess the similarity or dissimilarity between two sets by comparing their intersection with their union. This index, which produces values between 0 and 1, quantifies the degree of overlap between two sets, with higher values indicating greater similarity. It is widely used in various fields, including data mining, information retrieval, and bioinformatics, to analyze and compare datasets. The history of the Jaccard Similarity Index can be traced back to the 19th century when the French botanist and statistician Paul Jaccard introduced it as a method to measure the floristic similarity between two geographic regions based on the presence or absence of plant species. Jaccard's work laid the foundation for this similarity measure, and over time, it has found applications in diverse areas beyond botany. In the modern era, especially with the explosion of data analytics and machine learning, the Jaccard Similarity Index has become an essential tool for data scientists and researchers to gauge the similarity between datasets, identify patterns, and make informed decisions.
The concept of centrality is rooted in the study of human communication within small groups and was introduced to assess the potential relationship between structural centrality and influence within group dynamics. This early pioneering research paved the way for subsequent studies, although their findings often yielded confusing or contradictory results. Since then, numerous algorithms and methodologies have been developed to identify nodes of significance within graphs [17] with the most prominent being Degree Centrality, Closeness Centrality, Betweenness Centrality, Eigenvector Centrality and PageRank. PageRank is a derivative of Eigenvector Centrality that evaluates node influence by considering not only their immediate neighbors but also the neighbors of these neighbors. The algorithm was conceived in 1996 [19] as a means to establishing web content rankings by prioritizing link popularity. The foundational description of PageRank, a pivotal component of the Google search engine, was presented in 1998. Initially, all nodes held an identical PageRank score, set at 1. However, in later versions, each node commenced with a value between 0 and 1. Throughout the iteration process, nodes distribute their influence score equally among linked nodes, with the impact transferred via an outbound link being calculated as its score divided by the total number of its outbound links. Notably, PageRank omits self-links, treating all connections between nodes as one, regardless of their quantity. Nodes with zero out-degree are assumed to connect with all other nodes in the graph. To account for real-world behavior, PageRank introduces a damping factor representing the probability of users continuing their action by clicking on another link. Typically, this factor is set around 0.85. While PageRank shares commonalities with Eigenvector centrality, its emphasis on in-degrees makes it better suited for directed graphs, although it does not apply to undirected graphs [17].
The Neo4's ability to handle composite interconnected data makes it a precious tool for managing and analyzing healthcare data and over deriving uncovered insights. According to (Neo4j's website, available from link), many large healthcare companies use Neo4j to analyze their data.
In the healthcare domain, Neo4j offers several advantages that contribute to enhanced data management and decision-making processes. Firstly, its Cypher querying language facilitates efficient and real-time analysis by providing a powerful tool for querying and navigating the graph. This enables healthcare professionals to explore complex relationships within the data and derive meaningful insights.
Moreover, Neo4j's graph data model proves valuable in healthcare as it allows for the representation and exploration of intricate connections between various healthcare entities such as patients, medications, and diseases. This capability enables a comprehensive understanding of the relationships and dependencies within the data, ultimately leading to actionable insights and improved patient care.
Furthermore, Neo4j can be leveraged to develop clinical decision support systems. By utilizing the graph database, personalized treatment recommendations can be generated based on individual patient profiles. The system can also assist in detecting drug interactions, predicting disease progression, and facilitating more informed clinical decision-making.
Overall, Neo4j's efficient querying capabilities, graph modeling approach, and potential for clinical decision support make it a valuable tool in the healthcare industry, enabling advanced analysis and the provision of personalized and effective care.
The thesis workflow is depicted in Figure 2. Workflow of the thesis. As shown, it begins with the initial step of downloading the open-source Mimic III dataset. This dataset serves as the foundation for the subsequent data modeling process. Once the dataset is obtained, the focus shifts to data modeling, where the structure and relationships within the dataset are defined. This step involves identifying the relevant entities and attributes and establishing the connections between them. After completing the data modeling phase, the next step is to load the prepared data into the Neo4j graph database. Neo4j provides a robust and scalable platform for storing and managing graph data efficiently. With the data successfully loaded into Neo4j, the creation of nodes and relationships takes place. Nodes represent the entities from the dataset, while relationships define the connections and associations between the nodes. This step ensures that the data is organized in a graph structure that accurately reflects the relationships within the dataset.
Figure 2. Workflow of the thesis.
Once the nodes and relationships are established, Cypher queries are executed within Neo4j. Cypher is a powerful query language specifically designed for querying and manipulating graph data. These queries can perform various operations, such as retrieving specific nodes or relationships, filtering data based on specific criteria, aggregating information, and performing advanced graph algorithms.
The output of the Cypher queries can be displayed in multiple formats. One option is graph visualization, where the results are presented in a visual representation of the graph structure, providing a clear view of the relationships and patterns within the data. Alternatively, the output can be displayed in a tabular format, presenting the queried data in a structured and organized manner.
Overall, this thesis workflow encompasses the steps of downloading and modeling the dataset, loading it into Neo4j, creating nodes and relationships, running Cypher queries, and presenting the results either through graph visualization or tabular format. This systematic process allows for efficient analysis and exploration of the Mimic III dataset using the capabilities of Neo4j as a powerful graph database.
In this section, we describe the data collection and preprocessing steps for the demo version of the Mimic III dataset, which includes data from 100 patients. The demo dataset can be obtained from PhysioNet, a well-known platform for sharing open-source physiological data and related resources.
The demo (Mimic III dataset, available from link), provides a condensed version of the comprehensive patient data available in the full dataset. It encompasses essential information related to patient demographics, clinical measurements, laboratory results, medication records, procedures, and more. Accessible through (PhysioNet platform, available from link), this dataset serves as a valuable resource for preliminary analysis and exploration of critical care medicine data.
To access the demo Mimic III dataset, we downloaded it from the PhysioNet platform. PhysioNet hosts a variety of publicly available datasets, including the demo version of Mimic III, which can be freely obtained for thesis and educational purposes. From PhysioNet, we obtained nine CSV files, which will be illustrated in the Appendices. These CSV files were used for the implementation of this thesis and will describe below, namely patients.csv, admissions.csv, icu_stays.csv, d_item.csv, inputevents_cv.csv, inputevents_mv.csv, outputs.csv, d_icu_diagnoses.csv, and diagnoses_icd.csv. These CSV files collectively contain essential patient data, admission records, ICU stays, medicines, measurements, and diagnosis details that form the foundation of our exploratory analysis. Below will be described in more detail each of this file.
In the data preprocessing stage, the focus is on preparing the dataset obtained from PhysioNet for analysis using Neo4j. The process involves two key steps: data transformation and loading, followed by data filtering and quality assurance.
To begin, the dataset is downloaded from the PhysioNet website and subsequently transformed and loaded into the Neo4j graph database. This transformation entails converting the dataset, which is initially in CSV format, into a format that is compatible with Neo4j's graph database structure. This step ensures efficient storage and facilitates seamless querying of the data within the graph database environment.
Once the data is loaded, filters are applied to ensure the quality and integrity of the dataset. Incomplete or erroneous records are addressed, and missing values are handled appropriately to maintain data consistency and reliability within the graph database. This data filtering and quality assurance step plays a crucial role in ensuring that only valid and reliable data is utilized in subsequent analyses.
By undergoing these data preprocessing steps, including data transformation, loading, filtering, and quality assurance, the dataset becomes well-prepared for further analysis and exploration within the Neo4j graph database.
The data preprocessing approach for the demo Mimic III dataset involves utilizing Neo4j as the primary tool for data organization, filtering, and manipulation. Neo4j's capabilities, coupled with the availability of the dataset in CSV format from PhysioNet, streamline the preprocessing workflow within the graph database environment. By leveraging PhysioNet as the data source and employing Neo4j's filtering and manipulation functionalities, the preprocessing phase ensures the dataset's suitability for subsequent analysis and exploration.
In summary, the data collection process involves obtaining the demo Mimic III dataset from PhysioNet, a reputable platform for sharing open-source physiological data. The dataset is then preprocessed using Neo4j for data organization and manipulation. This combination of resources and tools allows for efficient data preprocessing and sets the stage for further analysis within the graph database environment.
In this section, we present the development and implementation of the graph data model for the MIMIC-III demo dataset. The graph data model, as depicted in Figure 3, serves as the foundational structure for representing and organizing critical care medical data.
Figure 3. The Graph Data Model for MIMIC III.
Leveraging the power of Neo4j's graph database and the (Arrows.app, available from link) graph data modeling tool developed by Neo4j Labs, we designed an intuitive and comprehensive model that captures the intricate relationships between various data entities. This phase is critical as it converts our meticulously designed graph data model into a fully functioning database ready for exploration and analysis. Our first step will involve configuring the database environment in (Neo4j Desktop, available from link) to ensure a robust and highly efficient setup. We will then proceed to install the Graph Data Science Library (see section 2.3.5) plugin, a potent tool that provides advanced graph algorithms and analytics capabilities. With the necessary infrastructure in place, we will embark on the crucial task of loading the CSV files containing the MIMIC-III demo dataset into our Neo4j graph database, including creating nodes and relationships that accurately reflect the complex network of critical care medicine data entities within our model. This comprehensive implementation process will enable us to unlock the full potential of our graph database, allowing us to navigate a wealth of information and uncover valuable insights during the subsequent stages of our thesis journey.
The graph data model comprises several node types, each representing specific data entities, and relationships that indicate associations between nodes. Let's delve into the details of each component:
In conclusion, this section has provided a detailed exposition of the graph data model developed for the MIMIC-III demo dataset. By leveraging Neo4j's powerful graph database and the intuitive Arrows.app graph data modeling tool, we have successfully designed a comprehensive model that captures the intricate relationships between critical care medicine data entities.
To establish and configure a Neo4j graph database, it is imperative to follow the guidance briefly outlined below. Firstly, it is paramount to install the Neo4j Desktop application, a user-friendly tool that facilitates database management. Once the application has been installed, the next step is to create a new database instance by carefully selecting a secure password and assigning a name to your database. Subsequently, it is crucial to install two essential plugins: the Graph Data Science Library and the APOC Library. These plugins offer advanced graph analytics and a comprehensive set of procedures and functions that significantly expand the capabilities of the Cypher query language. Lastly, it is essential to commence the database engine, thereby enabling your graph database to undergo data loading and manipulation. It is important to note that following these steps will ensure optimal performance and efficiency of the database, leading to an overall enhanced user experience.
Once the database is created and configured the next step entails the process of loading data and creating graph nodes and relationships with Cypher queries in Neo4j. To this effect, we have utilized the MIMIC-III demo dataset from CSV files to elucidate the establishment of nodes for entities such as patients, admissions, diseases, medicines, measurements, and ICU stays, as well as their corresponding relationships. By following these Cypher queries step-by-step, as depicted in Figure 4, we can proficiently load and create all necessary nodes and relationships in our Neo4j database. This will culminate in the creation of a comprehensive graph data model that facilitates efficient healthcare data analysis and exploration.
Figure 4. Data Loading and Graph Creation.
In this section, we will explore the healthcare data using the powerful Neo4j's Cypher (see section 2.3.2) query language and the Graph Algorithms (see section 2.3.5). By combining these tools, we can perform complex graph-based queries, gain insights into patterns, and explore relationships between different entities. This data exploration journey will help us unlock the potential of our healthcare database, enabling us to make informed decisions, discover hidden patterns, and derive valuable knowledge. At start, we will review illustrate queries to gain an overview of the mechanics of Neo4j’s Cypher query language and then we will delve into more complex issues of patient similarity analysis and patient influence analysis. For the patient similarity analysis developed some custom algorithms, while for the patient influence analysis used some graph algorithms provided by the Neo4j’s Graph Data Science Library (see section 2.3.5).
We will initiate with fundamental Cypher queries aimed at fetching crucial data from our Neo4j graph database. Our exploration will encompass the techniques for acquiring patient particulars, admission records, measurements, medicines, and disease details through uncomplicated queries.
Query 1: What are the admissions, diagnosed diseases, measurements, and medications associated with the patient whose patientId is 40601?
Figure 5. Query 1 of the section 5.1.
The query, as depicted in Figure 5, consists of three consecutive MATCH clauses, each creating a distinct path: p1 represents the patient’s connections to admissions and diagnosed diseases through HAS_ADMISSION and DIAGNOSED_WITH relationships, p2 represents the patient’s measurements through the HAS_DONE relationship, and p3 represents the patient’s medicines through the HAS_TAKEN relationship. The RETURN clause ensures that the results include the three paths p1, p2, and p3, providing a holistic view of the patient’s interconnected healthcare data. Below is depicted the output of the query, as shown in Figure 6.
Figure 6. Output for query 1 of the section 5.1.
Query 2: Which patients share common diagnosed diseases?
Figure 7. Query 2 of the section 5.1.
The query, as depicted in Figure 7, retrieves patients (patient1 and patient2) who share common diseases. It starts by matching patients with their corresponding admissions and diagnosed diseases. Then, it finds another patient (patient2) who also has admissions with the same diseases. The query collects and returns the unique disease titles shared between patient1 and patient2. Below is depicted the output of the query, as shown in Figure 8.
Figure 8. Part of output for query 2 of the section 5.1.
Query 3: What are the common diagnosed diseases and medicines taken by two specific patients with patientId 40601 and 44212?
Figure 9. Query 3 of the section 5.1.
The query, as depicted in Figure 9, focuses on two specific patients with patientId 40601 and 44212. It provides a comprehensive view of their healthcare data by retrieving their diagnosed diseases (g1) and the medicines they have taken (g2 for patient1 and g3 for patient2). The query returns three paths: p1 represents the common diseases shared between the two patients, while g2 and g3 represent the medicines each patient has taken. This detailed data exploration allows for a deeper understanding of the medical histories and treatments of these specific patients. Below is depicted the output of the query, as shown in Figure 11.
Query 4: What are the common diagnosed diseases and measurements taken by two specific patients with patientId 40601 and 44212?
Figure 10. Query 4 of the section 5.1.
The query, as depicted in Figure 10, focuses on two specific patients with patientId 40601 and 44212. It provides a comprehensive view of their healthcare data by retrieving their diagnosed diseases (g1) and the measurements they have done (g2 for patient1 and g3 for patient2). The query returns three paths: g1 represents the common diseases shared between the two patients, while g2 and g3 represent the measurements each patient have done. This detailed data exploration allows for a deeper understanding of the medical histories and treatments of these specific patients. Below is depicted the output of the query, as shown in Figure 12.
Figure 11. Output for query 3 of the section 5.1.
Figure 12. Query 4 of the section 5.1.
Query 5: What are the common diagnosed diseases in patients with patientId 40601 and patientId 44212 during their hospital admissions, and what medications have been taken, as well as what measurements have been performed on these patients during their hospital stays?
Figure 13. Query 5 of the section 5.1.
The query, as depicted in Figure 13, performs a complex graph traversal to find patterns between two patients with patientId 40601 and 44212 in terms of diagnosed diseases, medicines taken, and measurements done during their hospital admissions. The first MATCH clause (p1) identifies a path between the two patients, traversing through the common diseases (Disease nodes) diagnosed during their respective admissions (Admission nodes). The WHERE clause filters the results to only include patient combinations with patientId 40601 and 44212. The second MATCH clause (p2) identifies a path between the first patient and the medicines (Medicine nodes) taken during the hospital stay. The third MATCH clause (p3) does the same for the second patient. These paths include relationships representing the medications taken and the measurements performed (Measurement nodes). The RETURN statement outputs the results, providing a visualization of the graph paths (p1, p2, and p3) containing the relevant information about the patients' diagnosed diseases, medicines taken, and measurements performed during their admissions. Below is depicted the output of the query, as shown in Figure 14.
In this section, we delve into the creation of custom Jaccard similarity (see section 2.3.6) algorithms tailored for healthcare data analysis. These algorithms, enumerated as Algorithm 1 through Algorithm 6, are designed to quantify the similarity between patients based on various aspects of their health records, including diagnosed diseases, medications administered, and measurements recorded. The resulting Jaccard similarity scores are crucial for understanding the closeness of patients’ medical profiles, enabling personalized treatment recommendations. Each algorithm is accompanied by a detailed explanation and its corresponding Cypher query, as illustrated in Figures 15 to 26. Following this, we present a practical application of these algorithms in the context of personalized treatment recommendations for a newly admitted patient, demonstrating the tangible impact of patient similarity analysis within a healthcare setting.
Figure 14. Output for query 5 of the section 5.1.
Algorithm 1: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases.
Figure 15. Jaccard similarity algorithm for diseases.
The query, as depicted in Figure 15, performs a comprehensive analysis of patient’s disease diagnoses within the Metavision data source, aiming to calculate Jaccard similarity scores between pairs of patients based on their shared diseases. It begins by matching patients with admissions linked to disease diagnoses and collecting the disease IDs into lists. Patients and their corresponding disease lists are then collected into collections for further comparison. Pairs of patients are generated, and their disease lists intersected and united to calculate shared and total diseases. Jaccard similarity scores are computed using the shared and total disease counts, with a filter to include only non-zero similarities. The query ultimately returns patient pairs along with their Jaccard similarity scores, shared disease counts, and total disease counts, enabling the identification of patients with similar disease profiles within the Metavision data source. Below is depicted the result of the algorithm, as shown in Figure 16.
Figure 16. Part of the result for the Jaccard similarity algorithm for diseases.
Algorithm 2: The algorithm finds the Jaccard Similarity between patients based on their medicines they have taken.
Figure 17. Jaccard similarity algorithm for medicines.
The query, as depicted in Figure 17, undertakes a comprehensive exploration of patient’s medication history within the Metavision data source, aiming to calculate Jaccard similarity scores between pairs of patients based on their shared medications. It begins by matching patients with medicine administrations, collecting the medication labels into lists. Patients and their respective medication lists are then aggregated for further comparative analysis. Pairs of patients are generated, and their medication lists are intersected and combined to derive shared and total medications. Jaccard similarity scores are computed using the counts of shared and total medications, with a filtering condition to consider only non-zero similarities. The query ultimately retrieves patient pairs alongside their Jaccard similarity scores, shared medication counts, and total medication counts. Below is depicted the result of the algorithm, as shown in Figure 18.
Figure 18. Part of the result for the Jaccard similarity algorithm for medicines.
Algorithm 3: The algorithm finds the Jaccard Similarity between patients based on their measurements they have done.
Figure 19. Jaccard similarity algorithm for measurements.
Figure 20. Part of the result for the Jaccard similarity algorithm for measurements.
The query, as depicted in Figure 19, performs an intricate analysis of patient’s measurement records within the Metavision data source, aimed at computing Jaccard similarity scores between pairs of patients based on their shared measurements. The query starts by matching patients with recorded measurements and collecting the measurement names into lists. The measurements and patients are then grouped to facilitate subsequent comparisons. Patient pairs are generated, and their measurement lists are intersected and combined to determine shared and total measurements. Jaccard similarity scores are calculated using the counts of shared and total measurements while considering only non-zero similarities. The query finally retrieves pairs of patients alongside their Jaccard similarity scores, shared measurement counts, and total measurement counts. Below is depicted the result of the algorithm, as shown in Figure 20.
Algorithm 4: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases and the medicines they have taken.
Figure 21. Jaccard similarity algorithm for diseases and medicines.
The query, as depicted in Figure 21, conducts a comprehensive analysis of patient’s health records within the Metavision data source. This analysis aims to calculate the Jaccard similarity scores between pairs of patients based on shared diseases and medicines. The query starts by matching patients with diagnosed diseases and collects the disease identifiers into lists. Patient-disease pairs are grouped to facilitate comparisons. Similarly, patients are matched with medicines they have taken, and their medicine labels are collected into lists. Patient pairs are then generated, and their shared and total disease lists are intersected and combined to determine shared and total diseases. The same process is repeated for medicines. The query then calculates Jaccard similarity scores based on shared and total diseases and medicines, and only non-zero similarities are considered. Finally, the query retrieves patient pairs alongside their Jaccard similarity scores, shared counts of diseases and medicines, and total counts. Below is depicted the result of the algorithm, as shown in Figure 22.
Figure 22. Part of the result for the Jaccard similarity algorithm for diseases and medicines.
Algorithm 5: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases and the measurements they have done.
Figure 23. Jaccard similarity algorithm for diseases and measurements.
The query, as depicted in Figure 23, conducts an extensive analysis of patient’s health records within the Metavision data source, focusing on calculating Jaccard similarity scores between pairs of patients based on shared diseases and measurements. The query begins by matching patients with diagnosed diseases and collecting the disease identifiers into lists. Patient-disease pairs are grouped to facilitate comparisons. Similarly, patients are matched with measurements they have undergone, and their measurement names are collected into lists. Patient pairs are generated, and their shared and total disease lists are intersected and combined to determine shared and total diseases. The same process is repeated for measurements. The query then calculates Jaccard similarity scores based on shared and total diseases and measurements, considering only non-zero similarities. Finally, the query retrieves patient pairs alongside their Jaccard similarity scores, shared counts of diseases and measurements, and total counts. Below is depicted the result of the algorithm, as shown in Figure 24.
Figure 24. Part of the result for the Jaccard similarity algorithm for diseases and measurements.
Algorithm 6: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases, the measurements they have done, and the medicines they have taken.
The query, as depicted in Figure 25, conducts an intricate analysis of patient health data in the context of the Metavision data source. The goal is to compute the Jaccard similarity scores between pairs of patients based on shared diseases, medications, and measurements. The query begins by matching patients with diagnosed diseases and collecting disease identifiers into lists. Patient-disease pairs are generated to enable comparisons. Next, patients are matched with the medications they have taken, and the medication labels are collected into lists. Similarly, patients are associated with the measurements they have undergone, and the measurement names are collected into lists. Pairs of patients are generated, and their shared and total disease, medication, and measurement lists are computed. Shared and total counts are determined for each category. Jaccard similarity scores are calculated based on shared and total counts, and only non-zero similarities are retained. Finally, the query retrieves patient pairs along with their Jaccard similarity scores, shared counts of diseases, medications, and measurements, and total counts. Below is depicted the result of the algorithm, as shown in Figure 26.
Figure 25. Jaccard similarity algorithm for diseases, medicines, and measurements.
Figure 26. Part of the result for the Jaccard similarity algorithm for diseases, medicines, and measurements.
In a modern hospital setting, a new patient, John (patientId: 40504), is admitted for treatment. John's medical condition is intricate and unique, making it challenging for the medical team to devise an effective treatment plan. To address this, the hospital has implemented a sophisticated healthcare data system powered by Neo4j and Cypher, which leverages the patient similarity analysis algorithm that was provided in the previous section. Below are the steps the system follows:
Figure 27. Personalized treatment recommendation for a new hospitalized patient using patient similarity.
The Cypher code, depicted in Figure 27, describes the steps required to calculate patient similarity and recommend metrics and medications for the new patient, John, based on the most similar patient in the database. It demonstrates the practical application of Neo4j and Cypher in personalized healthcare. Following the Cypher code snippet provided above, Figure 28 displays a screenshot of the query result. This result showcases the recommended measurements and medicines for a new patient admitted to the hospital, based on their similarity to another patient's medical history and diagnoses, as shown in Figure 16 the patient with patientId of 40503 has a similarity score of 1 with the John (patientId: 40504).
Figure 28. John's recommended medicines and measurements.
In this section, we delve into the dynamic world of healthcare optimization, exploring the capabilities of the Centrality algorithms (see section 2.3.6) provided by Neo4j's Graph Data Science Library (see section 2.3.5) and specifically the PageRank to revitalize care delivery and resource allocation. In the first subsection, we create two tailored PageRank algorithms, which allow us to uncover influential roles within healthcare networks. These algorithms are designed to gauge the impact of patients and caregivers, providing essential insights into the complex web of interactions that underpin effective healthcare systems. The subsequent subsection presents a real-world scenario where we demonstrate the practical application of these algorithms. We showcase how the PageRank algorithm can significantly enhance patient care coordination upon admission, ultimately improving healthcare service quality and patient outcomes. By harnessing the power of PageRank, we unlock the potential for more patient-centric healthcare systems and streamlined care delivery.
Algorithm 1: The algorithm finds the influence of each patient within the healthcare data network.
Figure 29. PageRank patient influence algorithm.
The Cypher code, depicted in Figure 29, is designed to assess the influence of patients within a healthcare network. It initiates by gathering patients admitted to a hospital and the diseases they were diagnosed with during these admissions. The code subsequently projects a subgraph named ‘patientInfluence’ from the overall graph, emphasizing relationships between patients and diseases through their admissions. After projecting this graph, the code proceeds to compute PageRank scores using the ‘gds.pageRank.stream’ function. The ‘scaler’ parameter is set to ‘MEAN’, utilizing the mean of the relationship weights as the initial PageRank score. This algorithm calculates the influence of patients in the context of their connections to specific diseases. The results provide a list of patients and their respective PageRank scores, with higher scores indicating greater influence. Below is illustrated the result of the algorithm, as shown in Figure 30.
Figure 30. Part of output for PageRank patient influence algorithm.
Algorithm 2: The algorithm finds the influence of each caregiver within the healthcare data network.
Figure 31. PageRank caregiver influence algorithm.
The Cypher code, depicted in Figure 31, is strategically developed to evaluate the influence of caregivers within the complex healthcare network. It commences by identifying caregivers (referred to as ‘source’ nodes) responsible for caring for patients (referred to as ‘target’ nodes). Subsequently, a specialized subgraph named ‘caregiverInfluence’ is projected from the broader network, specifically highlighting caregiver-patient interactions during patient care. The essential graph metrics, including node count and relationship count, are presented. Employing the PageRank algorithm with the ‘MEAN’ scaler parameter, the query calculates influence scores for caregivers. The algorithm assigns these scores based on the caregivers' connections with patients, making it a robust tool for assessing their influence in the network. The query filters out caregivers with non-null caregiver IDs to ensure the focus remains on actively involved caregivers. The final output is a ranked list of caregivers based on their influence scores, offering crucial insights to healthcare administrators for optimizing patient care by effectively matching patients with influential caregivers. Below is illustrated the result of the algorithm, as shown in Figure 32.
Figure 32. Part of output for PageRank caregiver influence algorithm.
In modern healthcare systems, the efficient allocation of caregivers to patients plays a pivotal role in ensuring high-quality healthcare delivery. To address this, a data-driven approach is proposed that utilizes the PageRank algorithm within a healthcare graph database. This algorithm helps identify the most influential caregivers for patients upon admission to a healthcare facility. The following scenario outlines the systematic application of this approach to enhance patient care coordination.
The Cypher code, depicted in Figure 33, describes the steps required to implement the scenario described above. It demonstrates the practical application of Neo4j and Cypher in personalized healthcare.
Figure 33. Recommending influential caregivers for new patient using patient and caregiver influence.
Following the Cypher code snippet provided above, Figure 34 displays a screenshot of the query result. This result showcases the recommended caregivers for a new patient admitted to the hospital, based on their influence score, as shown in Figure 32 the caregiver with caregiverId of 21507 has an influence score of 0.95 and for this reason is recommended for the care of the new patient (patientId: 40506).
Figure 34. Caregiver with caregiverId 21570 is recommended to care for the patient with patientId 40506.
In this thesis, we have embarked on an in-depth exploration of the fusion between Medical Informatics and graph database technology, exemplified through the utilization of Neo4j in healthcare analytics. Our endeavor was driven by the overarching goal of enhancing the comprehension and utilization of patient data within the healthcare domain. Through a meticulously devised workflow, we undertook the comprehensive journey from data collection to advanced querying techniques. The initial stages encompassed a meticulous data collection process, ensuring the acquisition of accurate and reliable medical data. Subsequent data preprocessing procedures facilitated data consistency and prepared the foundation for efficient database implementation. The construction of a graph data model and the creation of a Neo4j database provided the structural backbone for our analytical pursuits. The application of Cypher queries facilitated data loading and the formation of complex relationships, granting a nuanced perspective of medical data interconnections. A pivotal aspect of this study lay in the realm of data querying, analysis, and exploration. Fundamental techniques for data retrieval were examined, while the focal point remained in the patient similarity analysis, the patient and caregiver influence analysis. By employing Jaccard similarity metrics across diseases, medicines, and measurements, as well as Centrality algorithms such as PageRank to measure patient and caregiver influence, we unearthed intricate patterns and correlations within patient data, enabling a more profound understanding of patient relationships and medical dynamics.
In conclusion, this research contributes to the evolving landscape of Medical Informatics by accentuating the potential of graph database technology to revolutionize healthcare analytics. Our study underscores the significance of Neo4j in unveiling hidden insights within patient data, exemplifying its efficacy in driving informed medical decision-making. The implications of our findings extend to improved patient care, enriched clinical research, and the potential for novel medical discoveries. As the healthcare sector grapples with escalating data complexities, our thesis advocates for the continued integration of innovative data management solutions. This exploration serves as a catalyst for further inquiry and innovation at the nexus of healthcare and technology, opening avenues for future research and advancements in the realm of Medical Informatics.
References
Appendices
Appendix A
Appendix A illustrates the CSV files used for this thesis.
Table 1. Part of the patients CSV file.
Table 2. Part of the admissions CSV file.
Table 3. Part of icustays CSV file.
Table 4. Part of d_items CSV file.
Table 5. Part of inputevents_mv CSV file.
Table 6. Part of inputevents_cv CSV file.
Table 7. Part of outputs CSV file.
Table 8. Part of d_icd_diagnoses CSV file.
Table 9. Part of diagnoses_icd CSV file.