Published using Google Docs
Bachelor Thesis Public.docx
Updated automatically every 5 minutes

A logo of a book and tree

Description automatically generated with low confidence

Hellenic Mediterranean University
Department of Informatics Engineering

Bachelor Thesis

Analyzing Large-Scale Data with Neo4j in Healthcare

Emmanouil Nitis

Supervisor: Demosthenes Akoumianakis

Heraklion, October 2023

Acknowledgements

I would like to express my sincere appreciation to my supervisor, Dr. Demosthenes Akoumianakis, for his invaluable guidance and support throughout this thesis. Dr. Akoumianakis' insightful feedback, thoughtful discussions, and unwavering encouragement have been instrumental in shaping the direction and outcome of this research. His expertise, dedication, and mentorship have provided me with a solid foundation on which to build my academic and research pursuits. I am deeply grateful for the time and effort he has generously invested in my growth as a scholar. Thank you, Dr. Akoumianakis, for your indispensable contribution to this thesis.

Περίληψη

Αυτή η πτυχιακή εργασία διερευνά τη διασταύρωση της Ιατρικής Πληροφορικής και αναδυόμενων τεχνολογιών διαχείρισης δεδομένων που βασίζονται στο μοντέλο του γράφου, εστιάζοντας στην χρήση του Neo4j για την ανάλυση δεδομένων υγειονομικής περίθαλψης. Με στόχο τη βελτίωση της ανάλυσης και εξερεύνησης δεδομένων ασθενών, η διατριβή αναλαμβάνει μια ολοκληρωμένη έρευνα που περιλαμβάνει συλλογή και προεπεξεργασία δεδομένων, υλοποίηση της βάσης δεδομένων με Neo4j και διερεύνηση προηγμένων τεχνικών αναζήτησης. Το εισαγωγικό τμήμα καθορίζει τον σκοπό της έρευνας, περιγράφει βασικούς στόχους και ορίζει το εύρος και τους περιορισμούς της μελέτης. Το θεωρητικό υπόβαθρο εμβαθύνει στην Ιατρική Πληροφορική, τονίζοντας την ανάλυση της υγειονομικής περίθαλψης και τη σημασία των βάσεων δεδομένων NoSQL, ιδιαίτερα των βάσεων δεδομένων γραφημάτων όπως το Neo4j. Η ροή εργασίας της πτυχιακής εργασίας καθοδηγεί τους αναγνώστες στην ερευνητική τροχιά. Στη συνέχεια, η φάση συλλογής και προεπεξεργασίας δεδομένων επεξηγεί σχολαστικά την πηγή δεδομένων, τις μεθοδολογίες συλλογής και τα εργαλεία προεπεξεργασίας για να διασφαλιστεί η ακρίβεια και η συνέπεια των δεδομένων. Οι ενότητες του μοντέλου δεδομένων γραφήματος και της εφαρμογής βάσης δεδομένων περιγράφουν λεπτομερώς την εννοιολογική και πρακτική κατασκευή μιας βάσης δεδομένων γραφημάτων Neo4j, συμπεριλαμβανομένης της δημιουργίας, διαμόρφωσης, φόρτωσης δεδομένων και οργάνωση αυτών υπό μορφή γράφου καθώς και την διαχείριση ερωτημάτων τροχιάς με χρήση της γλώσσας Cypher. Η προστιθέμενη αξία της πτυχιακής εργασίας βρίσκεται στην ενότητα αναζήτησης δεδομένων, ανάλυσης και εξερεύνησης, που περιλαμβάνει θεμελιώδεις τεχνικές ανάκτησης δεδομένων που υποστηρίζουν μια εις βάθος ανάλυση ομοιότητας ασθενών. Για το σκοπό αυτό χρησιμοποιούνται μετρήσεις ομοιότητας Jaccard που βασίζονται σε ασθένειες, φάρμακα και μετρήσεις, παρέχοντας πολύτιμες πληροφορίες για τις σχέσεις των ασθενών και τα ιατρικά πρότυπα. Συμπερασματικά, η πτυχιακή εργασία συνοψίζει συνοπτικά τα ερευνητικά αποτελέσματα, εξετάζει τις επιπτώσεις τους και παρουσιάζει πιθανές οδούς για μελλοντική εξερεύνηση στο πεδίο της Ιατρικής Πληροφορικής. Αξιοποιώντας τη δύναμη της τεχνολογίας βάσεων δεδομένων γραφημάτων, αυτή η μελέτη εισάγει καινοτόμες προοπτικές για την ανάλυση ιατρικών δεδομένων, υπογραμμίζοντας τις δυνατότητες για επαναστατικά βήματα στην ανάλυση της υγειονομικής περίθαλψης.

Abstract

This thesis explores the intersection of Medical Informatics and graph database technology, focusing on the application of Neo4j in healthcare analytics. With the objective of enhancing patient data analysis and exploration, the thesis undertakes a comprehensive investigation encompassing data collection, preprocessing, database implementation, and advanced querying techniques. The introductory segment establishes the research's purpose, outlines key objectives, and defines the scope and limitations of the study. The theoretical background delves into Medical Informatics, highlighting healthcare analytics and the significance of NoSQL databases, particularly graph databases like Neo4j. The thesis workflow guides readers through the research trajectory. Subsequently, the data collection and preprocessing phase meticulously expounds on the data source, collection methodologies, and preprocessing tools to ensure data accuracy and consistency. The graph data model and database implementation sections detail the conceptual and practical construction of a Neo4j graph database, including database creation, configuration, data loading, and graph construction using Cypher queries. The crux of the thesis resides in the data querying, analysis, and exploration section, encompassing fundamental data retrieval techniques, an in-depth patient similarity analysis, and patient influence analysis. The patient similarity analysis employs Jaccard similarity metrics based on diseases, medicines, and measurements, providing valuable insights into patient relationships and medical patterns. In conclusion, the thesis succinctly summarizes research outcomes, discusses their implications, and presents potential avenues for future exploration within the realm of Medical Informatics. By harnessing the power of graph database technology, this study introduces innovative perspectives on medical data analysis, underscoring the potential for revolutionary strides in healthcare analytics.

Contents                                                

Acknowledgements        2

Περίληψη        3

Abstract        4

Contents        5

List of Tables        7

List of Figures        8

1.        Introduction        11

1.1.        Thesis Objectives        11

1.2.        Scope and Limitations of Thesis        12

2.        Theoretical Background        13

2.1.        Medical Informatics, healthcare analytics and open data sets        13

2.2.        NoSQL Databases and the Property Graph data model        15

2.3.        Neo4j, Cypher and Graph Algorithms        16

2.3.1.        Path Querying        17

2.3.2.        Cypher Principles and Capabilities        17

2.3.3.        Cypher Commands        18

2.3.4.        Use Cases of Neo4j        19

2.3.5.        Neo4j Graph Algorithms        19

2.3.6.        The Concepts of Similarity and Centrality        20

2.3.7.        Neo4j in Healthcare        22

2.4.        Thesis Workflow        22

3.        Data Collection and Preprocessing        25

3.1.        Data Source        25

3.2.        Data Collection Process        25

3.3.        Data Preprocessing        28

3.4.        Data Preprocessing Tools and Techniques        28

4.        Graph Data Model and Database Implementation        29

4.1.        Database Creation and Configuration        32

4.2.        Data Loading and Graph Creation with Cypher        32

5.        Querying, Analyzing and Exploring the Dataset        34

5.1.        Basic Data Retrieval        34

5.2.        Patient Similarity Analysis        40

5.2.1.        Creating Custom Jaccard Similarity Algorithms        42

5.2.2.        Treatment Recommendation for a New Hospitalized Patients        51

5.3.        Patient and Disease Influence Analysis        54

5.3.1.        Creating PageRank Algorithms        54

5.3.2.        Recommending Influential Caregivers for New Patients        57

6.        Conclusions        59

References        60

Appendices        61

Appendix A        61

                        


List of Tables

Table 1. Part of the patients CSV file.        61

Table 2. Part of the admissions CSV file.        62

Table 3. Part of icustays CSV file.        63

Table 4. Part of d_items CSV file.        64

Table 5. Part of inputevents_mv CSV file.        65

Table 6. Part of inputevents_cv CSV file.        66

Table 7. Part of outputs CSV file.        67

Table 8. Part of d_icd_diagnoses CSV file.        68

Table 9. Part of diagnoses_icd CSV file.        69

List of Figures

Figure 1. An example of a property graph database.        16

Figure 2. Workflow of the thesis.        23

Figure 3. The Graph Data Model for MIMIC III.        29

Figure 4. Data Loading and Graph Creation.        33

Figure 5. Query 1 of the section 5.1.        34

Figure 6. Output for query 1 of the section 5.1.        35

Figure 7. Query 2 of the section 5.1.        36

Figure 8. Part of output for query 2 of the section 5.1.        36

Figure 9. Query 3 of the section 5.1.        37

Figure 10. Output for query 3 of the section 5.1.        38

Figure 11. Query 4 of the section 5.1.        37

Figure 12. Query 4 of the section 5.1.        39

Figure 13. Query 5 of the section 5.1.        40

Figure 14. Output for query 5 of the section 5.1.        41

Figure 15. Jaccard similarity algorithm for diseases.        42

Figure 16. Part of the result for the Jaccard similarity algorithm for diseases.        43

Figure 17. Jaccard similarity algorithm for medicines.        43

Figure 18. Part of the result for the Jaccard similarity algorithm for medicines.        44

Figure 19. Jaccard similarity algorithm for measurements.        45

Figure 20. Part of the result for the Jaccard similarity algorithm for measurements.        45

Figure 21. Jaccard similarity algorithm for diseases and medicines.        46

Figure 22. Part of the result for the Jaccard similarity algorithm for diseases and medicines.        47

Figure 23. Jaccard similarity algorithm for diseases and measurements.        48

Figure 24. Part of the result for the Jaccard similarity algorithm for diseases and measurements.        49

Figure 25. Jaccard similarity algorithm for diseases, medicines, and measurements.        50

Figure 26. Part of the result for the Jaccard similarity algorithm for diseases, medicines, and measurements.        51

Figure 27. Personalized treatment recommendation for a new hospitalized patient using patient similarity.        52

Figure 28. John's recommended medicines and measurements.        53

Figure 29. PageRank patient influence algorithm.        54

Figure 30. Part of output for PageRank patient influence algorithm.        55

Figure 31. PageRank caregiver influence algorithm.        55

Figure 32. Part of output for PageRank caregiver influence algorithm.        56

Figure 33. Recommending influential caregivers for new patient using patient and caregiver influence.        58

Figure 34. Caregiver with caregiverId 21570 is recommended to care for the patient with patientId 40506.        58


Acronyms 

SQL              Structured Query Language.  15

NoSQL    Not only SQL.  11, 12, 13, 14, 15

RDB             Relational Database.  15

API          Application Programming Interface.  15

  1. Introduction

Healthcare data plays a crucial role in improving patient outcomes, enhancing medical research, and optimizing healthcare processes. With the exponential growth in healthcare data, it has become essential to leverage advanced data management and analytics techniques to extract valuable insights from this wealth of information. This thesis delves into the intersection of medical informatics and NoSQL databases, specifically native graph data stores, by retooling an open healthcare database using Neo4j – a leading graph database technology. The primary objectives of this thesis are to explore the application of graph databases in the context of healthcare analytics and to develop advanced data querying and analysis techniques for exploring open healthcare data. By utilizing Neo4j's capabilities, we aim to construct an efficient and robust healthcare graph database that can provide powerful insights into patient records, disease patterns, medication histories, and medical measurements. Moreover, we seek to evaluate the performance and effectiveness of custom algorithms designed for patient similarity analysis, enabling the identification of patients with similar medical conditions and treatment histories. This thesis is structured to systematically address critical aspects of healthcare data management and analysis using graph databases, to contribute to advancements in healthcare data analysis and patient care.

  1. Thesis Objectives

The objectives of this thesis are twofold. Firstly, we aim to explore the application of Neo4j, a leading graph database management system, in the domain of medical informatics and healthcare analytics. By leveraging the power of graph databases, we seek to uncover valuable insights from complex healthcare data and discover previously hidden relationships between medical entities, such as patients, diseases, medications, and measurements. Secondly, we aim to demonstrate the practical application and add high added value of various data querying, analysis, and exploration techniques using Neo4j's Cypher query language and graph algorithms. Through a series of use cases and experiments, we intend to showcase the efficiency and effectiveness of Neo4j in handling diverse medical datasets and extracting meaningful patterns, ultimately contributing to more informed decision-making and improved patient care.

  1. Scope and Limitations of Thesis

The scope of the thesis can be briefly summarized as follows:

In terms of thesis limitations, is it fair to admit the following:

Overall, the thesis set out to present a through exploration of Neo4j's capabilities in medical informatics and healthcare analytics, specifically focusing on patient similarity analysis, thereby demonstrating the practicality of graph databases in healthcare data management and analysis. Future work could follow several paths to address some of the acknowledged limitations in terms of scope and dataset used, to provide a more comprehensive understanding of the thesis's applicability and potential implications.

  1. Theoretical Background

In the ever-evolving landscape of medical informatics, the fusion of technological innovation and medical practice has paved the way for transformative advancements. This section sets the stage for a comprehensive exploration of the theoretical underpinnings that shape the realm of modern medical informatics. Encompassing the domains of medical informatics, the paradigm shift introduced by NoSQL databases, the intricate world of graph databases exemplified by Neo4j, and the potent capabilities of graph algorithms, this section serves as a foundational framework upon which our journey of patient similarity analysis and graph-based insights will unfold.

  1. Medical Informatics, healthcare analytics and open data sets

Medical informatics is an area of study and application focused on enhancing the management of patient data, clinical knowledge, population data, and other relevant information about patient care and community health. It is a relatively young scientific discipline that emerged following the advent of digital computers in the 1940s. The utilization of mechanical computing in medicine dates back even further, with Herman Hollerith's development of the "punched-card data processing system" during the 19th century, initially employed for the US census and later adapted to support public health surveys and epidemiology. This historical example underscores the interdisciplinary nature of medical informatics, which intersects with various fields such as clinical sciences, public health sciences (including epidemiology and health services thesis), as well as cognitive, computing, and information sciences [1] [2] [9].

A particular stream of research in Medical informatics concentrates on Healthcare analytics. This is a developing field within medical informatics that focuses on the application of computer-based data analysis techniques to facilitate decision-making in both clinical and non-clinical settings. By leveraging electronic healthcare applications, valuable insights can be derived from data, enhancing service quality while minimizing costs. The adoption of systematic solutions has been widespread, driven by the abundance of internal and external data, diverse sources of medical information, and the need for comprehensive reporting. Robust analytic systems are increasingly employed to manage electronic health records, provide clinical decision support, and effectively handle personal or hospital data. These systems play a vital role in supporting managerial decision-making for clinical care and optimizing hospital operations, all while generating evidence-based insights within specific healthcare contexts [3] [4].

Open healthcare data encompasses various healthcare-related datasets and information that are intentionally made available to the public, researchers, and other stakeholders without extensive restrictions. This category of data promotes transparency, collaboration, and innovation within the healthcare sector. It includes de-identified patient records, public health statistics, medical research datasets, and other healthcare-related information that is shared with open-access licenses. However, it's essential to ensure that privacy and security measures are in place to protect individuals' sensitive health information when releasing and using open healthcare data [16].

This thesis concentrates on MIMIC-III dataset which is a large, freely-available dataset comprising deidentified health-related data associated with over 40,000 patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The MIMIC-III Clinical Database is available on PhysioNet (doi: 10.13026/C2XW26). Though deidentified, MIMIC-III contains detailed information regarding the care of real patients, and as such requires credentialing before access. To allow researchers to ascertain whether the database is suitable for their work, there is a demo subset, which contains information for 100 patients also present in the MIMIC-III Clinical Database. The demo dataset contains all intensive care unit (ICU) stays for 100 patients. These patients were selected randomly from the subset of patients in the dataset who eventually die. Consequently, all patients will have a date of death (DOD). However, patients do not necessarily die during an individual hospital admission or ICU stay.

Healthcare datasets, particularly open healthcare datasets, are incredibly valuable for further analysis and research. Firstly, they provide a wealth of information on patient demographics, medical conditions, treatments, and outcomes, making them indispensable for clinical research, epidemiological studies, and healthcare policy analysis. Secondly, these datasets enable researchers to identify trends, correlations, and patterns that can lead to new insights into disease management, preventive care, and treatment effectiveness. Additionally, open healthcare datasets support the development and testing of innovative healthcare technologies, such as predictive analytics, machine learning models, and telemedicine applications. By fostering collaboration and transparency, these datasets empower researchers, healthcare professionals, and data scientists to collectively address healthcare challenges, enhance patient care, and drive improvements in the healthcare industry [16].

  1. NoSQL Databases and the Property Graph data model

The advantages of NoSQL databases over traditional SQL databases include being non-relational, distributed, open-source, and horizontally scalable. NoSQL development's original goal was to create contemporary web-scale databases. The construction started in early 2009 and is expanding quickly. NoSQL databases frequently have additional features like Schema-free, simple APIs, easy replication support, ultimately consistent/BASE (essentially available, soft-state, eventual consistency), a lot of data, and more. Furthermore, the misleading term "NoSQL" can be described as "Not Only SQL," which implies that if RDB is suitable for use, use it, if RDB is not suitable for use, use alternatives [5] [6] [8]. According to the official website of the (NoSQL databases, available from link), there are 15 kinds of NoSQL databases based on diverse data models, such as column-oriented databases, document databases, key-value stores, graph databases, and so on. This section will focus on property graph databases as one of the main four categories of NoSQL databases. It will delve into the fundamental concepts of property graph databases, exploring their data model and characteristics. Additionally, it will discuss the unique features of property graph databases that make them suitable for processing and analyzing specific types of data.

The Property graph model is rooted in the principles of graph theory and advocates a fundamental shift in the way data are organized and stored for processing. In general, a graph is, a mathematical abstraction which, is utilized to represent a collection of objects referred to as vertices or nodes, along with the interconnecting links called edges or relationships. The Property graph is a particular type of graph in which both vertices and edges can be qualified by properties. It forms a distinct data modelling pattern that diverges from other data models employed by key-value, column-family, and document stores, offering a distinct approach that enables efficient storage of relationships between diverse data nodes. In property graph databases, nodes and relationships possess individual properties, typically structured as key-value pairs. These databases specialize in managing intricately interconnected data and, as a result, exhibit notable efficiency in traversing relationships among distinct entities. They find suitability in numerous applications, including social networking platforms, pattern recognition, dependency analysis, recommendation systems, and the resolution of pathfinding problems encountered in navigation systems [5] [7]. An example of a property graph database is illustrated below, as shown in Figure 1.

A diagram of a social media network

Description automatically generated

Figure 1. An example of a property graph database.

  1. Neo4j, Cypher and Graph Algorithms

Neo4j is a prominent open-source property graph database that offers commercial support. It adheres to the ACID (Atomicity, Consistency, Isolation, Durability) properties and employs a property graph model to store data. As a NoSQL graph database, it utilizes graph structures to enable semantic queries, incorporating nodes, edges, and properties. The graph consists of interconnected nodes and relationships, with both entities possessing named values known as properties. Relationships establish connections between nodes, allowing for the retrieval of related data. These relationships can be incoming or outgoing for a node, facilitating the exploration of connections between two nodes. Properties are represented as key-value pairs, with the key being a string. Property values can be of primitive types or an array of a single primitive type. A path represents one or more nodes connected by relationships, often obtained as the result of a query or traversal operation [10].

Neo4j is a vital system in cases where queries require interactions between data relationships. This is attributed to the fact that the database stores relationships and enables immediate access to them, consequently yielding precise and direct responses. Moreover, Neo4j exhibits flexibility and scalability, enabling the addition of new nodes and relationships as the system expands by evolving requirements. Additionally, Neo4j incorporates the employment of (Cypher, available from link), a user-friendly path-querying language, which proficiently addresses complex queries involving nodes and their corresponding relationships [12].

Lastly, Neo4j offers a comprehensive suite of graph algorithms, encompassing various functionalities such as pathfinding, similarity analysis, and more. These algorithms serve the purpose of effectively analyzing and extracting valuable insights from graph data, facilitating in-depth exploration, and understanding of complex interconnected relationships within the dataset.

  1. Path Querying

Path querying is a foundational concept within graph theory and databases, serving as a fundamental mechanism for information retrieval and traversal operations in graph-based data structures. Specifically relevant in the context of graph databases, path querying entails the precise specification of a path or sequence of steps that traverse through a graph, defined by nodes and edges, with the primary objective of extracting targeted information or ascertaining relationships between nodes [15].

The key constituents of a path query encompass the following [15]:

  1. Cypher Principles and Capabilities

Cypher, a path query language for graph databases, offers several features that facilitate working with graph patterns. It employs pattern matching to express graph patterns, allowing users to specify the relationships and connections they are interested in. Moreover, Cypher incorporates labels, enabling the categorization of nodes and relationships based on specific characteristics or attributes.

In addition to pattern matching and labeling, Cypher provides a wide range of operations to manipulate and analyze graph data. Users can apply filtering, sorting, and aggregations to refine their queries and obtain the desired insights. Furthermore, Cypher supports graph algorithms, allowing for the execution of specialized computations on graph structures.

To ensure efficient query processing, Cypher takes advantage of indexing, caching, and other performance optimizations. These techniques contribute to the scalability and effectiveness of the queries, enabling efficient data retrieval and processing. Overall, Cypher offers a comprehensive set of tools and optimizations that facilitate working with graph databases in a scalable and efficient manner [11].

  1. Cypher Commands

Interacting with a graph database necessitates the utilization of the following commands for querying purposes [14]:

When it comes to modifying the graph, the following commands are involved in the process:

  1. Use Cases of Neo4j

Some of the most common and prominent use cases of Neo4j include [11]:        

  1. Neo4j Graph Algorithms

The Neo4j open-source developer community's (Graph Algorithms, available from link), provided by (Graph Data Science Library, available from link), are used to determine metrics for graphs, nodes, or relationships. With a core set of validated and supported algorithms, the algorithms reveal hidden patterns and structures in the connected data around community discovery, centrality, and pathfinding. Many graph algorithms are iterative procedures in which the graph is traversed repeatedly for computation utilizing random walks, breadth-first or depth-first searches, or pattern matching. The algorithms provide useful and important insights into relevant elements in the graph (centralities, ranking) or inherent structures such as communities (community identification, graph splitting, clustering) [12].

The Neo4j graph algorithms are categorized into eight categories, which are defined as follows:

The Neo4j Graph Data Science Library offers a comprehensive collection of algorithms categorized into these different problem classes, providing users with a wide range of tools to analyze and extract valuable insights from graph data.

Our thesis emphasizes two categories of algorithms from this library those of centrality and similarity. Centrality algorithms enable us to pinpoint the most influential nodes or entities within the healthcare graph, aiding in tasks such as identifying critical patients or diseases. Simultaneously, similarity algorithms allow us to uncover hidden patient patterns and connections, providing personalized healthcare recommendations based on shared characteristics and medical histories.

  1. The Concepts of Similarity and Centrality

Similarity is used to determine how much of a node’s characteristics and neighborhood matches that of another node. Various algorithms are available [18] for calculating similarity scores, like Cosine (COS), Pearson's Correlation (COR), and Jaccard similarity index. In this thesis we are particularly interested in the Jaccard similarity index which is briefly described below. The Jaccard Similarity Index, also known as the Jaccard Coefficient, is a statistical measure used to assess the similarity or dissimilarity between two sets by comparing their intersection with their union. This index, which produces values between 0 and 1, quantifies the degree of overlap between two sets, with higher values indicating greater similarity. It is widely used in various fields, including data mining, information retrieval, and bioinformatics, to analyze and compare datasets. The history of the Jaccard Similarity Index can be traced back to the 19th century when the French botanist and statistician Paul Jaccard introduced it as a method to measure the floristic similarity between two geographic regions based on the presence or absence of plant species. Jaccard's work laid the foundation for this similarity measure, and over time, it has found applications in diverse areas beyond botany. In the modern era, especially with the explosion of data analytics and machine learning, the Jaccard Similarity Index has become an essential tool for data scientists and researchers to gauge the similarity between datasets, identify patterns, and make informed decisions.

The concept of centrality is rooted in the study of human communication within small groups and was introduced to assess the potential relationship between structural centrality and influence within group dynamics. This early pioneering research paved the way for subsequent studies, although their findings often yielded confusing or contradictory results. Since then, numerous algorithms and methodologies have been developed to identify nodes of significance within graphs [17] with the most prominent being Degree Centrality, Closeness Centrality, Betweenness Centrality, Eigenvector Centrality and PageRank. PageRank is a derivative of Eigenvector Centrality that evaluates node influence by considering not only their immediate neighbors but also the neighbors of these neighbors. The algorithm was conceived in 1996 [19] as a means to establishing web content rankings by prioritizing link popularity. The foundational description of PageRank, a pivotal component of the Google search engine, was presented in 1998. Initially, all nodes held an identical PageRank score, set at 1. However, in later versions, each node commenced with a value between 0 and 1. Throughout the iteration process, nodes distribute their influence score equally among linked nodes, with the impact transferred via an outbound link being calculated as its score divided by the total number of its outbound links. Notably, PageRank omits self-links, treating all connections between nodes as one, regardless of their quantity. Nodes with zero out-degree are assumed to connect with all other nodes in the graph. To account for real-world behavior, PageRank introduces a damping factor representing the probability of users continuing their action by clicking on another link. Typically, this factor is set around 0.85. While PageRank shares commonalities with Eigenvector centrality, its emphasis on in-degrees makes it better suited for directed graphs, although it does not apply to undirected graphs [17].

  1. Neo4j in Healthcare

The Neo4's ability to handle composite interconnected data makes it a precious tool for managing and analyzing healthcare data and over deriving uncovered insights. According to (Neo4j's website, available from link), many large healthcare companies use Neo4j to analyze their data.

In the healthcare domain, Neo4j offers several advantages that contribute to enhanced data management and decision-making processes. Firstly, its Cypher querying language facilitates efficient and real-time analysis by providing a powerful tool for querying and navigating the graph. This enables healthcare professionals to explore complex relationships within the data and derive meaningful insights.

Moreover, Neo4j's graph data model proves valuable in healthcare as it allows for the representation and exploration of intricate connections between various healthcare entities such as patients, medications, and diseases. This capability enables a comprehensive understanding of the relationships and dependencies within the data, ultimately leading to actionable insights and improved patient care.

Furthermore, Neo4j can be leveraged to develop clinical decision support systems. By utilizing the graph database, personalized treatment recommendations can be generated based on individual patient profiles. The system can also assist in detecting drug interactions, predicting disease progression, and facilitating more informed clinical decision-making.

Overall, Neo4j's efficient querying capabilities, graph modeling approach, and potential for clinical decision support make it a valuable tool in the healthcare industry, enabling advanced analysis and the provision of personalized and effective care.

  1. Thesis Workflow

The thesis workflow is depicted in Figure 2. Workflow of the thesis. As shown, it begins with the initial step of downloading the open-source Mimic III dataset. This dataset serves as the foundation for the subsequent data modeling process. Once the dataset is obtained, the focus shifts to data modeling, where the structure and relationships within the dataset are defined. This step involves identifying the relevant entities and attributes and establishing the connections between them. After completing the data modeling phase, the next step is to load the prepared data into the Neo4j graph database. Neo4j provides a robust and scalable platform for storing and managing graph data efficiently. With the data successfully loaded into Neo4j, the creation of nodes and relationships takes place. Nodes represent the entities from the dataset, while relationships define the connections and associations between the nodes. This step ensures that the data is organized in a graph structure that accurately reflects the relationships within the dataset.

Figure 2. Workflow of the thesis.

Once the nodes and relationships are established, Cypher queries are executed within Neo4j. Cypher is a powerful query language specifically designed for querying and manipulating graph data. These queries can perform various operations, such as retrieving specific nodes or relationships, filtering data based on specific criteria, aggregating information, and performing advanced graph algorithms.

The output of the Cypher queries can be displayed in multiple formats. One option is graph visualization, where the results are presented in a visual representation of the graph structure, providing a clear view of the relationships and patterns within the data. Alternatively, the output can be displayed in a tabular format, presenting the queried data in a structured and organized manner.

Overall, this thesis workflow encompasses the steps of downloading and modeling the dataset, loading it into Neo4j, creating nodes and relationships, running Cypher queries, and presenting the results either through graph visualization or tabular format. This systematic process allows for efficient analysis and exploration of the Mimic III dataset using the capabilities of Neo4j as a powerful graph database.


  1. Data Collection and Preprocessing

In this section, we describe the data collection and preprocessing steps for the demo version of the Mimic III dataset, which includes data from 100 patients. The demo dataset can be obtained from PhysioNet, a well-known platform for sharing open-source physiological data and related resources.

  1. Data Source

The demo (Mimic III dataset, available from link), provides a condensed version of the comprehensive patient data available in the full dataset. It encompasses essential information related to patient demographics, clinical measurements, laboratory results, medication records, procedures, and more. Accessible through (PhysioNet platform, available from link), this dataset serves as a valuable resource for preliminary analysis and exploration of critical care medicine data.

  1. Data Collection Process

To access the demo Mimic III dataset, we downloaded it from the PhysioNet platform. PhysioNet hosts a variety of publicly available datasets, including the demo version of Mimic III, which can be freely obtained for thesis and educational purposes. From PhysioNet, we obtained nine CSV files, which will be illustrated in the Appendices. These CSV files were used for the implementation of this thesis and will describe below, namely patients.csv, admissions.csv, icu_stays.csv, d_item.csv, inputevents_cv.csv, inputevents_mv.csv, outputs.csv, d_icu_diagnoses.csv, and diagnoses_icd.csv. These CSV files collectively contain essential patient data, admission records, ICU stays, medicines, measurements, and diagnosis details that form the foundation of our exploratory analysis. Below will be described in more detail each of this file.

  1. Data Preprocessing

In the data preprocessing stage, the focus is on preparing the dataset obtained from PhysioNet for analysis using Neo4j. The process involves two key steps: data transformation and loading, followed by data filtering and quality assurance.

To begin, the dataset is downloaded from the PhysioNet website and subsequently transformed and loaded into the Neo4j graph database. This transformation entails converting the dataset, which is initially in CSV format, into a format that is compatible with Neo4j's graph database structure. This step ensures efficient storage and facilitates seamless querying of the data within the graph database environment.

Once the data is loaded, filters are applied to ensure the quality and integrity of the dataset. Incomplete or erroneous records are addressed, and missing values are handled appropriately to maintain data consistency and reliability within the graph database. This data filtering and quality assurance step plays a crucial role in ensuring that only valid and reliable data is utilized in subsequent analyses.

By undergoing these data preprocessing steps, including data transformation, loading, filtering, and quality assurance, the dataset becomes well-prepared for further analysis and exploration within the Neo4j graph database.

  1. Data Preprocessing Tools and Techniques

The data preprocessing approach for the demo Mimic III dataset involves utilizing Neo4j as the primary tool for data organization, filtering, and manipulation. Neo4j's capabilities, coupled with the availability of the dataset in CSV format from PhysioNet, streamline the preprocessing workflow within the graph database environment. By leveraging PhysioNet as the data source and employing Neo4j's filtering and manipulation functionalities, the preprocessing phase ensures the dataset's suitability for subsequent analysis and exploration.

In summary, the data collection process involves obtaining the demo Mimic III dataset from PhysioNet, a reputable platform for sharing open-source physiological data. The dataset is then preprocessed using Neo4j for data organization and manipulation. This combination of resources and tools allows for efficient data preprocessing and sets the stage for further analysis within the graph database environment.

  1. Graph Data Model and Database Implementation

In this section, we present the development and implementation of the graph data model for the MIMIC-III demo dataset. The graph data model, as depicted in Figure 3, serves as the foundational structure for representing and organizing critical care medical data.

Figure 3. The Graph Data Model for MIMIC III.

Leveraging the power of Neo4j's graph database and the (Arrows.app, available from link) graph data modeling tool developed by Neo4j Labs, we designed an intuitive and comprehensive model that captures the intricate relationships between various data entities. This phase is critical as it converts our meticulously designed graph data model into a fully functioning database ready for exploration and analysis. Our first step will involve configuring the database environment in (Neo4j Desktop, available from link) to ensure a robust and highly efficient setup. We will then proceed to install the Graph Data Science Library (see section 2.3.5) plugin, a potent tool that provides advanced graph algorithms and analytics capabilities. With the necessary infrastructure in place, we will embark on the crucial task of loading the CSV files containing the MIMIC-III demo dataset into our Neo4j graph database, including creating nodes and relationships that accurately reflect the complex network of critical care medicine data entities within our model. This comprehensive implementation process will enable us to unlock the full potential of our graph database, allowing us to navigate a wealth of information and uncover valuable insights during the subsequent stages of our thesis journey.

The graph data model comprises several node types, each representing specific data entities, and relationships that indicate associations between nodes. Let's delve into the details of each component:

In conclusion, this section has provided a detailed exposition of the graph data model developed for the MIMIC-III demo dataset. By leveraging Neo4j's powerful graph database and the intuitive Arrows.app graph data modeling tool, we have successfully designed a comprehensive model that captures the intricate relationships between critical care medicine data entities.

  1. Database Creation and Configuration

To establish and configure a Neo4j graph database, it is imperative to follow the guidance briefly outlined below. Firstly, it is paramount to install the Neo4j Desktop application, a user-friendly tool that facilitates database management. Once the application has been installed, the next step is to create a new database instance by carefully selecting a secure password and assigning a name to your database. Subsequently, it is crucial to install two essential plugins: the Graph Data Science Library and the APOC Library. These plugins offer advanced graph analytics and a comprehensive set of procedures and functions that significantly expand the capabilities of the Cypher query language. Lastly, it is essential to commence the database engine, thereby enabling your graph database to undergo data loading and manipulation. It is important to note that following these steps will ensure optimal performance and efficiency of the database, leading to an overall enhanced user experience.

  1. Data Loading and Graph Creation with Cypher

Once the database is created and configured the next step entails the process of loading data and creating graph nodes and relationships with Cypher queries in Neo4j. To this effect, we have utilized the MIMIC-III demo dataset from CSV files to elucidate the establishment of nodes for entities such as patients, admissions, diseases, medicines, measurements, and ICU stays, as well as their corresponding relationships. By following these Cypher queries step-by-step, as depicted in Figure 4, we can proficiently load and create all necessary nodes and relationships in our Neo4j database. This will culminate in the creation of a comprehensive graph data model that facilitates efficient healthcare data analysis and exploration.

Figure 4. Data Loading and Graph Creation.


  1. Querying, Analyzing and Exploring the Dataset

In this section, we will explore the healthcare data using the powerful Neo4j's Cypher (see section 2.3.2) query language and the Graph Algorithms (see section 2.3.5). By combining these tools, we can perform complex graph-based queries, gain insights into patterns, and explore relationships between different entities. This data exploration journey will help us unlock the potential of our healthcare database, enabling us to make informed decisions, discover hidden patterns, and derive valuable knowledge. At start, we will review illustrate queries to gain an overview of the mechanics of Neo4j’s Cypher query language and then we will delve into more complex issues of patient similarity analysis and patient influence analysis. For the patient similarity analysis developed some custom algorithms, while for the patient influence analysis used some graph algorithms provided by the Neo4j’s Graph Data Science Library (see section 2.3.5).

  1. Basic Data Retrieval

We will initiate with fundamental Cypher queries aimed at fetching crucial data from our Neo4j graph database. Our exploration will encompass the techniques for acquiring patient particulars, admission records, measurements, medicines, and disease details through uncomplicated queries.

Query 1: What are the admissions, diagnosed diseases, measurements, and medications associated with the patient whose patientId is 40601?

A white background with black text

Description automatically generated

Figure 5. Query 1 of the section 5.1.

The query, as depicted in Figure 5, consists of three consecutive MATCH clauses, each creating a distinct path: p1 represents the patient’s connections to admissions and diagnosed diseases through HAS_ADMISSION and DIAGNOSED_WITH relationships, p2 represents the patient’s measurements through the HAS_DONE relationship, and p3 represents the patient’s medicines through the HAS_TAKEN relationship. The RETURN clause ensures that the results include the three paths p1, p2, and p3, providing a holistic view of the patient’s interconnected healthcare data. Below is depicted the output of the query, as shown in Figure 6.

A diagram of different types of chemical substances

Description automatically generated with medium confidence

Figure 6. Output for query 1 of the section 5.1.

Query 2: Which patients share common diagnosed diseases?

A close-up of a computer code

Description automatically generated

Figure 7. Query 2 of the section 5.1.

The query, as depicted in Figure 7, retrieves patients (patient1 and patient2) who share common diseases. It starts by matching patients with their corresponding admissions and diagnosed diseases. Then, it finds another patient (patient2) who also has admissions with the same diseases. The query collects and returns the unique disease titles shared between patient1 and patient2. Below is depicted the output of the query, as shown in Figure 8.

A screenshot of a computer

Description automatically generated

Figure 8. Part of output for query 2 of the section 5.1.

Query 3: What are the common diagnosed diseases and medicines taken by two specific patients with patientId 40601 and 44212?

A close-up of a math formula

Description automatically generated

Figure 9. Query 3 of the section 5.1.

The query, as depicted in Figure 9, focuses on two specific patients with patientId 40601 and 44212. It provides a comprehensive view of their healthcare data by retrieving their diagnosed diseases (g1) and the medicines they have taken (g2 for patient1 and g3 for patient2). The query returns three paths: p1 represents the common diseases shared between the two patients, while g2 and g3 represent the medicines each patient has taken. This detailed data exploration allows for a deeper understanding of the medical histories and treatments of these specific patients. Below is depicted the output of the query, as shown in Figure 11.

Query 4: What are the common diagnosed diseases and measurements taken by two specific patients with patientId 40601 and 44212?

A math equations with numbers and symbols

Description automatically generated with medium confidence

Figure 10. Query 4 of the section 5.1.

The query, as depicted in Figure 10, focuses on two specific patients with patientId 40601 and 44212. It provides a comprehensive view of their healthcare data by retrieving their diagnosed diseases (g1) and the measurements they have done (g2 for patient1 and g3 for patient2). The query returns three paths: g1 represents the common diseases shared between the two patients, while g2 and g3 represent the measurements each patient have done. This detailed data exploration allows for a deeper understanding of the medical histories and treatments of these specific patients. Below is depicted the output of the query, as shown in Figure 12.

A diagram of a diagram

Description automatically generated

Figure 11. Output for query 3 of the section 5.1.

A diagram of a medical scheme

Description automatically generated with medium confidence

Figure 12. Query 4 of the section 5.1.

Query 5: What are the common diagnosed diseases in patients with patientId 40601 and patientId 44212 during their hospital admissions, and what medications have been taken, as well as what measurements have been performed on these patients during their hospital stays?

A math equations with numbers and symbols

Description automatically generated with medium confidence

Figure 13. Query 5 of the section 5.1.

The query, as depicted in Figure 13, performs a complex graph traversal to find patterns between two patients with patientId 40601 and 44212 in terms of diagnosed diseases, medicines taken, and measurements done during their hospital admissions. The first MATCH clause (p1) identifies a path between the two patients, traversing through the common diseases (Disease nodes) diagnosed during their respective admissions (Admission nodes). The WHERE clause filters the results to only include patient combinations with patientId 40601 and 44212. The second MATCH clause (p2) identifies a path between the first patient and the medicines (Medicine nodes) taken during the hospital stay. The third MATCH clause (p3) does the same for the second patient. These paths include relationships representing the medications taken and the measurements performed (Measurement nodes). The RETURN statement outputs the results, providing a visualization of the graph paths (p1, p2, and p3) containing the relevant information about the patients' diagnosed diseases, medicines taken, and measurements performed during their admissions. Below is depicted the output of the query, as shown in Figure 14.

  1. Patient Similarity Analysis

In this section, we delve into the creation of custom Jaccard similarity (see section 2.3.6) algorithms tailored for healthcare data analysis. These algorithms, enumerated as Algorithm 1 through Algorithm 6, are designed to quantify the similarity between patients based on various aspects of their health records, including diagnosed diseases, medications administered, and measurements recorded. The resulting Jaccard similarity scores are crucial for understanding the closeness of patients’ medical profiles, enabling personalized treatment recommendations. Each algorithm is accompanied by a detailed explanation and its corresponding Cypher query, as illustrated in Figures 15 to 26. Following this, we present a practical application of these algorithms in the context of personalized treatment recommendations for a newly admitted patient, demonstrating the tangible impact of patient similarity analysis within a healthcare setting.

A diagram of a tree

Description automatically generated

Figure 14. Output for query 5 of the section 5.1.

  1. Creating Custom Jaccard Similarity Algorithms

Algorithm 1: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases.

A screenshot of a computer program

Description automatically generated

Figure 15. Jaccard similarity algorithm for diseases.

The query, as depicted in Figure 15, performs a comprehensive analysis of patient’s disease diagnoses within the Metavision data source, aiming to calculate Jaccard similarity scores between pairs of patients based on their shared diseases. It begins by matching patients with admissions linked to disease diagnoses and collecting the disease IDs into lists. Patients and their corresponding disease lists are then collected into collections for further comparison. Pairs of patients are generated, and their disease lists intersected and united to calculate shared and total diseases. Jaccard similarity scores are computed using the shared and total disease counts, with a filter to include only non-zero similarities. The query ultimately returns patient pairs along with their Jaccard similarity scores, shared disease counts, and total disease counts, enabling the identification of patients with similar disease profiles within the Metavision data source. Below is depicted the result of the algorithm, as shown in Figure 16.

A screenshot of a calculator

Description automatically generated

Figure 16. Part of the result for the Jaccard similarity algorithm for diseases.

Algorithm 2: The algorithm finds the Jaccard Similarity between patients based on their medicines they have taken.

Figure 17. Jaccard similarity algorithm for medicines.

The query, as depicted in Figure 17, undertakes a comprehensive exploration of patient’s medication history within the Metavision data source, aiming to calculate Jaccard similarity scores between pairs of patients based on their shared medications. It begins by matching patients with medicine administrations, collecting the medication labels into lists. Patients and their respective medication lists are then aggregated for further comparative analysis. Pairs of patients are generated, and their medication lists are intersected and combined to derive shared and total medications. Jaccard similarity scores are computed using the counts of shared and total medications, with a filtering condition to consider only non-zero similarities. The query ultimately retrieves patient pairs alongside their Jaccard similarity scores, shared medication counts, and total medication counts. Below is depicted the result of the algorithm, as shown in Figure 18.

A screenshot of a computer

Description automatically generated

Figure 18. Part of the result for the Jaccard similarity algorithm for medicines.

Algorithm 3: The algorithm finds the Jaccard Similarity between patients based on their measurements they have done.

A screenshot of a computer code

Description automatically generated

Figure 19. Jaccard similarity algorithm for measurements.

A screenshot of a computer

Description automatically generated

Figure 20. Part of the result for the Jaccard similarity algorithm for measurements.

The query, as depicted in Figure 19, performs an intricate analysis of patient’s measurement records within the Metavision data source, aimed at computing Jaccard similarity scores between pairs of patients based on their shared measurements. The query starts by matching patients with recorded measurements and collecting the measurement names into lists. The measurements and patients are then grouped to facilitate subsequent comparisons. Patient pairs are generated, and their measurement lists are intersected and combined to determine shared and total measurements. Jaccard similarity scores are calculated using the counts of shared and total measurements while considering only non-zero similarities. The query finally retrieves pairs of patients alongside their Jaccard similarity scores, shared measurement counts, and total measurement counts. Below is depicted the result of the algorithm, as shown in Figure 20.

Algorithm 4: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases and the medicines they have taken.

A screenshot of a computer code

Description automatically generated

Figure 21. Jaccard similarity algorithm for diseases and medicines.

The query, as depicted in Figure 21, conducts a comprehensive analysis of patient’s health records within the Metavision data source. This analysis aims to calculate the Jaccard similarity scores between pairs of patients based on shared diseases and medicines. The query starts by matching patients with diagnosed diseases and collects the disease identifiers into lists. Patient-disease pairs are grouped to facilitate comparisons. Similarly, patients are matched with medicines they have taken, and their medicine labels are collected into lists. Patient pairs are then generated, and their shared and total disease lists are intersected and combined to determine shared and total diseases. The same process is repeated for medicines. The query then calculates Jaccard similarity scores based on shared and total diseases and medicines, and only non-zero similarities are considered. Finally, the query retrieves patient pairs alongside their Jaccard similarity scores, shared counts of diseases and medicines, and total counts. Below is depicted the result of the algorithm, as shown in Figure 22.

A screenshot of a computer

Description automatically generated

Figure 22. Part of the result for the Jaccard similarity algorithm for diseases and medicines.

Algorithm 5: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases and the measurements they have done.

A screenshot of a computer code

Description automatically generated

Figure 23. Jaccard similarity algorithm for diseases and measurements.

The query, as depicted in Figure 23, conducts an extensive analysis of patient’s health records within the Metavision data source, focusing on calculating Jaccard similarity scores between pairs of patients based on shared diseases and measurements. The query begins by matching patients with diagnosed diseases and collecting the disease identifiers into lists. Patient-disease pairs are grouped to facilitate comparisons. Similarly, patients are matched with measurements they have undergone, and their measurement names are collected into lists. Patient pairs are generated, and their shared and total disease lists are intersected and combined to determine shared and total diseases. The same process is repeated for measurements. The query then calculates Jaccard similarity scores based on shared and total diseases and measurements, considering only non-zero similarities. Finally, the query retrieves patient pairs alongside their Jaccard similarity scores, shared counts of diseases and measurements, and total counts. Below is depicted the result of the algorithm, as shown in Figure 24.

A screenshot of a computer

Description automatically generated

Figure 24. Part of the result for the Jaccard similarity algorithm for diseases and measurements.

Algorithm 6: The algorithm finds the Jaccard Similarity between patients based on their diagnosed diseases, the measurements they have done, and the medicines they have taken.

The query, as depicted in Figure 25, conducts an intricate analysis of patient health data in the context of the Metavision data source. The goal is to compute the Jaccard similarity scores between pairs of patients based on shared diseases, medications, and measurements. The query begins by matching patients with diagnosed diseases and collecting disease identifiers into lists. Patient-disease pairs are generated to enable comparisons. Next, patients are matched with the medications they have taken, and the medication labels are collected into lists. Similarly, patients are associated with the measurements they have undergone, and the measurement names are collected into lists. Pairs of patients are generated, and their shared and total disease, medication, and measurement lists are computed. Shared and total counts are determined for each category. Jaccard similarity scores are calculated based on shared and total counts, and only non-zero similarities are retained. Finally, the query retrieves patient pairs along with their Jaccard similarity scores, shared counts of diseases, medications, and measurements, and total counts. Below is depicted the result of the algorithm, as shown in Figure 26.

A screenshot of a computer program

Description automatically generated

Figure 25. Jaccard similarity algorithm for diseases, medicines, and measurements.

A screenshot of a white screen

Description automatically generated

Figure 26. Part of the result for the Jaccard similarity algorithm for diseases, medicines, and measurements.

  1. Treatment Recommendation for a New Hospitalized Patients

In a modern hospital setting, a new patient, John (patientId: 40504), is admitted for treatment. John's medical condition is intricate and unique, making it challenging for the medical team to devise an effective treatment plan. To address this, the hospital has implemented a sophisticated healthcare data system powered by Neo4j and Cypher, which leverages the patient similarity analysis algorithm that was provided in the previous section. Below are the steps the system follows:

  1. Patient Admission and Disease Profiling: When John is admitted, his patient record is created within the hospital's database. The system records his age, gender, and medical history, including a list of diagnosed diseases. These diseases are associated with unique disease IDs in the database.
  2. Calculating Patient Similarity: The system utilizes the Jaccard similarity algorithm to compare John's disease profile with those of other patients in the hospital's database. This algorithm considers the overlap of diagnosed diseases between John and every other patient. It quantifies the similarity score, revealing how closely related John's medical condition is to other patients.
  3. Identifying the Most Similar Patient: The system identifies a patient with the highest similarity score to John. This patient shares a substantial portion of diagnosed diseases with John, indicating a strong similarity in their medical conditions.
  4. Personalized Treatment Recommendation: Based on the patient similarity analysis, the system recommends a set of measurements and medicines to John. These recommendations are derived from the treatment history and outcomes of the patient identified as most similar to John.
  5. Measuring the Impact: John's healthcare team decided to follow the system's recommendations. Over time, they observed improvements in John's condition, showcasing the effectiveness of personalized treatment based on patient similarity.

A screenshot of a computer code

Description automatically generated

Figure 27. Personalized treatment recommendation for a new hospitalized patient using patient similarity.

The Cypher code, depicted in Figure 27, describes the steps required to calculate patient similarity and recommend metrics and medications for the new patient, John, based on the most similar patient in the database. It demonstrates the practical application of Neo4j and Cypher in personalized healthcare. Following the Cypher code snippet provided above, Figure 28 displays a screenshot of the query result. This result showcases the recommended measurements and medicines for a new patient admitted to the hospital, based on their similarity to another patient's medical history and diagnoses, as shown in Figure 16 the patient with patientId of 40503 has a similarity score of 1 with the John (patientId: 40504).

A diagram of a variety of medications

Description automatically generated with medium confidence

Figure 28. John's recommended medicines and measurements.

  1.  Patient and Disease Influence Analysis

In this section, we delve into the dynamic world of healthcare optimization, exploring the capabilities of the Centrality algorithms (see section 2.3.6) provided by Neo4j's Graph Data Science Library (see section 2.3.5)  and specifically the PageRank to revitalize care delivery and resource allocation. In the first subsection, we create two tailored PageRank algorithms, which allow us to uncover influential roles within healthcare networks. These algorithms are designed to gauge the impact of patients and caregivers, providing essential insights into the complex web of interactions that underpin effective healthcare systems. The subsequent subsection presents a real-world scenario where we demonstrate the practical application of these algorithms. We showcase how the PageRank algorithm can significantly enhance patient care coordination upon admission, ultimately improving healthcare service quality and patient outcomes. By harnessing the power of PageRank, we unlock the potential for more patient-centric healthcare systems and streamlined care delivery.

  1. Creating PageRank Algorithms

Algorithm 1: The algorithm finds the influence of each patient within the healthcare data network.

Figure 29. PageRank patient influence algorithm.

The Cypher code, depicted in Figure 29, is designed to assess the influence of patients within a healthcare network. It initiates by gathering patients admitted to a hospital and the diseases they were diagnosed with during these admissions. The code subsequently projects a subgraph named ‘patientInfluence’ from the overall graph, emphasizing relationships between patients and diseases through their admissions. After projecting this graph, the code proceeds to compute PageRank scores using the ‘gds.pageRank.stream’ function. The ‘scaler’ parameter is set to ‘MEAN’, utilizing the mean of the relationship weights as the initial PageRank score. This algorithm calculates the influence of patients in the context of their connections to specific diseases. The results provide a list of patients and their respective PageRank scores, with higher scores indicating greater influence. Below is illustrated the result of the algorithm, as shown in Figure 30.

Figure 30. Part of output for PageRank patient influence algorithm.

Algorithm 2: The algorithm finds the influence of each caregiver within the healthcare data network.

Figure 31. PageRank caregiver influence algorithm.

The Cypher code, depicted in Figure 31, is strategically developed to evaluate the influence of caregivers within the complex healthcare network. It commences by identifying caregivers (referred to as ‘source’ nodes) responsible for caring for patients (referred to as ‘target’ nodes). Subsequently, a specialized subgraph named ‘caregiverInfluence’ is projected from the broader network, specifically highlighting caregiver-patient interactions during patient care. The essential graph metrics, including node count and relationship count, are presented. Employing the PageRank algorithm with the ‘MEAN’ scaler parameter, the query calculates influence scores for caregivers. The algorithm assigns these scores based on the caregivers' connections with patients, making it a robust tool for assessing their influence in the network. The query filters out caregivers with non-null caregiver IDs to ensure the focus remains on actively involved caregivers. The final output is a ranked list of caregivers based on their influence scores, offering crucial insights to healthcare administrators for optimizing patient care by effectively matching patients with influential caregivers. Below is illustrated the result of the algorithm, as shown in Figure 32.

A screenshot of a cell phone

Description automatically generated

Figure 32. Part of output for PageRank caregiver influence algorithm.

  1. Recommending Influential Caregivers for New Patients

In modern healthcare systems, the efficient allocation of caregivers to patients plays a pivotal role in ensuring high-quality healthcare delivery. To address this, a data-driven approach is proposed that utilizes the PageRank algorithm within a healthcare graph database. This algorithm helps identify the most influential caregivers for patients upon admission to a healthcare facility. The following scenario outlines the systematic application of this approach to enhance patient care coordination.

  1. Patient Admission: The scenario commences with a new patient's admission to a healthcare facility. During the admission process, comprehensive patient data is recorded within the healthcare graph database, including patient identification, medical history, and relevant healthcare information.
  2. Calculating Patient Influence: Upon admission, the PageRank algorithm is activated to calculate the patient's influence score within the healthcare network. The algorithm assesses the patient's connections to identify patients with significant influence scores within the healthcare network, specifically based on their disease diagnoses. Patients whose influence scores exceed a predefined threshold, such as 0.5, are considered to have a substantial impact on the network.
  3. Finding Influential Caregivers: Subsequently, another Cypher query utilizes the PageRank algorithm. However, this time, it is applied to identify the most influential caregivers within the network. The query takes into account the relationships between patients and caregivers, focusing on the interactions that contribute to a caregiver's influence score.
  4. Recommendation for Patient Care: The query in the previous step returns the caregivers with the highest influence scores greater than 0.5. These caregivers are then recommended for the care of the newly admitted patient. The assumption here is that caregivers with higher influence scores are more likely to positively impact the patient's health and recovery.

The Cypher code, depicted in Figure 33, describes the steps required to implement the scenario described above. It demonstrates the practical application of Neo4j and Cypher in personalized healthcare.

A screenshot of a computer program

Description automatically generated

Figure 33. Recommending influential caregivers for new patient using patient and caregiver influence.

Following the Cypher code snippet provided above, Figure 34 displays a screenshot of the query result. This result showcases the recommended caregivers for a new patient admitted to the hospital, based on their influence score, as shown in Figure 32 the caregiver with caregiverId of 21507 has an influence score of 0.95 and for this reason is recommended for the care of the new patient (patientId: 40506).

A screenshot of a computer

Description automatically generated

Figure 34. Caregiver with caregiverId 21570 is recommended to care for the patient with patientId 40506.

  1. Conclusions

In this thesis, we have embarked on an in-depth exploration of the fusion between Medical Informatics and graph database technology, exemplified through the utilization of Neo4j in healthcare analytics. Our endeavor was driven by the overarching goal of enhancing the comprehension and utilization of patient data within the healthcare domain. Through a meticulously devised workflow, we undertook the comprehensive journey from data collection to advanced querying techniques. The initial stages encompassed a meticulous data collection process, ensuring the acquisition of accurate and reliable medical data. Subsequent data preprocessing procedures facilitated data consistency and prepared the foundation for efficient database implementation. The construction of a graph data model and the creation of a Neo4j database provided the structural backbone for our analytical pursuits. The application of Cypher queries facilitated data loading and the formation of complex relationships, granting a nuanced perspective of medical data interconnections. A pivotal aspect of this study lay in the realm of data querying, analysis, and exploration. Fundamental techniques for data retrieval were examined, while the focal point remained in the patient similarity analysis, the patient and caregiver influence analysis. By employing Jaccard similarity metrics across diseases, medicines, and measurements, as well as Centrality algorithms such as PageRank to measure patient and caregiver influence, we unearthed intricate patterns and correlations within patient data, enabling a more profound understanding of patient relationships and medical dynamics.

In conclusion, this research contributes to the evolving landscape of Medical Informatics by accentuating the potential of graph database technology to revolutionize healthcare analytics. Our study underscores the significance of Neo4j in unveiling hidden insights within patient data, exemplifying its efficacy in driving informed medical decision-making. The implications of our findings extend to improved patient care, enriched clinical research, and the potential for novel medical discoveries. As the healthcare sector grapples with escalating data complexities, our thesis advocates for the continued integration of innovative data management solutions. This exploration serves as a catalyst for further inquiry and innovation at the nexus of healthcare and technology, opening avenues for future research and advancements in the realm of Medical Informatics.

References

  1. J C Wyatt, J L Y Liu. 2022. Basic concepts in medical informatics, pp. 808.
  1. Reinhold Haux. 2010. Medical informatics: Past, present, future, pp. 600-606.
  1. Shah J Miahhttps, John Gammack, Najmul Hasan. 2019. Methodologies for designing healthcare analytics solutions: A literature analysis, pp. 1.
  1. Md Saiful Islam, Md Mahmudul Hasan, Xiaoyi Wang, Hayley D. Germack , Md Noor-E-Alam. 2018. A Systematic Review on Healthcare Analytics: Application and Theoretical Perspective of Data Mining, pp. 1-3.
  1. Jeang-Kuo Chen, Wei-Zhe Lee. 2019. An Introduction of NoSQL Databases Based on Their Categories and Application Industries, Algorithms 2019, 12(5), pp.106.
  1. C Strauch, ULS Sites, W Kriha. 2011. NoSQL Databases, pp. 2-29.
  1. S Jouili, V Vansteenberghe. 2013. An empirical comparison of graph databases, pp. 708-715.
  1. Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari, Miriam AM Capretz .2013. Data management in cloud environments: NoSQL and NewSQL data stores, pp. 6-7.
  1. Raghupathi, W., Raghupathi V. 2013. An overview of health analytics. Journal of Health Medical Information,4, pp. 132.
  1. Safikureshi Mondal, Nandini Mukherjee. 2016. Mobile-assisted remote healthcare delivery, pp. 632-633.
  1. Baset, Aiman G. 2015. Graphical Database Architecture for Clinical Trials, pp. 30-37.
  1. Amit Kumar. 2019. Implementing Real Time Recommendation Systems using Graph Algorithms & Exploring Graph Analytics in a Graph Database Platform (Neo4j), 2, pp. 29-34.
  1. Mark Needham, Amy E. Hodler. 2018. A Comprehensive Guide to Graph Algorithms in Neo4j, pp. 34.
  1. Sujoy Bag, Sri Krishna Kumar, Manoj Kumar Tiwar. 2019. An efficient recommendation generation using relevant Jaccard similarity, pp. 57-58.
  1. Diego Figueira. 2021. Foundations of Graph Path Query Languages, pp. 1-3.
  1. Roel Heijlenhttps, Joep Crompvoets. 2021. Open health data: Mapping the ecosystem, pp. 1-8.
  2. Natalie Oskian. 2020. Evaluation of Centrality Algorithms for Information Spread in Social Networks, 29-35. https://dione.lib.unipi.gr/xmlui/bitstream/handle/unipi/12829/NatalieOskian_thesis.pdf?sequence=3&isAllowed=y
  3. Sujoy Bag, Sri Krishna Kumar, Manoj Kumar Tiwari. 2019. An efficient recommendation generation using relevant Jaccard similarity, In Proceedings of the Journal of Information Sciences, 483, May 2019, 53-64. https://doi.org/10.1016/j.ins.2019.01.023
  4. Ritika Wason. 2012. Comparative Analysis Of Pagerank And HITS Algorithms. In Proceedings of the International Journal of Engineering Research & Technology (IJERT) Journal, 1, 8, October 2012. https://doi.org/10.17577/IJERTV3IS11182

Appendices

Appendix A

Appendix A illustrates the CSV files used for this thesis.

A table of numbers and letters

Description automatically generated

Table 1. Part of the patients CSV file.

A table of information with numbers and letters

Description automatically generated

Table 2. Part of the admissions CSV file.

A table of numbers and numbers

Description automatically generated

Table 3. Part of icustays CSV file.

A screenshot of a table

Description automatically generated

Table 4. Part of d_items CSV file.

A table with numbers and numbers

Description automatically generated

Table 5. Part of inputevents_mv CSV file.

A table with numbers and numbers

Description automatically generated

Table 6. Part of inputevents_cv CSV file.

A table of numbers and a number

Description automatically generated

Table 7. Part of outputs CSV file.

A screenshot of a computer

Description automatically generated

Table 8. Part of d_icd_diagnoses CSV file.

A table of numbers and letters

Description automatically generated

Table 9. Part of diagnoses_icd CSV file.