Data Protection in Personalized Health (DPPH) - WP1
|Focused Questionaire - Introduction and Glossary|
The purpose of this questionnaire is to collect perceptions of hospital ICT-research representatives and CISOs on the privacy and security aspects related to the IT processes that involve medical data for research.
The questionnaire is organized in five consecutive sheets that we kindly ask you to carefully read in order.
To set up the stage we define the terminology that we will use throughout this questionnaire. As the terminology in this space isn't always clear, and often the same terms are used to mean different, and sometimes conflicting things, we provide below an organized glossary of the main privacy and security-related terms. This glossary is compiled based on legal and technical definitions from multiple sources among which the GDPR and HIPAA regulations, and the ARTICLE 29* DATA PROTECTION WORKING PARTY.
*Art29 Opinion 05/2014: https://www.dataprotection.ro/servlet/ViewDocument?id=1085
|Types of research data|
|Structured clinical data |
DEFINITION: Structured data refers to information with a high degree of organization, such that inclusion in a relational database is seamless and readily searchable by simple, straightforward search engine algorithms or other search operations. For example, numerical or coded data (based on some medical terminology) from EHR or clinical trials.
|Unstructured clinical data|
DEFINITION: Unstructured data doesn't follow a specific format and includes narrative text such as nursing notes, scanned documents, images, videos, etc. Unstructure data cannot be easily organized and analyzed using standard, predefined structures.
DEFINITION: High-dimensional data is data characterized by several (thousands) of dimensions such as *omics data and data from health IoT devices.
|Types of administrative data|
DEFINITION: demographics data includes information about a patient such as age, name, address, ethnicity, gender, etc.
|Unique identifying codes|
DEFINITION: unique identifying codes that are artificially generated and attached to an individual record in order to uniquely identify it within a hospital information system. These code can include, among others, pseudonyms, cryptographic keys, hospital unique identifiers etc.
DEFINITION: Aggregated data consists of data that is collected and aggregated from multiple patients/individuals
DEFINITION: Individual data consists of data pertaining to one sole individual
|Entities - Roles|
DEFINITION: (from Art. 4 GDPR) The data controller is the organization (a legal person, agency, public authority, etc.) or the natural person which, alone or depending on the organization and personal data processing activity, in collaboration with others defines what needs to happen with the personal data.
DEFINITION: (from Art. 4 GDPR) The data processor means a natural or legal person, public authority, agency or other body which processes personal data on behalf of the controller.
DEFINITION: A Principal Investigator is the primary individual responsible for the preparation, conduct, and administration of a research grant, cooperative agreement, training or public service project, contract, or other sponsored project in compliance with applicable laws and regulations and institutional policy governing the conduct of sponsored research. The principal investigator also analyzes the data and reports the results of the trial or grant research. Also called PI.
DEFINITION: We define as a researcher any individual different from the PI who is part of the project team and can analyze the data on behalf and under authorization of the PI.
|Hospital IT infrastructure|
DEFINITION: A hospital computing infrastructure refers to the hospital's entire collection of hardware, software, networks, data centers and related equipment used to develop, test, operate monitor manage and/or support information technology services.
|Hospital IT staff|
DEFINITION: IT personnel of the hospital responsible for the setup and mantainance of the hospital computing infrastructure.
|Third-party IT infrastructure|
DEFINITION: A third-party computing infrastructure is essentially when the computing infrastructure is managed by an external company or provider rather than internally. Such external company can be for example an academic data center or a public cloud provider
|Third-pary IT staff|
DEFINITION: IT personnel of the external company responsible for the setup and mantainance of the computing infrastructure.
|Privacy-related Terms: Risks|
DEFINITION: (Define here what it is before talking about the risks) There are three main risks of re-identification for an individual; namely singling out, linkability and inference. A solution to these three risks would be robust against re-identification performed by the most likely and reasonable means.
DEFINITION: possibility to isolate some or all records which identify an individual in the dataset.
DEFINITION: ability to link, at least, two records concerning the same data subject or a group of data subjects (either in the same database or in two different databases).
DEFINITION: possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.
|Privacy-related Terms: Sensitive Data Fields|
DEFINITION: information that can be used to uniquely identify an individual, either by themselves or in combination with other readily available information (e.g., name, telephone number, SSN, government issued ID)
|Indirect Identifiers (or Quasi-identifiers)|
DEFINITION: information that identifies an individual indirectly (e.g., DOB, gender, ethnicity, location, cookies, IP address, license plate number) by being combined with other indirect identifiers in order to create a unique identifier. Data that includes direct and indirect identifiers is called "identifiable" data.
|Privacy-related Terms: Data Sanitization Levels|
DEFINITION: a data record is pseudonimized when direct identifiers are replaced with temporary pseudonyms (e.g., through hashing, encrypting and/or tokenizing). Pseudonymized individuals records are still trackable, and it may be possible to re-identify the individual in the presence of unique combinations of indirect identifiers or by accessing the mapping between the pseudonyms and the real identifiers.
DEFINITION: a data record is de-identified when direct identifiers are removed. It may be possible to re-identify the individual in the presence of unique combinations of indirect identifiers.
DEFINITION: Data anonymization is the result of processing personal data with the aim of irreversibly preventing re-identification of the data subject. Technical safeguards (aggregation, generalization, suppression and randomization) are applied to both direct and indirect identifiers such that data cannot, to some extent, be re-identified. There can be several degrees of anonymized data to which correspond different levels of re-identification risk. In particular, data considered as fully anonymized can never be re-identified.
|Privacy-related Terms: Data Protection Techniques|
|Tokenization or replacement|
DEFINITION: direct identifiers are replaced with temporary pseudonyms (e.g., through hashing, encrypting and/or tokenizing). The link to the original identifier might be kept if needed for referencial integrity.
DEFINITION: removal of specific attributes (e.g., direct and indirect identifiers).
DEFINITION: family of techniques that alters the veracity of the data in order to remove the strong link between the data and the individual. If the data are sufficiently uncertain then they can no longer be referred to a specific individual. Randomization techniques comprise noise addition and permutation.
DEFINITION: This approach consists in generalizing, or diluting, the attributes of data subjects by modifying the respective scale or order of magnitude (i.e. a region rather than a city, a month rather than a week).
DEFINITION: Technical measures to enable processing of data while limiting the leakage of information to the processor. These include, among others, homomorphic encryption and secure multiparty computation.
DEFINITION: Malleable encryption that enables some computations on ciphertexts without decrypting them first. Somewhat homomorphic cryptosystems enable bounded-degree polynomial operations on encrypted data efficiently, while fully homomorphic cryptosystems enable any kind of computation on encrypted data with a significant complexity overhead.
|Secure Multiparty Computation|
DEFINITION: Cryptographic mechanism to enable two or more parties to jointly compute a function over their inputs while keeping the inputs private. It is usually accomplished by means of interactive protocols (interchanging either ciphertexts or random shares) run between the involved parties.
|Privacy-related Terms: Anonymization Metrics|
DEFINITION: A dataset fulfills k-anonymity whenever it contains at least k individuals with any chosen combination of quasi-identifiers. Therefore, each individual in the dataset is "hidden" within an anonymity set of at least k people. A determined level of k-anonymity can be achieved through technical measures such as aggregation, generalization, suppression and randomization.
DEFINITION: A function applied over a dataset is called differentially private whenever the (distribution of the) result of such function does not substantially change by the presence or absence of one individual in the dataset. Therefore, the release of a differentially-private result reveals negligible information about each of the individuals in the dataset. Differential privacy is usually achieved through randomization techniques; i.e., adding a controlled amount of random noise to the output of the to-be-computed deterministic "exact" function, such that the variance of the noise exceeds the maximum deviation of the function's output produced by the contribution of any individual as part of the input dataset.