A Semiotic Information Quality Framework:
Applications and Experiments
(Research Paper)
Topic Category: Information Quality Models and Frameworks
Gregory Hill, Rosanne Price and Graeme Shanks
Faculty of Information Technology, Monash University
research@greg-hill.id.au
Abstract
Recent theoretical developments in information quality research have focused on defining, understanding and assessing data quality. The frameworks are intended to further theoretical knowledge as well as inform practice. Prior experimental work on the impact of data quality upon management decision-making is not based explicitly on theoretically-grounded data quality frameworks. As a result, research on data quality tagging – providing information to individual decision-makers about the quality of data elements – has not addressed the form of the tags themselves, their rationale, semantics, nor how such tags are to be derived. Additionally, there is a paucity of experimental work based on such frameworks that examines how data quality impacts upon value-creation within organisations. To address these concerns, this paper presents one such framework (the Semiotic Information Quality Framework) and describes two distinct proposed empirical studies addressing the impact on decisions. The first examines how the form of data quality tags effects individual decision-makers, while the second explores the value of improving data quality in large-scale automated decision processes.
INTRODUCTION
Information quality (IQ) problems are becoming increasingly prevalent in practice (Wand and Wang 1996), particularly in data warehousing, customer relationship management systems and the implementation of enterprise resource planning systems. In enterprise systems where data are obtained from multiple sources, users may be far removed from the original data collection and may have little understanding of the nuances regarding the meanings of data items (Tayi and Ballou 1998). Poor quality data in organisational information systems can have significant social and economic impacts (Strong et al. 1997). It is critical that organisations understand and manage IQ, and that procedures are in place to assure the quality of data. Understanding the impact that poor information quality has on the organisation’s ability to make decisions is crucial for these tasks.
There are a number of different definitions of IQ and the specific criteria against which such quality should be assessed (Naumann and Rolker, 2000). Competing views of quality from product and service based perspectives focus on objective and subjective views of quality respectively. Objective measures of IQ can be based on evaluating data’s conformance to initial requirements specifications and specified integrity rules or their correspondence to external phenomena. (For simplicity, the term external is used synonymously with the term real-world; although the former is actually a more general term than the latter.) However, such a view of quality struggles to incorporate aspects, critical to an organization’s success, related to data delivery, actual data use, and data consumer (i.e. internal or external users of organizational data) perceptions.
To address these concerns, subjective measures of information quality can be used based on consumer feedback, acknowledging that consumers do not (and cannot) judge the quality of data in isolation but rather in combination with the delivery and use of that data. Thus data delivery and use-based factors are integral to a service-based view and consumer perceptions of quality. However, such subjective assessments may not be adequate for making investments in IQ improvements, where IQ projects must compete for scarce resources against other organisational goals. This problem is compounded in large and complex environments, where no one person has both the intimacy with the source systems and the “bird’s eye” view across the organisation required to assign priority to initiatives.
Managers often rely upon the data in enterprise systems to assist in their decision-making tasks. Such decisions must be made regardless of the quality of the data in databases, by accepting the data at face value, compensating for suspected deficiencies or by avoiding using data that is suspected to be of dubious quality. This paper examines the proposition that knowledge about the quality of the data may help decision processes and improve decision outcomes (Chengular-Smith et al. 1999). We examine this proposition under two distinct yet common scenarios, termed individual and automated. The first involves an individual decision-maker exercising discretion in forming a judgement. The second involves large-scale, automated decision-making using pre-determined business rules (sometimes known as Enterprise Decision Management). In both cases, the information systems present tabular data, with attributes (dimensions) in columns and instances (records) as rows, while the decision-maker selects from a set of alternatives.
Under the managerial scenario, we propose to examine the impact of IQ on decision-making through the use of data quality tagging. Data quality tagging involves storing information about data quality within an organisation’s databases. These tags are then made available to decision-makers when they use the data. In contrast, understanding the organisational impact of data quality on automated decision-making requires a different tack. Computer simulations are employed to validate metrics intended to guide investment in information quality improvement initiatives (data quality treatment evaluation). The Semiotic Information Quality Framework, described in Price and Shanks (2004), grounds theoretically the data quality tags and metrics, respectively.
Specifically, this paper addresses some of the limitations of earlier studies by including a comprehensive examination of different types (or forms) of quality tagging and by using experienced practitioners in the experimental phase. This same framework also forms the basis for experimental work to develop and refine a comprehensive set of metrics relating to data quality, for investment purposes.
This work is significant for practitioners, as determining and storing data quality tags is an expensive process. Similarly, undertaking programs to identify and correct data quality problems in information systems can require a large amount of resources and carry significant risk. Thus, the impact of data quality on decision outcomes must be clearly understood before any investment can be justified. An important element of this is in understanding how best to present information about data quality to ensure it is effectively incorporated into the decision-making process.
The paper is structured as follows. The next section introduces key concepts in the Semiotic Information Quality Framework, including information quality measurement, metadata (ie tags) and information quality improvement initiatives. This theoretical framework allows for defining and selecting a range of different types of data quality tags for subjective assessment, as well as objective metrics for scoring data quality. Based on these new types of data tags and metrics respectively, the following section proposes two initial experimental designs to be used in an empirical investigation into the effect of such data tags and metrics on decision outcomes. The final section is the Conclusion.
A Semiotic information quality framework
Understanding Data Quality
Data quality is commonly defined in terms of a set of quality criteria grouped into quality categories. Competing views of quality from product and service based perspectives focus on objective and subjective quality criteria (and measurements) respectively. The former view is based on conformance to initial specifications (including specified integrity rules) or correspondence to represented real-world phenomena. The latter view is based on consumer judgements of perceived data quality in the context of data use, where perceptions are influenced by data delivery (e.g. interface quality) and consumer expectations. Whereas purely theoretical research approaches used to define quality (Wand and Wang 1996) are necessarily limited to the product view and thus limited in scope; empirical (Wang and Strong 1996; Kahn et al 2002) or ad-hoc/intuitive (e.g. English 1999) approaches suffer from problems with consistency, particularly with respect to the definition of quality categories and classification of quality criteria into categories.
Price and Shanks (2004) recently proposed a semiotic information quality framework to address problems in previous data quality work with respect to scope or consistency. Classical philosophical semiotics, the theory of signs, describes communication using signs as consisting of (1) three components describing the form, meaning (i.e. intended meaning), and use (i.e. interpretation) of a sign respectively and (2) three levels—the syntactic, semantic, and pragmatic levels—describing respectively relations between “signs” (i.e. their forms)1; between a “sign” and its meaning; and between a “sign” and its use.
These components and levels can equivalently be used to describe Information Systems (IS), since IS data can be regarded as “signs” that represent external phenomena (i.e. intended meaning) on retrieval and use. An example would be an employee salary field representing an employee’s salary and used for payroll purposes. Similarly, IS metadata, e.g. the integrity rule emp.sal>0, can be regarded as signs for real-world constraints, e.g. employee salary must be non-negative. In the IS context, the three semiotic levels then describe relations between IS data and metadata (i.e. both “signs”); between IS data and represented external phenomena (i.e. a “sign” and its intended meaning); and between data and use (i.e. a “sign” and its use).
Quality categories can then be defined based on the desirable characteristics at each of these levels, i.e. conformance (of data to metadata), correspondence (of data to real-world phenomena), and suitability (of data for use). In the context of employee salary data, syntactic, semantic, and pragmatic quality aspects relate to whether such salary data conforms to relevant integrity rules (e.g. emp.sal>0), whether it matches actual employee salaries, and whether it is useful for a given purpose (e.g. payroll).
Syntactic and semantic quality categories relate to the objective, product-based quality view; therefore, their quality criteria can be derived using a theoretical approach based on integrity theory and on Wand and Wang’s (1996) theoretical work on IS/real-world transformations respectively. The pragmatic quality category relates to the subjective, service-based quality view; therefore, consumer feedback must be considered, using an empirical research approach, when deriving quality criteria. (Note that this does not imply that quantitative assessment is not possible: modern Utility Theory is predicated on subjective, numerical assessments of value.) Therefore, we can see that semiotics provides a theoretical basis for defining quality categories, for determining the appropriate criteria derivation method, for classifying quality criteria, and for integrating objective and subjective views of quality. Significantly, the fact that the last two steps follow implicitly and automatically from the first two ensures both consistent classification of criteria and coherent integration of quality views.
Data Quality Metadata
Data quality metadata provides information about the quality of data and is stored with that data in an organisation’s databases. Data quality tagging is the process of measuring a dimension of data quality and storing it as metadata. There has been very little research into the effectiveness of data quality tags. Furthermore, research to date considers only single-dimensional quality tags (i.e. based on a single quality criterion such as reliability or accuracy) used as dichotomous variables (i.e. quality information present or absent) or normalised scores (i.e expressed as a percentage), without full explanation of the semantics of and rationale for the tag itself. Chengular-Smith et al. (1999) and Fisher et al (2003) found that under some circumstances; for particular decision-making strategies, task complexity levels, or decision-maker experience levels; data quality tagging impacts decision outcomes. Work by Shanks and Tansley (2002) reported a preliminary empirical investigation into the impact of data quality tagging on decision outcomes using different decision-making strategies for both simple and complex decision tasks. Acknowledged limitations of these initial experiments include the restricted scope of the quality tagging considered, as with other data tagging research work to date, and the reliance on student participants.
Different types of data quality tags can be defined based on the semiotic information quality framework described above, either based on individual criteria or summarized by category. However, one fundamental issue that must be considered is the treatment of objective versus subjective quality aspects in tagging. Whereas objective quality measures can be provided for a given data set since they are inherently based on that data set; subjective quality measures are context dependant, i.e. varying based on the individual stakeholder or their organizational role, on the task considered, and on the administrative or geographic context. Previous work in data tagging either does not address the question of subjective data quality (Chengular-Smith et al. 1999, Fisher et al. 2003) or concludes that only objective aspects can be considered (Shanks 2002).
We contend that subjective data quality judgements represent a potentially valuable information resource for decision-makers. We therefore introduce the novel concept of context-based tags – drawn from the semantic level - to address the issue of providing feedback to decision-makers on subjective quality aspects, where the context is incorporated in the tagging process itself. Essentially, a separate quality tag (based on either individual or summarized subjective criteria) would be stored for each different context and would include contextual information (i.e. identifying the tag’s context). The types of context information considered (e.g. respondent organizational role, task, administrative/geographic context) in tagging could then be adapted as suitable depending on the specific IS requirements. This approach ensures that the decision-maker is aware of a quality judgement’s context and thus can select tags with the appropriate context. It further ensures that possible context-based polarities in quality judgements, for example due to differential respondent role-based or task-based perspectives, are accessible to the decision-maker rather than obscured.
Data Quality Treatments
Data Quality Treatment (DQT) refers to the process of modifying the signs that comprise an information system (that is, changing the present state) in order to meet organisational goals. This can be undertaken at the syntactic level (relations between signs) through data transformation and integration. It can also be at the semantic level (relations between signs and the external world), through substitution of values via direct observation or via a (supposed) higher-quality proxy. An example of the former would be re-processing a customer birth date field that has a mixture of American and European date formats. An example of the latter would be checking the field against Government birth records.
It is rarely possible to predict in advance the economic impact of a DQT upon an organisation’s decision-making. For example, suppose a series of edits to an organisation’s records of customer dates of birth are made. Predicting analytically how this will cascade through to different decisions about credit-scoring, marketing campaigns and fraud identification is simply not possible, due to the confounding effects of other variables and the non-linear decision-making process.
Instead, the question can only be tackled empirically, through “parallel processing” of data and decisions. In the testing (forwards-looking) phase, a sample of original data is treated, the decision logic applied and the original and revised decisions are compared. If the impact is significant and desirable, then the treatment is undertaken across the entire data source. During the tracking (backwards-looking) phase, a sample of the original data is retained. After processing with the newly-cleansed data source, the original data are re-processed and the resulting “a priori decision outcomes” compared with “a posteriori decision outcomes”. In this way, Data Quality Treatments can be aligned with organisational goals (eg maximising value) and processes (eg managerial accountability).
For this method to work efficiently, a number of assumptions must be satisfied:
The organisation must be able to put a (subjective) value on all decision outcomes.
The decision-making process must be memoryless, time-invariant and deterministic (ie gives same decision each time when given same input).
The costs of sampling, retaining and re-processing data must be negligible compared with value of treatment.
We contend that these assumptions are met in large-scale, automated, data-driven decision-making found in modern processes, such as those associated with Customer Relationship Management and Enterprise Resource Planning. The problem remains, however, of identifying, rating and selecting candidate Data Quality Treatments. This is a resource allocation problem: Which processes should be focused on? Which data attributes should be targeted? Which treatments yield the best results? It is not practicable to try all candidate treatments, across all processes and data elements. Instead, subject-matter experts with intimate knowledge of the data and decisions must use judgement to identify “hot spots” or “points of pain” in the systems. As such systems grow ever-more complex, it becomes increasingly unlikely that any individual could maintain such knowledge of the organisation’s systems. Further, the requirement for a rational justification (assessment of the costs and benefits to the organisation of alternatives) and an “investment view” – managing data as an asset (Moody and Walsh, 1999) – mean that there is a growing need for value-focused measures of DQTs. Such measures can be used to help guide investments in improving the quality of data.
Data Quality Measurement
For large-scale automated decision-making, such as credit scoring and marketing campaign management, decision-making is the objective application of pre-defined business rules to individual cases. In these circumstances, the lack of individual case-by-case discretion shifts management focus to investments in improving data quality to achieve organisational goals. This requires the definition and calculation of a set of metrics to guide investments in data quality improvement. Previous efforts in this regard have produced approaches with significant computational complexity (Ballou and Tayi, 1989) and a lack of empirical validation (Ballou and Pazer, 1985).
We argue that the information-theoretic model of information quality proposed earlier (Hill, 2004) can be used to derive suitable metrics. This model is based on an information-theoretic analysis of the ontological model (Wand and Wang, 1996) used in the semantic level of the Semiotic Information Quality Framework, extended to include decision-theoretic considerations.
In brief, the Wand and Wang ontological model conceives of the external world - and the information system that represents it – as a “state machine”. That is, the external world (and IS) is in precisely one state at any moment in time, and its transition to another state is defined by a set of rules. This IS state is the input into a decision-making function that maps IS states to decision outcomes. These decision outcomes are then mapped back onto the external world, where they have implications for the organisation in the form of “pay-offs” (or costs). A decision outcome with a zero cost is regarded as the “right” one, while alternatives introduce costs and are regarded as undesirable mistakes.
Semantic information quality, then, is the degree to which the IS state “mirrors” or corresponds to the external state. (When the correspondence is perfect, an IS user can infer the external state by inspection of the IS.) The problem of how to define a theoretically-sound and “well-behaved” measure of the degree of correspondence between two statespaces was tackled and solved in an entirely different context, communications engineering (Shannon and Weaver, 1949). This seminal work introduced information theory and in particular the entropy measure, which has found widespread use in engineering economics, psychology, biology and mathematical sciences (Cover and Thomas, 1991).
We propose to use the normalised mutual information – here dubbed fidelity (meaning degree of faithfulness to the external world) – as the appropriate objective measure of semantic information quality. This metric has a similar form to the relative information score proposed by Kononeko and Bratko (1991) and is widely used in machine learning research. It has the interpretation of “the average amount of uncertainty about the external-world removed by inspecting the information system”2. The maximum value of one implies that all uncertainty is removed ie the IS perfectly agrees with the external world. The minimum value of zero implies that the IS states are statistically-independent of those of the external world (i.e. the IS tells us nothing about the external world).
The second objective measure relates to the pragmatic level, and is concerned with the use of information for a particular decision-making process. Here, instead of comparing the statespace of the external world with that of the IS, we compare the IS’ statespace to the decision outcome statespace, again using the normalised mutual information score. We call this metric influence, as it captures the relevance, or importance a data element (ie attribute) bears on the final decision. It must be stressed that influence is a property of the decision-making process, not the attribute value’s correspondence to the external world. Influence has the interpretation of “the average amount of uncertainty about the decision outcome removed when that attribute value is revealed”. The maximum value of one implies that all uncertainty is removed ie the attribute in question always solely determines the decision (for example, it may be a flag in the database). The minimum value of zero implies that no value that the attribute could take will ever impact on the decision outcome (for example, it may be an extraneous field).
The third metric, termed stake is a measure of the value-at-risk for the decision task. It is defined as the difference between the cost of (imperfect) decision-making with perfect information and the cost of (imperfect) decision-making with imperfect information. As such, it is a property of the decision process and organisational context. It can be calculated by adding up the cost of mistakes (ie poor decision outcomes) when using existing data and subtracting the cost of mistakes when using idealised perfect data. If the decision-making process is itself perfect (so that there is no mistake if perfect data are used), the stake reverts to the cost of mistakes with existing data, or, in information-economic terms, the Value of Perfect Information (Lawrence, 1999). Stake requires the organisation to assign subjective costs to decision outcomes, in accordance with its operating environment and procedures. Stake should be non-negative; a negative stake implies that improving data quality results in more (or worse) mistakes.
The fourth, and final, metric is tweak which measures magnitude of the treatment ie the extent of change in an IS representation of the external world. It compares the original data source with a revised data source using the normalised mutual information score, and captures the degree of agreement between the two. The minimum value of tweak is zero, which occurs when the two sources are identical (for example, no edits were made to the original). The maximum value of tweak is one, which implies that the second source is so radically different to the original that they are entirely (statistically) independent from one other.
The four proposed objective “SIFT” measures, summarised below, have been derived from an application of Information Theory to the Ontological IQ model, as used in the Semiotic Information Quality Framework.
|
Metric Name |
Semiotic Level |
Measures |
Minimum Value Interpretation |
Maximum Value Interpretation |
|
Stake |
Pragmatic |
Cost of mistakes due to poor Data Quality. |
$0 – Fixing Data Quality will not change decision outcomes. |
Amount of value “recoverable” due to Data Quality improvement |
|
Influence |
Pragmatic |
Contribution of each attribute to decision-making process. |
0% - Attribute is not used / has no bearing on decision process. |
100% - Attribute always solely determines decision outcome |
|
Fidelity |
Semantic |
Correspondence between IS and external world values. |
0% - IS and external world are entirely disconnected. |
100% - IS state perfectly informs about external world state. |
|
Tweak |
Semantic |
Correspondence between two IS values. |
0% - Second (ie derived / alternative) IS identical to original. |
100% - Second (ie derived / alternative) IS entirely disconnected from original. |
Table 1. Objective Data Quality Metrics.
PROPOSED RESEARCH DESIGN – Data Quality Tagging
Experiments can be used to examine the impact on decision outcomes of varying the type of data quality tagging (including context-based subjective data quality tags), the decision-making strategy used, or the task complexity. This research focuses on decision-making as the process of choosing among multiple alternatives described by the same set of attributes. The decision-making strategies adopted are built into the spreadsheet-like interface of the database query system used in the decision-making task. A proposed research model is shown below in Figure 1, illustrating the causal relationships between theoretical constructs (represented as ovals) using solid lines and the measurement relationships between theoretical constructs and their empirical indicators (represented using rectangles) using dotted lines.
The Decision Task
The decision task selected should ideally allow comparison with previous studies. Thus, in common with previous data tagging studies (Chengular-Smith et al., 1999, Payne 1976, Shanks and Tansley 2002), we propose the use of a task involving selecting an apartment from a number of alternatives, where each apartment alternative is described by the same set of attributes (e.g. rent, commuting time, number of rooms).
The proposed study involves three independent variables: decision strategy, task complexity, and data quality tagging. We propose that at least two representative decision strategies be used, additive and elimination by attribute (EBA). The additive strategy is an alternative-based and compensatory strategy whereas the EBA is attribute-based and non-compensatory (Shanks and Tansley, 2002). These two decision-making strategies therefore have contrasting properties and thus provide a useful comparison. These strategies will be supported by a spreadsheet-like interface that can (1) display a scrolling window of alternatives with the values of all attributes displayed for different task complexities, (2) allow selection and/or ordering of attributes, (3) sum individual attribute values, and (4) sort alternatives based either on the attributes selected or on the calculated attribute value sums as appropriate for the particular decision-making strategy used. In the context of the proposed experiment and apartment selection task, task complexity is defined as the number of attributes included in the database. Several different levels of task complexity will be used in the study, based on different numbers of attributes. Finally, three different types of data quality tags will be compared, across three levels of aggregation (attribute value, attribute and entire dataset):
No Tags: The data are presented “as is” without any data quality information provided.
Objective Score: The fidelity measure is computed and provided (as a percentage), along with an explanation for its interpretation.
Subjective Rating: Likert scale assessment of semiotic framework criteria (individual and summarised), incorporating context information.
There are four dependent variables in the study: decision complacency, decision consensus, decision efficiency, and decision confidence. The alternatives in the decision-making task will be designed so that one apartment is the preferred solution in the absence of data quality information, and therefore an objective decision evaluation measure would be available. The inclusion of data quality information makes the selection of an apartment a subjective assessment based on the individual decision-maker; therefore, no single alternative can be assumed to be the universally preferred solution. Decision complacency refers to the degree to which participants ignore data quality information (adapted from Chengular-Smith et al. 1999). The complacency level is the number of times the preferred apartment is ranked highest within a group of participants. Decision consensus refers to the degree to which participants converge on a decision when presented with data quality information (adapted from Chengular-Smith et al. 1999). If the inclusion of data quality information affected the ability of participants to agree on an apartment then it would be considered harmful. The consensus level is the number of times an apartment is ranked highest within a group of participants, even—without any quality information—if it is not the preferred solution specified by the experiment and database. Decision efficiency is measured by recording the time taken by participants to reach a decision. The inclusion of different types of data quality tagging information may affect the time taken by each participant to reach a decision. Finally, the inclusion of different types of data quality tagging information may affect participants’ confidence in their decisions. A five point likert scale will be used to record the participants’ subjective assessment of confidence in their decision outcomes.
Figure 1 – Proposed Research Model for Data Tagging
Experimental Procedure
A computer database with a query interface will be used to simulate the decision-making activity, with participants to be assigned randomly to different groups. In this investigation, the different groups would be used to test not only a range of decision-making strategies and task complexities, but also the three different types of data tagging outlined above (including no tagging). Furthermore, the participants targeted for the experiments would be practitioners with actual business decision-making experience.
A database system with a spreadsheet-like interface will be used for each of the groups, loaded with at least 100 alternative apartments where one apartment alternative is designed to be the preferred solution in the absence of any data quality information. The assigned task involves selecting the preferred apartment, recording the start and finish times, nominating a confidence level for the decision, and providing a brief explanation of the reason for the decision. Participants will receive a brief explanation of the particular decision-making strategy to be used and a demonstration of how to use the computer system, plus instruction and answer sheets.
PROPOSED RESEARCH DESIGN – Data Quality Treatment Evaluation
These experiments seek to explore the suitability of the information-theoretic data quality metrics for guiding investments in improvement initiatives, or treatment (DQT). These metrics are to be used to help identify which processes, data elements and treatments are likely to have the highest yield for the organisation. This section outlines the use of computer simulations of real-world business problems to evaluate their use. Specifically, we’re exploring the organisational value of Data Quality Treatments. The model below shows the relationship between the dependent (value) and independent (treatment) variables in the study.
The Scenarios
Firstly, to ensure external validity, the scenarios used in the simulations should reflect real-world problems that this framework is intended to tackle, incorporating real data. Secondly, the study must be verifiable, so that the scenarios are available to other researchers. Thirdly, the complexity of the scenarios must not be so overwhelming that important effects cannot be discerned, yet of sufficient richness as to allow meaningful conclusions to be drawn. To meet these criteria, we’ve selected a number of scenarios from a public data mining competition – the KDD Cup (KDD, 2006) – in which data mining researchers apply their algorithms to discover patterns in the provided datasets. These datasets are drawn from real-world systems, anonymised, cleansed and made publicly available. We’ve selected three datasets that pertain to Customer Relationship Management processes (credit-scoring, fraud and marketing), since this most-closely relates to the intended use. In each scenario, the dataset’s columns contain demographic attributes such as age, income, location etc and the rows are a particular instance (customer). There is also a small number of possible decision outcomes (such as the credit rating or wether or not to mail an offer). Further, the “pay-off matrices” (the organisational cost of different types of mistakes) are provided. The true (post-hoc) decision is provided, for training and valuation purposes.
In each scenario, a decision-making function is required to process the data and provide a decision for each customer. Since we are not concerned with decision-making process per se, we propose to use a standard decision-tree approach, such as C4.5 (Quinlan 1996). This is because decision-trees are relatively easy to compute, widely adopted in practice and satisfy the criteria of being memoryless, time-invariant and deterministic. The decision-trees’ parameters will be “tuned” to ensure that their performance is comparable (although obviously lower) than the “state of the art” algorithms, by comparison with the results of the public competition.
Figure 2 – Proposed Research Model for Data Treatment Evaluation
Experimental Procedure
For each scenario, an appropriate decision-process (tree) is constructed using the “clean” data. Next, the pragmatic data quality metrics (Stake and Influence) are computed. Then “noise” is introduced to the data to recreate the effect of data quality problems. This is achieved by swapping data values between customers, adding offsets and replacing values with averages (ie replicating the use of imputed data for missing values). The semantic data quality metrics (Fidelity and Tweak) are computed for the treatment data. Finally, the “dirty” and “treated” data are processed by the decision logic, and the decision outcomes compared. The difference in costs is computed using the pay-off matrices, giving the value of the treatments. The resulting data will undergo statistical analysis to determine the capacity of the four data quality metrics to rank candidate treatments by value.
Conclusion
This paper describes two proposed empirical studies (with tentative experimental designs) into how data quality impacts upon organisational decision-making, based upon the Semiotic Information Quality Framework (Price and Shanks, 2004). The first looks at the individual level, introducing a novel approach to data tagging research focused on the semantics of the data tags themselves. In contrast to previous studies of data tagging, we contend that both the definition and the selection of quality tags should have a firm theoretical foundation. A further distinction from previous data tagging studies is the proposal that subjective data quality information be included using a new type of data quality tags, i.e. context-based data tags and a rigorously-defined objective measure. The second study – focusing on automated decision-making – tackles the problem of metrics for investment in Data Quality Treatments. Here, the Semiotic Framework (incorporating the Ontological Model) is used in conjunction with Information Theory to describe novel, value-focused measures of data quality. Unlike other studies, the proposed experiments use both real-world scenarios and theoretically-grounded measures to appraise the impact of DQTs.
The full implications and practical feasibility of using such tags and metrics require further consideration and clarification. After consideration of the hypotheses and results from earlier data studies, hypotheses must be formulated for the proposed experiments. Such work is currently in progress.
REFERENCES
1 Note that the quoted term “sign” is specifically used to refer to the form (i.e. the actual representation of) a sign.
2 Example calculations for these metrics can be found in the paper where these metrics were originally proposed (Hill, 2004).