Internal DQC Problem Patterns
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
£
%
123
 
 
 
 
 
 
 
 
 
ABCDEFGHIJLMNOPQRSTUVWXYZAAABACADAEAFAG
1
ProblemActionSeveritySolutions for Checking
2
IDDescriptionEvidenceNegative ImpactNotesDiscovery Scenario AffectedEffortMethodTechnologyNotes
3
1. Duplicate / Redundant information (within a field, across fields and across a collection)
4
P1Systematic use of the same titleMany CARARE objects use the term "Rijksmonument" as title.Lack of differentiation in search; title uninformativeNegative for SEOBasic RetrievalReportInformativeLowField statistics accross a collectionField faceting
5
P2Equal title and description fieldsExample recordDistorts search weightingsOne of the dc:title(s) is equal to dc:descriptionBasic RetrievalReport, CorrectWarningLowStrict field comparisonSHACL
6
P3Near-identical descriptions and title fieldsExample recordDistorts search weightings; distorts completeness measurementBasic RetrievalReport, CorrectWarningModerateMeasure overlapping of words between two fieldsSHACL (only with an extension)Option: Measure the longest common substring with threshold
7
P4Duplicated stringsedm:provider "Europeana Food and DrinkEuropeana Food and Drink"Illegibility; search accuracyBasic RetrievalReport, CorrectWarningLowCheck for duplicate words within a fieldRegexp, SHACL with costumized SPARQL querySpell checkers check for duplicate words. Regexp (.*{5,})$1 (find a sequence of >5 chars followed by the same sequence)
8
2. Irrelevant information
9
P5Unrecognizable titlesFilenames and shelfmarks as titlesBypasses completeness metrics; exposes internal messages to external usersInstead of showing such strings, the field should be left in blank.Basic RetrievalReportInformativeMediumMeasure the number of recognizable wordsRegex + SHACLTitles which contain identifiers and no other words are easy to detect. Though, there is some risk of false positives (eg. a painting being called something like abc100)
10
P6Non-meaningful titlesuch as "no suitable title found", "unknown title", etc. ...in many languagesBypasses completeness metrics ("empty" being a non-empty value); exposes internal messages to external usersNegative for CTRs in-Europeana and in search enginesBasic RetrievalReport, Correct when possibleInformativeHardCheck against a catalogue of unwanted keywords and/or expressions.

A complementary method to get more variants in the catalogue by applying field statistics
Dictionary + Regexp, SHACL

Field faceting
The values must be language specific. Complementary, field statistics can be used to help widen coverage. Getting all the variant in the catalogue will not be easy.
11
3. Missing or Incomplete information
12
P7Missing description fieldsExample recordRecord unlikely to be retrieved; record uninformativeBasic RetrievalReportWarningLowEDM checking rulesSHACL
13
P8Missing lang tagSee Column EFor specific fields, since not all values are language specific.Basic Retrieval; Cross-language recall; Improved language facets Report, CorrectWarningLowEDM checking rulesSHACL
14
P9Very short description fieldExample recordRecord unlikely to be retrieved; record uninformativeBasic RetrievalReportInformativeLowCheck field lengthSHACLImplementation should measure confidence (description of one letter is certainly not enough; one word is not enough 99% of the time; three words 95% of the time; etc). Also note that description is often used as the 'place holder' for any info that does not fit into any other field.
15
P10Empty literalsdc:subject ""Diminished chance of retrieval; record uninformative; potentially breaks completeness measureBasic RetrievalReport, CorrectWarningLowCheck field for empty valueRegex, SHACL
16
P11Missing sequence relations and reference integrity issues for hierarchical objectsBreaks displayIncorrect chaining of isNextInSequence relations; reference integrityBrowsing of hierarchical itemsReportWarningMediumChecking rules within a datasetSHACLRequires access to the full dataset
17
4. "Misuse of fields"
18
P12Extreme long values, e.g. using a long text as a titleDistorts search weightings; limits legibility to userBasic RetrievalReportInformativeLowRule constraintSHACL
19
P13edm:type with same content as dc:typedc:type - "Text", or "texte", or "texto" and edm:type "TEXT"Slight distortion of search weightingsBasic RetrievalReportInformativeLowCompare dc:type with edm:typeSHACLSee discussion in the other 'backlog' tab
20
P14Swapped thumbnail and full-size imageBasecamp threadSlow loading? Poor image display?N/AReport, CorrectWarningLowCompare image technical metadataSHACLWould need to happen after the media checker is run
21
5. Wrong data
22
P15Wrong language tagsBack in 2014-02, 32 lang tags appeared in the database, of which 15 need to be correctedSee Column ESee #1247Basic Retrieval; Cross-language recall; Improved language facetsReport, Partially CorrectWarningLowCheck compliance against a controlled list of termsXSD, SHACL, Jena RIOTPartially Correct, means that for certain situations (e.g. the famous Spanish case, esp used instead of spa) some mappings can be defined as they represent typical behaviour
23
P16Duplicate lang tags when only one is requireddc:title "Fisherman"@nl , "Pêcheur"@nl , "Visser"@nl ; dc:type "Patacon mould"@nl , "Pataconvorm"@nl , "Moule à patacon"@nl ;See Column EAnother related problem is identified and described in the other tab and pending further discussionBasic Retrieval; Cross-language recall; Improved language facets Partially ReportWarningLowCheck for cardinality constraints (one label per language)SHACLIt may suffice to report a cardinality violation and expect that a Data Provider will recognize the reason for the issue
24
P17Term not fitting against a controlled list of termsedm:provider "Europeana Food and DrinkEuropeana Food and Drink"
dc:language "esp" (instead of spa)
Depending on the nature of the problem, either filtering or retrieval may prove impossibleBasic Retrieval; Entity-Based Facets; ReportWarningMediumCheck compliance against a controlled list of termsDictionaryDictionary of key strings (Aggregators, languages, countries...)
25
P18impossible dates e.g. spring 20011prevents creation of proper date filters (this example from Europeana Fashion), where date is interpreted as the year 11Basic Retrieval;ReportWarningHighConstraints on DatesSHACL
26
P19Identifiers of controlled vocabularies used as values, when these vocabularies are not correctly provided <iconclass=49G35(+52)> tools, instruments; laboratory equipment - scientific researchSee Column Ecoming from TF report on multilingual and semantic enrichment strategyBasic Retrieval;Partially Report, Partially CorrectWarningLowPattern matchingRegex, SHACLFor reporting it may be sufficient to look for a predefined set of code delimiters to flag as a warning. Partially correct, means to support the correction of patterns that have been "manually" recognized.
27
6. Normalization
28
P20Time period in specific formatting: 3200[ac]-2250[ac]Example recordSee Column EThe date is represented in Spanish or Portuguese, so it is not an error. However, normalizing dates to a consistent representation would help search and visualizationBasic Retrieval;NormalizeHighNormalizationSHACL, Stanford Core NLPNormalization could be applied to make dates and date ranges uniform. It is expected a long list of date patterns and combinations that can be language specific.
29
P21Agent role mentioned in different waysdc:creator “John Doe (engraver)”, “Jane Smith (printer)”. And in German: "Jane Smith [drucker]"See Column #Entity-based Facets; Browse by AgentNormalize (when possible)LowPattern matching using grouping charactersMatch+Replace using RegexCorrection may happen to standardize the way roles are mentioned in the text (using only parenthesis).
30
7. Dependency to external resources
31
P22IsShownBy used instead of isShownAtExample RecordSee Column DThe link appears to be for an image but redirects to a portal pageN/AReport, CorrectWarningLowMedia CheckerMedia Extractor used for Technical Metadata
32
P23HTTPS link with expired certificateAssembla ticketSame as broken link (says Europeana Ops, though the image could still be obtained)N/AReportWarningLowLink Checker
33
P24HTTPS link where HTTP sufficesAssembla ticketWhen HTTPS doesn't work but HTTP of the same URL works, replace the linkN/AReport, CorrectWarningLowLink Checker
34
8. Serialization / format / encoding
35
P25Field should be literal but is URLedm:dataProvider <Royal Museum for Central Africa>results in field treated as local-file URLBasic RetrievalReport, CorrectErrorLowIRI ValidationSHACL, Jena IRI Validator
36
P26Invalid URLhttp://cdn.collectionsbase.org.uk/WAGMU/WAMS\bi511.jpg In general terms: link may not be resolvable or may break toolingLast slash should be forward not backward. The server returns it, but it's not valid URL. See more issues in the IRI investigationN/AReportErrorLowIRI ValidationSHACL, Jena IRI Validator
37
P27Many URLs in one fieldedm:isShownBy <http://foodanddrink.image.ntua.gr/image/DianthaOs/00010362_001.tif; http://foodanddrink.image.ntua.gr/image/DianthaOs/00010362_002.tif;..Field cannot be parsed; links cannot be resolvedN/AReport, CorrectErrorLowIRI Validation + Check IRIs for protocol ("http://") patterns within the relative partSHACL, Jena IRI Validator
38
P28Terribly long CHO/Aggregation URLsExample RecordSee Column DWhile not a technical problem, this URL is ugly and IMHO has a low chance of being permanentN/AReportInformativeLowMeasure URL lengthSHACL
39
P29labels / attributes within field content name=ambito culturale;value=bottega italianaDistorts search metrics; results in many irrelevant words among real informationDo we know how frequent this is?N/AReportInformativeHighPattern matchingRegex, SHACL
40
P30HTML in fieldshttp://www.europeana.eu/portal/record/2022611/H_DF_DF_4538Can create false positives in search; display unreadableOn the other hand, <br/> is needed for poetry or lyrics but the portal doesn't handle line breaksBasic RetrievalReport, CorrectWarningLowPattern matchingRegex, SHACL
41
P31Schematic data within fields{"danish"=>["Vester Sakskærsgård"]}Distorts search metrics; relevant values are cluttered with machine-readable formatingBasic RetrievalReport, Partially CorrectWarningMediumMeasure number of control characters (used in schemas, e.g <>=/{}[]) present within a field with some thresholdRegex, SHACLPartially correct means to support patterns that have already been "manually" recognized.
42
9. Content Quality (NB: DQC has decided content quality a lower-priority work item for the time being)
43
P32Visual quality criteria
Leaving aside subjective measures of aesthetic quality around the item and its digitisation, at the very least we need flags that say if images have watermarks, borders with extra non-image content such as colour bars, etc N/AReport, CorrectWarningHighBorders, sharpness, contrast, brightness, presence of noise etc could be checked by integrating existing image processing; watermarks may be more difficult to detect reliably in still images (at least without risking false positives)Image Processing Software
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...