Modern Data Formats�Introduction
Jakub Klímek
This work is licensed under a Creative Commons Attribution 4.0 International License.
Some developers think mainly in terms of apps...
2
My computer
My app
My database
My app
My data
Data as part of an app
In public administration….
3
Computer
App
Database
App
Data
Give me the data, I want to build another app
No, you only paid for an app
Data as part of an app
App
Possible vendor-lock
Suddenly, the app needs to work with another app
4
His computer
His app
His database
His app
His data
Her computer
Her app
Her database
Her app
Her data
I will send you that
I will send you this
Suddenly, the app needs to work with another app
5
I will send you that
I will send you this
Data independent of applications
6
My computer
My app
My database
My app
Data
My computer
My app
My database
My app
Data independent of applications
7
Data
OK, data is independent. More problems?
Use of improper formats for a given use case
Not following specifications
8
Point of this course
9
Conceptual view of data
10
Conceptual domain model
Independent of any particular technology or representation. Answers:
What real world entities are described?
What are their properties?
How are they connected?
�Conceptual model can be discussed with non-IT personnel.
11
Example: National Open Data Catalog
12
Example: National Open Data Catalog - Dataset
13
Conceptual domain model - UML Class diagrams
Class: Catalog
��
This is saying:�“There are things in the real-world of a type Catalog.”
14
Conceptual domain model - UML Class diagrams
Class: Catalog�Attributes: title, description, homepage
This is saying:�“Each instance of a catalog has a title, description and a homepage.”
15
Conceptual domain model - UML Class diagrams
Class: Catalog�Attributes: title, description, homepage
Class: Contact point�Attributes: name, e-mail
This is saying:�“There are contact points, each has a name and an e-mail”
16
Conceptual domain model - UML Class diagrams
Class: Catalog�Attributes: title, description, homepage
Class: Contact point�Attributes: name, e-mail
Association: contact point
This is saying:
17
Conceptual domain model - UML Class diagrams
Class: Dataset�Attributes: title, description
Association: dataset
This is saying:
18
Conceptual domain model - UML Class diagrams
Attributes of Dataset:
Association: part of
Association: contact point
19
Conceptual domain model - UML Class diagrams
Associations:
20
Conceptual domain model - UML Class diagrams
Association: distribution
21
Conceptual domain model - UML Class diagrams
Association: media type
Association: package format
Association: compression format
22
Conceptual domain model - UML Class diagrams
23
Conceptual domain model - UML Class diagrams
24
Conceptual domain model - UML Class diagrams
Representations in:
RDF, JSON, CSV
25
Data models
vs.�Data formats
vs.�Data schemas
26
Data models - logical view of data - graphs
Resource Description Framework (RDF) model
Labeled Property Graph (LPG) model
27
https://...
https://...
https://...
https://...
"Coffee shop"
Ann
Dan
Place
V60
Steve
John
KNOWS
KNOWS
EMPLOYS
VISITS
SERVES
EMPLOYS
:Person
:Person
:Person
:Person
:CoffeeShop
:BrewingMethod
name: V60
duration: 3 minutes
name: John�age: 42
name: Steve�age: 24
name: Dan�age: 87
name: Ann�age: 16
since: 2021-03-01
since: 2010-12-31
since: 2020-03-16
"Ann"
https://...
https://...
https://...
https://...
https://...
https://...
https://...
Data models - logical view of data - hierarchies/trees
Document Object Model (DOM)
JSON (both format and model)
28
document
root�<coffeeShops>
attribute�number
element�<coffeeShop>
element�<name>
element�<employees>
element�<employee>
text�"Place"
text�2
element�<name>
element�<age>
array
object
array
value�"Place"
object
object
name
employees
object
object
object
value�"Ann"
name
Data models - logical view of data - tables
Relational model
29
name | age | knows |
Ann | 16 | |
John | 42 | Steve |
Steve | 24 | |
Dan | 87 | Steve |
name | employee |
Place | Ann |
Place | John |
Data formats - physical view of data
Graph
Hierarchical
30
How data using a certain data model is serialized into files / sent over network
Relational
Data schemas
Annotations and constraints applicable to instances of data formats, allowing the data to be better described and validated
CSV
RDF
JSON
XML
31
Specific data formats using meta-formats
CSV, JSON, XML, … sometimes called meta-formats
They serve as “host” formats for use-case specific formats
Data schemas used to define these specific formats
JSON
CSV
XML
RDF
32
Generic data format properties�open vs. closed
machine-readability�binary vs. text-based
33
Open format?
34
Open vs. closed formats
Open
Specification available on the Web, freely accessible to anyone, with no limitation on its usage.
Closed
Examples
35
Machine-readable format
“All files are machine-readable, because all files are, in the end, read by machines”
36
The second part is true, but not what machine readability is about
❌
❌
❌
Machine-readable format
“Open formats like CSV, XML, JSON or RDF Turtle and Excel .xlsx files are machine readable”
37
❓
❓
❓
Machine-readable format - CSV, JSON
,,,,,,,,,,,,
Back to TOC,,,,,,,,,,,,
r2 : R2. Do you have permanent residence in Brno?,,,,,,,,,,,,
,%,count,,,,,,,,,,
Yes,89.1%,1385,,,,,,,,,,
No,10.9%,169,,,,,,,,,,
TOTAL,100.0%,1554,,,,,,,,,,
"Total sample, Weight: Weight, base n = 1554",,,,,,,,,,,,
,,,,,,,,,,,,
Back to TOC,,,,,,,,,,,,
r3 : R3. For how long have you lived in Brno? ,,,,,,,,,,,,
,%,count,,,,,,,,,,
"ico","nazev","udaje","vymazDatum","zapisDatum"
"3571092","Nadace RK CARE","[{hlavicka=Spisová značka;zapisDatum=2014-11-20;hodnotaText=N 521/KSBR;udajTyp={kod=SPIS_ZN;nazev=spisová značka};spisZn={soud={kod=KSBR;nazev=Krajský soud v Brně};oddil=N;vlozka=521}}, {hlavicka=Název;zapisDatum=2014-11-20;hodnotaText=Nadace RK CARE;udajTyp={kod=NAZEV;nazev=název}}, {hlavicka=Sídlo;zapisDatum=2014-11-20;udajTyp={kod=SIDLO;nazev=sídlo};adresa={statNazev=Česká republika;obec=Lipůvka;castObce=Lipůvka;cisloPo=385;psc=67922;okres=Blansko}}, {hlavicka=Identifikační číslo;zapisDatum=2014-11-20;hodnotaText=3571092;udajTyp={kod=ICO;nazev=identifikační číslo}}, {hlavicka=Právní forma;zapisDatum=2014-11-20;hodnotaText=nad;udajTyp={kod=PRAVNI_FORMA;nazev=právní forma};pravniForma={kod=nad;nazev=Nadace;zkratka=nad}}, {hlavicka=Účel nadace;zapisDatum=2014-11-20;udajTyp={kod=UCEL_SUBJEKTU_SEKC
38
❓
Machine-readable format - XML
<?xml version="1.0" encoding="UTF-8"?>
<PvsRejstrikData rejstrik="" operace="1" xmlns="http://portal.gov.cz/portal/xsd/PvsRejstrikData">
<TYPE>datová sada</TYPE>
<NAZEV>Smlouvy SŽDC 2017</NAZEV>
<POPIS>Uzavřené smlouvy organizace Správa železniční a dopravní cesty (resort dopravy) v roce 2017</POPIS>
<HOMEPAGE></HOMEPAGE>
<PERIODICITY></PERIODICITY>
<SPATIAL_TYPE></SPATIAL_TYPE>
<SPATIAL_TYPE_TXT></SPATIAL_TYPE_TXT>
<SPATIAL_CODE></SPATIAL_CODE>
<SPATIAL_CODE_TXT>Česká republika</SPATIAL_CODE_TXT>
<THEME></THEME>
<THEME_TXT>-</THEME_TXT>
<KEYWORDS>smlouva</KEYWORDS>
<STAV>zpracováno 2017-03-29 15:54:05</STAV>
<PROBLEMY></PROBLEMY>
<x-priloha MimeTyp="application/xml" Jmeno="data.xml">PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8+CjxkYXRhc2V0IHhtbG5zPSJodHRwOi8vcG9ydGFsLmdvdi5jei9wb3J0YWwveHNkL1B2c1JlanN0cmlrRGF0YSIgSUQ9IiIgb3BlcmFjZT0iMSI+CiAgPHRpdGxlPlNtbG91dnkgU8W9REMgMjAxNzwvdGl0bGU+CiAgPGRlc2NyaXB0aW9uPlV6YXbFmWVuw6kgc21sb3V2eSBvcmdhbml6YWNlIFNwcsOhdmEgxb5lbGV6bmnEjW7DrSBhIGRvcHJhdm7DrSBjZXN0eSAocmVzb3J0IGRvcHJhdnkpIHYgcm9jZSAyMDE3PC9kZXNjcmlwdGlvbj4KICA8YWNjcnVhbFBlcmlvZGljaXR5PlIvUDFNPC9hY2NydWFsUGVyaW9kaWNpdHk+CiAgPHNwYXRpYWw+CiAgICA8dHlwZT5TVDwvdHlwZT4KICAgIDxub3RhdGlvbj4xPC9ub3RhdGlvbj4KICA8L3NwYXRpYWw+CiAgPHRlbXBvcmFsPgogICAgPHN0YXJ0RGF0ZT4yMDE3LTAxLTAxPC9zdGFydERhdGU+CiAgICA8ZW5kRGF0ZT4yMDE3LTEyLTMxPC9lbmREYXRlPgogIDwvdGVtcG9yYWw+CiAgPGtleXdvcmQ+c21sb3V2YTwva2V5d29yZD4KICA8ZGlzdHJpYnV0aW9uPgogICAgPGFjY2Vzc1VSTD5odHRwOi8vd3d3Lm1kY3IuY3ovTURDUi9tZWRpYS9vdGV2cmVuYWRhdGEvc21sb3V2eS8yMDE3L3NtbG91dnlfc3pkY18yMDE3LmNzdjwvYWNjZXNzVVJMPgogICAgPGRvd25sb2FkVVJMPmh0dHA6Ly93d3cubWRjci5jei9NRENSL21lZGlhL290ZXZyZW5hZGF0YS9zbWxvdXZ5LzIwMTcvc21sb3V2eV9zemRjXzIwMTcuY3N2PC9kb3dubG9hZFVSTD4KICAgIDxmb3JtYXQ+dGV4dC9jc3Y8L2Zvcm1hdD4KICAgIDxsaWNlbnNlPmh0dHBzOi8vcG9ydGFsLmdvdi5jei9wb3J0YWwvb3N0YXRuaS92b2xueS1wcmlzdHVwLWstZHMuaHRtbDwvbGljZW5zZT4KICA8L2Rpc3RyaWJ1dGlvbj4KPC9kYXRhc2V0Pgo=</x-priloha>
</PvsRejstrikData>
39
❓
Machine-readable format - XLSX
40
❓
Machine-readable format
Machine readability is not a property of a format
41
Binary vs. text based formats
“Binary format means that the file is stored as 1s and 0s”
-- a student at a recent state exam
42
This is, of course, true also for text-based file formats
❌
❌
❌
Binary vs. text based formats
Binary files
Text-based files
43
Text-based formats - character encoding - US-ASCII
Character encoding - representation of characters as binary sequences (numbers)
US-ASCII using 7 bits to represent 1 character
44
Original by: User:Vanessaezekowitz, CC BY-SA 3.0, via Wikimedia Commons
Text-based formats - newline representations
CR - carriage return - \r�LF - line feed - \n - Unix/Linux, MacOS�CR LF - both of them - \r\n - Windows�See all variants (Wikipedia)
45
Text-based formats - newline representations
CR - carriage return - \r�LF - line feed - \n - Unix/Linux, MacOS�CR LF - both of them - \r\n - Windows�See all variants (Wikipedia)
46
Text-based formats - character encoding - UTF-8
1 to 4 bytes representing one character
first byte compatible with US-ASCII
most frequently used characters use 2 bytes
emojis use 4 bytes
47
Number of bytes | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
1 | U+0000 | U+007F | 0xxxxxxx | | ||
2 | U+0080 | U+07FF | 110xxxxx | 10xxxxxx | | |
3 | U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Character encoding - BOM - Byte order mark
Magic number at the beginning of a text file
Indicates
Most data formats use UTF-8 without BOM
48
Character encoding - other encodings
49
In Czechia, from legacy systems mainly
Standardization
50
Standards - for data formats and other things
Why do we need standards?
51
Interoperability is costly. For each dataset:
👨🔧
👨🔧
👩🔧
👨🔧
👩🔧
👩🔧
👩🔧
👨🔧
Spec
Low interoperability is even costlier! For each dataset:
👩🔧
👩🔧
👩🔧
👩🔧
👩🔧
👨🔧
👨🔧
👩🔧
👨🔧
👩🔧
👩🔧
👩🔧
👨🔧
👩🔧
👩🔧
👩🔧
👩🔧
👨🔧👩🔧
Spec
Spec
Spec
Spec
Spec
Spec
Spec
Spec
Internet Engineering Task Force - IETF
Open standards organization
IETF Working Groups
Internet Engineering Steering Group (IESG)
54
Internet Society - ISOC
Americat non-profit
“to promote the open development, evolution, and use of the Internet for the benefit of all people throughout the world”
55
World Wide Web Consortium - W3C
International standards organization for the WWW
Specification maturation process
56
Internet Corporation for Assigned Names and Numbers - ICANN
Standards organization
57
MIME-Type, Media-type
Multipurpose Internet Mail Extensions (MIME) type
The list: Media Types
Managed by�Internet Assigned Numbers Authority (IANA)
Examples:
58
Ecma International
Standards organization
Examples:
59
RFC 2119 - Key words for use in RFCs to Indicate Requirement Levels
MUST, REQUIRED, SHALL
MUST NOT, SHALL NOT
SHOULD, RECOMMENDED
SHOULD NOT, NOT RECOMMENDED
60
RFC 5234 - Augmented Backus-Naur Form (ABNF)
Example
fragment = *( pchar / "/" / "?" )�pchar = unreserved / pct-encoded / sub-delims / ":" / "@"�pct-encoded = "%" HEXDIG HEXDIG�unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"�sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
61
Identifiers
62
URI, URL, IRI, URN
URI - Uniform Resource Identifier - RFC 3986�URN - Uniform Resource Name - RFC 8141, IANA URN namespace registry�URL - Uniform Resource Locator - RFC 3986�IRI - Internationalized Resource Identifier - RFC 3987�� foo://example.com:8042/over/there?name=ferret#nose� \_/ \______________/\_________/ \_________/ \__/� | | | | |� scheme authority path query fragment� | _____________________|__� / \ / \� urn:example:animal:ferret:nose
63
RFC 3986 - Uniform Resource Identifier - examples
64
RFC 3987 - IRI - Internationalized Resource Identifier
Examples
Percent-encoding
The same examples of IRIs percent-encoded into URIs
65
RFC 3492 - Punycode
IRIs not to be confused with IDN - internationalized domain name:
66
Data types
67
Common data types in text-based structured data formats
The same data types used in all common formats - RDF syntaxes, XML, JSON, CSV�Based on XML Schema data type system
68