Representing Data: CSV, XML, JSON & YAML

Just about any interesting setting where a data scientist may wish to operate consists of many systems operating independently. Each subsystem is typically designed and optimized for performing a specific task or set of task, and operates largely without regard to how other subsystems work or even what they’re working on. This division of labor amongst computer systems is known as abstraction. For instance, in the social e-commerce setting described earlier, the way web pages are rendered, how the databases storing transactions work, and how the internal search engine finds relevant results are all likely abstracted away from the data scientist trying to build a recommender system. The data scientist just needs to know that these things do work, and maybe what the input and output look like.

In this setting where individual components are insulated from the operations of their peers, communication becomes crucial. The system rendering the web pages must understand what the recommender system is trying to say, and the component that builds the recommender systems must, for instance, understand how user information is represented. Fortunately, there exist a wide range of data representation^[1] and communication formats that have been developed and optimized over many years of system development. The existence of standards means that the data scientist must only know a few of the most common data formats in order to process data output from almost any system, and usually little more than a call to the appropriate library is needed when outputting whatever good things your data science is doing.

This document covers a few of the more common data representation formats, CSV, XML, JSON and YAML, focusing on how common data structures are represented. Each of these systems has their pros and cons, but all are very common. Familiarity with these formats will enable the data scientist to consume a wide variety of data, and construct output that is readily acceptable in almost any operational setting.

CSV: Comma-Separated Values

This simple format is no doubt familiar to almost anyone who has done anything with a computer. Often used as a simplified input or output to spreadsheet applications, CSV consists of columns of plain text^[2] separated by a comma or some other delimiter, such as a tab character. Each row corresponds to an individual record--in our view of data from earlier, this may correspond to particular object. Each column denotes the values of a particular field realized in each row, the value of an attribute of the object. Often the first row in csv data is a ``header row’’, used to describe the contents of each column. CSV is usually easily parsable and producible by any program--a library is not required to deal with this format; for example, in Python a simple .split(‘,’) or ‘,’.join( … ) is all that is required.

There are only two difficulties when dealing with csv files. First, csv files created by Microsoft Office software use a different character denoting new lines than is typically used in unix environments. Additionally, quotes should be used in order to deal with strings of text that contain the delimiting character. Python’s standard library includes a csv package, which will help deal with these cases (meaning you don’t have to install anything extra).

While csv’s simplicity makes it attractive in many situations, its capabilities are limited. It assumes that each example has a fixed number of attributes. This makes representing a less structured data collection, such as a set, list or map, difficult or impossible. More complex and varied data structures require a richer representation.

Below is an example of csv-formatted examples of Listing objects from our social e-commerce example, including a header row,:

id,title,description,price

1,shoes,red shoes,$70.00

2,hat,a black hat,$20.00

3,sweater,a wool sweater,$50.00

XML: Extensible Markup Language

XML is an incredibly rich and flexible data representation format. XML’s design focuses on markup--providing context to fields in plain text in much the same way as HTML. However, because XML is so flexible, it is also commonly used for serializing objects and data. XML allows the specification of “schemas”, a concise definition of the grammar used for some specific task. So there might be a shared schema for representing common financial data, and then different analysts and programmers need only know the schema, and not negotiate a data format for every interaction.

The advantage of here being that when objects are written within the constraints of a schema, the objects are guaranteed to be valid--possessing the correct syntax. Similarly, a schema can be used when reading XML data into a computer system in order to filter out garbage and poorly formatted input. The complexity of the XML language actually makes this kind of validation useful--it is quite easy to compose invalid XML data, especially when typing objects by hand. However, while schemas are a useful component of XML, the topic is worth knowing about, but is way too broad to be covered here.

Now we’re going to get a little technical, but a little technicality is necessary to understand the basics of XML. In this class we will not concentrate on XML very much, so don’t worry if you don’t get all this on the first reading.

Like HTML, a primary component of XML is the tag. Tags begin with “<” and end with “>”, and are used to encapsulate certain types of information in a dictionary format. The first term in between the “< >” is the title of the tag, and can be thought of as in identifier for a particular data object. Additionally there are other fields that may be in the declaration of a particular tag that will be discussed shortly.

There are two primary methods for using XML tags for data encapsulation--using matching start and end tag or using an empty element tag, and tag attributes. For instance <title>sweater</title> uses a start and end tag “title” to encapsulate the the value of a title. Note that the first character of an end tag is “/”. The information encapsulated between a start and end tag may be other nested tags representing deeper objects. An alternative way to encapsulate a data is with with tag-attributes. For instance, we might write <listing title=”sweater” />. Note that this tag doesn’t have a end tag partner--it is said to be an empty-element tag, and is distinguished by a ending with a “/”, indicating that there is nothing following.

The information stored between a start and end tag and as a tag-attribute are called elements. All elements are used to serialize objects, often in a dictionary format--associating a key with a value. Typically when working with XML data, the entire set of information given is wrapped with an outer set of tags, and there is often a special XML tag preceding the actual content known as an XML declaration. For instance: <?xml version="1.0" encoding="UTF-8" ?>. This informs the parsing program of the specifics of the data to follow.

Thus far, we have only discussed how to represent dictionary data structures in XML. In order to represent lists of objects, examples of tags are provided in succession, often separated by newlines. For instance, one possible way represent the listing objects presented above as an array of dictionaries in XML format would be:

<description>red shoes</description>

</listing>

<description>black hat</description>

</listing>

<description>a wool sweater</description>

</listing>

Clearly this is quite verbose and somewhat messy. Because of this verbosity, I would avoid using XML when possible. However, because XML has evolved into a standard format for data representation, it is important to know how to deal with it. There are a wide variety of python libraries for both writing data into XML format and extracting data from XML formats. Writing out XML is a somewhat complex undertaking, and I won’t deal with it here. For reading XML, I recommend the lxml library included in the set of requirements for this class. This library will also greatly simplify the task of gathering data from HTML too, and is generally a useful tool.

JSON: Javascript Object Notation

JSON is another plain text object serialization format--remember, that means it’s going to represent the complex data in a way that it can be transferred between users and programs --based on the way objects are represented in the javascript programming language. Json is extremely flexible, able to represent many common data structures. Unlike xml, JSON is easy for a person to read due to its simple syntax. Writing JSON by hand, however, is a bit more difficult. JSON is often used when sending data from a web application to javascript code running in web browser. Because this is such a common task, JSON has caught on, becoming widely used, and frequently encountered in practice--far beyond javascript tasks.

JSON has two basic data structures, dictionaries (maps) and lists (arrays). JSON treats an object as a dictionary where attribute names are used as keys into the map. Dictionaries are defined in a way that may be familiar to anyone who has initialized a python dict with some values (or has printed out the contents of a dict), pairs of keys and values, separated by a “:”, with each key-value pair delimited by a “,”, and each entire object/record surrounded by “{}”. Lists are also represented using python-like syntax, a sequence of values separated by “,”, surrounded by “[ ]”. These two data structures can be arbitrarily nested, e.g., a dictionary that contains a list of dictionaries, etc. Additionally, individual attributes can be text strings, surrounded by double quotes (“ “), numbers, true/false, or null. Note that there is no native support for a ‘set’ data structure. Typically a set is transformed into a list when an object is getting written to json, which would be input into a set when being consumed, For instance, in python: some_set = set([ a, list, here]). Quotes in text fields are escaped like \”. Note that when inserted into a file, by convention JSON objects are typically written one per line.

Python’s standard library includes a JSON package. This is very useful for reading a raw JSON string into a dictionary; however, transforming that map into an actual object, and writing an arbitrary object out into JSON may require some additional programming. See here for an example.

Below are two example shop objects written in JSON format:

{“id”:1, “name”:”josh-shop”, “listings”:[1, 2, 3]}

{“id”:2, “name”:”provost”, “listings”:[4, 5, 6]}

YAML: Yet Another Markup Language

YAML is yet another data serialization format, also called “YAML Ain’t Markup Language” emphasizing its emphasis on storing data, not marking up text. YAML emphasizes human readability and writability, maintaining a very simple yet flexible format. YAML actually has two valid formatting rules. Like python, YAML is sensitive to the indentation of individual elements-- those elements on the same indentation level (or same left indentation) belong to a common data structure. Like JSON, YAML represents objects as a dictionary data structure and is capable of representing dictionaries and lists. Also like JSON, YAML supports compact ``inline’’ representations of data structures. Note that it is never necessary to put quotes around strings of text.

YAML’s more common multi-line block format represents lists as a set of objects having a common indentation, each starting with a hyphen followed by a space. For instance:

- shoes

- hat

- sweater

That was a simple YAML list of three listing titles.

YAML’s more compact inline format for representing arrays looks very much like JSON’s array representation--items are separated by “,”’s, and are surrounded by square brackets, “[ ]”. In this inline format, the above list would be: [shoes, hat, sweater].

YAML’s block format represents dictionaries as a set of key-value pairs with common indentation. Keys are separated from values by a colon, “:”. For instance, a dictionary could be used to represent a Listing object:

id: 1

title: shoes

description: red shoes

price: $70.00

Alternately, YAML’s inline format looks almost identical to how JSON (and python) represent dictionary data structures, with each key separated from its value by a “:”, each key-value pair being separated by a “,”, with the whole dictionary being surrounded by curly braces, “{ }”. For instance: {id: 1, title: shoes, description: red shoes, price: $70.00}. The difference between YAML’s inline dictionaries and JSON dictionaries is that YAML doesn’t require strings be quoted. When you want to represent a “special” character in a string in a YAML representation, you must quote that string, and “escape” the special character with a backslash (“\”).

When transmitting multiple objects together in YAML format, an outer array is typically used, with one object per entry in this array.

Below is an example of Shop objects being represented in yaml format:

- id: 1

name: joshshop

listings:

- 1

- 2

- 3

- id: 2

name: provost

listings:

- 4

- 5

- 6

[1] Often called data serialization

[2] Such as ASCII or unicode