Attribution, Open IP and Highly Factual, Highly Granular Data

Michael Collinson, OpenStreetMap Foundation, February 2011

2011-02-01 This is a draft of a white paper ... it still contains some specific references to the ODbL rather than treating the subject purely generally .... but may be of interest as it encapsulates the main issues. It was been made available to Creative Commons as a result of discussions between OpenStreetMap License Working Group and their executives.

Abstract:

The requirement for attribution on highly factual, highly granular data appears a basic, innocent courtesy but, from practical experience, can be highly problematic to the encouragement of communality and communality in data sharing in an open environment. This paper explores the issues from the perspective of the OpenStreetMap project and proposes that attribution must be thought of in distinct levels. The proposal is made that forcible attribution of data can play an important role in Open IP but that it should be unambiguously restricted to indirect attribution chains using Level 1 and 2 attribution only. Interested parties can then follow attribution back to more distant sources using an "attribution chain".

An Introduction to Open IP and Data

Sharing data is the new kid on the block in terms of Open IP, joining software and creative works. Like both areas before it, there is and will be debate on what form licensing should take and new license types are emerging and will go through a maturation process.

Data is characterised by three things:

More Granular
Highly factual - creativity may be absent or difficult to establish
Potentially high mutability

In terms of attribution, the focus of this paper, it is granularity that makes data a very different animal.

Software has a clear edge around it. With a GNU-licensed word-processor, there is an obvious distinction between using it to make and sell your award-winning novel, (a GNU license makes no claims on that,) versus improving the software and selling it, (a GNU license definitely has something to say about that). Using versus improving/adding to data is blurred because you will be often be doing both together.

Coarse granularity issues do occur in software when considering the incorporation of software libraries, addressed by the GNU Library General Public License (now GNU Lesser General Public License www.gnu.org/copyleft/lesser.html). However, in data this must be thought of as the general rule and not an exception.

The next pioneering wave in Open IP was open licensing of creative works such as photos and pieces of text. Creative works have less of an edge round them than software, but nevertheless it is trivial to take a quote from Wikipedia, put quotes around it, mention the source and so maintain both the integrity of your own project and that of Wikipedia.

But data are not creative works per se and needs to be thought of as a completely independent area. Consider OpenStreetMap's map data. How do you flag that a single latitude and longitude in a book's text or diagram comes from OpenStreetMap? Is there any value in that? What is the project information that should properly be made available to OpenStreetMap and what is none of OpenStreetMap's business?

Why Open IP Data Attribution is Good

Usable open data is only available for folks to use if someone as an individual or as an organisation CONTRIBUTES data, continue to contribute, continues to improve it, continues to encourage others to contribute. That only happens if folks are happy contributing. Attribution is acknowledgement and kudos for hard work. It is a key substitute in the open IP world for monetary compensation.

It advertises a project or resource to a wider audience.

Academic/scientific citation, tracing data to source, assessment of reliability, quality control.

Why Open IP Data Attribution is Bad

In a nutshell, it can go out of control with the sheer number of potential atributees.

Imagine you are an end user. You make a small extraction of (geo)data to make a graphic from ABC. You know from ABC's website that ABC's dataset contains data from or derived directly from third parties who are requesting some form of attribution. There is a long, long list of them. In the current open licensing environment:

You don't know whether your extraction intersects with a contribution from any given third party.

You don't know whether the third party is expecting you to attribute them regardless or whether there is intersection with their original data.

You probably don't have the resources to go through each third-party's attribution clause and figure out exactly how they want you to attribute them relative to the particular end medium you want to use.

If you are a school. club or just an ordinary individual wanting to do the right thing, you are probably now completely confused.

Third-Party Attribution Levels

At OpenStreetMap, we have developed the idea of attribution levels. We have found Level 1 to be practical and desirable; Level 2 to be practical with certain qualifications; but Levels 3 and 4 to be impractical and worth explicitly moving away from.

To make some headway in answering the poor end user's dilemmas raised above, major progress can be made by thinking about attribution in terms of distinct defined levels. Let us maintain the example of ABC which makes available a database of highly factual data under Open IP.

Level 1 (Primary Attribution):

ABC acknowledges its third party sources on its website or however technology/social trends change in the future. There is no attempt to get end users of ABC's data to do the same.

OpenStreetMap finds this very practical to implement. We offer this in our contributor terms, as distinct from our end user license, since it will survive any future end user/distribution license change ... including to a public domain-type with no attribution clause at all.

Level 2:

When any ABC data is "published", i.e. extracted from an ABC website electronically or distributed on some sort of hard media such a book or recording disk, there is something physically present in the material transferred that acknowledges third parties.

That something can be one of three things:

Explicit attribution list. A complete list of third party sources incorporated in the ABC database plus their preferred attribution language.
Link to attribution list. A link back to ABC's level 1 attribution statement, most likely a web URL.
Data tagging. Attribution or source references for third parties are tagged with onto individual datums in the actual data.

An extraction could be the entire database or a single element ... the amount of attribution meta-data added therefore needs to be small. At the current level of network bandwidth, explicit attribution lists are therefore impractical for extraction calls for only single data elements due to their relative size.

It is OpenStreetMap policy to implement at least link attribution and to encourage our individual contributors to individually tag data elements with source references and to respect, i.e. don't delete such tags as already exist. However, data tagging can only best effort as not all contributors will perform it and source tags may get change or get deleted over time, as our dataset is by policy 100% mutable.

Level 3:

End-users re-distributing a copy of the ABC database or a derivative database are required to maintain any third-party attribution information intact.

This is messy in the case of very small extracts and in source tags, but not impossible.

However, this forces future generations of OSMers to require this in perpetuity, (in reality about 135 years given the current age of many contributors). Consistent source tagging, and appending to source tags rather changing them, helps this on a best-effort but not guaranteed basis.

Level 4:

End-users have to acknowledge third-parties in maps they make. Here, we are making a distinct jump from a set of numbers and descriptors to what would generally recognised as a creative work. This is potentially specific to the OpenStreetMap experience and not generalisable, i.e Level 4 is really just a special case of Level 3.

Work in progress, this is currenly too specific to OpenStreetmaps Open Database License:

If it was intended for the extraction of the original data, then it is a database and not a Produced Work. Otherwise it is a Produced Work.

I am vehemently opposed to this for any form of highly granular data. Even if individual contributors are excluded, requiring a list of several hundred sources is not practical and will become worse when OSM data itself is just one of several sources used to make a map. Regretfully, imported CC-BY's "at least as prominent" for each source exacerbates this situation.

Attribution Chains

Work in progress

The idea of an Attribution Chain is that a derived work may the result of many sources going back in a chain. The derived work uses data from source A, which uses data from source B, which uses data from multiple sources ... and so on. Rather than forcing the work to attribute all sources, it just attributes A using Level 1 attribution. Since the modern Internet provides an easy mechanism, any interested party can simply follow the attribution chain back across websites or their future replacements.

There is, of course, a danger that the chain can be broken by one source failing to properly attribute or completely ceasing publication. Also requiring or encouraging Level 2 attribution provides a certain amount of insurance.