Concerns about opening up data, and responses which have proved effective
(or “How to make friends and get them to give you their data”)
Christopher Gutteridge, University of Southampton
Alexander Dutton, University of Oxford
This document is inspired by the open data excuses bingo card.
Someone asked for what responses have proved effective. This document is a work in progress based on our experience. Carly Strasser has also written at the Data Pub blog about these issues from an Open Science and research data perspective, and this has inspired a video from Research Data Netherlands about addressing researchers’ data sharing concerns. You may also be interested in How to make a business case for open data, published by the ODI.
See also German translation of version from May 2013: http://is.gd/nIN6CY
We’ll get spam
- We don’t have to include email addresses in the data
- You already have those email addresses on the website
- We have a spam filter!
- We can make including your email opt in or opt out.
Terrorists might use the data
- This excuse is usually someone just trying to find a reason not to do it and generally won’t stand up
- In some cases there is a valid argument (nuclear, chemical, biological, weapons), or information that would make people a target, like the names of people working in a controversial field such as animal testing. These are legitimate concerns.
- The data can already be obtained by other slightly harder means[a][b]. An example is the rally point for a building in the event of a fire. This is posted on the walls of the buildings already and people could just walk in and read them; in other words a motivated person still can access this information already so there’s little or no risk in making the data useful as well.
- Is there a subset of this data which would reduce/eliminate this risk?
- Can we redact the few records which are an issue?
People will contact us to ask about stuff
- This is usually an objection of people who feel overworked and that this isn’t part of their job. Escalating to a higher level of management can identify if this is a useful or time-wasting thing for them to be contacted about.
People will misinterpret the data
- Document how it should be interpreted
- Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help
- Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.
It’s too big
- It probably isn’t as big as they think (unless you are at CERN).
- Data owners are probably proud of how big their dataset is, so don’t insult them by making out it’s not really very big.
- Ask if they mind you running some experiments (and gently show them it is practical to do)
It’s not very interesting
- By virtue of asking them, you probably already think their data is interesting; tell them why.
- Come up with some use-cases for how integrating it with other datasets might lead to increased value (e.g. “Which Oxbridge colleges offer which courses” can lead to plotting them on a map alongside the buildings in which those courses are taught, helping prospective students to choose a college).
- Let others judge how interesting or useful it is — even niche datasets have people that care about them. That said, don’t see “publish everything openly” as a goal in itself; take the low-hanging fruit and leave the genuinely less interesting stuff behind for now.[c][d]
We might want to use it in a research paper
- I’ve heard this about datasets produced in crystallography
- One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.
There’s no API to that system
- It doesn’t necessarily need a public API. What about behind-the-scenes database (SQL) interfaces? It may also be possible to export data as flat files and dropping them into some well-known place for later transformation.
- As a last resort, you could scrape data from a public website
- Talk to the provider of that system. Open Data is a hot new buzzword and they may be interested in getting on that gravy train.
We’re worried about the Data Protection Act (UK Law)
- The DPA only covers data on people. If the data doesn’t contain anything to do with people, the DPA does not apply.
- Mirror what’s published in a non-machine-readable way.
- Strip out, aggregate or anonymise the bits that contain personal data
- Seek permission from data subjects to publish data about them (opt-in)
We’re not sure that we own it
- This is an issue for library catalogues, often metadata comes from various paid sources which don’t allow open republishing. Sometimes it’s not clear exactly what data came from where so the entire dataset is tainted.
- What bits are we sure about? We could publish those.
- If we don’t own it, do we know who does?
- Do you have anything else which we definitely do own?
I don’t mind making it open, but I worry someone else might object
- This is a common deflection, rather than a genuine issue
- This implies that the person is nervous about being blamed for making an error
- Go up the management chain to find someone who can reassure them that they won’t get in trouble
- Ask for a less controversial subset
It’s too complicated
- This is a similar issue to “it’s too big”
- Don’t be too smug if it turns out it’s not that complicated, it could harm your professional relationship with the data provider.
Our data is embarrassingly[e] bad
- Many eyes will help you improve your data (e.g. spot inaccuracies)
- People will accept your data for what it is
- Offer to help the data owner to tidy up, or better maintain, their data. By providing a system in which owners are more easily able to curate their data, you could be doing them a favour.
- If you can minimise the risk, publishing open data is a very good way to motivate data providers to clean up their data, but the data needs to be visible so you need a web page to expose the data on so non-techies can see it (and see it’s wrong). This page is an ideal place to restate the procedure for making corrections. This does mean publishing initially flawed data, but is very effective. You can email people for years to ask them to improve their data and they won’t. The second they discover it’s visible to the public they beat down your door demanding their right to correct it.
It’s not a priority and we’re busy
- What are you busy with? Often we can find something which we can help with-- eg. if you just helped us get the data you already have, we could make that map/tool/report you need.
Our lawyers want to make a custom license[f]
- This usually comes from not explaining to the lawyers what you’re trying to do
- Often there’s chinese whispers via levels of management between techies and lawyers, see if you can talk with the lawyer directly, and may not be familiar with open licenses.
- Precedents can help. If you can show several peer organisations that have published under this open license you wish to use, people will be more confident.
It changes too quickly
- You may be able to set up real time data flows
- Publish the bits that don’t change that often
- Publish the metadata, with links to a machine-readable representation of the data (or an API) that isn’t part of your main data publishing platform
There’s already a project in progress which sounds similar
- Often these projects take years, we can do something cheap and cheerful now
- That project will only produce a tool (eg. mobile app), and that’s not quite the same thing
- Actually, working out how to extract the data for the open data can supp ort that project and save them work later.
Some of what you asked for is confidential
- Which bits? Can they be excluded, leaving something that’s still useful?
- Is it actually confidential? (Is it already published on the web in some disaggregated way?)
I don’t own the data, so can’t give you permission
- Sometimes it’s as easy as just finding out who does own the data
- Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
- Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
- Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.
We don’t have that data
- Sometimes they do but don’t realise-- oh, that’s not data, it’s just a spreadsheet...
- Well, what data do you have for us instead?
- We have an obligation/requirement to keep that data, we can help you get started...
- But it’s on your website...
- they may discover they do have database powering that, yay
- or maybe it’s hand maintained, in which case you could screen scrape, and maybe even reimplement their website as data-driven, making their job easier.
That data is already published via (external organisation X)
- Often external organisations won’t release the data under an open license, limiting its utility to the organisation that it’s about. Publishing it ourselves will mean that we can make full use of the data.
- Self-publishing means we can more accurately model the aspects that are of most interest to ourselves.
- Inaccuracies in the published data can be fixed in-house once noticed, leading to more accurate data (“many eyes”)
- Can we get that data from the external source, and link it with our other data? (if not, there’s a problem...)
- Can we get the same data that you send to the external organisation?
We can’t provide that dataset because one part is not possible
- We’ve seen entire requests be initially denied because of one small part which wasn’t available or was controversial.
- As well as the formal process for requesting the data, follow up with the humans at each stage, ideally in person, as email can be ineffective for removing confusion. Reassure them that if a bit is hard then it can be skipped (this may be quite an unusual situation)
What if something breaks and the open version becomes out of date?
- Plan how you might prevent this from happening (e.g. monitoring that the timestamps don’t get too stale)
- If appropriate, add disclaimers to the data to say that it’s provided without warranty, etc.
- Integrating the open data into internal data flows means that people will notice if the data becomes stale, reducing the chance that it remains so.
We can’t see the benefit
What if we want to sell access to this data?[g]
- In some cases this is a valid concern, if selling this data is part of your business model
- To use an analogy[h][i][j], a chinese take-away would go bust very quickly if it decided to make its food free, but it would also go out of business if it didn’t make its menu available for free. A middle ground are its recipes which might be a valuable secret, but are also of great interest to people with food intolerances, and may make them prefer to use that take-away if they knew everything in every dish.
- Help identify if this is a realistic prospect, or just being risk-averse
- Help identify practical benefits of opening the data. If there are none, why should they bother?
- For datasets that change, publishing now doesn’t preclude selling access later; consumers would then have to choose between paying and stopping using the out-of-date information.
If we publish this data, people might sue us
- If the concern is about information about people, make a policy and stick to it. eg. we’ll correct/remove things in 7 days. (This is already embodied in one of the Data Protection Principles in the UK)
- If the concern is about errors which impact business, case law may help, but there’s not much yet. (let us know when there is some!)
- Ensure you can state what data you would have been providing on any given day, eg. someone claims they couldn’t find the nearest hospital in your data and someone died. Very helpful to be able to prove/show what data you were providing at the day/time.
- If the data is already available on your website or via an app, then are you really increasing risk by opening the data?
We want people to come direct to us so we know why they want the data
- This is similar to “People might misinterpret the data” if the intention is to help people satisfy their information need.
- Well, OK, can the *metadata* or some minimal subset at least be made open?
- If there’s the option to open up the data, chances are that the information would already be available under Freedom of Information rules, wherein requestors aren’t required to justify or explain themselves. This still allows the owner to find out on a “best efforts” basis, but it’s important to remember that they won’t be able to find out on every occasion.
- An example case is a comms department “find an expert” service for arranging media interviews with your staff. In this case, rather than provide direct contact information you can offer to make the contact information via a logged web-form or specific email address. This will be subverted if the people in the dataset is linked with your open phonebook dataset.
- This can be a tricky one as once you open your data you can’t track every instance of use. This can be very threatening to people who produce this data and need to be able to provide documented proof that they/their team are doing something of value, especially when staffing levels are being reduced. One option is to find someone in power to reassure them that their work is valued and that their management understand & accept that when the data is opened they are trading control for utility.
Setting a dangerous precedent
If we publish data once, then it will set a precedent of allocating staff time to continually publishing it, when we don’t have the resources internally to commit to doing so
- Generally if you are using the data internally it’s not very hard to publish using the same workflow, although minimise the human effort in the process. Tag open records as you go.
- By publishing the data you make it available in the most convenient form to your own staff/users. If it’s not a secret why are you forcing them to log in?
- By making open data replace a current business process, rather than adding to the load of busy staff, it can make producing it neutral or even a saving.
Fraudsters use data against us
e.g. Knowledge of suppliers, payment dates etc can be used to sound convincing on the phone and get a payment redirected to a fraudster’s account.
- Information Commissioner says it is unlikely: http://ico.org.uk/~/media/documents/decisionnotices/2014/fs_50516271.pdf [link broken, do we have a new version?]
- We’ve had a couple of these objections, usually based on the idea that you can’t send in fraudulent requests to divert funds if you don’t know the names of the suppliers to which funds are being paid. Which, would be fine, in theory, except that Tenders Electronic Daily requires public bodies to list contract awards, including supplier names and Google exists (suppliers have a habit of making announcements about having won contracts). Countering this is about having proper anti-fraud measures in place in the organisation, it is not about release of data.
[a]I think this could probably be expanded on - we've seen a few cases of "we can't do this" or "we won't ever make this public" when the information itself is already available, just buried in a PDF rather than as structured data.
[b]Yes, but what is a good strategy to get the potential data provider to see this logic?
[c]I've expanded this in response to http://datapub.cdlib.org/closed-data-excuses-excuses/#comment-13754. I'm not sure it's perfect, so suggestions and modifications welcome.
[d]One reason to share data is that it may be useful to others. Another completely different reason is that it shows how you conducted your (presumably published) research so that your results may be reproducible.
[e]It can help to overcome the embarrassment if they publish a short note on why the data is bad. For example, poor use of company identifiers weren’t checked at the time of submission and so we can’t be sure of their accuracy. In most cases, there’s a good reason why the data has become compromised, as long as the failures are clearly documented this challenge can be overcome.
[f]The underlying issue here is that success for researchers is advancing their field, whereas success for lawyers is not getting blamed for anything. That's not a criticism of lawyers, just a recognition of their role. They are essentially defensive, whereas researchers are creative. So it shouldn't surprise us that their and our instincts are often in conflict. From a lawyer's perspective, the ideal outcome would be that no-one ever does anything at all -- then no bad result can ever ensue. The question for researchers is how to either help lawyers see the researcher perspective, or what higher level of authority can be brought in to override lawyers' concerns.
[g]This objection boils down to: there is an opportunity cost in giving the data away (i.e. we lose the chance of income from selling it). The response is to be aware of the opportunity cost of NOT releasing the data -- all the things that won't happen if it's kept locked up. These may include useful collaborations, co-authorships on publications, publicity, etc.
[h]This is nearly an excellent analogy. The one place it breaks down is that food, being physical, has a cost for each copy -- unlike data. And that is the reason for a take-away's proper reluctance to make it freely available. Can a better analogy be found?
[i]I have tried talking about charging people for a copy of your business card, or some others, but this one seems to get the most a-ha! moments.
[j]What about am analogy of a film and the cinema timetable? Its a bit last century because no-one goes to the cinema anymore but it does solve the physical problem. - I'm Danny Kingsley @dannykay68 BTW
[k]About Official statistics which are open by nature (see Principles for official statistics agreed upon by nearly all countries) how do you effectively deal with a CC-By license where you have to systematically attribute the work; also statistical data are of little use without a minimum of meta data as otherwise it is just a number with a name.