1 of 37

Free and Open DataWhere to find it and Can you use it

Business Reference and Services Section (BRASS)

of Reference User Services Association (RUSA)

in American Library Association (ALA)

March 18, 2020: Data in Libraries Webinar Series

Presenter: Jennifer C. Boettcher, Georgetown University

2 of 37

Jennifer C. Boettcher

Jennifer C. Boettcher and Leonard M. Gains. Industry Research Using the Economic Census. Greenwood Press: Phoenix, AZ. 2004

M.B.A., Georgetown University, Washington, D.C., 2005

M.L.S., State University of New York, Albany, N.Y.,1992

B.A., University of New Hampshire, Durham, N.H., 1987

ALA RUSA BRASS Member since 1991

Georgetown University 1997-present

Founder of Business Information Finders (BIF) and Capital Area Business Academic Librarians (CABAL) in DC

2013 Emerald Research Grant: Zombie List (reanimated business sources)

Seeking contributors: https://boettcher.georgetown.domains/HisBusColl

2010 Gale Cengage Learning Award for Excellence in Business Librarianship

3 of 37

Librarian & Information Scientist

  • As a Librarian, I
    • Understand the source
    • Know how to find the source
    • Know the related subjects
    • Know how it’s connected to other sources
    • Know how to read it
    • Make connections between publisher and researcher
  • As a Librarian, I don’t
    • Publish the primary source
    • Have your context or expertise
    • Do statistical analysis
    • Interpret the data
    • Do data entry
    • Have legal expertise

These are my views and do not reflect those of Georgetown or RUSA.

Boettcher, J. C., & Dames, K. M. (2018). Government data as intellectual property:

Is public domain the same as open access? Online Searcher, 42(4), 42-48. 

4 of 37

Data Vocabulary

  • Datasets: Raw or statistical numbers, can be flat file as Comma Separated Variable (CSV) or Proprietary like Excel (see Bobray Bordelon’s RUSA Presentation, March 4)
  • Metadata: Variables or fields in the record (example, Author)
  • Big data: Transactional (example, each check out)
  • Visualization and Reports: Making sense of what the data means. How many checkout, usually as aggregated statistics (See Ryan Womack, March 25)
  • Application program interface (API): piece of software that interfaces between getting the data out and putting in your computer (See Jeremy D. Darrington, April 15)

  • Open Data: Freely accessible data, created for a specific purpose

5 of 37

Adaptations of DIKW pyramid by US Army Knowledge Managers,

from https://en.wikipedia.org/wiki/DIKW_pyramid

Data are not:

Information

Technology

Digital

Analytics

Evidence

Research

Visualizations

Ideas

Data are

collected facts

“raw material”

6 of 37

Copyright provides the owner of copyright with the exclusive right to

  • Display the work
  • Distribution of Public Domain collections called “collective works” or compilations
  • Copyright also provides the owner of copyright the right to authorize others to exercise these exclusive rights, subject to certain statutory limitations.

  • Reproduce the work in copies
  • Prepare derivative works based upon the work
  • Distribute copies of the work to the public by sale or other transfer of ownership or by rental, lease, or lending
  • Perform the work publicly live or by means of a digital transmission

7 of 37

Copyright and Numeric Data

Facts are not copyrighted (In no case does copyright protection for an original work of authorship extend to any idea, procedure, process, system, method of operation, concept, principle, or discovery, regardless of the form in which it is described, explained, illustrated, or embodied in such work.) 17 USC 102b

In US collections of facts or data that fail to meet the minimum threshold of creativity also are ineligible for copyright protection, even if assembling such a collection takes significant time, effort, or resources: “sweat of the brow.”

Creative expression of data in compilation is protected (Feist 1991)

8 of 37

Public Domain: No Copyright Restrictions

Public Domain is not protected by intellectual property laws, like copyright. Anyone can use a public domain work without obtaining permission, but no one can ever own it.

Example: no longer protected due to age of creative work.

Works produced for the U.S. Government by its officers and employees should not be subject to copyright. The provision applies the principle equally to unpublished and published works. 17 USC 105

REMEMBER: Public domain data must be attributed.

9 of 37

Data policy in the Federal Government

  • Federal government policy
    • Passed by Congress
    • Implemented by Executive Branch
    • Refined by Courts
  • Policy on data collection (priorities)
    • Mandated by law (in CFR) Public Law No: 115-435 (signed 1/14/19)
    • Implemented by regulations (Federal Register) 82 FR 52213
    • Directed by memorandum (Presidential) M-13-13
    • Standard of practice: Data plans (Agencies)

10 of 37

Caveats of Open Government Data�

Why not?

  • Classified
  • Not widely distributed/web
  • Contracted or grant work
  • Lack of supposed interest, DoJ
  • Lack of funding, Census
  • Not kept, Record Retention

  • Free
  • Transparent
  • Accountable
  • Accessible to citizens
  • Engage all citizens
  • Machine Readable, API v Human
  • When in doubt openness prevails

11 of 37

Why Open Data exists

  • Funded research created for a specific purpose
  • Open Access is not Intellectual Property law it’s a license agreement from the copyright owner and a set principles: CC0

12 of 37

Public Domain Vs. Open Access

  • Public Domain
    • US Law
    • Federal Government products
    • Data at any stage can be retrieved by FOIA
    • Some sub-nationals
    • Some countries
    • Some NGOs
  • Open Access
    • Decision of publisher/owner
    • Because of ownership of copyright
    • License (CC, GNU, etc)
    • Principles
      • Reuse and redistribution of the data
      • Allows derivative works as Open only
      • No restrictions on who can access and use
      • Electronically transferable
      • Machine-readable

13 of 37

Data as input and output

Input

  • FREE data
    • Public Domain
    • Open Data

  • Questionable status
    • Internet
    • Repositories

  • Commercial (licensed) sources
    • Grants, co-authors, permission

Output

  • Working with scholars
  • FAIR
    • Findable: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers.

    • Accessible: Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation.

    • Interoperable: The data usually need to be integrated with other data.

    • Reusable: The ultimate goal of FAIR is to optimise the reuse of data.

14 of 37

Questions?

CC0, https://pixabay.com/en/hedgehog-child-young-hedgehog-1759027

15 of 37

Major Sources of Social Science Data in the US Government

16 of 37

Major Sources of Natural Science Data from the US Government

https://www.flickr.com/photos/notbrucelee/6897137283/in/photostream

17 of 37

P.E.S.T. Analysis for Industry

  • Political
    • Legislative
      • Congress.gov
    • Executive
      • Regulations.com
    • Judicial
      • United States Courts

  • Economic
    • Sector Inflation
      • BLS’s Producers Price Index
    • Microeconomic trends
  • Socio-cultural
    • Norms & Ratios
      • IRS’s Statistics of Income
    • Peers and partners

  • Technology
    • Patents
      • Citation Analysis
    • Tech Transfer

18 of 37

Problems that come with government data

  • Beggars can’t be choosers
    • Too old
    • Not to the geographic level needed
    • Too detailed
    • Have to file a FOIA request
  • Compatibility
    • Standardization
    • Combining two datasets even from same source might not be possible
    • Combining two different sources must look at methodology

19 of 37

Major International Data Sources

By topic

By Country

National Statistical Offices

More data available in national language

Some charge for access

Citizens of that country might have free access

National Repositories/Archives

Historical

Datasets

20 of 37

Where to start

  • New Jersey https://data.nj.gov

  • United States https://www.data.gov

  • International http://data.un.org

  • Inter-university Consortium for Political and Social Research repository https://www.icpsr.umich.edu/icpsrweb/ICPSR/

21 of 37

Where to learn MORE

For Librarians

For Federal data

22 of 37

Let’s discuss

boettcher@georgetown.edu

202 687-7495

Twitter: @jenny.wombat

These slides are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

© Bill Waterhouse, with permission

AMSTAT images from

http://magazine.amstat.org/blog/2018/05/01/fy18fedbudget

23 of 37

Numeric Data: who is responsible

  • Analytics
    • Analysts
      • Knowledge
      • Context of numbers
  • Tacit Knowledge
    • Expert
      • Wisdom
      • “I Just know it”
  • Curation
    • Repository
      • Preserving access
  • Datum (single of data)
    • Researchers
      • Data
      • Creation of numbers

  • Statistics
    • Publishers
      • Information
      • Relations among numbers

24 of 37

Numeric Data: who is responsible

  • Analytics
    • Analysts
      • Knowledge
      • Context of numbers
  • Tacit Knowledge
    • Expert
      • Wisdom
      • “I Just know it”
  • Curation
    • Repository
      • Preserving access
  • Datum (single of data)
    • Researchers
      • Data
      • Creation of numbers

  • Statistics
    • Publishers
      • Information
      • Relations among numbers

25 of 37

Vocabulary: Tools, Process, and Products

Datasets or compilation: Raw or statistical numbers, can be flat file such as Comma Separated Variable (CSV) or proprietary like Excel

Metadata: Includes field descriptions for the dataset, found in codebooks

Schema: How data is organized or structured using standards, like classification

Application Program Interface (API): Read-only machine to machine querying, generally from JSON or XML files

Big data: Raw, unstructured data; normally transactional (example: each check out)

Natural Language Processing (NLP): Use for text analysis, not numeric data

Artificial Intelligence (AI): Includes predictive analytics and machine learning

Reports: Usually aggregated statistics based on big data (example: how many checkouts)

Data Visualization: Using software to visually communicate relationships and context of data

Open Data: Freely accessible data, created for a specific purpose; by-product of decision making or research

26 of 37

F.A.I.R data

  • Findable: The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers.

  • Accessible: Once the user finds the required data, she/he needs to know how can they be accessed, possibly including authentication and authorisation.

  • Interoperable: The data usually need to be integrated with other data.

  • Reusable: The ultimate goal of FAIR is to optimise the reuse of data.

Mainly for scientific literature and in Europe

GoFAIR

27 of 37

Works produced for the U.S. Government:�Lifecycle of Data

Policy Makers who ask the questions about what has to be found or measured

Researchers who design methods or experiments to collect the data and where the data and codebooks are created.

Statisticians who manipulate datasets using models and algorithms to see trends in longitudinal data and to interpret data at a moment of time in cross-sectional studies.

Analysts who see patterns using predictive analytics, seek the emerging relationships between the numbers, transforming data into information by giving it context.

Other Data Scientists will link graphics, statistical downloads, and application programming interfaces (APIs) to the researcher's raw data.

Writers and Data Visualization Designers, who uses their imagination to apply their knowledge to make data understandable in reports, press releases, and other resources.

The federal agency will act as Publishers, putting the synthesized resources on its website for all-primarily for decision makers but also for citizens-to read.

28 of 37

Funding for Federal Data Collection

NIH- National Institutes of Health (HHS)

NSF- National Science Foundation

AHRQ- Agency for Healthcare Research & Quality (HHS)

FDA- Food & Drug Agency (HHS)

BEA- Bureau of Economic Analysis (DoC)

BJS- Bureau of Justice Statistics (DoJ)

BTS- Bureau of Transportation Stat. (DoT)

Census- DoC

EIA- Energy Information Admin. (DoE)

ERS- Economic Research Service (DoA)

NASS- Nat. Agricultural Stat. Service (DoA)

NCES- Nat. Center of Education Stat. (DoE)

NCHS- Nat. Center for Health Stat . (HHS)

NCSES- Nat. Center for Science and Engineering Sat. (NSF)

ORES- Off. of Research, Evaluation, and Statistcs (SSA)

SOI- Statistics of Income (IRS)

Image from AmStat (permission pending)

29 of 37

One Statistical Office in US: Why Not?

1. Privacy: The Privacy Act of 1974, Confidential Information Protection and Statistical Efficiency Act of 2002 (CIPSEA), and Statistical Policy Directive No. 1 (2014) require agencies to ensure that the collection and maintenance of citizens' data is accurate, confidential, and within legal restrictions. With different offices having access to those records, there would be less possibility of everything being leaked.

2. Security: Along the lines of fewer offices having access to data records. The more servers that hold the data, the safer it is. The times when an exchange of information is necessary laws and regulations among departments allow to protect access to data.

3. Integrity: The income you report to IRS might be different from what you report to the Census Bureau.

4. Methodology: Sometimes data must have a higher number of people questioned so the accuracy will be better; different methods of collection or sampling may be required.

5. Popularity: Anything being done by the government has a political dimension, especially funding for employees and for modernizing and updating technology, attractiveness of the research, and repetition of statistical programs by agencies.

30 of 37

Future of the Bureau of Labor Statistics

In danger: Nat. Longitudinal Sur., JOLTS, Am. Time Use Sur., Employee Benefits Sur.,

Cen. of Fatal Occupational Injuries, Evaluation $27M>$2M

Protected

Principal Federal Economic Indicators (PFEI) and programs written into or referenced by law for allocation or other purpose. 85% of budget

31 of 37

Administrative Data and the Freedom of Information Act (FOIA)  5 U.S.C. § 552, 1966

  • What to ask for
    • Anything unpublished by US government
    • Controlled Unclassified Information (CUI)

Read this from Archives

File in at FOIA Online

Oversight: Office of Government Information Services

32 of 37

OMB’s Statistical Policy Directive No. 1�Executive agencies must:

  • Produce and disseminate relevant and timely information
  • Conduct credible and accurate statistical activities
  • Conduct objective statistical activities
  • Protect the trust of information providers

33 of 37

strategy.data.gov�Guidance from OMB

  • Govern and Manage Data as a Strategic Asset
  • Protect and Secure Data
  • Promote Efficient Use of Data Assets
  • Build a Culture that Values Data as an Asset
  • Honor Stakeholder Input and Leverage Partners

34 of 37

Open Government

US Federal

International

35 of 37

States and Cities

https://data.sonomacounty.ca.gov/dataset/SoCo-Data-PNG/3m9t-bc35

36 of 37

Legal issues

Data and IP

Licensing Data

37 of 37

Learning more

Government Sources

FDLP Academy

Accidental Government Librarian

DigitalGov from Digital Government Division of GSA

Standards for Born Digital images

Numerical Data

Public Knowledge: Access and Benefits (Information Today, 2016)

Innovation in Federal Statistics (National Academics, 2017)