1 of 19

BagIt

for data managers

2 of 19

problem space

use cases:

  • moving digital content (network and physical transfer)
  • storing digital content

while being able to:

  • detect changes in the content
  • associate basic metadata with the content

also:

  • not require changes to the content
  • super simple (easy to understand; no special tools)

3 of 19

motivation

interoperability.

(the kind that does not require standards committees or endless email exchanges to achieve or only works on some operating systems or requires more than 30 minutes to understand.)

4 of 19

problem space

that's it.

(if you think you're missing something profound, you're not.)

5 of 19

"bag" as metaphor

think of a package sent through the mail:

1. it is wrapped (the bag structure)

2. it contains the item being delivered (the payload)

3. it has labels on the outside about the sender and receiver (the tags)

4. it includes a list of the items being delivered (the manifests)

6 of 19

structure of a bag

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

7 of 19

structure: base directory

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • contains tags and data directory
  • can have any name
    • common practice: use bag name/id for name

8 of 19

structure: payload

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • the contents of the bag
  • must be in data/ directory
  • spec dictates nothing about the payload (although there are some recommended practices)

9 of 19

structure: tags

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • contain metadata about the bag
  • located in base directory
  • some required or prescribed by spec:
    • bagit.txt
    • manifests
  • arbitrary tags allowed

10 of 19

structure of a bag: bagit.txt

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • example:

BagIt-Version: 0.97� Tag-File-Character-Encoding: UTF-8

  • required by spec
  • easy way to identify bags

11 of 19

structure: payload manifest

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [optional additional tag files]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • example:

49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png�408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt

  • file name of manifest indicates algorithm
    • for example: manifest-md5.txt
  • every payload file must be in at least one manifest

12 of 19

structure: tag manifest

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [tagmanifest-<algorithm>.txt]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • just like payload manifest, except for tag files
  • optional (but recommended)
  • every tag file may be in one or more tag manifests

13 of 19

structure: bag-info.txt

<base directory>/� | bagit.txt� | manifest-<algorithm>.txt� | [bag-info.txt]� \--- data/� | [payload files]� \--- [optional tag directories]/� | [optional tag files]

  • contains bag metadata
  • optional (but recommended)
  • spec describes some fields; may add others
  • example:

Source-Organization: Spengler University�Organization-Address: 1400 Elm St., Cupertino, Ca, 95014�Contact-Name: Edna Janssen�Bagging-Date: 2008-01-15�External-Identifier: spengler_yoshimuri_001

14 of 19

conformance: complete

  • has all required BagIt elements
  • every file in a payload manifest must be present
  • every payload file must be listed in a payload manifest
    • may be listed in more than one payload manifest
  • every file in a tag manifest must be present
    • although every tag file does not have to be in a tag manifest

15 of 19

conformance: valid

  • complete
  • every checksum in a manifest can be verified against the corresponding file

16 of 19

best practices

  • be aware of differences between operating systems (?nix vs. windows)
  • be aware of differences between file systems
  • be aware of non-Latin characters in filenames

17 of 19

tools

  • bags can be created with standard system tools (for example, md5sum)
  • LC developed and supported:
    • BagIt Library (BIL): java library and commandline
    • Bagger: gui built on top of BIL
  • also available: python, ruby, PHP (and probably others)
    • See http://en.wikipedia.org/wiki/BagIt#Tools

18 of 19

experience at Library of Congress

  • our partners/vendors can produce bags with minimal overhead
    • some use available tools; others write their own
  • do catch the errors
    • files accidentally added
    • drive errors
    • transmission errors
  • still support non-BagIt bags

19 of 19

more info

  • spec: http://tools.ietf.org/html/draft-kunze-bagit-08
  • info: http://en.wikipedia.org/wiki/BagIt
  • BIL/Bagger: http://sourceforge.net/projects/loc-xferutils/
  • discussion: http://bit.ly/Rxyg2A
  • this: http://bit.ly/SRYYDU
  • self: Justin Littman (jlit@loc.gov)