Digital Archiving Made Simple
September 10 and 12, 2019
John Sarnowski
Director, The ResCarta Foundation Milwaukee, Wisconsin
+1-608-514-1958
john.sarnowski@ResCarta.org ABSTRACT
Marcia Sarnowski
Former Consultant (Retired) Winding Rivers Library System +1-608-385-4171
Marcia@johnsarnowski.com
This is an introductory session on the use of open/free software to create, validate, index, search, display, and maintain a digital archive of various materials including photographs, postcards, newspapers, books, audio recordings and videos.
Software Used in this presentation
Open Source and Free Software
Audacity – (Windows-OSX-Linux)
http://audacity.sourceforge.net/
Digital Library for Earth System Education jOAI - (Windows-OSX-Linux) https://github.com/NCAR/joai-project
FFMPEG – (Windows-OSX-Linux)
http://ffmpeg.org/
Ghostscript – (Windows-OSX-Linux)
http://www.ghostscript.com/
Gimp – (Windows-OSX-Linux)
http://www.gimp.org/downloads/
Java – (Windows-OSX-Linux)
https://adoptopenjdk.net/
ResCarta Toolkit – (Windows-OSX-Linux)
https://rescarta.org/
CMU Sphinx – (Windows-OSX-Linux)
https://cmusphinx.github.io/wiki/
Tomcat8 - (Windows-OSX-Linux)
https://tomcat.apache.org/download-80.cgi
VLC media player - (Windows-OSX-Linux)
http://www.videolan.org/
Tech Days East 2 www.ResCarta.org
Introductions...............................................................................................................................................4 The ResCarta Foundation............................................................................................................................4 Creating a Digital Archive.............................................................................................................................4 Installing Software.......................................................................................................................................5
Metadata types........................................................................................................................................5 Descriptive...............................................................................................................................................5 Structural.................................................................................................................................................5 Administrative.........................................................................................................................................5 Textural....................................................................................................................................................5
Adding metadata to photographs................................................................................................................6 Adding metadata to paged materials (Newspapers, Books, Yearbooks).....................................................7 Convert Data................................................................................................................................................8 Editing Textural Metadata...........................................................................................................................9
Optical Character Recognition (OCR).......................................................................................................9 Automatic Audio Transcription (AAT)....................................................................................................10 Collect Digital Objects................................................................................................................................11 Indexing Data for high speed retrieval.......................................................................................................11 Host/Share Collections..............................................................................................................................12 The Web Site.............................................................................................................................................13 WEB SERVERS/PORTS/ADDRESSES............................................................................................................13 Share Metadata.........................................................................................................................................14 Explore OAI/PMH...................................................................................................................................15 Maintain the Archive.................................................................................................................................15 Checksum validation..............................................................................................................................15 Lots of Copies Keeps Stuff Safe..............................................................................................................15 Appendix:...................................................................................................................................................16
www.ResCarta.org 3 Digital Archiving Made Simple
Introductions
Good morning! Who are we? Why are we here?
The ResCarta Foundation
The foundation is a not for profit corporation which produces, distributes and supports open source software to create and help maintain local digital archives of culturally important materials.
The ResCarta-Toolkit and ResCarta-Web applications have assisted in the creation of hundreds of digital archives world wide.
The software is free and open for use by anyone. See https://www.rescarta.org
Creating a Digital Archive
A digital archive is more than scanned images, sound files or videos stored in a computer system. A digital archive is an organized collection of files containing containing digital objects. Digital objects are made up of media files, metadata information and checksums. Digital Media files can be TIFF, JPEG, PDF, PNG, WAV, MP4, OGG, JPEG2000 or DXF etc.
Metadata can include descriptive, structural, administrative, or textural information about the media files and should be expressed in a known open format readable by humans and machines (e.g. XML).
Checksum files contain a series of digits that represent the sum of the correct digits in the media file against which later comparisons can be made to detect errors in the data.
We will be using previously created digital files from a Sample directory during this session. Contents of the SAMPLES directory structure
JPG Output from a digital single-lens reflex (SLR) camera (PerFile) and Scans of postcards (PerDir) TIF Historic steamboat photographs (PerFile) and Books and newspapers (PerDir) PDF Documents from the internet (PerFile) and images from microfilm or paper originals (PerDir) WAV Recordings from records (Music) and Wisconsin Public Radio news broadcasts (NEWS) MP4 Video from VHS tape with and without SRT (SubRip Text) basic subtitle files.
Tech Days East 4 www.ResCarta.org
Installing Software
We will install new software for this workshop, it will take a few moments and then we can locate the list of installed tools. Installing well developed software should be a simple process.
We have installed the following software:
ResCarta Toolkit – an open source suite of programs for creating digital archives GHOSTSCRIPT - a support program if you want to convert or view PDF or Postscript files Apachetm Tomcat – a full featured web server
ResCarta-Web - a web application which will serve your archive data. You will find this application at C:\Program Files\RcTools-7.0.3\apache-tomcat-8.5.31\webapps\ResCarta-Web
Documentation – See the PDF files at C:\Program Files\RcTools-7.0.3\docs
Sample archive - a (very) small example RCDATA01 directory will be installed at C:\Program Files\RcTools-7.0.3\apache-tomcat-8.5.31\webapps\ResCarta-Web\RCDATA01
Metadata types
Descriptive
Descriptive metadata describes a resource for purposes such as discovery and identification. It can include elements such as title, abstract, author, and subjects.
Structural
Structural metadata indicates how compound objects are put together, for example, how pages are named or gathered to form chapters.
Administrative
Administrative metadata provides information to help manage a resource, such as when and how it was created, file type and other technical information.
Textural
The results of Optical Character Recognition (OCR) or Automatic Audio Transcription (AAT)
Metadata standards of MODS/METS/MIX/AudioMD/reVTMD from the Library of Congress and the National Archives will be discussed and created for text, photographs and multimedia sources.
www.ResCarta.org 5 Digital Archiving Made Simple
Adding metadata to photographs
Let’s open 1_ ResCarta Metadata Creation Tool. Then /File/Open Data Directory (Alt FO) SELECT “OBJECT PER FILE” from the dialog then browse to \Samples\JPG\PerFile-NikonCamera
When the application opens the directory you should find that it
shows a listing of image file names and one item called “Chicago
Tribune Building Plaque” .
Click on the “CREATE METADATA” button to start the process of
creating metadata.
A dialog will open asking for the “Source Institution”. Press the
ADD button and enter the following information.
Name = ResCarta ID = wionrfi0 (Marc Institution Code)
Then click the “OK” button to continue. An “Aggregator” and “Root id” is requested. These are used to create the directory structure of the digital archive. \RCDATA01\{Institution}\{Aggregator}\{ObjectID}
Enter “2010911” as the “Aggregator” and allow the software to set the Root identifier.
The next dialog asks for the TYPE of material you will be creating. Select Photo from the Pulldown.
Now we are presented with a form where we can begin to enter our information. We will continue to work with the Metadata Creation Tool to add metadata to the sample data. See pages 20-22 of the ResCartaToolkitGuide-7.0.pdf for Template and Carry Forward features.
Tech Days East 6 www.ResCarta.org
Adding metadata to paged materials (Newspapers, Books, Yearbooks) Open 1_ ResCarta Metadata Creation Tool. Then /File/Open Data Directory (Alt FO)
SELECT “OBJECT PER DIRECTORY” from the dialog then browse to \Samples\TIF\PerDir\ .
When the application opens the directory you should find that it
shows a listing of publications and one item called “Temple Daily
Telegram” .
Note that all objects already have metadata in this case.
Click on the title “Temple Daily Telegram” to make it the active
object.
In the page information pane you will see that this object has been
broken into sections recreating the STRUCTURE of the newspaper.
You can collapse a section by clicking the key icon on the left side of the STRUCTURE display..
We’ll continue to work with the Metadata Creation Tool to add/modify STRUCTURAL metadata to the sample data.
Open the Samples\PDF\PerDir\ location and see the “Native Times” newspaper. Add Volume and Issue information to this newspaper type. Correct the pagination of the “Wisconsin Jubilee”.
Open the Samples\WAV\News directory for an example of Audio metadata.
Do the same for video by opening the Samples\MP4\ in PER DIRECTORY MODE. www.ResCarta.org 7 Digital Archiving Made Simple
Convert Data
The processes creating and converting data are complex. These issues will be discussed in detail, and a reasoned, simplified approach will be provided. File types, metadata locations, directory structures and file-naming are covered here.
Let’s open the 2_Data Conversion Tool.
The Data Conversion Tool opens in “Object Per File” mode which has many options. Let’s change the “Source Data Type” pull-down to “Object Per Directory” There, that’s better! (Less complex?)
We set the “Source Data directory” to \Samples\JPG\PerDir-Postcards\ and the “Destination Directory” to the Desktop. {When you do this, make certain you know where these files are going to wind up.} Set the “Source metadata type” to ResCarta METS using the pull down selection.
Let’s check the “Enable OCR” box, and then Press the “Begin Conversion” button.
When complete we’ll set the “Source Data directory” to \Samples\PDF\PerDir-NewsYearbooks\ and press the “Begin Conversion” button. This will take a bit longer to compete so let’s continue.
Open another instance of the 2_Data Conversion Tool.
Now change the “Source Data Type” pulldown back to “Object Per File” and the Source Directory to \Samples\JPG\PerFile-NikonCamera. Set “Source Metadata type” to ResCarta METS, uncheck the “Enable OCR” box and press the “Begin Conversion” button.
Now for audio. Set the Source data directory to \Samples\WAV\Music, data type to “Object per file” and Press “Begin Conversion”.
Next, set Source data directory” to \Samples\WAV\News and Check the “Enable audio transcription” checkbox, and press “Begin Conversion”.
Last, set Source data directory” to \Samples\MP4\ and “Source Data Type” pulldown to “Object Per Directory”, Check the “Enable audio transcription” checkbox, and press “Begin Conversion”.
Tech Days East 8 www.ResCarta.org
Editing Textural Metadata
Optical Character Recognition (OCR)
Keyword searching has become commonplace and expected with text based materials. Let’s open the 3-Textural Metadata Editor (TME). Then /File/Open ResCarta Object Directory will open a browse dialog. We are going to open an object we just created above. Locate our RCDATA01 directory and drill down to\RCDATA01\wionrfi0\20140529\00000001.
Note that the words “NATIVE TIMES” in the title are highlighted. Now note the web address in the banner.
Drag a box around around www.NATIVETIMES.COM, This will zoom to that area. From the sidebar select the arrow icon (second one down) and click on the area “NATIVETIMES”. Note that the OCR engine separated the address into “WWW.” “NATIVETI” and “ES.COM” This is a minor error due to the inter character spacing used by the design team. Click on “NATIVETI” and use the dialog box to correct this item.
The TME can also be used to “Tag” text into image only files like photographs.
Let’s close the Textural Metadata Editor.
www.ResCarta.org 9 Digital Archiving Made Simple
Automatic Audio Transcription (AAT)
There are many Optical Character Recognition programs to produce text from images containing text, but there are few sources of software to produce transcriptions of audio files containing spoken text. The Data Conversion tool used above is one of the few automatic audio transcription (AAT) programs. And like early OCR programs the quality of the automatically produced transcription will depend on the quality of the original recording. So a tool will be necessary to correct the recognition of transcriptions.
Let’s open the 4_Audio Transcription Editor (ATE). Then /File/Open will open a browse dialog. We are going to open an object we just created above. Locate your RCDATA01 directory on your C:\ or Desktop and drill down to \RCDATA01\wpr00000\20121124\00000003\
The ATE opens showing each word recognized and its location in the audio file. We press the button to hear audio and check the transcription.
We can click on a word balloon to edit/delete the word. We second mouse click in an open space on the waveform graphic to add a word. An “insert” will appear, click it to get a new empty balloon.
Correcting an hour long oral history from a poor quality recording may take some patience. At this time this release only works well with English as the source language.
The ATE can be used as a manual transcription tool as well. Using the Data Conversion tool with AAT turned on will output word locations and word balloons for what the AAT recognizes. So English words recognized can be replaced with the proper language term. This tool supports UTF8 at the core so most world languages are supported.
Tech Days East 10 www.ResCarta.org
Collect Digital Objects
So far we have created digital objects from Photographs, Postcards, Books, Music, Newscasts and Videos. The objects have been arranged in a known directory structure and have associated metadata embedded into the files along with an external standard XML metadata file.
The next step is to gather these objects into digital collections. We open 5_Collections Manager then /FILE/Open ResCarta Data Volume. Use the browse dialog to open the RCDATA01 top level directory of our archive.
A window will open showing the number of ResCarta objects recognized under the top level RCDATA01 directory. The banner will list the location of the Data Volume that has been opened. But not much has changed on the surface of the Collection Manager interface.
We add a collection by pressing the icon at the bottom of the left hand Collection pane. Give the collection a name (short is best) like “News” and add an abstract to the collection. “Newscast and Newspapers…”, then press “Finish” button. We see the News collection listed in the Collections pane.
Let’s add objects to the collection by pressing the icon at the bottom of the Middle “Collection Content” pane. A dialog will open listing objects with the first object in the list selected. We use the mouse or down arrow key to select “The Native Times” and then hold your control key and use our mouse to select all the newpapers and newscasts. Then press the “OK” button. These objects will be placed into our “News” collection. We can create a few other collections and add appropriate objects to them, using all the Objects that we have created.
Indexing Data for high speed retrieval
Now that we have organized our digital objects and defined our collections let’s create a computer index for quick retrieval. Open 6_Indexer; a dialog will appear with two pulldowns. Set the ResCarta data volume directory to the location of your RCDATA01 top level directory. The second pull down will default to that location with the addition of \index.ir7. This is the default location of the index used by the ResCarta-Web application. Press the “Begin Indexing” button. This program will build a full Lucenetm index to every word in your archive. This indexes the metadata elements and the textural metadata in your OCR/AAT processed files.
www.ResCarta.org 11 Digital Archiving Made Simple
Host/Share Collections
Create a Web server
When we installed the software for this tutorial, an Apache Tomcattm web application was also installed. This will make a fully functional, full featured web server of your machine capable of delivering large scale archives. To start the web server we click on the “Start ResCarta-Web server” from the ResCarta Tools menu. A command window will open on Windows based systems (Linux and OSX will be silent). A web server and a web application will be created and configured to search and display the digital objects.
NOTE: This Tomcat cmd
window must remain running for
this sample web server to
function.
Also installed along with the Tomcat web server was the ResCarta-Web application. Let’s take a look at this website. Let’s Click on the “Open ResCarta-Web in Browser” link from the ResCarta Tools menu. Your default browser will open to the localhost:8302 location of your sample Tomcat web server. But these are NOT the objects we are looking for…
Let’s click the “Log In” link in the upper right hand under the ResCarta-Web logo. We will enter “admin” for the user and “password” for the password then press the “Log in” button. In the Server Administration/General tab we’ll change the data volume directory to the location of our RCDATA01 and press the SAVE button, then click the “Thumbnails” tab and press the “Start” button.
Tech Days East 12 www.ResCarta.org
The Web Site
Now we should be able to see our collections, search the site and examine our objects. Let’s try a few “Simple Searches”.
For our first search let’s see what we can find by searching for the term “Wisconsin”. Lets press the Simple Search tab and enter the term Wisconsin into the “Search for” box.
We should get a return of a few disparate items, audio files, books and newspapers. Let’s pick the audio file titled “Wisconsin's New Rainy Day Fund”. When the file opens it will begin playing and a list of the times the word “Wisconsin” appears in the sound file. You can jump to the location by clicking on the highlighted term.
We can click the “Search Results” tab to return to the listing of found objects and select a text based object. When opened, you can find the highlighted term in the object. Other features include zooming, text extraction, printing and listing the metadata for the item.
Open the iiif standard image viewer at http://localhost:8302/ResCarta-Web/mirador
WEB SERVERS/PORTS/ADDRESSES
We should put an easy to find link to your local archive web site on our library’s main home page.
The web server we are using is the same server used by aerospace companies, libraries and archives around the world. But running it from a command window or batch file would not be appropriate. Also note the address “http://localhost:8302/ResCarta-Web/” would not work outside of your system. This workshop computer has an IP address and for another machine to see our website, the address of our system would be substituted for the localhost above like “http://192.168.0.124:8302/ResCarta-Web/” Our workshop computer has a name so the address could become “ http:// ResCarta03:8302/ResCarta- Web/”
Downloading a current version of the Apache Tomcat Service installer and adjusting the port number to 80 instead of the default 8080 would allow the server to start whenever the system started and the address would become more reconizable as “ http:// ResCarta03 /ResCarta-Web/”
Renaming the ResCarta-Web directory to ROOT will make it the default site and the resulting URL would become “ http://ResCarta03”
Opening your machine’s firewall to allow HTTP traffic to and from your machine will allow others on the same network to see your website. Opening the external firewall to the INTERNET will allow the entire world to view your website.
www.ResCarta.org 13 Digital Archiving Made Simple
Share Metadata
OAI/PMH: Finally, the archive we created can be exposed to metadata harvesting by the creation and configuration of an OAI/PMH server. This sounds complex but can be quite simple. Let’s create the necessary Dublin Core metadata files, add an OAI/PMH server and configure it.
Let’s make a directory to hold our OAI formatted Dublin Core metadata for harvesting. Using a file manager Let’s make a directory called C:\OAIDATA.
Reopening the 5_Collections Manager, from the menu let’s choose from the menu /File/Dublin Core Preferences.
We will fill in the dialog as shown to the left.
Checking the Write DC metadata box. Set our
output directory to C:\OAIDATA . Selecting
OAI_DC as the Output format. And we will set the
ResCarta-Web URL to
http://localhost:8302/ResCarta-Web/
Press OK and YES to the dialogs.
NOW for the OAI/PMH Server…
I have one on my thumbdrive at
\ResCarta\Installers\jOAIv3\oai.war
I will copy this file to the following location :
C:\Program Files\RcTools-7.0.3\apache-tomcat
8.5.31\webapps
After a few moments you will see a directory
called oai being created in the webapps directory.
This is the OAI/PMH server application. Now let’s
configure it for use on your archive data.
Opening our browser to http://localhost:8302/oai . The jOAI Overview will open. We click the “Data Provider” tab then select “Setup and Status” from the pulldown. Then click the Edit repository information link. Give your repository a name and an email contact, then press SAVE.
Clicking the “Data Provider” tab we select “Setup and Status” from the pulldown again. This time we click the Add metadata directory link. Let’s enter a nickname for these files “OurArchive”, set format of files to “oai_dc” and Path to C:\OAIDATA, our oai data directory.
You can try some requests against your repository with the urls listed below or using the Search TAB. Tech Days East 14 www.ResCarta.org
Explore OAI/PMH
Identify the repository
http://localhost:8302/oai/provider?verb=Identify&rt=text
List Metadata Formats
http://localhost:8302/oai/provider?verb=ListMetadataFormats&rt=text
List all records in your repository
http://localhost:8302/oai/provider?verb=ListRecords&metadataPrefix=oai_dc&rt=text
Maintain the Archive
Checksum validation
To validate our archive from time to time we open the 7_Checksum Verification Tool, and set the location of our ResCarta data volume directory, and press the “Begin Verification” button. When you do this (and you should do it often),you will be presented with a report showing the status of each object in your archive. The checksum verification can determine if any file in your archive has even one bit changed.
If an error is reported you will need to run the verification tool against a copy of your archive. If the verification for the object passes verification, you would be safe to copy that object back to your local archive, thereby replacing the damaged object.
Lots of Copies Keeps Stuff Safe
The concept here is an easy one. If you want anything to survive for a long period of time, make more than one copy and place them in locations away from each other. Digital archives are simple to copy, they are exact duplicates of each other, and storage is relatively cheap compared to rebuilding an archive from scratch.
That’s it! You have made digital objects, hosted them on a website and created an OAI harvestable repository. All created with open source and free software.
Thank you.
www.ResCarta.org 15 Digital Archiving Made Simple
Appendix:
Some useful links (we think)
Selection:
Consortium of Academic and Research Libraries in Illinois
http://www.carli.illinois.edu/sites/files/digital_collections/documentation/Selection-of materials_20120221.pdf
Metadata:
OAI/PMH http://www.openarchives.org/Register/BrowseSites Photographic Indexing http://www.loc.gov/rr/print/tgm1/iib.html
Subject Headings http://www.loc.gov/rr/print/tgm1/iii.html
The Getty Vocabularies http://vocab.getty.edu/
Art & Architecture Thesaurus http://www.getty.edu/research/tools/vocabularies/aat/ Geographical Names http://www.getty.edu/research/tools/vocabularies/tgn/ MODS http://www.loc.gov/standards/mods/
METS http://www.loc.gov/standards/mets/
MIX http://www.loc.gov/standards/mix/
audioMD http://www.loc.gov/standards/amdvmd/
reVTMD https://www.archives.gov/preservation/products/reVTMD.xsd Audio:
HTML5 support http://en.wikipedia.org/w/index.php?title=HTML5_Audio International Image Interoperability Framework:
iiif site https://iiif.io/about/
RC-mirador https://demos.rescarta.org/ResCarta-Web/mirador/ Tech Days East 16 www.ResCarta.org