Published using Google Docs
TechDays
Updated automatically every 5 minutes

  Digital Archiving Made Simple

September 10 and 12, 2019

John Sarnowski

Director, The ResCarta Foundation Milwaukee, Wisconsin

+1-608-514-1958

john.sarnowski@ResCarta.org ABSTRACT  

Marcia Sarnowski

Former Consultant (Retired)  Winding Rivers Library System +1-608-385-4171

Marcia@johnsarnowski.com 

This is an introductory session on the use of open/free software to create, validate, index, search,  display, and maintain a digital archive of various materials including photographs, postcards,  newspapers, books, audio recordings and videos.

Software Used in this presentation

Open Source and Free Software

Audacity – (Windows-OSX-Linux)

http://audacity.sourceforge.net/ 

Digital Library for Earth System Education jOAI - (Windows-OSX-Linux)  https://github.com/NCAR/joai-project 

FFMPEG – (Windows-OSX-Linux)

http://ffmpeg.org/ 

Ghostscript – (Windows-OSX-Linux)

http://www.ghostscript.com/ 

Gimp – (Windows-OSX-Linux)

http://www.gimp.org/downloads/ 

Java – (Windows-OSX-Linux)

https://adoptopenjdk.net/ 

ResCarta Toolkit – (Windows-OSX-Linux)

https://rescarta.org/ 

CMU Sphinx – (Windows-OSX-Linux)

 https://cmusphinx.github.io/wiki/ 

Tomcat8 - (Windows-OSX-Linux)

 https://tomcat.apache.org/download-80.cgi 

VLC media player - (Windows-OSX-Linux)

http://www.videolan.org/ 

Tech Days East 2 www.ResCarta.org

Introductions...............................................................................................................................................4 The ResCarta Foundation............................................................................................................................4 Creating a Digital Archive.............................................................................................................................4 Installing Software.......................................................................................................................................5

Metadata types........................................................................................................................................5 Descriptive...............................................................................................................................................5 Structural.................................................................................................................................................5 Administrative.........................................................................................................................................5 Textural....................................................................................................................................................5

Adding metadata to photographs................................................................................................................6 Adding metadata to paged materials (Newspapers, Books, Yearbooks).....................................................7 Convert Data................................................................................................................................................8 Editing Textural Metadata...........................................................................................................................9

Optical Character Recognition (OCR).......................................................................................................9 Automatic Audio Transcription (AAT)....................................................................................................10 Collect Digital Objects................................................................................................................................11 Indexing Data for high speed retrieval.......................................................................................................11 Host/Share Collections..............................................................................................................................12 The Web Site.............................................................................................................................................13 WEB SERVERS/PORTS/ADDRESSES............................................................................................................13 Share Metadata.........................................................................................................................................14 Explore OAI/PMH...................................................................................................................................15 Maintain the Archive.................................................................................................................................15 Checksum validation..............................................................................................................................15 Lots of Copies Keeps Stuff Safe..............................................................................................................15 Appendix:...................................................................................................................................................16

www.ResCarta.org 3 Digital Archiving Made Simple

Introductions  

Good morning! Who are we? Why are we here?  

The ResCarta Foundation

The foundation is a not for profit corporation which produces, distributes and supports open source  software to create and help maintain local digital archives of culturally important materials.

The ResCarta-Toolkit and ResCarta-Web applications have assisted in the creation of hundreds of digital  archives world wide.

The software is free and open for use by anyone. See https://www.rescarta.org 

Creating a Digital Archive

A digital archive is more than scanned images, sound files or videos stored in a computer system. A digital archive is an organized collection of files containing containing digital objects. Digital objects are made up of media files, metadata information and checksums. Digital Media files can be TIFF, JPEG, PDF, PNG, WAV, MP4, OGG, JPEG2000 or DXF etc.  

Metadata can include descriptive, structural, administrative, or textural information about the media  files and should be expressed in a known open format readable by humans and machines (e.g. XML).

Checksum files contain a series of digits that represent the sum of the correct digits in the media file  against which later comparisons can be made to detect errors in the data.

We will be using previously created digital files from a Sample directory during this session. Contents of the SAMPLES directory structure

JPG Output from a digital single-lens reflex (SLR) camera (PerFile) and Scans of postcards (PerDir) TIF Historic steamboat photographs (PerFile) and Books and newspapers (PerDir) PDF Documents from the internet (PerFile) and images from microfilm or paper originals (PerDir) WAV Recordings from records (Music) and Wisconsin Public Radio news broadcasts (NEWS) MP4 Video from VHS tape with and without SRT (SubRip Text) basic subtitle files.

Tech Days East 4 www.ResCarta.org

Installing Software  

We will install new software for this workshop, it will take a few moments and then we can locate the  list of installed tools. Installing well developed software should be a simple process.

We have installed the following software:

ResCarta Toolkit – an open source suite of programs for creating digital archives GHOSTSCRIPT - a support program if you want to convert or view PDF or Postscript files Apachetm Tomcat – a full featured web server

ResCarta-Web - a web application which will serve your archive data. You will find this  application at C:\Program Files\RcTools-7.0.3\apache-tomcat-8.5.31\webapps\ResCarta-Web  

Documentation – See the PDF files at C:\Program Files\RcTools-7.0.3\docs

Sample archive - a (very) small example RCDATA01 directory will be installed at C:\Program  Files\RcTools-7.0.3\apache-tomcat-8.5.31\webapps\ResCarta-Web\RCDATA01

Metadata types

Descriptive

Descriptive metadata describes a resource for purposes such as discovery and identification. It can  include elements such as title, abstract, author, and subjects.

Structural

Structural metadata indicates how compound objects are put together, for example, how pages are  named or gathered to form chapters.

Administrative

Administrative metadata provides information to help manage a resource, such as when and how it was  created, file type and other technical information.  

Textural

The results of Optical Character Recognition (OCR) or Automatic Audio Transcription (AAT)

Metadata standards of MODS/METS/MIX/AudioMD/reVTMD from the Library of Congress and the  National Archives will be discussed and created for text, photographs and multimedia sources.

www.ResCarta.org 5 Digital Archiving Made Simple

Adding metadata to photographs

Let’s open 1_ ResCarta Metadata Creation Tool. Then /File/Open Data Directory (Alt FO) SELECT “OBJECT PER FILE” from the dialog then browse to \Samples\JPG\PerFile-NikonCamera

When the application opens the directory you should find that it  

shows a listing of image file names and one item called “Chicago  

Tribune Building Plaque” .

Click on the “CREATE METADATA” button to start the process of  

creating metadata.

A dialog will open asking for the “Source Institution”. Press the  

ADD button and enter the following information.

Name = ResCarta ID = wionrfi0 (Marc Institution Code)  

Then click the “OK” button to continue. An “Aggregator” and “Root id” is requested. These are used to create the directory structure of the digital archive. \RCDATA01\{Institution}\{Aggregator}\{ObjectID}

Enter “2010911” as the “Aggregator” and allow the software to set the Root identifier.

The next dialog asks for the TYPE of material you will be creating. Select Photo from the Pulldown.

Now we are presented with a form where we can begin to enter our information. We will continue to  work with the Metadata Creation Tool to add metadata to the sample data. See pages 20-22 of the  ResCartaToolkitGuide-7.0.pdf for Template and Carry Forward features.

Tech Days East 6 www.ResCarta.org

Adding metadata to paged materials (Newspapers, Books, Yearbooks) Open 1_ ResCarta Metadata Creation Tool. Then /File/Open Data Directory (Alt FO)

SELECT “OBJECT PER DIRECTORY” from the dialog then browse to \Samples\TIF\PerDir\ .

When the application opens the directory you should find that it  

shows a listing of publications and one item called “Temple Daily  

Telegram” .

Note that all objects already have metadata in this case.

Click on the title “Temple Daily Telegram” to make it the active  

object.

In the page information pane you will see that this object has been  

broken into sections recreating the STRUCTURE of the newspaper.

You can collapse a section by clicking the key icon on the left side of the STRUCTURE display..

We’ll continue to work with the Metadata Creation Tool to add/modify STRUCTURAL metadata to the  sample data.  

Open the Samples\PDF\PerDir\ location and see the “Native Times” newspaper. Add Volume and  Issue information to this newspaper type. Correct the pagination of the “Wisconsin Jubilee”.

Open the Samples\WAV\News directory for an example of Audio metadata.

Do the same for video by opening the Samples\MP4\ in PER DIRECTORY MODE. www.ResCarta.org 7 Digital Archiving Made Simple

Convert Data

The processes creating and converting data are complex. These issues will be discussed in detail, and a  reasoned, simplified approach will be provided. File types, metadata locations, directory structures and  file-naming are covered here.

Let’s open the 2_Data Conversion Tool.

The Data Conversion Tool opens in “Object Per File” mode which has many options. Let’s change the  “Source Data Type” pull-down to “Object Per Directory” There, that’s better! (Less complex?)

We set the “Source Data directory” to \Samples\JPG\PerDir-Postcards\ and the “Destination  Directory” to the Desktop. {When you do this, make certain you know where these files are  going to wind up.} Set the “Source metadata type” to ResCarta METS using the pull down  selection.  

Let’s check the “Enable OCR” box, and then Press the “Begin Conversion” button.

When complete we’ll set the “Source Data directory” to \Samples\PDF\PerDir-NewsYearbooks\ and  press the “Begin Conversion” button. This will take a bit longer to compete so let’s continue.

Open another instance of the 2_Data Conversion Tool.

Now change the “Source Data Type” pulldown back to “Object Per File” and the Source Directory to  \Samples\JPG\PerFile-NikonCamera. Set “Source Metadata type” to ResCarta METS, uncheck the “Enable OCR” box and press the “Begin Conversion” button.  

Now for audio. Set the Source data directory to \Samples\WAV\Music, data type to “Object per file”  and Press “Begin Conversion”.

Next, set Source data directory” to \Samples\WAV\News and Check the “Enable audio transcription” checkbox, and press “Begin Conversion”.

Last, set Source data directory” to \Samples\MP4\ and “Source Data Type” pulldown to “Object Per  Directory”, Check the “Enable audio transcription” checkbox, and press “Begin Conversion”.

Tech Days East 8 www.ResCarta.org

Editing Textural Metadata  

Optical Character Recognition (OCR)  

Keyword searching has become commonplace and expected with text based materials.  Let’s open the 3-Textural Metadata Editor (TME). Then /File/Open ResCarta Object Directory will  open a browse dialog. We are going to open an object we just created above. Locate our RCDATA01  directory and drill down to\RCDATA01\wionrfi0\20140529\00000001.

Note that the words “NATIVE TIMES” in the title are highlighted. Now note the web address in the  banner.

Drag a box around around www.NATIVETIMES.COM, This will zoom to that area. From the sidebar select the arrow icon (second one down) and click on the area “NATIVETIMES”. Note that the OCR engine  separated the address into “WWW.” “NATIVETI” and “ES.COM” This is a minor error due to the inter  character spacing used by the design team. Click on “NATIVETI” and use the dialog box to correct this  item.

The TME can also be used to “Tag” text into image only files like photographs.

Let’s close the Textural Metadata Editor.

www.ResCarta.org 9 Digital Archiving Made Simple

Automatic Audio Transcription (AAT)  

There are many Optical Character Recognition programs to produce text from images containing text,  but there are few sources of software to produce transcriptions of audio files containing spoken text.  The Data Conversion tool used above is one of the few automatic audio transcription (AAT) programs.  And like early OCR programs the quality of the automatically produced transcription will depend on the  quality of the original recording. So a tool will be necessary to correct the recognition of transcriptions.

Let’s open the 4_Audio Transcription Editor (ATE). Then /File/Open will open a browse dialog. We are  going to open an object we just created above. Locate your RCDATA01 directory on your C:\ or Desktop  and drill down to \RCDATA01\wpr00000\20121124\00000003\

The ATE opens showing each word recognized and its location in the audio file. We press the  button to hear audio and check the transcription.

We can click on a word balloon to edit/delete the word. We second mouse click in an open space on  the waveform graphic to add a word. An “insert” will appear, click it to get a new empty balloon.

Correcting an hour long oral history from a poor quality recording may take some patience. At this time  this release only works well with English as the source language.  

The ATE can be used as a manual transcription tool as well. Using the Data Conversion tool with AAT turned on will output word locations and word balloons for what the AAT recognizes. So English words  recognized can be replaced with the proper language term. This tool supports UTF8 at the core so most  world languages are supported.

Tech Days East 10 www.ResCarta.org

Collect Digital Objects

So far we have created digital objects from Photographs, Postcards, Books, Music, Newscasts and  Videos. The objects have been arranged in a known directory structure and have associated metadata  embedded into the files along with an external standard XML metadata file.

The next step is to gather these objects into digital collections. We open 5_Collections Manager then  /FILE/Open ResCarta Data Volume. Use the browse dialog to open the RCDATA01 top level directory of  our archive.  

A window will open showing the number of ResCarta objects recognized under the top level RCDATA01  directory. The banner will list the location of the Data Volume that has been opened. But not much has  changed on the surface of the Collection Manager interface.

We add a collection by pressing the icon at the bottom of the left hand Collection pane. Give the  collection a name (short is best) like “News” and add an abstract to the collection. “Newscast and  Newspapers…”, then press “Finish” button. We see the News collection listed in the Collections pane.  

Let’s add objects to the collection by pressing the icon at the bottom of the Middle “Collection  Content” pane. A dialog will open listing objects with the first object in the list selected. We use the  mouse or down arrow key to select “The Native Times” and then hold your control key and use our  mouse to select all the newpapers and newscasts. Then press the “OK” button. These objects will be  placed into our “News” collection. We can create a few other collections and add appropriate objects to  them, using all the Objects that we have created.

Indexing Data for high speed retrieval

Now that we have organized our digital objects and defined our collections let’s create a computer index for quick retrieval. Open 6_Indexer; a dialog will appear with two pulldowns. Set the ResCarta data  volume directory to the location of your RCDATA01 top level directory. The second pull down will  default to that location with the addition of \index.ir7. This is the default location of the index used by  the ResCarta-Web application. Press the “Begin Indexing” button. This program will build a full Lucenetm index to every word in your archive. This indexes the metadata elements and the textural metadata in  your OCR/AAT processed files.

www.ResCarta.org 11 Digital Archiving Made Simple

Host/Share Collections  

Create a Web server

When we installed the software for this tutorial, an Apache Tomcattm web application was also  installed. This will make a fully functional, full featured web server of your machine capable of  delivering large scale archives. To start the web server we click on the “Start ResCarta-Web server” from the ResCarta Tools menu. A command window will open on Windows based systems (Linux and  OSX will be silent). A web server and a web application will be created and configured to search and  display the digital objects.  

NOTE: This Tomcat cmd  

window must remain running for  

this sample web server to  

function.

Also installed along with the Tomcat web server was the ResCarta-Web application. Let’s take a look at  this website. Let’s Click on the “Open ResCarta-Web in Browser” link from the ResCarta Tools menu.  Your default browser will open to the localhost:8302 location of your sample Tomcat web server. But  these are NOT the objects we are looking for…

Let’s click the “Log In” link in the upper right hand under the ResCarta-Web logo. We will enter “admin”  for the user and “password” for the password then press the “Log in” button. In the Server  Administration/General tab we’ll change the data volume directory to the location of our RCDATA01 and press the SAVE button, then click the “Thumbnails” tab and press the “Start” button.

Tech Days East 12 www.ResCarta.org

The Web Site

Now we should be able to see our collections, search the site and examine our objects. Let’s try a few  “Simple Searches”.

For our first search let’s see what we can find by searching for the term “Wisconsin”. Lets press the  Simple Search tab and enter the term Wisconsin into the “Search for” box.

We should get a return of a few disparate items, audio files, books and newspapers. Let’s pick the audio  file titled “Wisconsin's New Rainy Day Fund”. When the file opens it will begin playing and a list of the  times the word “Wisconsin” appears in the sound file. You can jump to the location by clicking on the  highlighted term.  

We can click the “Search Results” tab to return to the listing of found objects and select a text based  object. When opened, you can find the highlighted term in the object. Other features include zooming,  text extraction, printing and listing the metadata for the item.  

Open the iiif standard image viewer at http://localhost:8302/ResCarta-Web/mirador 

WEB SERVERS/PORTS/ADDRESSES

We should put an easy to find link to your local archive web site on our library’s main home page.

The web server we are using is the same server used by aerospace companies, libraries and archives  around the world. But running it from a command window or batch file would not be appropriate. Also  note the address “http://localhost:8302/ResCarta-Web/ would not work outside of your system. This  workshop computer has an IP address and for another machine to see our website, the address of our  system would be substituted for the localhost above like “http://192.168.0.124:8302/ResCarta-Web/  Our workshop computer has a name so the address could become “ http:// ResCarta03:8302/ResCarta-  Web/” 

Downloading a current version of the Apache Tomcat Service installer and adjusting the port number to  80 instead of the default 8080 would allow the server to start whenever the system started and the  address would become more reconizable as “ http:// ResCarta03 /ResCarta-Web/” 

Renaming the ResCarta-Web directory to ROOT will make it the default site and the resulting URL would become “ http://ResCarta03” 

Opening your machine’s firewall to allow HTTP traffic to and from your machine will allow others on the  same network to see your website. Opening the external firewall to the INTERNET will allow the entire  world to view your website.  

www.ResCarta.org 13 Digital Archiving Made Simple

Share Metadata

OAI/PMH: Finally, the archive we created can be exposed to metadata harvesting by the creation and configuration of an OAI/PMH server. This sounds complex but can be quite simple. Let’s create the  necessary Dublin Core metadata files, add an OAI/PMH server and configure it.

Let’s make a directory to hold our OAI formatted Dublin Core metadata for harvesting. Using a file  manager Let’s make a directory called C:\OAIDATA.

Reopening the 5_Collections Manager, from the menu let’s choose from the menu /File/Dublin Core  Preferences.

We will fill in the dialog as shown to the left.  

Checking the Write DC metadata box. Set our  

output directory to C:\OAIDATA . Selecting  

OAI_DC as the Output format. And we will set the

ResCarta-Web URL to  

http://localhost:8302/ResCarta-Web/ 

Press OK and YES to the dialogs.

NOW for the OAI/PMH Server…

I have one on my thumbdrive at  

\ResCarta\Installers\jOAIv3\oai.war  

I will copy this file to the following location :

C:\Program Files\RcTools-7.0.3\apache-tomcat

8.5.31\webapps

After a few moments you will see a directory  

called oai being created in the webapps directory.  

This is the OAI/PMH server application. Now let’s  

configure it for use on your archive data.

Opening our browser to http://localhost:8302/oai . The jOAI Overview will open. We click the “Data  Provider” tab then select “Setup and Status” from the pulldown. Then click the Edit repository  information link. Give your repository a name and an email contact, then press SAVE.

Clicking the “Data Provider” tab we select “Setup and Status” from the pulldown again. This time we  click the Add metadata directory link. Let’s enter a nickname for these files “OurArchive”, set format of  files to “oai_dc” and Path to C:\OAIDATA, our oai data directory.

You can try some requests against your repository with the urls listed below or using the Search TAB. Tech Days East 14 www.ResCarta.org

Explore OAI/PMH

Identify the repository

http://localhost:8302/oai/provider?verb=Identify&rt=text 

List Metadata Formats

http://localhost:8302/oai/provider?verb=ListMetadataFormats&rt=text 

List all records in your repository

http://localhost:8302/oai/provider?verb=ListRecords&metadataPrefix=oai_dc&rt=text 

Maintain the Archive

Checksum validation

To validate our archive from time to time we open the 7_Checksum Verification Tool, and set the  location of our ResCarta data volume directory, and press the “Begin Verification” button. When you do  this (and you should do it often),you will be presented with a report showing the status of each object in your archive. The checksum verification can determine if any file in your archive has even one bit  changed.

If an error is reported you will need to run the verification tool against a copy of your archive. If the  verification for the object passes verification, you would be safe to copy that object back to your local  archive, thereby replacing the damaged object.

Lots of Copies Keeps Stuff Safe

The concept here is an easy one. If you want anything to survive for a long period of time, make more  than one copy and place them in locations away from each other. Digital archives are simple to copy,  they are exact duplicates of each other, and storage is relatively cheap compared to rebuilding an  archive from scratch.

That’s it! You have made digital objects, hosted them on a website and created an OAI harvestable repository. All created with open source and free software.

Thank you.  

www.ResCarta.org 15 Digital Archiving Made Simple

Appendix:

Some useful links (we think)

Selection:

Consortium of Academic and Research Libraries in Illinois

http://www.carli.illinois.edu/sites/files/digital_collections/documentation/Selection-of materials_20120221.pdf 

Metadata:

OAI/PMH http://www.openarchives.org/Register/BrowseSites Photographic Indexing http://www.loc.gov/rr/print/tgm1/iib.html 

Subject Headings http://www.loc.gov/rr/print/tgm1/iii.html 

The Getty Vocabularies http://vocab.getty.edu/ 

Art & Architecture Thesaurus http://www.getty.edu/research/tools/vocabularies/aat/ Geographical Names http://www.getty.edu/research/tools/vocabularies/tgn/ MODS http://www.loc.gov/standards/mods/ 

METS http://www.loc.gov/standards/mets/ 

MIX http://www.loc.gov/standards/mix/ 

audioMD http://www.loc.gov/standards/amdvmd/ 

reVTMD https://www.archives.gov/preservation/products/reVTMD.xsd Audio:

HTML5 support http://en.wikipedia.org/w/index.php?title=HTML5_Audio International Image Interoperability Framework:

iiif site https://iiif.io/about/ 

RC-mirador https://demos.rescarta.org/ResCarta-Web/mirador/ Tech Days East 16 www.ResCarta.org