Exercises: Free/Open Source Tools for Digital Preservation

May 14, 2016

Presenters:

Table of Contents:

Pt. 1: Characterizing Content

WinDirStat

DROID (Digital Record Object IDentification)

bulk_extractor

Pt. 2: Migrating Files to Preservation Formats

Still Images

IrfanView

ImageMagick (Demo)

Text(ual) Content

Adobe Acrobat Pro

Ghostscript

Audio and Video

FFmpeg

HandBrake

Pt. 3: Metadata for Digital Preservation

Descriptive Metadata

Microsoft Word

Adobe Acrobat

Technical Metadata

ExifTool

Preservation Metadata

MD5summer


Pt. 1: Characterizing Content

WinDirStat logo_50px.png

WinDirStat is a disk usage statistics viewer and cleanup tool for various versions of Microsoft Windows. (https://windirstat.info/)

  1. Open WinDirStat and navigate to the ES-Test-Files folder you saved to your machine.

    windir1.PNG
  2. Review the Folder Hierarchy pane.  
  1. Click on various folders and files; what happens in the Treemap and File Extension panes?
  2. Navigate down to “ES-Test-Files\dirk_hageman\documents\SAPResearch\Projects\CID\Meetings\2008”. How many files are in this folder?
  1. Review the Treemap pane and note your responses to the following:
  1. Based upon the sizes of the rectangle, which files appear to be the largest?
  2. Based upon the prevalence of colors, what appears to be the most common or numerous file format?
  1. Review the File Extension pane and note your responses to the following:
  1. How many .TXT files are in the test data?
  2. What is the total size (in MB) of the .JPG files?
  3. What file formats are you unfamiliar with?



DROID (Digital Record Object IDentification)

droid-logo.png

DROID stands for Digital Record Object Identification. It’s a free software tool developed by The National Archives that will help you to automatically profile a wide range of file formats.

http://www.nationalarchives.gov.uk/information-management/manage-information/preserving-digital-records/droid/ 

  1. Open DROID and click the ‘Add’ button to add the test data to the current DROID profile.
    droid1.png
  2. Browse to where the ‘ES-Test-Files’ folder is stored; select the folder and make sure the ‘Include subfolders’ box is checked. Then click ‘OK’
    droid2.png
  3. The test data is now loaded; click ‘Start’ to run DROID.
    droid3.png
  4. Once completed, results will display as a table in the DROID window.  You can browse through the directory structure to see basic information on files and their identification.
    droid4.png

Click on a PUID link to see the summary format profile in the PRONOM registry

  1. Click the ‘Export’ tab to export the DROID profile. Check the box next to the profile name and when saving the file, be sure to choose ‘comma separated values’ (.CSV) as the file type.
    droid6.PNG
  2. Open the exported DROID .CSV profile in Excel and respond to the following questions:
  1. How many different versions of the Adobe PDF format are included? (Sort spreadsheet by ‘FORMAT_NAME’)
  2. What kinds of files (based on extension) were not identified by DROID? (Sort spreadsheet by ‘FORMAT_NAME’)
  3. Under the ‘EXTENSION_MISMATCH’  column, DROID indicates potentially erroneous file extensions with a value of ‘TRUE.’  What files appear to have an extension mismatch?


bulk_extractor

“A computer forensics tool that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results can be easily inspected, parsed, or processed with automated tools.” http://www.forensicswiki.org/wiki/Bulk_extractor 

  1. Open BEViewerLauncher.exe.
  2. Under the Tools menu, select Run bulk_extractor…, or hit Ctrl + R.
    be.png
  3. Choose the option to scan a “Directory of Files” and then navigate to the ES-Test-Files directory.  Also, be sure to select an “Output Feature Directory” in a convenient location.
    be2.PNG
  4. Click “Submit Run” at the bottom of the window.  bulk_extractor includes many options; for the purpose of this exercise, we will just use the default settings.
    be3.PNG
  5. A progress window will open with information on the scanning; close when finished.  The reports directory should automatically load in Bulk Extractor Viewer; click on the folder name to see a full list of all scanner reports.
  6. Complete the following steps:
  1. Open ‘ccn.txt’ and
  1. Browse through the results in the ‘Image’ pane
  2. Check the actual files in the folder “ES-Test-Files\Identity_Finder_Test_Data” (source filename identified as “Image File”)
    be6.png
  1. Open ‘pii.txt’; how many SSNs were identified in the scan?
  2. The ‘email.txt’ contains a list of all email addresses found in the targeted content and ‘email_histogram.txt’ details the number of times each email address occurs. Review each file and discuss with a neighbor how this information help you organize or manage files?


Pt. 2: Migrating Files to Preservation Formats

Still Images

Links:

Learning outcomes:

IrfanView

IrfanView is an image viewer, editor, organizer and converter program for Microsoft Windows. IrfanView is free for non-commercial use; commercial use requires paid registration. It is noted for its small size, speed, ease of use and ability to handle a wide variety of graphic file formats.

Instructions:

  1. Download and install IrfanView and IrfanView Plugins.
  2. Download and unzip the Sample Images.
  3. Viewing images
  1. Use Windows Explorer to navigate to the Sample Images folder. Note that Windows cannot preview the .TGA files, and doesn’t know how to open them.

  1. Launch IrfanView.
  2. Click File → Open.
  3. Set Look in: to the Sample Images folder, and double-click CTC24.TGA (or any .TGA file). Note that IrfanView can render the file.

  1. Converting a single file
  1. Click File → Save as…
  2. Set Save in: to the Sample Images folder.
  3. Set Save as type: to TIF - Tagged Image File Format.
  4. Click Save. Note the TIF save options.
  5. Navigate to that file in Windows Explorer. Note that Windows Explorer can now preview the file and double-clicking it will open Windows Photo Viewer, which can now render it. You also now have a file format recognized by experts as sustainable!

  1. Batch conversion
  1. Back in Irfanview, click File → Batch Conversion/Rename…
  2. Select Batch conversion. Note that you may also opt to rename result files.
  3. Set Look in: to the Sample Images folder.
  4. Click Add all.
  5. Set Output format: to TIF - Tagged Image File Format.
  6. Click Options.
  7. Click Cancel.
  8. Select Use advanced options (for bulk resize…).
  9. Click Advanced.
  10. Click Cancel.
  11. Deselect Use advanced options (for bulk resize…). (Those were just to let you know that you had more options.)
  12. Click Use current (“look in”) directory. Note that you may also opt to browse to a different folder.
  13. Click Start Batch.
  14. Click Exit Batch. Note that you might also have opted to Copy to clipboard if you were interested in saving a log of these operations elsewhere.
  15. Navigate to those files in Windows Explorer. Note that Windows Explorer can now preview any of the migrated files and double-clicking them will open Windows Photo Viewer, which can now render them.
  1. Command-line options
  1. Click Help IrfanView Help.
  2. Click Command Line Options.
  3. Escape out. (We’ll take a look at command-line magic during the next exercise.)
  1. Exit Irfanview.

ImageMagick (Demo)

ImageMagick is a free and open-source software suite for displaying, converting and editing raster image and vector image files. It can read and write over 200 image file formats. The functionality of ImageMagick is typically utilized from the command-line or you can use the features from programs written in your favorite language.

Instructions:

  1. Download and install ImageMagick.
  2. Converting a single file
  1. Open Command Prompt.
  2. Navigate to your Sample Images folder using the cd command.
  1. Type: convert CTC24.tif -quality 100 CTC24.jpg
  1. Navigate to those files in Windows Explorer. Note that in addition to a preservation format for this file, you have an “access” format as well!
  1. Batch conversion
  1. In the Command Prompt, type: mogrify -format jpg -quality 100 *.tif
  2. Navigate to those files in Windows Explorer. Note that in addition to a preservation formats for these files, you have “access” formats as well!
  1. Exploring further options
  1. Explore further options for the convert and mogrify programs, including: resize an image, blur, crop, despeckle, dither, draw on, flip, join, re-sample, and much more. Play around some!


Text(ual) Content

Links:

Learning outcomes:

Adobe Acrobat Pro

Adobe Acrobat is a family of application software and Web services developed by Adobe Systems to view, create, manipulate, print and manage files in Portable Document Format (PDF).

Instructions:

  1. Purchase (sorry!), download and install Adobe Acrobat Pro.
  2. Download and unzip the Sample Textual Content
  3. Converting a single file
  1. Launch Adobe Acrobat Pro
  2. Click Create PDF from File.
  3. Navigate to the Sample Textual Content Folder and select MSWord_test_document.doc.
  4. Click File → Save As…
  5. Navigate to the Sample Textual Content Folder and click Save.
  1. Comparing the significant characteristics
  1. Navigate to the Sample Textual Content folder in Windows Explorer and double-click MSWord_test_document.doc. It should open in Microsoft Word.
  2. Compare the textual content, formatting such as bolded text, font type and size, layout, bulleting, color and embedded graphics. Compare anything else you can think of!
  3. What about file metadata? Which of these fields would you consider essential, and which are non-essential?
  1. Click File in Microsoft Word.

  1. Click File → Properties… in Adobe Acrobat Pro.

  1. Close Microsoft Word.
  2. Close Adobe Acrobat Pro.

Ghostscript

Ghostscript is “a suite of software based on an interpreter for Adobe Systems' PostScript and Portable Document Format (PDF) page description languages. Its main purposes are the rasterization or rendering of such page description language files, for the display or printing of document pages, and the conversion between PostScript and PDF files.”

Instructions:

  1. Download and install Ghostscript.
  2. Convert a single file:
  1. Open Command Prompt.
  2. Navigate to your Sample Textual Content folder using the cd command.
  3. Since Ghostscript is not automatically added to your system’s PATH (like ImageMagick or ffmpeg), type (or, better yet, copy and paste): "path\to\gs\gs9.16\bin\gswin64.exe" -dPDFA -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=FRPEnForm_PDFA.pdf FRPEnForm.pdf
  1. -dPDFA: Switch for PDFA file.
  2. -dBATCH: Causes Ghostscript to exit after processing all files named on the command line.
  3. -dNOPAUSE: Disables the prompt and pause at the end of each page.
  4. -sDEVICE: Selects an output device (e.g., pdfwrite device).
  5. -sOutputFile: Output to file. Note that I’ve appended a _PDFA suffix to distinguish between the original and migrated versions.
  1. Comparing the significant characteristics
  1. Again, navigate to the Sample Textual Images folder in Windows Explorer and double-click FRPEnForm.pdf and FRPEnForm_PDFA.pdf. They should open in Adobe Acrobat Pro.
  2. Again, compare the textual content, formatting such as bolded text, font type and size, layout, bulleting, color and embedded graphics. Note that the version normalized to PDF/A is no longer usable as a fillable form.

  1. Exploring further options
  1. Explore further options on how to use Ghostscript.
  1. Batch conversion using Python:
  1. This PDF to PDF/A Python script loops through a directory, checks to make sure a file has not already been converted, and, if not, runs the above Ghostscript command on the file. You now have a file format recognized by experts as sustainable!
  2. For the very adventurous, there’s also a Ghostscript library for Python.


Audio and Video

Links:

Learning outcomes:

FFmpeg

FFmpeg is a type of Swiss Army knife for digital audio and video, “a complete, cross-platform solution to record, convert and stream audio and video.”

Instructions:

  1. Download and install FFmpeg.
  2. Download and unzip the Sample Audio and Video.
  3. Convert a single video file (Note that audio works the same way.)
  1. Open Open Command Prompt.
  2. Navigate to your Sample Audio and Video folder using the cd command.
  3. Type: dir
  4. Type: ffmpeg -i bird.avi bird_FFmpeged.mp4 (Note the suffix to ensure we can distinguish between the original and migrated versions.)
  5. Type: dir (Note that there's a new file there! Feel free to watch them both and compare.)
  1. Batch conversion (Note again that audio works the same way.)
  1. This FFmpeg Python script loops through a directory, checks to make sure a file has not already been converted, and, if not, runs the above FFmpeg command on the file.
  1. Exploring further options
  1. Explore further options using the Documentation.

HandBrake

HandBrake is a tool (that uses ffmpeg) for converting video from nearly any format to a selection of modern, widely supported codecs.

Instructions:

  1. Download and install HandBrake.
  2. Convert a single file
  1. Launch HandBrake.
  2. Click Tools → Options.
  3. Click Output Files.
  4. Ensure that Automatically name output files is checked, set Default Path to the Sample Audio and Video folder, and set Format to {source}_HandBraked. This suffix ensures that we can distinguish between the original and migrated versions.
  5. Ensure that Change case to Title Case and Replace underscores with a space are unchecked.
  6. Click Close.
  7. Click Source → File.
  8. Navigate to the Sample Audio and Video folder and select bird.avi.
  9. Click Start.
  10. Navigate to the Sample Audio and Video folder in Windows Explorer.
  1. Batch conversion
  1. Launch HandBrake.
  2. Click Source → Folder.
  3. Navigate to the Sample Audio and Video folder and select it.
  4. Click Add to Queue → Add Selection.
  5. Ensure that only cbw3.avi, drop.avi and flame.avi are checked.
  6. Click Start.
  7. Navigate to the Sample Audio and Video folder in Windows Explorer.
  1. Exploring further options
  1. Explore further options using the HandBrake User Guide.
  1. Close HandBrake


Pt. 3: Metadata for Digital Preservation

Descriptive Metadata

Microsoft Word

ms_word.png

  1. Navigate to “ES-Test-Files/dirk_hageman”
  2. Open the file “Knowledge Worker.doc”
  3. Select File > Info and examine the properties. What kind of information can you learn from the existing descriptive metadata?
  1. Who is the author of this document?

ms_word_file_info.PNG

Adobe Acrobat

  1. Navigate to “ES-Test-Files/email tests”
  2. Open the file “document1.pdf”
  3. Select File > Properties. What kind of information can you learn from the existing descriptive metadata?
  1. Who is the author of this document?
  2. When was this PDF created?
  3. What application was used to create this PDF?

pdf_metadata.PNG

  1. Click Additional Metadata…
  2. Add additional descriptive metadata
  1. Add a title, keywords, a brief description, etc.
  2. In what ways might this additional descriptive metadata be helpful to you or others in the future?

Batch extraction

[Note: This exercise will likely be easier to complete after completing the ExifTool exercise below]

  1. Open cmd.exe and navigate to the ExifTool directory

cmd_exiftool.PNG

  1. Shift + right-click on the "email tests” directory and select “Copy as path”
  2. Enter the following command in cmd.exe:
  1. exiftool.exe “path\to\email tests\*.pdf” -csv > pdf_metadata.csv

exiftool_pdfs.PNG

pdf_metadata_csv.PNG

Additional tools: Xpdf


Technical Metadata

ExifTool

  1. Navigate to the ExifTool directory in one Windows Explorer window
  1. Ensure that the ExifTool executable is named exiftool(-k).exe

exiftool_exe.PNG

  1. Navigate to ES-Test-Files/martin_williams/offline-flickrsinc in another Windows Explorer window

offline-flickrsinc.PNG

  1. Drag and drop an image file from the offline-flickrs directory onto exiftool(-k).exe

exif_metadata.PNG

  1. What kind of information can you learn about the image file from the ExifTool output?
  1. What is the X Resolution of the image? The Y Resolution?
  2. What is the file size of the image?
  3. What is the image size?
  1. Rename the ExifTool executable to exiftool.exe

exiftool_exe_no-k.PNG

  1. Open cmd.exe and navigate to the ExifTool directory

cmd_exiftool.PNG

  1. Shift + right-click the offline-flickrsinc directory and select “Copy as path”

  1. Enter the following commands in cmd.exe, right clicking in the cmd.exe window to paste the full path to offline-flickrsinc:
  1. exiftool.exe “path/to/offline-flickrsinc”
  2. exiftool.exe -XResolution -YResolution “path/to/offline/flickrsinc”
  3. exiftool.exe -csv “path/to/offline-flickrsinc” > exif_metadata.csv
  1. Select a single image from the offline-flickrsinc directory, shift + right click the file and select “Copy as path”
  2. Run the following command in cmd.exe to read a metadata element from the file:
  1. exiftool.exe -ImageDescription “path/to/file.jpg”
  1. Now run the following command to write a value for that metadata element to the file:
  1. exiftool.exe -ImageDescription=”Testing” “path/to/file.jpg”
  1. Run exiftool.exe -ImageDescription “path/to/file.jpg” again

Bonus exercise

Try this step on some of your personal photographs on your computer, then upload a file to various social media sites (Facebook, Twitter, etc.) and download the file again. What metadata has changed? Disappeared? Remained the same?


Preservation Metadata

MD5summer

  1. Navigate to the MD5summer directory and open md5summer.exe

md5summer_exe.PNG

md5summer.PNG

  1. Select ES-Test-Files as the root folder and click “Create sums”

md5summer_root.PNG

  1. Navigate to the claudia_stern directory
  2. Select “README.TXT”
  3. Click “Add”

md5summer_add_single.PNG

  1. Click “OK”
  2. Save the resulting checksum as “README.md5”

readme_md5.PNG

  1. In MD5summer, select ES-Test-Files as the root directory and click “Verify sums”
  2. Select the “README.md5” file to verify the checksum

verified_sum.PNG

  1. Open the README.txt file and add a character
  2. Run the “Verify sums” step again

md5_error.PNG

  1. Now delete the character and run “Verify sums” again. What happens?
  2. Make a copy of “README.TXT”
  3. Run “Create sums” on both files and compare the checksums

md5summer_copied_files.PNG

copied_files_checksums.PNG

  1. Select ES-Test-Files as the root folder yet again and click “Create sums”
  2. Select the “dirk_hageman” directory and click “Add recursively”

add_recursively.PNG

  1. Click “OK” and save the results as “dirk_hageman.md5”

dirk_hageman_md5.PNG

  1. Try adding, deleting, or otherwise changing a file in the dirk_hageman directory and running the “Verify sums” step with the “dirk_hageman.md5” file. What happens?

 PDA 2016: Free and Open Source Tools                                                            p.