1 of 41

Photo is © Jennifer Peebles , used under a Creative Commons Attribution-NonCommercial license. https://goo.gl/tUDx1w

Pickling your Project: Workshop

2 of 41

Preservation is another phase in the lifecycle of your research:

  • Conception
  • Formulation
  • Research
  • Product
  • Citation
  • Preserveration
  • Access

3 of 41

outstanding stats: The LifeSpan of the average website

  • 1997 Scientific American & Brewster Khale both claim 44 days 1, 2
  • 2001 IEE claims 75 days1
  • 2003 Washington Post article claims 100 days1
  • 2011 30% of links were dead6
  • 2014 Average lifespan of a web page is 1,132.1 days4
  • 2014 In a year from now: 85% of pages available on the web today will disappear of been modified.5

1 Mike Ashenfelder, “The Average Lifespan of a Webpage | The Signal: Digital Preservation,” webpage, (November 8, 2011), http://blogs.loc.gov/digitalpreservation/2011/11/the-average-lifespan-of-a-webpage/.

2 “A Look at Website Lifespans,” Bismarck Tribune, accessed November 17, 2015, http://bismarcktribune.com/news/columnists/keith-darnay/a-look-at-website-lifespans/article_1d879ae6-851a-11e3-8bd1-0019bb2963f4.html.

3 Joy Thomas, “Web Site Demise and Graduate Research: Persistence of Web Pages Cited in Social Work Theses,” Behavioral & Social Sciences Librarian 22, no. 2 (January 2004): 67–77, doi:10.1300/J103v22n02_04.

4 T. Agata et al., “Life Span of Web Pages: A Survey of 10 Million Pages Collected in 2001,” in 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014, 463–64, doi:10.1109/JCDL.2014.6970226.

5 Daniel Gomes and Miguel Costa, “The Importance of Web Archives for Humanities,” Journal of Humanities & Arts Computing: A Journal of Digital Humanities 8, no. 1 (March 2014): 106–23, doi:10.3366/ijhac.2014.0122.

6 Adrienne LaFrance, “Raiders of the Lost Web,” The Atlantic, October 14, 2015, http://www.theatlantic.com/technology/archive/2015/10/raiders-of-the-lost-web/409210/.

��

4 of 41

5 of 41

Depending upon your project

If a web-based project for your dissertation/thesis:

  • The library/I will work with you to help you follow best development practices.
  • Library will crawl your site with archive-it; this is a reiterative process and creates a far from perfect archived copy. https://www.archive-it.org/organizations/713
  • We will require a WARC file created from webrecoder. A simple tool, but feel to make an appointment to learn how to use.

Let us know as early as possible, so we can review best practices, review webrecorder and start performing test crawls.

6 of 41

Some Best Practices

  • Make sure the site is built with proper architecture. Each page on the site should have a unique URL.
  • Please, whenever possible, host content locally and do not point to third party sites. Content includes video, audio, code (scripts, css, etc.)
  • Do not use Flash.
  • Please delete or modify robots.txt file to allow for crawling. You can develop and test using the google tester. (https://support.google.com/webmasters/answer/6062598?hl=en)
  • Websites with nested javascript, generally do not archive well.
  • Real Time Protocol (RTP) that streams audio and video do not archive very well.
  • If you embed a video, please embed only YouTube videos and each video should only appear once on the site or the crawler will not capture it. Vimeo embeds are not crawled, so avoid Vimeo.
  • Search is not captured, so it is recommended that you do not make it a primary focus of your website. Note: If search is important, you can collect URLs of what you assume to be the most popular search result pages and add to pages or a page on your site and the crawler might be able to capture theses searches. Test search result URLs in a different browser.
  • Interactivity is not captured, it you might not want it to be the primary focus of your website. Note: if interactivity is important, you might want to build a static, rather than dynamic site, and screen video capture the interactive aspects of the site and then post the video to somewhere on the site. However, do not let the archive-it crawler's limited functionality constrain you, because WebRecorder might be able to capture the interactivity that archive-it cannot.

7 of 41

For Example:

8 of 41

The timeline does not appear in the archived version

http://wayback.archive-it.org/4739/20140722000114/http://mydigitalfootprint.org/

9 of 41

And

10 of 41

Embedded Media Does Not appear in Archived Version

11 of 41

Another example

http://dbpod.graciass.net/browse

12 of 41

Searches

Returns:

13 of 41

In the Archived Version

http://wayback.archive-it.org/5163/20150917122237/http://dbpod.graciass.net/browse

14 of 41

The same search returns “Not in archive” in archive-it version

15 of 41

Other Interactive Elements|Maps�http://nycfashionindex.com/

16 of 41

No interactivity in the archived Version

http://wayback.archive-it.org/5978/20150921204622/http://nycfashionindex.com/

17 of 41

http://dropoutsdropin.org/

18 of 41

http://wayback.archive-it.org/4739/20160511121130/http://dropoutsdropin.org/

19 of 41

http://wayback.archive-it.org/5484/20150403202249/http://inq13.gc.cuny.edu/videos/

20 of 41

Crawlers are not able to fully simulate a user interacting with a site via a browser because crawlers are able to read, but, generally, unable to execute many of the scripts embedded in a website.

21 of 41

  • Dynamic content/Interactivity: Maps, searches, etc.
  • Some nested/embedded javascript (unable to execute script)
  • Timelines
  • Vimeo, SoundCloud, flash
  • Crawler traps - calendars, some social media
  • Copyright/IP

22 of 41

The web is a…

Archive it support:

In general, the steps you took towards expanding/limiting the scope and using the Developer's Tools to pinpoint what was missing/had changed were spot-on. This is exactly what scoping is all about. The phrase we commonly invoke, "the web is a mess" is quite real.”

23 of 41

Again, Some Best Practices

  • Make sure the site is built with proper architecture. Each page on the site should have a unique URL.
  • Please, whenever possible, host content locally and do not point to third party sites. Content includes video, audio, code (scripts, css, etc.)
  • Do not use Flash.
  • Please delete or modify robots.txt file to allow for crawling. You can develop and test using the google tester. (https://support.google.com/webmasters/answer/6062598?hl=en)
  • Websites with nested javascript, generally do not archive well.
  • Real Time Protocol (RTP) that streams audio and video do not archive very well.
  • If you embed a video, please embed only YouTube videos and each video should only appear once on the site or the crawler will not capture it. Vimeo embeds are not crawled, so avoid Vimeo.
  • Search is not captured, so it is recommended that you do not make it a primary focus of your website. Note: If search is important, you can collect URLs of what you assume to be the most popular search result pages and add to pages or a page on your site and the crawler might be able to capture theses searches. Test search result URLs in a different browser.
  • Interactivity is not captured, it you might not want it to be the primary focus of your website. Note: if interactivity is important, you might want to build a static, rather than dynamic site, and screen video capture the interactive aspects of the site and then post the video to somewhere on the site. However, do not let the archive-it crawler's limited functionality constrain you, because WebRecorder might be able to capture the interactivity that archive-it cannot.

24 of 41

Depending upon your project

If a web-based project for your dissertation/thesis:

  • The library/I will work with you to help you follow best development practices.
  • Library will crawl your site with archive-it; this is a reiterative process and creates a far from perfect archived copy. https://www.archive-it.org/organizations/713
  • We will require a WARC file created from webrecoder. A simple tool, but feel to make an appointment to learn how to use.

Let us know as early as possible, so we can review best practices, review webrecorder and start performing test crawls.

25 of 41

26 of 41

We have you do it (DIY)

webrecorder.io

27 of 41

WebRecorder Resolves

  • The inability of archive-it to simulate the browser experience.
  • The struggle of precisely scoping =>searches

28 of 41

Recommendations & Processes documented in A Libguide

We:

  • Inform depositing students that they should submit an online form, so the library can start working with archive-it to crawl their website.
  • Suggest web development practices that allow for better preservation.
  • Share simple webrecorder instructions.
  • Explain that we will add the archive-it URL and upload their user generated WARC file, made with webrecorder, to our institutional repository’s (Academic Works) ETD series.

29 of 41

Demo

30 of 41

31 of 41

32 of 41

33 of 41

34 of 41

35 of 41

36 of 41

ablility to test

Your WARCs using a player

37 of 41

Why we Embraced

  • Better captures interactivity on a website than archive-it/Heritrix. �
  • More representative of the how a website should be used because the content creator knows best how they would like their users to understand their site.�
  • Easier to scope than archive-it.

38 of 41

uploadED to CUNY’s institutional repo (Academic Works)

�In addition to providing links to archive-it in our catalog and institutional repository (Academic Works), we decided to upload the user generated WARC from webrecorder to our institutional repository for additional preservation. Note: According to Corey Davis “capturing websites in WARC format for playback and full-text search is only a part of what is needed for true digital preservation. WARC files backed up by the Internet Archive are susceptible to corruption.”1.

1 Davis, Corey. “Archiving the Web: A Case Study from the University of Victoria.” The Code4Lib Journal 26 (2014): n. pag. Code4Lib Journal. Web. 13 Nov. 2015.

39 of 41

If an application...

If an application (desktop or mobile) and not a website, you are welcome to provide a link to your GitHub repository, but please also include:

  • A zip of all your source code.
  • A zip of the backend database (if a database-driven site).
  • A rudimentary readme file explaining software requirements (if relevant, e.g.: OS, Apache, MYSQL, PHP, Python, version, etc.), so the project can be reproduced
  • A screencast showing how the application works. Note: This is more difficult on phones. See:
  • Windows
  • iPhone
  • iPhone
  • Android

Please upload all of these files when you deposit to Academic Works.

40 of 41

More info

Visit our Born Digital Deposits LibGuide:

http://goo.gl/enXGJJ

41 of 41

Thank you

Stephen Klein�Digital Systems Librarian�sklein@gc.cuny.edu