Thank you for participating in the development and documentation of improved biodiversity digitization workflows in the DROID (Developing Robust Object-to-Image-to-Data) Workshop. This workshop is expected to generate a number of paper as well as digital artifacts. In order to facilitate the consolidation and publication of data, as well as to encourage community contribution, we ask that all digital data be stored directly within this single core workshop document. Modification and edits from all participants are encouraged throughout the Workshop. Please ensure that any paper artifacts are turned in to a Workshop staff member prior to the close of the Workshop (be sure to include your name, Working Group session, and other pertinent information on the paper artifact to help identify the content).

Shortened Document Link, Handout Link & Remote Connectivity

URL to this online Google Document - http://tinyurl.com/d2hxs8z

URL to DROID Handouts in Google Documents - http://tinyurl.com/7ejqvvm

URL to the DROID Adobe Connect site - https://idigbio.adobeconnect.com/droid

Twitter Hashtag

#idigbio

Lightning Round (Notes)

Presentation Order:

Dorothy Allard
Les Landrum
Melissa Tulig & Michael Bevans
Ed Gilbert
Jason Best
Jennifer Thomas
Linda Ford
Paul Heinrich
Paul Morris
Dean Pentcheff
Petra Sierwald
Talia Karim
Louis Zachos
Andy Bentley
Dmitry Dmitriev
Rusty Russell

Several institutions are using open-source software, however most are also using (or augmenting other software) with ad-hoc internally-written software. This software is not shared with other institutions, or even other collections within the same institution, due to concerns with the quality of the software (not comfortable that the quality is sufficient for sharing), lack of documentation, and concerns about needing to support the software as other collections/institutions use the software and have issues or question.

There is no current consensus on a workflow documentation protocol/software. The workshop will focus on the processes first, and then take a closer look at workflow documentation protocol/software selection.

Leveraging other already-databased collection events helps to pre-fill data for newly-databased specimens from the same event. However, this is primarily only taken advantage of within collections databases. Combining and searching data from all collections/institutions would prove even more helpful... a role for Scatter/Gather/Reconcile (SGR), however SGR is currently only populated with herbarium data.

Training session: Workflow Core Concepts (Notes)

Focus on lossless data ingestion
Interoperable metadata is key for broad data usability
Metadata includes not just data about the object, but data about the process of handling/managing/digitizing the data
Metadata is (Coyle):

Constructive
Constructed
Actionable

Onsite students and volunteers can present issues in terms of stability - significant training time may be required, for only a short period of work
Checklists are key for developing, testing, and monitoring the workflow. If they’re good enough for pilots and surgeons, why not digitization processes?
The DDC Curation Lifecycle Model: http://www.dcc.ac.uk/resources/curation-lifecycle-model However, the model cannot exist in a vacuum. You must take into account your department/university activities, policies, priorities, and resources.
If it isn’t online, people don’t know that it exists. Reduces public access --> preservation interest --> funding
Develop a common language, not just for the digitization workflow but for digitization projects in general.
Identify constraints (human resource limitations, technology deficiencies, lack of funding, lack of time) to help define your workflows.
Crowdsourcing should be divided into separate areas - those easy to perform (expert knowledge not required) and those that require expert review (should be presented in a non-threatening manner, similar to academic peer review).
Digitization projects (short-term, grant-funded, with a defined end) and digitization processes both exist. They may need to be treated separately. Processes need to be integrated for long-term execution.
When is digitization “done” for an object?
Digitization is typically ⅛ of the total process time in the library workflows. Consider the full process... you may be focusing on the least impactful area.
Beware of false optimization - balance redundancy with cost and efficiency.
Move human tasks to automated tasks whenever possible (bulk renaming, etc). The technology may exist, even if it’s not available within your immediate institution/department.
Scale makes everything different... small projects require different up-front planning and management than large, long-term projects.
Cost-Benefit analysis are an important planning task... recurring costs such as human resources vs investment in technology (generally a one-time cost, or with smaller recurring costs following startup).
Use schedules as tools:

Define a critical path... the item that, if delayed, will delay the entire workflow.
To track and assess the work being performed.
Identify catastrophic failures, and use them as a learning process.
Trace errors to their root cause (not the apparent, direct cause). Error elimination may need to occur by fixing a prior step.
Identify dependencies for each task.
Check for and document the failure rate. May most effectively be done via a post-mortem and/or someone not emotionally tied to the process.
What are your milestones (items for a checklist).
Identify impact for each task on other tasks, both within the digitization workflow and to the greater organization.
Identify risk at the task level... which tasks are most likely to fail? Does sufficient redundancy exist for those critical tasks.

Consider and document:

Fixed resources - equipment, software, space
Human resources - production personnel, project management

Schedule human resources with consideration for lost efficiency from task switching, out of the office time, “non-productive” tasks
When considering the above, projected time *1.66 = actual time typically required

Task timing - consider task timing for single items as well as for groups of items.

“If you want to succeed, double your failure rate! Fail in new, useful, and educational ways.”

Presentation: Workflow Elements and Concepts - Common Practices (Notes)

Keep your eye on the output... the end product. Are we producing a digital product that solves our/community needs, or are we just completing tasks that seem to be effective?
Workflow sharing is under-utilized but EXTREMELY beneficial to the community.
Global digitization continua > Institutional policies > Collection digitization workflows (first two elements impact the third, and must be created first)
We cannot just capture images - at least skeletal data must be captured with the image in order to make it easier to improve the data in the future.
Cannot indefinitely put off building out the complete specimen record... must have a specific plan/schedule/timeline/milestones to enrich the data (ie, by databasing label images, georeferencing).
Global Digitization Continua, that require decisions (there is a negative impact and benefit for every item/decision):

Use Current Tools vs Rely on Potential Future Tools
Fitness For Use (capture full rich data) vs Quantity of Specimens Digitized
High Cost/Specimen vs Low Cost/Specimen (i.e., Productivity - and are we defining a “digitized object” the same way, and including all associated costs?)
Efficiency (and potentially Accuracy) vs Speed
Adopt New Digital Protocols vs Hold on to Traditional Practices
Image Everything vs Image Nothing (midpoint = Image Exemplars)
Digitize Ancillary Materials (e.g., field notes) vs Digitize Specimens Only
Implement Evolving, Documented Workflows vs Adhere to Static Workflows

Cost per specimen MUST account for all underlying costs, initial investment of equipment and other fixed costs, data richness, number and type of media objects captured.
Protocols should be written, including actions and expected responses (i.e., system messages).
Evaluate productivity and eliminate/reassign non-productive staff.
Include feedback loops (programmatic and human) within the workflow, both for error checking and process improvement.
Realistically, based upon observation, most processes are not linear. Looping is perfectly valid and should be captured in the workflow diagrams.
Workflows should account for both legacy capture, and efficient capture of newly-collected objects (to prevent newly-collected objects from adding to the mass of legacy objects).

Presentation: Social Issues in Collaborative Digitization (Notes)

Think small science (individual research needs) to big science (large collaborations, shared data, used for potentially unforeseen purposes in the future)
Change (workflow process improvement)

Review

Observe
Document
Evaluate

Redesign

Observe
Document
Evaluate

Reengineer

Three aspects of change: People, Process, Culture
Influencing Change

Recognize that motivation differs from person to person
Listen rather than tell
Recognize the positive and the negative
Change requires organizational buy-in, not just leadership buy-in
Process and outcome must be fair to the workers

#1 motivator - Autonomy. Examples:

Allow self-selection of task order
Capture and integrate feedback into the workflow
Allow self-selection of different methods, as long as they are just as effective

#2 motivator - Mastery
#3 motivator - Purpose
Encourage sharing of innovation - tools, software, processes, workflows
Inventory and use individual skill sets (e.g., languages known, meticulous nature)
Enable students/staff to share what works with each other
Foster a culture of openness to change
Communication within and among projects is crucial - a Bioinformatics Manager can be extremely helpful
Why slow steps occur

lack of sound process design
resources
process knowledge automation
automation

Re-engineer slowest steps
Evaluate organizational digitization maturity (regularly): http://www.ala.org.au/about-the-atlas/digitisation-guidance/
Maintain a community awareness and contribute to the community
To influence others, be:

Open
Clear
Transparent

Motivation may be very minimal in terms of cost and time. Analyze your workers and creatively motivate.

Day 1 Breakout Group Documentation

Breakout groups are preconceived groupings based upon preservation type. These groups are for consideration only and may be modified based upon feedback from participants.

Breakout Group A - Specimens on flat sheets/in packets

http://idigbio.adobeconnect.com/droid1/

Members of Group A: Dorothy Allard, Les Landrum, Melissa Tulig, Michael Bevans, Ed Gilbert, Jason Best, Rusty Russell, Austin Mast (Moderators: Larry Page, Chris Norris, Jason Grabon)

~ 20 minutes: Current Workflow Constraints

Current workflow constraints and proposed solutions (process, technology, staffing, funding, institutional culture/psychology, more...)

Predigitization

Obtaining staging space
Lack of funding to obtain dedicated digitization space
Vicinity of digitization equipment to the collection
Separating specimens that have already been digitized (i.e., from a prior project) from those that have not been digitized
Logistical issues with multiple staff pulling items from the collection in the same physical space

Digitization

Readability of labels

Solution: Human intervention required, possibly through multi-keying.

Handwritten labels are very prevalent and cannot be readily OCR’d, the handwriting interpretation technology is even further behind OCR.

Solution: Human intervention required, possibly through multi-keying.

Staff availability, with competing responsibilities
Unexpected activities that pull staff away from their digitization tasks
There tend to be more managers than digitization workers
Items sent out on loan

Solution: Digitize prior to sending out on loan as part of the workflow
Issue: If the taxonomic authority file has not yet been built for that item, it requires the addition of those names to the authority file ad-hoc.

~ 10 minutes: Workflow Processes Across Institutions

Elements that would cause a workflow to diverge from one institution to the next (volunteerism, level of professional expertise within the digitization process, funds, more...)

There are a number of similarities, but enough differences that workflows must diverge in a number of areas.
The most logical approach seems to be to create workflow modules that can be ordered, configured, and inserted/removed as needed on a collection-by-collection basis.

For example, Barcoding would be a module, with various sub-tasks.

For software/hardware selection, and workflow module implementation, a decision-making process could be developed (with significant effort) to help an institution put in the characteristics of their collection/institution/staff/funding/number of specimens to be digitized/etc and receive the optimal workflow.
What defines a complete record is key to determine what the workflow will look like (particularly the sub-tasks within a module). This varies by grant, depending upon the requirements of the grant.

Minimum Information Standards are required to be able to find/enhance a record in the future
Minimum Information Standards would be over-ridden by grant specifications.

~ 30 minutes: Metrics & How to Measure Success

What defines a “successfully digitized object” (the outcome of an optimal workflow, including databasing, geo-referencing, etc) and measurements of success (cost per specimen, throughput per hour, minimization of level of knowledge required to fully digitize an object via process and tools, queue times, more...)

There are no universal metrics, because metrics are frequently driven by project requirements.
A rating schema can be programmatically applied to determine the level of completeness of a record.
A skeleton record (minimum information standard) should enable someone to find the record. However, there are problems imposing that standard if one of those data elements is missing from the specimen. Can it be marked as “not present” and still qualify?
Optimally, there would be a mechanism to identify if a record is “incomplete” because it has not yet been enriched but can be (ie, additional data exists on the label that has not yet been turned into machine-readable text), vs a record that is “incomplete” because the data simply does not exist.

~ 45 minutes: Workflow evaluation matrix

As the biodiversity collections community moves forward with digitization efforts, we need strategies not only for documenting workflows, but also systematic methods for evaluating workflows to look for ways to increase efficiency. Some synonyms for efficiency include: effectiveness, efficaciousness, productiveness. While speeding up and automating processes certainly improves efficiency, there are other related factors to consider that, if optimized, can minimize damage to specimens, influence data quality, and increase worker satisfaction.

With this in mind, please consider the matrices below as a starting point to develop a methodical way to try and find various points in our workflows where productiveness might be increased. We look forward to your input on these forms and tweaking them to add value.

Look for opportunities to increase workflow efficiency in a systematic manner. How might one increase efficiency?

Make a step faster
Reduce the total number of steps
Make a step less error-prone (reduce need for repeating or re-checking the step)
Find opportunities where multi-keying (via crowd-sourcing) could improve data quality.
Automate a step (machine or code or physical tool)
Make a step less boring or repetitive (increase worker satisfaction)
Find places where an updated tool (software, hardware or physical tool) may:

increase speed
reduce error
reduce likelihood of specimen damage

It is our plan to utilize the data captured in these forms to compile lists of needs for the community in each area (e.g. software development, sharing existing physical tools, a list of steps that can be done with citizen scientists, ...)

At the end of this there is a sample set with comments to show how these documents may help the community coalesce these ideas.

Pre-Digitization Curation Tasks	must be done before digitization	could be done at or after digitization	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
identify specimens to be digitized	✓
identify location of specimen
remove specimen from collection
document/flag location to enable return of the specimen
apply barcode
hiring and training staff
conservation and collection
complete a project management plan
specimen repair
select/purchase hardware
select/install/configure software
identify authority files
configure imaging station with a set scale and color chart

Imaging Specimen Tasks (label may be with specimen)	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
place scale and color bar in the imaging frame	✓
calibrate camera to balance exposure and white balance based upon the color chart
photograph the herbarium sheet
select specimens with key features for close-up images, and image those specimens
optional: rename file

Post Image Capture Image Processing Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
save archival copy	✓
optional: rename file
create a web-presentation file
add metadata (TBD, including copyright, photographer, type of photo, etc)
apply color adjustment (controversial)
optional: redact locality information for sensitive specimens

Capture Specimen Data from Image (Or Specimen Label) Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
access queued images requiring data capture	✓
database utilizing voice recognition
OCR
NLP
validate OCR results
correct OCR errors
execute NLP
keystroking (internal project team)
crowdsourced keystroking

Post Specimen Data Capture Quality Analysis / Quality Control Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
validate country, state and county against authority files	✓
programatically validate lat/long coordinates
validate taxonomy against authority file
**A common QR tool would be extremely helpful for the community

Georeferencing Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Comments:

Sample Pre Digitization Curation Tasks:

Specimen Accession, Specimen Cataloging, Interview Staff, Hire Staff, Train Staff, Decide What to Digitize, Pull Specimens, Sort Specimens (e.g., by Taxon, Sex, Geographic Region, Collecting Event, Collector, Color, Size, Shape), Add Taxon Names to Database, Update Taxonomic Identification on Specimens (e.g., vet type specimens)

Sample Imaging Tasks:

Affix Barcode, Turn on Camera, Check Camera Settings, Check Lighting, Order Specimens, Take Photos, Stamp Specimen as “Imaged”, Return Specimen to Collection

Sample Post Image Capture Image Processing Tasks:

Name images, Rename Images, Store Original, Crop, Make Derivatives, Color Correction

Sample Capture Specimen Data from Image (Or Specimen Label) Tasks:

Turn on Computer, Log In (Remote or on Site), Open Image, Enter Taxon Data, Enter Locality Data, Enter Specimen Record (All Data), Enter Only Minimal Fields, Built in Quality Control Steps In Situ

Sample Post Specimen Data Capture Quality Analysis / Quality Control Tasks:

Turn on Computer, Log In (Remote or on Site), Automated QA/QC – Taxon Names; Collector Names; Place Names; County-State Validation

Sample Georeferencing Tasks:

Turn on Computer, Log In (Remote or on Site), One Record At A Time, Batch Georef Processing

Breakout Group B - Specimens pinned in trays

http://idigbio.adobeconnect.com/droid2/

REVISION HISTORY For this Group B section only. Discussion 5/30/12: 400-5:30 PM, Deb Paul made initial notes below as group scribe, Deb Paul updated notes late P.M. 5/30 & early A.M. 5/31. Deb created an associated Word document for easier editing, the table in the fourth item below was edited in the Word Doc not here for ease of use. This doc Jim B edited 5/31/12 at 5:30 AM.

Members of Group B: Jennifer Thomas, Paul Heinrich, Paul Morris, Petra Sierwald, Dmitry Dmitriev and Moderators: Jim Beach & Deb Paul, Stan Blum and 1-2 others online

~ 20 minutes: Current Workflow Constraints

Current workflow constraints and proposed solutions (process, technology, staffing, funding, institutional culture/psychology, more...)

Insects in collections are organized into series (lots), but the series are not always obvious, so in order to efficiently computerize them, series must be re-assembled in the drawer trays as a pre-processing step. Also collecting events (where multiple species were collected at the same time and event, but where specimens are subsequently dispersed taxonomically throughout the collection) ultimately need to be identified and linked to individual specimens which belong to them, either as a physical pre-processing step (e.g. Texas A&M’s project), or logically at a later step in the process once Collecting Event records are in the system. Additionally, “duplicate” specimens can be distributed across multiple ENT collections-analogous to the practice found in herbaria to distribute “duplicates” or exsiccate to multiple institutions. Easier (instant) searching within a data management system for duplicate Collecting Events within and across collections would speed capture of locality, geography, collector information.
Provisional and partial taxonomic identifications on many specimens require preprocessing prior to digitization to verify the identity of the taxon and its placement in a classification, verification of the taxon list used for a set or “unit” of species by an an expert in that group.
Label size constrains the amount of information recorded.

Abbreviations make a lot of information difficult to interpret, historical conventions are not familiar to inexperienced data entry clerks
interpreting the similar labels on multiple specimens in a series (which are supposed to be the same) can help with legibility/interpretation problems.
Important for vocabularies and protocols to be documented semester by semester – to create a record for future use – to avoid losing information)

~ 10 minutes: Workflow Processes Across Institutions

Elements that would cause a workflow to diverge from one institution to the next (volunteerism, level of professional expertise within the digitization process, funds, more...)

Physical space and logistics constraints vary from institution to institution

Paul H – 3 different floors where they are; other places even have different buildings.
Paul H has carts used to transport specimens between imaging location and specimen storage (logistics). Avoid moving imaging equipment. But overhead involved in tracking where specimens are going and making sure they get back.

IT support uneven across institutions

Ethernet network essential, still not 100% available in all workspaces

Student labor is a major contributor - any issues there?

Paul M – students / volunteers not able to utilize fully at Harvard (due to work policies), but can use students from other universities
Petra – FMNH free standing institution– students walk in the door

Volunteers, can have “Entitlement Issues”
Paul H Engineer opinion – a volunteer thinks he knows better than scientists Ha!

~ 30 minutes: Metrics & How to Measure Success

From Stan Blum (CAS) online: “Success” can be understood as a set of capabilities:

record is identifiable (referenceable),
discoverable,
represents the label data without interpretation (verbatim information)--could be the function of the image
and then includes interpreted information, such as georeferenced and/or “corrected” locality information.
Leaving a trail of breadcrumbs is critical for long-term value of the information.(presumably what was done to the record, how it was created, vocabulary standardized, georeferenced, information shared or linked in the database, where the name came from, etc.)

Cost issues:

Could measure throughput per time
Could estimate dollar unit cost
No standards for fair comparisons between projects as Gil’s presentation highlighted
Hidden costs often not included in the calculations

$400 to get a phone put in (Paul H), these kind of costs not included
Project and Facility management costs often not included (Jim B)

Stan Blum, metrics different from project to project. Can we break it down?

How many fields filled out per unit time?
Completeness, accuracy of data in fields?
Can we formalize more explicit reference standards (metrics) for evaluating such things?

Regarding “bad data records?”

Google earth / symbiota (Paul H) – is off data really wrong?
It’s worth looking at (Paul M)
Jennifer T concurs (needs to find that record to look at it).

~ 45 minutes: Workflow evaluation matrix

Look for opportunities to increase workflow efficiency in a systematic manner. How might one increase efficiency?

Make a step faster
Reduce the total number of steps
Make a step less error-prone (reduce need for repeating or re-checking the step)
Find opportunities where multi-keying (via crowd-sourcing) could improve data quality.
Automate a step (machine or code or physical tool)
Make a step less boring or repetitive (increase worker satisfaction)
Find places where an updated tool (software, hardware or physical tool) may:

increase speed
reduce error
reduce likelihood of specimen damage

At the end of this there is a sample set with comments to show how these documents may help the community coalesce these ideas.

Pre-Digitization Curation Tasks	must be done before digitization	could be done at or after digitization	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
define in scope in proposal	✓
UGs create a provisional taxon authority file fam by fam by going into collection (2 days) open cabinet, remove 5 drawers type in taxa, put trays back
That list of names goes to an inhouse or external taxon expert for validation returned marked up with taxon placement changes.
specimens are relocated based on taxon changes, unit trays are all relabeled, UGs do this, all affected specimens are relocated not just those being digitized.
During the relocation process, if unit tray needs expansiona new box is put in the drawer for those specimens, later as specimens are bar coded UGs move densely packed specimens into new empty unit trays, repeat for entire section, drawers are labeled and initialed by students to track who did what
series are sorted in unit trays by collecting event and then by host plant by UGs all the specimens that look identical are put together in a ‘duplicate’ series, all barcoded, then put back into the unit tray or expansion tray if needed.within a single unit tray all barcode numbers are sequential as it makes data entry in excel i.e. adding sequential numbers to the spreadsheet can use the excel autoincrement drag and drop function
drawer numbers are used for tracking as folder names, images go into the filespace folder named for that drawer number, when all images have been attached to the collection objects in the database the temporary folder is deleted.

Imaging Specimen Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Post Image Capture Image Processing Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Capture Specimen Data from Image (Or Specimen Label) Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Post Specimen Data Capture Quality Analysis / Quality Control Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Georeferencing Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Comments:

Sample Pre Digitization Curation Tasks:

Sample Imaging Tasks:

Affix Barcode, Turn on Camera, Check Camera Settings, Check Lighting, Order Specimens, Take Photos, Stamp Specimen as “Imaged”, Return Specimen to Collection

Sample Post Image Capture Image Processing Tasks:

Name images, Rename Images, Store Original, Crop, Make Derivatives, Color Correction

Sample Capture Specimen Data from Image (Or Specimen Label) Tasks:

Turn on Computer, Log In (Remote or on Site), Open Image, Enter Taxon Data, Enter Locality Data, Enter Specimen Record (All Data), Enter Only Minimal Fields, Built in Quality Control Steps In Situ

Sample Post Specimen Data Capture Quality Analysis / Quality Control Tasks:

Turn on Computer, Log In (Remote or on Site), Automated QA/QC – Taxon Names; Collector Names; Place Names; County-State Validation

Sample Georeferencing Tasks:

Turn on Computer, Log In (Remote or on Site), One Record At A Time, Batch Georef Processing

Breakout Group C - Three-dimensional specimens in boxes/drawers & specimens in spirits in jars

http://idigbio.adobeconnect.com/droid3/

Members of Group C: Linda Ford, Dean Pentcheff, Talia Karim, Louis Zachos, Andy Bentley, Laurie Taylor (Moderators: Gil Nelson, Amanda Neill, Laurie Taylor)

~ 20 minutes: Current Workflow Constraints

Current workflow constraints and proposed solutions (process, technology, staffing, funding, institutional culture/psychology, more...)

Cultural conflicts with young and only, excited and oppositional.
Scope creep with limited funding, need to make processes so available that people can understand that they can do what they're already doing to do the work with digitization
Fear that money is being taken away for digitization instead of other things
People stuck in the old ways and overworked, and not seeing the benefit of this, if they could see this, it could help with them being overworked and alleviate the over capacity
Reluctance to change
Comfortable with existing processes, databases, and fear of change
Non personnel constraint, being able to find the time to research and implement being able to implement tools and software into the workflow
Learning curve with new technologies
Difficulties in evaluating what's best for new technologies, what's the best solution for the problem, competing and incomplete technologies
Buyers remorse for buying some equipment and then finding that another is the industry standard
IT translation, support, learning curve
Need support for change management, especially with it and training
Sustaining institutions and supporting systems and data
Client and database that someone can support at a centralized institution
Experimental funding models, centralizing databases and who owns it no concerns about control
Storage space in general
Technical constraint of storage of images, with multiple copies of the data, have to propagate changes across storage locations, data management and workflow problems
Cameras, databases, and similar seem to be the least of the constraints
Main constraint with the cameras is the cost of the cameras
Databases, constraints are that there are multiple systems that work in different ways and it's difficult to share data across the systems
Problems just serving the data on the web right now, just getting a web presence

staff instability- students graduation, finals week, turn over
Why even bother, we don’t have the money to do this
reluctance to change by employees
utilizing the right software or technology- how do you know?
IT support
What about centrally hosted data, pay a fee to host it offsite, e.g., Specify, ARCTOS,
Image storage constraints
difficult to share data between databases
making data web searchable

~ 10 minutes: Workflow Processes Across Institutions

Elements that would cause a workflow to diverge from one institution to the next (volunteerism, level of professional expertise within the digitization process, funds, more...)

Time to be able to share, ways to be able to share
Need to have people share good news, bad news, failures, and problems
Where to put materials if I do want to share it, at the moment, and how to make it findable
And, people not realizing that they're doing something cool and so not knowing that they should share it, can be afraid of looking silly for doing things in odd ways, even if really innovative and creative
People can feel like they're not part of the community or they can feel only connected to their institution instead of the community
Shared community knowledge docs need to be very searchable and easily findable
Need to curate the documents to make sure they are findable and usable
Need for simple handouts, conserve o grams

lack of knowing what other people are doing. End up reinventing the wheel many times.
Where can we share this information? Where can we go to look for this information? We need to share even the most mundane things, because often they are not known to others. (One possibility is notes in Collections Forum.)
Close minded mentality- I will find my own solution, don’t reach out or try to find solution from outside
Lack of resources to attend meetings, reach out, visit collections
iDigBio- needs to pull this information together and make it searchable (text, videos, etc) and web available

~ 30 minutes: Metrics & How to Measure Success

What constitutes being done? Online and archived; core information up and being shared, that's the core and then enhanced with georeferencing, and other data sources
Core data to be shared, and then sharing you get more information on it, ways to place the core data into the regular workflow
If we want to satisfy our constituents and our funders
What counts for being done for metadata, skeletal or better, or as fully fleshed out as possible
What is a decent record?
Ways to get from a skeletal record to a more complete record, to a enhanced record
Get as much as possible online and then get from partially done to done
Why we are digitizing specimens, have to have a reason for doing it, that defines fitness, what is the functionality
Iterative process
What constitutes animate, what is sufficient quality for the fitness for use?
Various levels of best practices
Metric can be enough to allow discovery or usability. If someone comes in and requests x, sufficiency to be able to meet that request.
Trying to get from that legacy record to done in a series of iterative steps that are useful. The metrics for each portion of the process can be part of the metric for that level of done.
Perhaps what constitutes done is one thing at the institutional level and another at a community level
Renaming done to be available, with the core fields to make it usable
There is a minimal standards group working on recommended, minimal, etc.
Metrics should define what is being started with and what the target goals are.
Need to define the variables and have names for the variables
Level of difficulty to develop a scale for collections of certain types to help inform developing metrics, an index of complexity, could be based on several factors, where the data is, how the specimen is stored, etc.
McGinley curation index could inform this process, possibly create a DROID index.

when are you done?
online and archived? What does that mean for Nat. History Collections?
are we done when we get a spartan record online with an image? or is it done when field notes and publications are attached etc. (done for you may or may not be done for the community).
is there a base line for being digitized? Basic specimen data with an image searchable online via GBIF, etc.
With secondary digitization being georeferenced, publications, etc.
are our constituents satisfied with the product? Can they do what they want to do with the data we have provided?
how much label data is enough?
Why are we digitizing? That defines the fitness of the data/end product. Various degrees of done within this context.
it is an iterative process to be getting to DONE.
get database out there so that you can get the help of additional people, experts, technologies, to move forward to the next state of being done, to eventually being done done.
Maybe AVAILABLE is a better word than done?
There needs to be a core set of fields that need to be filled in before you make it available. A core set of fields for each community and type of data.
cost per specimen not as important as where you are starting from and how much you will digitize and how it will be done.
could iDigBio come up with a community scale for how easy or difficult it is to digitize a collection? Index of complexity or difficulty? Is data in a ledger or only label data, air or liquid, etc. Award point values for how well collection is curated.
McGinley Curation index
DROID index or scale: community scale for how easy or difficult it is to digitize a collection. Ranked scale/metric, something like 6-7 criteria to rank difficulty.

~ 45 minutes: Workflow evaluation matrix

Look for opportunities to increase workflow efficiency in a systematic manner. How might one increase efficiency?

Make a step faster
Reduce the total number of steps
Make a step less error-prone (reduce need for repeating or re-checking the step)
Find opportunities where multi-keying (via crowd-sourcing) could improve data quality.
Automate a step (machine or code or physical tool)
Make a step less boring or repetitive (increase worker satisfaction)
Find places where an updated tool (software, hardware or physical tool) may:

increase speed
reduce error
reduce likelihood of specimen damage

At the end of this there is a sample set with comments to show how these documents may help the community coalesce these ideas.

Pre-Digitization Curation Tasks	must be done before digitization	could be done at or after digitization	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
Access to label data from container - removing specimens from containers	✓
Investigate & document hazardous materials issues associated with retrieval	x
Place specimens in wet box	X
Add color and scale bars	X

Imaging Specimen Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
Specimen cleaning & prep	✓
Mounting for photo orientation	X
Image stacking

Capture Specimen Data from Image (Or Specimen Label) Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Post Specimen Data Capture Quality Analysis / Quality Control Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Georeferencing Tasks	must be done before imaging step	could be done at or after imaging	could be done by local volunteers or students or non-PI staff	could be done remotely (aka crowd sourcing)	represents a step that could be automated	a task that would benefit from QA / QC	could be done with current existing machinery (e.g. Kirtas)	could benefit from authority file creation or sharing (if exists)	a physical tool exists to speed up or otherwise make task more efficient	can easily compute time / costs for this task	formulas exist
	✓

Comments:

Sample Pre Digitization Curation Tasks:

Sample Imaging Tasks:

Affix Barcode, Turn on Camera, Check Camera Settings, Check Lighting, Order Specimens, Take Photos, Stamp Specimen as “Imaged”, Return Specimen to Collection

Sample Post Image Capture Image Processing Tasks:

Name images, Rename Images, Store Original, Crop, Make Derivatives, Color Correction

Sample Capture Specimen Data from Image (Or Specimen Label) Tasks:

Turn on Computer, Log In (Remote or on Site), Open Image, Enter Taxon Data, Enter Locality Data, Enter Specimen Record (All Data), Enter Only Minimal Fields, Built in Quality Control Steps In Situ

Sample Post Specimen Data Capture Quality Analysis / Quality Control Tasks:

Turn on Computer, Log In (Remote or on Site), Automated QA/QC – Taxon Names; Collector Names; Place Names; County-State Validation

Sample Georeferencing Tasks:

Turn on Computer, Log In (Remote or on Site), One Record At A Time, Batch Georef Processing

Breakout Group reports to the re-assembled Plenary Group (Notes)

Pre-Workshop Survey results and discussion (Notes)

Self-assessment

16 respondents

Often protocols for imaging, databasing, workflow in place

Rarely protocols for hardware, software, training staff, or georeferencing

Most respondents reported doing some kind of image manipulation, most are saving images as JPEGs.

Most work being done by students or paid staff

Only two crowdsourcing projects (one more like citizen science than true crowdsourcing)

Damage to specimens does occur, stats are not kept, damage is usually repaired immediately

79% reported benefits to digitization

Training session: Business Process Modeling (Notes)-- Jason Grabon

Workflows should not be static-- they become less efficient over time

Workflows should be continually improved and should have redundancy built in

Work breakdown structure (task list) + dependencies for each task are most important for us today

With extra time, add human/physical resources and time lags

The only way to definitively tell which workflow is most efficient is to actually use them and time them.

Suggestion of log sheets attached to cabinet or drawer with steps listed, having staff write date/time for each step.

Suggestion of development of modular processes that each institution can pick and choose from and use easily.

A training document should have a checklist first, then the written details.

Day 2 Breakout Group Documentation - Workflows

www.idigbio.org/sites/default/files/sites/default/files/Business%20Process%20Management.pptx

www.idigbio.org/sites/default/files/videos/slides/Nelson_DROID.pptx

Breakout Group A - Specimens on flat sheets/in packets

IDENTIFY TOOLS THAT CAN HELP WITH THESE TASKS

Module 1: Project Management

Task ID	Task Name	Dependency(ies)	Resource(s)
T1	Define scope of project and goals
T2	Evaluate, select, purchase equipment and software
T3	Coordinate grant-funded projects
T4	Hire staff
T5	Define practical scope
T6	Identify IT requirements
T7	Purchase/obtain IT services
T8	Setup project meetings
T9	Feedback
T10	Training Staff
T11	Define Schedules/Timeline
T12	Create documentation
T13	Create/identify authority files
T14	Budget management/accounting reporting
T15	Reporting
T16	Integration with other activities
T17	Sustainability plan
T18	Install Equipment

Module 2: Pre-Digitization Curation

Task ID	Task Name	Dependency(ies)	Resource(s)
T1	Identify specimens to be digitized
T2	Identify location of specimen
T3	Remove specimen from collection and bring to imaging station
T4	Document/flag location to enable return of the specimen
T5	Apply barcode
T6	Specimen conservation
T7	Select specimens with key features for close-up images
T8	Publication
T9	Quality Control/QA
T10	Archiving
T11	Create Skeletal Record	** This may need to be a module- multiple places where this can be executed
T12	Optional: Validate taxonomy

Module 3: Imaging

Task ID	Task Name	Dependency(ies)	Resource(s)
T1	Start stable light source and allow it to reach running temperature (or check flash operation)
T2	Calibrate Camera to balance exposure and white balance based upon color chart
T3	Add Metadata (copyright, photographer, type of photo, …)
T4	Apply Color Adjustment (controversial)
T5	Place scale and color bar in the imaging frame
T6	Redact locality information for sensitive specimens
T7	Frame the Specimen
T8	Image the Complete Specimen (Herbarium Sheet)
T9	Image the Label
T10	Image the ancillary/archival material (ledgers, field notes)
T11	Optional: Close-up imaging (image the barcode)
T12	Light Specimen
T13	Scan barcode (in order to rename the file)
T14	Rename File
T15	Publication of Image to a public or private location
T16	Archive and create derivatives (web presentation file, OCR file)
T17	Quality Control/Quality Assurance
T18	Stamp to indicate the specimen has been imaged
T19	Return Specimen to the Collection

Module 4: Data Enrichment

Task ID	Task Name	Dependency(ies)	Resource(s)
T1	Georeferencing
T1a	● Ingest locality data set into the Georeferencing tool
T1b	● Attempt automated Geoferencing
T1c	● Validate Georeferencing results by reviewing map results
T1d	● Adjust points (manual keying or crowdsourcing)
T1e	● Add error radius/shape file to define precision
T2	Optical Character Recognition (OCR)
T2a	● Ingest label images into the OCR tool
T2b	● Delineate regions of interest with text (Apiary) and identify text classification
T2c	● Attempt OCR on the Label
T2d	● Archive raw text
T2e	● Validate OCR Results
T2f	● Correct OCR Errors (manual keystroking or crowdsourcing)
T3	Natural Language Processing (NLP)
T3a	● Ingest data into the NLP tool (typically OCR’d, but possibly typed into a document)
T3b	● Train/setup/configure grammars and parsing (predefined formats and cases, e.g. dates, duplicates)
T3c	● Attempt automated NLP
T3d	● Validate parsed NLP results
T3e	● Correct parsed NLP results (manual keystroking or crowdsourcing)
T4	Publication of enriched data
T5	Archiving the enriched data
T6	Quality Control/Quality Assurance
T7	Transcription
T8	Access Queued Images Requiring Data Capture
T9	Database utilizing speech recognition
T9a	● Train the software
T9b	● View the label
T9c	● Read the label
T9d	● Record data into the database
T9e	● Validate results
T9f	● Manually correct errors
T10	Manual Data Entry (Keystroking) - Internal Project Team
T11	Manual Data Entry (Keystroking) - Crowdsourcing
T12	Validate Country, State and County against authority files
T13	Programmatically validate lat/long coordinates
T14	Validate Taxonomy against authority files

** A common QR tool would be extremely helpful for the community

Breakout Group B - Specimens pinned in trays

http://tinyurl.com/cha6kto

Breakout Group C - Three-dimensional specimens in boxes/drawers & specimens in spirits in jars

Ledgers/card catalogs (materials not directly associated with specimens)

	Tasks	Dependencies	Resources
T1	Select and Retrieve object		Human
T2	Transport to staging area		Human, cart, vehicle
T3	Locate page(s)		Human
T4	Image page		Human, camera/scanner
T5	Name file		Human
T6	Store file		Hardware, software
T7	Populate core metadata (process/admin/technical)		Human
T8	QC images		Human
T9	Re-store object		Human, cart, vehicle
T10	Create verbatim data from image file (OCR, etc.)		Human, technology
T11	Clean/verify data		Human
T12	Create interpreted data		Human
T13	Clean and verify data		Human
T14	QC data and correct if necessary		Human
T15	Archive		Human, hardware
T16	Augment data if necessary/desired (taxonomy, georeferencing)		Human, technology
T17	Archive		Human, hardware

Labels associated with specimens

	Tasks	Dependencies	Resources
T1	Select and Retrieve specimens/lot/container
T2	Find specimens in lot/container
T3	Transport to staging area
T4	If needed extract label(s) (out of vials or jars etc.)
T5	Record/mark label(s) and associated specimen(s) (so association is not lost; can associate color placed near label with color placed near jar)
T6	If necessary transport to imaging station (may be multiple or different - camera/scanner)
T7	Prepare label(s) for imaging (flatten, dry)
T8	Image label(s)
T9	Populate core metadata (process/admin/technical)
T10	QC image(s)
T11	Name file(s) and associate them
T12	Store file(s)
T13	Reassociate label(s) and specimen(s)
T14	Re-store specimen(s)
T15	Create verbatim data from file (OCR, etc.)
T16	Clean/verify data
T17	Create interpreted data
T18	Clean and verify data
T19	QC data and correct if necessary
T20	Archive
T21	Augment data if necessary/desired
T22	Archive

Specimens

	Tasks	Dependencies	Resources
T1	Select and Retrieve specimens/lot/container
T2	Find specimens in lot/container
T3	Transport to staging area
T4	Order specimens for optimal imaging efficiency (i.e., to prevent frequent lens changes)
T5	Record/mark label(s) and associated specimen(s) (so association is not lost)
T6	If necessary transport to imaging station
T7	Select appropriate imaging equipment/materials
T8	Follow imaging policy
T9	Set up camera/imaging station (may need to be set up each time and disassembled for security reasons etc.)
T10	Set up image naming convention
T11	Extract and position specimen
T12	Pre-imaging specimen prep (blackening/place under liquid/shot of air)
T13	Adjust hardware and software (focus, etc.)
T14	Image specimen(s)
T15	Potential multiple images (stacking or multiple views)
T16	QC images while being shot (focus, unwanted items in frame, color and saturation balance)
T17	Retake images if necessary
T18	Stack images if necessary
T19	Archive (temporary or permanent)
T20	Batch image processing (batch editing - crop, resize, saturation, color balance, white balance, scale bar)
T21	Archive (temporary or permanent)
T22	Human image processing
T23	Create derivatives (jpgs for web; attach to db record; thumbnail catalog)
T24	Populate core metadata (process/admin/technical)
T25	Name files and associate them
T26	Store file(s)
T27	Reassociate label(s) and specimen(s)
T28	Clean specimen if necessary (after any treatments above - blackening etc.)
T29	Re-store specimens
T30	Create verbatim data from file (OCR, etc.)
T31	Clean/verify data
T32	Create interpreted data
T33	Clean and verify data
T34	QC data and correct if necessary
T35	Archive
T36	Augment data if necessary/desired
T37	Archive

Plenary: reports back from the Breakout Groups and discussion (Notes)

Group C: Consider workflow augmentations for stratigraphic specimens that may need to include research steps.

Non-Destructive imaging is a requirement for scientific publication. Should be considered in workflow design/explanation.

Vision for the Future / Minority Reports / Out-of-the-box ideas (Notes)

Paul - robotics and engineering - not included in the workshop

Amanda -- show these to robotics now (so that they have workflows to look at)

Christopher -- realistic -- based on feedback from robotics real capabilities (can’t handled curled sheets, or “fuzzy issues.” If all specimens were exactly the same, with no variability -- works. But quirks of each process making).

Andrea -- Data Management missing from the discussion

Andy -- barcodes not needed (just catalog number)

Paul M. -- why are we imaging?

Talia -- don’t need - on fast typist can enter from ledger (don’t need image)

Andy -- but image means more people can database at one time.

Les Landrum - image b/c you may have a fire / explosion

Laurie - traditional materials (ledgers) are dark / gray lit.

Ed -- pull specimen to check label data or have added value as a way to -- look at verbatim mage of the label instead as a way to check the veracity

Austin -- do you print out a copy of the database (OCR Font), in case of disaster (sun solar flares)

Andy - yes, ledger on legal paper copy

Austin -- 10 reams of paper -- to do 10 specimens per sheet

Linda -- any change in time -- not captured, space constraints

decided electronic redundancies are needed / better

Dean - often people don’t back up, or only have 3 week type back up

Jason B. -- we’ll soon outpace our ability to back up all the data we are creating?

-- what about an appliance to do this for the community?

Andy -- something NSF could invest in infrastructure in the community across the country

Amanda -- 100s of servers distributed across the country?

Andy -- people have space problems already?

if NSF funded nodes -- for reciprocal, distributed data back up - tool need for NIBA Community Implementation Plan

Jim - What happens when ADBC ends in 8 years? What is the sustainability plan? How to we keep momentum?

Louis Zachos: demand

Ed Gilbert: enable people, tools to be able to digitize on their own

Andy: data that is digitized -- is being used -- metrics to show that data is being utilized. show that it’s useful

Andy: make sure people cite every source, every time

Jennifer: image copyright

Les Landrum: model for sustainability

Amanda: (national foundation for collections?)

Model where Users pay for data (some small amount)

Laurie: Library institutional support? What use is data to community?

Ed: Opportunity for Education / Outreach applications to show / demo usefulness

user can create a species list on map

Plenary wrap-up discussion. DROID Working Group strategy for polishing and and dissemination of workshop products. (Notes)

iDigBio Working Groups

Gil: Working Groups by domain

Herbarium Working Group

See list of working groups on the idigbio website: https://www.idigbio.org/wiki/index.php/IDigBio_Working_Groups

Word Bank for Common Terms

(Some suggested primary task clusters are given below)

Primary Task	Sub-Task (May be Blank)	Community Term
Specimen Imaging	Rename the Specimen Image File	Rename Specimen Image File
Label Imaging		Capture Label Image
Pre-Digitization Curation		Stage
Pre-Digitization Curation	decide what to digitize	prioritize
Pre-Digitization Curation	vet taxon names applied to specimens	check taxonomy
Pre-Digitization Curation	count specimens
Pre-Digitization Curation	sort specimens (by some trait: size, color, sex, collecting event, ...)	Sort
Pre-Digitization Curation	label specimens (with pen or paint)
Pre-Digitization Curation	barcode specimen	apply specimen guid
Pre-Digitization Curation
Image Processing		Process Image
Image/Data Storage
GeoReferencing		GeoReference
Proofreading

Quality Control
Quality Assurance

Parking Lot - Future Action Items and Notes That Do Not Fit Elsewhere

Workshop Agenda

DROID: Developing Robust Object-to-Image-to-Data Workflows

A Workshop on the Digitization of Biological Collections

30th - 31st May 2012

The DROID workshop is organized by Integrated Digitized Biocollections (iDigBio), a National Resource Center at the University of Florida and Florida State University, in collaboration with the Botanical Research Institute of Texas, Yale University, and the University of Kansas. The workshop is supported by the U.S. National Science Foundation’s Office of Cyberinfrastructure and Directorate for Biological Sciences, through the Scientific Software Innovation Institutes (S2I2) and Advancing Digitization of Biological Collections (ADBC) Programs.

Overview:

Biological specimens document the historical and modern occurrence of plant and animal species--and most of what we know about the diversity and distribution of life on earth. This research workshop addresses the design, documentation, and optimization of Object-to-Image-to-Data workflows for digitizing biological specimens which are curated in thousands of museum and herbarium collections worldwide.

Documenting digitization workflows begins with the recognition of differences that exist between specimen preparation types due to their physical properties and discipline-specific handling, collecting and preservation methods,curatorial and conservation practice, storage environments, data conceptualizations, and data label techniques. Digitizing data recorded on tags tied to vertebrate skins, on labels encircling snakes submerged in solutions of alcohol, on the lilliputian labels of pinned insects, and on the large, verbose labels glued on flat sheets of plant specimens, presents specific constraints and opportunities in each case for efficient digitization workflow design.

Goals of the Workshop:

To illustrate and analyze a diversity of existing biological specimen digitization workflows with the aim of gaining a deeper and broader understanding of the practical logistics and efficiencies involved in the handling of biological specimens for the purpose of creating digital database records for publication and for new research applications of the biological, geospatial, and temporal information associated with specimens.
To discuss and dissect the dimensions of: digitization project goal definition, the choice of project outcomes and metrics for their assessment, curatorial practice and technology application, human resource and training issues, social and professional values, and the promised deliverables which impact digitization project definition, processes, and success.
To engage in the application of lightweight business process modeling (BPM) to create and document reference workflow models for representative disciplines or specimen preservation types with the aim of enabling biological collection curators to implement efficient data capture workflow through comparative analysis and quantitative evaluation of individual workflow steps and tasks.

Workshop Objectives:

To review and examine a resplendent set of existing participant collections workflows as case studies, observing constraints, local optimizations, and creative solutions.
To gain exposure to workflow design and implementation techniques from libraries and business.
To consider how existing or proposed workflows could be enhanced or extended to gain cost efficiency, scalability, and generality (for implementing across additional collections).
To identify critical constraints to digitization by discipline or preservation type which represent serious throughput bottlenecks and which may require out-of-the-box solutions and/or redefining digitization project goals or outputs.
To identify opportunities for existing or new technology to address costly labor-intensive steps or processing gaps.
To examine workflow goals, scope, and procedures for efficiencies of cost, staff utilization, technology, and outputs, and to propose general guidelines for evaluating workflow designs and workflow project success.
To identify the synergies of collaborative digitization within TCN workgroups or across innumerable collections within a discipline.
To propose near-term project design research and technology development priorities for accelerating the rate of specimen digitization and data publishing.

Desired Outcomes:

Formation of a working group to collate work done at this workshop and advance the desired outcomes listed here.
To identify best ways, based on existing human resources and technologies, for implementing scalable, efficient solutions for image capture and the integration of label images into data authoring workflows.
To document methods for evaluating and quantifying the efficiency of workflow components and tasks, and their suitability/relevance/necessity to the core digitization project goals.
To contribute to an annotated web resource illustrating common and divergent digitization tasks, issues, and constraints across disciplines/preparation types.
To issue a call to action to identify resources, social and technical approaches, and hardware and software tools to bridge gaps in existing workflow end-to-end integrity.
To produce a publication of Workshop findings in Collection Forum, PLoS ONE, and/or appropriate society/discipline journals.

Schedule:

Day 1, Wednesday, 30 May 2012

Time	Activity	Owner(s)
9:30AM	Welcome, overview, and brief participant introductions	Jason Grabon Amanda Neill
9:45 AM	Workshop goals and agenda run-through	Chris Norris Jim Beach Deb Paul
10:00 AM	Lightning Round of workflow summaries 5 minutes and 1 slide per presenter (~18 presenters)	Participants
10:30 AM	Coffee break	Pascal’s
11:00 AM	Continuation of Lightning Round Group discussion Breakout group definition and self-assignment	Participants
12:30 PM	Box lunch
1:15 PM	Training session: Workflow Core Concepts (level-set practices, processes, and developing a common terminology) Q&A Session	Laurie Taylor Mark Sullivan
2:00 PM	Presentation: Workflow Elements and Concepts - Common Practices	Gil Nelson
3:00 PM	Coffee break	Pascal’s
3:30 PM	Presentation: Social Issues in Collaborative Digitization	Deb Paul
4:00 PM	Breakout Groups: small groups self-assigned by disciplinary interest to identify and record commonalities and divergences in: Current workflow constraints Workflow processes across institutions Metrics - how to measure success Workflow evaluation matrix	Breakout Groups & Moderators
5:45 PM	Review of evening activity and Day 2 agenda	Amanda Neill
6:00 PM	Group photo, dinner, and team building activities
7:00 PM	Dinner at Leonardo’s 706

Day 2, Thursday, 31 May 2012

Time	Activity	Owner(s)
9:30 AM	Review of Day 1, Day 2 agenda summary	Amanda Neill
9:45 AM	Breakout Group reports to the re-assembled Plenary Group	Breakout Groups
10:30 AM	Coffee break	Pascal’s
11:00 AM	Pre-Workshop Survey results and discussion	Shari Ellis
11:30 AM	Training session: Business Process Modeling	Brian Anthony
12:30 PM	Breakout Groups reconvene for box lunch and generate one or more redesigned workflows by addressing: What would you change now? Is a consensus workflow possible for your group? Is a consensus workflow possible for a single preparation type? For a taxon? What would you do to optimize these now?	Breakout Groups & Moderators
3:00 PM	Coffee break	Pascal’s
3:30 PM	Plenary: reports back from the Breakout Groups and discussion	Participants
4:30 PM	Vision for the Future. Minority Reports. Out-of-the-box ideas.	Jim Beach
5:00 PM	Plenary wrap-up discussion. DROID Working Group strategy for polishing and and dissemination of workshop products.	Amanda Neill Deb Paul Gil Nelson
5:30 PM	Adjourn

Software Tools

Software Name	Functionality Delivered	Who is Currently Using
ZBar - http://zbar.sourceforge.net/	1 and 2D barcode reading	BRIT
OCRopus - ocropus.org	OCR, image segmentation	BRIT
GOCR - http://jocr.sourceforge.net/	OCR and 1D barcode reading	BRIT
OpenLayers - http://openlayers.org/	Large Image navigation and zooming. Image segmentation interface.	BRIT
djatoka - http://sourceforge.net/projects/djatoka/	Image server, dynamic tiling of large JPEG2000 images	BRIT
http://jesserosten.com/2010/wireless-tethering-to-ipad	overview of wifi camera tethering	PLH