1 of 12

Information handling in Free UK Genealogy�

Where do we get our information from, what do we do with it, how do we publish it?

Richard Light

FUG Strategy Weekend, 24-25 June 2023

2 of 12

�Sources of information

    • Need to match ‘supply’ and ‘demand’: matching both quantity of material to number/capacity of transcribers and types of source to their preferred type of challenge
    • Currently, mostly scans of hard-copy sources or fiche
    • Format: manuscript, typed, printed, computer printout
    • Issue of image quality for transcribers: alternative sources?
    • Scope for OCR on most sources – maybe not manuscript yet
    • Machine-readable sources would transform both our capabilities and our workflow

3 of 12

Tesseract OCR:

FreeBMD 1994 deaths

4 of 12

Capture2Text OCR:

Probate records

5 of 12

�Data capture

  • TWYS is the overriding policy
  • No structured way to record ‘best guess’ variations on TWYS, or corrected versions of obvious errors in the source (e.g. incorrect Reg District names in FreeBMD)
  • Increasing tendency for data capture software to provide lookups/cross-checks, rather than just have the transcriber typing ‘blind’
  • No support for cross-project lookups – transcribers have to resort to the mailing list
  • FreeComETT offers possibility of a consistent data capture framework for all projects

6 of 12

�Information management�

  • Different systems in place for each project, both for data capture and for data checking/quality
  • Some information is common to all projects (e.g. forenames; surnames) or could have a shared resource (gazetteer)

7 of 12

�Information retrieval��

  • Current strategies for coping with variable (TWYS) data could (IMO) be improved
  • Lookups of common variations on e.g. names would offer better precision and recall than soundex/UCF (could also be used as a data capture aid)
  • All three sites should be brought together in a single search interface, so the researcher can search by telling us what they know, and we tell them what matches from our databases

8 of 12

Autocomplete for FreeBMD2 searches

9 of 12

�Publishing/exchanging information���

    • Importance of providing persistent URLs for retrieved records, to act as a reference source. Must resolve to the correct web page; ideally should also deliver machine-processible data (Linked Data)
    • Our support for end-users’ own research is pretty much non-existent. As a minimum, we should allow them to log in and save both searches and search results. Exporting search results as GEDCOM [in the absence of anything better] would facilitate transfer of their results to other genealogy web sites
    • Ideally, we should offer some means of recording ‘curated’ results on our own site
    • Reach out to free-access genealogical groups elsewhere
    • Develop data interchange standard and/or API to our data; encourage their wider adoption

10 of 12

�Software development environment

  • (at least!) 4 distinct platforms in use:
    • Perl + SQL (FreeBMD)
    • Ruby on Rails + MongoDb (REG, CEN, BMD2 web site)
    • Refinery (CMS)
    • PHP + CodeIgniter (FreeComETT)
  • Transcription support is a miscellany of platform-specific programs (mostly Windows?), many of them dated and/or without support
  • The platforms and products which are available reflect what volunteers have offered/been interested in in the past
  • Typically we have far too few software developers (compared, say, with the number of transcription volunteers)
  • Progress towards system-wide code for the new CEN, REG and BMD2 web sites (Ruby on Rails)
  • Use of GitHub supports the possibility of shared/distributed software development
  • FreeComETT offers the potential to do the same (i.e. system-wide code) for data capture (and for the subsequent checking/validation work, which currently follows different pathways)
  • FreeComETT also offers the prospect of device independence, and of moving towards updating procedures which are less batch-oriented and CSV-file-based

11 of 12

Rights and Licences

  • We recognised the need to establish clear authority for making our data Open Data; hence the transcriber agreement
  • Recently we have realised that similar arguments could be applied to our software

12 of 12

Questions

  • Is there any aspect of our information handling strategy which is clearly wrong?
  • Are there any opportunities which the above analysis misses?
  • Which is the higher priority: improving the web interface for our end-users, or improving the framework(s) for capturing and checking data?
  • Do we have a good enough handle on the supply-and-demand issues for material to feed to our transcribers?