1 of 12

Information handling in Free UK Genealogy�

Where do we get our information from, what do we do with it, how do we publish it?

Richard Light

FUG Strategy Weekend, 24-25 June 2023

2 of 12

�Sources of information

Need to match ‘supply’ and ‘demand’: matching both quantity of material to number/capacity of transcribers and types of source to their preferred type of challenge
Currently, mostly scans of hard-copy sources or fiche
Format: manuscript, typed, printed, computer printout
Issue of image quality for transcribers: alternative sources?
Scope for OCR on most sources – maybe not manuscript yet
Machine-readable sources would transform both our capabilities and our workflow

3 of 12

Tesseract OCR:

FreeBMD 1994 deaths

4 of 12

Capture2Text OCR:

Probate records

5 of 12

�Data capture

TWYS is the overriding policy
No structured way to record ‘best guess’ variations on TWYS, or corrected versions of obvious errors in the source (e.g. incorrect Reg District names in FreeBMD)
Increasing tendency for data capture software to provide lookups/cross-checks, rather than just have the transcriber typing ‘blind’
No support for cross-project lookups – transcribers have to resort to the mailing list
FreeComETT offers possibility of a consistent data capture framework for all projects

Ian Brooke: I personally believe we tend to overemphasize, or perhaps mis-focus, TWYS. We concentrate on things like "what case it was in the scan" and "how many dots there are" but I feel these have no interest whatsoever to a researcher and we really need to concentrate on what they need or want. "John" means exactly the same to them as "jOhn.."

Steve Biggs:

This is a contentious issue. There are good reasons for the TWYS rule; 1. Transcribers should not spend time considering what the clerk may have meant or what errors may be in the source - this could well lead to new errors in the transcription, 2. Names that we now think of as one gender were in some cases used for the other gender back in the 16th/17th centuries, 3. There were legitimate spelling variations of names so there is no error in the source to correct even though it may look wrong. ��Bottom line is that we are creating a faithful copy of the original source to point the researcher at so they can find the source and make their own conclusion on potential errors.

Geoff Jarvis:

FreeCEN deals with 19C documents - before National schools were even set up. So a lot of our spellings can be phonetic. This is difficult to match with today's standardisation. In addition spellings can change within families even nor - hyphenated names abbreviated or names hyphenated, variations in local versions of the same name, accents and dialects changing what is heard and therefore how a name is spelt etc.

Anne Vandervord: FreeCEN does have facility to add alternative Birth County and/or Birth Place. The FreeCEN Gazetteer of Place Names also records alternative names for Places.

[cross-project lookups] Steve Biggs: This is possible by manually searching on FreeBMD for post-1837 events being transcribed in FreeREG but better integration would be a good thing.

6 of 12

�Information management�

Different systems in place for each project, both for data capture and for data checking/quality
Some information is common to all projects (e.g. forenames; surnames) or could have a shared resource (gazetteer)

7 of 12

�Information retrieval��

Current strategies for coping with variable (TWYS) data could (IMO) be improved
Lookups of common variations on e.g. names would offer better precision and recall than soundex/UCF (could also be used as a data capture aid)
All three sites should be brought together in a single search interface, so the researcher can search by telling us what they know, and we tell them what matches from our databases

Geoff Jarvis: This [using lookups rather than soundex/UCF] needs to be decided before FreeCEN starts to develop POB searching. i.e. very soon.

Geoff Jarvis: The development of POB searching in FreeCEN is a good point to develop the model for a common search methodology between the three projects. In FreeCEN this will be the bread and butter search that other specialist searches can be developed from. FreeCEN expects to be commence developing POB searching in about six months. We need to establish some sort of collaboration at business analyst and programmer levels to make one search finds all FUG records possible.

Ian Brooke: An oft described need is also on an address as people do research usage of houses and even streets. So I think we should not restrict ourselves to "person" here. I'm sure there are other areas of research also, maybe occupations, age at death, deaths in an area, people per house, size of family and so on.

8 of 12

Autocomplete for FreeBMD2 searches

9 of 12

�Publishing/exchanging information��

Importance of providing persistent URLs for retrieved records, to act as a reference source. Must resolve to the correct web page; ideally should also deliver machine-processible data (Linked Data)
Our support for end-users’ own research is pretty much non-existent. As a minimum, we should allow them to log in and save both searches and search results. Exporting search results as GEDCOM [in the absence of anything better] would facilitate transfer of their results to other genealogy web sites
Ideally, we should offer some means of recording ‘curated’ results on our own site
Reach out to free-access genealogical groups elsewhere
Develop data interchange standard and/or API to our data; encourage their wider adoption

10 of 12

�Software development environment

(at least!) 4 distinct platforms in use:

Perl + SQL (FreeBMD)
Ruby on Rails + MongoDb (REG, CEN, BMD2 web site)
Refinery (CMS)
PHP + CodeIgniter (FreeComETT)

Transcription support is a miscellany of platform-specific programs (mostly Windows?), many of them dated and/or without support
The platforms and products which are available reflect what volunteers have offered/been interested in in the past
Typically we have far too few software developers (compared, say, with the number of transcription volunteers)
Progress towards system-wide code for the new CEN, REG and BMD2 web sites (Ruby on Rails)
Use of GitHub supports the possibility of shared/distributed software development
FreeComETT offers the potential to do the same (i.e. system-wide code) for data capture (and for the subsequent checking/validation work, which currently follows different pathways)
FreeComETT also offers the prospect of device independence, and of moving towards updating procedures which are less batch-oriented and CSV-file-based

11 of 12

Rights and Licences

We recognised the need to establish clear authority for making our data Open Data; hence the transcriber agreement
Recently we have realised that similar arguments could be applied to our software

Ian Brooke: It is difficult (at least for me!) to know exactly what is being described as "our data". I'm not sure what our agreement is with the GRO who ultimately own all the data or what we are allowed to do with that data. It seems unlikely to me that transcribing the data makes it ours.

Geoff Jarvis: No. The current copyright impositions on Census images are an example in point. Until we get to a point where we have an indisputable approval by GRO in writing to use the images for transcription we will never be able to match supply and demand as far as FreeCEN is concerned.

My memory may be failing but I seem to remember this question being raised in the past, long ago in the early days of FreeBMD the project asked lawyers on the legal position on ownership of data. I don't remember the answer but I'm sure Graham will.

12 of 12

Questions

Is there any aspect of our information handling strategy which is clearly wrong?
Are there any opportunities which the above analysis misses?
Which is the higher priority: improving the web interface for our end-users, or improving the framework(s) for capturing and checking data?
Do we have a good enough handle on the supply-and-demand issues for material to feed to our transcribers?