Boards.ie uses cookies. By continuing to browse this site you are agreeing to our use of cookies. Click here to find out more x
Post Reply  
 
Thread Tools Search this Thread
01-03-2019, 19:12   #1
VirginiaB
Registered User
 
Join Date: Aug 2015
Posts: 199
How are they doing it?

I have been following the new records on FindMyPast and am mystified as to how they are adding so many records so fast. Does anyone know anything about this? Is it computer-aided? I am just very curious as to how it's done.

I finally subscribed because of the NY Archdiocese baptisms and marriages and have solved many mysteries. Now today Liverpool Catholic records. It's hard to keep up with, a good problem to have.
VirginiaB is offline  
Advertisement
01-03-2019, 20:34   #2
pinkypinky
Moderator
 
Join Date: Aug 2007
Posts: 4,677
I presume they outsource the scanning to a specialist company and then the transcription to somewhere in Asia and then get in a team of area-specific experts at the end.
pinkypinky is offline  
Thanks from:
01-03-2019, 20:39   #3
VirginiaB
Registered User
 
Join Date: Aug 2015
Posts: 199
Hmmm. I'm not so sure about the transcriptions in Asia. The transcriptions aren't perfect but they are pretty good considering how difficult the old handwriting is, as we all know. And whoever is doing it seems familiar with Irish names. Of course, I could be completely wrong. Not for the first time.
VirginiaB is offline  
01-03-2019, 23:50   #4
pinkypinky
Moderator
 
Join Date: Aug 2007
Posts: 4,677
The censuses were transcribed first by Canadians and then by Indians. Research conducted by Sean Murphy (ex UCD lecturer) showed the Indian error rate was lower than the Canadians.
pinkypinky is offline  
07-03-2019, 12:54   #5
pedroeibar1
Registered User
 
Join Date: Jan 2010
Posts: 4,775
Quote:
Originally Posted by VirginiaB View Post
I have been following the new records on FindMyPast and am mystified as to how they are adding so many records so fast. Does anyone know anything about this? Is it computer-aided? I am just very curious as to how it's done..
That’s a very interesting topic. I don’t know if any of the geno companies are using machine reading for data but it has been around for decades and is IMO the way ahead for genealogy records but due to cost it remains some time off for most records.

Typed format is easier to machine read but there still is a high error rate (c20%) on older print due to typeface/smudging – look for e.g. at the machine transcriptions of Australian newspapers on the TROVE site – so manual inputs / corrections are still required.

If handwritten ledgers are to be computerised and searchable they first have to be scanned and then an optical character recognition system (OCR) built to recognise images of every individual character. That is a costly process so the feasibility of a machine transcription generally requires a cost/benefit analysis and depends on the format, quantity, quality and value (potential income) of data involved. Sometimes it is easier/cheaper to have transcription done manually. At worst the data can be made available merely as scanned images but they cannot be searched, as is the case with several Boston parish records – it’s the same as scrolling through microfilm in the National Library but doing it from home like the parish records in Ireland without access to the FMP search function.

Among the leaders in providing raw data for building an optical recognition system is the US National Institute of Science which has collated handwriting from thousands of writers and has about a million character images on file ( the ‘EMNIST dataset’) on which to build a program. I’m familiar only with the work done on the Roman alphabet but other alphabets are now available.

Machine learning can correct ‘typos’ using intelligent character recognition (ICR) algorithms e.g. similar to the way predictive text works on a mobile phone. Usually OCR and ICR are combined, but errors will still occur – e.g. there would be confusion between Hennessy & Fennessy, Looney & Lunney, Dunn & Gunn, O’Leary & O’Cleary, etc.. Adding Soundex to the mix would help with this.

Typical examples of very basic ICR/OCR are already in use -bank lodgement ATMs and vehicle registration identification at car parks/motorways. (But note that numbers are far easier to read than some old parish priest’s handwriting!) Also the system would have to incorporate adaptability when Latin words are interspersed with English, not easy when there is a high percentage of English words with Latin roots.

The easiest is ‘migration’ – simple transfer of data from machine to machine, but even that can be complex. In genealogy for e.g. Ancestry / FMP / Familysearch most likely have different system architecture that would complicate machine transfer and writing a program to facilitate it could be impracticable (time & cost) rather than impossible. That of course might change by the time the existing store of machine data becomes available under GDPR.
pedroeibar1 is offline  
Thanks from:
Advertisement
07-03-2019, 14:38   #6
VirginiaB
Registered User
 
Join Date: Aug 2015
Posts: 199
Wow--thanks for that very interesting and informative reply, Pedroeibar1. Pretty amazing stuff. I'll be saving your info. Thanks again.
VirginiaB is offline  
Post Reply

Quick Reply
Message:
Remove Text Formatting
Bold
Italic
Underline

Insert Image
Wrap [QUOTE] tags around selected text
 
Decrease Size
Increase Size
Please sign up or log in to join the discussion

Thread Tools Search this Thread
Search this Thread:

Advanced Search



Share Tweet