Originally Posted by VirginiaB
I have been following the new records on FindMyPast and am mystified as to how they are adding so many records so fast. Does anyone know anything about this? Is it computer-aided? I am just very curious as to how it's done..
That’s a very interesting topic. I don’t know if any of the geno companies are using machine reading for data but it has been around for decades and is IMO the way ahead for genealogy records but due to cost it remains some time off for most records.
Typed format is easier to machine read but there still is a high error rate (c20%) on older print due to typeface/smudging – look for e.g. at the machine transcriptions of Australian newspapers on the TROVE site – so manual inputs / corrections are still required.
If handwritten ledgers are to be computerised and searchable they first have to be scanned and then an optical character recognition system (OCR) built to recognise images of every individual character. That is a costly process so the feasibility of a machine transcription generally requires a cost/benefit analysis and depends on the format, quantity, quality and value (potential income) of data involved. Sometimes it is easier/cheaper to have transcription done manually. At worst the data can be made available merely as scanned images but they cannot be searched, as is the case with several Boston parish records – it’s the same as scrolling through microfilm in the National Library but doing it from home like the parish records in Ireland without access to the FMP search function.
Among the leaders in providing raw data for building an optical recognition system is the US National Institute of Science which has collated handwriting from thousands of writers and has about a million character images on file ( the ‘EMNIST dataset’) on which to build a program. I’m familiar only with the work done on the Roman alphabet but other alphabets are now available.
Machine learning can correct ‘typos’ using intelligent character recognition (ICR) algorithms e.g. similar to the way predictive text works on a mobile phone. Usually OCR and ICR are combined, but errors will still occur – e.g. there would be confusion between Hennessy & Fennessy, Looney & Lunney, Dunn & Gunn, O’Leary & O’Cleary, etc.. Adding Soundex to the mix would help with this.
Typical examples of very basic ICR/OCR are already in use -bank lodgement ATMs and vehicle registration identification at car parks/motorways. (But note that numbers are far easier to read than some old parish priest’s handwriting!) Also the system would have to incorporate adaptability when Latin words are interspersed with English, not easy when there is a high percentage of English words with Latin roots.
The easiest is ‘migration’ – simple transfer of data from machine to machine, but even that can be complex. In genealogy for e.g. Ancestry / FMP / Familysearch most likely have different system architecture that would complicate machine transfer and writing a program to facilitate it could be impracticable (time & cost) rather than impossible. That of course might change by the time the existing store of machine data becomes available under GDPR.