Wednesday, November 20, 2013

Mocavo Has Been Working on Handwriting Text Recognition

Cliff Shaw of Mocavo posted A Little Something We've Been Working On ... on the Mocavo Genealogy Blog this morning. He started his post with:

"A little over a year ago, Mocavo acquired ReadyMicro and the incredible mind known as Matt Garner. One of Matt’s lifelong passions and curiosities is to enable computers to read historical handwritten documents to bring genealogy search to the next level. It’s well known in the genealogy industry that historical handwriting recognition is the Holy Grail – the single largest technological advancement that would enable more content to become accessible online (except for maybe the invention of the Web). For the past year, we’ve joined with Matt to tackle this very hard problem, and have finally made enough progress that we can begin to report on it."

Please read the entire blog post and look carefully at the document images presented, and how Mocavo is attacking the handwriting recognition problem.  It also discusses the complexities of cursive handwriting and the progress they've made in transcribing it.  

James Tanner discussed this today in Is Handwriting Recognition the Holy Grail of Genealogy? on the Genealogy's Star blog.

Ah, the "Holy Grail of Genealogy" - is it really being able to transcribe historical handwritten text so that it can be indexed and/or published?  I think that is certainly one of the Holy Grails in genealogy - others might be digitizing and indexing all historical documents, having a Mother of All Genealogy Databases (MOAGD - an interconnected, source-centric, image and story heavy, family tree), etc.  

What does it mean for genealogy and family history researchers?  I think that - if they can make this work even at 90% on handwritten text - it has a chance of being a breakthrough technology.  

What are the current genealogy record sets that are not digitized or indexed at this time?  For me, they are probate records, land records, town, county and state records, court records, military pension records, manuscript collections, personal paper collections, etc.  They all reside in an archive of some sort - national, state, county, town, historical society, desk drawers, attics, etc.  Few of those archives have the fiscal, time or technology capability to digitize or index their holdings.  They would like to, but they have limited capabilities to do it.  There are exceptions, like the Family History Library, Ancestry.com, the National Archives, the New England Historic Genealogical Society, etc.  However, they are nowhere near a complete digitization of their holdings, and probably never will be with current technology and funding levels.  

If this historical handwriting text recognition and transcription can be made to work by Mocavo (or other entities), then at least the indexes of volumes of probate, land, court, vital, etc. records could be quickly digitized and indexed, and would therefore be searchable.  Having every name in a document digitized would be fantastic.  If records are searchable, at least by name, then they are findable.

As an example, I give you the will of Philip Jacob King (1764-1829), which I transcribed recently in Amanuensis Monday - Post 194: Will of Philip Jacob King (1764-1829) of York, Pennsylvania.  Here is one of the three images of this document found in the Pennsylvania Probate Records, 1683-1994 collection on FamilySearch.org:



This six-page document names all of the heirs of Philip Jacob King, including his daughter Elizabeth Spangler, the wife of Daniel Spangler. my third great-grandparents.  I had to plow line by line through this beautifully handwritten document in order to find the evidence that Elizabeth (King) Spangler was the daughter of Philip Jacob King.  

There are over 3.2 million images in the Pennsylvania Probate Records collection on FamilySearch - already image digitized and cataloged by County, Volume and Page, just waiting to be indexed and transcribed by a handwriting recognition and transcription process.  There are hundreds of millions more probate records like this on FamilySearch microfilm, some are already image digitized.  Then there are hundreds of millions of land records, court records, state, county and town records, etc.  It's a big job, and only technology is going to provide these records in an online environment.

I've always thought that once records like those are image digitized, organized, indexed and made searchable will resolve many of the "brick wall" research problems that we all have in our genealogy research.  We all have Elizabeths and don't know their parents or siblings.  With the digitizing, text recognition, name indexing, and search technology, there is hope that the parents of many of our Elizabeths will be found in records.  

There will be many problems with the handwriting text recognition and transcription portion of this technology - handwriting styles have changed over the past 500 years, many countries and groups have a different character set, and those present challenges.  But, over time, it can probably be made to work.

Maybe even in my lifetime!

I'm glad that Mocavo is working on this technology and applying it to genealogy and family history records.  Other genealogy companies have OCR technology, and are using it routinely for typeset documents like newspapers, books and periodicals.  I wonder if any others are attempting handwriting recognition and transcription?

If Mocavo is first to the genealogy market with this text recognition technology, will there be a market to license the technology for use by other entities?  Or some sort of agreement to share the technology in return for record access?

These are great times we genealogists live in!

The URL for the post is:  http://www.geneamusings.com/2013/11/mocavo-has-been-working-on-handwriting.html

copyright (c) 2013, Randall J. Seaver


No comments: