Thursday, October 31, 2019

Beware of Automated Computer Indexing of Newspaper Articles

Early in our genealogy and family history careers, we are told, or learn by experience, that indexes can have errors.  In the last 20 years or so, many newspaper pages have been digitized and  Optical Character Recognition (OCR) was developed to "read" those digitized pages.

This has been a tremendous boon to researchers who used to laboriously scan the printed or microfilmed newspaper pages in years past.  It often took an hour or more to find an article in a newspaper of a known date for a death notice or an obituary.

Today, for my "Seavers in the News" article, I had a great example of the problems of OCR indexing of an obituary by a computer.

Here is the record summary for an obituary found today from 1886 in Ancestry.com's collection for "Newspapers.com Obituary Index, 1800s-current":


Here is the newspaper page image with the obituary and the indexed information next to it:


There is a note at the bottom of the indexed information saying:

 "These facts were pulled from this record by a computer and may not be accurate."  

No kidding, Sherlock!  The computer indexed:

*  Name as "Oliitiarr Seaver" rather than "Anna Seaver"
*  Birth date as "1 Oct" rather than "13 Oct 1809"
*  Birth date as "North I arolina" rather than "North Carolina"
*  Residence place as "Kgypt Hottom" rather than "Egypt Bottom"
*  Death date as "Abt 1886" rather than "14 June 1886"
*  Parents as "Susanna Tanue" rather than "George and Susanna Tague"
*  Spouse as "Heury Seaver" rather than "Henry Seaver"
*  Child as "Georee W. Seaver" rather than "George W. Seaver"
*  Siblings as "Jonathan Tanue" rather than "Jonathan Tague."

I understand how this happens with OCR and computer "pulling" but I could easily see the correct spellings of the indexed entries in the obituary itself.

Fortunately, Ancestry.com provides a way to correct errors like this, so I clicked on the "Add alternate info" link and was able to edit the indexed information so that it reads:


I clicked "Done" and I was!  The added information should be added to the Index of names in the near future for some other researcher to find.

We all need to be aware that these types of errors will occur as we search and research for names in newspaper collections.

Even with the indexing errors, searching in digitized collections is much easier these days than it was searching newsprint and/or microfilm of newspaper pages 20 years ago.

I greatly appreciate the efforts by companies like Ancestry.com to bring us newspaper pages and indexes that provide links to the pages.

I'm not complaining here - just making the point that we need to expect errors like this will be made, and we need to be flexible in our searches if we don't get results when we use an exact name or date or place.

                                  =============================================

Disclosure:  I have a paid subscription to Newspapers.com and have used it extensively to find articles about my ancestral and one-name families.  I have a paid subscription to Ancestry.com and use it daily in my genealogy search and research work.



Copyright (c) 2019, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post.  Share it on Twitter, Facebook,  or Pinterest using the icons below.  Or contact me by email at randy.seaver@gmail.com.

4 comments:

Diane Gould Hall said...

That’s a pretty extreme example of bad OCR for sure. Another example of why we always need to try our best to see an actual item, image, record etc., rather than rely on indexing.
Thanks Randy.

Marian said...

In a number of cases where I knew that a particular newspaper MUST have had an obituary that I needed (but the online index wasn't finding it), I have resorted to reading it page-by-page around the time of the death. Often I have succeeded, and that means that the OCR indexing isn't measuring up to what we expect.

Randall said...

Thanks for the post! Note that in addition to OCR, which tells what the words are, there is also natural language processing (NLP) going on here, which determines which words constitute a name, date or place; which names or pronouns ("he", "her") refer to the same person; which dates and places constitute a fact that goes on which person; and how the people are related to each other. That is all powerful technology and hard to get right, too. It looks like the relationships that it came up with in this record were right, which is cool.

It looks like they don't yet let edit the relatives' names if they're wrong (the sibling is still "Georee W. Seav er", and it doesn't appear that this can be fixed by users). It also appears there isn't (yet) a way to delete relationships that are wrong nor add missing ones. And it looks like there isn't yet support for facts on the relatives (such as the husband's death date that is mentioned in this article). So there is more that can be done.

That being said, this is far better than having to wade through every page of each paper in the area to find what you're looking for when it works right. I imagine we'll see more and more of this over time, with better and better accuracy.

Brad said...

Hi Randy
Greetings from down under. When ever I am using (which is quite often)our newspaper archives, "TROVE" I most always correct the spelling that is shown on the pages that I am on.
There is every digital symbol shown when the scanner cannot read it properly and talk about put a smile on one's face in reading it, that is what digitisation does to printed pages when scanning them.
But one can at least zoom in and work out the proper letters/names so that it can be corrected and saved so that others can find what they are looking for.
One thing I do, do is make sure that the surnames and dates are fixed and recorded there by making it that much easier for others when searching these archives.
Brad