Tuesday, January 17, 2023

Blast From the Past - Papers, Images, Indexes and Searches

I wrote this 14 years ago on 17 January 2009 after the visit to Ancestry.com.  My guess is I used information for this blog post from that meeting.   Indexing, imaging and search technology has come a long way in 14 years, and there are many more online record collections for us to search for elusive ancestors, but ...this is still pretty much true!

========================================

How often have you heard someone complain about Ancestry.com (or any other genealogy database) that "the Search just doesn't find my people" even though the searcher "knows" that the people are in a certain location at a certain time? I hear this all the time at my society meetings, and in comments on blog posts and in message boards.

How often is the Search engine for the database blamed? Almost always, I think. Is that really fair? I don't think so.

The reality is that there are four elements for a successful search:

1) The original record paper with the desired name was in a record set (e.g., census, military, passenger list, etc.) and available to be imaged.

2) An image of the original record paper with the desired name was made and is available, and the image was digitized for the record database.

3) The indexer that transcribed the name from the digitized image of the original paper was able to accurately transcribe the desired name (and other entries) as it appears on the paper image.

4) The Search engine for the database was able to find the desired name in the database using the searcher's search criteria.

What if:

1) The original paper is not included in the image collection? Obviously, the searcher won't find it. Why would this happen? Were portions of a record set lost, or damaged by handling, before the paper collection was imaged? Just think of how many unreadable names are at the bottom of some census pages!

2) The image of the original paper is so poor (due to faded ink or pencil marks, soiled or torn pages, or extra markings obscure names, etc.) that the names cannot be read. Again, the searcher cannot find it because it is unreadable.

3) The indexer cannot read the name accurately from the digitized image of the original paper, or the name on the original paper was not accurately spelled by the writer of the paper. The searcher might find it using advanced search techniques.

4) The Search algorithm is so limited that it does not use wild cards, soundex/metaphone systems, or other search criteria (location, birthplace, age or birth year, keywords, etc.) to find the desired name. The searcher might find it using advanced search techniques.

Is it any wonder that even experienced and expert researchers cannot find persons in record databases even when they know that the person should be there?

It is evident to me that the "missing names" numbers pile up rather quickly - they may be as high as 15% to 20% missing names for a census records search (see my Seaver surname study here). I found that I was missing about 15% of my known Seaver families in census records, and was able to find about 33% of the missing families using advanced search techniques (i.e., 5% of the missing 15%, and I never could find the other 10% of known families). Other surnames may have more or fewer problems. Other databases may have different problems due to their peculiarities.

I spent months looking for Robert Leroy Thompson (1880 TN -1965 NC) and his family in the 1900, 1910, 1920 and 1930 census (see The Ultimate "Dodging the Census" Puzzle). I still haven't found him, even though the odds are really high that he was in at least one of those four censuses (99.2% chance that he's in one of them if the miss rate for each census is 20%).

A researcher often doesn't know what s/he doesn't know. How does s/he know that "all" of the original papers were available to be imaged? How does s/he know that "all" of the images of the original paper were digitized? How does s/he know that "all" of the entries on a set of digitized images were indexed? The answer is that "s/he doesn't know for sure." S/he has to rely on the word of the repositories that hold the original paper, the digitized images, and the indexes. And that's where quality control - at all of the steps from paper to Search engine, come into play.

The ideal for the genealogy industry is that:

1. The record repository that has the original papers provides everything that it has to the people that image the collection, with some sort of quality assurance provision that assures all involved that all available original papers are provided.

2. The image people create digital images for every paper in the collection, even those that are badly damaged or unreadable, and even use advanced imaging techniques to bring out the best image possible. Again, some sort of quality assurance provision needs to be used to ensure that all original papers were imaged.

3. The indexing people use a double check quality assurance system that ensures the best possible index entries for every name on a record set.

4. The Search engine is versatile enough that many entries with significant problems can be found using advanced searching techniques.

My point in all of the above is this: It is not the sole fault of the Search engine or search algorithm when it cannot find your person of interest in a database. The problem is probably in one of the other categories. A large database provider like Ancestry.com, which often works from microfilms of paper records, usually does not control access to the original papers or the images of the original papers. It does control the digitizing of the image, indexing the records and the search engine, of course, and should have standardized quality control procedures in place for those steps.

Has anybody else done extensive investigations into records missing from specific databases, compared different databases for the same record set, or done a surname study that identifies persons missing from the records? If so, please let me know about it, with a link to, or the source of, the study if possible.

==============================================

Frankly, I think the number of digitized image and/or indexed online records is genea-mazing!  There are billions more pieces of genealogy-related paper in archives and libraries to be digitized.  I hope that the record providers keep finding them, digitize them, and index them for researchers to find.

I would add that the database and indexed record be sourced using Evidence Explained (or similar) standards.  FamilySearch and Find A Grave are the only record providers that do a good job on sourcing - not Ancestry, MyHeritage, Findmypast, and many others.


Copyright (c) 2023, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post.  Share it on Twitter, Facebook, or Pinterest using the icons below.  Or contact me by email at randy.seaver@gmail.com.

No comments: