Saturday, January 17, 2009

Papers, Images, Indexes and Searches

How often have you heard someone complain about Ancestry.com (or any other genealogy database) that "the Search just doesn't find my people" even though the searcher "knows" that the people are in a certain location at a certain time? I hear this all the time at my society meetings, and in comments on blog posts and in message boards.

How often is the Search engine for the database blamed? Almost always, I think. Is that really fair? I don't think so.

The reality is that there are four elements for a successful search:

1) The original record paper with the desired name was in a record set (e.g., census, military, passenger list, etc.) and available to be imaged.

2) An image of the original record paper with the desired name was made and is available, and the image was digitized for the record database.

3) The indexer that transcribed the name from the digitized image of the original paper was able to accurately transcribe the desired name (and other entries) as it appears on the paper image.

4) The Search engine for the database was able to find the desired name in the database using the searcher's search criteria.

What if:

1) The original paper is not included in the image collection? Obviously, the searcher won't find it. Why would this happen? Were portions of a record set lost, or damaged by handling, before the paper collection was imaged? Just think of how many unreadable names are at the bottom of some census pages!

2) The image of the original paper is so poor (due to faded ink or pencil marks, soiled or torn pages, or extra markings obscure names, etc.) that the names cannot be read. Again, the searcher cannot find it because it is unreadable.

3) The indexer cannot read the name accurately from the digitized image of the original paper, or the name on the original paper was not accurately spelled by the writer of the paper. The searcher might find it using advanced search techniques.

4) The Search algorithm is so limited that it does not use wild cards, soundex/metaphone systems, or other search criteria (location, birthplace, age or birth year, keywords, etc.) to find the desired name. The searcher might find it using advanced search techniques.

Is it any wonder that even experienced and expert researchers cannot find persons in record databases even when they know that the person should be there?

It is evident to me that the "missing names" numbers pile up rather quickly - they may be as high as 15% to 20% missing names for a census records search (see my Seaver surname study here). I found that I was missing about 15% of my known Seaver families in census records, and was able to find about 33% of the missing families using advanced search techniques (i.e., 5% of the missing 15%, and I never could find the other 10% of known families). Other surnames may have more or fewer problems. Other databases may have different problems due to their peculiarities.

I spent months looking for Robert Leroy Thompson (1880 TN -1965 NC) and his family in the 1900, 1910, 1920 and 1930 census (see The Ultimate "Dodging the Census" Puzzle). I still haven't found him, even though the odds are really high that he was in at least one of those four censuses (99.2% chance that he's in one of them if the miss rate for each census is 20%).

A researcher often doesn't know what s/he doesn't know. How does s/he know that "all" of the original papers were available to be imaged? How does s/he know that "all" of the images of the original paper were digitized? How does s/he know that "all" of the entries on a set of digitized images were indexed? The answer is that "s/he doesn't know for sure." S/he has to rely on the word of the repositories that hold the original paper, the digitized images, and the indexes. And that's where quality control - at all of the steps from paper to Search engine, come into play.

The ideal for the genealogy industry is that:

1. The record repository that has the original papers provides everything that it has to the people that image the collection, with some sort of quality assurance provision that assures all involved that all available original papers are provided.

2. The image people create digital images for every paper in the collection, even those that are badly damaged or unreadable, and even use advanced imaging techniques to bring out the best image possible. Again, some sort of quality assurance provision needs to be used to ensure that all original papers were imaged.

3. The indexing people use a double check quality assurance system that ensures the best possible index entries for every name on a record set.

4. The Search engine is versatile enough that many entries with significant problems can be found using advanced searching techniques.

My point in all of the above is this: It is not the sole fault of the Search engine or search algorithm when it cannot find your person of interest in a database. The problem is probably in one of the other categories. A large database provider like Ancestry.com, which often works from microfilms of paper records, usually does not control access to the original papers or the images of the original papers. It does control the digitizing of the image, indexing the records and the search engine, of course, and should have standardized quality control procedures in place for those steps.

Has anybody else done extensive investigations into records missing from specific databases, compared different databases for the same record set, or done a surname study that identifies persons missing from the records? If so, please let me know about it, with a link to, or the source of, the study if possible.

1 comment:

Sharon said...

Randy: Thanks for your great blog!

I could not post a comment under your 2007 Leroy Thompson story, so I am posting here.

A couple of ideas for you: many of my Leroys were also known as Roy, which can also be misread as Ray. Just another way to compound your problem.

Leroy's Social Security number application may be the only record you will find where HE answered the question about his parents. His marriage record will probably not have parents' names. I note that he got his SS# in Tennessee before 1951 (probably 1937).

Gwen Thompson Nelson's obit says she was a "native of Knoxville." I found her in a 1937 Central High School yearbook in Knoxville.

Although Ancestry has many Tennessee marriages now on-line, the Knox County marriages are not included for any time period that would interest you. Some other "big city" counties (Davidson, Hamilton) are also not complete. Knox County is probably a reasonable place to look for his marriage in 1917. However, I am puzzled as to how he was married in 1917 if he was in the army from 1916 to 1918. Maybe he was home on leave?

The East Tennessee Historical Society building in Knoxville has all you could ever dream of for this area. Downstairs (library) the McClung collection includes city directories for Knoxville. Upstairs (Knox County Archives) has marriage records and much more. Both sections have very knowledgeable and helpful staff.

See their website: http://www.easttnhistory.org/

I believe you can order some records by mail, or perhaps find a researcher/volunteer to start by finding marriage and/or city directory lisitings. Some directories have reverse listings. If you can find an address in 1930, you may be able to find names of neighbors and locate the mystery man that way. Also ask if they have Kingsport directories.

Good luck