Monday, July 23, 2012

1940 U.S. Census Comparisons - Summary and Conclusions

I proposed a methodology for evaluating the accuracy and completeness of the 1940 United States Census indexes on Ancestry.com and FamilySearch.org in 1940 U.S. Census Index Comparisons - Post 1: Methodology. 

In 1940 U.S. Census Index Comparisons - Post 2: Carringer in California, I displayed my comparison tables and found that Ancestry had 8 more Carringer entries than FamilySearch, and my judgment was that Ancestry was more correct on 7 of them, and FamilySearch for one. 


In 1940 U.S. Census Index Comparisons - Post 3: Seaver in California, Part 1Part 2 and Part 3 I displayed my comparison tables for Seaver entries in California, and found that FamilySearch had 22 more entries (147 entries) than Ancestry.com (125 entries).  I analyzed all of the Ancestry unique entries, plus the ones that disagreed but were indexed by name correctly.  My judgment for accuracy tally in these posts was: Ancestry = 13, FamilySearch = 29, Both wrong = 4, No decision = 7.


In 1940 U.S. Census Index Comparisons - Post 6: McKnew in California, I displayed my comparison tables forMcKnew entries in California,and found that FamilySearch had 11 more entries(23 entries)than Ancestry.com (12 entries).  Many of the additional FamilySearch entries were because FamilySearch matched both "McKnew" and"Mc Knew"when "mcknew"was the search term, but Ancestry did not.  I think these types of names, with a prefix, should be "lumped together" because most searchers will search using the prefix.  My judgment for the accuracy tally in these posts was Ancestry = 0, FamilySearch = 12. 

In summary then, for about 200 entries analyzed so far, the accuracy tally, based on my humble opinion and trying to be objective, is:

*  Ancestry.com = 20 (10%) (meaning FamilySearch was wrong)

*  FamilySearch = 42 (21%) (meaning Ancestry was wrong)

*  Both wrong = 4 (2%)

* No decision = 7 (3.5%)

Using the percentages, it appears that the Ancestry index was wrong 23% of the time, and the FamilySearch index was wrong 12% of the time.  

Statisticians will say that the sample size is too small, and it probably is.  An expanded study that covered2,000 entries of various surnames over all of the states might change the results.  I had hoped that many other researchers would do one or two similar studies and that we could combine the numbers, but only a few others even tried:

*  Eddie Black did a Smilie study in California, and shared it with me in email.  He found that Ancestry had 3 errors in their indexing, and FamilySearch found 3 more entries, all of which were correctly indexed and incorrectly indexed on Ancestry. Eddie's score then would be Ancestry = 0, FamlySearch = 6.

* If others have done a study, I'll be happy to link to it here. Please make a comment to this post.

Assuming that the error rates are approximately what I calculated above, I can draw some conclusions from the study to date:

1)  Indexing census records is a difficult task because of the informant's knowledge (or lack thereof), the enumerator's attention (or lack thereof) and handwriting skills (good to poor), and indexer's attention and skills. 

2)  A 12% error rate on FamilySearch shows the imperfections of the indexing process - but it is about one half of  the 23% error rate on Ancestry.com.

3)  The two indexers plus an arbitrator process on FamilySearch appears to work better than whatever process that Ancestry.com employs (a single indexer?  Volunteer or paid? English speaking?  Located where?).  

4)  In order to find target persons, a researcher will have to employ all available census indexes and use all of their "tricks of the search" to find the most elusive persons.

5) My recommended search strategy is to use the FamilySearch index with the target name, then try wild cards, age, birthplace and residence filters to try to find the elusive targets.  If a searcher cannot find their target on FamilySearch, then do the same type of search on Ancestry.com.

My standard search is with exact names, but I use wild cards and the filters very early in my search efforts.  I know that many given names and surnames can be misspelled because of the problems with informants, enumerators and indexers, so I try to use wild cards for capital and lower case letters that look like other letters.  For the "Seaver" surname, it's not unusual for me to try sea*, *ver, *vers, ?eaver, se*er*, ?e*er, ?e*ers, etc.

The URL for this post is: http://www.geneamusings.com/2012/07/1940-us-census-comparisons-summary-and.html

Copyright(c) 2012, Randall J. Seaver

5 comments:

Jude said...

I've indexed over 22,000 names on Family Search. Sometimes, I'll be completely wrong on a name, and when I go back to the name on a second pass, I'll suddenly know that what looked like Ophiel is really Pearl. I wish I were perfect at it, but I'm at 98% accuracy, and after about 2,000 names, I got a lot better. I think that everyone who is doing Family Search's indexing, whether LDS or not (I'm not) really, really cares about getting every name right. I know that *I* feel that way. Unfortunately, perfection isn't possible.

Unknown said...

Thanks for the timely, thorough analysis of these two indexing efforts. At RootsTech an Asian company presented on indexing and I heard Ancestry.com does a subcontracted foreign single pass index. I think they might be using them, but I could be wrong. The RootsTech class was impressive though so I thought they'd do a good job. All things considered for a one pass, probably foreign index, it is really good. I'm glad the community can hopefully get insights from this--whatever those might be.

Sharon said...

Thanks for the analysis. I was a bit surprised by how different the results were between the two companies.

In time, however, this may change. Ancestry permits corrections to their 1940 census indexing by users; FamilySearch does not. So if it is wrong on FamilySearch, it will stay that way forever. If it is wrong on Ancestry, it might get corrected.

This makes it even more important for Ancestry users to submit corrections when they do locate an indexing error. You can even add alternate names if it is not truly an indexing error. For example, the census itself may have been wrong, or perhaps only initials were used instead of a full given name. So if you see something in the Ancestry index that can be improved, please submit a correction or alternate name.

Anonymous said...

This is a great comparison Randy, thank you for putting it together.
This does confirm my experience with both indexes. I recall reading that Ancestry paid a company to index for them. Not sure about onshore/offshore, but based on what I've seen there were some pretty basic mistakes.
Regarding being able to submit corrections on the Ancestry indexes perhaps Ancestry should credit my account for each entry that I fix? If they are willing to pay a company to do it, why should I fix it for free? I'm thinking a nickel per correction would be good. ;-)
Dave

Unknown said...

Thanks for your update.

Another thought – is what is written down the correct entry for that person: is the names, date, or locations correct? There could have been communication problems between the informant and the enumerator – thus the inaccurate entries.

As a FSI indexer and arbitrator – even with several aids at my disposal – it was hard deciphering between the ‘a’ and ‘o’, or ‘e’ and ‘I’. FSI guidelines were to TWYS (type what you see) – not being influenced with the 1930 Census or other sources.

I know record indexing accuracy is important, but my main excitement will be when I find my family records – with exact spelling or with other filtering aids. I will provide my searching results when Illinois becomes searchable on FamilySearch or Ancestry.