Wednesday, January 30, 2019

A Reader's Take on Ancestry.com Problems - Part II - Ancestry Search Engine Problems

In response to Dear Ancestry.com: Are You Fixing These Problems? (posted 17 January 2019), I received 38 comments on the post, and several via email.  

One of the email correspondents was a person who has Ancestry and IT experience, and offered knowledge, experience and wisdom for users of Ancestry.com, especially on the trees and the search engines.  The comments are detailed and ring true to me based on my experience (and that of others) working on Ancestry.com.  

In this post, I want to concentrate on the search problems with Ancestry record collections because that is where some of the problems and inconsistency of results seems to occur.  


My correspondent offered this TL:DR (that is "Too Long: Didn't Read" for those wondering) summary (since the detailed comments are extensive), which I appreciate:
  • There are two different search engines - Solr and the original home-grown solution (Hints, at one point, ran on a separate search system as well). Global and Category searches are powered by Solr. Collection specific searches are powered by the home-grown system. They have different relevancy models and rules, which causes results to vary as you drill down into individual collections.
  • Ancestry has moved all services to AWS (Amazon Web Services) - there are growing pains as they work with that system, but I am confident they will get it right.

A)  Two Different Search Engines, and the side effects

It is no secret that Ancestry has moved its systems to the cloud - see https://www.businesswire.com/news/home/20170608006326/en/Ancestry-All-in-AWS for an announcement from 2017, when they were in the middle of the migration. 

Part of the cloud migration plan was to get all of Search powered by Solr. They also hired several Solr experts and engineers (check some of the old job postings). Because of this plan, there was no appetite for moving the home-grown system to the cloud. Ancestry had already proven that Solr could power a search experience by using it in the Ancestry Mobile App, which searched a subset of the collections, and the Solr team was certain they could get the rest of the collections indexed and make them available.

They were right. But what they have not yet been able to do is index everything the way that the home-grown system did.

First, remember that the home-grown search engine indexed content sets into "collections." The "1940 US Census" is a self-contained collection. The "England, Select Births and Christenings" are a self-contained collection. Every item that you see in the Card Catalog is a separate collection with its own index. The home-grown search engine could query these collections individually or in sets (like all "Census Collections") or in a universal set (global search).

In all collections, there are a core set of data fields that are typically indexed when the data is available. First Name. Surname. Birth Date. Birth Place. There are actually several thousand defined and "normalized" fields that mean the same thing in any collection. That allows the system to efficiently search for "Schoonover" as a surname in any collection - it can map the input to a common underlying field identifier that is the same in every collection index.There is a complex but understandable set of rules that govern what any single field ID means, based (again) on a 32-bit number translated to an 8 digit hexadecimal number. If a field didn't fit into the rules, it wasn't a "normalized" field.

But some collections have unique data items. And the search engine allowed those unique data items to be indexed in collection-specific fields. These non-standard, non-normalized fields, may be unique in any collection. The home-grown system lets you search an individual collection and search those collection-specific fields.

The Solr implementation, however, as it was originally implemented (and they may be planning a different or supplemental implementation), did not use a collection-based index. It has a global index, sharded by record type. You define the fields up-front and index any content that fits into those fields. This works with the normalized field data. It does NOT work with the collection-specific data.

That's the background. In practice, this affects the overall search experience.

Consider Global Search and Category Search (like searching all Census collections or all BMD collections). Look at the search forms. Global Search gives you the most common search fields - those that are to be found in most collections. A category search form shows different fields than the global form. Compare two different category forms (Census vs Military) and you will see that they have different fields to search on. This was true in the home-grown search, and it is true in the Solr-based search. Global and Category-based searches used the normalized fields.

But only the home-grown search handles the collection-specific searches.

So, now we have a mixed search experience. The casual user will never know the difference. Experienced users will. But Ancestry cares more about gaining new users than retaining long-time users. There are more of the former in the market than there are of the latter.

For the casual user, global search, or search from one of the primary categories on the search menu, or a search from a profile in their tree, is all they ever do. These searches are all powered by Solr and provide a reasonably good experience.

For the experienced user, who drills down into a category and opens individual collections, there is a difference. Clicking into the Category view on the results page is a good way to discover this. Using the category drill-downs on the left of the search results will also expose this. The "Results" tab is powered by Solr. The "Categories" display is powered by Solr. The actual query that is run when you click through to a collection is powered by the home-grown search engine. And the results are different. Sometimes, the number of results goes up; sometimes, it goes down.


Side note: In looking for other examples, I came across this - William Barsh, b1918 in Pennsylvania, USA:  https://www.ancestry.com/search/?name=William_Marsh&birth=1918_pennsylvania-usa_41&birth_x=_1-0&count=50&name_x=psi&viewMode=category. Note the settings for the Birth Place - exact to the state. Look at the BMD collections - there are 12 results in England and Wales. Click through, and there really are 12...but none of them were born in Pennsylvania.This appears to be a bug - the exact flag isn't being set on Solr and isn't being translated when the search is sent to the home-grown system. [Note by Randy:  My results in the Categories list of collections shows 12 results, but clicking on the collection provides zero results.]

The two search engines have different rules for calculating relevancy. Given the same inputs and the same content to search, they come up with different results. Is one better than the other? Depends on if you can find the record you are looking for. But when the number of results grows or shrinks dramatically, it is a problem, as it lowers your confidence in the system.

Also, if you do a collection-specific search (go to the Card Catalog, pick a collection, and search from there), you can see many of the collection-specific fields that are available. When you do a search from these forms, you are using the home-grown search system. That same search for William Marsh in the England & Wales Civil Registration Birth index correctly returns 0 results: https://www.ancestry.com/search/collections/freebmdbirth/?name=william_marsh&birth=1918_pennsylvania-usa_41&birth_x=_1-0&count=50&name_x=psi.

Two engines. Different rules. Different results.

Ancestry needs to get the collection-specific searches moved to Solr. OR, they need to move everything back to the home-grown system. If they do the latter, they need to migrate the home-grown system to a 64-bit architecture so they can handle Public Member Trees. In either case, they need to take a hard look at how they are doing relevancy. The models built up and refined over almost 20 years are only partially observed in the Solr implementation.

B)  Randy's Notes and Opinion:

My correspondent described the search problems that all of us face when we do a Search for a person in the Global Search (which gives the "Results" list), or in the Categories Search (which lists the record collections and the number of results in each category).

Using the link for Louis Deweese above, here is the Results page:

The bottom line for me is that:

1.  A Global Search (from "Home" page or "Search" page) uses the Solr search engine (which uses only the "normalized" fields), and provides the "Results" list.

2.  A Categories Search (using the categories on the left side of the "Results" page) uses the Solr search engine (which uses only the "normalized" fields).

3.  A specific Collection Search from the "Categories" list uses the Solr search engine (I'm not sure of this)

4.  A search from a specific Collection page (i.e., selected from the Card Catalog, e.g., the Find A Grave collection) uses the Home-grown search engine (and searches all collection fields)

5.  A search from the "Search" button on an Ancestry Member Tree profile uses the Solr search engine (which uses only the "normalized" fields).

C)  My opinion is:

1)   That the only reliable Search engine for finding all of the indexed records in a specific record collection (e.g., Find A Grave) at this time is to use the Card Catalog to select the record collection and then search it for your target person.  This is because the home-grown search system has "collection-specific" indexed information in it (e.g., parents, siblings, children in the case of Find A Grave) that the Solr search engine cannot find because of the present limitations of the Solr system.  

2)  A Global, Categories or Tree Search will provide results for your person who is the subject of a specific record and has the "normalized" search fields.  

D)  See earlier posts on this general subject:

*  A Reader's Take on Ancestry.com Problems - Part I - Ancestry Member Tree Indexing (posted 23 January 2019)

=============================================

Disclosure:  I have had a paid subscription to Ancestry.com since 2000, and use the site every day.  I have received material considerations from Ancestry.com in years past, but that does not affect my objectivity in writing about their products and services.

The URL for this post is:  https://www.geneamusings.com/2019/01/a-readers-take-on-ancestrycom-problems_30.html

Copyright (c) 2019, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post.  Share it on Twitter, Facebook, Google+ or Pinterest using the icons below.  Or contact me by email at randy.seaver@gmail.com.

3 comments:

Marian B. Wood said...

Excellent analysis and thoughtful post. Another approach I'm using is to use HeritageQuest for some of my searches (Census, city directories, etc) because those collections are targeted and therefore the searches are targeted. I'm getting good, relevant results, and it's easy to locate specific collections in the uncluttered HQ interface. Just a thought for consideration!

jennyalogy said...

Thanks Randy, that's very useful. And it reinforces what I have found, namely that going in via the card catalog can be a very effective way to search.

Linda Stufflebean said...

Thanks Randy, as that helps understand some of the quirks of Ancestry's current search engine workings. However, is it possible that some collections are missed by both? Recently, I was looking at the 1860 slave schedules for Bourbon County, KY. I was looking for the Spears, Grimes and Talbott families. No hits came up even when I went in through the card catalog. However, I found the families in the regular 1860 census using FamilySearch's version, found the district they lived in and then went back to Ancestry and search the slave schedules page by page in the right area and all three families were enumerated with handwriting on a clear, easy to read copy.