Wednesday, January 23, 2019

A Reader's Take on Ancestry.com Problems - Part I - Ancestry Member Tree Indexing

In response to Dear Ancestry.com: Are You Fixing These Problems? (posted 17 January 2019), I received 33 comments on the post, and several via email.  

One of the email correspondents was a person who has Ancestry and IT experience, and offered knowledge, experience and wisdom for users of Ancestry.com, especially on the trees and the search engines.  The comments are detailed and ring true to me based on my experience (and that of others) working on Ancestry.com.  

In this post, I want to concentrate on the indexing problems with Ancestry Member Trees because that seems to be where some of the problems and inconsistency of results seems to occur.  

My correspondent offered this TL:DR summary (since the detailed comments are extensive), which I appreciate:

  • "Trees are not indexed regularly because the process is time consuming, somewhat brittle, and running on 'old code'
  • Not all tree profiles are indexed because the old system can only handle 4 billion indexed profiles.
1)  Ancestry Member Trees are not indexed regularly.


"The Ancestry Member Trees are huge. Between the private and public member trees, there are over 6 billion records. [Several] years ago, Public Member Trees was approaching 4 billion profiles (records) and was adding 100-150 million items each month. Getting these profiles into the search index was a multi-week, multi-team effort. If everything goes flawlessly after the extract from the trees system, the indexing process would take 2-4 days. If something went wrong, the indexing process had to start over. Because of resource usage, both computationally and people to monitor the process, a monthly update was all the team would commit to back-in-the-day. And if the index failed, sometimes the process had to wait until the next month to allow other collections through the production pipeline.

2)  Ancestry Member Trees do not index profiles without sources

"As mentioned, the Ancestry trees [several] years ago were approaching 4 billion profiles. By now, they have probably exceeded the number [Note by Randy: the corporate information page claims 6 billion now]. Unfortunately, the indexing system for Member Trees is based on a 32-bit architecture. Every record in a collection must have a unique ID. In a 32-bit system, that means that there are only 4,294,967,296 total records allowed. We were running out of space.

"The obvious solution is to move the search engine to a 64-bit architecture. That allows 18,446,744,073,709,551,616 records in an index, and is what current software / data architectures are built on. But that requires significant investment in development - not just for the search engine, but also for the tools that feed the engine. Process and people costs were too high [several] years ago...and the plan was to move everything from the home-grown search engine to Solr/Lucene, which already supported 64-bit. So, we punted. We'd get Solr spun up on some test projects, then migrate everything over to it and not have to pay the price for upgrading the home-server.

"We looked for ways to reduce the index size. Were there "trash" nodes in the tree that didn't belong? Yes. In doing analysis, we found rogue profiles that had 10,000+ events associated with them. Those were obvious candidates to drop, but that wasn't enough. What about empty profiles? If there is only a name and no other information, is there any value in adding it to the index? If it is the only reference to that person, then it is potentially of value to someone. But if it doesn't have a source to back it up, then it is of little value - it is just conjecture. That pulled the projected index size down into something we could manage for a while.

"It appears the rules have been expanded to 'anything without an attached Ancestry source.' In their defense, the number of users who actually create a non-Ancestry source is small, so they probably thought this wouldn't affect many profiles. But those users who take the time to do so are usually quite good and there could be benefit to others, so that defense doesn't go too far.

"The short-term solution - prune the index until Solr was ready - seemed reasonable [several] years ago. If you check the message boards, there was some understandable feedback when the cuts accidentally went too deep. The next roll of the index corrected that, but the index continued to grow and the Solr solution wasn't coming along as fast as hoped...so some different rules were apparently applied later."

3)  Randy's opinion:  What my correspondent describes above is really "Ancestry is a victim of its own success, and they didn't react fast enough."  Like Topsy, the company grew quickly, a customer base grew, and then stuff happened.  They started with a home-grown search system, and then had to react to the eventual limitations of the search system.  I worked for 40 years in aerospace engineering and lived through several cycles of this type of business and technology challenges, but without a giant customer base.

Over about 20 years, this company, their products, and customer base grew significantly year-over-year, and resources were strained.  Computer technology, hardware and software improved exponentially at the same time, and that continues unabated.  The company was challenged by the need to improve their products, incorporate new technology, fight obsolescence, add capable staff and management, and be profitable, all while satisfying customer, corporate and investor expectations, in a dynamic, and competitive, technology and business environment.  

 In my opinion, more information is preferred when problems like the Tree Indexing issue occur, rather than less information.  Customers don't always understand why things happen, and then complain about the perceived problems (this blogger included).  I am more than willing to publish information from Ancestry.com, without my editorializing, to discuss these comments from my correspondent if they desire.  

In the next post, my correspondent will discuss the issues relating to the search process for records on Ancestry.com.  

=============================================

Disclosure:  I have had a paid subscription to Ancestry.com since 2000, and use the site every day.  I have received material considerations from Ancestry.com in years past, but that does not affect my objectivity in writing about their products and services.

The URL for this post is:  https://www.geneamusings.com/2019/01/a-readers-take-on-ancestrycom-problems.html

Copyright (c) 2019, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post.  Share it on Twitter, Facebook, Google+ or Pinterest using the icons below.  Or contact me by email at randy.seaver@gmail.com.

7 comments:

Leslie P said...

Just making sure I understand this. I use RootsMagic, and sync to Ancestry because my sister has a subscription and is doing DNA research via Ancestry. So what I sync from RM has my "non-Ancestry" sources. Since I don't have an ancestry subscription and don't care to spend time on that site, none of the profiles I sync from RM would be listed in the index, so other people wouldn't be able to find the folks on my tree. Is that correct?

Marcia Crawford Philbrick said...

Thanks Randy for sharing this. As one of those people that has a large sourced tree that is only partially indexed, I need to continue sharing my tree in other ways. For quite a few years, I have posted my tree on the web. With RootsMagic, this is easy and economical. I am also trying to do more with Family Search. As an Ancestry user, I need to be willing to spread a broad net when looking for others researching my ancestors.

Note: the problem of scale likely explains the issues I’ve experienced with my DNA matches. The growth in sales of kits is exponentially growing the database of matches.

Randy Seaver said...

Hi Leslie P, YES, that is exactly correct. It's been that way for over one year now, and maybe longer. The source citations would be listed as "Other Sources" on the profiles in your Ancestry Member Tree.

Your synced tree still has value, however, especially if you have a DNA test connected to it.

Also, in case of a problem with your computer files, you can download your Ancestry Tree to a GEDCOM file.

Leslie P said...

Thanks for confirming, Randy, I thought that was how things work. It's really rather a mell of a hess with all of the syncing and data loss that we have to put up with. My sister has done a bunch of work on a different tree on Ancestry (she found my Dad's birth mom, we had a reunion, it was a whole thing) and of course the GEDCOM dl from there doesn't have any images or source documents so it's better than nothing, but still lame.

As an aside - your post about Genetic Affairs was fascinating and I sent it over to my sister - she's already using the clusters to sort folks out when trying to figure where they connect.

When I started this computerized genealogy stuff 20-odd years ago I sure thought we'd be a LOT further along in the interoperability thing by now.

Ryan Ross said...

So basically it's a combination of things that we who commented on your last Ancestry post guessed. Ancestry had tech problems that they allowed to snowball out of control because they long had the subscriber growth and lack of competition to enable such "punting." Meanwhile, they chose to spend money on acquiring smaller services like Find-A-Grave, not to mention investing in a huge DNA market, rather than on improving services they already offered. (Cynically, why bother improving your technical systems when you have a subscriber base that will be there even if you don't do so? Big head-starts in records, trees, etc. afford this luxury.)

This situation shows why it's a bad thing when one company has a stranglehold on an industry...especially an industry that has until very recently been lacking in technical innovation. How old are most of the major software programs being offered now? How little have most of them been fundamentally updated apart from new layers of bells and whistles? Why are we still forced to rely upon (and work around) the ancient GEDCOM format?

Thankfully, there are glimmers of change. RootsFinder is a Godsend to me. It does so much of what I want a genealogy program to do, and I am happy to keep using it, online format though it be. Bold, skillful innovators like Dallan Quass need our support.

Thanks for reporting on this, Randy. You may not agree with everything I said here, but I nevertheless consider your blog to be one of the very best genealogy blogs out there. Keep up your good work.

cathyd said...

Thanks for sharing this information, Randy. It makes perfect sense to me (with some IT experience). I also agree with you that we're better off w/ more info rather than less. And that Ancestry is a victim of its own success. As I'm sure you know, this kind of problem is not unique to Ancestry; there is never an easy answer to balancing the needs of the customers, the investors, the resource limits, etc. etc.

Looking forward to the next post -- and meanwhile will use some of my time between jobs downloading my info from Ancestry and getting it properly sourced in my Legacy Family Tree program. (I've gotten into the questionable habit of putting all my source info online. Ugh.)

Jan Murphy said...

Since I've been taking advantage of Ancestry All-Access, I've been linking source material from fold3 and Newspapers-dot-com as well as Find a Grave back to the profiles on my Ancestry online trees. Guess where Ancestry puts all that material? Under "Other Sources".

So it may not be likely, but it's possible that someone could create a tree, have only "Other Sources" which all came from Ancestry-owned websites, and still not have their tree indexed.