Tuesday, August 28, 2007

Ancestry.com is Caching some web site data

I received an email notice from a colleague yesterday, with a message from the owners of http://www.usgennet.org/. I went to the web site, followed the "What's New at USGenNet" links for 2007 announcements, and found the following:

"Dateline: 26 August 2007

"Copycats on the Loose! Sites on the USGenNet server have been copied and cached versions of pages are being shown in Ancestry'Com's subscription only site. USGenNet President, Ginger Cisewski, sent a notice to all account holders and has assured everyone that USGenNet will be dealing with this promptly. A viewable sample of a copied page is now available at http://www.usgennet.org/usgnhome/business/ancestry.html for those who don't have a subscription to Ancestry.Com. "

I followed the links given above, and then went on http://www.ancestry.com/ to see if what it said was true. Although I couldn't find the exact example, I think that what the message says is true - Ancestry has cached pages from online databases and is making them available through the "Internet Biographical Collection" database that requires an Ancestry subscription to access.

While reading my Bloglines this morning, I noticed that Kimberly Powell posted a very informative article titled "Cache 22 - Has Ancestry.com Gone Too Far?" about this situation, with significant background information. Her article made the point that Google does a similar thing of caching data, has been sued for it, and Google won the suit in a Nevada District Court. Kimberly's concern, however, is this:

"The Ancestry.com database takes things even further, however, serving up the cached pages as the first option and offering a small link to the "live Web site." There is no way to get to the link for the live page without first viewing the cached page. On the actual record page for each search result, the cached link is identified as "cached," but it is still the only option open if you want to view the content - there is no link to the live Web page until after you view the cached page. And from the search results where you are given the option only to "view Web page" you are taken directly to the cached page, with no notice that the page is indeed cached. This is where I feel that this database has stepped over the line, possibly into copyright infringement. Ancestry.com is serving up copies of copyrighted work and, to make matters worse, selling this as one of their subscription databases. Because the pages are cached, they are also depriving the Web site and/or content owner of traffic and potential income. "

I don't have an opinion about the legality of what Ancestry has done here, because I am not an attorney. But I do have thoughts about this issue and how it impacts genealogy data providers and researchers.

I appreciate that Ancestry has provided links to useful genealogy data, but I am troubled that it is behind the subscription firewall and that the URL for the cached page is for Ancestry.com. However, Ancestry does provide a clear link to the actual web page that was cached - there is a "View Live Web Page" just below the link to the cached page. I will click on the Live Web Page link as a standard practice so that the data provider gets a hit.

Is the claim by USGenNet that Ancestry is a "copycat" true? Has Ancestry copied the data? Ancestry has not "imaged" the specific web pages or data set directly, but it appears that they have "captured" the web page and put it on Ancestry servers, waiting for a subscriber to click on the Cached link. They have also indexed the information on the page. Ancestry is acting as a data portal and search engine in this case. This is a legal issue and USGenNet is apparently going to pursue the issue.

Has Ancestry "stolen" the genealogy data on http://www.usgennet.org/ as Amy Crooks in her "Ancestry.com Nothing but Thiefs" posited? It's a legitimate question, since Ancestry has apparently "captured" the actual web pages and put them on their servers. Again, this is a legal issue and should be settled by a legal agreement or in a court.

What Ancestry has done doesn't prohibit a researcher from going directly to the http://www.usgennet.org/ web site and doing searches on their free online data. I have tested this site and found that it contains a lot of useful genealogical and historical data, and have added it to my Favorites list and will post a separate blog about it later.

There is a bigger question. Ancestry describes their "Internet Biographical Collection" as:

"This database contains a sampling of biographical sketches found on English language web pages throughout the entire World Wide Web. Web pages can vary greatly in the amount of information they contain about a given person, and in the number of related and unrelated people mentioned on the same page. The information source and the central topic of each page will also vary greatly. Given facts should be verified using other sources. One unique and valuable feature of this web-based collection is the number of hyperlinks leading from each page in the collection to other web pages of possible interest on related topics."

So we know that Ancestry has cached http://www.usgennet.org/ pages. Have they cached pages from other web sites? Are they making agreements with these web sites, or are they just "capturing" and caching web pages? Are they indexing these pages?

If this is legal, what freely available data will be cached next? LDS data (I don't see any yet)? USGenWeb data (I don't see any yet)? Wikipedia (yep!)? Blog posts (uh oh - I just found some Cow Hampshire, Steve's Genealogy Blog and Geneablogie pages in the Ancestry collection, but Genea-Musings is not there yet - what's up with that?)? If it is legal, then what stops any free or commercial web site from doing caching and indexing of free web pages? Probably just money - bandwidth, servers, etc.

It is an interesting time for genealogy researchers and family historians, isn't it?

UPDATED Tuesday, 8/28, 3 PM: Kimberly commented that:

"I just wanted to mention that while there is now a link to the "View Live Web Page" under the link to the cached page, there was no such link this morning. Several changes have been made to the database already since early this morning, likely in response to some of the uproar from the genealogy community.

"The database is now appearing in their "free records" as well. You have to sign in as an Ancestry.com member, but you don't have to be a subscriber. "


I could not check out that this database is on the Free side - it says "Free Index" in the list of recently added Ancestry databases (dated 8/22/07). That certainly is a major change if true.

I thought I noticed the "View Live Web Page" link was there today, but not yesterday. And putting this in the FREE collection is certainly a smart thing to do - they should encourage clicking on the "View Live Web Page" link rather than the "View Cached Page" link.

Becky Wiseman on her Kinexxions blog has posted about this issue in "Is This Fair Use?" and Janice Brown on her Cow Hampshire blog has commented in "Ancestry Hijacks Cow Hampshire."

9 comments:

John said...

Have you been to archive.org? Their 'Wayback Machine' archives the entire internet. Maybe not *all*, but it doesn't narrow its focus to 'popular' sites or themes.

I know newspaper websites have successfully kept their sites from being archived there, but I would expect many genealogy blogs appear there.

Probably usgennet as well. Though the site is currently experiencing technical difficulties, so I can't check. That said, they don't charge a fee. Ancestry's putting this under a fee-service seems questionable, though I too am not a lawyer.

Kimberly said...

Hi Randy,

Thanks for your link to my blog! I just wanted to mention that while there is now a link to the "View Live Web Page" under the link to the cached page, there was no such link this morning. Several changes have been made to the database already since early this morning, likely in response to some of the uproar from the genealogy community.

The database is now appearing in their "free records" as well. You have to sign in as an Ancestry.com member, but you don't have to be a subscriber.

Susan K said...

Archive.org's wayback machine is good for finding older sites. The Wayback machine is a good netizen, tho, and if requested that its bots not crawl the site, it will not.

My question about ancestry.com's crawling/scraping actions are technical. What kind of bot is it using? Has anyone been able to find evidence of ancestry.com bot activity in server logs? That kind of specific information will help tons in fighting undesired scraping activities; the web site owner or master can add code to web server software to disallow visits from that BOT or that IP address.

That solution does require that one have pretty good tech access to your own website, tho.. and is prolly not suitable for sites hosted on blogspot or blogger or typepad or wordpress.com etc. (i.e., free web hosting services)

anniegms said...

the name of the bot is MyFamilyBot. It is talked about on their My Family.com site.

Susan K said...

Fabulous, AnnieGMS, thanks. That was the hint I need.

Incidentally, Randy, I've posted about all of this... with a parody image of Ancestry.com's home page. Plus, my update re: tech info (hat tip to you, Annie) will be at the conclusion of the post on that page.

Jessica's thoughts said...

Hi Randy,

I've also posted my opinion on this issue on my blog:

http://jessicagenejournal.blogspot.com/2007/08/my-two-cents-on-ancestrys-internet.html

Happy Dae said...

Well, of course opinions are like perfumes -- some smell better than others. I like this new tool of Ancestry's. AND I like the fact that our community is awake and VOCAL about the changes that occur. AND I like that Ancestry has quietly responded by making it free and is linking to the web sites.

I see this as all good. I do think they erred, but they have quickly corrected their mistakes. I'm comfortable with that.

I will suggest that Ancestry add more criteria for an advanced search. Much needed for more common names.

I will remind the reader that this database says "sampling" and therefore isn't complete. (Is any database complete? Well, perhaps the database of living descendants of President Abraham Lincoln.)

So, I like it, even with its flaws and brief history.

Happy Dae.
http://www.ShoeStringGenealogy.com/ssg1.htm

kristine said...

Here is what Dear Myrtle had to say on her blog about this:

Numbers, rankings and Ancestry.com

http://blog.dearmyrtle.com/2007/08/numbers-ranking-ancestrycom.html

from http://www.familyforest.com

Hydrocodone said...

LTQfvy The best blog you have!