Tuesday, June 9, 2009

Some Average Database Sizes

Tamura Jones asked the question on his http://www.tamurajones.net/ web site, and on Twitter and Facebook, "what is the average genealogy database size?" Please go to Tamura's web site and read his articles about the subjects of GEDCOMs, social networks, database size, etc. Here are links to his articles (thank you Tamura!):

* http://www.tamurajones.net/MyLargeIsSmallerThanYours.xhtml -- My Large is Smaller than Yours.

* http://www.tamurajones.net/HowGeniBeatsWereRelated.xhtml -- How Geni beats We're Related

* http://www.tamurajones.net/AverageSizeIsAStatistic.xhtml -- Average Size is a Statistic

* http://www.tamurajones.net/SocialGenealogyMetrics.xhtml -- Sicial Genealogy Metrics

* http://www.tamurajones.net/SocialGEDCOMFormula.xhtml -- Social GEDCOM Formula

In response to hisp ost on Twitter and Facebook, I guessed "900." It was just a SWAG (super wild a$$ guess) on my part... based on my own work, conversations with my society colleagues, etc.

Tamura also noted in one of his posts that the http://www.geni.com/ average database size was about 16 (50 million entries for 3 million users) and that the average Facebook database size on the FamilyLink We're Related application was about 5 (200 million entries for 40 million users). He noted that Geni may have a higher average name per database because some users have been able to upload a GEDCOM file (I have) whereas the We're Related application has failed (so far) to provide a working GEDCOM upload capability.

I tried to find some "average database sizes" from several online family tree databases:

* In the Rootsweb WorldConnect database, the search results page says that there are 421,867 databases with 580,636,456 names in them. That works out to be 1,376 names per database, on average.

* In the GenCircles database, you can click on a link that shows all of the databases in alphabetical order. I added up the first 100 databases listed in the A surname list, and came up with 198,140 names in 100 databases, for an average of 1,981 names per database. There were database sizes from 0 to over 40,000 in this list of 100 databases. Perhaps there is an easier way.

* The Ancestry.com One World Tree database has over 192 million names in it. They claim that these are combined into one database, so the average size is 192 million, right?

* The Ancestry.com Public Member Trees have 585,730,026 names in them, but I couldn't find out how many trees there were.

* The Ancestry.com Town Hall meeting webinar in May 2009 claimed that there were 1 billion names in 10 million family trees. That is an average of only 100 persons per family tree. Perhaps that includes One World Tree, the Member Trees, Ancestry World Tree, etc. It probably includes people that have entered family tree data one-by-one by hand, rather than uploading a GEDCOM file.

* Kindred Konnections/My Trees has over 242 million names in their database, but no indication of the number of family trees.

* The Genealogy.com World Family Tree claims to have over 1 million names in 1,000 trees on each CD-ROM they offer (190 CD-ROMs to date). That's an average of 1,000 names per tree, but the numbers are probably estimates.

There are many other family tree databases available on the Internet. Can someone find others that have a large number of names and databases and that posts both of them?

The only database that I've found that explicitly lists the number of names and the number of databases is on the Rootsweb WorldConnect database. The average there is 1,376 persons per database.

Of course, the databases uploaded to all of the family tree databases in whatever form contain significant duplications. I've posted my databases at about ten family tree web sites, and I'm sure that many people have uploaded the same database(s) to a number of web sites.

Interesting problem, isn't it?

UPDATED 4:15 p.m. Tamura sent links to his posts that I couldn't get to. I'm still using IE7 for almost everything but reading his articles.

UPDATED 10:45 pm: Tamura sent the link for his article Figuring out Average Genealogy Size . I haven't read it yet but will soon.

My first reaction was "7 inches" - the height of about 900 Individual Reports printed out.

Perhaps the better measurement would be Median GEDCOM Size rather than average size. Really big databases can skew the numbers badly.

5 comments:

CMPointer said...

Randy,

Just the thought of all those trees hurts my head! Thank you for your insight on this. Also, I wanted to let you know that I, like so many, have awarded you the "Puckerbrush" award. Also, I have a special "thank you" on the same blog post: http://yourfamilystory-cmpointer.blogspot.com/2009/06/my-many-thanks.html

Caroline

TamuraJones said...

Randy,

Thanks again for the attention.
As the previous article in this subject points out, you are one of the three bloggers that made me do it.

Interesting problem?
Actually, calculating the average size is a mean problem ;-)

I've already calculated average numbers for more online sites, including the average for all 100.000+ GenCircles databases :-)
Writing up what it all means takes time.

You have replaced your SWAG with the RootsWeb average as your estimate, moving from 900 to 1.376.
As I tweeted an hour or so ago, I've already come across a GEDCOM of 846.886 individuals…

The article I posted some ten minutes ago, Figuring out Average Genealogy Size,
starts with a quick overview of what the previous articles discussed.

- Tamura

Robert Baca said...

I guess mine is larger than average ... I have 4,052 names, 1,580 marriages, 17 generations, and 635 surnames in my database. Of course, I've seen some huge databases. For instance, the Great New Mexico Pedigree Database http://www.hgrc-nm.org/surnames/surnames.htm currently has 128,614 names. That isn't even the largest single database that I've seen.I wouldn't count Ancestry.com's database as a single database.

Louis said...

The biggest GEDCOM file I have found online is one that is 320 MB in size having 741,968 individuals in it. The header says it was created with Legacy Version 6.0.

You can find it at: http://www.prpletr.com/Gedcoms.htm and it is the first on in the list, entitled: "Good, Engle, Hanks Family Gedcom".

TamuraJones said...

Randy,

I privately suggested to consider the question whether the average size matter at all,
and you've since updated your post with the suggestion that the median may be a better number.

You are thinking along the same line as I. I found something to say about the median, but the question remains whether the number matters.
If you knew the average or median is say 7.654,321, then what? Do these numbers matter?

By the way, large databases influence the numbers, but so do small ones.
Of all sizes, nothing influences the average more than the mode, and the mode of all these online sites seems to be 1.

- Tamura