Tuesday, February 12, 2019

Using GeneticAffairs.com to Create DNA Match AutoClusters - Part 3: Using the Cluster Information

I wrote Using GeneticAffairs.com to Create DNA Match AutoClusters - Part I: Getting the Clusters four weeks ago, describing the login process and getting to the point of creating the Autoclusters for my AncestryDNA matches.  In Part 2: The Cluster Graphic last week, I discussed the mechanics of obtaining and analyzing the graphic, and discussed problems I found with the graphic (partially shown below):

The graphic contains information of persons that match me and each other in a cluster.  I had 52 clusters - the biggest one had 37 matches (shown above in orange), and the smallest had 2 matches.  I didn't show the bottom portion of the .html cluster image - it lists the names, cMs, number of segments, tree size, predicted relationships, and AncestryDNA Notes for each person in each cluster.  That is useful information and overcomes the issue with the unreadable names on the left and top.

How can the AutoCluster graphic be used to identify the common ancestor(s) of each cluster and each DNA match?  That is really the ultimate reason to use this cluster approach.  

Here is what I did:

1)  The AutoCluster results includes a .csv file called   AutoCluster_Ancestry_your_name_date_time.csv.zip.  This file can be downloaded, saved to a file folder, and unzipped.  From the unzipped .csv file, a spreadsheet can be created with only the AncestryDNA matches that were estimated to be in a cluster of matches - with two or more persons in a cluster.  This file gave me only 398 persons with over 25 cM, in 52 clusters.

2)  When I opened the .csv file into a spreadsheet (I use OpenOffice Calc on Windows), I immediately saved it as a spreadsheet file (.ods in OpenOffice).  I also named the spreadsheet sheet (using Insert > Sheet)  as "Test Data."  Here is the top of my "Test Data" sheet with the columns that identify names, and some other columns, reduced in width:

The columns on the screen above are:

A:  Identifier (reduced in width for privacy reasons)
B.  Name (reduced in width for privacy reasons)
C:  Total Shared CM
D:  Matches with own profile
E.  Cluster No.
F to OM(reduced in width for space reasons):  Person names that match with the DNA Match names (the rows).  Column F on the sheet is the first DNA Match (also in row 2 of the spreadsheet), and the numbers in column F indicate that that first DNA match also matches those persons in later rows.  From Column F, I can see that the DNA Match matches DNA Matches 1 to 15 (rows 2 to 16), Match 17, Match 21, Matches 23-24, etc.  That Column F match is my first cousin, who has all of my Seaver and Richmond ancestry from our grandparents.

3)  I then created another spreadsheet (Insert > Sheet) sheet and called it "Cluster list." I used Edit > Select All on the "Test Data" sheet, and then clicked on "Edit > Copy" on the "Test Data" sheet and clicked on the "Cluster list" sheet, and clicked "Edit  > Paste" to f ill in the sheet.  At this point, the "Cluster list" sheet is identical to the "Test Data" sheet.

a)  The magic of Spreadsheets is that the user can Sort the information by columns.  I selected all of the columns (Edit > Select All) on the "Cluster list" sheet, then clicked on Data > Sort and selected Column E and ascending (for Cluster No.) and Column C and descending (for Total Shared cM) to sort the data in the "Cluster list" sheet.

b)  I reorganized the columns to put "Cluster No." in Column A, Shared cM in Column B, Name in Column C.  

c)  I added columns (Insert > Column) for Column D = Match No., Column E for Known Rel[ationship], and Column F for Common Ancestors.  

d)  I added the "Match No." by hand - there is probably a way to do this before I moved the columns, but I didn't do that this first time around.  I haven't added the numbers for some Clusters.  The "Match number" helps me find the match in my AncestryDNA match results (until the match list changes!).

e)  I probably should have added an Ancestry Member Tree size column also.

f) The other columns moved to the right on the sheet.  

g)  Here is the resulting "Cluster list" sheet near the top of the sorted list:

As you can see, all of the clusters are ordered from 1 to 52 in Column A. and all of the persons in a cluster are listed by decreasing shared cM in Column B.

4)  I then went to my AncestryDNA Match list (on Ancestry.com) and added the "Known Relationship" information (from the Shared Ancestor list on Ancestry or from my own research) and added the known "Common Ancestor" person(s) information.

As you can see on the screen above for Cluster 7, the known common ancestors are from the Richmond/Richman or Rich families.  Even though some of the common ancestors may be White or Oatley, this cluster must be Richman or Rich because there are several matches in the cluster who only have Richman or Rich and don't have White or Oatley.  White and/or Oatley matches are probably in other Clusters (and those who have a White or Oatley common ancestor on the list above will match most persons in those clusters also).

Cluster 7 above has the most "Known Relationship" and "Common Ancestor" entries of my 52 clusters.  

5)  The Known Relationships and Known Common Ancestors for each AutoCluster for my AncestryDNA cluster list indicate that:

*  Cluster 1 (37 members) has 2 known common ancestor matches - this is probably a Carringer, Spangler, Feather, or Houx cluster.
*  Cluster 2 (27 members) has 1 known common ancestor match - this is probably a Rich or Hill cluster.
*  Cluster 3 (21 members) has no known common ancestors matches.
*  Cluster 4 (20 members) has no known common ancestors matches.
*  Cluster 5 (20 members) has no known common ancestors matches.
*  Cluster 6 (19 members) has 2 known common ancestor matches - this is probably a Seaver or Hildreth cluster.
*  Cluster 7 (19 members) has 16 known common ancestors - this is probably a Richmond or Rich cluster.
*  Cluster 8 (17 members) has no known common ancestors matches.
*  Cluster 9 (15 members) has no known common ancestors matches.
*  Cluster 10 (13 members) has 2 known common ancestors - this is probably an Oatley or Champlin cluster.

and so forth.  There are several smaller clusters that have 5 or 6 known common ancestors.

6)  Some clusters have only one or two known relationships and Common Ancestors, and many don't have any, based on the AncestryDNA Shared Ancestors and my own research.  I will focus on the largest of these clusters to try to figure out common ancestors by using Quick and Dirty Trees.

7)  While reviewing the Cluster list and the AncestryDNA Match list, I noticed that I have several significant DNA matches with over 40 cM that are not included in the Cluster list.  To me, that means that those DNA matches don't share DNA with the other matches on the Cluster list (which included matches between 25 cM and 900 cM).  Perhaps a Cluster list with a lower minimum cM would add those omitted DNA Matches.

8)  I will probably run the AncestryDNA AutoCluster in several months time so as to keep the Cluster analysis up-to-date with my AncestryDNA Match list.

9)  I hope that this review of using a spreadsheet to analyze the AutoClusters has been helpful.  The first one I've created here has been instructive to me, and I see several ways to improve it.  I would appreciate any suggestions on ways to improve it from my readers.

                     =======================================================

Disclosure:  I have no material connection to Genetic Affairs and am "just a user" of their service. 

The URL for this post is:  https://www.geneamusings.com/2019/02/using-geneticaffairscom-to-create-dna_12.html

Copyright (c) 2019, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post.  Share it on Twitter, Facebook, Google+ or Pinterest using the icons below.  Or contact me by email at randy.seaver@gmail.com.


1 comment:

DNA Comment said...

First, thank you for doing this work and sharing it with good directions. Appreciated. When I see this work (and it's a lot of work) based on spreadsheets, I wonder why some of the vendors of DNA information can't automate the clustering for us. They have a numerical relational data base of each person's relationship to another. If the relationships can be downloaded and put into an Excel spreadsheet, then manipulated by a skilled person on their home computer, it seems like a program could be written by someone smarter than I am to give us family clusters as results. Then all we would need to do is identify one or two of the people in the cluster and we would know who is from that family unit. I know this is all new and evolving...and look forward to the day when someone figures out how to present the data without each of us having to do all of the spreadsheet work. Like the evolution of computers we need to move from having the user do the "coding" to presenting the data in a usable form. I'd pay a premium for that service.