Genea-Musings: Ask AI: What Can I Do With a Whole Genome Sequencing File, and What Can It Tell Me?

Tuesday, October 14, 2025

Ask AI: What Can I Do With a Whole Genome Sequencing File, and What Can It Tell Me?

MyHeritage DNA just released the information that they will provide a Whole Genome Sequencing (WGS) file for some of their DNA testers - the ones being processed at their lab and future kits processed. In an email, they noted:

"Whole Genome Sequencing reads almost the entire human genetic code, covering around 3 billion base pairs (nucleotides). This is superior to the standard genotyping arrays used by most consumer DNA tests, including MyHeritage until recently, which read only about 700,000 base pairs. More data enables deeper insights across all types of genetic analysis. Whole Genome Sequencing is now being applied to most new MyHeritage DNA kits currently being processed at the lab, and to every new MyHeritage DNA kit sold moving forward. MyHeritage DNA kits already processed with the older genotyping array technology will not be reprocessed with Whole Genome Sequencing. Customers whose MyHeritage DNA kits are processed with Whole Genome Sequencing will be able to download their entire genome from MyHeritage at no cost, in CRAM format."

At first reading, MyHeritage seems to be saying "We are not going to do Whole Genome Sequencing on previously submitted DNA kits - only on currently in-process kits and new kits."

That implies that if I want to have my Whole Genome Sequencing performed, I need to buy a new MyHeritagte DNA test and have it processed by MyHeritageDNA.

1) My first reaction to this was: Huh? If I have WGS test results, how can I use this information? Where can I use it? What will it tell me?

My second reaction was to ask the FREE OpenAI ChatGPT5 with this prompt: "What can I do with a Whole Genome Sequencing file of my own DNA? What companies can "read" it, and what information can they provide about my genetic makeup?"

2) OpenAI ChatGPT5 helpfully (I think!) replied:

Great question — having your whole genome sequencing (WGS) (or data derived from it) opens up a lot of possibilities — but also a lot of caveats. Below is a breakdown of what you can do, what limitations/challenges you’ll face, and what companies or services exist to help you interpret it.

What is a “whole genome sequence file”?

First, by “whole genome sequence file,” you might mean different file formats:

Raw read files (FASTQ) — the direct output from the sequencing instrument, containing millions to billions of short reads.
Aligned reads (BAM / CRAM) — those reads aligned to a reference genome.
Variant calls (VCF or gVCF) — this is the interpreted file that says: “at position X you differ from the reference in this way (SNP, insertion, deletion, structural variant, etc.).”
Annotation reports / interpreted reports — the clinical or consumer‐facing summaries of what variants might matter, what traits or risks they link to.

Most consumer / downstream services work with the variant file (VCF) or derived annotation files. The raw data files are huge and require specialized tools to handle.

As an example, companies like Dante Labs or Nebula Genomics provide your genome in multiple file formats (FASTQ, BAM, VCF) so you can use downstream tools. (Sequencing)

What you can do (on your own or with tools)

If you have the sequence / variant data, here are some things you can do (or try):

Activity	What it involves / tools	What you might learn	Challenges / caveats
Ancestry / population genetics	Use tools that compare your genome with population reference panels (e.g. PCA, admixture analyses)	Proportions of ancestry from different populations, relatedness to others	Interpretation is probabilistic; reference populations may be biased
Gene / trait association scanning	Annotate all variants (SNPs, indels) for known trait/disease associations using resources like ClinVar, GWAS catalogs	Known risk alleles, protective alleles, carrier status for recessive conditions	Most variants are of “unknown significance” — i.e. no clear known effect; many trait associations have small effect sizes
Pharmacogenomics	Look up variants known to affect drug metabolism (e.g. CYP genes)	Whether certain drugs may be more or less effective or more likely to cause side effects for you	Must be careful — drug decisions should always involve a medical professional
Rare disease / Mendelian variant search	If you or a relative has a phenotype (disease, symptom), you might search for rare or damaging mutations in genes known to cause such phenotypes	Candidate causal variants or genes	Requires good variant filtering, interpretation, domain knowledge. Many variants are inconclusive (“variants of uncertain significance”)
Structural variant / large rearrangement detection	With sufficiently good sequencing or additional analysis, detect duplications, deletions, inversions, translocations	Variation beyond SNPs / small indels	Tools are more complex; signal/noise issues; many structural variants have unclear clinical meaning
Re‐analysis over time	As scientific knowledge advances, a variant previously “unknown” might become better interpreted	You can re‐interpret your genome periodically	Requires good recordkeeping and access to updated annotation databases
Sharing / comparison with others	You might upload parts of your genome (or select variants) to genealogy / matching services, or share with researchers or family members	Identify relatives, shared segments, reconstruct pedigree relationships	Privacy risks, data security, ethics, consent issues

Some in the bioinformatics community mention tools and services like Geneiobio, OpenCRAVAT, GATK, bcftools etc. for doing your own analyses. (Reddit)

One blog post notes that once you have your WGS, you have many files (FASTQ, BAM, VCF) and you must decide which files to feed into “DNA analysis apps / reports.” (Sequencing)

What companies / services “read” your genome and what they provide

There are several commercial and academic/clinical services that accept your genome (or will sequence + interpret) and provide reports. The offerings vary quite a bit in depth, cost, regulatory / clinical validity, and what they promise. Below are some examples and categories.

Consumer / Direct-to-Consumer / Genomics startups

These tend to provide health / trait / ancestry reports, often not strictly clinical (i.e. for informational / wellness purposes):

Nebula Genomics — offers whole genome sequencing plus trait/health interpretations. (Wikipedia)
Dante Labs — provides WGS + analysis / reports. (Dante Labs Global)
Veritas Genetics — historically has done WGS + interpretation (cancer risk, pharmacogenomics, etc.). However, their operations have had some changes. (Veritas)
Genetic Genie — a third‐party interpretive service; you can upload VCF / raw data and get trait / methylation / pharmacogenomics style reports. (Genetic Genie)

These companies might report on:

Carrier status (for recessive diseases)
Risk or predisposition to common diseases (heart disease, Alzheimer’s, diabetes, etc.)
Pharmacogenomics (how you metabolize certain drugs)
Traits (eye color, height, etc.)
Ancestry / genealogical estimates
Genetic “wellness” traits (some feel these are speculative)

But note: many such associations are probabilistic, not deterministic. Also, consumer reports may not have clinical regulatory oversight (i.e. not necessarily CLIA / CAP certified in the U.S.).

Clinical / Diagnostic / Medical Genomics Labs

These companies or labs operate under regulatory frameworks, and aim to provide medically actionable results (particularly for rare diseases, cancer, or diagnostics). They interpret variants in the context of disease phenotypes. Examples:

Baylor Genetics — they run clinical WGS and interpret variants in disease / diagnostic contexts. (Baylor Genetics)
CentoGenome / Centogene — they offer comprehensive WGS for rare / neurodegenerative diseases. (CENTOGENE)
Illumina Clinical Services / Illumina’s clinical lab — they perform clinical-grade WGS. (Illumina)
Broad Clinical Labs — they offer research and clinical WGS services. (Broad Clinical Labs)
Revvity Omics — provides WGS as a clinical service. (Revvity)
Other specialty molecular / genomic diagnostic labs often partner with hospitals or academic medical centers.

These services typically provide:

Pathogenic / likely pathogenic variant identification in genes relevant to your condition
Variants of uncertain significance (VUS), with appropriate notes / disclaimers
Insight that may guide medical decisions (if known and validated)
Reports with genetic counseling included or recommended
In some cases, reanalysis in the future as evidence evolves

One recent review notes that in clinical practice, WGS is increasingly used and is done at ~30× coverage, paired‐end, and allows detection of structural variants. (BioMed Central)

What kind of information you can get (and what you probably won’t get)

What you can reasonably expect

Carrier status for many recessive disorders (if you carry one “bad copy”)
Strong risk / causal mutations (if present) for some monogenic diseases, particularly if there is supporting evidence
Pharmacogenomic guidance for certain drugs (metabolism, dosing)
Ancestry / population genetics insight
Some trait / phenotype associations (often small effect size)
Structural variants or CNVs in well‐characterized disease genes
Variant reanalysis over time — updating interpretations as new research emerges

What you probably won’t get (or will get with low confidence)

Complete insight into polygenic / complex disease risk (e.g. for common diseases) — many variants have very small effects, and environmental / lifestyle factors dominate
A perfect “blueprint” of health — presence of a risk allele doesn’t guarantee disease (penetrance, gene–environment interaction, epigenetics, etc.)
Definitive answers for all traits / conditions — many genes or variants are poorly understood or remain “variants of uncertain significance”
Fine‐scale regulatory / noncoding variant interpretation — these are still actively researched and often ambiguous
Accurate prediction of future disease beyond what genetics + environment allow

In clinical genetics, many variant calls remain “variants of uncertain significance (VUS)” — meaning we don’t currently know whether they’re harmful, benign, or somewhere in between.

Things to watch out for / caveats / ethical considerations

Privacy & data security — genomic data is sensitive. If a service is hacked or data leaked, there could be implications for insurance, employment, etc. (laws like GINA in the U.S. provide some protection, but not perfect).
Regulatory / clinical validity — not all variant associations are validated; consumer reports may overstate certainty.
Psychological impact — learning of risk for serious diseases can be stressful; genetic counseling is recommended with clinical tests.
False positives / false negatives — sequencing and variant calling are not perfect; also rare variants might be missed.
Reinterpretation over time — a variant might be classified as uncertain now and become known later (or vice versa).
Consent & sharing — if you share your genome with relatives, they’re indirectly affected (they share parts).

Strategy advice: how you might proceed

Check what you already have — is your file a VCF, gVCF, or something else? How many variants are called?
Choose a reputable interpretive service — ideally one that has peer reviews, uses up‐to‐date variant databases (ClinVar, gnomAD, HGMD, etc.), and provides disclaimers / counseling.
If your interest is medical / diagnostic, go via a clinical genetics lab (often via a physician).
Do your own exploration carefully — tools like OpenCRAVAT, Ensembl Variant Effect Predictor (VEP), ClinVar browser, etc., can help, if you are comfortable with bioinformatics.
Reanalyze in the future — keep your data in safe storage so you can revisit it as knowledge improves.
Be skeptical — treat any prediction or risk estimate as probabilistic, and consult professionals when using for medical decisions.

If you like, I can suggest a few currently active services (in 2025) that accept your existing WGS data (or will sequence + interpret) and compare their pros, cons, cost, and reliability. Would you like me to do that?

3) Well! I wish I understood any of that. I get carrier status, risks, traits, etc. Will I be able to tell where my eye color, hair color, baldness pattern, left-handedness, analytical nature, etc. come from? Probably - the AncestryDNA Traits tell me some of that already.

It may be a whole new Genetic Genealogy World!!

But MyHeritageDNA will provide a CRAM file. Can I use that for all of the above?

4) My next prompt was: "MyHeritageDNA will provide a CRAM file of my WGS. Does all of the above apply to using a CRAM file?"

5) The OpenAI ChatGPT5 response was:

Excellent question — yes, almost all of the above applies — but there are important technical distinctions in how a CRAM file fits into the workflow and what you can actually do with it.

Let’s break this down clearly:

🧬 What a CRAM File Is

A CRAM file is a compressed version of a BAM file — it contains your aligned sequencing reads (that is, the millions of DNA fragments from your sequencing run, already mapped to the human reference genome).

BAM = Binary Alignment Map (uncompressed or less-compressed)
CRAM = Compressed Reference-based Alignment Map (uses the reference genome to save space)

So, in essence, the CRAM file contains:

Each sequencing read (after quality filtering and alignment)
The chromosomal position where it aligns
Base quality scores
Metadata (sequencing instrument, sample ID, read group, etc.)

It does not directly contain “variants” (like SNPs or indels), but the information to call them is inside.

⚙️ What You Can (and Cannot) Do with a CRAM File

Task	Can You Do It with a CRAM File?	Notes / Tools
Generate a variant call (VCF/gVCF)	✅ Yes	You can use tools like GATK, bcftools, or DeepVariant to call variants. This is the first step to make your data usable by most interpretation services.
Upload to consumer interpretation services	⚠️ Sometimes	Most third-party DNA services (e.g., Genetic Genie, Sequencing.com, Nebula interpretation portal) require a VCF file. You’ll likely need to convert the CRAM → VCF first.
View in genome browsers	✅ Yes	You can visualize your reads in tools like IGV (Integrative Genomics Viewer) to inspect alignments, coverage, etc.
Perform ancestry analysis / health trait analysis	⚠️ Indirectly	Only after generating variant calls. CRAM is “raw-ish” — you need variants to analyze genetic makeup.
Reanalyze with improved tools later	✅ Yes	Because it’s aligned, you can always recall variants as methods improve (e.g., better references like GRCh38 → T2T).
Compute coverage / quality metrics	✅ Yes	You can assess sequencing quality, coverage uniformity, etc. using tools like samtools.

🧠 How to Go From CRAM → Interpretable Data

If MyHeritage gives you a CRAM file, you can process it this way:

Obtain the reference genome they used (likely GRCh38 / hg38).

Use samtools or GATK to convert CRAM → BAM (if needed):

samtools view -b -T reference.fa input.cram -o output.bam

Run variant calling using one of:
- GATK HaplotypeCaller
- bcftools mpileup + call
- DeepVariant (Google’s AI-based caller)
You’ll get a VCF or gVCF file, which contains your variant calls (the interpretable part).
Then, you can upload the VCF to:
- Sequencing.com – health and trait reports
- Nebula Genomics – reanalysis / interpretation
- Ensembl Variant Effect Predictor (VEP) – gene and effect annotation
- OpenCRAVAT – custom annotation for disease and pharmacogenomics
- ClinVar or gnomAD lookups – to see known variant effects

🔍 Why the CRAM Is Valuable

The CRAM file is the richest version of your genome data, short of the raw FASTQ reads.
Because it contains all aligned read data, you can:

Recompute variants using newer pipelines later.
Identify complex structural variants not captured by a simple SNP VCF.
Check quality if a variant call looks uncertain.

Essentially, it gives you future-proof access to your entire genome, allowing deeper or re-analyses as technology and databases improve.

⚠️ Key Caveats

File size: CRAMs are huge (typically 60–100 GB for 30× coverage). You’ll need substantial disk space and RAM for processing.
Software setup: You’ll need bioinformatics tools (samtools, GATK, bcftools, etc.) or a service that accepts CRAM directly.
Privacy: Treat it like your digital fingerprint — never share it casually.
Compatibility: If a company asks for a VCF, you can easily generate one from CRAM, but uploading CRAM directly often isn’t supported.

✅ Summary

Concept	CRAM File	VCF File
Type of data	Aligned sequencing reads	Identified genetic variants
File size	Large (tens of GB)	Small (MBs)
Contains raw read data?	Yes	No
Readable by most consumer sites?	Usually no	Yes
Allows future reanalysis?	Yes	Limited
Good for immediate interpretation?	No (needs variant calling)	Yes

If you’d like, I can create a step-by-step guide (with tools and free options) for converting your MyHeritage CRAM file into a VCF and then uploading it to a few interpretation platforms for health, traits, and ancestry.

Would you like me to do that next?

6) Now I have a lot of information and not enough knowledge, and no CRAM file yet. I will wait and see how this plays out.

Here is a YouTube video about taking the Whole Genome sample to a report:

What does the information in a WGS look like? Here's an image from the video before the analysis and report:

7) What's in the future? Will MyHeritageDNA provide the analytical services for the Whole Genome Sequencing file in the future? Stay tuned!

==========================================

Disclosure: I receive a complimentary Complete subscription to MyHeritage, and have received other material considerations in past years. I uploaded my autosomal DNA raw data to their DNA product. This does not affect my objective analysis of MyHeritage products. I am a paid subscriber to Family Tree Webinars and love it.

The URL for this post is:

Copyright (c) 2025, Randall J. Seaver

Please comment on this post on the website by clicking the URL above and then the "Comments" link at the bottom of each post. Share it on X, Facebook, or Pinterest using the icons below. Or contact me by email at randy.seaver@gmail.com. Note that all comments are moderated, so they may not appear immediately.

Subscribe to receive a free daily email from Genea-Musings using www.Blogtrottr.com.

1 comment:

Louis Kessler said...: Ha ha! Maybe it's time to go back to school and get an advanced degree in Genomics - a subject that didn't even exist when we were in University.; Tuesday, October 14, 2025 at 12:52:00 PM PDT

Pages