Average Percentages of WHG, ENF, and ESH Ancestry in Modern Europeans

The calculation of the percentage of the WHG (Loschbour) component in Europeans is one of the most variable and inaccurate that exists. For example, for me it ranges from 37.03 on Eurasia K6 to 8/10 on G25 Calculators and Admixtools.

Calculator
WHG
Eastern_HG
European_HG (WHG+EHG)
Eurasia K6
37,03
Eurogenes K13 (WHG-ANE-EEF-Calc.xls)
23
FTDNA (Ancient Origins)
8
22
30
G25 Calculators
8/10
Admixtools
8/10
Illustrative DNA
8,4
22,2
30,6

The value of Eurasia K6 is obviously too high, another high value is the 23% found by the EEF-WHG-ANE calculator, based on the Eurogenes K13 values.
The values of this calculator were used in the maps of the EEF-WHG-ANE components on Eupedia.

Most G25 CaIculators, Admixtools and indirectly FTDNA Ancient Origins, and Illustrative DNA also, give me approximately 8% WHG, which I think is a more realistic value.

FTDNA Ancient Origins give me 30% of European Hunter Gatherer (Eastern + Western HG) and 13% of Metal Age Invader, which represents the CHG component. The Steppe value I have is approximately 35%, which leads to 8% WHG by making some simplistic calculations, considering that Steppe is Eastern_HG + CHG :
35 (Steppe)-13(CHG) = 22 (Eastern_HG) which leads to 30 (European_HG) – 22 (Eastern_HG) = 8 (WHG)

The Admixtools WHG percentages that I find are proportionally lower for all European populations, for example between a minimum of 2% in Northern Italy to a value of 22% in Estonia, and in my opinion, they are closer to reality.

The population density of the WHG would never have been very high, and their genome was diluted first by the Early Farmers in the Neolithic and later by the Steppe people from the Bronze Age on.
Currently, of the major European components (WHG, EEF,WSH), it represents the smallest one.​
 
It depends on the calculator and whether it includes Anatolian farmers and yamnaya (or other steppe sources). If a calculator, for example, omits the Steppe component your HG percentage will rapidly inflate the way you see in some calculators. If the Anatolian farmer component is omitted, the same phenomenon occurs. Why? Because both Anatolian Farmer, aka European Farmer, and steppe sources have ancestral WHG-like admixtures and mix themselves. So, it is basic statistics with overfitting variables depending on the sources used.
 

Yes, the K-means clustering methods used by those calculators (Dodecad, Eurogenes) are more likely to overfit, especially if the number of clusters (K) is too high or if the reference populations are not well-chosen and can lead to misleading ancestry estimates.

Thats why I think that Admixtools qpAdm f-statistics methods are preferable to compare different hypothetical models of admixture, it provides a more robust statistical framework for ancient DNA studies.
qpAdm is less prone to overfitting compared to those others calculators, it uses a more rigorous statistical approach to test different admixture models and reject implausible ones based on p-values and admixture proportions.
The more robust statistical framework helps in minimizing overfitting, making qpAdm more reliable for distinguishing between different historical admixture scenarios.
 
Last edited:

Yes, the K-means clustering methods used by those calculators (Dodecad, Eurogenes) are more likely to overfit, especially if the number of clusters (K) is too high or if the reference populations are not well-chosen and can lead to misleading ancestry estimates.

Thats why I think that Admixtools qpAdm f-statistics methods are preferable to compare different hypothetical models of admixture, it provides a more robust statistical framework for ancient DNA studies.
qpAdm is less prone to overfitting compared to those others calculators, it uses a more rigorous statistical approach to test different admixture models and reject implausible ones based on p-values and admixture proportions.
The more robust statistical framework helps in minimizing overfitting, making qpAdm more reliable for distinguishing between different historical admixture scenarios.
Can you break down the statistical calculations which ensure a more robust approach when running qpAdm? I ask this sheerly out of curiosity.
 
Can you break down the statistical calculations which ensure a more robust approach when running qpAdm? I ask this sheerly out of curiosity.

Sure, I'm not an expert, but here's what I consider the essentials of what I've read about qpAdm.

qpAdm relies on f4-statistics, which measure the correlation of allele frequencies between different populations.
These statistics help identify admixture events by comparing possible scenarios involving admixture.
For this qpAdm fits a series of hypothetical models to the data.
These models represent different possible histories of admixture.
By comparing these models, qpAdm can identify the most plausible scenarios.

The tool calculates the relative proportion of ancestry that can be attributed to each source population in the model.
This helps in understanding the genetic contributions from different populations.

qpAdm works with the genetic data from the target population and potential source populations.
This data typically consists of allele frequencies at various genetic markers.
It needs also additional data of reference populations that are not directly involved in the admixture but can help in modelling the genetic background.

After defining the admixture model, for example, the hypothesis that the target population is a mixture of two or more source populations, we use qpAdm to fit the model to the data.
The tool will calculate the f4-statistics and test the fit of the model.
qpAdm will provide estimates of the admixture proportions and assess the fit of the model.
This involves comparing the observed f4-statistics with those expected under different admixture scenarios.

The f4-statistic is calculated as:
f4(Target,Source;Ref1,Ref2)=(pTarget−pSource)⋅(pRef1−pRef2)
where ( p ) represents allele frequencies.

This statistic measures the allele frequency correlation between the pairs (Target, Source) and (Ref1, Ref2).
If ( f4 ) is significantly different from zero, it suggests that the Target population has ancestry from the Source population. The sign and magnitude of ( f4 ) provide information about the direction and strength of the admixture.

The p-value associated with an f4-statistic indicates the probability that the observed correlation could occur by chance. A low p-value (typically < 0.05) suggests that the observed f4-statistic is unlikely to have occurred by chance, indicating a significant admixture event, in this case the null hypothesis (no admixture) is rejected. A High P-Value (≥ 0.05) indicates insufficient evidence of admixture, in this case we cannot reject the null hypothesis.

Multiple studies have shown that qpAdm can provide accurate results even in challenging conditions, like low data coverage, high rates of missing data, and ancient DNA damage.

There are some best practices for using qpAdm, such as limiting the number of reference populations in a single model, and being cautious with extended periods of gene flow.

By following these steps and best practices, qpAdm is a tool that provides a robust and accurate method for detecting and quantifying admixture events, even in complex scenarios.​
 
Back
Top