• Don't want to see ads? Install an adblocker like uBlock Origin or use a Europe-based privacy-friendly browser like Vivaldi or Mullvad.

Genetic study ARGMix - a graph transformer for ancient ancestry inference

Tautalus

Regular Member
Messages
545
Reaction score
1,380
Points
93
Ethnic group
Portuguese
Y-DNA haplogroup
I2-M223 / I-FTB15368
mtDNA haplogroup
H6a1b2y
An interesting paper on how to infer local ancestry through the use of an optimised version of ancestral recombination graphs (ARGs).
This work addresses a central challenge in genetics: determining which parts of modern human DNA come from different ancient populations. As the admixture events become more distant in time, the inherited DNA segments become smaller and more fragmented, making it increasingly difficult for traditional methods to accurately assign their origins. Earlier approaches typically relied on comparing segments of DNA to reference populations based on overall similarity, which works well for recent admixture but becomes unreliable for ancient events and can be biased by differences in total ancestry proportions between populations.

The paper introduces a new method, ARGMix, which reframes the problem by modelling genetic data as a network of relationships rather than a simple sequence. Using a graph transformer, a form of deep learning designed to analyse structured data, it leverages ancestral recombination graphs to capture how DNA segments are related through evolutionary history. By incorporating information about when lineages share common ancestors and using ancient DNA samples as references, the method can more accurately trace the origin of even very small DNA fragments. This represents a shift from surface level pattern matching to reasoning over genealogical structure, leading to substantial improvements in accuracy and robustness compared to previous approaches. A key innovation of the method is its ability to perform ancestry specific analyses by “masking” the genome. In practice, this means that DNA segments not belonging to a chosen ancestry are temporarily hidden, allowing comparisons to be made using only a single ancestral component. This avoids a major limitation of earlier methods, where populations could appear similar simply because they share higher proportions of a given ancestry, rather than because their ancestry is more closely related.

Applying this framework to ancient and modern European genomes reveals new insights into population history. In the case of Ötzi the Iceman, previous studies consistently found that he clustered most closely with Sardinians, an observation driven by their high proportion of early farmer ancestry. However, when the analysis is restricted to only the Anatolian farmer component, the paper shows that Ötzi’s ancestry aligns more closely with present-day populations from northern Italy, particularly around Bergamo. This finding suggests a degree of local genetic continuity in the Alpine region that had been obscured by later admixture events.
Abstract

Local ancestry inference classifies segments of DNA in admixed individuals by their originating population. However, as the date of admixture becomes older, these segments become shorter and determining their ancestry becomes increasingly difficult. This limits many existing segment-based methods to relatively recent historical admixture events and more highly diverged populations. The rapidly expanding availability of ancient DNA offers a promising opportunity to use these ancient samples as references for local ancestry inference. A recent approach integrates ancient samples into the ancestral recombination graph (ARG) for local ancestry inference. Here, we introduce recent advances in deep learning for graphs into this ARG framework to create ARGMix, a graph transformer that infers local ancestry using the coalescent trees of the inferred ARG. Our approach employs ancient samples as references in the marginal trees to predict local ancestry. We train ARGMix on data reflecting the well-understood ancient European demography and demonstrate improved accuracy and robustness even under demographic misspecification. We then apply ARGMix to an ARG of ancient and present-day European samples for ancestry-specific analyses, finding evidence of continuity between Ötzi the Iceman and present-day individuals from nearby regions.

Population structure of present-day Europeans and the Iceman. (A) Principal component analysis (PCA) of the Iceman and European populations from the Human Genome Diversity Project and the 1000 Genomes Project. (B) Anatolian-specific PCA of the same populations generated by masking non-Anatolian ancestry.
db1Ou4M.png
 
This is fascinating! Thus earlier comparisons may have overstated distance or closeness because they were being driven by mixed ancestry proportions rather than true ancestry-specific relatedness.

I wish I had access to this, there's so many samples I would love to test out.
 
This is fascinating! Thus earlier comparisons may have overstated distance or closeness because they were being driven by mixed ancestry proportions rather than true ancestry-specific relatedness.

I wish I had access to this, there's so many samples I would love to test out.
I'd be curious to see if Latins are actually closest to Central Italians. That would be the first test I would run.
 
An interesting paper on how to infer local ancestry through the use of an optimised version of ancestral recombination graphs (ARGs).
This work addresses a central challenge in genetics: determining which parts of modern human DNA come from different ancient populations. As the admixture events become more distant in time, the inherited DNA segments become smaller and more fragmented, making it increasingly difficult for traditional methods to accurately assign their origins. Earlier approaches typically relied on comparing segments of DNA to reference populations based on overall similarity, which works well for recent admixture but becomes unreliable for ancient events and can be biased by differences in total ancestry proportions between populations.

The paper introduces a new method, ARGMix, which reframes the problem by modelling genetic data as a network of relationships rather than a simple sequence. Using a graph transformer, a form of deep learning designed to analyse structured data, it leverages ancestral recombination graphs to capture how DNA segments are related through evolutionary history. By incorporating information about when lineages share common ancestors and using ancient DNA samples as references, the method can more accurately trace the origin of even very small DNA fragments. This represents a shift from surface level pattern matching to reasoning over genealogical structure, leading to substantial improvements in accuracy and robustness compared to previous approaches. A key innovation of the method is its ability to perform ancestry specific analyses by “masking” the genome. In practice, this means that DNA segments not belonging to a chosen ancestry are temporarily hidden, allowing comparisons to be made using only a single ancestral component. This avoids a major limitation of earlier methods, where populations could appear similar simply because they share higher proportions of a given ancestry, rather than because their ancestry is more closely related.

Applying this framework to ancient and modern European genomes reveals new insights into population history. In the case of Ötzi the Iceman, previous studies consistently found that he clustered most closely with Sardinians, an observation driven by their high proportion of early farmer ancestry. However, when the analysis is restricted to only the Anatolian farmer component, the paper shows that Ötzi’s ancestry aligns more closely with present-day populations from northern Italy, particularly around Bergamo. This finding suggests a degree of local genetic continuity in the Alpine region that had been obscured by later admixture events.
Abstract

Local ancestry inference classifies segments of DNA in admixed individuals by their originating population. However, as the date of admixture becomes older, these segments become shorter and determining their ancestry becomes increasingly difficult. This limits many existing segment-based methods to relatively recent historical admixture events and more highly diverged populations. The rapidly expanding availability of ancient DNA offers a promising opportunity to use these ancient samples as references for local ancestry inference. A recent approach integrates ancient samples into the ancestral recombination graph (ARG) for local ancestry inference. Here, we introduce recent advances in deep learning for graphs into this ARG framework to create ARGMix, a graph transformer that infers local ancestry using the coalescent trees of the inferred ARG. Our approach employs ancient samples as references in the marginal trees to predict local ancestry. We train ARGMix on data reflecting the well-understood ancient European demography and demonstrate improved accuracy and robustness even under demographic misspecification. We then apply ARGMix to an ARG of ancient and present-day European samples for ancestry-specific analyses, finding evidence of continuity between Ötzi the Iceman and present-day individuals from nearby regions.


Population structure of present-day Europeans and the Iceman. (A) Principal component analysis (PCA) of the Iceman and European populations from the Human Genome Diversity Project and the 1000 Genomes Project. (B) Anatolian-specific PCA of the same populations generated by masking non-Anatolian ancestry.
db1Ou4M.png

So if I understand the surface layer overview, ARGMix infers a level of ancestry isolate with a prior population in multiple more recent populations, isolates this contribution and then is able to compare the remaining differences between said isolate across our more recent population dataset.

This is an interesting idea, though I would say that the inference of where certain types of ancestry may have originated from and exactly what is being isolated is likely prone to error as well. Neolithic Anatolian contributions are a very clear cut so this example is a good use case. The implication here is that neolithic farmer contributions to modern N. Italians were more local instead of coming from other parts of Europe.

It'd be also great to see how much of this isolate is quantified in each population of the dataset by percentage relative to their total ancestry as a standard when running this type of model. A target population could have the closest isolate from a given ancient population but also have very little or very large amounts of said ancestry.
 
Last edited:
I'd be curious to see if Latins are actually closest to Central Italians. That would be the first test I would run.
You may be able to use this as a way to try to locate the Greek origin of contributions in Italy as well as Italic contributions from bronze aged populations beyond the alps.
 
Methodologically interesting, but the results regarding Otzi's population affinities are puzzling. As is often the case, the paper relies on limited sampling, datasets that cover European diversity only partially. Southern Europe in particular is severely underrepresented: in the panel the Toscani (TSI) are the genetically southeasternmost European population, which makes the analysis poorly suited for drawing robust conclusions about Neolithic continuity in southern Europe. Visually, in the ancestry-specific PCA, Otzi appears to fall in an intermediate zone between Bergamo HGDP and TSI, closer to Bergamo HGDP, though with some TSI individuals ending up north of Otzi and some Bergamo HGDP individuals (4 out of 11, 36.36% of the sample: that’s quite a lot) drifting south into the TSI cluster. Reading precise distances from a PCA is methodologically risky without the underlying numerical data, and the picture seems somewhat more complex than the paper presents it. The claim that "the clustering of the Iceman with modern day Bergamo Italians in their Neolithic farmer DNA segments suggests continuity to the present day in the Alpine region" might seem compelling and makes for a good narrative, but the ancestry-specific PCA published in the paper itself suggests the reality is a bit more nuanced.

It is also quite clear that the TSI shows, on average, a higher proportion of post-Neolithic or even post-Bronze Age ancestry compared to the Bergamo HGDP, given its more south-easterly position in a PCA analysis. The Levant serves as an indicator of a south-easterly shift of the TSI relative to the Bergamo HGDP and could be replaced by any other sample source indicating the same direction (for example, Bronze or Iron Age Aegean-Anatolia). None of this is new. It almost feels like rediscovering something already well established.

It is also worth noting that the Bergamo HGDP sample originates from the Val Seriana, in the Prealps, according to the CEPH coordinates, and shows a Neolithic/Chalcolithic shift relative to other northern Italian samples, something visible in any decent PCA. This makes the clustering with Otzi considerably less surprising than it might appear.

A more fundamental question remains open: what does Otzi's Neolithic profile actually represent? Otzi carries high EEF and low WHG ancestry. Is he within the average range of Alpine Chalcolithic samples, or is he something of an outlier? The most recent study I am aware of seems to treat him as somewhat of an outlier, which makes interpreting his similarities with modern populations even more difficult (because Ötzi had a lower WHG than other contemporary samples and a very high EEF, which obviously links him more closely to northern Italy. Modern Sardinians, if I recall correctly, have a higher WHG than the Bergamo HGDP).

So what has this paper actually demonstrated? That Bergamo HGDP carries more Otzi-like descent than Sardinia HGDP? And that the well-known similarity between Otzi and Sardinians simply reflects the fact that Sardinians retained a higher overall Neolithic contribution, rather than any specific descent from Otzi or related populations? This could have been argued already. Personally, I never found the Sardinian connection particularly convincing as evidence of direct descent. And I thought that was the general consensus. The paper may well be correct, but it is not clear to me that it has rigorously demonstrated its claims. More diverse test cases analyzed with ARGMix would be needed to properly evaluate whether the method is truly accurate and reliable.
 
Last edited:
Methodologically interesting, but the results regarding Otzi's population affinities are puzzling. As is often the case, the paper relies on limited sampling, datasets that cover European diversity only partially. Southern Europe in particular is severely underrepresented: in the panel the Toscani (TSI) are the genetically southeasternmost European population, which makes the analysis poorly suited for drawing robust conclusions about Neolithic continuity in southern Europe. Visually, in the ancestry-specific PCA, Otzi appears to fall in an intermediate zone between Bergamo HGDP and TSI, closer to Bergamo HGDP, though with some TSI individuals ending up north of Otzi and some Bergamo HGDP individuals (4 out of 11, 36.36% of the sample: that’s quite a lot) drifting south into the TSI cluster.
I agree that the sampling limitations, especially for southern Europe, make strong conclusions difficult.

That said, the way I read the ancestry-specific PCA, the more apparent pull of the Tuscans relative to the Bergamo sample seems to be primarily along an east–west axis rather than a north–south one, if it were so the Basque would be the genetically northernmost population in Europe (I bet that if there were more samples from Bergamo the bulk of them would nest more decisively between the Tuscan and the French datasets, as it makes more sense geographically and historically). In other words, in this PCA TSI appear shifted eastward rather than simply southward in relation to Bergamo HGDP, which complicates the idea of interpreting Otzi’s position as intermediate in a strictly latitudinal sense.

This also reinforces your broader point: without the underlying numerical coordinates, interpreting relative distances and directions on the PCA remains somewhat speculative, and the visual impression alone may oversimplify a more complex structure.
 
Last edited:
This is fascinating! Thus earlier comparisons may have overstated distance or closeness because they were being driven by mixed ancestry proportions rather than true ancestry-specific relatedness.

I wish I had access to this, there's so many samples I would love to test out.

There are no packages available for download yet. The indicated GitHub repository is empty.
But it will not be easy to implement, as it involves a long sequence of steps, from input data quality and sampling, ARG inference and estimation of genealogical relationships, training of the graph based neural network on simulated data, application of the trained model to infer ancestry along real genomes, ancestry specific masking, visualisation and interpretation using PCA.​
 
I agree that the sampling limitations, especially for southern Europe, make strong conclusions difficult.

That said, the way I read the ancestry-specific PCA, the more apparent pull of the Tuscans relative to the Bergamo sample seems to be primarily along an east–west axis rather than a north–south one, if it were so the Basque would be the genetically northernmost population in Europe (I bet that if there were more samples from Bergamo the bulk of them would nest more decisively between the Tuscan and the French datasets, as it makes more sense geographically and historically). In other words, in this PCA TSI appear shifted eastward rather than simply southward in relation to Bergamo HGDP, which complicates the idea of interpreting Otzi’s position as intermediate in a strictly latitudinal sense.

This also reinforces your broader point: without the underlying numerical coordinates, interpreting relative distances and directions on the PCA remains somewhat speculative, and the visual impression alone may oversimplify a more complex structure.

Agreed. If the goal is to demonstrate that this new method can identify which modern population carries the most Otzi-like descent, the panel is missing genuine southeastern European samples altogether. This is not a trivial omission: the prevailing view is that EEF ancestry is more typical of southern Europeans broadly, so a more complete southern European cluster would be the most informative test case. PCA results become truly meaningful only when the sampling is reasonably complete.

You are right that in these PCAs TSI is not southeast but east of Bergamo HGDP. A Levantine source cannot explain an eastward shift, which is more plausibly explained by a Balkan source, including Bronze and Iron Age Balkan populations.

Looking at the Anatolian-specific PCA, Otzi appears to fall in the middle of a convergence zone between TSI, French, and Iberian samples, an area partially overlapping with Bergamo HGDP. One would expect more decisive evidence for the paper's conclusions. It is also worth noting that in the Anatolian-specific PCA, TSI appears visually a bit closer to the Russians than to the Basque, who drift in a distinct direction. Bergamo HGDP, meanwhile, falls mostly within the TSI cluster and partly within the Iberian one. Given all this, what has ARGMix actually found? That remains genuinely unclear to me.

Additional methodological concerns are worth flagging: projection bias in the placement of ancient samples, and the highly uneven sample sizes across populations (some represented by 10 to 20 individuals, others by 100), both of which can distort the geometry of the PCA in ways that are difficult to assess without the underlying numerical data.

NHI1y6A.png
 
The key point is to clearly separate where the ancestry inference comes from and how it is visualized. The relationship between Ötzi the Iceman and northern Italians is not derived from the PCA plot, but from the ancestral recombination graph (ARG), which represents the genealogical relationships between DNA segments. ARGMix works by looking, for each segment of Ötzi’s genome, at how it is positioned in this genetic family tree, specifically which reference samples it connects to and how recently they share common ancestors. This allows the method to distinguish not just how much ancestry is shared, but the type and structure of that ancestry. The observed affinity with northern Italians therefore reflects patterns in these underlying genealogical connections, rather than a simple visual clustering. The PCA shown in the paper is therefore only a downstream visualization of already inferred ancestry segments, not the evidence itself. As such, ambiguities or overlaps in PCA space do not invalidate the ARG based inference.

That said, the strength of the conclusion still depend on the reference populations included. If important southeastern European groups are missing, the method may identify the closest match among the available samples rather than the true closest population overall. Adding more southern and Balkan populations would likely refine the result and test its robustness.​
 
Fair enough, and I obviously agree: according to the paper, the primary evidence comes from ARGMix's local ancestry classification applied to the ancestral recombination graph (ARG), which represents the genealogical relationships between DNA segments, and the PCA is explicitly described as a downstream visualization rather than the core evidence. The TreeMix analysis on Anatolian-specific allele counts is also used as an independent confirmation. But ARGMix is not yet downloadable or testable, as you yourself noted, so the PCA is all we have to look at for now. And even as a downstream visualization, the PCA is something the authors chose to publish in support of their conclusions. If it does not clearly show what the paper claims, that is worth pointing out regardless of where the primary evidence comes from.

That said, Bergamo HGDP is also the geographically closest population to Otzi's findsite in the entire panel, and in my opinion it shows a Neolithic/Chalcolithic shift relative to other northern Italian samples, visible in any decent PCA. Given these two facts alone, finding an affinity between Otzi and Bergamo HGDP is not particularly surprising, regardless of the method used. In my view the result would need to be considerably more robust to be genuinely informative. The sampling limitation also remains a problem, you know. If southeastern European and Balkan populations are missing, not to mention the lack of samples from Austria and Trentino-Alto Adige, the very area where Ötzi was found, ARGMix can only identify the closest match among the available samples, not necessarily the truly closest population overall. Adding more southern European populations would be the real test of whether the signal is genuine or simply a reflection of what was available in the panel. One further observation worth noting: HGDP actually includes three Italian samples, Bergamo, Tuscan, and Sardinian. The authors used Bergamo and Sardinian HGDP, but for the Tuscans they chose the TSI sample from 1000 Genomes rather than Tuscan HGDP . This is curious, because Tuscan HGDP appears to show a slightly higher Neolithic shift compared to TSI. :)

As I understand it, one of the paper's goals was to demonstrate the effectiveness of ARGMix as a method. In that light, the choice of Otzi as a test case looks rather safe: a geographically close and already well-characterized sample, with a panel conveniently lacking the populations that would have made the test truly challenging.

I am aware that my observations may come across as pedantic, but I have been following geneticists for too long not to be.
 
Well most if not all the previous tools at our disposal show similarity rather than direct descent anyway. So while I might share 10% of my genome with Croats (just for the sake of argument), all that shows is that we have a common ancestor (s) at some time in the past (undetermined). Unless that 10% that we share is unique and not shared with other Balkan populations it really is meaningless. Also the closeness of some individuals to ancient populations from let's say LBA is nearly impossible unless they come from a genetically isolated population high up in the mountains. Even then the genetic drift would throw off the closeness calculations.
 
Back
Top