Monday, October 6, 2008

Interpreting Cavalli-Sforza's principal components of genetic variation in Europe

Interpreting Cavalli-Sforza's principal components of genetic variation in Europe


I created this page in response to a couple people who were proffering interpretations of Cavalli-Sforza's "PCs" based on outdated information and a poor understanding of the topic. Luigi Luca Cavalli-Sforza was, for a time, a central figure in population genetics research; but he certainly is not infallible. The (mostly tentative) interpretations Cavalli-Sforza gave for the PCs in HGHG were not to be taken as gospel at the time the work was published, much less a decade on, when new evidence has come to light.

Cavalli-Sforza's initial interpretations of the PCs

From Genes, Peoples, and Languages (1997) by Luca Cavalli-Sforza:

"Hidden patterns in the geography of Europe shown by the first five principal components, explaining respectively 28%, 22%, 11%, 7%, and 5% of the total genetic variation for 95 classical polymorphisms. The first component is almost superimposable to the archaeological dates of the spread of farming from the Middle East between 10,000 and 6,000 years ago. The second principal component parallels a probable spread of Uralic people and/or languages to the northeast of Europe. The third is very similar to the spread of pastoral nomads (and their successors) who domesticated the horse in the steppe towards the end of the farming expansion, and are believed by some archaeologists and linguists to have spread most Indo-European languages to Europe. The fourth is strongly reminiscent of Greek colonization in the first millennium B.C. The fifth corresponds to the progressive retreat of the boundary of the Basque language. Basques have retained, in addition to their language, believed to be descended from an original language spoken in Europe, some of their original genetic characteristics."

The above quotation sees Cavalli-Sforza at his most smug. In The History and Geography of Human Genes, Cavalli-Sforza was more circumspect about assigning meanings to the PCs, acknowledging, for example, the uncertainty over the meaning PC2 and suggesting it could just as well represent cold adaptation. Here, in a paper in which Cavalli-Sforza is trying to link genetic and linguistic evolution, he glosses over the uncertain nature of his interpretations. A few years on, science has proven the incorrectness, or, at least, incompleteness of Cavalli-Sforza's interpretations of PC2 and PC3.

What was wrong with Cavalli-Sforza's initial interpretations?

Some of Cavalli-Sforza's early interpretations of European genetic variation were misguided. Cavalli-Sforza was working under the misconception that genetic variation in Europe must all be attributed to events from the Neolithic onward, since he believed Europeans were mostly descended from Neolithic farmers. Cavalli-Sforza also had a special interest in trying to link genetic variation to linguistic variation, leading him to favor linguistic interpretations of PCs, even if the evidence supporting such an interpretation was entirely circumstantial.

Subsequent research has shown that Europeans largely descend from Paleolithic hunter-gatherers, and scientists have determined that Europeans mostly weathered the last ice age in refuges in Iberia and the Ukraine. Following the LGM, Europe was repopulated from these refuges. One must now expect that these postglacial expansions are responsible for much of the genetic variation in Europe, and some of Cavalli-Sforza's old interpretations must be discarded -- specifically, his interpretations of PC2 and PC3.

The Genetic Legacy of Paleolithic Homo sapiens sapiens in Extant Europeans: A Y Chromosome Perspective, a paper published in 2000 (more recent than "Genes, Peoples, and Languages" obviously), coauthored by Cavalli-Sforza, checks for and finds correlations between Y-chromosome haplotypes and the principal components determined by Cavalli-Sforza.

The most comprehensive previous survey of the European gene pool has been the PC analysis of 95 autosomal protein polymorphisms (5, 8). We compared the frequency distribution of the major Eu Y chromosome haplotypes with the first three PCs of Europe (Table 2). Because Sardinians were not included in the original PC analysis because of their pronounced outlier phylogenetic status (5), they were also excluded in our correlation analysis. The first PC, which was proposed to reflect the diffusion of Neolithic farmers (5, 8), correlates with Eu4, Eu9, Eu10, and Eu11 [Eu4 is the same as HG25 and M35; Eu9 is HG9; Eu11 is a sublineage of HG2]. The second PC, whose meaning has never been fully assessed (5, 8), is correlated with the spread of Eu18 [HG1] from Spain toward Central Europe [A postglacial expansion from a refuge on the Iberian peninsala also shows up in mtDNA] and, on the opposite pole, with the spread of Uralic TAT/M178 (Eu13 and Eu14) [HG16]. The third PC, the meaning of which has been debated (3, 5, 8), correlates to the M17 mutation (Eu19) [HG3, which represents another postglacial expansion from a refuge according to the most reasonable theories. It is wrong to attribute HG3 to the Proto-Indo-Europeans]. The concordance of protein-based PC and NRY data suggests that migration, more than natural selection, has influenced the pattern of NRY variation observed.

First principal component of genetic variation in Europe

PC1 correlates with Eu4 [incidentally, this is a YAP+ lineage that probably derives from Ethiopians within the past 20000 years], Eu9, Eu10, and Eu11. This component has been "proposed to reflect the diffusion of Neolithic farmers". That may be it's primary cause, but it may also incorporate earlier and later migrations from the Middle East.

mtDNA patterns are also consistent with a Neolithic expansion from the Near East.


Second principal component

PC2 correlates with HG1 and HG16 (in opposite directions). PC2 primarily represents an expansion from Iberia following the LGM. In the extreme North and East, Uralic influence may also be responsible for some of the variation in this component; but it can't be the primary cause (HG16 is absent from most of Western Europe, yet there is a cline in PC2 even within Spain). Further argument against Uralic genes as an important component of PC2 comes from the fact that PC2 is higher in Finns than in Hungarians, even though Cavalli-Sforza's methods determined that Finns have less Uralic ancestry than Hungarians.

Cavalli-Sforza states:

The second PC map shows a concentric gradient like that of the first but centered instead in the Iberian peninsula. The opposite pole of the second PC shows a strong peak among the Lapps, which are certainly no candidate for an expansion. There is no known demic expansion from the Iberian peninsula, and an interpretation based on a migration from this area seems unlikely. [Today, we know that there was in fact a population expansion from the Iberian peninsula following the last ice age. Due to his ignorance of this fact at the time he wrote HGHG, Cavalli-Sforza struggles to come up with an explanation for PC2, and ignores the fact that the map of PC2 features a concentric gradient centered in Iberia.] In general, there is a strong north-to-south gradient that might be interpreted on a climatic of ecological basis; but this interpreation does not take into consideration the existence of ethnic differences between the populations of northern Scandinavia, like the Lapps and other speakers of Uralic languages who occupied the northeastern areas of Europe, perhaps before the arrival of Neolithics.

The interpretation we gave in Menozzi et al. (1978a) was based on migrations from Asia. Although it would be historically absured to think in terms of an expansion centered in Lappland, the PC peak in this region may be explained by noting that Lapps have a stronger Mongoloid componenet than any eastern European population. . . Migration of steppe nomads or their descendants and the "barbarian" invasions towards the end of the Roman Empire and afterwards seem to have no influence on this PC.

[The History and Geography of Human Genes, p. 292]

The "Mongoloid" explanation for PC2 is totally unfounded.
  • As mentioned elsewhere, Hungarians have more "Uralic" ancestry than Finns, yet they are lower in PC2.
  • Cavalli-Sforza admits that historical Mongoloid invasions don't show up in PC2.
  • Likewise, Cavalli-Sforza admits that an expansion from Lappland "would be historically absurd".
One of the more obvious indications that PC2 represents an expansion from Iberia rather than Mongoloid influence is the fact that PC2 features a circular cline radiating from Iberia. Two pages after his strained attempt at explaining PC2, Cavalli-Sforza states:
The interpretation of synthetic maps is not always easy. The first question is, how can one hypothesize that a population expansion was involved? If there is a radiation of circular of elliptic clines from a specific area, an expansion is a possible explanation; and its place of origin must clearly be the center of the radiation. The alternative possibility of a centripetal, rather than centrifugal, population movement may have to be considered. The best way to distinguish them would be on a historical basis, but this evidence is usually lacking. In principle, a centrifugal movement is a priori more likely, but a centripetal migration could be that directed towards an important city, expecially a capital.

[The History and Geography of Human Genes, p. 294]

However, Cavalli-Sforza ignores his own observation and passes up the obvious interpretation of PC2 because of his desire at the time to align PCs with known migrations or linguistic expansions from the Neolithic onward.

The above map indicates approximate locations of refuges during the last ice age. HG1 spread from the Iberian refuge following the LGM, HG3 spread from the Ukraine refuge.

We interpret the differentiation and the distribution of haplotypes Eu18 and Eu19 as signatures of expansions from isolated population nuclei in the Iberian peninsula and the present Ukraine, following the Last Glacial Maximum (LGM). In fact, during this glacial period (20,000 to 13,000 years ago), human groups were forced to vacate Central Europe, with the exception of a refuge in the northern Balkans (16). Similar discrete patterns of the flora and fauna in Europe have been attributed to glaciation-modulated isolation followed by dispersal from climatic sanctuaries (18). This scenario is also supported by the finding that the maximum variation for microsatellites linked to Eu19 is found in Ukraine (19). In turn, the maximum variation for microsatellites linked to 49a,f Ht15 and its derivatives (and then to the Eu18 lineage) is in the Iberian peninsula (19). This is consistent with the diffusion of M173-marked Eu18 from its refuge after the LGM, in agreement with mitochondrial DNA (mtDNA) haplogroup V and some of the H lineages (20). Haplotype Eu19 has been also observed at substantial frequency in northern India and Pakistan (12) as well as in Central Asia (12). Its spread may have been magnified by the expansion of the Yamnaia culture from the "Kurgan culture" area (present-day southern Ukraine) into Europe and eastward, resulting in the spread of the Indo-European language (21) [This idea is entirely speculative, and based on circumstantial evidence -- HG3 is high in one of the proposed Indo-European homelands, so HG3 must have something to do with the Indo-Europeans; it is clear to me that most HG3 had spread well before the Indo-Europeans appeared, and the Proto-Indo-Europeans contributed little if anything to the spread of HG3]. An alternative hypothesis of a Middle Eastern origin of Indo-European languages was proposed on the basis of archaeological data (3).

["The Genetic Legacy of Paleolithic Homo sapiens sapiens in Extant Europeans: A Y Chromosome Perspective"]

As reported above, the distributions of different Y-chromosome haplogroups are highly correlated with the first 3 principal components of genetic variation in Europe as determined by Cavalli-Sforza. The correlations are not perfect, nor do they need to be. Y-chromosome haplogroup distributions are subject to genetic drift. Likewise, autosomal markers are subject to drift and possible selection.

Though the Y-chromosome tells only half the story, studies on mtDNA confirm the migrations that show up in Y-chromosomes and protein markers.

For example,

With regard to age and frequency, there is a clear cline from west to east (fig. 4); the age for V in the west (16,000 years) is almost twice that in the east, indicating the direction of settlement. We interpret this pattern in the following way: the older age reflects the onset of the recolonization of Europe from western refugia (Housley et al. 1997), whereas later founder events are responsible for the limited occurrences and reduced diversity of V mtDNAs in the east.

A Signal, from Human mtDNA, of Postglacial Recolonization in Europe

The genetic signature of a postglacial expansion from a refuge in SW Europe.

(A) Cavalli-Sforza's second principal component of genetic variation in Europe.1

(B) The distribution of mtDNA haplgroup V.3

(C) The distribution of Y-chromosome haplogroup 1.6

Note that while neither mtDNA nor Y-chromosome haplogroup distributions precisely match PC2, when considered together, they can be seen to account for most of the variation in PC2. Also, remember that Cavalli-Sforza's maps merely show variation, and assigning "high" or "low" values to the poles is a matter for the person doing the interpreting. Here, Cavalli-Sforza has assigned "high" to Lappland, because that's how he interpreted the map at the time. This piece of outdated interpretation is irrelevant to us. All that is relevant is that PC2 shows concentric gradients centered in Iberia, consistent with a population expansion from Iberia.

See also mtDNA analysis reveals a major late Paleolithic population expansion from southwestern to northeastern Europe.


Third principal component

PC3 correlates with HG3. PC3 represents an expansion from North of the Black Sea following the LGM. It has been speculated that PC3 and HG3 are associated with the spread of Indo-European. But, several different studies have demonstrated that the spread of HG3 predates the spread of Indo-European.

PC3 is often wrongly identified as being associated with the expansion of the Kurgan people. I have provided evidence elsewhere that this interpretation is incorrect. However, even before such evidence was available, it was irresponsible to blindly associate this PC with a linguistic/archaeological expansion based solely geographic coincidence. In HGHG, Cavalli-Sforza acknowledges that "interpretation of synthetic maps is not always easy" and "it is often difficult to link a cline of gene frequencies with a precise historical expansion" (294). And, though in HGHG he strongly favors associating PC3 with the Gimbutas hypothesis (the expansion of the Kurgan people), Cavalli-Sforza admits "it is . . . difficult to assign unequivocally the third . . . PC to a specific migration" (293).

HG3 in India predates the Aryan invasion. As well, HG3 is present at high levels in low caste Indians and lower levels in high caste Indians.5 This is not the distribution one would expect if HG3 had been brought to India by Indo-Europeans. And, here again is more evidence that the spread of HG3 predates the spread of Indo-European:

The 49a,f haplotype 11 is a new marker of the EU19 lineage that traces migrations from northern regions of the Black Sea.

Passarino G, Semino O, Magri C, Al-Zahery N, Benuzzi G, Quintana-Murci L, Andellnovic S, Bullc-Jakus F, Liu A, Arslan A, Santachiara-Benerecetti AS.

Dipartimento di Genetica e Microbiologia, University of Pavia, Pavia, Italy.

Previous studies on human Y-chromosome polymorphisms in the European populations highlighted the high frequency of the 49a,f/TaqI haplotype 11 and of the Eu19 (M17) lineage in Eastern Europe. To better understand the origin and the evolution of the Eu19, and its relationship with 49a,f Ht11, this study surveyed 2,235 individuals (mainly from Europe and the Middle East) for the 49a,f Ht11 and for many biallelic markers defining the Eu19 lineage. As previously described, the highest frequency of Eu19 was found in Eastern Europe. All the Eu19 Y-chromosomes turned out to be 49a,f Ht11 or its derivatives, the distribution of which suggests that the Eu19/49a,f Ht11 emerged in Ukraine, probably in a Palaeolithic population. Thereafter, the spread of this lineage toward Europe, Asia, and India occurred at different waves over a few thousands years. At present this seems to indicate the influence of the Ukraine Palaeolithic groups in the gene pool of modern populations. For the first time it is possible to make inferences about the evolution of some haplotypes of the 49a,f system. In spite of its unknown molecular base, this is one of the first most informative polymorphisms of the Y chromosome.

PMID: 11543894 [PubMed - indexed for MEDLINE]


In response to this page, RM created his own page on genetic variation in Europe. RM's page rehashes what RM was told about Cavalli-Sforza by an Irish-Jewish teenager with a reading disability. RM latched onto this misinterpreted and outdated information because it fits his agenda of "sticking it to the Nordicists" he hates so much. RM's page does not warrant a detailed refutation (I've already repeatedly explained the facts to him, both on the message board, and on this page). However, one obvious mistake (which he picked up from the aforementioned teenager) that bears pointing out is RM's belief that the second principal component of genetic variation in Asia is "[European] PC2 extended to Asia", and that both European and Asian PC2 represent Uralic influence. In fact, Asian PC2 is no such thing. It is, quite simply Asian PC2, not European PC2, and it is generated using data from Asia, not Europe. Moreover, we can confirm by actually reading HGHG (a task I don't think RM is up to) that Asian PC2 (unsurprisingly) is correlated with a different set of genes than the ones European PC2 is correlated with. Additionally, at no point does Cavalli-Sforza suggest Asian PC2 is associated with Uralic. Rather, Asian PC2 is "the difference between northern and southern Mongoloids" (249). Trivial mistakes like this one remind us how little grasp RM actually has of the material he presumes to interpret.


1. Cavalli-Sforza, LL. Genes, peoples, and languages. Proc Natl Acad Sci USA 1997 Jul 22;94(15):7719-24.

2. Semino et al. The Genetic Legacy of Paleolithic Homo sapiens sapiens in Extant Europeans: A Y Chromosome Perspective. Science 2000 Nov 10;290(5494):1155-9.

3. Torroni et al. A signal, from human mtDNA, of postglacial recolonization in Europe. Am J Hum Genet 2001 Oct;69(4):844-52.

4. Guglielmino et al. Uralic genes in Europe. Am J Phys Anthropol 1990 Sep;83(1):57-68.

5. Majumder PP. Ethnic populations of India as seen from an evolutionary perspective. J Biosci 2001 Nov;26(4 Suppl):533-45.

6. Rosser et. al. Y-Chromosomal Diversity in Europe Is Clinal and Influenced Primarily by Geography, Rather than by Language. Am J Hum Genet 2000 Dec;67(6):1526-43.

7. Cavalli-Sforza et al. The History and Geography of Human Genes. Princeton, N.J.: Princeton University Press. 1994.

Additional reading

A Y-chromosome nomenclature system (Figure 1)

Patterns of classical genetic variation in Asia

No comments: