Biobanks Beware: Men, Women, and Sampling Bias

By Carol Morton

Andrea Ganna180x180W.jpg
             Andrea Ganna

14 June 2021. Except for the X and Y chromosomes, men and women are assumed to have the same random mix of genes from their parents. So a genome-wide association study (GWAS) that compares men to women but leaves out the sex chromosomes should come up with no differences.

Instead, a new analysis of 3.3 million individuals revealed sex-specific sampling bias in large voluntary biobank cohorts. In fact, their data reflected more than 150 spurious genetic variations associated with sex.

The authors attribute the findings to differences in traits of people who enroll in studies or pay for direct-to-consumer genetic tests. They also call their report a cautionary tale about new statistical challenges that arise as genetic studies grow larger in deeper dives to find more biological clues to predict, prevent, and treat disorders and diseases.

The report, one of the largest GWAS to date, is published in the May 2021 Nature Genetics by an international collaboration that includes researchers and biobank cohorts from Denmark (iPSYCH), Finland (FinnGen), Japan (Biobank Japan), United Kingdom (UK Biobank), and United States (23andMe).

Who Participates?

“The cool thing is that we can use genetic correlation with sex GWAS to find what makes females and males relatively more likely to participate in the study … and we can show that determinants of participation are study-specific," tweeted Andrea Ganna, a senior author and group leader at the Institute for Molecular Medicine Finland in Helsinki.

The bogus sex differences varied among study cohorts. For example, in the UK Biobank, more women than men appear to be genetically predisposed to achieve higher educational levels, but in 23andMe the trend was opposite. On the other hand, in both data sets, a gene variant thought to increase obesity was found more often in men.

Meanwhile, researchers found fewer sex differences in autosomal genes in the FinnGen and Biobank Japan cohorts, and none in the Danish iPSYCH study group.

A major factor may be the biobank study designs, the authors propose. People who volunteered for the UK Biobank report less obesity, smoking, and health conditions than the general population. iPSYCH, on the other hand, recruits a random sample from a routine collection of newborn dried blood spots.

 

 Fig2NatGen.JPG
                         Among the participation biases, women in both UK Biobank and 23andMe had greater
                spurious genetic associations with risky behavior, autism and cannabis use, compared to men
                                               in both cohorts. Image courtesy of Nature Genetics.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is a very important paper particularly for @iPSYCHdk, as this highlights a major strength for the iPSYCH cohort, which is the absence of participation bias, to the community and to the funders,” tweeted co-author Veera Rajagopal, a scientist at Regeneron who did his doctoral research at Aarhus University in Denmark.

What Is to Be Done?

The authors propose two methods to correct the data so that it better matches the general population, but the methods require a random population sample for comparison, akin to the iPSYCH sample, which does not yet exist in most countries.

The findings may be technical, but they have major implications for personalized medicine, first author Nicola Pirastu, a genetic epidemiologist at University of Edinburgh, told NSHG-PM.

“We’re also using the same data to understand the risk for disease and to understand what drugs may work in disease,” he said. “Pharmaceutical companies use the data to prioritize molecules for further testing. We have the responsibility to be extra careful when we go from statistics to inference on people.”

Shared Questions, Shared Answers

The study arose from the close working relationships among geneticists, who often share ideas and collaborate. Several groups became curious about how well participants in several large biobank studies matched the general population.

Fig1NatGen_0.PNG
                                    A genome-wide search for traits associated with men or women that leaves out the
                              X and Y chromosomes should have come out flat. Instead, researchers found 158 loci with
                                                   spurious sex associations. Image courtesy of Nature Genetics.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In Pirastu’s case, he was puzzling over GWAS results on food consumption and body weight. Food consumption data is self-reported and known to be biased. But even taking that into account, there was something more that wasn’t making sense in his study results.

Another kind of bias was a possibility. So he and his co-authors ran a GWAS on sex, as they would for diabetes, heart disease, or schizophrenia, leaving out the X and Y chromosomes.

Instead of coming out flat, as would be expected, the GWAS detected 158 genome-wide significant loci. The researchers traced the explanation to study designs that resulted in participation bias. Supporting this, the sex associations aren’t consistent between 23andMe, for which people pay to have their genome tested, and the UK Biobank, for which people volunteer their information and samples.

The differential participation of males and females in GWAS may correlate with the complex trait being studied. For example, the authors write, “females with higher genetic susceptibility to obesity are less likely to participate in studies than their male equivalents (or that genetically lean males are more likely to).”

At least half of the sex signals are associated with at least one complex trait, such as type 2 diabetes, autoimmune and psychiatric diseases, blood press, and bone mineral density.

The GWAS are documenting differences in people who participate in studies and may cause researchers to misinterpret results, such as when researchers statistically test proteins for a potential causal role in disease using only GWAS data.

“Understanding which biases are influencing data is really important at the moment,” Pirastu said. “In this case, we knew that we should have gotten no result, but if you substituted cases and controls for men and women, it would have looked perfectly legit, and we would have never guessed there could be a problem.”

Reference

Genetic analyses identify widespread sex-differential participation bias.
Pirastu N, Cordioli M, Nandakumar P, Mignogna G, Abdellaoui A, Hollis B, Kanai M, Rajagopal VM, Parolo PDB, Baya N, Carey CE, Karjalainen J, Als TD, Van der Zee MD, Day FR, Ong KK; FinnGen Study; 23andMe Research Team; iPSYCH Consortium, Morisaki T, de Geus E, Bellocco R, Okada Y, Børglum AD, Joshi P, Auton A, Hinds D, Neale BM, Walters RK, Nivard MG, Perry JRB, Ganna A.
Nat Genet. 2021 Apr 22.