This one is solely about genome-wide association studies (GWAS). I realized it would be helpful to have a general post about genome-wide association studies. I have already written a few blog posts focused on the genetics of specific conditions and specifically what genetic variants (mutations) and genes are associated with those conditions.Â
GWAS is the primary way researchers identify the association between genetics and some given conditions. Therefore, I think having this type of blog post would be helpful for those who are not entirely familiar with GWAS. So, letâs start.Â
What is GWAS?
In very simple terms, GWAS is a type of genetic study where genetic information from a set of individuals is combined with their phenotypic information (phenotypic information can be if an individual has diabetes, or what his height or any other quantifiable characteristic) with a final goal to identify genetic mutations (genetic variants) that are highly associated with given phenotype information and which can potentially have some causal role in the mechanisms underlying the given phenotype (condition, trait, etc.).
What is phenotypic information?
So, letâs explain that using an example. Imagine we have 10000 individuals diagnosed with diabetes (these are usually called cases) and 100000 individuals not having diabetes (these are usually called controls). This information about diabetes can be considered phenotypic information. Phenotype refers to any type of human characteristic that can be quantifiable.Â
What is genotypic information?
Now, we would like to check if genetics plays a role if one will get diabetes or not. We already collected phenotype information (diagnosis of diabetes). The next step is to obtain genotype information. Genotypes can be obtained either by sequencing or just genotyping. In our imaginary example, we have to get genotype information for both those 10000 individuals with diabetes and those 10000 without diabetes. What is genotype information? Well, now I will do a bit more of an introduction.
The oneâs genetic makeup is called the genome, and in this case, we refer to human DNA that is located in the nucleus of the cell, and it is split into 23 pairs of chromosomes (plus there is also mitochondrial DNA, but letâs not complicate now). You may remember from high school that DNA is composed of extremely long chains of connected subunits called nucleotides that come in either of four forms: adenine (A), guanine (G), cytosine (C), and thymine (T).
The human genome, the total DNA a person has, is composed of 3 billion of those pairs of nucleotides (A, C, G, Ts).Â
As far as the sequence of nucleotides goes, humans share 99.5% of their genetic information. It means that, for example, at a particular place, most people have nucleotide A, while the remaining have nucleotide T at the same spot. These forms are called genetic variants (genetic mutations). These genetic variants in the genome are things that we are interested in and that we want to genotype and obtain information about.
Now, I will not go into details about technologies and how we actually obtain genotype information. But the bottom line is that we aim to obtain genotype information for as many of these variants (to have the maximum coverage of the whole human genome) as possible and for as many individuals as possible (large sample size).Â
How does GWAS work?
OK, now we have both genotype information for those 10000 indiivdiuals with diabetes and those 10000 individuals without diabetes. The next step is to perform a genome-wide association study. Here, I will provide a conceptual explanation of how that is done under the hood.
Just to remind you, genotype information contains information on millions of those genetic variants, and for the sake of example, I will just present a hypothetical example for one of those genetic variants:
Letâs say that one out of millions of those variants is a genetic variant called rs12345 (btw this is a usual name for single nucleotide polymorphisms, a specific type of genetic variant most often used in GWAS).Â
Now, we have genetic information for rs12345 for all 20000 individuals (100000 with diabetes and those 10000 without diabetes). This rs12345 genetic variant has two alleles: allele A and allele C, and therefore, there are three potential types of genotype an individual can have and those are: AA (homozygous), AC (heterozygous) and CC (homozygous).Â
If we want to check if this rs12345 genetic variant is associated with diabetes, we must bucket all those individuals with and without diabetes into three different types of genotypes (AA, AC, CC) and see how many of those have diabetes and how many donât have diabetes.Â
In this process, statistical testing is performed to determine statistical association. This process is repeated for all genetic variants we have in our sample. The end product of this is a long list of statistical tests for each genetic variant. The end results one would like to obtain is a list of genetic variants that are highly associated with a given phenotype (condition, human characteristics, trait).
I will come back to this blog post and try to make it a bit longer and more detailed, as there is obviously much more to GWAS than just this concise blog post, but I guess for now, it is enough.
Comments