pQTL Analysis. Protein abundance data from proteomic analyses (e.g. mass spectrometry, left), is integrated with QTL analysis from genomics (right) for pQTL analysis. This multi-omic approach evaluates the genetic basis of protein abundance. (Created in BioRender.com/p70o061)
Protein quantitative trait loci (pQTL) are regions in the genome associated with variation in protein expression levels. Since proteins are the primary mediators of biological activity, changes in their abundance influence both health and disease.[1][2] While the central dogma of biology describes the flow of genetic information from DNA to RNA to protein, current research highlights complicated and multiple stages of regulation and modification throughout the process.[3][4] For example, proteins are influenced by epigenetic regulation and post-transcriptional and post-translational modifications, meaning that protein abundance does not always correlate with RNA abundance.[5]
Identifying and mapping pQTLs combine quantitative trait loci (QTL) analysis with proteomic data. The "trait" studied in pQTL analysis is the quantity of specific proteins. Mapping genetic variants beyond the coding gene that alter protein levels can clarify the genetic and molecular mechanisms underlying disease.[6][7] Improved characterization of the genetic basis of proteins also has the potential to reveal new targets for drug development that have not been previously identified.[8][9]
Background
Quantitative trait loci analyses
pQTLs are part of a broad "family" of quantitative trait loci (QTL). QTLs are regions in the genome associated with a phenotype of interest, which must be a measurable, continuous trait.[10] Basic concepts in genetics, such as the influence of genes on organisms' physical traits, are at the foundation of QTLs. QTLs help explain the gap between Mendelian inheritance and complex trait variation by highlighting the influence of multiple genomic regions on a single trait.[11] QTL analysis was first performed in 1988 by Paterson and Lander to identify regions in the genome that control tomato plant mass, concentration of soluble solids, and fruit pH.[12] Early studies had to rely on morphological markers on chromosomes, which were not evenly distributed throughout the genome. As common genetic markers were identified, such as restriction fragment length polymorphism (RLFPs) or single nucleotide polymorphisms (SNPs), the resolution and accuracy of QTL studies improved.[10] The statistical framework for quantitative trait loci mapping was further developed in 2001 by Sen and Churchill.[13]
There are many types of QTL referred to in scientific literature (see table below for examples).[14][15]
Chromatin accessibility quantitative trait loci (caQTL)
Binding quantitative trait loci (bQTL)
Genetic variants that affect the level of RNA expression
Genetic variants that affect the level of DNA methylation
Genetic variants that affect the packaging and accessibility of DNA for transcription factors
Genetic variants that affect transcription factor binding
Proteomics
pQTL analyses are only possible because of the growing field of proteomics. Scientists who study proteomics use techniques like mass spectrometry and affinity-based assays to quantify the number of and quality of proteins in different cells or tissues.[16] The technology available to study protein quantity, location, and interactions has expanded rapidly in the past century. For example, the Olink's Proximity Extension Assay (PEA) is a proteomics assay that uses an antibody-based immunoassay in combination with polymerase chain reactions (PCR) to accurately measure protein levels.[17] The SomaScan platform is another common proteomic technology that can measure 11,000 different proteins.[18]
It is true that studies identifying and characterizing eQTLs have grown in popularity faster compared to those that focus on pQTLs.[19][20] However, these genomic regions capture different information. Gene expression data, such as messenger RNA (mRNA) abundance, tells a scientist whether the gene is being transcribed efficiently, whereas proteomic data tells a scientist whether the mRNA is being translated efficiently and how proteins interact with each other. Gene expression and protein translation are often highly correlated, but there are cases when post-transcriptional and post-translational modifications increase or decrease the rate and timing of protein synthesis and stability independently of gene transcription.[21]
pQTL types
cis- and trans-pQTLs. The two major classifications of pQTLs are based on the position of the variant relative to the gene encoding the protein of interest. (Created in BioRender.com/f82v896)
cis-pQTLs are genetic variants located near the gene encoding the protein they influence, acting through the cognate gene to modify protein levels.[22]
trans-pQTLs are genetic variants located far from the gene encoding the associated protein, often on different chromosomes, that exert their effects indirectly.[22]
In some cases, a single variant may exhibit both cis- and trans-effects, impacting the local gene and a distant gene simultaneously, which can be referred to as cis-trans pQTLs.[6]
Workflow
pQTL Workflow. Sample collection is followed by protein quantification and genotyping or sequencing. These data are used to identify and map the pQTLs. The pQTLs are further characterized with integrative genome analyses, such as genome-wide association studies (GWAS), colocalization analyses with other QTLs, and mendelian randomization (MR) analysis to evaluate causation. (Created in BioRender.com/c86e583)
Sample collection
The process begins with the careful collection of biological specimens, such as blood,[23]serum,[24] or tissue,[19][25][26] from a well-characterized cohort. It is critical to follow standardized protocols to ensure sample integrity and minimize pre-analytical variability. Proper labeling and documentation of each sample, along with associated metadata (e.g. age, sex, and clinical parameters) are essential to facilitate later analysis and to control for confounding factors.[25]
pQTL mapping is the process that correlates protein abundance measurements with genome-wide SNP data to identify genetic variants influencing protein expression. Researchers then perform association analyses, such as linear regression models,[9][19][23] to statistically correlate each SNP with the observed protein levels.
Applications
pQTL studies have found applications in biological and biomedical research. Since pQTL analyses serve to provide a direct, quantitative link between genetic variation and protein abundance, this approach can circumvent the difficulties of inferring levels of protein expression from levels of mRNA expression. While eQTL analysis can provide insight into how genetic variation can impact mRNA transcription, mRNA levels correlate only moderately with the amount of protein present due to post-transcriptional regulatory factors such as translation efficiency, protein degradation or stability, and post-translational modifications.[34] As a result, pQTL analysis is useful in the investigation of complex diseases, including neurological disorders, cardiovascular and cardiometabolic disease, and inflammatory conditions, where protein dysregulation plays a key role.[35][36][37][38] Additionally, pQTLs have demonstrated utility in drug development where they aid in the identification of novel therapeutic targets and functional validation of existing treatments.[8]
Integration with genomics data
pQTL analysis can complement GWAS approaches to help confirm genotype-phenotype findings. For example, while GWAS findings have been instrumental in identifying genetic variants linked to complex diseases, they are primarily associative and often fall short of revealing the underlying etiology of disease. Many associated variants lie in non-coding regions, making it challenging to determine their biological function.[39] Furthermore, pleiotropy can further complicate interpretations.[40] Integrating pQTL data with GWAS results provides a direct, quantitative link between genetic variation and protein abundance, a functional output that can more immediately affect cellular processes.[1][23]
Colocalization analyses can also be used to assess whether the genetic signals underlying pQTLs overlap with those detected in other types of QTL studies, such as eQTLs.[9][41][42]
Mendelian Randomization. pQTLs can serve as genetic instrumental variables for Mendelian Randomization, so long as they are free of confounders (e.g. variants in linkage disequilibrium with the pQTL), and inform the target outcome (e.g. an observed phenotype) solely through a particular exposure, typically the cognate protein's abundance for pQTLs
Another approach for understanding how pQTLs drive phenotypes uses Mendelian randomization (MR). pQTL data can be used as an instrumental variable with MR to infer causality rather than association. This approach exploits the random assortment of genetic variants at conception to test whether genetically driven differences in protein expression are associated with disease risk or other trait outcomes. pQTLs that are consistently associated with an outcome in multiple GWAS are good candidates for instrumental variables in MR approaches as they show "relevance". However, they must also fulfill the "independence" and "exclusion restriction" criteria of MR. Together, the integration of GWAS and pQTL data through MR can help distinguish causal genetic variants from associated bystander variants by incorporating information on protein-level changes that are more proximal to the disease mechanism.[23][43]
Therapeutic development
The identification of functionally relevant proteins and biological pathways through pQTL analysis can contribute to the development of new therapies. Specifically, mapping pQTL variants to annotated genomic regions can help clarify the precise mechanism by which a pQTL variant can alter protein expression. cis-pQTLs have a direct effect on protein abundance and can thus reveal direct effectors of a phenotype.[44] Conversely, trans-pQTLs can reveal protein-protein interactions, establishing links between cellular signals, downstream effectors that may be more easily druggable, and disease endpoints.[26][43] From a drug discovery perspective, pQTL analysis not only provides potential therapeutic targets through the identification of relevant biological pathways, but can also reduce the likelihood of pursuing ineffective or non-contributory cellular components.[45]
Beyond target identification, pQTL studies can also contribute to drug development studies by revealing biomarkers that correlate with disease progression or treatment response. This information aids in patient stratification, allowing for more precise therapeutic approaches tailored to individuals most likely to benefit from a given intervention. Furthermore, because genetic variants that influence protein levels can also reveal potential side effects, pQTL data can inform early-stage safety assessments, identifying possible adverse reactions before clinical trials. In some cases, this approach also facilitates drug repurposing by highlighting proteins that are already modulated by existing therapeutics but may be relevant to new indications.[8][35][36]
Considerations
Tissues specificity
An important consideration for pQTLs is the tissue-specific nature of proteins. The majority of pQTL studies have used plasma as the tissue, but protein levels in plasma are not always indicative of protein levels in other tissue types, especially during disease states.[7][46][47] Indeed, pQTLs, especially trans-pQTLs, are tissue-specific.[48]
Relative and absolute protein abundance
pQTL identification critically depends on how protein abundance is quantified. Most studies use relative protein quantification to compare protein levels across samples or individuals using the methods such as TMT, iTRAQ,[49]LFQ because they are cost-effective and high throughput. Each protein is normalized to an internal reference standard (e.g. relative fold changes), providing lower technical variance and increased statistical power in large-scale studies.[50] However, the normalization can sometimes mask subtle yet biologically significant changes in protein levels, potentially reducing the sensitivity for detecting pQTLs that exert modest effects. In contrast, absolute quantification techniques provide exact concentration measurements of proteins with heavy isotope-labeled peptides, such as LC-MS/MS and DIATPA.[19] This precision allows researchers to detect finer differences in protein abundance, potentially revealing more complex protein expression regulatory effects that might be overlooked with relative measures.[19]
Genotyping or sequencing
The method of genotyping versus sequencing further shapes the outcomes of pQTL studies by determining which genetic variants are captured. Traditional SNP genotyping focuses on known SNPs compiled into microarrays or panels and provides a rapid, cost-effective snapshot of genetic variation across the genome. This method is well-suited for detecting common variants (often in conjunction with imputation),[9] but may miss rare or novel mutations that could influence protein expression. On the other hand, sequencing offers a more comprehensive view by identifying both common and rare variants, including structural variations.[51]
Limitations
An important limitation for pQTL is the bias conferred by the data being used. Large population studies in the field of genomics overwhelmingly study populations of European ancestry.[52] Specific genetic variants in a population with European ancestry might be relevant for a disease, which may or may not be the case for populations with non-European ancestry.[53][54][55] Recent studies have focused on diversifying the populations that are being studied to increase equitable access to disease-relevant research across populations.[56] For example, a study in 2023 identified pQTLs specifically in Han Chinese participants and identified 195 pQTLs.[43] 75% of the 60 pQTLs that have more evidence for gene-protein-phenotype associations in participants with Han Chinese ancestry have not been "prioritized" in Europeans, highlighting the need for continued diversification within the field.[43]
Another important caveat of pQTLs is that cis-pQTLs are often assumed to influence their cognate genes by direct genetic interactions (e.g. promoter regions).[22] However, this assumption should be validated as there are cases where pQTLs instead act through an intermediary mechanism to affect protein abundance.[57]
^Srivastava H, Lippincott MJ, Currie J, Canfield R, Lam MPY, Lau E. Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners. PLoS Comput Biol. 2022 Nov 10;18(11):e1010702. doi: 10.1371/journal.pcbi.1010702. PMID: 36356032; PMCID: PMC9681107