A model for health risks prediction based on inherited DNA variation and clinical data

A model for health risks prediction based on inherited DNA variation and clinical data

Dr. Mykyta Artomov – Massachusetts General Hospital, Broad Institute, USA

Simply Put

Complex interactions between inherited disease risks, lifestyle, and environment contribute significantly to individual trajectories of aging. A large amount of physiological and genetic data is required to predict critical timepoints of aging for a single individual – diseases onset age, overall health deterioration, age of death. We aim to investigate such data of more than 180,000 individuals from Finland to understand what features define aging and contribute to the risks of diseases impeding healthy aging. Identified relevant components could then be brought together in a single statistical model to predict disease onset age and mortality.


Aging is commonly associated with an increase in the risk of complex phenotypes onset, such as cardiovascular diseases, diabetes, cancer, etc. In this context, aging outcomes could be defined as the age of major disease onset, age of critical health deterioration, or the age of death. Due to the high complexity of the genetic architecture and its interactions with the environment, modeling aging phenotypes requires lots of detailed clinical and genetic data.

In this project, data from several biobanks will be used to build a predictor for the individual risks of various aging outcomes. The available collections of exome sequencing, genotyping, and clinical record data enable both the construction of a predictor for disease onset age/mortality and its rigorous validation.

First, the data from Finnish biobanks assembled under the FinnGen project, one of the most homogenous and largest genetic and clinical data collections in the world, will be analyzed. In this project, more than 180,000 FinnGen participants with genetic data generated on genotyping arrays and clinical data with 3,882 data points per sample will be considered.

The clinical data for subsets of individuals with the most complete phenotypic information and well-defined time of disease onset will be used

  • to identify a set of phenotypic features, describing common diseases of aging – cardiovascular problems, cancer, etc.
  • to work out an optimal protocol for handling clinical data
  • to define a set of the most relevant for the risk prediction clinical features that could be tested in currently healthy individuals
  • to design an optimal model for integrating clinical features in health risk predictions

Separately, the analysis for inherited genetic features will be performed using genotyping arrays covering the entire genome sequence with about 1,000,000 DNA variants. The GWAS studies on FinnGen data allow evaluation of inherited risks contributing to the age of disease/mortality onset in the Finnish population. Polygenic risk score (PRS) model will be initially used for each individual to measure inherited predisposition to a phenotype conferred by the cumulative effect of relevant DNA alterations. Advanced statistical techniques will be used to develop a more complex model for estimating individual inherited disease risks using the genotype data.

Initially, the analysis will be performed separately for inherited genetic features and clinical features to investigate technical specifics and the relative contribution of each feature type to the overall disease onset age/mortality. Further, the relevant inherited risk features will be integrated with the relevant clinical data in the unified statistical model. The design and performance of the potential model will be investigated to create an optimal predictor. The effects of gender on health prediction accuracy will be estimated, and it will be determined whether gender should be included as a feature in a unified model or building separate predictors for each gender achieves better predictive power.

After the statistical model with the best combination of important features will be validated on FinnGen data, the transferability of the model to other ancestries and cultural backgrounds will be investigated using similar datasets. Specifically, a well-phenotyped cohort from Russia will be used to replicate findings and investigate the genetic diversity effects outside of the Finnish population.


The project has started on April 1, 2021.