How-To: Calculate a Polygenic Risk Score (PRS)¶
This guide shows you how to calculate one or more Polygenic Risk Scores (PRS) using the aoutools.prs submodule.
Note
To get started, you’ll need a Dataproc cluster set up for Hail Genomic Analysis. A good starting point is a master node with 8 CPUs, 52 GB RAM, and 300 GB storage, plus 50 preemptible workers with 4 CPUs, 15 GB RAM, and 300 GB storage each. This setup costs around $6 per hour and is enough to handle PRS weights from about 1 million variants. The setting was tested on Whole Genome Sequencing (WGS) data from the All of Us V8 release, using all available samples and without any variant or sample filtering. If you plan to run batch calculations with several large-scale weight files, you need a more powerful setup.
Given the specified workspace configuration, the tutorial requires only 1–2 minutes for both single and batch PRS calculations, not including the file-reading time, which is under a minute.
Setup¶
import os
import logging
from importlib import resources
import pandas as pd
import hail as hl
# Import functions to calculate PRS
from aoutools.prs import (
read_prs_weights,
read_prscs,
calculate_prs,
calculate_prs_batch,
PRSConfig
)
# Set logging level to INFO to view logs and check function correctness.
logging.basicConfig(level=logging.INFO)
# Bucket path
bucket = os.getenv("WORKSPACE_BUCKET")
# Get paths for example data
data_dir_path = str(resources.files("aoutools.data"))
prs_weights_header = f"{data_dir_path}/prs_weights_header.csv"
prs_weights_noheader = f"{data_dir_path}/prs_weights_noheader.tsv"
# Initiate Hail
hl.default_reference(new_default_reference="GRCh38")
# VDS file
vds = hl.vds.read_vds(os.getenv("WGS_VDS_PATH"))
Step 1: Reading PRS Weights Files¶
The read_prs_weights function is a flexible tool for importing weights files
into a validated Hail Table. It uses a column_map dictionary to handle
different file structures.
Example 1: File with a header
# Define a map from your file's column names to the required names
column_map_header = {
'chr': 'CHR',
'pos': 'POS',
'effect_allele': 'A1',
'noneffect_allele': 'A2',
'weight': 'WEIGHT'
}
weights_ht_header = read_prs_weights(
file_path=prs_weights_header,
header=True,
column_map=column_map_header
)
Note
The files prs_weights_header and prs_weights_noheader are local, but
Hail cannot access them directly from a local Jupyter environment. Therefore,
the read_prs_weights function automatically stages an input file to a
temporary Google Cloud Storage (GCS) location at
gs://your-workspace-bucket/data/temp_prs_data so that Hail can access them.
However, it is recommended for users to upload the input files to a GCS
bucket and provide a path that starts with ‘gs://’.
Example 2: Header-less file (like PRScs output)
The column_map uses 1-based integer indices instead of names. For convenience,
you can use the read_prscs wrapper if your weight file is generated from
PRScs.
# Using the main function
column_map_noheader = {
'chr': 1,
'pos': 3,
'effect_allele': 4,
'noneffect_allele': 5,
'weight': 6
}
weights_ht_noheader = read_prs_weights(
file_path=prs_weights_noheader
header=False,
column_map=column_map_noheader,
delimiter='\t'
)
# Using the convenient wrapper for PRS-CS files
prscs_ht = read_prscs(
file_path=prs_weights_noheader
)
Step 2: Calculating a Single PRS¶
Once you have a weights table and the All of Us VDS loaded, you can calculate a
PRS. To see more detailed log information, set
PRSConfig(detailed_timings=True) and pass it to the config argument.
# Assume 'weights_ht_header' is a Hail Table from Step 1
prs_single = calculate_prs(
weights_table=weights_ht_header,
vds=vds,
output_path=f"{bucket}/single_prs.csv"
)
# Check the result
pd.read_csv(prs_single).head()
Advanced: Handling Odds Ratios (OR)
If your weights file uses Odds Ratios, the function can log-transform them into BETA values.
config_or = PRSConfig(
weight_col_name='OR',
log_transform_weight=True
)
Tip: Batch PRS Calculation¶
To calculate multiple scores efficiently, use calculate_prs_batch. This is
highly recommended as it reads the VDS only once.
# Create a dictionary mapping score names to their weights tables
weights_tables_map = {
'prs1': weights_ht_header,
'prs2': weights_ht_noheader,
}
# Calculate all scores in a single pass
prs_batch = calculate_prs_batch(
weights_tables_map=weights_map,
vds=vds,
output_path=f"{bucket}/batch_prs.csv"
)
# Check the result
pd.read_csv(prs_batch).head()