How-To: Use the calculate_pgs Workflow¶
The calculate_pgs function provides a streamlined, all-in-one workflow for
calculating Polygenic Risk Scores (PRS) directly from the PGS Catalog. It
automates the entire process, from downloading scoring files to calculating the
final scores.
This function is ideal when you know the specific PGS Catalog ID(s) you want to analyze and prefer a single command to handle all the intermediate steps.
How It Works¶
The calculate_pgs function simplifies the PRS calculation process by
combining three key steps into a single call:
- Download: It begins by calling the
download_pgsfunction internally to fetch the specified scoring files from the PGS Catalog and saves them to a temporary directory.
- Download: It begins by calling the
- Read: It automatically reads each downloaded scoring file into a Hail
Table, correctly mapping the standard PGS Catalog column names (e.g., hm_chr, hm_pos, effect_allele, other_allele, effect_weight). If a file is malformed or cannot be read, it will be skipped with a warning.
- Calculate: Finally, it uses the efficient
calculate_prs_batch function to calculate all requested PRS in a single pass, which minimizes reads of the Hail VDS.
- Calculate: Finally, it uses the efficient
Basic Usage¶
The following example demonstrates how to download two scores from the PGS Catalog and calculate them for all samples in your VDS.
import os
import pandas as pd
import hail as hl
# Import the workflow function
from aoutools.prs import calculate_pgs
# Initiate Hail and get bucket path
hl.default_reference(new_default_reference="GRCh38")
bucket = os.getenv("WORKSPACE_BUCKET")
# Load the VDS
vds = hl.vds.read_vds(os.getenv("WGS_VDS_PATH"))
# Define the PGS IDs to calculate
pgs_ids_to_calculate = ("PGS000196", "PGS000771")
# Run the end-to-end workflow
result_path = calculate_pgs(
vds=vds,
output_path=f"{bucket}/pgs_catalog_scores.csv",
pgs=pgs_ids_to_calculate,
build="GRCh38"
)
# Check the result
pd.read_csv(result_path).head()
Important Considerations¶
Input is Limited to PGS IDs: To prevent accidental downloads of a large number of files, this function only accepts PGS Catalog IDs (e.g., “PGS000123”). It does not support querying by EFO traits or PGP publication IDs.
Customization: For advanced calculation options, you can pass a
PRSConfigobject to theconfigargument, just as you would withcalculate_prsorcalculate_prs_batch.Underlying Functions: This workflow is a wrapper around other functions in the library. For more fine-grained control over downloading, reading, or calculation, you can use
download_pgs,read_prs_weights, andcalculate_prs_batchseparately.