How-To: Use the calculate_pgs Workflow

The calculate_pgs function provides a streamlined, all-in-one workflow for calculating Polygenic Risk Scores (PRS) directly from the PGS Catalog. It automates the entire process, from downloading scoring files to calculating the final scores.

This function is ideal when you know the specific PGS Catalog ID(s) you want to analyze and prefer a single command to handle all the intermediate steps.

How It Works

The calculate_pgs function simplifies the PRS calculation process by combining three key steps into a single call:

  1. Download: It begins by calling the download_pgs function internally

    to fetch the specified scoring files from the PGS Catalog and saves them to a temporary directory.

  2. Read: It automatically reads each downloaded scoring file into a Hail

    Table, correctly mapping the standard PGS Catalog column names (e.g., hm_chr, hm_pos, effect_allele, other_allele, effect_weight). If a file is malformed or cannot be read, it will be skipped with a warning.

  3. Calculate: Finally, it uses the efficient calculate_prs_batch

    function to calculate all requested PRS in a single pass, which minimizes reads of the Hail VDS.

Basic Usage

The following example demonstrates how to download two scores from the PGS Catalog and calculate them for all samples in your VDS.

import os
import pandas as pd
import hail as hl

# Import the workflow function
from aoutools.prs import calculate_pgs

# Initiate Hail and get bucket path
hl.default_reference(new_default_reference="GRCh38")
bucket = os.getenv("WORKSPACE_BUCKET")

# Load the VDS
vds = hl.vds.read_vds(os.getenv("WGS_VDS_PATH"))

# Define the PGS IDs to calculate
pgs_ids_to_calculate = ("PGS000196", "PGS000771")

# Run the end-to-end workflow
result_path = calculate_pgs(
    vds=vds,
    output_path=f"{bucket}/pgs_catalog_scores.csv",
    pgs=pgs_ids_to_calculate,
    build="GRCh38"
)

# Check the result
pd.read_csv(result_path).head()

Important Considerations

  • Input is Limited to PGS IDs: To prevent accidental downloads of a large number of files, this function only accepts PGS Catalog IDs (e.g., “PGS000123”). It does not support querying by EFO traits or PGP publication IDs.

  • Customization: For advanced calculation options, you can pass a PRSConfig object to the config argument, just as you would with calculate_prs or calculate_prs_batch.

  • Underlying Functions: This workflow is a wrapper around other functions in the library. For more fine-grained control over downloading, reading, or calculation, you can use download_pgs, read_prs_weights, and calculate_prs_batch separately.