aoutools.prs

This submodule contains all tools related to polygenic risk score (prs) calculations.

aoutools.prs.calculate_pgs(*, vds, output_path, pgs, build='GRCh38', config=PRSConfig(chunk_size=20000, samples_to_keep=None, weight_col_name='weight', log_transform_weight=False, include_n_matched=False, sample_id_col='person_id', split_multi=True, ref_is_effect_allele=False, strict_allele_match=True, detailed_timings=False), user_agent=None, verbose=False)[source]

Downloads specified PGS Catalog scoring files and calculates PRS.

This function automates a controlled workflow:

  1. Download: Fetches scoring files from the PGS Catalog for a specific

    list of PGS IDs.

  2. Read: Parses the downloaded scoring files into Hail Tables.

  3. Calculate: Computes the Polygenic Risk Score(s) for each downloaded

    file and exports a single CSV file.

Notes

This function does not accept EFO traits or PGP publication IDs to prevent the unexpected download of a large number of scoring files, which may overwhelm the storage or computational resources of a typical workspace.

Parameters:
  • vds (hail.vds.VariantDataset) – A Hail VariantDataset containing the genotype data to be scored.

  • output_path (str) – A GCS path (e.g., ‘gs://bucket/results.csv’) for the output file.

  • pgs (str or iterable of str) – One or more PGS Catalog ID(s) (e.g., “PGS000771”) to download. This argument is required.

  • build (str, optional) – The genome build for harmonized scores (“GRCh37” or “GRCh38”). Defaults to “GRCh38”.

  • config (PRSConfig, optional) – A configuration object for calculation parameters.

  • user_agent (str, optional) – A custom user agent string for PGS Catalog API requests.

  • verbose (bool, default False) – Enable verbose logging for the download process.

Returns:

The output path if results are successfully written; otherwise, None.

Return type:

str or None

Raises:
  • ValueError – If the output_path is not a valid GCS path, or if a downloaded scoring file is empty, malformed, or contains duplicate variants.

  • TypeError – If config.samples_to_keep is an unsupported type.

  • Exception – If the download process fails due to network issues, invalid PGS IDs, or other errors from the underlying pgscatalog-download tool.

aoutools.prs.calculate_prs(weights_table, vds, output_path, config=PRSConfig(chunk_size=20000, samples_to_keep=None, weight_col_name='weight', log_transform_weight=False, include_n_matched=False, sample_id_col='person_id', split_multi=True, ref_is_effect_allele=False, strict_allele_match=True, detailed_timings=False))[source]

Calculates a Polygenic Risk Score (PRS) and exports the result to a file.

This function is the main entry point for the PRS calculation workflow. It processes a weights table in chunks, using a filter_intervals approach to select variants from the VDS for each chunk. Partial results are then converted to Pandas DataFrames and aggregated to produce the final score file.

Notes

By default (config.split_multi=True), this function prioritizes robustness over performance by splitting multi-allelic variants.

This split_multi process includes creating a minimal representation for variants. For example, for a variant chr1:10075251 A/G in the weights table, split_multi can intelligently match it to a complex indel in the VDS (e.g., alleles=[‘AGGGC’, ‘A’, ‘GGGGC’]) by simplifying the VDS representation to its minimal form ([‘A’, ‘G’]) for ‘AGGGC’ -> ‘GGGGC’.

The non-split path (config.split_multi=False) is a faster but less robust alternative. It relies on a direct string comparison of alleles and will fail to match the complex variant described above. Furthermore, if the weights table contains multiple entries for the same locus, the non-split path will arbitrarily select only one of them. This “power-user” option should only be used if you are certain that both your VDS and weights table contain only simple, well-matched, bi-allelic variants.

Parameters:
  • weights_table (hail.Table) –

    A Hail table containing variant weights. Must contain the following columns:

    • chr: str

    • pos: int32

    • effect_allele: str

    • noneffect_allele: str

    • A column for the effect weight (float64), specified by weight_col_name.

  • vds (hail.vds.VariantDataset) – A Hail VariantDataset containing both variant and sample data.

  • output_path (str) – A GCS path (starting with ‘gs://’) to write the final comma-separated output file.

  • config (PRSConfig, optional) – A configuration object for all optional parameters. If not provided, default settings will be used. See the PRSConfig class for details on all available settings.

Returns:

The output path if results are successfully written; otherwise, None. The output file is a comma-separated text file with:

  • A sample ID column (as configured in config.sample_id_col)

  • prs: The calculated PRS value

  • n_matched (optional): The number of variants used to calculate the score, included if config.include_n_matched is True.

Return type:

str or None

Raises:
  • ValueError – If output_path is not a valid GCS path, or if the weights_table is empty after validation.

  • TypeError – If the config.samples_to_keep argument is of an unsupported type.

See also

PRSConfig

A configuration class that holds parameters for PRS calculation.

aoutools.prs.calculate_prs_batch(weights_tables_map, vds, output_path, config=PRSConfig(chunk_size=20000, samples_to_keep=None, weight_col_name='weight', log_transform_weight=False, include_n_matched=False, sample_id_col='person_id', split_multi=True, ref_is_effect_allele=False, strict_allele_match=True, detailed_timings=False))[source]

Calculates multiple Polygenic Risk Scores (PRS) concurrently using a memory-efficient, per-score annotation approach.

This function performs a batch PRS calculation on a Hail VariantDataset, using chunked aggregation and optional sample filtering.

Parameters:
  • weights_tables_map (dict[str, hl.Table]) – A dictionary mapping score names to their corresponding PRS weights tables.

  • vds (hl.vds.VariantDataset) – A Hail VariantDataset containing both variant and sample data.

  • output_path (str) – A GCS path (starting with ‘gs://’) to write the final comma-separated output file.

  • config (PRSConfig, optional) – A configuration object for all optional parameters. If not provided, default settings will be used. See the PRSConfig class for details on all available settings.

Returns:

The path to the final PRS result file if successful; otherwise, None if no valid variants were found.

Return type:

Optional[str]

Raises:
  • ValueError – If output_path is not a valid GCS path, or if the weights_table is empty after validation.

  • TypeError – If the config.samples_to_keep argument is of an unsupported type.

See also

PRSConfig

A configuration class that holds parameters for PRS calculation.

aoutools.prs.download_pgs(*, outdir, pgs=None, efo=None, pgp=None, build='GRCh38', efo_include_children=True, overwrite_existing_file=False, user_agent=None, verbose=False)[source]

Download PGS Catalog scoring files to a local directory or GCS bucket.

This function detects if the output path is local or GCS (gs://). If GCS, it downloads files to a temp directory then uploads to GCS.

Parameters:
  • outdir (str or pathlib.Path) – Local directory or GCS bucket path (e.g., ‘gs://my-bucket/path’).

  • pgs (str or iterable of str, optional) – PGS Catalog ID(s) (e.g., “PGS000194”).

  • efo (str or iterable of str, optional) – EFO term(s) (e.g., “EFO_0004611”).

  • pgp (str or iterable of str, optional) – PGP publication ID(s).

  • build (str, optional) – Genome build (“GRCh37” or “GRCh38”), default “GRCh38”.

  • efo_include_children (bool, default True) – Whether to include descendant EFO terms.

  • overwrite_existing_file (bool, default False) – Overwrite existing files if newer versions exist.

  • user_agent (str, optional) – Custom user agent string.

  • verbose (bool, default False) – Enable verbose logging.

Returns:

The PGS Catalog score file(s) saved to the specified output path.

Return type:

None

Raises:
  • FileNotFoundError – If local output directory does not exist.

  • ValueError – If none of pgs, efo, or pgp are provided.

  • Exception – On download or upload failure.

aoutools.prs.read_prs_weights(file_path, header, column_map, delimiter=',', comment='#', keep_other_cols=False, validate_alleles=False, **kwargs)[source]

Reads a file containing variant effect weights for PRS calculation.

This function requires an active Hail-enabled environment. It uses a flexible column_map dictionary to handle various input file formats. After standardizing the required columns, the function performs several validation checks, filtering out variants with missing weights, invalid alleles (if validate_alleles=True), or raising an error for duplicates.

If a local file path is provided, it is automatically copied to a temporary directory in your GCS bucket for Hail access.

Parameters:
  • file_path (str) – A path to the weight file (local or gs://).

  • header (bool) – If True, column_map values should be strings (column names). If False, column_map values should be 1-based integers (column indices).

  • column_map (dict) – A dictionary mapping standard names to user-defined names or indices. Must contain the keys: ‘chr’, ‘pos’, ‘effect_allele’, ‘noneffect_allele’, and ‘weight’. Example for header=True: {‘chr’: ‘CHR’, ‘pos’: ‘BP’, …} Example for header=False: {‘chr’: 1, ‘pos’: 2, …}

  • delimiter (str, default ',') – A field delimiter.

  • comment (str or list[str], default '#') – A character, or list of characters, that denote comment lines to be ignored.

  • keep_other_cols (bool, default False) – If True, all columns not specified in column_map are preserved.

  • validate_alleles (bool, default False) – If True, validates that allele columns contain only ACGT characters.

  • **kwargs (dict, optional) – Other keyword arguments to pass directly to hail.import_table, such as missing or min_partitions.

Returns:

A Hail Table with standardized columns ready for PRS calculation.

Return type:

hail.Table

Raises:
  • ValueError – If column_map is missing required keys, if the input file is empty, or if duplicate variants are found in the weights file.

  • TypeError – If the value types in column_map do not match the header setting (e.g., strings for header=True, integers for header=False).

  • FileNotFoundError – If a local file_path is provided and the file does not exist.

aoutools.prs.read_prscs(file_path, **kwargs)[source]

A simple wrapper to read PRS-CS output files.

This function assumes a standard PRS-CS output format, which is a header-less, tab-separated file with the following columns: 1. Chromosome 2. Variant ID 3. Base Position 4. Effect Allele (A1) 5. Non-Effect Allele (A2) 6. Posterior Effect Size (weight)

Note: The second column (Variant ID) is not loaded by default, as it is not required for the core functionality. To preserve this and any other columns, set keep_other_cols=True when calling this function.

Parameters:
  • file_path (str) – A path to the PRS-CS output file.

  • **kwargs – Other optional arguments to pass to read_prs_weights, such as keep_other_cols or validate_alleles.

Returns:

A processed Hail Table of the PRS-CS weights.

Return type:

hail.Table

class aoutools.prs.PRSConfig(chunk_size=20000, samples_to_keep=None, weight_col_name='weight', log_transform_weight=False, include_n_matched=False, sample_id_col='person_id', split_multi=True, ref_is_effect_allele=False, strict_allele_match=True, detailed_timings=False)[source]

A configuration class for Polygenic Risk Score (PRS) calculation.

chunk_size

The number of variants to include in each processing chunk.

Type:

int, default 20000

samples_to_keep

A collection of sample IDs to keep. Accepts a Hail Table, or a Python list, set, tuple of strings or integers, or a single string or integer. If None, all samples are retained.

Type:

Union[hl.Table, Sequence[str], Sequence[int], str, int], optional

weight_col_name

The column name in weights table that contains effect sizes or weights.

Type:

str, default ‘weight’

log_transform_weight

If True, applies a natural log transformation to the weight column. Useful when weights are odds ratios (OR), since PRS assumes additive effects on the log-odds scale.

Type:

bool, default False

include_n_matched

If True, adds a column ‘n_matched’ with the number of variants matched between weights table and VDS. This option has a performance cost and should be used only when necessary.

Type:

bool, default False

sample_id_col

The column name to use for sample IDs in the final output table.

Type:

str, default ‘person_id’

split_multi

If True, splits multi-allelic variants in VDS into bi-allelic variants prior to calculation.

Type:

bool, default True

ref_is_effect_allele

If True, assumes effect allele in weights file corresponds to reference allele in VDS. Used only when split_multi is True.

Type:

bool, default False

strict_allele_match

Used only when split_multi is False. If True, enforces that one allele in weights table matches reference allele in VDS and other allele is a valid alternate. If False, only effect allele is checked to correspond to either reference or alternate allele, and other allele is not verified.

Type:

bool, default True

detailed_timings

If True, logs timing information for each major step. Helpful for diagnosing performance issues.

Type:

bool, default False