How-To: Using the download_pgs Function¶
The download_pgs function is used to download harmonized score files from
the PGS Catalog.
How It Works¶
The download_pgs function is designed to resolve a specific dependency
conflict the can arise in the All of Us Researcher Workbench.
The core issue is a version mismatch between two important libraries:
dsub, which is used for job scheduling, and pgscatalog.core, which is
used for downloading PGS data. They both rely on a shared dependency called
tenacity, but they require different, conflicting versions. Installing them
together would cause errors.
To solve this, download_pgs works as follows:
1. Isolated Environment: The first time you call the function, it
automatically creates a separate Python virtual environment using
venv. This keeps the dependencies for pgscatalog.core completely
isolated from your main notebook environment.
2. Automatic Installation: Inside this new, isolated environment, it
automatically installs pgscatalog.core and its required version of
tenacity.
3. Execution: It then runs the necessary download commands from within this isolated, managed environment.
This method allows you to use both the dsub job scheduler and the
download_pgs function in the same project without any dependency conflicts.
Customizing the Virtual Environment Path¶
By default, the virtual environment is created in your home directory at
~/.aoutools/pgscatalog_env.
You can override this location by setting the AOUTOOLS_PGS_ENV_DIR
environment variable. This must be set before the download_pgs function is
called for the first time. To set this in a Jupyter Notebook, you can use the
os module.
For example, to create the environment in a different directory, add the following to a notebook cell:
import os
# Set the environment variable to a custom path before importing aoutools
os.environ['AOUTOOLS_PGS_ENV_DIR'] = '/path/to/your/custom/env/dir'
# Now you can import and use the function
from aoutools.prs import download_pgs
# When this is called, the virtual environment will be created
# at the custom path specified above.
download_pgs(outdir='your_output_directory', pgs='PGS000001')