Getting Started
This short overview of GWASStudio will help you get started.
Usage on the HT-HPC
To use GWASStudio on the HT computing cluster, simply type:
source /exchange/healthds/singularity_functions
Verify with:
gwasstudio --version
Vault token
To securely access (meta)data according to your user permissions, you will be provided with a vault token.
To authenticate automatically, please save your token in the following file:
${HOME}/.vault-token
NOTE:
- The vault token is personal and confidential. Do not share it with other users
- If
${HOME}/.vault-tokenis missing, you will be prompted to manually paste the token during commmand execution
Main commands
GWASStudio has three main commands for users:
- list: list available/accessible data
- meta-query: query metadata of interest
- ingest: ingestion of summary-statistics files(s)
- export: export data of interest
1. list
The list command is used to display all summary statistics available on MongoDB, based on your access permissions (see vault token) as category → project → study.
List example
gwasstudio list
List output example
Category: GWAS
Project: opengwas
Studies: ukb-a, ukb-b, ukb-d
2. meta-query
The meta-query command retrieves metadata of interest using a query file. It can be used to verify the availability and characteristics of the data to export.
Meta-query example
gwasstudio meta-query --search-file query_ex01.txt --output-prefix output_query_ex01
The output is a metadata table named output_query_ex01.csv with records filtered by the query file query_ex01.txt.
For a detailed explanation of all command options, see also meta-query command.
Query file
The query file used to retrieve (meta)data follows a structured format with two sections:
- Filtering criteria: metadata fields used to query the database, specified as
metadata field: filtering valuepairs - Output specification (
output:): a list of valid metadata fields to include in the output
Query file example
project: opengwas
study: ukb-d
trait:
- desc: Z42
- desc: pregnancy
output:
- build
- population
- notes_sex
- notes_source_id
- total_samples
- total_cases
- total_controls
- trait_desc
This query file searches within the ukb-d study for all trait descriptions containing Z42 or pregnancy, and returns a table with the columns specified in section output:.
NOTES:
- Filtering values can include partial matches (e.g. trait descriptions containing
Z42orpregnancy) - Filtering values are processed by lowercasing and replacing special characters before being used to query the database
- It is possible to query across different projects and studies by not specifiyng
projectandstudyin the query file
Meta-query output example
| project | study | category | data_id | build | population | notes_sex | notes_source_id | total_samples | total_cases | total_controls | trait_desc |
|---|---|---|---|---|---|---|---|---|---|---|---|
| opengwas | ukb-d | GWAS | 47e96deafe | GRCh38 | European | Males and Females | ukb-d-XV_PREGNANCY_BIRTH | 361194 | 11959 | 349235 | "Pregnancy, childbirth and the puerperium" |
| opengwas | ukb-d | GWAS | 531f0d4bcc | GRCh38 | European | Males and Females | ukb-d-Z42 | 361194 | 1963 | 359231 | Diagnoses - main ICD10: Z42 Follow-up care involving plastic surgery |
| opengwas | ukb-d | GWAS | cc18ce8683 | GRCh38 | European | Males and Females | ukb-d-O26 | 361194 | 1289 | 359905 | Diagnoses - main ICD10: O26 Maternal care for other conditions predominantly related to pregnancy |
3. ingest
The ingest command stores harmonized summary-statistics files(s) into a TileDB dataset, , using the relative metadata (which includes the source file paths) and the specified destination path.
For a detailed explanation of input formatting, see Summary-statistics columns and Metadata fields.
Ingest example
gwasstudio ingest --file-path metadata_ukb_d_sampled.tsv --uri destination
This command creates a folder named destination, containing the summary-statistics data stored in TileDB format.
For a detailed explanation of all command options, see also ingest command.
4. export
The export command is used to extract records of summary statistics (and associated metadata) from TileDB as speficied in the query file.
Enter compute node
The export command is a computationally intensive operation. Therefore, it must be executed from a compute node.
To enter a compute node, run the following command:
salloc --partition=cpu-interactive --nodes=1 --ntasks-per-node=2 --mem-per-cpu=2048M --time=12:00:00
Full stats
The export command, when used without any filtering options, will export the full set of summary statistics.
gwasstudio export --search-file query_ex01.txt
Filtering options
Exports can also be performed with different filtering options.
Region and SNP filtering
Command example to export data by filtering regions and SNPIDs provided in region_or_snp_list.tsv:
gwasstudio export --search-file query_ex01.txt --get-regions-snps region_or_snp_list.tsv
The list of regions and SNPs to filter should preferably be in BED format. Example: regions_query.tsv.
Alternatively, SNPIDs can also be listed in CHR,POS format. Example: hapmap3_snps.csv.
Trait-specific lead-SNP search
Given an input table trait_snps_list.csv (SOURCE_ID,CHR,POS,EA,NEA), the command --get-regions-leadsnps creates a window of given width --region-width and extracts from this region the statistics (MLOG10P, BETA, SE) of:
- the lead SNP, i.e. the SNPID with the most significant P-value;
- the exact SNP, i.e. the exact CHR:POS:EA:NEA of the input. Note that the input SNPs must be harmonized to alphabetically ordered alleles.
gwasstudio export --search-file query_trait_snps.yml --get-regions-leadsnps trait_snps_list.csv --region-width 500000
P-value filtering
Command example to export data by filtering based on a P-value threshold (in -log10 format):
gwasstudio export --search-file query_ex01.txt --pvalue-thr 4
Locusbreaker
Command example to export data with locusbreaker:
gwasstudio export --search-file query_ex01.txt --locusbreaker
For a detailed explanation of all command options, see also export command.