Complete GWAS summary datasets are now abundant. A large repository of curated, harmonised and QC'd datasets is available in the IEU GWAS database. They can be queried via the API directly, or through the ieugwasr R package, or the ieugwaspy python package. However, for faster querying that can be used in a HPC environment, accessing the data directly and not through cloud systems is advantageous.
We developed a format for storing and harmonising GWAS summary data known as GWAS VCF format. All the data in the IEU GWAS database is available for download in this format. This R package provides fast and convenient functions for querying and creating GWAS summary data in GWAS VCF format. This package includes:
- a wrapper around the bioconductor/VariantAnnotation package, providing functions tailored to GWAS VCF for reading, querying, creating and writing GWAS VCF format files
- some LD related functions such as using a reference panel to extract proxies, create LD matrices and perform LD clumping
- functions for harmonising a dataset against the reference genome and creating GWAS VCF files.
See also the gwasglue R package for methods to connect the VCF data to Mendelian randomization, colocalisation, fine mapping etc.
remotes::install_github("mrcieu/gwasvcf")
See vignettes here: https://mrcieu.github.io/gwasvcf.
If using GWAS-VCF files please reference the studies that you use and the following paper:
The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, Ben Elsworth, Tom R Gaunt, Gibran Hemani, Edoardo Marcora. bioRxiv 2020.05.29.115824; doi: https://doi.org/10.1101/2020.05.29.115824
Example GWAS VCF (GIANT 2010 BMI):
- http://fileserve.mrcieu.ac.uk/vcf/IEU-a-2.vcf.gz
- http://fileserve.mrcieu.ac.uk/vcf/IEU-a-2.vcf.gz.tbi
1000 genomes reference panels for LD for each superpopulation - used by default in OpenGWAS:
1000 genomes European reference panel for LD (legacy):
1000 genomes vcf harmonised against human genome reference:
- http://fileserve.mrcieu.ac.uk/vcf/1kg_v3_nomult.vcf.gz
- http://fileserve.mrcieu.ac.uk/vcf/1kg_v3_nomult.vcf.gz.tbi
data.vcf.gz and data.vcf.gz.tbi are the first few rows of the Speliotes 2010 BMI GWAS
The eur.bed/bim/fam files are the same range as data.vcf.gz, from here http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz