API
Here is a list of available function calls. A detailed description can be found below.
EasyLD.HailBlockMatrixEasyLD._extract_alternate_allele_freqEasyLD._extract_genome_buildEasyLD._extract_ref_alt_allelesEasyLD._read_metadataEasyLD.download_gnomad_LD_matricesEasyLD.download_gnomad_variant_index_tablesEasyLD.download_ukb_LD_matricesEasyLD.download_ukb_variant_index_tablesEasyLD.get_blockEasyLD.get_gnomad_filenamesEasyLD.get_ukb_filenamesEasyLD.hail_block_matrixEasyLD.read_variant_index_tables
Exported functions
EasyLD.download_gnomad_LD_matrices — Functiondownload_gnomad_LD_matrices(population::String, outdir::String, [start_from], [num_files])Downloads the LD matrices from gnomAD panel for specified population and saves result to outdir. See https://gnomad.broadinstitute.org/downloads#v2-linkage-disequilibrium.
Inputs
population: Options includeafrfor African/African American populationamrfor Latino/Admixed American populationasjfor Ashkenazi Jewish populationeasfor East Asian populationfinfor European (Finnish) populationnfefor European (non-Finnish) populationestfor Estonian populationnwefor North-western European populationseufor Southern European population
outdir: Directory for which files will be saved.start_from: Int specifying which file to start downloading atnum_files: Int specifying how many files to download
Multiple download sessions at once
Since there are lots of files to download, one can specify the number of files to download by modifying the num_files files option. This function will download the first num_files file counting from the start_from element of the output of get_gnomad_filenames(). For example, if myfiles = get_gnomad_filenames("afr"), then the function will download files from myfiles[start_from] through myfiles[start_from+num_files-1]
EasyLD.download_ukb_LD_matrices — Functiondownload_ukb_LD_matrices(population::String, outdir::String, [start_from], [num_files])Downloads the LD matrices from pan-UK-biobank for specified population and saves result to outdir. See https://pan-dev.ukbb.broadinstitute.org/docs/hail-format/index.html.
Inputs
population: Several options are available. These codes refer only to ancestry groupings used in GWAS, not necessarily other demographic or self-reported data.EURfor European ancestryCSAfor Central/South Asian ancestryAFRfor African ancestryEASfor East Asian ancestryMIDfor Middle Eastern ancestryAMRfor Admixed American ancestry
outdir: Directory for which files will be saved.start_from: Int specifying which file to start downloading atnum_files: Int specifying how many files to download
Multiple download sessions at once
Since there are lots of files to download, one can specify the number of files to download by modifying the num_files files option. This function will download the first num_files file counting from the start_from element of the output of get_ukb_filenames(). For example, if myfiles = get_ukb_filenames("afr"), then the function will download files from myfiles[start_from] through myfiles[start_from+num_files-1]
EasyLD.download_gnomad_variant_index_tables — Functiondownload_gnomad_variant_index_tables(population::String, outdir::String)Downloads variant index hail tables from gnomAD panel for specified population and saves result to outdir. population accepts the same population labels as download_gnomad_LD_matrices. These files are typically at most a few GB.
EasyLD.download_ukb_variant_index_tables — Functiondownload_ukb_variant_index_tables(population::String, outdir::String)Downloads variant index hail tables from pan-UK-biobank for specified population and saves result to outdir. population accepts the same population labels as download_ukb_LD_matrices. These files are typically at most a few GB.
EasyLD.get_gnomad_filenames — Functionget_gnomad_filenames(population::String)Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true, the full path will be returned
EasyLD.get_ukb_filenames — Functionget_ukb_filenames(population::String)Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true, the full path will be returned
EasyLD.hail_block_matrix — Functionhail_block_matrix(bm_files::String, ht_files::String)Creates a HailBlockMatrix which allows reading chunks of LD matrix data into memory. It is a subtype of AbstractMatrix, so operations like indexing and size() works, but only accessing its elements in contiguous chunks is "fast". I may support more functionalities depending on interest.
Examples
using EasyLD
datadir = "/Users/biona001/.julia/dev/EasyLD/data"
bm_file = joinpath(datadir, "UKBB.EUR.ldadj.bm")
ht_file = joinpath(datadir, "UKBB.EUR.ldadj.variant.ht")
bm = hail_block_matrix(bm_file, ht_file); # need a ';' to avoid displaying a few entries of bm, which takes ~0.1 seconds per entry
# get matrix dimension
size(bm) # returns (23960350, 23960350)
# read a single entry
bm[1, 1] # returns 0.999959841549239
# read first 10k by 10k block into memory (takes roughly 7 seconds)
bm[1:10000, 1:10000]
# arbitrary slicing works but is very slow
bm[1:3, 1:2:100] # ~22 seconds
# read a specific chromosome region
chr = 1
start_pos = 11063
end_pos = 91588
sigma = get_block(bm, chr, start_pos, end_pos)EasyLD.read_variant_index_tables — Functionread_variant_index_tables(ht_file::String; [alt_allele_header], [reprocess])Read variant index hail tables into a DataFrame. The first time this function gets called, we will read the original .ht files into memory and write the result to a tab-separated value .tsv file into the same directory as ht_file, and we will also create a comma separated file ending in .csv which pre-process the .tsv file for easier reading. If reprocess=true, we will re-run the preprocessing step.
Note
rsIDs: Some panels (e.g. Pan-UKB) also contain rsID, but others (e.g. gnomAD) does not, so rsID may not be available.AF: Alternate Allele frequency is sometimes available in the.tsvfile in its own column (e.g. Pan-UKB), but other times it's hidden inside a column of the.tsvfile (e.g. in gnomAD it is under the headerpop_freqwhich includesAFbut also other information). In the later case, we will assume the column in.tsvhas header namealt_allele_headerand is JSON formatted, and further that alt-allele freq is available asAFkey. See_extract_alternate_allele_freq
EasyLD.get_block — Functionget_block(bm::HailBlockMatrix, chr, start_bp, end_bp; [min_maf],
[snps_to_keep], [min_eigval], [enforce_psd], [rk])Reads in a block of LD matrix from chromosome chr between basepairs start_bp and end_bp. The inputs chr, start_bp, end_bp can be Int or String. The result will always be a matrix even if start_bp == end_bp. If start_bp or end_bp is not in the LD matrix, we will return the smallest region that does NOT include them. For example, if (start_bp, end_bp) = (555, 777) and SNP positions in the LD panel are positions = [..., 400, 500, 600, 700, 800, 900, ...] with snp_names = [..., snp4, snp5, snp6, snp7, snp8, snp9, ..], then we will return the LD matrix for [snp6, snp7]
Inputs
bm: AHailBlockMatrixchr: Chromosome number, can be an Int or String (e.g.1or"1")start_bp: Starting basepair, can be an Int or Stringend_bp: Ending basepair, can be an Int or String
Optional inputs
min_maf: Minimum minor allele frequency. Only variants with alternate allele frequency between[min_maf, 1-min_maf]is kept. Defaultmin_maf=0.01snps_to_keep: Vector of SNP positions to import. If bothsnps_to_keepandmin_mafare specified, only SNPs whose position is listed insnps_to_keepwhose minor allele frequency exceedsmin_mafwill be kept.min_eigval: Smallest eigenvalue allowed in Σ. All eigenvalues smaller thanmin_eigvalwill be set tomin_eigval(default1e-5). This option is only used ifenforce_psd=true.enforce_psd: LD data stored in Pan-UKB or gnomAD LD panels only includes the upper triangular portion. Ifenforce_psdis true, we will copy the upper triangular portion to the lower triangular portion, force eigenvalues to be abovemin_eigval, and then scale the covariance matrix into a correlation matrix (defaulttrue)rk: AnInt. If specified andenforce_psd=true, we will keep only the toprkeigenvalues of importedΣbefore truncating the rest tomin_eigval. This experiments with the "truncated-SVD" approach ofhttps://www.sciencedirect.com/science/article/pii/S0002929717303919#bib19(defaultInf)
Note
Make sure start_bp and end_bp is from the same human genome build as the LD matrices. One can verify build information via bm.build. Both Pan-UKBB and gnomAD uses hg19 (i.e. GRCh37).
Internal helper functions and structs
EasyLD.HailBlockMatrix — TypeCurrently a HailBlockMatrix is just a python wrapper, where data reading is achieved within python by the Hail software, then the result is passed into Julia via PyCall.jl. The bm field behaves as a BlockMatrix from hail. The info fields stores the chr/pos/ref/alt/AF information, where pos uses coordinates in genome build build.
The Hail Block Matrix format is described here https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#blockmatrix
A more specific file description is provided here https://discuss.hail.is/t/blockmatrix-specification/3118
EasyLD._extract_alternate_allele_freq — Function_extract_alternate_allele_freq(df::DataFrame; header_name = "pop_freq")Internal helper function that extracts alternate allele frequency from the ht_file assuming this information is present in the variant index files as JSON format with header header_name. Only gnomAD panel will actually reach here since Pan-UKB has AF as a separate field in its variant index files
For example, in gnomAD, we have
julia> tsv_file = "gnomad.genomes.r2.1.1.nfe.common.adj.ld.variant_indices.ht/variant.ht.tsv"
julia> df = CSV.read(tsv_file, DataFrame)
julia> names(df)
4-element Vector{String}:
"locus"
"alleles"
"pop_freq"
"idx"
julia> df[1, "pop_freq"] # first entry with header `pop_freq` that contains AF
"{"AC":861,"AF":0.40922053231939165,"AN":2104,"homozygote_count":166}"Then the AF field for all variants will be extracted
EasyLD._extract_ref_alt_alleles — Function_extract_ref_alt_alleles(alleles::Vector{String})Internal helper function to extract the ref/alt alleles from hail tables. Each element of alleles should be a string that has 2 alleles wrapped in quotations, as in the case for Pan-UKB and gnomAD. The first is assumed to be reference and the second is alt. As an example, ["CCTAA","C"] with give ref = "CCTAA" and alt = "C".
EasyLD._extract_genome_build — Function_extract_genome_build(ht_file::String)Internal helper function to extracts the human genome build information from the variant index folder. We assume this information is stored in a file called metadata.json or metadata.json.gz in the following format Locus(XXX) where XXX is the reference build. If this format is not detected, the genome build information will be "Unknown"
EasyLD._read_metadata — Function_read_metadata(metafile::String)Internal helpder function that reads the metadata file and returns a dictionary. metafile should end with .json or .json.gz