API

Here is a list of available function calls. A detailed description can be found below.

Exported functions

EasyLD.download_gnomad_LD_matricesFunction
download_gnomad_LD_matrices(population::String, outdir::String, [start_from], [num_files])

Downloads the LD matrices from gnomAD panel for specified population and saves result to outdir. See https://gnomad.broadinstitute.org/downloads#v2-linkage-disequilibrium.

Inputs

  • population: Options include
    • afr for African/African American population
    • amr for Latino/Admixed American population
    • asj for Ashkenazi Jewish population
    • eas for East Asian population
    • fin for European (Finnish) population
    • nfe for European (non-Finnish) population
    • est for Estonian population
    • nwe for North-western European population
    • seu for Southern European population
  • outdir: Directory for which files will be saved.
  • start_from: Int specifying which file to start downloading at
  • num_files: Int specifying how many files to download

Multiple download sessions at once

Since there are lots of files to download, one can specify the number of files to download by modifying the num_files files option. This function will download the first num_files file counting from the start_from element of the output of get_gnomad_filenames(). For example, if myfiles = get_gnomad_filenames("afr"), then the function will download files from myfiles[start_from] through myfiles[start_from+num_files-1]

source
EasyLD.download_ukb_LD_matricesFunction
download_ukb_LD_matrices(population::String, outdir::String, [start_from], [num_files])

Downloads the LD matrices from pan-UK-biobank for specified population and saves result to outdir. See https://pan-dev.ukbb.broadinstitute.org/docs/hail-format/index.html.

Inputs

  • population: Several options are available. These codes refer only to ancestry groupings used in GWAS, not necessarily other demographic or self-reported data.
    • EUR for European ancestry
    • CSA for Central/South Asian ancestry
    • AFR for African ancestry
    • EAS for East Asian ancestry
    • MID for Middle Eastern ancestry
    • AMR for Admixed American ancestry
  • outdir: Directory for which files will be saved.
  • start_from: Int specifying which file to start downloading at
  • num_files: Int specifying how many files to download

Multiple download sessions at once

Since there are lots of files to download, one can specify the number of files to download by modifying the num_files files option. This function will download the first num_files file counting from the start_from element of the output of get_ukb_filenames(). For example, if myfiles = get_ukb_filenames("afr"), then the function will download files from myfiles[start_from] through myfiles[start_from+num_files-1]

source
EasyLD.download_gnomad_variant_index_tablesFunction
download_gnomad_variant_index_tables(population::String, outdir::String)

Downloads variant index hail tables from gnomAD panel for specified population and saves result to outdir. population accepts the same population labels as download_gnomad_LD_matrices. These files are typically at most a few GB.

source
EasyLD.download_ukb_variant_index_tablesFunction
download_ukb_variant_index_tables(population::String, outdir::String)

Downloads variant index hail tables from pan-UK-biobank for specified population and saves result to outdir. population accepts the same population labels as download_ukb_LD_matrices. These files are typically at most a few GB.

source
EasyLD.get_gnomad_filenamesFunction
get_gnomad_filenames(population::String)

Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true, the full path will be returned

source
EasyLD.get_ukb_filenamesFunction
get_ukb_filenames(population::String)

Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true, the full path will be returned

source
EasyLD.hail_block_matrixFunction
hail_block_matrix(bm_files::String, ht_files::String)

Creates a HailBlockMatrix which allows reading chunks of LD matrix data into memory. It is a subtype of AbstractMatrix, so operations like indexing and size() works, but only accessing its elements in contiguous chunks is "fast". I may support more functionalities depending on interest.

Examples

using EasyLD
datadir = "/Users/biona001/.julia/dev/EasyLD/data"
bm_file = joinpath(datadir, "UKBB.EUR.ldadj.bm")
ht_file = joinpath(datadir, "UKBB.EUR.ldadj.variant.ht")
bm = hail_block_matrix(bm_file, ht_file); # need a ';' to avoid displaying a few entries of bm, which takes ~0.1 seconds per entry

# get matrix dimension
size(bm) # returns (23960350, 23960350)

# read a single entry
bm[1, 1] # returns 0.999959841549239

# read first 10k by 10k block into memory (takes roughly 7 seconds)
bm[1:10000, 1:10000]

# arbitrary slicing works but is very slow
bm[1:3, 1:2:100] # ~22 seconds

# read a specific chromosome region
chr = 1
start_pos = 11063
end_pos = 91588
sigma = get_block(bm, chr, start_pos, end_pos)
source
EasyLD.read_variant_index_tablesFunction
read_variant_index_tables(ht_file::String; [alt_allele_header], [reprocess])

Read variant index hail tables into a DataFrame. The first time this function gets called, we will read the original .ht files into memory and write the result to a tab-separated value .tsv file into the same directory as ht_file, and we will also create a comma separated file ending in .csv which pre-process the .tsv file for easier reading. If reprocess=true, we will re-run the preprocessing step.

Note

  • rsIDs: Some panels (e.g. Pan-UKB) also contain rsID, but others (e.g. gnomAD) does not, so rsID may not be available.
  • AF: Alternate Allele frequency is sometimes available in the .tsv file in its own column (e.g. Pan-UKB), but other times it's hidden inside a column of the .tsv file (e.g. in gnomAD it is under the header pop_freq which includes AF but also other information). In the later case, we will assume the column in .tsv has header name alt_allele_header and is JSON formatted, and further that alt-allele freq is available as AF key. See _extract_alternate_allele_freq
source
EasyLD.get_blockFunction
get_block(bm::HailBlockMatrix, chr, start_bp, end_bp; [min_maf], 
    [snps_to_keep], [min_eigval], [enforce_psd], [rk])

Reads in a block of LD matrix from chromosome chr between basepairs start_bp and end_bp. The inputs chr, start_bp, end_bp can be Int or String. The result will always be a matrix even if start_bp == end_bp. If start_bp or end_bp is not in the LD matrix, we will return the smallest region that does NOT include them. For example, if (start_bp, end_bp) = (555, 777) and SNP positions in the LD panel are positions = [..., 400, 500, 600, 700, 800, 900, ...] with snp_names = [..., snp4, snp5, snp6, snp7, snp8, snp9, ..], then we will return the LD matrix for [snp6, snp7]

Inputs

  • bm: A HailBlockMatrix
  • chr: Chromosome number, can be an Int or String (e.g. 1 or "1")
  • start_bp: Starting basepair, can be an Int or String
  • end_bp: Ending basepair, can be an Int or String

Optional inputs

  • min_maf: Minimum minor allele frequency. Only variants with alternate allele frequency between [min_maf, 1-min_maf] is kept. Default min_maf=0.01
  • snps_to_keep: Vector of SNP positions to import. If both snps_to_keep and min_maf are specified, only SNPs whose position is listed in snps_to_keep whose minor allele frequency exceeds min_maf will be kept.
  • min_eigval: Smallest eigenvalue allowed in Σ. All eigenvalues smaller than min_eigval will be set to min_eigval (default 1e-5). This option is only used if enforce_psd=true.
  • enforce_psd: LD data stored in Pan-UKB or gnomAD LD panels only includes the upper triangular portion. If enforce_psd is true, we will copy the upper triangular portion to the lower triangular portion, force eigenvalues to be above min_eigval, and then scale the covariance matrix into a correlation matrix (default true)
  • rk: An Int. If specified and enforce_psd=true, we will keep only the top rk eigenvalues of imported Σ before truncating the rest to min_eigval. This experiments with the "truncated-SVD" approach of https://www.sciencedirect.com/science/article/pii/S0002929717303919#bib19 (default Inf)

Note

Make sure start_bp and end_bp is from the same human genome build as the LD matrices. One can verify build information via bm.build. Both Pan-UKBB and gnomAD uses hg19 (i.e. GRCh37).

source

Internal helper functions and structs

EasyLD.HailBlockMatrixType

Currently a HailBlockMatrix is just a python wrapper, where data reading is achieved within python by the Hail software, then the result is passed into Julia via PyCall.jl. The bm field behaves as a BlockMatrix from hail. The info fields stores the chr/pos/ref/alt/AF information, where pos uses coordinates in genome build build.

The Hail Block Matrix format is described here https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#blockmatrix

A more specific file description is provided here https://discuss.hail.is/t/blockmatrix-specification/3118

source
EasyLD._extract_alternate_allele_freqFunction
_extract_alternate_allele_freq(df::DataFrame; header_name = "pop_freq")

Internal helper function that extracts alternate allele frequency from the ht_file assuming this information is present in the variant index files as JSON format with header header_name. Only gnomAD panel will actually reach here since Pan-UKB has AF as a separate field in its variant index files

For example, in gnomAD, we have

julia> tsv_file = "gnomad.genomes.r2.1.1.nfe.common.adj.ld.variant_indices.ht/variant.ht.tsv"
julia> df = CSV.read(tsv_file, DataFrame)
julia> names(df)
4-element Vector{String}:
 "locus"
 "alleles"
 "pop_freq"
 "idx"
julia> df[1, "pop_freq"] # first entry with header `pop_freq` that contains AF
"{"AC":861,"AF":0.40922053231939165,"AN":2104,"homozygote_count":166}"

Then the AF field for all variants will be extracted

source
EasyLD._extract_ref_alt_allelesFunction
_extract_ref_alt_alleles(alleles::Vector{String})

Internal helper function to extract the ref/alt alleles from hail tables. Each element of alleles should be a string that has 2 alleles wrapped in quotations, as in the case for Pan-UKB and gnomAD. The first is assumed to be reference and the second is alt. As an example, ["CCTAA","C"] with give ref = "CCTAA" and alt = "C".

source
EasyLD._extract_genome_buildFunction
_extract_genome_build(ht_file::String)

Internal helper function to extracts the human genome build information from the variant index folder. We assume this information is stored in a file called metadata.json or metadata.json.gz in the following format Locus(XXX) where XXX is the reference build. If this format is not detected, the genome build information will be "Unknown"

source
EasyLD._read_metadataFunction
_read_metadata(metafile::String)

Internal helpder function that reads the metadata file and returns a dictionary. metafile should end with .json or .json.gz

source