API
Here is a list of available function calls. A detailed description can be found below.
EasyLD.HailBlockMatrix
EasyLD._extract_alternate_allele_freq
EasyLD._extract_genome_build
EasyLD._extract_ref_alt_alleles
EasyLD._read_metadata
EasyLD.download_gnomad_LD_matrices
EasyLD.download_gnomad_variant_index_tables
EasyLD.download_ukb_LD_matrices
EasyLD.download_ukb_variant_index_tables
EasyLD.get_block
EasyLD.get_gnomad_filenames
EasyLD.get_ukb_filenames
EasyLD.hail_block_matrix
EasyLD.read_variant_index_tables
Exported functions
EasyLD.download_gnomad_LD_matrices
— Functiondownload_gnomad_LD_matrices(population::String, outdir::String, [start_from], [num_files])
Downloads the LD matrices from gnomAD panel for specified population and saves result to outdir
. See https://gnomad.broadinstitute.org/downloads#v2-linkage-disequilibrium
.
Inputs
population
: Options includeafr
for African/African American populationamr
for Latino/Admixed American populationasj
for Ashkenazi Jewish populationeas
for East Asian populationfin
for European (Finnish) populationnfe
for European (non-Finnish) populationest
for Estonian populationnwe
for North-western European populationseu
for Southern European population
outdir
: Directory for which files will be saved.start_from
: Int specifying which file to start downloading atnum_files
: Int specifying how many files to download
Multiple download sessions at once
Since there are lots of files to download, one can specify the number of files to download by modifying the num_files
files option. This function will download the first num_files
file counting from the start_from
element of the output of get_gnomad_filenames()
. For example, if myfiles = get_gnomad_filenames("afr")
, then the function will download files from myfiles[start_from]
through myfiles[start_from+num_files-1]
EasyLD.download_ukb_LD_matrices
— Functiondownload_ukb_LD_matrices(population::String, outdir::String, [start_from], [num_files])
Downloads the LD matrices from pan-UK-biobank for specified population and saves result to outdir
. See https://pan-dev.ukbb.broadinstitute.org/docs/hail-format/index.html
.
Inputs
population
: Several options are available. These codes refer only to ancestry groupings used in GWAS, not necessarily other demographic or self-reported data.EUR
for European ancestryCSA
for Central/South Asian ancestryAFR
for African ancestryEAS
for East Asian ancestryMID
for Middle Eastern ancestryAMR
for Admixed American ancestry
outdir
: Directory for which files will be saved.start_from
: Int specifying which file to start downloading atnum_files
: Int specifying how many files to download
Multiple download sessions at once
Since there are lots of files to download, one can specify the number of files to download by modifying the num_files
files option. This function will download the first num_files
file counting from the start_from
element of the output of get_ukb_filenames()
. For example, if myfiles = get_ukb_filenames("afr")
, then the function will download files from myfiles[start_from]
through myfiles[start_from+num_files-1]
EasyLD.download_gnomad_variant_index_tables
— Functiondownload_gnomad_variant_index_tables(population::String, outdir::String)
Downloads variant index hail tables from gnomAD panel for specified population and saves result to outdir
. population
accepts the same population labels as download_gnomad_LD_matrices
. These files are typically at most a few GB.
EasyLD.download_ukb_variant_index_tables
— Functiondownload_ukb_variant_index_tables(population::String, outdir::String)
Downloads variant index hail tables from pan-UK-biobank for specified population and saves result to outdir
. population
accepts the same population labels as download_ukb_LD_matrices
. These files are typically at most a few GB.
EasyLD.get_gnomad_filenames
— Functionget_gnomad_filenames(population::String)
Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true
, the full path will be returned
EasyLD.get_ukb_filenames
— Functionget_ukb_filenames(population::String)
Returns a list of file names that would be downloaded for the specified population. This function will not download anything. If join=true
, the full path will be returned
EasyLD.hail_block_matrix
— Functionhail_block_matrix(bm_files::String, ht_files::String)
Creates a HailBlockMatrix
which allows reading chunks of LD matrix data into memory. It is a subtype of AbstractMatrix
, so operations like indexing and size()
works, but only accessing its elements in contiguous chunks is "fast". I may support more functionalities depending on interest.
Examples
using EasyLD
datadir = "/Users/biona001/.julia/dev/EasyLD/data"
bm_file = joinpath(datadir, "UKBB.EUR.ldadj.bm")
ht_file = joinpath(datadir, "UKBB.EUR.ldadj.variant.ht")
bm = hail_block_matrix(bm_file, ht_file); # need a ';' to avoid displaying a few entries of bm, which takes ~0.1 seconds per entry
# get matrix dimension
size(bm) # returns (23960350, 23960350)
# read a single entry
bm[1, 1] # returns 0.999959841549239
# read first 10k by 10k block into memory (takes roughly 7 seconds)
bm[1:10000, 1:10000]
# arbitrary slicing works but is very slow
bm[1:3, 1:2:100] # ~22 seconds
# read a specific chromosome region
chr = 1
start_pos = 11063
end_pos = 91588
sigma = get_block(bm, chr, start_pos, end_pos)
EasyLD.read_variant_index_tables
— Functionread_variant_index_tables(ht_file::String; [alt_allele_header], [reprocess])
Read variant index hail tables into a DataFrame
. The first time this function gets called, we will read the original .ht
files into memory and write the result to a tab-separated value .tsv
file into the same directory as ht_file
, and we will also create a comma separated file ending in .csv
which pre-process the .tsv
file for easier reading. If reprocess=true
, we will re-run the preprocessing step.
Note
rsIDs
: Some panels (e.g. Pan-UKB) also contain rsID, but others (e.g. gnomAD) does not, so rsID may not be available.AF
: Alternate Allele frequency is sometimes available in the.tsv
file in its own column (e.g. Pan-UKB), but other times it's hidden inside a column of the.tsv
file (e.g. in gnomAD it is under the headerpop_freq
which includesAF
but also other information). In the later case, we will assume the column in.tsv
has header namealt_allele_header
and is JSON formatted, and further that alt-allele freq is available asAF
key. See_extract_alternate_allele_freq
EasyLD.get_block
— Functionget_block(bm::HailBlockMatrix, chr, start_bp, end_bp; [min_maf],
[snps_to_keep], [min_eigval], [enforce_psd], [rk])
Reads in a block of LD matrix from chromosome chr
between basepairs start_bp
and end_bp
. The inputs chr
, start_bp
, end_bp
can be Int or String. The result will always be a matrix even if start_bp == end_bp
. If start_bp
or end_bp
is not in the LD matrix, we will return the smallest region that does NOT include them. For example, if (start_bp, end_bp) = (555, 777)
and SNP positions in the LD panel are positions = [..., 400, 500, 600, 700, 800, 900, ...]
with snp_names = [..., snp4, snp5, snp6, snp7, snp8, snp9, ..]
, then we will return the LD matrix for [snp6, snp7]
Inputs
bm
: AHailBlockMatrix
chr
: Chromosome number, can be an Int or String (e.g.1
or"1"
)start_bp
: Starting basepair, can be an Int or Stringend_bp
: Ending basepair, can be an Int or String
Optional inputs
min_maf
: Minimum minor allele frequency. Only variants with alternate allele frequency between[min_maf, 1-min_maf]
is kept. Defaultmin_maf=0.01
snps_to_keep
: Vector of SNP positions to import. If bothsnps_to_keep
andmin_maf
are specified, only SNPs whose position is listed insnps_to_keep
whose minor allele frequency exceedsmin_maf
will be kept.min_eigval
: Smallest eigenvalue allowed in Σ. All eigenvalues smaller thanmin_eigval
will be set tomin_eigval
(default1e-5
). This option is only used ifenforce_psd=true
.enforce_psd
: LD data stored in Pan-UKB or gnomAD LD panels only includes the upper triangular portion. Ifenforce_psd
is true, we will copy the upper triangular portion to the lower triangular portion, force eigenvalues to be abovemin_eigval
, and then scale the covariance matrix into a correlation matrix (defaulttrue
)rk
: AnInt
. If specified andenforce_psd=true
, we will keep only the toprk
eigenvalues of importedΣ
before truncating the rest tomin_eigval
. This experiments with the "truncated-SVD" approach ofhttps://www.sciencedirect.com/science/article/pii/S0002929717303919#bib19
(defaultInf
)
Note
Make sure start_bp
and end_bp
is from the same human genome build as the LD matrices. One can verify build information via bm.build
. Both Pan-UKBB and gnomAD uses hg19 (i.e. GRCh37).
Internal helper functions and structs
EasyLD.HailBlockMatrix
— TypeCurrently a HailBlockMatrix
is just a python wrapper, where data reading is achieved within python by the Hail software, then the result is passed into Julia via PyCall.jl. The bm
field behaves as a BlockMatrix
from hail. The info
fields stores the chr/pos/ref/alt/AF
information, where pos
uses coordinates in genome build build
.
The Hail Block Matrix format is described here https://hail.is/docs/0.2/linalg/hail.linalg.BlockMatrix.html#blockmatrix
A more specific file description is provided here https://discuss.hail.is/t/blockmatrix-specification/3118
EasyLD._extract_alternate_allele_freq
— Function_extract_alternate_allele_freq(df::DataFrame; header_name = "pop_freq")
Internal helper function that extracts alternate allele frequency from the ht_file
assuming this information is present in the variant index files as JSON format with header header_name
. Only gnomAD
panel will actually reach here since Pan-UKB has AF
as a separate field in its variant index files
For example, in gnomAD, we have
julia> tsv_file = "gnomad.genomes.r2.1.1.nfe.common.adj.ld.variant_indices.ht/variant.ht.tsv"
julia> df = CSV.read(tsv_file, DataFrame)
julia> names(df)
4-element Vector{String}:
"locus"
"alleles"
"pop_freq"
"idx"
julia> df[1, "pop_freq"] # first entry with header `pop_freq` that contains AF
"{"AC":861,"AF":0.40922053231939165,"AN":2104,"homozygote_count":166}"
Then the AF field for all variants will be extracted
EasyLD._extract_ref_alt_alleles
— Function_extract_ref_alt_alleles(alleles::Vector{String})
Internal helper function to extract the ref/alt alleles from hail tables. Each element of alleles
should be a string
that has 2 alleles wrapped in quotations, as in the case for Pan-UKB and gnomAD. The first is assumed to be reference and the second is alt. As an example, ["CCTAA","C"]
with give ref = "CCTAA"
and alt = "C"
.
EasyLD._extract_genome_build
— Function_extract_genome_build(ht_file::String)
Internal helper function to extracts the human genome build information from the variant index folder. We assume this information is stored in a file called metadata.json
or metadata.json.gz
in the following format Locus(XXX)
where XXX
is the reference build. If this format is not detected, the genome build information will be "Unknown"
EasyLD._read_metadata
— Function_read_metadata(metafile::String)
Internal helpder function that reads the metadata file and returns a dictionary. metafile
should end with .json
or .json.gz