Usage within Julia
GhostKnockoffGWAS
is a regular Julia package, which can be used directly within Julia for greater flexibility. To install it, execute the following in Julia
using Pkg
Pkg.add(url="https://github.com/biona001/ghostbasil_jll.jl")
Pkg.add(url="https://github.com/biona001/Ghostbasil.jl")
Pkg.add(url="https://github.com/biona001/HDF5.jl") # needed to resolve https://github.com/biona001/GhostKnockoffGWAS/issues/7
Pkg.add(url="https://github.com/biona001/GhostKnockoffGWAS")
This package currently only works on Linux machines with Julia 1.8.x, 1.9.x, and 1.10.0. If you need it to work on a different Julia version, let us know by filing an issue on Github.
Usage example
The following example performs summary-statistics GWAS under the GhostKnockoff framework.
using GhostKnockoffGWAS
# file paths and directories
LD_files = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data/EUR"
zfile = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data/AD_Zscores_Meta_modified.txt"
outdir = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data"
# specify sample size and human genome build
N = 506200
hg_build = 38
# read Z-scores using built-in function read_zscores
z, chr, pos, effect_allele, non_effect_allele = GhostKnockoffGWAS.read_zscores(zfile)
# run analysis
@time ghostknockoffgwas(LD_files, z, chr, pos, effect_allele,
non_effect_allele, N, hg_build, outdir, outname="test_alzheimers_meta")
Function API
GhostKnockoffGWAS.ghostknockoffgwas
— Functionghostknockoffgwas(LD_files::String, z::Vector{Float64}, chr::Vector{Int},
effect_allele::Vector{String}, non_effect_allele::Vector{String}, N::Int,
hg_build::Int, outdir::String; [outname="result"], [seed=2023],
[target_chrs=sort!(unique(chr))], [A_scaling_factor = 0.01],
[kappa_lasso=0.6], [LD_shrinkage=false],
[target_fdrs=[0.01, 0.05, 0.1, 0.2]], [verbose=true],
[skip_shrinkage_check=false])
Runs the main GhostKnockoffGWAS
pipeline on the Z scores in z
using pre-computed knockoff data in LD_files
.
Inputs
LD_files
: Directory that stores pre-computed knockoff resultsz
: Vector of Z scoreschr
: Chromosome of each Z score (cannot be X/Y/M chromosomes)pos
: Position of each Z score (specify hg build withhg_build
)effect_allele
: Effect allele of Z score (i.e. ALT)non_effect_allele
: Non-effect allele of Z score (i.e. REF)N
: sample size of original studyhg_build
: Human genome build (must be 19 or 38)outdir
: output directory, which must exist. We will output 2 files, one containing the full analysis results, as well as a summary file.
Optional inputs
seed
: Random seed for reproducibility (default =2023
)target_chrs
: Target chromosomes to analyze. For example, one can specifytarget_chrs = 22
to only analyze 1 chromosome, ortarget_chrs = [1, 2]
to only analyze 2 chromosomes (default =sort!(unique(chr))
).A_scaling_factor
: The scaling factor forA = [X X̃]'*[X X̃]
for improving numerical stability. Scaling proceeds by addingA_scaling_factor*I
toA
(default = 0.01).kappa_lasso
: A constant between 0 and 1 for tuning Lasso's lambda path. Larger value forces earlier exit in Lasso lambda path, resulting in stronger shrinkage. See the "lasso-min" method of "Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression" by Chen et al (default0.6
).LD_shrinkage
: Whether to perform shrinkage to LD and S matrices following method in SuSiE paper (i.e. eq 24 of "Fine-mapping from summary data with the “Sum of Single Effects” model" by Zou et al). Iffalse
, we will still compute the shrinkage level, but it will not be used to adjust the LD matrices (defaultfalse
).target_fdrs
: Default target FDR levels (default = [0.01, 0.05, 0.1, 0.2])verbose
: Whether to print progress and informative intermediate results ( default =true
)skip_shrinkage_check
: Forces a result output even if there is a high estimated LD shrinkage by SuSiE's method (default =false
)random_shuffle
: Whether to randomly permute the order of the original and knockoff variables (defaultfalse
). The main purpose of this option is to take care of potential ordering bias of Lasso solvers. However, in our simulations we never observed such biases, so we turn this off by default.
Output
By default we output 2 files into outdir
- A knockoff statistics file where each SNP occupies a row and the columns include various SNP attributes include rsid, AF, chr, pos, zscores...etc. The columns
selected_fdr_FDR
indicates whether the variant was ultimately selected under the false discovery rate threshold ofFDR
. - A summary statistics file. The first dozens of rows print, for each false discovery rate threshold
FDR
, the knockoff thresholdτ̂
and the number of groups that pass this threshold. The next couple of lines print some parameters used in the knockoff analysis, as well as some timing results.
GhostKnockoffGWAS.read_zscores
— Functionread_zscores(filepath::String)
Helper function to read a Z-score file at filepath
. This function is mainly intended for Julia users running GhostKnockoffGWAS
in the REPL.
Input
filepath
: Full file path to the Z-score file. First row must be a header column withCHR
,POS
,REF
,ALT
, andZ
. If the file contains these information but has a different header name for them, use the optional input arguments below. All other columns will be ignored.
Optional inputs
chr_col
: An integer, specifying which column infilepath
should be read as CHR (by default we search for a headerCHR
)pos_col
: An integer, specifying which column infilepath
should be read as POS (by default we search for a headerPOS
)ref_col
: An integer, specifying which column infilepath
should be read as REF (by default we search for a headerREF
)alt_col
: An integer, specifying which column infilepath
should be read as ALT (by default we search for a headerALT
)z_col
: An integer, specifying which column infilepath
should be read as Z (by default we search for a headerZ
)
Output
z
: The Z scores stored in theZ
column offilepath
chr
: The chromosome number stored inCHR
column offilepath
. Only integer values are allowed.pos
: The position number stored inPOS
column offilepath
.effect_allele
: The allele stored inALT
column offilepath
.non_effect_allele
: The allele stored inREF
column offilepath
.
GhostKnockoffGWAS.solve_blocks
— Functionsolve_blocks(file::String, covfile::String, chr::Int, start_bp::Int, end_bp::Int,
outdir::String, hg_build::Int; [m=5], [tol=0.0001], [min_maf=0.01],
[min_hwe=0.0], [force_block_diag=true],
[method::String = "maxent"], [linkage::String="average"],
[force_contiguous::Bool=false], [group_cor_cutoff::Float64=0.5],
[group_rep_cutoff::Float64=0.5], [verbose=true])
Solves the group knockoff optimization problem on provided individual-level data and outputs the result into outdir
. All variants that reside on chromosome chr
with position between start_bp
and end_bp
(inclusive) will be included.
Note on large VCF files
Currently reading/parsing a VCF file is a single-threaded operation (even if it is indexed). Thus, we strongly recommend one convert to binary PLINK format.
Inputs
file
: A VCF or binary PLINK file storing individual level genotypes. Must end in.vcf
,.vcf.gz
, or.bed
. If a VCF file is used, the ALT field for each record must be unique, i.e. multiallelic records must be split first. Missing genotypes will be imputed by column mean.covfile
: An optional comma- or tab-separated file containing sample covariates (e.g. sex, age, PCs). This argument can be an empty string. The supplied covariates will be used to improve LD estimation. The first column should be sample IDs (not necessary to be in the sample order as VCF or PLINK files) and all other columns will be used as additional covariates.chr
: Target chromosome. This MUST be an integer and it must match theCHROM
field in your VCF/PLINK file. For example, if your VCF file has CHROM field likechr1
,CHR1
, orCHROM1
etc, they must be renamed into1
.start_bp
: starting basepair (position)end_bp
: ending basepair (position)outdir
: Directory that the output will be stored in (must exist)hg_build
: human genome build for position of each SNP, must be 19 (hg19) or 38 (hg38)
Optional inputs (for group knockoff optimization)
snps_to_keep
: Vector of SNP positions to import. If specified, only SNPs whose position is listed insnps_to_keep
will be kept (defaultnothing
)tol
: Convergence tolerlance for coordinate descent algorithm (default0.0001
)min_maf
: Minimum minor allele frequency for a variable to be considered ( default0.01
)min_hwe
: Cutoff for hardy-weinburg equilibrium p-values. Only SNPs with p-value >min_hwe
will be included (default0.0
)force_block_diag
: Whether to re-order the columns/rows of the correlation matrix and correspondingS
matrix so that features in the same group are contiguous (defaulttrue
). This has no impact on the final results, it is simply for computational performance.method
: group knockoff optimization algorithm, choices include "maxent" (default), "mvr", "sdp", or "equi". See sec 2 of https://arxiv.org/abs/2310.15069linkage
: cluster linkage function to use for hierarchically clustering groups. It defines how the distances between features are aggregated into the distances between groups. Valid choices include::average
(default): use the mean distance between any of the cluster members:single
: use the minimum distance between any of the cluster members:complete
: use the maximum distance between any of the members:ward
: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters:ward_presquared
: same as:ward
, but assumes that the distances in d are already squared.
force_contiguous
: whether to force groups to be contiguous (defaultfalse
). Note ifforce_contiguous=true
,linkage
must be:single
)group_cor_cutoff
: correlation cutoff value for defining groups (default0.5
). Value should be between 0 and 1, where larger values correspond to larger groups.group_rep_cutoff
: cutoff value for selecting group-representatives (default0.5
). Value should be between 0 and 1, where larger values correspond to more representatives per group.verbose
: whether to print informative intermediate results (defaulttrue
)
output
Calling solve_blocks
will create 3 files in the directory outdir/chr
:
XXX.h5
: This contains data (Sigma, S, groups, ..., etc) for region XXX. It contains the following:D
: Ap × p
(dense) matrix corresponding to the S matrix for both the representative and non-representative variables. Knockoff sampling should use this matrix.S
: Matrix obtained from solving the group knockoff optimization problem on the representative (group-key) variables.Sigma
: The originalp × p
correlation matrix estimated fromvcffile
SigmaInv
: Inverse ofSigma
Sigma_reps
: The correlation matrix for the representative variables. This is the matrix actually used to solve the group-knockoff optimizationSigma_reps_inv
: Inverse ofSigma_reps
group_reps
: Indicesgroups
that are used as representatives (i.e. group-key variables)groups
: The group membership vector
Info_XXX.csv
: This includes information for each variant (chr/pos/etc) present in the corresponding.h5
file.summary_XXX.csv
: Summary file for the knockoff optimization problem