Usage within Julia

GhostKnockoffGWAS is a regular Julia package, which can be used directly within Julia for greater flexibility. To install it, execute the following in Julia

using Pkg
Pkg.add(url="https://github.com/biona001/ghostbasil_jll.jl")
Pkg.add(url="https://github.com/biona001/Ghostbasil.jl")
Pkg.add(url="https://github.com/biona001/HDF5.jl") # needed to resolve https://github.com/biona001/GhostKnockoffGWAS/issues/7
Pkg.add(url="https://github.com/biona001/GhostKnockoffGWAS")

Warning

This package currently only works on Linux machines with Julia 1.8.x, 1.9.x, and 1.10.0. If you need it to work on a different Julia version, let us know by filing an issue on Github.

Usage example

The following example performs summary-statistics GWAS under the GhostKnockoff framework.

using GhostKnockoffGWAS

# file paths and directories
LD_files = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data/EUR"
zfile = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data/AD_Zscores_Meta_modified.txt"
outdir = "/home/groups/sabatti/.julia/dev/GhostKnockoffGWAS/data"

# specify sample size and human genome build
N = 506200
hg_build = 38

# read Z-scores using built-in function read_zscores
z, chr, pos, effect_allele, non_effect_allele = GhostKnockoffGWAS.read_zscores(zfile)

# run analysis
@time ghostknockoffgwas(LD_files, z, chr, pos, effect_allele, 
    non_effect_allele, N, hg_build, outdir, outname="test_alzheimers_meta")

Function API

GhostKnockoffGWAS.ghostknockoffgwas — Function

ghostknockoffgwas(LD_files::String, z::Vector{Float64}, chr::Vector{Int}, 
    effect_allele::Vector{String}, non_effect_allele::Vector{String}, N::Int,
    hg_build::Int, outdir::String; [outname="result"], [seed=2023], 
    [target_chrs=sort!(unique(chr))], [A_scaling_factor = 0.01], 
    [kappa_lasso=0.6], [LD_shrinkage=false], 
    [target_fdrs=[0.01, 0.05, 0.1, 0.2]], [verbose=true], 
    [skip_shrinkage_check=false])

Runs the main GhostKnockoffGWAS pipeline on the Z scores in z using pre-computed knockoff data in LD_files.

Inputs

LD_files: Directory that stores pre-computed knockoff results
z: Vector of Z scores
chr: Chromosome of each Z score (cannot be X/Y/M chromosomes)
pos: Position of each Z score (specify hg build with hg_build)
effect_allele: Effect allele of Z score (i.e. ALT)
non_effect_allele: Non-effect allele of Z score (i.e. REF)
N: sample size of original study
hg_build: Human genome build (must be 19 or 38)
outdir: output directory, which must exist. We will output 2 files, one containing the full analysis results, as well as a summary file.

Optional inputs

seed: Random seed for reproducibility (default = 2023)
target_chrs: Target chromosomes to analyze. For example, one can specify target_chrs = 22 to only analyze 1 chromosome, or target_chrs = [1, 2] to only analyze 2 chromosomes (default = sort!(unique(chr))).
A_scaling_factor: The scaling factor for A = [X X̃]'*[X X̃] for improving numerical stability. Scaling proceeds by adding A_scaling_factor*I to A (default = 0.01).
kappa_lasso: A constant between 0 and 1 for tuning Lasso's lambda path. Larger value forces earlier exit in Lasso lambda path, resulting in stronger shrinkage. See the "lasso-min" method of "Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression" by Chen et al (default 0.6).
LD_shrinkage: Whether to perform shrinkage to LD and S matrices following method in SuSiE paper (i.e. eq 24 of "Fine-mapping from summary data with the “Sum of Single Effects” model" by Zou et al). If false, we will still compute the shrinkage level, but it will not be used to adjust the LD matrices (default false).
target_fdrs: Default target FDR levels (default = [0.01, 0.05, 0.1, 0.2])
verbose: Whether to print progress and informative intermediate results ( default = true)
skip_shrinkage_check: Forces a result output even if there is a high estimated LD shrinkage by SuSiE's method (default = false)
random_shuffle: Whether to randomly permute the order of the original and knockoff variables (default false). The main purpose of this option is to take care of potential ordering bias of Lasso solvers. However, in our simulations we never observed such biases, so we turn this off by default.

Output

By default we output 2 files into outdir

A knockoff statistics file where each SNP occupies a row and the columns include various SNP attributes include rsid, AF, chr, pos, zscores...etc. The columns selected_fdr_FDR indicates whether the variant was ultimately selected under the false discovery rate threshold of FDR.
A summary statistics file. The first dozens of rows print, for each false discovery rate threshold FDR, the knockoff threshold τ̂ and the number of groups that pass this threshold. The next couple of lines print some parameters used in the knockoff analysis, as well as some timing results.

source

GhostKnockoffGWAS.read_zscores — Function

read_zscores(filepath::String)

Helper function to read a Z-score file at filepath. This function is mainly intended for Julia users running GhostKnockoffGWAS in the REPL.

Input

filepath: Full file path to the Z-score file. First row must be a header column with CHR, POS, REF, ALT, and Z. If the file contains these information but has a different header name for them, use the optional input arguments below. All other columns will be ignored.

Optional inputs

chr_col: An integer, specifying which column in filepath should be read as CHR (by default we search for a header CHR)
pos_col: An integer, specifying which column in filepath should be read as POS (by default we search for a header POS)
ref_col: An integer, specifying which column in filepath should be read as REF (by default we search for a header REF)
alt_col: An integer, specifying which column in filepath should be read as ALT (by default we search for a header ALT)
z_col: An integer, specifying which column in filepath should be read as Z (by default we search for a header Z)

Output

z: The Z scores stored in the Z column of filepath
chr: The chromosome number stored in CHR column of filepath. Only integer values are allowed.
pos: The position number stored in POS column of filepath.
effect_allele: The allele stored in ALT column of filepath.
non_effect_allele: The allele stored in REF column of filepath.

source

GhostKnockoffGWAS.solve_blocks — Function

solve_blocks(file::String, covfile::String, chr::Int, start_bp::Int, end_bp::Int, 
    outdir::String, hg_build::Int; [m=5], [tol=0.0001], [min_maf=0.01], 
    [min_hwe=0.0], [force_block_diag=true], 
    [method::String = "maxent"], [linkage::String="average"],
    [force_contiguous::Bool=false], [group_cor_cutoff::Float64=0.5], 
    [group_rep_cutoff::Float64=0.5], [verbose=true])

Solves the group knockoff optimization problem on provided individual-level data and outputs the result into outdir. All variants that reside on chromosome chr with position between start_bp and end_bp (inclusive) will be included.

Note on large VCF files

Currently reading/parsing a VCF file is a single-threaded operation (even if it is indexed). Thus, we strongly recommend one convert to binary PLINK format.

Inputs

file: A VCF or binary PLINK file storing individual level genotypes. Must end in .vcf, .vcf.gz, or .bed. If a VCF file is used, the ALT field for each record must be unique, i.e. multiallelic records must be split first. Missing genotypes will be imputed by column mean.
covfile: An optional comma- or tab-separated file containing sample covariates (e.g. sex, age, PCs). This argument can be an empty string. The supplied covariates will be used to improve LD estimation. The first column should be sample IDs (not necessary to be in the sample order as VCF or PLINK files) and all other columns will be used as additional covariates.
chr: Target chromosome. This MUST be an integer and it must match the CHROM field in your VCF/PLINK file. For example, if your VCF file has CHROM field like chr1, CHR1, or CHROM1 etc, they must be renamed into 1.
start_bp: starting basepair (position)
end_bp: ending basepair (position)
outdir: Directory that the output will be stored in (must exist)
hg_build: human genome build for position of each SNP, must be 19 (hg19) or 38 (hg38)

Optional inputs (for group knockoff optimization)

snps_to_keep: Vector of SNP positions to import. If specified, only SNPs whose position is listed in snps_to_keep will be kept (default nothing)
tol: Convergence tolerlance for coordinate descent algorithm (default 0.0001)
min_maf: Minimum minor allele frequency for a variable to be considered ( default 0.01)
min_hwe: Cutoff for hardy-weinburg equilibrium p-values. Only SNPs with p-value > min_hwe will be included (default 0.0)
force_block_diag: Whether to re-order the columns/rows of the correlation matrix and corresponding S matrix so that features in the same group are contiguous (default true). This has no impact on the final results, it is simply for computational performance.
method: group knockoff optimization algorithm, choices include "maxent" (default), "mvr", "sdp", or "equi". See sec 2 of https://arxiv.org/abs/2310.15069
linkage: cluster linkage function to use for hierarchically clustering groups. It defines how the distances between features are aggregated into the distances between groups. Valid choices include:
- :average (default): use the mean distance between any of the cluster members
- :single: use the minimum distance between any of the cluster members
- :complete: use the maximum distance between any of the members
- :ward: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters
- :ward_presquared: same as :ward, but assumes that the distances in d are already squared.
force_contiguous: whether to force groups to be contiguous (default false). Note if force_contiguous=true, linkage must be :single)
group_cor_cutoff: correlation cutoff value for defining groups (default 0.5). Value should be between 0 and 1, where larger values correspond to larger groups.
group_rep_cutoff: cutoff value for selecting group-representatives (default 0.5). Value should be between 0 and 1, where larger values correspond to more representatives per group.
verbose: whether to print informative intermediate results (default true)

output

Calling solve_blocks will create 3 files in the directory outdir/chr:

XXX.h5: This contains data (Sigma, S, groups, ..., etc) for region XXX. It contains the following:
- D: A p × p (dense) matrix corresponding to the S matrix for both the representative and non-representative variables. Knockoff sampling should use this matrix.
- S: Matrix obtained from solving the group knockoff optimization problem on the representative (group-key) variables.
- Sigma: The original p × p correlation matrix estimated from vcffile
- SigmaInv: Inverse of Sigma
- Sigma_reps: The correlation matrix for the representative variables. This is the matrix actually used to solve the group-knockoff optimization
- Sigma_reps_inv: Inverse of Sigma_reps
- group_reps: Indices groups that are used as representatives (i.e. group-key variables)
- groups: The group membership vector
Info_XXX.csv: This includes information for each variant (chr/pos/etc) present in the corresponding .h5 file.
summary_XXX.csv: Summary file for the knockoff optimization problem

source