API
Here is a list of available function calls. A detailed description can be found below.
Knockoffs.Knockoff
Knockoffs.KnockoffFilter
Knockoffs.MarkovChainTable
Knockoffs.MK_statistics
Knockoffs.MK_statistics
Knockoffs.adj_constrained_hclust
Knockoffs.approx_modelX_gaussian_knockoffs
Knockoffs.block_diagonalize
Knockoffs.check_model_solution
Knockoffs.choose_group_reps
Knockoffs.cond_indep_corr
Knockoffs.condition
Knockoffs.fit_lasso
Knockoffs.fit_marginal
Knockoffs.fixed_knockoffs
Knockoffs.form_emission_prob_matrix
Knockoffs.forward_backward!
Knockoffs.forward_backward_sampling!
Knockoffs.full_knockoffscreen
Knockoffs.get_genotype_emission_probabilities
Knockoffs.get_genotype_transition_matrix
Knockoffs.get_haplotype_emission_probabilities
Knockoffs.get_haplotype_transition_matrix
Knockoffs.group_block_objective
Knockoffs.hc_partition_groups
Knockoffs.hmm_knockoff
Knockoffs.hmm_knockoff
Knockoffs.id_partition_groups
Knockoffs.initialize_S
Knockoffs.interpolative_decomposition
Knockoffs.inverse_mat_sqrt
Knockoffs.ipad
Knockoffs.likelihood_ratio
Knockoffs.lowrankdowndate_turbo!
Knockoffs.lowrankupdate_turbo!
Knockoffs.markov_knockoffs
Knockoffs.mk_threshold
Knockoffs.modelX_gaussian_group_knockoffs
Knockoffs.modelX_gaussian_knockoffs
Knockoffs.modelX_gaussian_rep_group_knockoffs
Knockoffs.normalize_col!
Knockoffs.prioritize_variants
Knockoffs.rapid
Knockoffs.search_rank
Knockoffs.shift_until_PSD!
Knockoffs.simulate_AR1
Knockoffs.simulate_ER
Knockoffs.simulate_block_covariance
Knockoffs.single_linkage_distance
Knockoffs.single_state_dmc_knockoff!
Knockoffs.solve_MVR
Knockoffs.solve_SDP
Knockoffs.solve_equi
Knockoffs.solve_group_SDP_single_block
Knockoffs.solve_group_SDP_subopt
Knockoffs.solve_group_block_update
Knockoffs.solve_group_equi
Knockoffs.solve_group_max_entropy_hybrid
Knockoffs.solve_group_mvr_hybrid
Knockoffs.solve_group_sdp_hybrid
Knockoffs.solve_max_entropy
Knockoffs.solve_s
Knockoffs.solve_s_graphical_group
Knockoffs.solve_s_group
Knockoffs.solve_sdp_ccd
Knockoffs.threshold
Knockoffs.update_normalizing_constants!
Knockoffs.MK_statistics
— MethodMK_statistics(T0::Vector, Tk::Vector{Vector}; filter_method)
Computes the multiple knockoff statistics kappa, tau, and W.
Inputs
T0
: p-vector of importance score for original variablesTk
: Vector storing T1, ..., Tm, where Ti is importance scores for thei
th knockoff copyfilter_method
: EitherStatistics.median
(default) or max (original function used in 2019 Gimenez and Zou)
output
κ
: Index of the most significant feature (κ[i] = 0
if original feature most important, otherwiseκ[i] = k
if thek
th knockoff is most important)τ
:τ[i]
stores the most significant statistic among original and knockoff variables minusfilter_method()
applied to the remaining statistics.W
: coefficient difference statisticW[i] = abs(T0[i]) - abs(Tk[i])
Knockoffs.MK_statistics
— MethodMK_statistics(T0::Vector, Tk::Vector)
Compute regular knockoff statistics tau and W.
Inputs
T0
: p-vector of importance score for original variablesTk
: p-vector of importance score for knockoff variables
output
W
: coefficient difference statisticW[i] = abs(T0[i]) - abs(Tk[i])
Knockoffs.adj_constrained_hclust
— Methodadj_constrained_hclust(distmat::AbstractMatrix, h::Number)
Performs (single-linkage) hierarchical clustering, forcing groups to be contiguous. After clustering, variables in different group is guaranteed to have distance less than h
.
Note: this is a custom (bottom-up) implementation because Clustering.jl
does not support adjacency constraints, see https://github.com/JuliaStats/Clustering.jl/issues/230
Knockoffs.approx_modelX_gaussian_knockoffs
— Methodapprox_modelX_gaussian_knockoffs(X, method; [m=1], [windowsize = 500], [covariance_approximator], kwargs...)
approx_modelX_gaussian_knockoffs(X, method, window_ranges; [m=1], [covariance_approximator], kwargs...)
Generates Gaussian knockoffs by approximating the covariance as a block diagonal matrix. Each block contains windowsize
consecutive features. One could alternatively specify the window_ranges
argument to construct blocks of different sizes.
Inputs
X
: An × p
numeric matrix orSnpArray
. Each row is a sample, and each column is covariate.method
: Can be one of the following:mvr
for minimum variance-based reconstructability knockoffs (alg 1 in ref 2):maxent
for maximum entropy knockoffs (alg 2 in ref 2):equi
for equi-distant knockoffs (eq 2.3 in ref 1),:sdp
for SDP knockoffs (eq 2.4 in ref 1):sdp_fast
for SDP knockoffs via coordiate descent (alg 2.2 in ref 3)
m
: Number of knockoff copies per variable to generate, defaults to 1.windowsize
: Number of covariates to be included in a block. Each block consists of adjacent variables. The last block could contain less thanwindowsize
variables.window_ranges
: Vector of ranges for each window. e.g. [1:97, 98:200, 201:500]covariance_approximator
: A covariance estimator, defaults toLinearShrinkage(DiagonalUnequalVariance(), :lw)
. See CovarianceEstimation.jl for more options.kwargs...
: Possible optional inputs to solvers specified inmethod
, seesolve_MVR
,solve_max_entropy
, andsolve_sdp_ccd
Multithreading (todo)
To enable multiple threads, simply start Julia with >1 threads and this routine will run with all available threads.
Covariance Approximation:
The covariance is approximated by a LinearShrinkageEstimator
using Ledoit-Wolf shrinkage with DiagonalUnequalVariance
target, which seems to perform well for p>n
cases. We do not simply use cov(X)
since isposdef(cov(X))
is typically false. For comparison of different estimators, see: https://mateuszbaran.github.io/CovarianceEstimation.jl/dev/man/msecomp/#msecomp
Knockoffs.block_diagonalize
— Methodblock_diagonalize(Σ, groups)
Internal function to block-diagonalize the covariance Σ
according to groups.
Knockoffs.check_model_solution
— Methodcheck_model_solution(model; verbose=false)
After solving a JuMP model, checks if the solution is accurate.
Knockoffs.choose_group_reps
— Methodchoose_group_reps(Σ::Symmetric, groups::AbstractVector; [threshold=0.5], [prioritize_idx], [Σinv])
Chooses group representatives. Returns indices of Σ
that are representatives. If R is the set of selected variables within a group and O is the set of variables outside the group, then we keep adding variables to R until the proportion of variance explained by R divided by the proportion of variance explained by R and O exceeds threshold
.
Inputs
Σ
: Correlation matrix wrapped in theSymmetric
argument.groups
: Vector of group membership.
Optional inputs
threshold
: Value between 0 and 1 that controls the number of representatives per group. Larger means more representatives (default 0.5)prioritize_idx
: Variable indices that should receive priority to be chosen as representatives, defaults tonothing
Σinv
: Precomputedinv(Σ)
(it will be computed if not supplied)
Knockoffs.cond_indep_corr
— MethodReturns Σnew
as a covariance matrix that strictly satisfies the conditional independence assumption.
Knockoffs.condition
— Methodcondition(x::AbstractVector, μ::AbstractVector, Σ::AbstractMatrix, S::AbstractMatrix, [m::Number=1])
Samples a knockoff x̃ from Gaussian x using conditional distribution formulas:
If (x, x̃) ~ N((μ, μ), G) where G = [Σ Σ - S; Σ - S Σ], then we sample x̃ from x̃|x = N(μ+(Σ-S)inv(Σ)(x-μ) , 2S-Sinv(Σ)S).
If we sample m
knockoffs, we use the algorithm in "Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization" by Gimenez and Zou.
Inputs
X
: An × p
numeric matrix, each row is a sample, and each column is covariate.μ
: Ap × 1
vector of column mean ofX
Σ
: Ap × p
covariance matrix ofX
S
: Ap × p
matrix solved to satisfyS ⪰ 0
and(m+1)/m*Σ - S ⪰ 0
m
: Number of (simultaneous) knockoffs per variable to generate, defaultm=1
Output
X̃
: An × pm
numeric matrix. The firstp
columns store the first knockoff copy, and the nextp
columns store the second knockoff...etc
Todo
- When s is the zero vector, X̃ should be identical to X but it isn't
- Consider changing sampling code to using Distribution's MvNormal
- For multiple knockoffs, can we avoid storing a pm × pm matrix in memory?
Knockoffs.fit_lasso
— Methodfit_lasso(y, X, method=:maxent, ...)
fit_lasso(y, X, μ, Σ, method=:maxent, ...)
Generates model-X knockoffs with method
, runs Lasso, then applies the knockoff-filter. If μ
and Σ
are not provided, they will be estimated from data.
Inputs
y
: An × 1
response vectorX
: An × p
numeric matrix, each row is a sample, and each column is covariate.method
: Method for knockoff generation (defaults to:maxent
)μ
: Ap × 1
vector of column mean ofX
. If not provided, defaults to column mean.Σ
: Ap × p
covariance matrix ofX
. If not provided, it will be estimated based on a shrinked empirical covariance matrix, seemodelX_gaussian_knockoffs
d
: Distribution of response. DefaultsNormal()
, for binary response (logistic regression) useBinomial()
.m
: Number of simultaneous knockoffs to generate, defaults tom=1
fdrs
: Target FDRs, defaults to[0.01, 0.05, 0.1, 0.25, 0.5]
groups
: Vector of group membership. If not supplied, we generate regular knockoffs. If supplied, we run group knockoffs.filter_method
: Choices are:knockoff
or:knockoff_plus
(default)debias
: Defines how the selected coefficients are debiased. Specify:ls
for least squares or:lasso
for Lasso (only running on the support). To not debias, specifydebias=nothing
(default).kwargs
: Additional arguments to input intoglmnetcv
andglmnet
Knockoffs.fit_marginal
— Methodfit_marginal(y, X, method=:maxent, ...)
fit_marginal(y, X, μ, Σ, method=:maxent, ...)
Generates model-X knockoffs with method
and computes feature importance statistics based on squared marginal Z score: abs2(x[:, i]^t*y) / n. If μ
and Σ
are not provided, they will be estimated from data.
Inputs
y
: An × 1
response vectorX
: An × p
numeric matrix, each row is a sample, and each column is covariate.method
: Method for knockoff generation (defaults to:maxent
)μ
: Ap × 1
vector of column mean ofX
. If not provided, defaults to column mean.Σ
: Ap × p
covariance matrix ofX
. If not provided, it will be estimated based on a shrinked empirical covariance matrix, seemodelX_gaussian_knockoffs
d
: Distribution of response. DefaultsNormal()
, for binary response (logistic regression) useBinomial()
.m
: Number of simultaneous knockoffs to generate, defaults tom=1
fdrs
: Target FDRs, defaults to[0.01, 0.05, 0.1, 0.25, 0.5]
groups
: Vector of group membership. If not supplied, we generate regular knockoffs. If supplied, we run group knockoffs.filter_method
: Choices are:knockoff
or:knockoff_plus
(default)debias
: Defines how the selected coefficients are debiased. Specify:ls
for least squares or:lasso
for Lasso (only running on the support). To not debias, specifydebias=nothing
(default).kwargs
: Additional arguments to input intoglmnetcv
andglmnet
Knockoffs.fixed_knockoffs
— Methodfixed_knockoffs(X::Matrix{T}; [method], [kwargs...])
Creates fixed-X knockoffs. Internally, X
will be automatically normalized before computing its knockoff.
Inputs
X
: A column-normalizedn × p
numeric matrix, each row is a sample, and each column is covariate. We will internally normalizedX
if it is not.method
: Can be one of the following:mvr
: Minimum variance-based reconstructability knockoffs (alg 1 in ref 2):maxent
: Maximum entropy knockoffs (alg 2 in ref 2):equi
: Equi-distant knockoffs (eq 2.3 in ref 1),:sdp
: SDP knockoffs (eq 2.4 in ref 1):sdp_fast
: SDP knockoffs via coordiate descent (alg 2.2 in ref 3)
kwargs...
: Possible optional inputs tomethod
, seesolve_MVR
,solve_max_entropy
, andsolve_sdp_ccd
Output
GaussianKnockoff
: A struct containing the original (column-normalized)X
and its knockoffX̃
, in addition to other variables (e.g.s
)
Reference
- "Controlling the false discovery rate via Knockoffs" by Barber and Candes (2015).
- "Powerful knockoffs via minimizing reconstructability" by Spector, Asher, and Lucas Janson (2020)
- "FANOK: Knockoffs in Linear Time" by Askari et al. (2020).
Knockoffs.form_emission_prob_matrix
— Methodform_emission_prob_matrix(a, θ, xi::AbstractVector)
Inputs
a
:p × K
matrix with values estimated from fastPHASE (i.e. they called it the α parameter)θ
:p × K
matrix with values estimated from fastPHASExi
: Lengthp
vector with samplei
's genotypes (entries 0, 1 or 2)
Knockoffs.forward_backward!
— Functionforward_backward!(x, L, y, storage=zeros(length(x)))
Non-allocating solver for finding x
to the solution of LL'x = y where L is a cholesky factor.
Knockoffs.forward_backward_sampling!
— Methodforward_backward_sampling!(Z, X, Q, q, θ, ...)
Samples Z, the hidden states of a HMM, from observed sequence of unphased genotypes X.
Inputs
Z
: Length p
vector of integers. This will store the sampled Markov states X
: Length p
vector of genotypes (0, 1, or 2) Q
: K × K × p
array. Q[:, :, j]
is a K × K
matrix of transition probabilities for j
th state, i.e. Q[l, k, j] = P(X{j} = k | X{j - 1} = l). The first transition matrix is not used. q
: Length K
vector of initial probabilities θ
: The θ parameter estimated from fastPHASE
Preallocated storage variables
table
: a MarkovChainTable
that maps markov chain states to haplotype pairs (ka, kb) d
: Sampling distribution, probabilities in d.p are mutated α̂
: p × K
scaled forward probability matrix, where α̂[j, k] = P(x_1,...,x_k, z_k) / P(x_1,...,x_k)
c
: normalizing constants, c[k] = p(x_k | x_1,...,x_{k-1})
Reference
Algorithm 3 of "Gene hunting with hidden Markov model knockoffs" by Sesia et al
Knockoffs.full_knockoffscreen
— Methodfull_knockoffscreen(x::SnpArray; windowsize::Int=100)
Generates knockoffs X̃ⱼ
by on regressing Xⱼ
on SNPs knockoffs within a sliding window of width windowsize
.
Inputs
x
: ASnpArray
orString
for the path of the PLINK.bed
filewindowsize
:Int
specifying window width. Defaults to 100
Outputs
X̃
: An × p
dense matrix ofFloat64
, each row is a sample.
References
- He, Zihuai, Linxi Liu, Chen Wang, Yann Le Guen, Justin Lee, Stephanie Gogarten, Fred Lu et al. "Identification of putative causal loci in whole-genome sequencing data via knockoff statistics." Nature communications 12, no. 1 (2021): 1-18.
- He, Zihuai, Yann Le Guen, Linxi Liu, Justin Lee, Shiyang Ma, Andrew C. Yang, Xiaoxia Liu et al. "Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics." The American Journal of Human Genetics 108, no. 12 (2021): 2336-2353.
TODO
- Use
ElasticArrays.jl
to avoid reallocating design matrix in each loop - Write iterator interface to avoid allocating and storing all knockoffs at once
Knockoffs.get_genotype_emission_probabilities
— Methodget_genotype_emission_probabilities(θ::AbstractMatrix, xj::Number, ka::Int, kb::Int, j::Int)
Computes P(xj | k={ka,kb}, θ): emission probabilities for genotypes. This is eq 10 of "Gene hunting with hidden Markov model knockoffs" by Sesia et al.
Knockoffs.get_genotype_transition_matrix
— Methodget_genotype_transition_matrix(r, θ, α, q, table)
Compute transition matrices for the hidden Markov chains in unphased genotypes. This is equation 9 of "Gene hunting with hidden Markov model knockoffs" by Sesia et al.
Inputs
r
: Length p
vector, the "recombination rates" θ
: Size p × K
matrix, θ[j, k]
is probability that the allele is 1 for SNP p
at k
th haplotype motif α
: Size p × K
matrix, probabilities that haplotype motifs succeed each other. Rows should sum to 1. q
: Length K
vector of initial probabilities table
: a MarkovChainTable
that maps markov chain states k = 1, ..., K+(K+1)/2 to haplotype pairs (ka, kb).
Knockoffs.get_haplotype_emission_probabilities
— Methodget_haplotype_emission_probabilities(θ::AbstractMatrix, j::Int, hj::Number, zj::Int)
Computes emission probabilities for unphased HMM. This is the equation above eq8 of "Gene hunting with hidden Markov model knockoffs" by Sesia et al.
Knockoffs.get_haplotype_transition_matrix
— Methodget_haplotype_transition_matrix(r, θ, α)
Compute transition matrices for the hidden Markov chains in haplotypes. This is 2 equations above eq8 in "Gene hunting with hidden Markov model knockoffs" by Sesia et al.
Inputs
r
: Length p
vector, the "recombination rates" θ
: Size p × K
matrix, θ[j, k]
is probability that the allele is 1 for SNP p
at k
th haplotype motif α
: Size p × K
matrix, probabilities that haplotype motifs succeed each other. Rows should sum to 1.
Output
Q
: A p
-dimensional vector of K × K
matrices. Q[:, :, j]
is the j
th transition matrix.
Knockoffs.group_block_objective
— Methodgroup_block_objective(Σ, S, groups, m, method)
Evaluate the objective for SDP/MVR/ME. This is not an efficient function, so it should only be called at the start of each algorithm.
Inputs
Σ
: Covariance or correlation matrix for original dataS
: Optimization variable (group-block-diagonal)groups
: Vector of group membership. Variablei
belongs to groupgroups[i]
m
: Number of knockoffs to generate for each variablemethod
: The optimization method for group knockoffs
Knockoffs.hc_partition_groups
— Methodhc_partition_groups(X::AbstractMatrix; [cutoff], [min_clusters], [force_contiguous])
hc_partition_groups(Σ::Symmetric; [cutoff], [min_clusters], [force_contiguous])
Computes a group partition based on individual level data X
or correlation matrix Σ
using hierarchical clustering with specified linkage.
Inputs
X
:n × p
data matrix. Each row is a sampleΣ
:p × p
correlation matrix. Must be wrapped in theSymmetric
argument, otherwise we will treat it as individual level datacutoff
: Height value for which the clustering result is cut, between 0 and 1 (default 0.5). This ensures that no variables between 2 groups have correlation greater thancutoff
. 1 recovers ungrouped structure, 0 corresponds to everything in a single group.min_clusters
: The desired number of clusters.linkage
: cluster linkage function to use (whenforce_contiguous=true
,linkage
must be:single
).linkage
defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are::single
(default): use the minimum distance between any of the cluster members:average
: use the mean distance between any of the cluster members:complete
: use the maximum distance between any of the members:ward
: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters:ward_presquared
: same as:ward
, but assumes that the distances in d are already squared.
rep_method
: Method for selecting representatives for each group. Options are:id
(tends to select roughly independent variables) or:rss
(tends to select more correlated variables)
If force_contiguous = false
and both min_clusters
and cutoff
are specified, it is guaranteed that the number of clusters is not less than min_clusters
and their height is not above cutoff
. If force_contiguous = true
, min_clusters
keyword is ignored.
Outputs
groups
: Lengthp
vector of group membership for each variablegroup_reps
: Columns of X selected as representatives. Each group have at mostnrep
representatives. These are typically used to construct smaller group knockoff for extremely large groups
Knockoffs.hmm_knockoff
— Methodhmm_knockoff(plinkname; [datadir], [plink_outfile], [fastphase_outfile], [outdir], [verbose], args...)
Generates HMM knockoffs from binary PLINK formatted files. This is done by first running fastPHASE, then running Algorithm 2 of "Gene hunting with hidden Markov model knockoffs" by Sesia, Sabatti, and Candes
Input
plinkname
: Binary PLINK file names without the.bed/.bim/.fam
suffix.
Optional arguments
datadir
: Full path to the PLINK and fastPHASE files (default = current directory)plink_outfile
: Output PLINK format namefastphase_outfile
: The output file name from fastPHASE's alpha, theta, r filesargs...
: Any parameter that accepted infastPHASE.fastphase_estim_param()
Output
plink_outfile.bed
:n × p
knockoff genotypesplink_outfile.bim
: SNP mapping file. Knockoff have SNP names ending in ".k"plink_outfile.fam
: Sample mapping file, this is a copy of the originalplinkname.fam
filefastphase_outfile_rhat.txt
: averaged r hat file from fastPHASEfastphase_outfile_alphahat.txt
: averaged alpha hat file from fastPHASEfastphase_outfile_thetahat.txt
: averaged theta hat file from fastPHASE
Knockoffs.hmm_knockoff
— Methodhmm_knockoff(snpdata::SnpData, r::AbstractVecOrMat, θ::AbstractMatrix, α::AbstractMatrix)
Generates knockoff of snpdata
with loaded r, θ, α
Input
SnpData
: ASnpData
object from SnpArraysr
: The r vector estimated by fastPHASEθ
: The θ matrix estimated by fastPHASEα
: The α matrix estimated by fastPHASE
Optional Inputs
outdir
: Output directory for generated knockoffsplink_outfile
: Output file name for knockoff genotypesestimate_δ
: If true, will estimate pseudo-FDR by computing a δ value for each SNP via likelihood ratio bound
Knockoffs.id_partition_groups
— Methodid_partition_groups(X::AbstractMatrix; [rss_target], [force_contiguous])
id_partition_groups(Σ::Symmetric; [rss_target], [force_contiguous])
Compute group members based on interpolative decompositions. An initial pass first selects the most representative features such that regressing each non-represented feature on the selected will have residual less than rss_target
. The selected features are then defined as group centers and the remaining features are assigned to groups
Inputs
G
: Either individual level dataX
or a correlation matrixΣ
. If one inputsΣ
, it must be wrapped in theSymmetric
argument, otherwise we will treat it as individual level datarss_target
: Target residual level (greater than 0) for the first pass, smaller means more groupsforce_contiguous
: Whether groups are forced to be contiguous. If true, variants are assigned its left or right center, whichever has the largest correlation with it without breaking contiguity.
Outputs
groups
: Lengthp
vector of group membership for each variable
Note: interpolative decomposition is a stochastic algorithm. Set a seed to guarantee reproducible results.
Knockoffs.initialize_S
— Functioninitialize_S(Σ, groups, m, method, verbose)
Internal function to help initialize S
to a good starting value, returns the final S
matrix as well as the cholesky factorizations L
and C
where
- L.LL.U = cholesky((m+1)/mΣ - S)
- C.L*C.U = cholesky(S)
Knockoffs.interpolative_decomposition
— Methodinterpolative_decomposition(A::AbstractMatrix, rk::Int)
Computes the interpolative decomposition of A with rank rk
and returns the top rk
most representative columns of A
Knockoffs.inverse_mat_sqrt
— MethodComputes A^{-1/2} via eigen-decomposition
Knockoffs.ipad
— Methodipad(X::Matrix; [r_method], [m])
Generates knockoffs based on intertwined probabilitistic factors decoupling (IPAD). This assumes that X
can be factored as X = FΛ' + E
where F
is a n × r
random matrix of latent factors, Λ
are factor loadings, and E
are residual errors. When this assumption is met, FDR can be controlled with no power loss when applying the knockoff procedure. Internally, we need to compute an eigenfactorization for a n × n
matrix. This is often faster than standard model-X knockoffs which requires solving p
-dimensional convex optimization problem.
Inputs
X
: An × p
numeric matrix, each row is a sample, and each column is covariate.r_method
: Method used for estimatingr
, the number of latent factors. Choices include:er
(default),:gr
, or:ve
m
: Number of (simultaneous) knockoffs per variable to generate, defaultm=1
References
- Fan, Y., Lv, J., Sharifvaghefi, M. and Uematsu, Y., 2020. IPAD: stable interpretable forecasting with knockoffs inference. Journal of the American Statistical Association, 115(532), pp.1822-1834.
- Bai, J., 2003. Inferential theory for factor models of large dimensions. Econometrica, 71(1), pp.135-171.
- Ahn, S.C. and Horenstein, A.R., 2013. Eigenvalue ratio test for the number of factors. Econometrica, 81(3), pp.1203-1227.
Knockoffs.likelihood_ratio
— Methodlikelihood_ratio(θa, θb, ρ; α=0.1, n = 1000, threshold = true)
Estimates the likelihood ratio bound log(P(x)Q(x̃) / Q(x)P(x̃)) for each a single HMM state (fixed i and j)
- θa: State 1 of genotype markov state
- θb: State 2 of genotype markov state
- ρ is maf of SNP
- α is % SNPs decorrelated (defaults 10%)
Knockoffs.lowrankdowndate_turbo!
— Methodlowrankdowndate_turbo!(C::Cholesky, v::AbstractVector)
Vectorized version of lowrankdowndate!, source https://github.com/JuliaLang/julia/blob/742b9abb4dd4621b667ec5bb3434b8b3602f96fd/stdlib/LinearAlgebra/src/cholesky.jl#L753 Takes advantage of the fact that v
is 0 everywhere except at 1 position
Knockoffs.lowrankupdate_turbo!
— Methodlowrankupdate_turbo!(C::Cholesky, v::AbstractVector)
Vectorized version of lowrankupdate!, source https://github.com/JuliaLang/julia/blob/742b9abb4dd4621b667ec5bb3434b8b3602f96fd/stdlib/LinearAlgebra/src/cholesky.jl#L707 Takes advantage of the fact that v
is 0 everywhere except at 1 position
Knockoffs.markov_knockoffs
— Methodmarkov_knockoffs(Z::Vector{Int}, Q::Array{T, 3}, q::Vector{T})
Generates knockoff of variables distributed as a discrete Markov Chain with K
states.
Inputs
Z
: Lengthp
vector ofInt
whereZ[i]
is thei
th stateQ
:K × K × p
array.Q[:, :, j]
is aK × K
matrix of transition probabilities forj
th state, i.e. Q[l, k, j] = P(X{j} = k | X{j - 1} = l) The first transition matrix is not used.q
:K × 1
vector of initial probabilities
Reference
Equations 4-5 of "Gene hunting with hidden Markov model knockoffs" by Sesia, Sabatti, and Candes
Knockoffs.mk_threshold
— Methodmk_threshold(τ::Vector{T}, κ::Vector{Int}, m::Int, q::Number)
Chooses the multiple knockoff threshold τ̂ > 0
by setting τ̂ = min{ t > 0 : (1/m + 1/m * {#j: κ[j] ≥ 1 and W[j] ≥ t}) / {#j: κ[j] == 0 and W[j] ≥ τ̂} ≤ q }.
Inputs
τ
: τ[i] stores the feature importance score for the ith feature, i.e. the value T0 - median(T1,...,Tm). Note in Gimenez and Zou, the max function is used instead of medianκ
: κ[i] stores which of m knockoffs has largest importance score. When original variable has largest score, κ[i] == 0.m
: Number of knockoffs per variable generatedq
: target FDR (between 0 and 1)rej_bounds
: Number of values of top τ to consider (default = 10000)
Reference:
- Equations 8 and 9 in supplement of "Identification of putative causal loci in wholegenome sequencing data via knockoff statistics" by He et al.
- Algorithm 1 of "Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization" by Gimenez and Zou.
Knockoffs.modelX_gaussian_group_knockoffs
— MethodmodelX_gaussian_group_knockoffs(X, method, groups, μ, Σ; [m], [covariance_approximator], [kwargs])
modelX_gaussian_group_knockoffs(X, method, groups; [m], [covariance_approximator], [kwargs])
Constructs Gaussian model-X group knockoffs. If the covariance Σ
and mean μ
are not specified, they will be estimated from data, i.e. we will make second-order group knockoffs. To incorporate group structure, the (true or estimated) covariance matrix is block-diagonalized according to groups
membership to solve a relaxed optimization problem. See reference paper and Knockoffs.jl docs for more details.
Inputs
X
: An × p
design matrix. Each row is a sample, each column is a feature.method
: Method for constructing knockoffs. Options include:maxent
: (recommended) for fully general maximum entropy group knockoffs:mvr
: for fully general minimum variance-based reconstructability (MVR) group knockoffs:equi
: for equi-correlated knockoffs. This is the methodology proposed inDai R, Barber R. The knockoff filter for FDR control in group-sparse and multitask regression. International conference on machine learning 2016 Jun 11 (pp. 1851-1859). PMLR.
:sdp
: Fully general SDP group knockoffs based on coodinate descent:sdp_block
: Fully general SDP group knockoffs where each block is solved exactly using an interior point solver.:sdp_subopt
: Chooses each blockS_{i} = γ_i * Σ_{ii}
. This slightly generalizes the equi-correlated group knockoff idea proposed in Dai and Barber 2016.
groups
: Vector of group membershipμ
: A lengthp
vector storing the true column means ofX
Σ
: Ap × p
covariance matrix for columns ofX
m
: Number of knockoffs per variable, defaults to 1.covariance_approximator
: A covariance estimator, defaults toLinearShrinkage(DiagonalUnequalVariance(), :lw)
. See CovarianceEstimation.jl for more options.kwargs
: Extra keyword arguments forsolve_s_group
How to define groups
The exported functions hc_partition_groups
and id_partition_groups
can be used to build a group membership vector.
A note on compute time
The computational complexity of group knockoffs scales quadratically with group size. Thus, very large groups (e.g. >100 members per group) dramatically slows down parameter estimation. In such cases, one can consider running the routine modelX_gaussian_rep_group_knockoffs
which constructs group knockoffs by choosing top representatives from each group.
Reference
Dai & Barber 2016, The knockoff filter for FDR control in group-sparse and multitask regression
Knockoffs.modelX_gaussian_knockoffs
— MethodmodelX_gaussian_knockoffs(X::Matrix, method::Symbol; [m], [covariance_approximator], [kwargs...])
modelX_gaussian_knockoffs(X::Matrix, method::Symbol, μ::Vector, Σ::Matrix; [m], [kwargs...])
Creates model-free multivariate normal knockoffs by sequentially sampling from conditional multivariate normal distributions. The true mean μ
and covariance Σ
is estimated from data if not supplied.
Inputs
X
: An × p
numeric matrix, each row is a sample, and each column is covariate.method
: Can be one of the following:mvr
for minimum variance-based reconstructability knockoffs (alg 1 in ref 2):maxent
for maximum entropy knockoffs (alg 2 in ref 2):equi
for equi-distant knockoffs (eq 2.3 in ref 1),:sdp
for SDP knockoffs (eq 2.4 in ref 1):sdp_ccd
for SDP knockoffs via coordiate descent (alg 2.2 in ref 3)
μ
: Ap × 1
vector of column mean ofX
, defaults to column meanΣ
: Ap × p
matrix of covariance ofX
, defaults to a shrinkage estimator specified bycovariance_approximator
.m
: Number of knockoff copies per variable to generate, defaults to 1.covariance_approximator
: A covariance estimator, defaults toLinearShrinkage(DiagonalUnequalVariance(), :lw)
which tends to give good empirical performance when p>n. See CovarianceEstimation.jl for more options.kwargs...
: Possible optional inputs to solvers specified inmethod
, seesolve_MVR
,solve_max_entropy
, andsolve_sdp_ccd
Reference:
- "Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection" by Candes, Fan, Janson, and Lv (2018)
- "Powerful knockoffs via minimizing reconstructability" by Spector, Asher, and Lucas Janson (2020)
- "FANOK: Knockoffs in Linear Time" by Askari et al. (2020).
Covariance Approximation:
The covariance is approximated by a linear shrinkage estimator using Ledoit-Wolf with DiagonalUnequalVariance
target, which seems to perform well for p>n
cases. We do not simply use cov(X)
since isposdef(cov(X))
is typically false. For comparison of various estimators, see: https://mateuszbaran.github.io/CovarianceEstimation.jl/dev/man/msecomp/#msecomp
Knockoffs.modelX_gaussian_rep_group_knockoffs
— MethodmodelX_gaussian_rep_group_knockoffs(X, method, groups; [m], [covariance_approximator], [kwargs...])
modelX_gaussian_rep_group_knockoffs(X, method, groups, μ, Σ; [m], [kwargs...])
Constructs group knockoffs by choosing representatives from each group and solving a smaller optimization problem based on the representatives only. Remaining knockoffs are generated based on a conditional independence assumption similar to a graphical model (details to be given later). The representatives are computed by choose_group_reps
Inputs
X
: An × p
design matrix. Each row is a sample, each column is a feature.method
: Method for constructing knockoffs. Options are the same asmodelX_gaussian_group_knockoffs
groups
: Vector ofInt
denoting group membership.groups[i]
is the group ofX[:, i]
covariance_approximator
: A covariance estimator, defaults toLinearShrinkage(DiagonalUnequalVariance(), :lw)
. See CovarianceEstimation.jl for more options.μ
: A lengthp
vector storing the true column means ofX
Σ
: Ap × p
covariance matrix for columns ofX
rep_threshold
: Value between 0 and 1 that controls the number of representatives per group. Larger means more representatives (default 0.5)m
: Number of knockoffs per variable, defaults to 1.kwargs
: Extra keyword arguments forsolve_s_group
Knockoffs.normalize_col!
— Methodnormalize_col!(X::AbstractVecOrMat, [center=false])
Normalize each column of X
so they sum to 1.
Knockoffs.prioritize_variants
— Methodprioritize_variants!(index::AbstractVector, priority_vars::AbstractVector)
Given (unsorted) index
, we make variables in priority_vars
appear first in index
, preserving the original order in index
and those not in priority_vars
.
Example
index = [11, 4, 5, 9, 7]
priority_vars = [4, 9]
result = prioritize_variants(index, priority_vars)
result == [4, 9, 11, 5, 7]
Knockoffs.rapid
— Methodrapid(rapid_exe, vcffile, mapfile, d, outfolder, w, r, s, [a])
Wrapper for the RaPID program.
Inputs
rapid_exe
: Full path to theRaPID_v.1.7
executable filevcffile
: Phased VCF file namemapfile
: Map file named
: Actual Minimum IBD length in cMoutfolder
: Output folder namew
: Number of SNPs in a window for sub-samplingr
: Number of runss
: Minimum number of successes to consider a hit
Optional Inputs
a
: Iftrue
, ignore MAFs. By default (a=false
) the sites are selected at random weighted by their MAFs.
Knockoffs.search_rank
— Functionsearch_rank(A::AbstractMatrix, sk::Vector{Int}, target=0.25, verbose=false)
Finds the rank (number of columns of A) that best approximates the remaining columns such that regressing each remaining variable on those selected has RSS less than some target.
Σ
: Original (p × p) correlation matrixA
: The (upper triangular) cholesky factor of Σsk
: The (unsorted) columns of A, earlier ones are more importanttarget
: Target residual level
note: we cannot do binary search because large ranks can increase residuals
Knockoffs.shift_until_PSD!
— Functionshift_until_PSD!(Σ::AbstractMatrix)
Keeps adding λI to Σ until the minimum eigenvalue > tol
Knockoffs.simulate_AR1
— Methodsimulate_AR1(p::Int, a=1, b=1, tol=1e-3, max_corr=1, rho=nothing)
Generates p
-dimensional correlation matrix for AR(1) Gaussian process, where successive correlations are drawn from Beta(a
,b
) independently. If rho
is specified, then the process is stationary with correlation rho
.
Source
https://github.com/amspector100/knockpy/blob/20eddb3eb60e0e82b206ec989cb936e3c3ee7939/knockpy/dgp.py#L61
Knockoffs.simulate_ER
— Methodsimulate_ER(p::Int; [invert])
Simulates a covariance matrix from a clustered Erdos-Renyi graph, which is a block diagonal matrix where each block is an Erdo-Renyi graph. The result is scaled back to a correlation matrix.
For details, see the 4th simulation routine in section 5.1 of Li and Maathius https://academic.oup.com/jrsssb/article/83/3/534/7056103?login=false
Inputs
p
: Dimension of covariance matrixϕ
: Probability of forming an edge between any 2 nodeslb
: lower bound for the value of an edge (drawn from uniform distribution)ub
: upper bound for the value of an edge (drawn from uniform distribution)invert
: Whther to invert the covariance matrix (to obtain the precision)λmin
: minimum eigenvalue of the resulting covariance matrixblocksize
: Number of variables within each ER graph.
Knockoffs.simulate_block_covariance
— Methodsimulate_block_covariance(groups, ρ, γ, num_v, w)
Simulates a block covariance matrix similar to the one in Dai & Barber 2016, The knockoff filter for FDR control in group-sparse and multitask regression
. That is, all diagonal elements will be 1, correlation within groups will be ρ
, and correlation between groups will be ρ*γ
.
Inputs
groups
: Vector of group membershipρ
: within group correlationγ
: between group correlation
Optional arguments
num_v
: Number of added rank 1 updateΣ + v1*v1' + ... + vn*vn'
wherev
is iidN(0, w)
(default 0)w
: variance of the rank 1 update used innum_v
(default 1)
Knockoffs.single_linkage_distance
— Methodsingle_linkage_distance(distmat, left, right)
Computes the minimum distance (i.e. single-linkage distance) between members in left
and members in right
. Member distances are precomputed in distmat
Knockoffs.single_state_dmc_knockoff!
— MethodSamples Zj
, the j
state of the hidden Markov chain.
Knockoffs.solve_MVR
— Methodsolve_MVR(Σ::AbstractMatrix)
Solves the minimum variance-based reconstructability problem for fixed-X and model-X knockoffs given correlation matrix Σ. Users should call solve_s
instead of this function.
See algorithm 1 of "Powerful knockoffs via minimizing reconstructability" by Spector, Asher, and Lucas Janson (2020) https://arxiv.org/pdf/2011.14625.pdf
Knockoffs.solve_SDP
— Methodsolve_SDP(Σ::AbstractMatrix)
Solves the SDP problem for fixed-X and model-X knockoffs given correlation matrix Σ. Users should call solve_s
instead of this function.
The optimization problem is stated in equation 3.13 of https://arxiv.org/pdf/1610.02351.pdf
Arguments
Σ
: A correlation matrix (diagonals all equal to 1)m
: Number of knockoffs to generate, defaults to 1optm
: SDP solver. Defaults toHypatia.Optimizer(verbose=false)
. This can be any solver that supports the JuMP interface. For example, useSDPT3.Optimizer
in SDPT3.jl package (which is a MATLAB dependency) for the best performance.
Knockoffs.solve_equi
— Methodsolve_equi(Σ::AbstractMatrix)
Solves the equicorrelated problem for fixed-X and model-X knockoffs given correlation matrix Σ. Users should call solve_s
instead of this function.
Knockoffs.solve_group_SDP_single_block
— Methodsolve_group_SDP_single_block(Σ11, ub)
Solves a single block of the fully general group SDP problem. The objective is min sum_{i,j} |Σ[i,j] - S[i,j]| s.t. 0 ⪯ S ⪯ A11 - [A12 A13]inv(A22-S2 A32; A23 A33-S3)[A21; A31]
Inputs
Σ11
: The block corresponding to the current group. Must be a correlation matrix.ub
: The matrix defined as A11 - [A12 A13]inv(A22-S2 A32; A23 A33-S3)[A21; A31]optm
: Any solver compatible with JuMP.jl
Knockoffs.solve_group_SDP_subopt
— MethodSolves the SDP group knockoff problem using analogy to the equi-correlated group knockoffs. Basically, the idea is to optimize a vector γ
where γ[j]
multiplies Σ_jj. In the equi-correlated setting, all γ[j]
is forced to be equal.
Details can be found in Dai & Barber 2016, The knockoff filter for FDR control in group-sparse and multitask regression
Knockoffs.solve_group_block_update
— MethodTodo
- somehow avoid reallocating ub every iteration
- When solving each individual block,
- warmstart
- avoid reallocating S1_new
- allocate vector of models
- use loose convergence criteria
- For singleton groups, don't use JuMP and directly update
- Currently all objective values are computed based on SDP case. Need to display objective values for ME/MVR objective
Knockoffs.solve_group_equi
— MethodSolves the equi-correlated group knockoff problem. Here Σ
is the true covariance matrix (scaled so that it has 1 on its diagonal) and Σblocks
is the block-diagonal covariance matrix where each block corresponds to groups.
Details can be found in Dai & Barber 2016, The knockoff filter for FDR control in group-sparse and multitask regression
Knockoffs.solve_group_max_entropy_hybrid
— Methodsolve_group_max_entropy_hybrid(Σ, groups, [outer_iter=100], [inner_pca_iter=1],
[inner_ccd_iter=1], [tol=0.0001], [ϵ=1e-6], [m=1], [robust=false], [verbose=false])
Solves the group-knockoff optimization problem based on Maximum Entropy objective. Users should call solve_s_group
instead of this function.
Inputs
Σ
: Correlation matrixgroups
: Group membership vector
Optional inputs
outer_iter
: Maximum number of outer iterations. Each outer iteration will performinner_pca_iter
PCA updatesinner_ccd_iter
full optimization updates (default = 100).inner_pca_iter
: Number of full PCA updates before changing to fully general coordinate descent updates (default = 1)inner_ccd_iter
: Number of full general coordinate descent updates before changing to PCA updates (default = 1)tol
: convergence tolerance. Algorithm converges whenabs((obj_new-obj_old)/obj_old) < tol
OR when changes inS
matrix falls below 1e-4ϵ
: tolerance added to the lower and upper bound, prevents numerical issues (default =1e-6
)m
: Number of knockoffs per variable (defaults1
)robust
: whether to use "robust" Cholesky updates. Ifrobust=true
, alg will be ~10x slower, only use this ifrobust=false
causes cholesky updates to fail. (defaultfalse
)verbose
: Whether to print intermediate results (defaultfalse
)
Knockoffs.solve_group_mvr_hybrid
— Methodsolve_group_mvr_hybrid(Σ, groups, [outer_iter=100], [inner_pca_iter=1],
[inner_ccd_iter=1], [tol=0.0001], [ϵ=1e-6], [m=1], [robust=false], [verbose=false])
Solves the group-knockoff optimization problem based on MVR objective. Users should call solve_s_group
instead of this function.
Inputs
Σ
: Correlation matrixgroups
: Group membership vector
Optional inputs
outer_iter
: Maximum number of outer iterations. Each outer iteration will performinner_pca_iter
PCA updatesinner_ccd_iter
full optimization updates (default = 100).inner_pca_iter
: Number of full PCA updates before changing to fully general coordinate descent updates (default = 1)inner_ccd_iter
: Number of full general coordinate descent updates before changing to PCA updates (default = 1)tol
: convergence tolerance. Algorithm converges whenabs((obj_new-obj_old)/obj_old) < tol
OR when changes inS
matrix falls below 1e-4ϵ
: tolerance added to the lower and upper bound, prevents numerical issues (default =1e-6
)m
: Number of knockoffs per variable (defaults1
)robust
: whether to use "robust" Cholesky updates. Ifrobust=true
, alg will be ~10x slower, only use this ifrobust=false
causes cholesky updates to fail. (defaultfalse
)verbose
: Whether to print intermediate results (defaultfalse
)
Knockoffs.solve_group_sdp_hybrid
— Methodsolve_group_sdp_hybrid(Σ, groups, [outer_iter=100], [inner_pca_iter=1],
[inner_ccd_iter=1], [tol=0.0001], [ϵ=1e-6], [m=1], [robust=false], [verbose=false])
Solves the group-knockoff optimization problem based on SDP objective. Users should call solve_s_group
instead of this function.
Inputs
Σ
: Correlation matrixgroups
: Group membership vector
Optional inputs
outer_iter
: Maximum number of outer iterations. Each outer iteration will performinner_pca_iter
PCA updatesinner_ccd_iter
full optimization updates (default = 100).inner_pca_iter
: Number of full PCA updates before changing to fully general coordinate descent updates (default = 1)inner_ccd_iter
: Number of full general coordinate descent updates before changing to PCA updates (default = 1)tol
: convergence tolerance. Algorithm converges whenabs((obj_new-obj_old)/obj_old) < tol
OR when changes inS
matrix falls below 1e-4ϵ
: tolerance added to the lower and upper bound, prevents numerical issues (default =1e-6
)m
: Number of knockoffs per variable (defaults1
)robust
: whether to use "robust" Cholesky updates. Ifrobust=true
, alg will be ~10x slower, only use this ifrobust=false
causes cholesky updates to fail. (defaultfalse
)verbose
: Whether to print intermediate results (defaultfalse
)
Knockoffs.solve_max_entropy
— Methodsolve_max_entropy(Σ::AbstractMatrix)
Solves the maximum entropy knockoff problem for fixed-X and model-X knockoffs given correlation matrix Σ. Users should call solve_s
instead of this function.
Reference
Algorithm 2.2 from Powerful Knockoffs via Minimizing Reconstructability: https://arxiv.org/pdf/2011.14625.pdf
Note
There is a typo in algorithm for computing ME knockoffs in "Powerful knockoffs via minimizing reconstructability" by Spector, Asher, and Lucas Janson (2020). In the supplemental section, equation 59, they needed to evaluate c_m = D^t_{-j,j}D^{-1}_{-j,-j}D_{-j,j}
. They claimed the FANOK paper ("FANOK: KNOCKOFFS IN LINEAR TIME" by Askari et al. (2020)) implies that c_m = ||v_m||^2
where Lv_m = u
. However, according to section A.1.2 of the FANOK paper, it seems like the actual update should be D^t_{-j,j}D^{-1}_{-j,-j}D_{-j,j} = ζ*||c_m||^2 / (ζ + ||c_m||^2)
where ζ = 2Σ_{jj} - s_j
.
Knockoffs.solve_s
— Methodsolve_s(Σ::Symmetric, method::Symbol; m=1, kwargs...)
Solves the vector s
for generating knockoffs. Σ
can be a general covariance matrix but it must be wrapped in the Symmetric
keyword.
Inputs
Σ
: A covariance matrix (one must wrapSymmetric(Σ)
explicitly)method
: Can be one of the following:mvr
for minimum variance-based reconstructability knockoffs (alg 1 in ref 2):maxent
for maximum entropy knockoffs (alg 2 in ref 2):equi
for equi-distant knockoffs (eq 2.3 in ref 1),:sdp
for SDP knockoffs (eq 2.4 in ref 1):sdp_ccd
fast SDP knockoffs via coordiate descent (alg 2.2 in ref 3)
m
: Number of knockoffs per variable, defaults to 1.kwargs
: Extra arguments available for specific methods. For example, to use less stringent convergence tolerance for MVR knockoffs, specifytol = 0.001
. For a list of available options, seesolve_MVR
,solve_max_entropy
,solve_sdp_ccd
,solve_SDP
, orsolve_equi
Reference
- "Controlling the false discovery rate via Knockoffs" by Barber and Candes (2015).
- "Powerful knockoffs via minimizing reconstructability" by Spector, Asher, and Lucas Janson (2020)
- "FANOK: Knockoffs in Linear Time" by Askari et al. (2020).
Knockoffs.solve_s_graphical_group
— Methodsolve_s_graphical_group(Σ::Symmetric, groups::Vector{Int}, group_reps::Vector{Int},
method; [m], [verbose])
Solves the group knockoff problem but the convex optimization problem only runs on the representatives. The non-representative variables are assumed to be independent by groups when conditioning on the reprensetatives.
Inputs
Σ
: Symmetricp × p
covariance matrixgroups
:p
dimensional vector of group membershipgroup_reps
: Indices for the representatives.method
: Method for solving group knockoff problemm
: Number of knockoffs to generate per featureverbose
: Whether to print informative intermediate resultskwargs...
: extra arguments forsolve_s_group
Outputs
S
: Matrix obtained from solving the optimization problem on the representatives.D
: Ap × p
(dense) matrix corresponding to the S matrix for both the representative and non-representative variables. Knockoff sampling should use this matrix. If the graphical conditional independent assumption is satisfied exactly, this matrix should be sparse, but it is always never sparse unless we usecond_indep_corr
to force the covariance matrix to satisify it.obj
: Objective value for solving the optimization problem on the representatives.
Knockoffs.solve_s_group
— Methodsolve_s_group(Σ, groups, method; [m=1], kwargs...)
Solves the group knockoff problem, returns block diagonal matrix S satisfying (m+1)/m*Σ - S ⪰ 0
where m
is number of knockoffs per feature.
Inputs
Σ
: A general covariance matrix wrapped bySymmetric
keywordgroups
: Vector of group membership, does not need to be contiguousmethod
: Method for constructing knockoffs. Options include:maxent
: (recommended) for fully general maximum entropy group knockoffs:mvr
: for fully general minimum variance-based reconstructability (MVR) group knockoffs:equi
: for equi-correlated knockoffs. This is the methodology proposed inDai R, Barber R. The knockoff filter for FDR control in group-sparse and multitask regression. International conference on machine learning 2016 Jun 11 (pp. 1851-1859). PMLR.
:sdp
: Fully general SDP group knockoffs based on coodinate descent:sdp_subopt
: Chooses each blockS_{i} = γ_i * Σ_{ii}
. This slightly generalizes the equi-correlated group knockoff idea proposed in Dai and Barber 2016.:sdp_block
: Fully general SDP group knockoffs where each block is solved exactly using an interior point solver.
m
: Number of knockoffs per variable, defaults to 1.kwargs
: Extra arguments available for specific methods. For example, to use less stringent convergence tolerance, specifytol = 0.001
. For a list of available options, seesolve_group_mvr_hybrid
,solve_group_max_entropy_hybrid
,solve_group_sdp_hybrid
, orsolve_group_equi
Output
S
: A matrix solved so that(m+1)/m*Σ - S ⪰ 0
andS ⪰ 0
γ
: A vector that is only non-empty for equi and suboptimal knockoff constructions. They correspond to values of γ whereS_{gg} = γΣ_{gg}
. So for equi, the vector is length 1. For SDP, the vector has length equal to number of groupsobj
: Final SDP/MVR/ME objective value givenS
. Equi-correlated group knockoffs and singleton (non-grouped knockoffs) returns 0 because they either no objective value or it is not necessary to evaluate the objectives
Warning
This function potentially permutes the columns/rows of Σ
, and puts them back at the end. Thus one should NOT call solve_s_group
on the same Σ
simultaneously, e.g. in a multithreaded for loop. Permutation does not happen when groups are contiguous.
Knockoffs.solve_sdp_ccd
— Methodsolve_sdp_ccd(Σ::AbstractMatrix)
Solves the SDP problem for fixed-X and model-X knockoffs using coordinate descent, given correlation matrix Σ. Users should call solve_s
instead of this function.
Reference
Algorithm 2.2 from "FANOK: Knockoffs in Linear Time" by Askari et al. (2020).
Knockoffs.threshold
— Methodthreshold(w::AbstractVector, q::Number, [method=:knockoff], [m::Int=1])
Chooses a threshold τ > 0 by choosing τ
to be one of the following τ = min{ t > 0 : {#j: w[j] ≤ -t} / {#j: w[j] ≥ t} ≤ q } (method=:knockoff
) τ = min{ t > 0 : (1 + {#j: w[j] ≤ -t}) / {#j: w[j] ≥ t} ≤ q } (method=:knockoff
)
Inputs
w
: Vector of feature important statisticsq
: target FDR (between 0 and 1)method
: either:knockoff
or:knockoff_plus
(default)rej_bounds
: Number of values of top W to consider (default = 10000)
Reference:
Equation 3.10 (method=:knockoff
) or 3.11 (method=:knockoff_plus
) of "Panning for Gold: Model-X Knockoffs for High-dimensional Controlled Variable Selection" by Candes, Fan, Janson, and Lv (2018)
Knockoffs.update_normalizing_constants!
— Methodupdate_normalizing_constants!(Q::AbstractMatrix{T}, q::AbstractVector{T})
Computes normalizing constants recursively using equation (5).
Inputs
Q
:K × K × p
array.Q[:, :, j]
is aK × K
matrix of transition probabilities forj
th state, i.e. Q[l, k, j] = P(X{j} = k | X{j - 1} = l). The first transition matrix is not used.q
:K × 1
vector of initial probabilities
todo: efficiency
Knockoffs.Knockoff
— TypeA Knockoff
holds the original design matrix X
, along with its knockoff Xko
.
Knockoffs.KnockoffFilter
— TypeA KnockoffFilter
is essentially a Knockoff
that has gone through a feature selection procedure, such as the Lasso. It stores, among other things, the final estimated parameters beta
after applying the knockoff-filter procedure.
The debiased
variable is a boolean indicating whether estimated effect size have been debiased with Lasso. The W
vector stores the feature importance statistic that satisfies the flip coin property. tau
is the knockoff threshold, which controls the empirical FDR at level q
Knockoffs.MarkovChainTable
— TypeGenotype states are index pairs (ka, kb) where ka, kb is unordered haplotype 1 and 2. If there are K=5 haplotype motifs, then the 15 possible genotype states and their index are
(1, 1) = 1 (1, 2) = 2 (2, 2) = 6 (1, 3) = 3 (2, 3) = 7 (3, 3) = 10 (1, 4) = 4 (2, 4) = 8 (3, 4) = 11 (4, 4) = 13 (1, 5) = 5 (2, 5) = 9 (3, 5) = 12 (4, 5) = 14 (5, 5) = 15