Command-line documentation and usage of GhostKnockoffGWAS
Usage
Simple run
GhostKnockoffGWAS --zfile example_zfile.txt --LD-files EUR --N 506200 --genome-build 38 --out example_output
Required inputs
Option name | Argument | Description |
---|---|---|
--zfile | String | Input file containing Z-scores as well as CHR/POS/REF/ALT. See Acceptable Z-score files for detailed requirement on this file. |
--LD-files | String | Input directory to the pre-processed LD files. Most users downloads this from the Downloads Page |
--N | Int | Sample size for target (original) study |
--genome-build | Int | The human genome build used for SNP positions in zfile (this value must be 19 or 38) |
--out | String | Output file name (without extensions) |
Optional inputs
Option name | Argument | Description |
---|---|---|
--CHR | Int | The column in zfile that will be read as chromosome number (note this must be an integer, e.g. chr22, X, chrX, ...etc are NOT acceptable). [If not specified, we will search for a column with header CHR ] |
--POS | Int | The column in zfile that will be read as SNP position . [If not specified, we will search for a column with header POS ] |
--REF | Int | The column in zfile that will be read as REF (non-effectiv) allele . [If not specified, we will search for a column with header REF ] |
--ALT | Int | The column in zfile that will be read as ALT (effective allele). [If not specified, we will search for a column with header REF ] |
--Z | Int | The column in zfile that will be read as Z-scores. [If not specified, we will search for a column with header Z ] |
--seed | Int | Sets the random seed [If not specified, defaults to 2023 ] |
--verbose | Bool | Whether to print intermediate messages [If not specified, defaults to true ] |
--random-shuffle | Bool | Whether to randomly permute the order of Z-scores and their knockoffs to adjust for potential ordering bias. The main purpose of this option is to take care of potential ordering bias of Lasso solvers. However, in our simulations we never observed such biases, so we turn this off by default.[If not specified, defaults to false ] |
--skip-shrinkage-check | Bool | Whether to allow Knockoff analysis to proceed even with large (>0.25) LD shrinkages [If not specified, defaults to false ] |
Output format
- A summary file, e.g.
example_output_summary.txt
. This file contains broad summary of the analysis - A comma-separated file, e.g.
example_output.txt
. This file contains the fullGhostKnockoffGWAS
output, one SNP in each row. - (optional) Manhattan plots, which can be generated by following step 5 of detailed example.
For a more detailed explanation on these 2 files, see Tutorial.
Acceptable Z-scores file format
The Z score file should satisfy the following requirements:
- It is a comma- or tab-separated text file (.gz compressed is acceptable)
- The first row should be a header line, and every row after the first will be treated as a different SNP.
- By default
GhostKnockoffGWAS
will search for column namesCHR
,POS
,REF
,ALT
, andZ
. Alternatively, you can specify which column should be used for each of these fields by providing the corresponding optional inputs, e.g.--CHR 6
tellsGhostKnockoffGWAS
to use column 6 asCHR
. TheALT
allele will be treated as the effect allele andREF
be treated as non-effect allele. The POS (position) field of each variant must be from HG19 or HG38, which must be specified by the--genome-build
argument.
Here is a minimal example with 10 Z scores
CHR POS REF ALT Z
17 150509 T TA 1.08773561923134
17 151035 T C 0.703898767202681
17 151041 G A NaN
17 151872 T C -0.299877259561085
17 152087 C T -0.371627135786605
17 152104 G A -0.28387322965385
17 152248 G A 0.901618600934489
17 152427 G A 1.10987516000804
17 152771 A G 0.708492545266136
A toy example is example_zfile.txt (17MB).
Missing Z scores can be specified as NaN
or as an empty cell. If you do not want a SNP to be considered in the analysis, you can change the its Z-score to NaN. CHR/POS/REF/ALT fields cannot have missing values.
Requirements on the input Z-scores
In our papers, Z-scores are defined by $z = \frac{1}{\sqrt{N}}X^ty$ where $X$ is the $N \times P$ standardized genotype matrix with $N$ samples and $P$ SNPs, $y$ is the normalized $n \times 1$ phenotype vector, and these Z-scores have $N(0, 1)$ distribution under the null.
In practice, this paper shows that other association test statistics that are $N(0, 1)$ under the null also result in FDR control. This includes commonly used tests in genetic association studies such as:
- generalized linear mixed effect model to account for sample relatedness
- saddle point approximation for extreme case-control imbalance
- meta-analysis that aggregates multiple studies.
If you have p-values, effect sizes, odds ratios,...etc, converting them into Z score might be possible, for example by following the Notes on computing Z-scores of this blog post.