Hi everyone, this is the final (summarizing) blog post for my Google Summer of Code project.
DNA testing is cheap, effective, and abundant. Anyone can just do it at specialized centers with a modest cost of \(\sim 200\) USD. Sometimes this may reveal the genetic basis for ongoing illnesses. Other times testing discover diseases that have a late-onset age, contributing to personalized medicine. How do scientists know what region of the genome contribute to a disease? We obtain this knowledge from Genome Wide Association Studies (GWAS). Analysis of GWAS datasets is what this project is about.
A traditional GWAS analysis would result in a graph looking like the above (generated by MendelGWAS.jl package).
Each colored dot represents a p-value for a single nucleotide polymorphism (SNP), and the dotted black lines represents the threshold for statistical significance. The p-value must pass this threshold for scientists to conclude a SNP is affecting the trait under consideration. Due to the problem of multiple testing, the more SNPs tested in a given experiment, the higher this threshold, and hence, this method admits a lot of false negatives. Those that don’t breach the threshold are effectively deemed not significant unless followed up by further studies. Enforcing such a threshold is not good science, because in the real world there is no such “cutoff”.
Therefore, we want a more effective way to interpret GWAS results. Since GWAS datasets can be huge (> 500,000 people and >8 million SNPs resulting in >100 GB datasets), this new method must also be extremely scalable. As one may guess, there is a trade-off between speed and quality of analysis, and balancing this is quite a challenge algorithmically.
We think the relatively novel algorithm iterative hard-thresholding (IHT) is the most suited for this job. It is one of the most scalable algorithms for subset selection, has great convergence guarantees, is better than LASSO & MCP for model selection (i.e. producing good quality models), simple to implement, and has low memory-footprint as desired in modern big-data analysis. We pushed through with this idea and implemented IHT in Julia. This post summarizes a few extra features that we added over this summer.
Note: The mathematics of IHT and how it is applied in genetics is explained in this paper
To run the new functions in IHT.jl
, execute the following in the Julia REPL:
using IHT
MendelIHT("gwas 1 Control.txt")
In Open Mendel, all analysis parameters are specified via the Control.txt
file. An example of a control file looks like the following:
#
# Input and Output files.
#
plink_input_basename = gwas 1 data
output_file = gwas 1 Output.txt
#
# Analysis parameters for IHT option.
#
data_type = genetic
predictors_per_group = 2
max_groups = 4
group_membership = 10_group_gwas1.dat
Note the genotype file must be stored in the PLINK binary format.
.bim
, .bed
, .fam
must be present.IHTResult
term as seen in the previous section.data_type = genetic
. This is preparing for future analysis on non-genetic data.Example files of each can be found here.
IHT results:
Compute time (sec): 2.060492161
Final loss: 1130.1137206545498
Iterations: 11
Max number of groups: 4
Max predictors/group: 2
IHT estimated 8 nonzero coefficients.
8*3 DataFrames.DataFrame
| Row | Group | Predictor | Estimated_b |
|-----|-------|-----------|-------------|
| 1 | 4 | 3544 | -0.0756442 |
| 2 | 4 | 3981 | 0.144222 |
| 3 | 7 | 6611 | 0.0833481 |
| 4 | 7 | 6627 | -0.0739322 |
| 5 | 8 | 7023 | 0.27132 |
| 6 | 8 | 7347 | -0.0660806 |
| 7 | 10 | 9467 | -0.0798 |
| 8 | 10 | 10001 | 0.140569 |