Gsoc Summary

15 minute read

Published:

GSoC Results and Summary

Hi everyone, this is the final (summarizing) blog post for my Google Summer of Code project.

Problem Motivation

DNA testing is cheap, effective, and abundant. Anyone can just do it at specialized centers with a modest cost of \(\sim 200\) USD. Sometimes this may reveal the genetic basis for ongoing illnesses. Other times testing discover diseases that have a late-onset age, contributing to personalized medicine. How do scientists know what region of the genome contribute to a disease? We obtain this knowledge from Genome Wide Association Studies (GWAS). Analysis of GWAS datasets is what this project is about.

Problem: traditional (p-value) analysis are not effective

A traditional GWAS analysis would result in a graph looking like the above (generated by MendelGWAS.jl package).

Each colored dot represents a p-value for a single nucleotide polymorphism (SNP), and the dotted black lines represents the threshold for statistical significance. The p-value must pass this threshold for scientists to conclude a SNP is affecting the trait under consideration. Due to the problem of multiple testing, the more SNPs tested in a given experiment, the higher this threshold, and hence, this method admits a lot of false negatives. Those that don’t breach the threshold are effectively deemed not significant unless followed up by further studies. Enforcing such a threshold is not good science, because in the real world there is no such “cutoff”.

Therefore, we want a more effective way to interpret GWAS results. Since GWAS datasets can be huge (> 500,000 people and >8 million SNPs resulting in >100 GB datasets), this new method must also be extremely scalable. As one may guess, there is a trade-off between speed and quality of analysis, and balancing this is quite a challenge algorithmically.

Solution: Replace (single-variate) p-value analysis with iterative hard-thresholding (IHT)

We think the relatively novel algorithm iterative hard-thresholding (IHT) is the most suited for this job. It is one of the most scalable algorithms for subset selection, has great convergence guarantees, is better than LASSO & MCP for model selection (i.e. producing good quality models), simple to implement, and has low memory-footprint as desired in modern big-data analysis. We pushed through with this idea and implemented IHT in Julia. This post summarizes a few extra features that we added over this summer.

Note: The mathematics of IHT and how it is applied in genetics is explained in this paper

Example Input/Output and usage:

To run the new functions in IHT.jl, execute the following in the Julia REPL:

using IHT
MendelIHT("gwas 1 Control.txt")

Preparing Input files

In Open Mendel, all analysis parameters are specified via the Control.txt file. An example of a control file looks like the following:

#
# Input and Output files.
#
plink_input_basename = gwas 1 data
output_file = gwas 1 Output.txt
#
# Analysis parameters for IHT option.
#
data_type = genetic
predictors_per_group = 2
max_groups = 4
group_membership = 10_group_gwas1.dat

Note the genotype file must be stored in the PLINK binary format.

  1. plink_input_basename: This is your PLINK compressed genetics data. All 3 files .bim, .bed, .fam must be present.
  2. output_file: Stores the IHTResult term as seen in the previous section.
  3. data_type: Currently only accepts data_type = genetic. This is preparing for future analysis on non-genetic data.
  4. predictors_per_group: Maximum number of active groups specified by user.
  5. max_groups: Maximum number of predictors per group, specified by user.
  6. group_membership: A text file indicating group membership, where there is the same number of lines equal to number of SNPs plus 1 (since the intercept needs a group-membership as well). Each line must be a positive integer. Currently all predictors (and the intercept) must belong to exactly 1 group.

Example files of each can be found here.

Sample output on “gwas 1 data”:

IHT results:

Compute time (sec):     2.060492161
Final loss:             1130.1137206545498
Iterations:             11
Max number of groups:   4
Max predictors/group:   2
IHT estimated 8 nonzero coefficients.
8*3 DataFrames.DataFrame
| Row | Group | Predictor | Estimated_b |
|-----|-------|-----------|-------------|
| 1   | 4     | 3544      | -0.0756442  |
| 2   | 4     | 3981      | 0.144222    |
| 3   | 7     | 6611      | 0.0833481   |
| 4   | 7     | 6627      | -0.0739322  |
| 5   | 8     | 7023      | 0.27132     |
| 6   | 8     | 7347      | -0.0660806  |
| 7   | 10    | 9467      | -0.0798     |
| 8   | 10    | 10001     | 0.140569    |

Superimposing these predictors onto the previous Manhattan plot, we would get: