# Gsoc Summary

15 minute read

Published:

GSoC Results and Summary

Hi everyone, this is the final (summarizing) blog post for my Google Summer of Code project.

## Problem Motivation

DNA testing is cheap, effective, and abundant. Anyone can just do it at specialized centers with a modest cost of $\sim 200$ USD. Sometimes this may reveal the genetic basis for ongoing illnesses. Other times testing discover diseases that have a late-onset age, contributing to personalized medicine. How do scientists know what region of the genome contribute to a disease? We obtain this knowledge from Genome Wide Association Studies (GWAS). Analysis of GWAS datasets is what this project is about.

#### Problem: traditional (p-value) analysis are not effective

A traditional GWAS analysis would result in a graph looking like the above (generated by MendelGWAS.jl package).

Each colored dot represents a p-value for a single nucleotide polymorphism (SNP), and the dotted black lines represents the threshold for statistical significance. The p-value must pass this threshold for scientists to conclude a SNP is affecting the trait under consideration. Due to the problem of multiple testing, the more SNPs tested in a given experiment, the higher this threshold, and hence, this method admits a lot of false negatives. Those that don’t breach the threshold are effectively deemed not significant unless followed up by further studies. Enforcing such a threshold is not good science, because in the real world there is no such “cutoff”.

Therefore, we want a more effective way to interpret GWAS results. Since GWAS datasets can be huge (> 500,000 people and >8 million SNPs resulting in >100 GB datasets), this new method must also be extremely scalable. As one may guess, there is a trade-off between speed and quality of analysis, and balancing this is quite a challenge algorithmically.

#### Solution: Replace (single-variate) p-value analysis with iterative hard-thresholding (IHT)

We think the relatively novel algorithm iterative hard-thresholding (IHT) is the most suited for this job. It is one of the most scalable algorithms for subset selection, has great convergence guarantees, is better than LASSO & MCP for model selection (i.e. producing good quality models), simple to implement, and has low memory-footprint as desired in modern big-data analysis. We pushed through with this idea and implemented IHT in Julia. This post summarizes a few extra features that we added over this summer.

Note: The mathematics of IHT and how it is applied in genetics is explained in this paper

## Example Input/Output and usage:

To run the new functions in IHT.jl, execute the following in the Julia REPL:

using IHT
MendelIHT("gwas 1 Control.txt")

#### Preparing Input files

In Open Mendel, all analysis parameters are specified via the Control.txt file. An example of a control file looks like the following:

#
# Input and Output files.
#
plink_input_basename = gwas 1 data
output_file = gwas 1 Output.txt
#
# Analysis parameters for IHT option.
#
data_type = genetic
predictors_per_group = 2
max_groups = 4
group_membership = 10_group_gwas1.dat

Note the genotype file must be stored in the PLINK binary format.

1. plink_input_basename: This is your PLINK compressed genetics data. All 3 files .bim, .bed, .fam must be present.
2. output_file: Stores the IHTResult term as seen in the previous section.
3. data_type: Currently only accepts data_type = genetic. This is preparing for future analysis on non-genetic data.
4. predictors_per_group: Maximum number of active groups specified by user.
5. max_groups: Maximum number of predictors per group, specified by user.
6. group_membership: A text file indicating group membership, where there is the same number of lines equal to number of SNPs plus 1 (since the intercept needs a group-membership as well). Each line must be a positive integer. Currently all predictors (and the intercept) must belong to exactly 1 group.

Example files of each can be found here.

#### Sample output on “gwas 1 data”:

IHT results:

Compute time (sec):     2.060492161
Final loss:             1130.1137206545498
Iterations:             11
Max number of groups:   4
Max predictors/group:   2
IHT estimated 8 nonzero coefficients.
8*3 DataFrames.DataFrame
| Row | Group | Predictor | Estimated_b |
|-----|-------|-----------|-------------|
| 1   | 4     | 3544      | -0.0756442  |
| 2   | 4     | 3981      | 0.144222    |
| 3   | 7     | 6611      | 0.0833481   |
| 4   | 7     | 6627      | -0.0739322  |
| 5   | 8     | 7023      | 0.27132     |
| 6   | 8     | 7347      | -0.0660806  |
| 7   | 10    | 9467      | -0.0798     |
| 8   | 10    | 10001     | 0.140569    |