User Tools

Site Tools


keller_and_evans_lab:gscan

This is an old revision of the document!


GSCAN–or the GWAS & Sequencing Consortium of Alcohol and Nicotine use–is an international genetic association meta-analysis consortium. Our goal is to aggregate genetic association findings across scores of studies with millions of individuals. GSCAN is composed of three independent but related projects: 1) an exome chip meta-analysis of low-frequency non-synonymous variants, 2) a GWAS meta-analysis, and 3) a whole genome sequencing association meta-analysis.

This wiki page is to help organize GSCAN efforts for the coordinating investigators. If you represent a study that may be interested in participating in GSCAN you can find more information on our more public website. Look on the right-hand side of the page to find analysis plans for each of the three projects.

Meetings

Regular conference calls are held and minutes are **available here**.

Other meeting materials from CO internal meetings are here:

16_--_db_ga_p_gf_g

GSCAN Exome Chip

Phenotype definitions and analysis plan

File Locations

Freeze 1. We concluded a pilot freeze of the exome chip project in 2015 and are writing up our results now. All of the summary statistics are on twins at /net/twins/svrieze/everything-else/wp/GSCAN/freeze1-25-Mar-2015.

Freeze 2. New studies that will be included in Freeze 2 are located on RC at /work/KellerLab/GSCAN/EXOME. Each folder in that directory is the name of a study and includes two subfolders, one for Phenotypes and one for Genotypes. Genotypes are split by chromosome to facilitate analyses.

GSCAN GWAS

Phenotype definitions and analysis plan

The analysis plan and phenotypes are described in files linked below (makes it easier to keep track of versioning!). Coding of phenotypes is described in the aptly-named “phenotype definitions” file whereas the genome-wide analysis plan is in the all-too-aptly-named “analysis plan” document. Please note that the phenotype definitions document only contains information on how to code the eight smoking/drinking phenotypes. File formats for those phenotypes, which many will recognize as standard pedigree formats, are included in the analysis plan. Everything else should be fairly straightforward.

file_gscan_gwas_analysis_plan-v1_3.docxclick_here_to_find_the_gscan_gwas_analysis_plan

file_gscan_gwas_phenotype_definitions-2-24-2016.pdfclick_here_to_find_the_gscan_gwas_phenotype_definitions

Coordination and organization

Progress, internal and external, are tracked in **this Google Doc**. More specific progress on internal studies is **tracked here**.

Study contact info is tracked in **this Google Sheet**.

Studies available in dbGaP, along with accession numbers, etc. are tracked in **this Airtable**.

File locations

Study data to which we have direct access are located either on twins or RC. Twins data are organized in the folder /net/twins/svrieze/everything-else/wp/GSCAN/GWAS. Within this folder those studies to which we have raw data access are in the folder CU_Boulder_samples (for lack of a better name!). Summary stats generated on these samples are organized within summary_stats_generated_internally. Summary stats generated by outside groups and submitted for meta-analysis are organized within summary_stats_generated_externally.

On RC the organization is similar. Everything is located within the folder /work/KellerLab/GSCAN/GWAS. Study data to which we have raw data access are in the folder individual_level_study_data. Summary stats generated on these samples are organized within summary_stats_generated_internally. Summary stats generated by outside groups and submitted for meta-analysis are organized within summary_stats_generated_externally.

[[gscan_db_ga_p]]

Studies included from dbGaP, and the process by which phenotypes and genotypes were constructed and merged is outlined on the gscan_db_ga_p page.

GSCAN use of UKBiobank

More information about the files used for UKBiobank are here. In brief, we used the UK10K + 1kgp3 imputed vcfs provided by UKBionank and added in dosages w/ this python script:

import gzip, argparse, re, os, datetime from subprocess import Popen, PIPE

def add_dosage(pair):

      a, b = pair
      probs = b.split(b',')
      dose = float(probs[1]) + (float(probs[2]) * 2)
      return a + b':' + str(dose).encode('ascii') + b':' + b

def gziplines(fname):

f = Popen(['zcat', fname], stdout=PIPE)
for line in f.stdout:
    yield line

parser = argparse.ArgumentParser() parser.add_argument('inputVCF', help = 'The path to the VCF') args = parser.parse_args()

flag = False

for line in gziplines(args.inputVCF):

      if line.startswith(b'#'):
              os.write(1, line.rstrip() + b'\n')
              if not flag:
                      os.write(1, b'##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype Dosages">\n')
                      os.write(1, b'##Dosages added using the script add.dosages.subprocess.py at ' +
                              str(datetime.datetime.now()).encode('ascii') + b'\n')
                      flag = True
      else:
              elements = re.split(b'\t|:', line.rstrip())
              first8 = elements[:8]
              genotypes = elements[10:]
              form = b'GT:DS:GP'
              genotypes_split = zip(genotypes[::2], genotypes[1::2])
              try:
                      dose_genos = [add_dosage(pair) for pair in genotypes_split]
              except (ValueError, IndexError) as e:
                      os.write(2, "\n" + line)
                      os.write(2, line + "\n" + args.inputVCF + "\n\n")
                      raise e
              os.write(1, b'\t'.join(first8) + b'\t' + form + b'\t' + b'\t'.join(dose_genos) + b'\n')

GSCAN Sequencing

TOPMed

Phenotype definitions and analysis plan

Phenotype definitions and analysis plans for the TOPMed studies are file_topmed_smoking_analysis_plan-v0_2.docxcontained_in_this_document.

The list of dbGaP studies in TOPMed is in **this Airtable**.

keller_and_evans_lab/gscan.1473713996.txt.gz · Last modified: 2016/09/12 14:59 by scott