Differences

This shows you the differences between two versions of the page.

--- keller_and_evans_lab:gscan [2016/04/22 13:46]
scott /* GSCAN Sequencing */
+++ keller_and_evans_lab:gscan [2016/08/29 10:28]
scott
@@ Line 7: / Line 7: @@
 Regular conference calls are held and minutes are [[https://docs.google.com/document/d/1ZK9VIXxcej3lat_oD_oxPP0ajHwj8yX_FKaPh53svVo/edit#|**available here**.]]
+Other meeting materials from CO internal meetings are here:
+[[gscan_6:16:16_--_db_ga_p_gf_g]]
 ======= GSCAN Exome Chip =======
+====== Phenotype definitions and analysis plan ======
+{{file_gscan_exome_chip_analysis_plan-v2_2.pdfExome chip analysis plan and phenotype definitions.}}
@@ Line 20: / Line 29: @@
 ======= GSCAN GWAS =======
+====== Phenotype definitions and analysis plan ======
+The analysis plan and phenotypes are described in files linked below (makes it easier to keep track of versioning!). Coding of phenotypes is described in the aptly-named "phenotype definitions" file whereas the genome-wide analysis plan is in the all-too-aptly-named "analysis plan" document. Please note that the phenotype definitions document only contains information on how to code the eight smoking/drinking phenotypes. File formats for those phenotypes, which many will recognize as standard pedigree formats, are included in the analysis plan. Everything else should be fairly straightforward.
+{{file_gscan_gwas_analysis_plan-v1_2.pdfClick here to find the GSCAN GWAS analysis plan.}}
+{{file_gscan_gwas_phenotype_definitions-2-24-2016.pdfClick here to find the GSCAN GWAS phenotype definitions.}}
 ====== Coordination and organization ======
-All analyses, internal and external, are tracked in [[https://docs.google.com/document/d/1kWaY40n-bSURoLW7VcU9CFv08zVx360RHmxvL7DIreU/edit|**this Google Doc**]].
+Progress, internal and external, are tracked in [[https://docs.google.com/document/d/1kWaY40n-bSURoLW7VcU9CFv08zVx360RHmxvL7DIreU/edit|**this Google Doc**]]. More specific progress on internal studies is  [[https://docs.google.com/spreadsheets/d/1canvCaAJW70LjSHidtvwrJgyDMa_ZlT7dvpzOsz6PNY/edit#gid=0|**tracked here**]].
 Study contact info is tracked in [[https://docs.google.com/spreadsheets/d/11apZaSyesNy4hl4MIgrKRYSASrwZM2iEJsuuFQByCfI/edit#gid=0|**this Google Sheet**]].
 Studies available in dbGaP, along with accession numbers, etc. are tracked in [[https://airtable.com/tblzZUtQWcZSlfjrA/viwhISDznphLfST8m|**this Airtable**]].
@@ Line 36: / Line 54: @@
 On RC the organization is similar. Everything is located within the folder /work/KellerLab/GSCAN/GWAS. Study data to which we have raw data access are in the folder //individual_level_study_data//. Summary stats generated on these samples are organized within //summary_stats_generated_internally//. Summary stats generated by outside groups and submitted for meta-analysis are organized within //summary_stats_generated_externally//.
+====== [[gscan_db_ga_p]] ======
+Studies included from dbGaP, and the process by which phenotypes and genotypes were constructed and merged is outlined on the [[gscan_db_ga_p]] page.
+====== GSCAN use of UKBiobank ======
+More information about the files used for [[uk_biobank|UKBiobank are here]]. In brief, we used the UK10K + 1kgp3 imputed vcfs provided by UKBionank and added in dosages w/ this python script:
+import gzip, argparse, re, os, datetime
+from subprocess import Popen, PIPE
+def add_dosage(pair):
+        a, b = pair
+        probs = b.split(b',')
+        dose = float(probs[1]) + (float(probs[2]) * 2)
+        return a + b':' + str(dose).encode('ascii') + b':' + b
+def gziplines(fname):
+  f = Popen(['zcat', fname], stdout=PIPE)
+  for line in f.stdout:
+      yield line
+parser = argparse.ArgumentParser()
+parser.add_argument('inputVCF', help = 'The path to the VCF')
+args = parser.parse_args()
+flag = False
+for line in gziplines(args.inputVCF):
+        if line.startswith(b'#'):
+                os.write(1, line.rstrip() + b'\n')
+                if not flag:
+                        os.write(1, b'##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype Dosages">\n')
+                        os.write(1, b'##Dosages added using the script add.dosages.subprocess.py at ' +
+                                str(datetime.datetime.now()).encode('ascii') + b'\n')
+                        flag = True
+        else:
+                elements = re.split(b'\t|:', line.rstrip())
+                first8 = elements[:8]
+                genotypes = elements[10:]
+                form = b'GT:DS:GP'
+                genotypes_split = zip(genotypes[::2], genotypes[1::2])
+                try:
+                        dose_genos = [add_dosage(pair) for pair in genotypes_split]
+                except (ValueError, IndexError) as e:
+                        os.write(2, "\n" + line)
+                        os.write(2, line + "\n" + args.inputVCF + "\n\n")
+                        raise e
+                os.write(1, b'\t'.join(first8) + b'\t' + form + b'\t' + b'\t'.join(dose_genos) + b'\n')
 ======= GSCAN Sequencing =======
+====== TOPMed ======
+Preliminary phenotype definitions for distributed analyses of TOPMed data are provided in this document.
 The list of dbGaP studies in TOPMed is in [[https://airtable.com/shryD6CMaM6R5sA3e/tblUKENXX5WmgNXQ8|**this Airtable**]].

IBG Wiki

User Tools

Site Tools

Differences

Page Tools