User Tools

Site Tools


keller_and_evans_lab:meeting_notes

This is an old revision of the document!


https://etherpad.net/p/ukb

09-06-2017

- location of data - what needs to happen

Communication Wiki @ https://ibg.colorado.edu/mediawiki/index.php/UK_Biobank

Overall study structure 500k px has subcomponents:

  1. phenotyping changed over course of study (eg personality only available for a subset)
  2. QC datafile contains batch variable for every individual- see if it contains “BiLEVE”
  3. differences between online/in person data
  4. two genotypings
  5. 50k on one of the chips where half heavy smokers
  6. two affy arrays but there are sig difs in call rates for particular SNPs

- phenotyping confounding with snp arrays and ascn for heavy smoking

  1. smoking also confounded with batch
  2. Phenotype data available as .csv and .Rdata file generated by provided R script; possible for SAS as well

!! rdata file is large and will excede memory allocated to login nodes

  1. object is `bd`
  2. each “project”/request has it's own file as IDs have been randomized; `f.eid` is randomized day linking phen/gene data within requests; can establish bijection between eids across projects via plink sample files (ie eid1 ↔ pos ↔ eid2)
  3. f.50.0.0 : 0 is initial visit; 1: reax (-20k indiv); 2: imaging visit;

- 50^ is var id - details on phenotype page on wiki - can get ukb_field.tsv from data showcase to id specific vars without loading entire data set

Phenotypes available

  1. psychiatric sx data (now available) – need to submit additional application if interested in using (particularly suicide)
  2. wiki with list of fields out to email
  3. data on rc `/work/ibg/` but some still in kellerlab still waiting on data availability
  4. for storage, important to use generic bgen files

Data cleaning - need to ensure consistency across projects - genotype data

  1. vcf files,
  2. ld-pruned relatedness files
  3. gargi will send out parameters (HWE, MAF cutoffs, etc) of cleaned files and location on directory (discussed previously by gargi and luke)

- QC - raw data will remain available - one set of files that have a bare min of QC (e.g., for imputed data, info score >=.3, removing indels, individs whose self-rep vs genetic sex differs excluded, singleton doubleton excld, two phases of imputation with some error–should use HRC snps, so luke removed uk10k and 1kg only snps) - bed files / chrm done - saved into plink bin files –» gzip vcf in progress but will take a long time; will likely die as wall time < compute time - luke will just post QCd bgen files instead - plink binaries lose uncertainty info present in bgen and vcf - gargi has ID'd ethnic subsets: see /work/ - relatives identification only done for 350k indiv so far but ukb provides kinship matrices up to 3rd degree for 500k; gargi will post script for IDing unrelated (currently removes both indiv, but will be modified to include only one of each pair; only for 350k currently) - need a list of folks to exclude for genetically unrelated sample - might recompute PCs only for caucasian subset - - need a subset of ld pruned files for caucasian only, then calc PCs; gargi is going to take care of this; but luke will calc PCs - HRC SNPs ~36k - best practices: all derrived data created in scratch (blanca: rcscratch..; summit … can't read between) - use globus for LFT between scratches on blanca/summit - procedure: init create in scratch; if cp to work/ explain in wiki QC, purpose, etc

keller_and_evans_lab/meeting_notes.1504721669.txt.gz · Last modified: 2017/09/06 12:14 by richard_border