This is an old revision of the document!
09-06-2017
- location of data - what needs to happen
Communication Wiki @ https://ibg.colorado.edu/mediawiki/index.php/UK_Biobank
Overall study structure 500k px has subcomponents:
- phenotyping changed over course of study (eg personality only available for a subset)
- QC datafile contains batch variable for every individual- see if it contains “BiLEVE”
- differences between online/in person data
- two genotypings
- 50k on one of the chips where half heavy smokers
- two affy arrays but there are sig difs in call rates for particular SNPs
- phenotyping confounding with snp arrays and ascn for heavy smoking
- smoking also confounded with batch
- Phenotype data available as .csv and .Rdata file generated by provided R script; possible for SAS as well
!! rdata file is large and will excede memory allocated to login nodes
- object is `bd`
- each “project”/request has it's own file as IDs have been randomized; `f.eid` is randomized day linking phen/gene data within requests; can establish bijection between eids across projects via plink sample files (ie eid1 ↔ pos ↔ eid2)
- f.50.0.0 : 0 is initial visit; 1: reax (-20k indiv); 2: imaging visit;
- 50^ is var id - details on phenotype page on wiki - can get ukb_field.tsv from data showcase to id specific vars without loading entire data set
Phenotypes available
- psychiatric sx data (now available) – need to submit additional application if interested in using (particularly suicide)
- wiki with list of fields out to email
- data on rc `/work/ibg/` but some still in kellerlab still waiting on data availability
- for storage, important to use generic bgen files
Data cleaning - need to ensure consistency across projects - genotype data
- vcf files,
- ld-pruned relatedness files
- gargi will send out parameters (HWE, MAF cutoffs, etc) of cleaned files and location on directory (discussed previously by gargi and luke)
- QC - raw data will remain available - one set of files that have a bare min of QC (e.g., for imputed data, info score >=.3, removing indels, individs whose self-rep vs genetic sex differs excluded, singleton doubleton excld, two phases of imputation with some error–should use HRC snps, so luke removed uk10k and 1kg only snps) - bed files / chrm done - saved into plink bin files –» gzip vcf in progress but will take a long time; will likely die as wall time < compute time - luke will just post QCd bgen files instead - plink binaries lose uncertainty info present in bgen and vcf - gargi has ID'd ethnic subsets: see /work/ - relatives identification only done for 350k indiv so far but ukb provides kinship matrices up to 3rd degree for 500k; gargi will post script for IDing unrelated (currently removes both indiv, but will be modified to include only one of each pair; only for 350k currently) - need a list of folks to exclude for genetically unrelated sample - might recompute PCs only for caucasian subset - - need a subset of ld pruned files for caucasian only, then calc PCs; gargi is going to take care of this; but luke will calc PCs - HRC SNPs ~36k - best practices: all derrived data created in scratch (blanca: rcscratch..; summit … can't read between) - use globus for LFT between scratches on blanca/summit - procedure: init create in scratch; if cp to work/ explain in wiki QC, purpose, etc