This is an old revision of the document!
09-06-2017
- location of data - what needs to happen
Communication Wiki @ https://ibg.colorado.edu/mediawiki/index.php/UK_Biobank
Overall study structure 500k px has subcomponents:
- phenotyping confounding with snp arrays and ascn for heavy smoking
!! rdata file is large and will excede memory allocated to login nodes
- 50^ is var id - details on phenotype page on wiki - can get ukb_field.tsv from data showcase to id specific vars without loading entire data set
Phenotypes available
Data cleaning - need to ensure consistency across projects - genotype data
- QC - raw data will remain available - one set of files that have a bare min of QC (e.g., for imputed data, info score >=.3, removing indels, individs whose self-rep vs genetic sex differs excluded, singleton doubleton excld, two phases of imputation with some error–should use HRC snps, so luke removed uk10k and 1kg only snps) - bed files / chrm done - saved into plink bin files –» gzip vcf in progress but will take a long time; will likely die as wall time < compute time - luke will just post QCd bgen files instead - plink binaries lose uncertainty info present in bgen and vcf - gargi has ID'd ethnic subsets: see /work/ - relatives identification only done for 350k indiv so far but ukb provides kinship matrices up to 3rd degree for 500k; gargi will post script for IDing unrelated (currently removes both indiv, but will be modified to include only one of each pair; only for 350k currently) - need a list of folks to exclude for genetically unrelated sample - might recompute PCs only for caucasian subset - - need a subset of ld pruned files for caucasian only, then calc PCs; gargi is going to take care of this; but luke will calc PCs - HRC SNPs ~36k - best practices: all derrived data created in scratch (blanca: rcscratch..; summit … can't read between) - use globus for LFT between scratches on blanca/summit - procedure: init create in scratch; if cp to work/ explain in wiki QC, purpose, etc