This is an old revision of the document!
Table of Contents
Genotype file
The genotype file is located here: https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/view?usp=sharing
Lab 1 Assignment
### Lab 1 assignment ### Assigned: 4/13/2017 ### Due: 4/20/2017 at the beginning of class. Late assignments (even by 5 minutes) ### will not be accepted! ### ### Note: all questions should be answered with respect to the ### genotypes from hu916767_20170324191934.txt
### Question 1 (4 points) ### a) What does “positive strand” mean in the header of the genotype file?
### Question 2 (2 points) ### a) Provide a command that I can run to extract only the chromosome column of the genotype file.
### Question 3 (2 points) ### a) Provide a command that I can run that extracts only the chromosome column of the genotype ### file, and pipes it to “sort -u”. ### b) Provide the output of that command and tell me in your own words what the command did.
### Question 4 (6 points) ### a) Give me a command that I can run that will extract the most ### commonly studied SNP associated with the flushing response discussed in class. ### b) Interpret this individual's risk for alcoholism, the flushing response, ### esophageal cancer, and their response to Disulfiram. ### ### Note: you will need to use your web searching abilities!
### Question 5 (4 points) ### Find out more about SNP rs72921001 in dbSNP ### a) What is the minor allele in individuals of European ancestry? What is the MAF? ### b) What is the allele frequency of this allele in individuals of African ancestry? ### c) Is this SNP associated with any phenotypic effects? ### d) Describe the geographical distribution of allele frequency for this variant using ### the website http://popgen.uchicago.edu/ggv/
Example commands
### PSYCH 3102 Behavioral Genetics ### Lab 1 – downloading and exploring a genome ### Author: Scott Vrieze
###################################### ### STEP 1, get a terminal working ### ###################################### ### ### If you have a PC running Windows, install CYGWIN ### https://www.cygwin.com/ ### CYGWIN will automatically install in C:/cygwin/ ### ### On a mac or linux computer, open up a terminal ###
################################################# ### STEP 2, make a directory in which to work ### ################################################# ### On a Windows PC, open up Cygwin. ### ### On a Mac or linux computer, open a terminal. ### ### A window with text will open up. This box is BY FAR the most ### powerful thing on your computer. The trick is learning how to use ### it. ### ### Let's practice!
### Let's see where you are by listing the contents of the directory ls
### Let's create a new directory called “bg”, where we can work. mkdir bg
### Check that you created that directory by running ls again ls
## Now, move into the bg directory cd bg
## You should be in the bg directory. Run ls to see what's in here ls
## The result should come up blank, because there's nothing in this ## directory yet! Let's find something interesting to put in here.
############################################ ### STEP 3, download the practice genome ### ############################################ ### ### The dataset is in our google drive folder, with the following ### direct link: ### https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/ ### ### Download the file and then move the file to, on a Windows computer: ### C:/cygwin/home/<username>/bg/ ### On a mac: ### /home/<username>/bg/
### Let's check and see if you got it right. Open a terminal and run cd bg
### Then list the contents of the directory ls ### You should see something like the following output: ### $ ls ### hu916767_20170324191934.txt ### ### If that's what you saw, congratulations, you put the file in the ### right place!
################################################ ### STEP 4, look at the contents of the file ### ################################################ ### ### In your terminal, go to the bg folder, then type less hu916767_20170324191934.txt
### That should open the file in your terminal. You can scroll up or ### down using the arrow keys. To scroll faster you can press the ### space bar. To close the “less” session, press “q”.
### What if we just want to look at the first few lines? head hu916767_20170324191934.txt
### The last few lines? tail hu916767_20170324191934.txt
### How many variants are there? Try “wc -l”. This will give you the ### number of lines in the file, which is approx the number of ### variants. wc -l hu916767_20170324191934.txt
### That's a lot of variants. How can I extract a certain variant, ### without scrolling through the whole file? grep 'rs9430244' hu916767_20170324191934.txt
### Try another one grep 'rs8176719' hu916767_20170324191934.txt
### Huh, what does DD mean? I thought nucleotides could be A, C, T, or ### G. Also – Google that variant. What phenotype does it affect? What ### phenotype does this person have?
### We can also grab both variants, if we wanted to grep -E 'rs8176719|rs9430244' hu916767_20170324191934.txt
### What if we have a variant where we don't know the rsID, ### but only the chromosome, position, genome build, and alleles? ### Well, to get chromosome 1, position 11850759, we can do this: grep -E '\s1\s11850750\s' hu916767_20170324191934.txt
###################################### ### STEP 5, join commands together ### ###################################### ### ### Now we'll do something called “piping”. Piping allows you to run a ### command on a file, then send the output of that command to a new ### command, and possibly on to a new command. Let's give it a try.
### We saw above that “grep” allows you to extract all lines that ### match a certain character string. We also saw that “wc -l” counts ### the number of lines. Can we combine these two commands? ### ### Let's extract all the variants that are “GG” grep 'GG' hu916767_20170324191934.txt
### OK, that didn't work so well, the output just kept on feeding our ### screen. Instead, we'll use a pipe to send that output to wc -l grep 'GG' hu916767_20170324191934.txt | wc -l
### How about all SNPs that are homozygous? grep -E 'GG|CC|TT|AA' hu916767_20170324191934.txt | wc -l
### How about all the variants that are homozygous? grep -E 'GG|CC|TT|AA|II|DD' hu916767_20170324191934.txt | wc -l
### How many indels are there? grep -E 'II|DD|ID|DI' hu916767_20170324191934.txt | wc -l
### Now, a little trickier. How many variants are on chromosome 1? Is ### this command going to work? Why or why not? grep -E '1' hu916767_20170324191934.txt
Useful databases
Geography of Genetic Variants Browser Interactively browse geographic distribution of genetic variants. Can compare to 1000 Genomes, ExAC, and POPRES (Euro-centric). http://popgen.uchicago.edu/ggv/?data=%221000genomes%22&chr=11&pos=6889648
dbSNP A fairly exhaustive database of SNPs in humans. https://www.ncbi.nlm.nih.gov/projects/SNP/
ExAC A good source for exonic variants. Very user friendly. http://exac.broadinstitute.org/
