This is an old revision of the document!

Genotype file

The genotype file is located here: https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/view?usp=sharing

Lab 1 Assignment

### Lab 1 assignment ### Assigned: 4/13/2017 ### Due: 4/20/2017 at the beginning of class. Late assignments (even by 5 minutes) ### will not be accepted! ### ### Note: all questions should be answered with respect to the ### genotypes from hu916767_20170324191934.txt

### Question 1 (4 points) ### a) What does “positive strand” mean in the header of the genotype file?

### Question 2 (2 points) ### a) Provide a command that I can run to extract only the chromosome column of the genotype file.

### Question 3 (2 points) ### a) Provide a command that I can run that extracts only the chromosome column of the genotype ### file, and pipes it to “sort -u”. ### b) Provide the output of that command and tell me in your own words what the command did.

### Question 4 (6 points) ### a) Give me a command that I can run that will extract the most ### commonly studied SNP associated with the flushing response discussed in class. ### b) Interpret this individual's risk for alcoholism, the flushing response, ### esophageal cancer, and their response to Disulfiram. ### ### Note: you will need to use your web searching abilities!

### Question 5 (4 points) ### Find out more about SNP rs72921001 in dbSNP ### a) What is the minor allele in individuals of European ancestry? What is the MAF? ### b) What is the allele frequency of this allele in individuals of African ancestry? ### c) Is this SNP associated with any phenotypic effects? ### d) Describe the geographical distribution of allele frequency for this variant using ### the website http://popgen.uchicago.edu/ggv/

Example commands

### PSYCH 3102 Behavioral Genetics ### Lab 1 – downloading and exploring a genome ### Author: Scott Vrieze

###################################### ### STEP 1, get a terminal working ### ###################################### ### ### If you have a PC running Windows, install CYGWIN ### https://www.cygwin.com/ ### CYGWIN will automatically install in C:/cygwin/ ### ### On a mac or linux computer, open up a terminal ###

################################################# ### STEP 2, make a directory in which to work ### ################################################# ### On a Windows PC, open up Cygwin. ### ### On a Mac or linux computer, open a terminal. ### ### A window with text will open up. This box is BY FAR the most ### powerful thing on your computer. The trick is learning how to use ### it. ### ### Let's practice!

### Let's see where you are by listing the contents of the directory ls

### Let's create a new directory called “bg”, where we can work. mkdir bg

### Check that you created that directory by running ls again ls

## Now, move into the bg directory cd bg

## You should be in the bg directory. Run ls to see what's in here ls

## The result should come up blank, because there's nothing in this ## directory yet! Let's find something interesting to put in here.

############################################ ### STEP 3, download the practice genome ### ############################################ ### ### The dataset is in our google drive folder, with the following ### direct link: ### https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/ ### ### Download the file and then move the file to, on a Windows computer: ### C:/cygwin/home/<username>/bg/ ### On a mac: ### /home/<username>/bg/

### Let's check and see if you got it right. Open a terminal and run cd bg

### Then list the contents of the directory ls ### You should see something like the following output: ### $ ls ### hu916767_20170324191934.txt ### ### If that's what you saw, congratulations, you put the file in the ### right place!

################################################ ### STEP 4, look at the contents of the file ### ################################################ ### ### In your terminal, go to the bg folder, then type less hu916767_20170324191934.txt

### That should open the file in your terminal. You can scroll up or ### down using the arrow keys. To scroll faster you can press the ### space bar. To close the “less” session, press “q”.

### What if we just want to look at the first few lines? head hu916767_20170324191934.txt

### The last few lines? tail hu916767_20170324191934.txt

### How many variants are there? Try “wc -l”. This will give you the ### number of lines in the file, which is approx the number of ### variants. wc -l hu916767_20170324191934.txt

### That's a lot of variants. How can I extract a certain variant, ### without scrolling through the whole file? grep 'rs9430244' hu916767_20170324191934.txt

### Try another one grep 'rs8176719' hu916767_20170324191934.txt

### Huh, what does DD mean? I thought nucleotides could be A, C, T, or ### G. Also – Google that variant. What phenotype does it affect? What ### phenotype does this person have?

### We can also grab both variants, if we wanted to grep -E 'rs8176719|rs9430244' hu916767_20170324191934.txt

### What if we have a variant where we don't know the rsID, ### but only the chromosome, position, genome build, and alleles? ### Well, to get chromosome 1, position 11850759, we can do this: grep -E '\s1\s11850750\s' hu916767_20170324191934.txt

###################################### ### STEP 5, join commands together ### ###################################### ### ### Now we'll do something called “piping”. Piping allows you to run a ### command on a file, then send the output of that command to a new ### command, and possibly on to a new command. Let's give it a try.

### We saw above that “grep” allows you to extract all lines that ### match a certain character string. We also saw that “wc -l” counts ### the number of lines. Can we combine these two commands? ### ### Let's extract all the variants that are “GG” grep 'GG' hu916767_20170324191934.txt

### OK, that didn't work so well, the output just kept on feeding our ### screen. Instead, we'll use a pipe to send that output to wc -l grep 'GG' hu916767_20170324191934.txt | wc -l

### How about all SNPs that are homozygous? grep -E 'GG|CC|TT|AA' hu916767_20170324191934.txt | wc -l

### How about all the variants that are homozygous? grep -E 'GG|CC|TT|AA|II|DD' hu916767_20170324191934.txt | wc -l

### How many indels are there? grep -E 'II|DD|ID|DI' hu916767_20170324191934.txt | wc -l

### Now, a little trickier. How many variants are on chromosome 1? Is ### this command going to work? Why or why not? grep -E '1' hu916767_20170324191934.txt

Useful databases

Geography of Genetic Variants Browser Interactively browse geographic distribution of genetic variants. Can compare to 1000 Genomes, ExAC, and POPRES (Euro-centric). http://popgen.uchicago.edu/ggv/?data=%221000genomes%22&chr=11&pos=6889648

dbSNP A fairly exhaustive database of SNPs in humans. https://www.ncbi.nlm.nih.gov/projects/SNP/

ExAC A good source for exonic variants. Very user friendly. http://exac.broadinstitute.org/

IBG Wiki

Table of Contents

Genotype file

Lab 1 Assignment

Example commands

Useful databases