Genotype file

The genotype file is located here: https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/view?usp=sharing

Lab 1 Assignment

### Lab 1 assignment ### Assigned: 4/13/2017 ### Due: 4/20/2017 at the beginning of class. Late assignments (even by 5 minutes) ### will not be accepted! ### ### Note: all questions should be answered with respect to the ### genotypes from hu916767_20170324191934.txt

### Question 1 (4 points) ### a) What does “positive strand” mean in the header of the genotype file?

### Question 2 (2 points) ### a) Provide a command that I can run to extract only the chromosome column of the genotype file.

### Question 3 (2 points) ### a) Provide a command that I can run that extracts only the chromosome column of the genotype ### file, and pipes it to “sort -u”. ### b) Provide the output of that command and tell me in your own words what the command did.

### Question 4 (6 points) ### a) Give me a command that I can run that will extract the most ### commonly studied SNP associated with the flushing response discussed in class. ### b) Interpret this individual's risk for alcoholism, the flushing response, ### esophageal cancer, and their response to Disulfiram. ### ### Note: you will need to use your web searching abilities!

### Question 5 (4 points) ### Find out more about SNP rs72921001 in dbSNP ### a) What is the minor allele in individuals of European ancestry? What is the MAF? ### b) What is the allele frequency of this allele in individuals of African ancestry? ### c) Is this SNP associated with any phenotypic effects? ### d) Describe the geographical distribution of allele frequency for this variant using ### the website http://popgen.uchicago.edu/ggv/

Example full credit answers:

Question 1
1. “The positive strand refers to the leading strand of DNA being sequenced (eg. the strand that RNA would be replicated against).”
2. “Each DNA strand is a double helix - it has two strands. The first strand given is the postive strand; the second strand is based on the first and is called the negative strand. For example, if the positive strand is ATCGG, then the negative strand is TAGCC (T always pairs with A, and G always pairs with C). The header is stating that the genome provided is only based on the first strand (the positive strand).”
Question 2
1. awk '{print $2}' hu916767_20170324191934.txt
2. cut -f2 hu916767_20170324191934.txt
Question 3
1. awk '{print $2}' hu916767_20170324191934.txt | sort -u
2. cut -f2 hu916767_20170324191934.txt | sort -u
3. The command extracts the second column from a tab-delimited file, alphanumerically sorts it, and removes all duplicate lines.
Question 4
1. grep 'rs671' hu916767_20170324191934.txt
2. Output: rs671 12 112241766 GG
3. “Interpretation: This individual does not flush, has a normal risk for alcoholism, normal risk of esophageal cancer, and Disulfiram is effective for alcoholism for this individual.”
Question 5
1. Minor allele is A in individuals of European ancestry and MAF is .36
2. In individuals of African ancestry MAF is .021
3. The SNP is associated with thinking cilantro tastes like soap
4. “The minor allele is most common in central/southern Asia and western Europe, and least common in African with the Americas in between.”

Example commands

### PSYCH 3102 Behavioral Genetics ### Lab 1 – downloading and exploring a genome ### Author: Scott Vrieze

###################################### ### STEP 1, get a terminal working ### ###################################### ### ### If you have a PC running Windows, install CYGWIN ### https://www.cygwin.com/ ### CYGWIN will automatically install in C:/cygwin/ ### ### On a mac or linux computer, open up a terminal ###

################################################# ### STEP 2, make a directory in which to work ### ################################################# ### On a Windows PC, open up Cygwin. ### ### On a Mac or linux computer, open a terminal. ### ### A window with text will open up. This box is BY FAR the most ### powerful thing on your computer. The trick is learning how to use ### it. ### ### Let's practice!

### Let's see where you are by listing the contents of the directory ls

### Let's create a new directory called “bg”, where we can work. mkdir bg

### Check that you created that directory by running ls again ls

## Now, move into the bg directory cd bg

## You should be in the bg directory. Run ls to see what's in here ls

## The result should come up blank, because there's nothing in this ## directory yet! Let's find something interesting to put in here.

############################################ ### STEP 3, download the practice genome ### ############################################ ### ### The dataset is in our google drive folder, with the following ### direct link: ### https://drive.google.com/file/d/0B608ps4vtHUaWFNOWXJqZ0tDMXc/ ### ### Download the file and then move the file to, on a Windows computer: ### C:/cygwin/home/<username>/bg/ ### On a mac: ### /home/<username>/bg/

### Let's check and see if you got it right. Open a terminal and run cd bg

### Then list the contents of the directory ls ### You should see something like the following output: ### $ ls ### hu916767_20170324191934.txt ### ### If that's what you saw, congratulations, you put the file in the ### right place!

################################################ ### STEP 4, look at the contents of the file ### ################################################ ### ### In your terminal, go to the bg folder, then type less hu916767_20170324191934.txt

### That should open the file in your terminal. You can scroll up or ### down using the arrow keys. To scroll faster you can press the ### space bar. To close the “less” session, press “q”.

### What if we just want to look at the first few lines? head hu916767_20170324191934.txt

### The last few lines? tail hu916767_20170324191934.txt

### How many variants are there? Try “wc -l”. This will give you the ### number of lines in the file, which is approx the number of ### variants. wc -l hu916767_20170324191934.txt

### That's a lot of variants. How can I extract a certain variant, ### without scrolling through the whole file? grep 'rs9430244' hu916767_20170324191934.txt

### Try another one grep 'rs8176719' hu916767_20170324191934.txt

### Huh, what does DD mean? I thought nucleotides could be A, C, T, or ### G. Also – Google that variant. What phenotype does it affect? What ### phenotype does this person have?

### We can also grab both variants, if we wanted to grep -E 'rs8176719|rs9430244' hu916767_20170324191934.txt

### What if we have a variant where we don't know the rsID, ### but only the chromosome, position, genome build, and alleles? ### Well, to get chromosome 1, position 11850759, we can do this: grep -E '\s1\s11850750\s' hu916767_20170324191934.txt

###################################### ### STEP 5, join commands together ### ###################################### ### ### Now we'll do something called “piping”. Piping allows you to run a ### command on a file, then send the output of that command to a new ### command, and possibly on to a new command. Let's give it a try.

### We saw above that “grep” allows you to extract all lines that ### match a certain character string. We also saw that “wc -l” counts ### the number of lines. Can we combine these two commands? ### ### Let's extract all the variants that are “GG” grep 'GG' hu916767_20170324191934.txt

### OK, that didn't work so well, the output just kept on feeding our ### screen. Instead, we'll use a pipe to send that output to wc -l grep 'GG' hu916767_20170324191934.txt | wc -l

### How about all SNPs that are homozygous? grep -E 'GG|CC|TT|AA' hu916767_20170324191934.txt | wc -l

### How about all the variants that are homozygous? grep -E 'GG|CC|TT|AA|II|DD' hu916767_20170324191934.txt | wc -l

### How many indels are there? grep -E 'II|DD|ID|DI' hu916767_20170324191934.txt | wc -l

### Now, a little trickier. How many variants are on chromosome 1? Is ### this command going to work? Why or why not? grep -E '1' hu916767_20170324191934.txt

Useful databases

Geography of Genetic Variants Browser Interactively browse geographic distribution of genetic variants. Can compare to 1000 Genomes, ExAC, and POPRES (Euro-centric). http://popgen.uchicago.edu/ggv/?data=%221000genomes%22&chr=11&pos=6889648

dbSNP A fairly exhaustive database of SNPs in humans. https://www.ncbi.nlm.nih.gov/projects/SNP/

ExAC A good source for exonic variants. Very user friendly. http://exac.broadinstitute.org/

IBG Wiki

Table of Contents

Genotype file

Lab 1 Assignment

Example commands

Useful databases