Need a better way to work

I’m working on annotating my results at the moment and that means that I have to take each region and input it into a website, search the website for the information that I want and then hand copy it down to my spreadsheet. This is unacceptable, I am a computer scientist. There has to be a better way.

What I have is:

1) a .csv file that contains the regions that I am interested in.

2) a file that contains data about these regions (but single mutations can have 4-6 entries on each variant)

3) a website that I can manually look up to find this data.

If only I had a wheelbarrow, that would be something.

What I need to do is get the data for the areas that I want and then write a new program to take that data, process it and spit out the tables that I want. But first I must ask myself what do I want from these tables?

What I want is:

1) Chromosome and population of sample

2) start and end of region

3) nearest gene and if there is no overlap I need to know the distance to the nearest gene

4) The number of non-synonymous mutations (those that change the expressed protein)

5) The number of non-coding functional mutations in the region

6) if the hit region overlaps a gene (or more)

7) gene function information

This seems like a bigger task than it is, I should be able to do this fairly easy but I am having a mental block on getting started. It is usually about this time that I start asking for advice on how you get past the wall of starting terror.

Leave a comment

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: