This is a wiki guide to python for biologists, and more specifically, a guide for students in the University of Miami Introduction to Biology course.

Feel free to comment on the entries (if and when I get this feature working!), to point out what you don’t understand, to ask questions that I, or other readers of the blog, will respond to.

Good luck and have fun.

Learning Python

Google python tutorial and take your pick. The Python Tutorial for Non-Programmers is good, as is the first hit: Python Tutorial. If you are ready for it, you can read the formal documentation.

The question arises as to how you are to write your python codes. You will either go completely interactive, opening up an instance of the python interpreter and typing at it, and getting responses, as if it were some super-intellectual calculator.

You can also write the code into a file, save it and run it. For this you will need to use an editor, not a word processor. If you chose this method of working, try using IDLE, an integrated development environment. It combines python with a point-and-click editor.

If you are using windows, you will have to download python and IDLE and install it. If you are running OS X, python appears to be preinstalled. Open up a terminal ( e.g. Finder bar, Go→Utilities, Terminal) and type python, and if that works, type idle. On one machine, the default configuration could not find idle, although it was there. To correct this, type the fullname:

/Library/Frameworks/Python.framework/Versions/Current/bin/idle

If this works, then it is just a matter of including the directory in your PATH variable. Further information about this is available if needed.

Instructions for Linux users can be made available, if needed.

Getting Started

The first thing to do with python is to get it installed, start it up, write and run the “Hello World” program. The “Hello World” program is a program that when run prints “Hello World”. In python it consists of the single line of instruction:

>>> print "Hello World"

The triple greater-than signs are the python prompt. You will get this when pyton is in interactive mode and is prompting you to input instructions.

If you get this all right, python should respond with Hello World.

Expressions: literals and operators

You have installed python, started the development environment IDLE, and have written and run the Hello World program. You are now ready to take the next steps in programming.

Read about expressions in the tutorial for no-programmers. Try this in the development environment:

>>> print 2+3

The building block of programs is the expression. An expression is a bunch of symbols which evaluate to a value. An expression is a combination of literals, operators and variables, as well as function calls.

A literal is an expression which evaluates to its face value. For instace, 2 is a literal which evaluates to the integer value 2. Expression values can be other types than integers. The literal “Hello World” evalutes to its face value, which is a string of characters, H-e-l- and so forth.

Operators combine literals, variables, or other expressions, acting upon them to derive new values. The expression 2+3 uses the operator plus to combine the literals 2 and 3, yeilding a value of the expression of 5. An operator’s action might depend on the type of values it operates on. The plus sign operates on string values by concatenating them, in the order presented. So “2”+”3” would yield the value “23”. Got it?

Variables

Expressions are combinations of literals and variables combined by use of operators. Expressions are evaluated to yield values. Expressions are made up of other expressions, with the values of the sub-expressions combined by operators. The simplest expressions are literals, which are expressions which evaluate to their face value, such as the literal 2 which evaluates to the integer value 2, or the literal “2” (note quotes!) that evaluates to the string containing the single character ‘2’.

Another simple expression is the variable. A variable looks like a name, but it contains a value. It is used in one of two contexts. Either its value is being used or it is being set. The typical way to set the value of a variable is to use the assignment statement:

>>> a=2

This assignment statement sets the variable of name “a” to the value 2. When the variable is named in the midst of an expression, it is intended that it evaluate to the currently set value. So:

>>> print a+3

should print out 5 , assuming that the assignment statement seting a to 2 preceeded this print statement, and the value of variable a has not been modified since.

Statements

Programs are made up of a sequence of statements, and statements are built up from expressions and certain connectives which control the flow of execution. The utorial. In doing so you will learn your first function, the input() function. You will also learn how to use idle to create programs in files and run them.

To do this, in idle’s menu, select File→New Window. In the new window, type your program. This window is not interactive, it will not execute your program at all. When done typing, go to the file menu to File→Save As, and give a file name. Typically, you will add the extension .py to the end of your filename, to help identify the file as a python script. Then run the file chosing Run→Run Module from the pyton menu.

Reading a gene sequence

We now must do something different than in the tutorial. Please continue on with the tutorial on your own, but to speed things along, given our needs, we will jump to flies and strings.

You will calculate not with numbers, but with strings. We will be asking questions about the nucleotide sequences of DNA and RNA, and protein sequences, and so forth. These are represented not by quantity but by sequences of letters, the letters representing a nucleaic acid or a protein. To accomplish this we will make extreme use of the string type of data.

You should grab a sample nucleotide sequence off genbank, and using cut and paste, drop the sequence into a file. I named the file sequence.txt (the .txt indicating that the file is plain text). We will read this file into a python variable so we can manipulate it. Here is the code that works:

>>> file = open("sequence.txt","r")
>>> seq = file.read()
>>> print seq

You can use the interactive python window for this, at this moment, just to get the feel for files.

What do these three python statements do? It is a somewhat mysterious prerequisite for a program to use a file that the file must be “opened”. The first line does this, giving the name of the file to read and indicating, but the string “r”, that the file is for reading only. The second line reads the entire contents of the file, as a string, and setting the variable named seq to the contents of the file. The third line is not really needed, but it proves to you that the read worked by printing out the value of seq.

Iterating through a sequence

Part One

We will now learn to manipulate the nucleotide sequence. In doing so we will learn how programs control the the execution of their statments, conditionally executing only certain parts of the program and repeating execution of certain parts of the program over and over again.

Programs are sequences of statements, typically executed in the order they appear in the file. Conditional and loop statements modify the order of execution. Conditional statements, also know as if or if-else statements, place prerequisites in front of expressions. If the prerequisite is satisfied, the expressions are executed, else they are not executed. Loop statements enclose a collection of statements and cause the collection to be executed over and over again, until some condition is statisfied. Loop statements occur in man forms, the most useful are the while statement and the for statement.

We will illustrate these ideas by showing how to manipulate the sequence string, working character by character through the string, inspecting the character and removing it if it is not of interest.

Part Two

A string data type is a sequence of characters. We will have need to work character by character through the sequence. For instance, to compare two sequences for similarties, we will work character by character in synchrony through the two sequence, comparing character by character and counting matches.

For a string s, that is:

s = "hello world"

The first letter in s is reference using the syntax s[0], the second letter is reference as s[1], and so forth. We will work our way through the string by setting an integer variable, for example, a variable names i, consecutively to 0, 1, 2, and so on, and looking at the value of s[i], that is, the i-th letter of the string s.

The control construct we need is a loop iterating over and over again a certain operation, on each character of the string. In python this is easily done using the for statement. Try:

>>> for i in range(len(seq)): print seq[i]

This will print out the sequence stored in the variable named seq character by character. By the way, if you wish to break out of the for execution, which can be a lengthy operation, use a “control-C”, also known as ^C, by holding down the ctrl key while depressing the letter C.

The for statement is called a compound statement. It has a head and a suite. The two parts are separated by the colon (:). The head introduces the statement using the special keyword for. It also sets up the iteration variable, in this case named i, and specifies for that variable the range of values which it will take. The word in connects the variable to a range construct which specifies the range, in our case, range(len(seq)). It is enough for you to know that in this circumstance, the magic words range(len(s)) work, where s is the name of the string over which you want to iterate.

The suite follows the colon in the for compound statement. It is a collection of statements to do under the guidance of the for loop. The statements are either on the same line as the head, in which case they are separated by semicolons (;), or on separates lines, in which case they have to be carefully indented, with a common indentation amount which is something more that the indentation of the for line. This aspect of python is confusing, and you might like to see the formal requirements for indentation.

Part Three

We have explored the use of the for statement to work character by character through a string. We need to do something on each character, and in this case, what we do depends on the value of the character. We explore now the use of the if statement to evaluate expression only under conditions given by a logical expression (an expression calculating to either the value True of False).

This conditional action requires the use of an if statement. The if statement is classed as a compound statement, as is the for statement, in that it has a head and a suite which are separated by a colon. The suite appears either on the same line as the head, or on following separate lines, with the precise indentation structure. The head is marked by the keyword if and is followed by an expression which evaluates either True or False. Our expression is seq.isalpha(), which is a special and complicated sort of function which gives true if seq[i] is an “alpha” character, that is, a letter, and false otherwise, that is, if seq[i] is a digit, or punctuation, or a space, whatever isn’t a letter.

  seq2 = ""
  for i in range(len(seq)):
          if seq[i].isalpha():
              seq2 = seq2 + seq[i]

Try this code in the interactive idle window. Don’t forget to read in seq from the file sequence.txt, as explained earlier. This code gives a pattern of operations we will use over and over again, to accomplish a great deal of calculation on the gene sequences we represent and manipulate as python strings.

Functions

A function is a name attached to a list of instructions. When the function name appears in an expression it is replaced by the value resulting from the evalution of the instructions. A function statement defines a function. To use a function it is named in an expression, the name followed by a pair of opposing parenthesis which enclose parameters to the function. The parameters allow the function to act upon different inputs each time it is used.

A function statement is a compound statment, with a head and a suite separated by a semicolon. The suite of a function is generaly written over several lines, indenting the lines past the function name. The head is introduced with the keyword def, followed by the name of the function, followed by, inside a pair of parenthesis, a comma separated list of variable names. When the function is used, these variables will be initialized to values presented when the function is used. This is a bit complicated, but should become clear in the following example and by reading the section in the tutorial.

A function requires a special return statement to terminate the function. The purpose of the return statement is:

  1. to cause the function to stop its evaulation, and return control to the evaluation of the expression which called the function, and
  2. to specify the value that the function has evaluated to.

The return statement as the keyword return followed by an expression. The resulting value of the expression is the final value of the function.

Working with functions

To write and run (use) a function, do the following:

  1. Go to the Idle menu and select File→New Window
  2. In the new window write the function, given below.
  3. In the Idle menu select Run→Run Module (this option appears when the new window has focus)
  4. As a result of Run Module, a Python Shell window will be reinitialized having read the contents of the New Window and will be ready with the python prompt (the three greater-than signs) to accept interactive commands.

You can also save the contents of the new window, and later load (Open) the saved file. Here is the function:

  def getsequence(filename):
      file = open(filename,"r")
      seq = file.read()
      file.close()
      seq2 = ""
      for i in range(len(seq)):
          if seq[i].isalpha():
              seq2 = seq2 + seq[i]
      return seq2

From the idel menu selection Run→Run Module. A Python Shell will open and halt at the python prompt. Type:

>>> print getsequence('sequence.txt')

This will execute the function getsequence that we have defined, except that the variable filename found in the definition of getsequence will have value “sequence.txt”, so the instructions will operate on this file.

Comparing Strings

We will write a function to count the number of letter differences between two strings. We shall assume that the two strings have the same number of characters in them. In use, we will probably also assume that the strings are clean, in the sense that they contain only letters, but this is not important for the function.

The idea is to use a for statement to iterate through the strings, incrementing a count by one each time characters in the corresponding location on the strings do not match. We have all the programming skills we need for this mission, except that maybe you would have to refer to some quick reference guide to find out that “not equals” is pronouced != in python (and in many other computer languages too).

  def compareseq(s1,s2):
      count=0
      for i in range(len(s1)):
          if s1[i]!=s2[i]:
              count = count + 1
      return count

Note carefully, very carefully, the indentation. The return statement is lined up underneath the for statement. This means that it is not controlled by the for statement, it is run once, at the conclusion of the for-controlled repetitions. The suite of the for statement contains the if statement, which has its own, indented, suite, which is the increment of the count variable.

burt/bioinfo.txt · Last modified: 2006/08/22 16:09 by burt
 
 
 
Recent changes RSS feed Creative Commons License Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki