Skip to content
Chris Fonnesbeck edited this page Oct 6, 2011 · 7 revisions

Scientific Programming using Python

Authors: Christopher J. Fonnesbeck

Python is an modern, open source, object-oriented programming language, created by a Dutch programmer, Guido van Rossum. Officially, it is an interpreted scripting language (meaning that it is not compiled until it is run) for the C programming language; in fact, Python itself is coded in C. Frequently, it is compared to Java and Perl, and is very similar to Ruby. It offers the power and flexibility of lower level languages, without the steep learning curve, and without most of the associated debugging pitfalls. The language is very clean and readable, and it is available for almost every modern computing platform.

Python offers a number of advantages to scientists, both for experienced and novice programmers alike:

Python is simultaneously powerful, flexible and easy to learn and use (in general, these qualities are traded off for a given programming language). Anything that can be coded in C, FORTRAN, or Java can be done in Python, always in fewer lines of code, and with fewer debugging headaches. Its standard library is extremely rich, including modules for string manipulation, regular expressions, file compression, mathematics, profiling and debugging (to name only a few). Unnecessary language constructs, such as END statements and brackets are absent, making the code terse, efficient, and easy to read. Finally, Python is object-oriented, which is an important programming paradigm particularly well-suited to scientific programming, which allows data structures to be abstracted in a natural way.

Python may be run interactively on the command line, in much the same way as Maple or S-Plus/R. Rather than compiling and running a particular program, commands may entered serially followed by the Return key. This is often useful for mathematical programming and debugging.

Python is often referred to as a “glue” language, meaning that it is a useful in a mixed-language environment. Frequently, programmers must interact with colleagues that operate in other programming languages, or use significant quantities of legacy code that would be problematic or expensive to re-code. Python was designed to interact with other programming languages, and in many cases C or FORTRAN code can be compiled directly into Python programs (using utilities such as f2py or weave). Additionally, since Python is an interpreted language, it can sometimes be slow relative to its compiled cousins. In many cases this performance deficit is due to a short loop of code that runs thousands or millions of times. Such bottlenecks may be removed by coding a function in FORTRAN or C, and compiling it into a Python module.

There is a vast body of Python modules created outside the auspices of the Python Software Foundation. These include utilities for database connectivity, mathematics, statistics, and charting/plotting.

Python is released on all platforms under the GNU public license, meaning that the language and its source is freely distributable. Not only does this keep costs down for scientists and universities operating under a limited budget, but it also frees programmers from licensing concerns for any software they may develop. There is little reason to buy expensive licenses for software such as Matlab or Maple, when Python can provide the same functionality for free!

While Python includes a number of built-in libraries, several of the key scientific programming add-ons must be downloaded and installed separately. The easiest way to install third-party Python packages is via another package, called setuptools. To get setuptools installed, simply download this file:

http://peak.telecommunity.com/dist/ez_setup.py

Then, from the command line, in the same directory as the downloaded file, run:

python ez_setup.py

Once this is done, you will be able to automatically download and install packages from the Python Python Package Index (PyPI), using the command easy_install. I recommend the following packages to get you started with scientific computing:

NumPy is a set of extensions that provides the ability to specify and manipulate array data structures. It provides array manipulation and computational capabilities similar to those found in IDL, Matlab, or Octave.

easy_install numpy

(note that on Mac OS X and Linux, you may need root permissions to install packages, in which case you need to prefix the above command with sudo)

An open source library of scientific tools for Python, SciPy supplements the popular NumPy module. SciPy gathering a variety of high level science and engineering modules together as a single package. SciPy includes modules for graphics and plotting, optimization, integration, special functions, signal and image processing, genetic algorithms, ODE solvers, and others.

easy_install scipy

Matplotlib is a 2D plotting library which produces publication-quality figures in a range of formats.

easy_install matplotlib

An enhanced interactive shell for Python that allows for easy data visualization, debugging, profiling and parallel computing.

easy_install ipython

The GNU Readline library provides a set of functions for use by applications that allow users to edit command lines as they are typed in.

easy_install readline

Here is a quick example of a Python program. We will call it stats.py, because Python programs typically end with the .*py* suffix. This code consists of some fake data, and two functions mean and var which calculate mean and variance, respectively. Python can be internally documented by adding lines beginning with the # symbol, or with simple strings enclosed in quotation marks. Here is the code:

# Import modules you might use
import numpy

# Some data, in a list
my_data = [12,5,17,8,9,11,21]

def mean(data):
    # Function for calulating the mean of some data

    # Initialize sum to zero
    sum_x = 0.0

    # Loop over data
    for x in data:

        # Add to sum
        sum_x += x

    # Divide by number of elements in list, and return
    return sum_x/len(data)

def var(data):
    # Function for calculating variance of data

    # Get mean of data from function above
    x_bar = mean(data)

    # Initialize sum of squares
    sum_squares = 0.0

    # Loop over data
    for x in data:

        # Add squared difference to sum
        sum_squares += (x - x_bar)\*\*2

    # Divide by n-1 and return
    return sum_squares/(len(data)-1)

Notice that, rather than using parentheses or brackets to enclose units of code (such as loops or conditional statements), python simply uses indentation. This relieves the programmer from worrying about a stray bracket causing her program to crash. Also, it forces programmers to code in neat blocks, making programs easier to read. So, for the following snippet of code:

# Loop over data
for x in data:

    # Add to sum
    sum_x += x

the first line initiates a loop, where each element in the data list is given the name x, and is used in the code that is indented below. The first line of subsequent code that is not indented signifies the end of the loop. It takes some getting used to, but works rather well.

Lets start up python, and run this program interactively. The first thing you see is the Python command prompt:

Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Our program can be imported with a simple command:

>>> import stats
>>>

Since there was no error message, we can assume that the module imported successfully. Alternately, you can import the individual contents of the program with this syntax:

>>> from stats import \*
>>>

Now lets call the functions:

>>> mean(my_data)
11.857142857142858
>>> var(my_data)
30.142857142857142
>>>

Our specification of mean and var are by no means the most efficient implementations. Python provides some syntax and built-in functions to make things easier, and sometimes faster:

# Function for calulating the mean of some data
def mean(data):

    # Call sum, then divide by the numner of elements
    return sum(data)/len(data)

# Function for calculating variance of data
def var(data):

    # Get mean of data from function above
    x_bar = mean(data)

    # Do sum of squares in one line
    sum_squares = sum([(x - x_bar)**2 for x in data])

    # Divide by n-1 and return
    return sum_squares/(len(data)-1)

In the new implementation of mean, we use the built-in function sum to reduce the function to a single line. Similarly, var employs a list comprehension syntax to make a more compact and efficient loop.

An alternative looping construct involves the map function. Suppose that we had a number of datasets, for each which we want to calculate the mean:

>>> x = (45, 95, 100, 47, 92, 43)
>>> y = (65, 73, 10, 82, 6, 23)
>>> z = (56, 33, 110, 56, 86, 88)
>>> datasets = (x,y,z)

This can be done using a classical loop:

means = []
for d in datasets:
    means.append(mean(d))

Or, more efficiently using map:

>>> map(mean, datasets)
[70.333333333333329, 43.166666666666664, 71.5]

Similarly we did not have to code these functions to get means and variances; the Numpy package that we imported at the beginning of the module has similar methods:

>>> Numpy.average(my_data)
11.857142857142858

Extensive Python documentation and links to tutorials can be found at the Python website.

As previously stated, Python is an object-oriented programming (OOP) language, in contrast to procedural languages like FORTRAN and C. As the name implies, object-oriented languages employ objects to create convenient abstractions of data structures. This allows for more flexible programs, fewer lines of code, and a more natural programming paradigm in general. An object is simply a modular unit of data and associated functions, related to the state and behavior, respectively, of some abstract entity. Object-oriented languages group similar objects into classes. For example, consider a Python class representing a bird:

class bird:
    # Class representing a bird

    name = “bird”

    def fly(self):
        # Makes bird fly

        print “Flying!”

    def nest(self):
        # Makes bird build nest

        print “Building nest ...”

You will notice that this bird class is simply a container for two functions (called methods in Python), fly and nest, as well as one variable, name. The methods represent functions in common with all members of this class. You can run this code in Python, and create birds:

>>> Tweety = bird()
>>> Tweety.name
`bird'
>>> Tweety.fly()
`Flying!'
>>> Foghorn = bird()
>>> Foghorn.nest()
`Building nest ...'

As many instances of the bird class can be generated as desired, though it may quickly become boring. One of the important benefits of using object-oriented classes is code re-use. For example, we may want more specific kinds of birds, with unique functionality:

class duck(bird):
    # Duck is a subclass of bird

    name = “duck”

    def swim(self):
        # Ducks can swim

        print “Swimming!”

    def quack(self,n):
        # Ducks can quack

        print “Quack!” * n

Notice that this new duck class refers to the bird class in parentheses after the class declaration; this is called inheritance. The subclass duck automatically inherits all of the variables and methods of the superclass, but allows new functions or variables to be added. In addition to flying and best-building, our duck can also swim and quack:

>>> Daffy = duck()
>>> Daffy.swim()
`Swimming!'
>>> Daffy.quack(3)
`Quack!Quack!Quack'
>>> Daffy.nest()
`Building nest ...'

In addition to adding new variables and methods, a subclass can also override existing variables and methods of the superclass. For example, one might define fly in the duck subclass to return an entirely different string. It is easy to see how inheritance promotes code re-use, sometimes dramatically reducing development time. Classes which are very similar need not be coded repetitiously, but rather, just extended from a single superclass.

This brief introduction to object-oriented programming is intended only to introduce new users of Python to this programming paradigm. There are many more salient object-oriented topics, including interfaces, composition, and introspection. I encourage interested readers to refer to any number of current Python and OOP books for a more comprehensive treatment.

In the introduction above, you have already seen some of the important Python data structures, including integers, floating-point numbers, lists and tuples. It is worthwhile, however, to quickly introduce all of the built-in data structures relevant to everyday Python programming.

The simplest data structure are literals, which appear directly in programs, and include most simple strings and numbers:

42            # Integer
0.002243        # Floating-point
5.0J            # Imaginary
‘foo’
“bar”            # Several string types
“““ Multi-line
string”””

The first sequence data structure is the tuple, which simply an immutable, ordered sequence of elements. These elements may be of arbitrary and mixed types. The tuple is specified by a comma-separated sequence of items, enclosed by parentheses:

(34,90,56)        # Tuple with three elements
(15,)            # Tuple with one element
(12, ‘foobar’)    # Mixed tuple

As with any sequence in Python, individual elements can be accessed by indexing. This amounts to specifying the appropriate element index enclosed in square brackets following the tuple name:

>>> foo = (5,7,2,8,2,-1,0,4)
>>> foo[5]
-1

Notice that the index is zero-based, so that 5 retrieves the sixth item, not the fifth. Hence, the first index is zero, rather than one (in contrast to R). Two or more elements can be indexed by slicing:

>>> foo[2:5]
(2, 8, 2)

This retrieves the third, fourth and fifth (but not the sixth!) elements -- i.e., up to, but not including, the final index. One may also slice or index starting from the end of a sequence, by using negative indices:

>>> foo[:-2]
(5, 7, 2, 8, 2, -1)

As you can see, this returns all elements except the final two. The elements of a tuple, as defined above, are immutable. Therefore, Python takes offense if you try to change them:

>>> a = (1,2,3)
>>> a[0] = 6
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: object does not support item assignment

Finally, the tuple() function can create a tuple from any sequence:

>>> tuple('foobar')
('f', 'o', 'o', 'b', 'a', 'r')

Why does this happen? Because strings are considered a sequence of characters.

Lists complement tuples in that they are a mutable, ordered sequence of elements. To distinguish them from tuples, they are enclosed by square brackets:

[90,43.7,56,1,-4]    # List with five elements
[100]                # Tuple with one element
[]                   # Empty list

Elements of a list can be arbitrarily substituted by assigning new values to the associated index:

>>> bar = [5,8,4,2,7,9,4,1]
>>> bar[3] = -5
>>> bar
[5, 8, 4, -5, 7, 9, 4, 1]

Operations on lists are somewhat unusual. For example, multiplying a list by an integer does not multiply each element by that integer, but rather:

>>> bar * 3
[5, 8, 4, -5, 7, 9, 4, 1, 5, 8, 4, -5, 7, 9, 4, 1, 5, 8, 4, -5, 7, 9, 4, 1]

Which is simply three copies of the list, concatenated together. This is useful for generating lists with identical elements:

>>> [0]*10
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Since lists are mutable, they retain several methods [1], some of which mutate the list. For example:

>>> bar.extend(foo)         # Adds foo to the end of bar
>>> bar.append(5)           # Appends 5 to the end of bar
>>> bar.insert(0, 4)        # Inserts 4 at index 0
>>> bar.remove(7)           # Removes the first occurrence of 7
>>> bar.remove(100)
Traceback (most recent call last):      # Oops! Doesn’t exist
  File "<stdin>", line 1, in ?
ValueError: list.remove(x): x not in list
>>> bar.pop(4)          # Removes and returns indexed item
-5
>>> bar.reverse()           # Reverses bar in place
>>> bar.sort()          # Sorts bar in place

Some methods, however, do not change the list:

>>> bar.count(7) # Counts occurrences of 7 in bar
1
>>> bar.index(7) # Returns index of first 7 in bar
4

One of the more flexible built-in data structures is the dictionary. A dictionary maps a collection of values to a set of associated keys. These mappings are mutable, and unlike lists or tuples, are unordered. Hence, rather than using the sequence index to return elements of the collection, the corresponding key must be used. Dictionaries are specified by a comma-separated sequence of keys and values, which are separated in turn by colons. The dictionary is enclosed by curly braces. For example:

>>> my_dict = {'a':16, 'b':(4,5), 'foo':'(noun) a term used as a
universal substitute for something real, especially when discussing
technological ideas and problems'}
>>> my_dict['b']
(4, 5)

Notice that a indexes an integer, b a tuple, and foo a string (now you know what foo means). Hence, a dictionary is a sort of associative array. Some languages refer to such a structure as a hash.

As with lists, being mutable, dictionaries have a variety of methods, and functions that take dictionaries as objects. For example, some dictionary functions include:

>>> len(my_dict) # Returns number of key/value pairs in my_dict
3
>>> 'a' in my_dict # Checks to see if ‘a’ is in my_dict
True

Some useful dictionary methods are:

>>> my_dict.copy()      # Returns a copy of the dictionary
{'a': 16, 'b': (4, 5), 'foo': '(noun) a term used as a universal
substitute for something real, especially when discussing
technological ideas and problems'}
>>> my_dict.has_key('bar')  # Checks to see if a key exists
False
>>> my_dict.items()     # Returns key/value pairs as list
[('a', 16), ('b', (4, 5)), ('foo', '(noun) a term used as a universal
substitute for something real, especially when discussing
technological ideas and problems')]
>>> my_dict.keys()      # Returns list of keys
['a', 'b', 'foo']
>>> my_dict.values()    # Returns list of values
[16, (4, 5), '(noun) a term used as a universal substitute for
something real, especially when discussing technological ideas
and problems']
>>> my_dict.get('c', -1)    # Gets value for ‘c’ if it exists,
-1              # if not, returns specified default

>>> my_dict.popitem()   # Removes and returns arbitrary item
('a', 16)
>>> my_dict.clear()     # Empties dictionary
>>> my_dict
{}

The control flow of a program determines the order in which lines of code are executed. All else being equal, Python code is executed linearly, in the order that lines appear in the program. However, all is not usually equal, and so the appropriate control flow is frequently specified with the help of control flow statements. These include loops, conditional statements and calls to functions. Let’s look at a few of these here.

One way to repeatedly execute a block of statements (i.e. loop) is to use a for statement. These statements iterate over the number of elements in a specified sequence, according to the following syntax:

for letter in “ciao”:
    print “give me a”, letter.upper()

give me a C
give me a I
give me a A
give me a O

Recall that strings are simply regarded as sequences of characters. Hence, the above for statement loops over each letter, converting each to upper case with the upper() method and printing it. Similarly, as shown in the introduction, list comprehensions may be constructed using for statements:

>>> [i**2 for i in range(10)]
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Here, the expression loops over range(10) -- the sequence from 0 to 9 -- and squares each before placing it in the returned list.

As the name implies, if statements execute particular sections of code depending on some tested condition. For example, to code an absolute value function, one might employ conditional statements:

def absval(some_list):

    # Create empty list
    absolutes = []

    # Loop over elements in some_list
    for value in some_list:

        # Conditional statement
        if value<0:
            # Negative value
            absolutes.append(-value)

        else:
            # Positive value
            absolutes.append(value)

    return absolutes

Here, each value in some_list is tested for the condition that it is negative, in which case it is multiplied by -1, otherwise it is appended as-is. For conditions that have more than two possible values, the elif statement can be used:

if x<0:
    print “x is negative”
elif x%2:
    print “x is positive and odd”
else:
    print “x is even and non-negative”

A different type of conditional loop is provided by the while statement. Rather than iterating a specified number of times, according to a given sequence, while executes its block of code repeatedly, until its condition is no longer true. For example, suppose we want to sample from a normal distribution, but are only interested in positive-valued samples. The following function is one solution:

# Import function
from random import gauss

def positive_normals(how_many):

    # Create empty list
    values = []

    # Loop until we have specified number of samples
    while (len(values) < how_many):

        # Sample from standard normal
        x = gauss(0,1)

        # Append if positive
        if x>0: values.append(x)

    return values

This function iteratively samples from a standard normal distribution, and appends it to the output array if it is positive, stopping to return the array once the specified number of values have been added.

Obviously, the body of the while statement should contain code that eventually renders the condition false, otherwise the loop will never end! An exception to this is if the body of the statement contains a break or return statement; in either case, the loop will be interrupted.

Python includes operations for importing and exporting data from files and binary objects, and third-party packages exist for database connectivity. The easiest way to import data from a file is to parse delimited text file, which can usually be exported from spreadsheets and databases. In fact, file is a built-in type in python. Data may be read from and written to regular files by specifying them as file objects:

chdage = open('chdage.dat')

Here, a file containing fish presence/absence data in a comma-delimited format is opened, and assigned to an object, called chdage. The next step is to transfer the information in the file to a usable data structure in python. Since this dataset contains three variables, the presence or absence of a particular fish species, the elevation of the stream and the gradient of the stream, it is convenient to use a dictionary. This allows each variable to be specified by name. First, a dictionary object is initialized, with appropriate keys and corresponding lists, initially empty:

vars = {'id':[], 'age':[], 'chd':[]}

It is then a matter of looping over each line of the data. Python file objects are iterable, essentially just a sequence of lines, and fit naturally into a for statement.

for line in chdage:
    id,age,chd = line.split()
    vars['id'].append(int(id))
    vars['age'].append(int(age))
    vars['chd'].append(int(chd))

For each line in the file, data elements are split by the comma delimiter, using the split method that is built-in to string objects. Each datum is subsequently appended to the appropriate list stored in the dictionary. After all of the data is parsed, it is polite to close the file:

chdage.close()

The data can now be readily accessed by indexing the appropriate variable by name:

>>> vars['age']
[20,23,24,25,25,26,26,28,…,69]

A second approach to importing data involves interfacing directly with a relational database management system. Relational databases are far more efficient for storing, maintaining and querying data than plain text files or spreadsheets, particularly for large datasets or multiple tables. A number of third parties have created packages for database access in Python. For example, sqlite3 is a package that provides connectivity for SQLite databases:

>>> import sqlite3 # import database package, and connect
>>> con = sqlite3.connect("haart.sqlite")
>>> cur = con.cursor() # create a cursor object to mediate
                       # communication with database
>>> cur.execute("SELECT death, weight, aids FROM haartrecs WHERE male=='1'")
# run query
<sqlite3.Cursor at 0x10e4530a0>
>>> data = cur.fetchall() # fetch data, and assign to variable
>>> data
[(0, None, 0),
 (0, 58.0608, 0),
 (1, 48.0816, 1),
 (0, None, 0),
 (0, None, 0),
 (0, None, 0),
 (1, None, 1),
 (1, 57.0, 1),
 (0, 48.0, 1),
 (0, None, 1),

Lets conclude by coding a relatively sophisticated (and useful) function. Suppose we want to estimate the parameters of a linear regression model. Rather than shutting down Python and opening SAS, it is relatively easy to solve the problem in a general way using Python.

The objective of regression analysis is to specify an equation that will predict some response variable Y based on a set of predictor variables X. This is done by fitting parameter values β of a regression model using extant data for X and Y. This equation has the form:

Regression equation

where ε is a vector of errors. One way to fit this model is using the method of least squares, which if you recall from undergraduate statistics, is given by:

Least squares

We can write a function that calculates this estimate, with the help of some functions from other modules:

from numpy.linalg import inv
from numpy import transpose, array, dot

We will call this function solve, requiring the predictor and response variables as arguments. For simplicity, we will restrict the function to univariate regression, whereby only a single slope and intercept are estimated:

def solve(x,y):
    'Estimates regession coefficents from data'

The first step is to specify the design matrix. For this, we need to create a vector of ones (corresponding to the intercept term, and along with x, create a n x 2 array:

'Add intercepts column'
X = array([[1]\*len(x), x])

An array is a data structure from the Numpy package, similar to a list, but allowing for multiple dimensions. Next, we calculate the transpose of x, using another Numpy function, transpose:

Xt = transpose(X)

Finally, we use the matrix multiplication function dot, also from Numpy to calculate the dot product:

'Estimate betas'
b_hat = dot(inv(dot(X,Xt)), dot(X,y))

The inverse function is provided by the LinearAlgebra package. Provided that x is not singular (which would raise an exception), this yields estimates of the intercept and slope, as an array:

return b_hat

Here is solve in action:

>>> solve((10,5,10,11,14),(-4,3,0,23,0.6))
array([ 2.04380952, 0.24761905])

.

[1] Methods are just functions that are attributed to objects; functions are stand-alone.