from R to Python (2): libraries for statistical analysis

A couple of years ago, I wrote a post about differences between R and Python. It mainly reported differences on grammars of these languages, which I found interesting when I translated my codes.

However, I found the post is not so informative for readers looking for a primer for “Python as an environment for statistical analysis” (, which can be alternative to R). So, I will write down what I recently learned on statistical analysis using Python.

If you want to develop an R-like environment for interactive statistical analysis using Python, first you need to install following libraries:

The pandas library provides functions and classes for interactive data analysis, for example, R-like data frame, and the statsmodels for statistical modelling like GLM. The matplotlib and numpy/scipy are required for better plotting, numerical calculations and vector handling. They are mainly called from pandas and statsmodels, but you can also directly call their functions.

Installation of these libraries is relatively easy if you use Unix-like operating system, but may not so if you use Windows or Mac. However, there are plenty of documentations on how to install them on Mac or Win like this.

In addition to the libraries, you’d better install IPython, a better interactive shell for Python. IPython provides you with a command-line completion much better than the default Python shell. Without a good command completion, interactive analysis would become very stressful.

Armed with IPython and these libraries, you can do almost the same things as you can with R.

import pandas

iris = pandas.read_csv("./iris.txt", sep="\t")

iris.head()

You will read a tab-delimited file by read_csv, and show first few lines of the table on console. The pandas.read_csv reads a tab-delimited table and returns a DataFrame object. As its name indicates, the DataFrame supports similar functions to the data.frame in R. The “head” is a method of the dataframe which shows part of its rows. You will see results like below when you execute the code in IPython.

In [1]: import pandas

In [2]: iris = pandas.read_csv("./iris.txt", sep="\t")

In [3]: iris.head()
Out[3]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0        5.1         3.5          1.4         0.2  setosa
1        4.9         3.0          1.4         0.2  setosa
2        4.7         3.2          1.3         0.2  setosa
3        4.6         3.1          1.5         0.2  setosa
4        5.0         3.6          1.4         0.2  setosa

[5 rows x 5 columns]

You can access columns of data frame with their names as you can with R. Also, slicing with indices, accessing entries with conditions are supported.

#select a column named "Species"
iris["Species"]

#select first 5 elements of column "Petal.Length"
iris["Petal.Length"][0:5]

#select rows with which "Species" is equal to "virginica", then show first 5 rows
iris[iris["Species"]=="virginica"].head() 

Commands above will return the following outputs.

In [4]: iris["Species"]
Out[4]:
0 setosa
1 setosa
2 setosa
3 setosa
4 setosa
...

In [5]: iris["Petal.Length"][0:5]
Out[5]:
0 1.4
1 1.4
2 1.3
3 1.5
4 1.4
Name: Petal.Length, dtype: float64

In [6]: iris[iris["Species"]=="virginica"].head()
Out[6]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
100      6.3         3.3          6.0         2.5 virginica
101      5.8         2.7          5.1         1.9 virginica
102      7.1         3.0          5.9         2.1 virginica
103      6.3         2.9          5.6         1.8 virginica
104      6.5         3.0          5.8         2.2 virginica

[5 rows x 5 columns]

Basic statistics are supported by methods of dataframe.

#count the number of occurrance
iris["Species"].value_counts()

#mean of each columns
iris.mean()

Counts of numbers of species and means of columns will be returned.

In [7]: iris["Species"].value_counts()
Out[7]: 
setosa        50
versicolor    50
virginica     50
dtype: int64

In [8]: iris.mean()
Out[8]: 
Sepal.Length    5.843333
Sepal.Width     3.057333
Petal.Length    3.758000
Petal.Width     1.199333
dtype: float64

Also, there are several types of plotting methods.

#boxplot
iris.boxplot(by="Species", column="Petal.Width")

#scatter plot matrix
pandas.tools.plotting.scatter_matrix(iris)

The first line plots a boxplot of “Petal.Width” by “Species”. The second line plots a scatter plot matrix. Make sure that you start IPython with the “–matplotlib” option to enable plotting.

figure_1 figure_2

If the Python codes above are translated into R, they look like the following R code.

iris <- read.table("./iris.txt", sep="\t", header=T)

head(iris)

iris$Species
iris$Petal.Length[1:5]
head(iris[iris$Species=="virginica",])

table(iris$Species)
colMeans(iris)

boxplot(Petal.Width ~ Species, data=iris)
pairs(iris)

plotinr1 plotinr2

The codes in Python and R are very similar. Functions of pandas are almost like object-oriented version of R functions. You can find very basic functions of pandas in their documents. It appears IPython+pandas have as many functions for basic data manipulation and visualization as R has.  So, how about statistical analysis like regression modelling? I will write about in the next post.

Advertisements

One thought on “from R to Python (2): libraries for statistical analysis

  1. Pingback: from R to python | Tomochika Fujisawa's site

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s