A couple of years ago, I wrote a post about differences between R and Python. It mainly reported differences on grammars of these languages, which I found interesting when I translated my codes.
However, I found the post is not so informative for readers looking for a primer for “Python as an environment for statistical analysis” (, which can be alternative to R). So, I will write down what I recently learned on statistical analysis using Python.
If you want to develop an R-like environment for interactive statistical analysis using Python, first you need to install following libraries:
- numpy/scipy for numerial functions and better handling of vectors and matrices
- matplotlib for plotting
- pandas for dataframe and other data structures
- statsmodels for statistical modeling
The pandas library provides functions and classes for interactive data analysis, for example, R-like data frame, and the statsmodels for statistical modelling like GLM. The matplotlib and numpy/scipy are required for better plotting, numerical calculations and vector handling. They are mainly called from pandas and statsmodels, but you can also directly call their functions.
Installation of these libraries is relatively easy if you use Unix-like operating system, but may not so if you use Windows or Mac. However, there are plenty of documentations on how to install them on Mac or Win like this.
In addition to the libraries, you’d better install IPython, a better interactive shell for Python. IPython provides you with a command-line completion much better than the default Python shell. Without a good command completion, interactive analysis would become very stressful.
Armed with IPython and these libraries, you can do almost the same things as you can with R.
import pandas iris = pandas.read_csv("./iris.txt", sep="\t") iris.head()
You will read a tab-delimited file by read_csv, and show first few lines of the table on console. The pandas.read_csv reads a tab-delimited table and returns a DataFrame object. As its name indicates, the DataFrame supports similar functions to the data.frame in R. The “head” is a method of the dataframe which shows part of its rows. You will see results like below when you execute the code in IPython.
In : import pandas In : iris = pandas.read_csv("./iris.txt", sep="\t") In : iris.head() Out: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa [5 rows x 5 columns]
You can access columns of data frame with their names as you can with R. Also, slicing with indices, accessing entries with conditions are supported.
#select a column named "Species" iris["Species"] #select first 5 elements of column "Petal.Length" iris["Petal.Length"][0:5] #select rows with which "Species" is equal to "virginica", then show first 5 rows iris[iris["Species"]=="virginica"].head()
Commands above will return the following outputs.
In : iris["Species"] Out: 0 setosa 1 setosa 2 setosa 3 setosa 4 setosa ... In : iris["Petal.Length"][0:5] Out: 0 1.4 1 1.4 2 1.3 3 1.5 4 1.4 Name: Petal.Length, dtype: float64 In : iris[iris["Species"]=="virginica"].head() Out: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 100 6.3 3.3 6.0 2.5 virginica 101 5.8 2.7 5.1 1.9 virginica 102 7.1 3.0 5.9 2.1 virginica 103 6.3 2.9 5.6 1.8 virginica 104 6.5 3.0 5.8 2.2 virginica [5 rows x 5 columns]
Basic statistics are supported by methods of dataframe.
#count the number of occurrance iris["Species"].value_counts() #mean of each columns iris.mean()
Counts of numbers of species and means of columns will be returned.
In : iris["Species"].value_counts() Out: setosa 50 versicolor 50 virginica 50 dtype: int64 In : iris.mean() Out: Sepal.Length 5.843333 Sepal.Width 3.057333 Petal.Length 3.758000 Petal.Width 1.199333 dtype: float64
Also, there are several types of plotting methods.
#boxplot iris.boxplot(by="Species", column="Petal.Width") #scatter plot matrix pandas.tools.plotting.scatter_matrix(iris)
The first line plots a boxplot of “Petal.Width” by “Species”. The second line plots a scatter plot matrix. Make sure that you start IPython with the “–matplotlib” option to enable plotting.
If the Python codes above are translated into R, they look like the following R code.
iris <- read.table("./iris.txt", sep="\t", header=T) head(iris) iris$Species iris$Petal.Length[1:5] head(iris[iris$Species=="virginica",]) table(iris$Species) colMeans(iris) boxplot(Petal.Width ~ Species, data=iris) pairs(iris)
The codes in Python and R are very similar. Functions of pandas are almost like object-oriented version of R functions. You can find very basic functions of pandas in their documents. It appears IPython+pandas have as many functions for basic data manipulation and visualization as R has. So, how about statistical analysis like regression modelling? I will write about in the next post.