I wrote a post about what I understood and what I didn’t understand about the phylogenetic regression last year. Since then, I read a few papers about this topic and I want to update the post.
As I wrote in the post and probably most readers know, the shared ancestry of species is a problem when we do a regression analysis on variables measured from multiple species. The phylogeny-aware regression is a method to correct the bias introduced by the shared ancestry.
I thought first that the phylogeny-aware regression methods reconstruct ancestral states. When I see the plot of simulated ancestral characters in the post, I can easily notice that there is correlated evolution between traits. However, at least, the PGLS (phylogenetic generalized least square) method used in the post does not reconstruct ancestral states to correct the effect of phylogeny. So, how does it do this difficult task?
According to a few papers about the method, the PGLS uses a method called the generalized least square (GLS). The GLS is an extension to the ordinal least square (OLS) method for regression, which relaxes the OLS’s assumptions. The OLS regression have assumptions like independence of data points and equality of variance. The GLS can handle data which violate these assumptions.
The plot below is a simple simulation of species traits under Brownian motion. The simulations were repeated 25 times. The observed trait values are correlated between species 1 (sp.1) and species 2 (sp.2) because they are very closely related. This means that the data points are not independent and the ordinal regression is not appropriate.
The GLS can correct the dependence of the data points by estimating the correlations between them. When you know a phylogeny of species, it provides a clue to the estimation of correlation structure between species’ trait values. Under the assumption that the trait evolves randomly (eg. Brownian motion), the closely related species have more close trait values. For example, in the plot above, Sp.1 and Sp.2 have strongly correlated trait values because they are closely related while Sp.1 and Sp.4 doesn’t because they don’t share any common history. The correlation between Sp.2 and Sp.3 is moderate. So ,the structure of interdependence between species can be estimated from species phylogeny like the matrix below.
The PGLS correct the dependence of data points by using this variance-covariance matrix taken from branch lengths of species phylogeny. (I am not a serious statistician. So, please don’t ask about how exactly the GLS does this.)
In most cases when we analyse species data, we don’t know if there is any phylogenetic dependence on the data. So, we need to test if the trait values are actually interdependent the way expected from phylogeny.
This is done by estimating the values of lambda in upper and lower triangles in the GLS regression. When this Pagel’s lambda is 1, the trait evolves exactly like Brownian motion along phylogeny. And, when lambda = 0, there is no correlation due to phylogeny. There are other statistics to measure phylogenetic dependence such as k.
These are how the PGLS works on the species comparative data. I hope I correctly understand it. If you are interested in this topic, the following papers may be interesting and useful.
Garland et al. (2005) Phylogenetic approaches in comparative physiology.
Orme (2013) Caper package: comparative analysis of phylogenetics and evolution in R
Grafen (1989) The phylogenetic regression
(The phylogenetic regregression by Grafen and PGLS are not exactly the same. So, the title of my posts are a bit misleading.)
Freckleton et al. (2002) Phylogenetic Analysis and Comparative Data: A Test and Review of Evidence