The phylogenetic regression is always confusing for me. As I have done some simulations for myself to better understand the phylogenetic regression, I will write down some of what I learnt from them and things which is still unclear.
Let’s start with the basic ideas. A common practice in ecology / evolutionary biology is correlating species traits and testing whether one trait has an effect on the another. Typical questions are like “Does generalist herbivorous insects have larger range size than specialists?” , “Is pollen production different between dioecious and hermaphrodite plants?” and so on …
The left plot in the figure below is a simulated trait of species (a binary trait like specialist/generalist) evolving along a phylogeny. It has evolutionary history as colored on the tree branches. Usually we don’t know this history, and only what we know is traits at tips.
In the right plot of the figure, the values of a continuous character (, which also evolves along the tree) are plotted against the binary character. What we want to know is “Is there a significant difference of the continuous character (let’s call it val) associated with the discrete one (let’s call type)?
They look different, and a regression analysis shows a significant effect of type on val.
> anova(lm(val~type, data=tab)) Analysis of Variance Table Response: val Df Sum Sq Mean Sq F value Pr(>F) type 1 410.62 410.62 8.3126 0.007484 ** Residuals 28 1383.13 49.40 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Looks great. We found a significant difference. A very good result for a thesis.
But, actually, this is a false correlation. There is no effect of type on val in this simulated data. The val evolved just randomly following the Brownian motion with an identical parameter set. There are no difference on the means or variances between the two states of type. If you check the historical values of val, there are no clear patterns of increase/decrease of values associated with type.
Closely related species have close values of val. Only shared ancestry and historical contingency created the observed false correlation. This is why we need to consider the effect of phylogeny when we correlates species traits.
Once the effect of phylogeny is removed by using the phylogenetic regression. The spurious correlation disappeared. (I used the “pgls” function in “caper” R package to do this.)
> anova(pgls(val~type, data=dat, lambda="ML")) Analysis of Variance Table Sequential SS for pgls: lambda = 1.00, delta = 1.00, kappa = 1.00 Response: val Df Sum Sq Mean Sq F value Pr(>F) type 1 0.8052 0.80517 3.8147 0.06086 . Residuals 28 5.9099 0.21107 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
With parameter sets of this case, the ordinal regression erroneously rejected the null model in about 30% of cases while the PGLS had just 3% of false rejection rate.
OK. So, what if there IS a correlation?
In the plot above, the val was simulated under a model in which the direction of evolution is actually different between the two colors. The boxplot looks not so different from the previous one, but if you see the ancestral characters, the trends is clear.
The dark-blue points go upward more frequently than the light-blues. And the result of the PGLS is significant.
> anova(pgls(val~type, data=dat, lambda="ML")) Analysis of Variance Table Sequential SS for pgls: lambda = 1.00, delta = 1.00, kappa = 1.00 Response: val Df Sum Sq Mean Sq F value Pr(>F) type 1 1.0897 1.08966 4.763 0.03763 * Residuals 28 6.4057 0.22877 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The statistical power to detect the difference really depends on the size of difference and error. With the parameter set used for the simulation above, 40% of cases successfully found the difference. And naturally, the larger the difference and the smaller error I set, the more frequently the PGLS found the correct answer. The phylogenetic regression actually performs really well. Considering the high error rate of the ordinal regression shown above, it is always safer to do the phylogeny-aware regression when you correlate species traits.
One puzzling point for me is how the PGLS can detect the difference.
The inputs for the PGLS are only extent characters. If you can know the ancestral character states like the 2nd & 4th figures, it is possible to see associations of colors to increases/decreases of points. However, the PGLS doesn’t have explicit steps of ancestral reconstruction, but still can find the difference. I am guessing that the process of correcting phylogenetic dependency using the variance-covariance matrix is equivalent to ancestral character reconstruction. But, not sure.
Maybe, I need to go back to the original papers of the pgls.
A slightly better post about PGLS is here.
Well, my guess was wrong. I skimmed through the Grafen’s original paper, and it says the phylogenetic regression is NOT equivalent to the ancestral reconstruction. I will read the detail of the paper.