Trace:

en:pca_examples

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision | |||

en:pca_examples [2019/02/27 16:13] David Zelený [Example 3: Evaluation of importance of ordination axes in PCA] |
en:pca_examples [2020/03/25 21:34] (current) David Zelený |
||
---|---|---|---|

Line 15: | Line 15: | ||

grasslands.spe.log <- log1p (grasslands.spe) | grasslands.spe.log <- log1p (grasslands.spe) | ||

+ | |||

+ | library (vegan) | ||

decorana (grasslands.spe.log) | decorana (grasslands.spe.log) | ||

</code> | </code> | ||

Line 128: | Line 130: | ||

</code> | </code> | ||

- | Total variation of the whole dataset is 14 in this case, and the first axis explains 31.3% of total variation (see the row ''Proportion Explained'', or calculate it as the ratio between eigenvalue of the first PCA axis and total variance, 4.3861/14 = 0.313)((Note difference here between ''vegan'' and CANOCO 5: while ''vegan'' reports unscaled eigenvalues, CANOCO 5 directly reports eigenvalues scaled in the way that their sum equals to one, not total variation; in case of CANOCO 5 you may directly read the percentage variation explained by individual axes from the eigenvalues reported in Summary by multiplying them by 100.)). Total variation is a sum of variations of each variable in the analyzed matrix - in this case, all variables have been standardized to zero mean and unit variance (mean = 0, sd = 1), and there are 14 variables, so total variation (inertia) is 14: | + | Total variation of the whole dataset is 14 in this case, and the first axis explains 31.3% of total variation (see the row ''Proportion Explained'', or calculate it as the ratio between eigenvalue of the first PCA axis and total variance, 4.3861/14 = 0.313)((Note the difference here between ''vegan'' and CANOCO 5: while ''vegan'' reports unscaled eigenvalues, CANOCO 5 directly reports eigenvalues scaled in the way that their sum equals to one, not total variation; in case of CANOCO 5 you may directly read the percentage variation explained by individual axes from the eigenvalues reported in Summary by multiplying them by 100.)). Total variation is a sum of variations of each variable in the analyzed matrix - in this case, all variables have been standardized to zero mean and unit variance (mean = 0, sd = 1), and there are 14 variables, so total variation (inertia) is 14: |

<code rsplus> | <code rsplus> | ||

stand.chem <- scale (chem) | stand.chem <- scale (chem) | ||

Line 183: | Line 185: | ||

</code> | </code> | ||

- | Magnesium, calcium and condictivity have high loadings to the first axis, while potassium (K), silica (Si) and iron (Fe) to the second). | + | Magnesium, calcium and conductivity have high loadings to the first axis, while potassium (K), silica (Si) and iron (Fe) to the second). |

- | Note that in this specific case, when we are analyzing dataset of environmental variables, data had to be standardized, either ahead of analysis (e.g. by applying ''scale (chem)'' or ''decostand (chem, method = 'standardize')''), or by seting the argument ''scale = TRUE'' in the function ''rda''. In this way, all variables have the same units and variance; otherwise, the variables with large values will have too high influence in the analysis. To draw the diagrams, you can use function ''biplot'', which is drawing arrows for species (note that function ''ordiplot'' draws both species/sample scores as centroids): | + | Note that in this specific case, when we are analyzing a dataset of environmental variables, data had to be standardized, either ahead of analysis (e.g. by applying ''scale (chem)'' or ''decostand (chem, method = 'standardize')''), or by setting the argument ''scale = TRUE'' in the function ''rda''. In this way, all variables have the same units and variance; otherwise, the variables with large values will have too high influence in the analysis. To draw the diagrams, you can use function ''biplot'', which is drawing arrows for species (note that function ''ordiplot'' draws both species/sample scores as centroids): |

<code rsplus> | <code rsplus> | ||

biplot (PCA, display = 'species', scaling = 'species') | biplot (PCA, display = 'species', scaling = 'species') | ||

Line 202: | Line 204: | ||

- | The left figure is for scaling 1 (focus on distances among plots), the right one for scaling 2 (focus on the correlation among species/variables, which is reflected in the angle of particular vectors). The circle in the left figure is so called circle of equilibrium contribution - the variables with vectors longer than the radius of the circle could be interpreted with confidence as important for given combination of axes ((Daniel Borcard [[http://biol09.biol.umontreal.ca/ULaval08/Chapitre_4a.pdf|explains (p. 6)]] what does equilibrium contribution circle means://... it is possible to draw, on a plane made of two principal components, a circle representing the equilibrium contribution of the variables. Equilibrium contribution is the length that a descriptor-vector would have if it contributed equally to all the dimensions (principal axes) of the PCA. Variables that contribute little to a given reduced space (say, the 1×2 plane) have vectors that are shorter than the radius of the equilibrium contribution circle. Variables that contribute more have vectors whose lengths exceed the radius of that circle. The circle has a radius equal to √(d/p), where d equals the number of dimensions of the reduced space considered (usually d=2) and p equals the total number of descriptors (and hence of principal components) in the analysis.//)). In this case, first axis represents so-called poor-rich gradient((Hájek et al. 2002 comment it in their [[http://link.springer.com/article/10.1007%2FBF02804232|paper]] in the following way: //Water calcium and magnesium concentrations, pH and conductivity as well as the soil organic carbon content were the properties measured that showed the strongest correlation with the main vegetation gradient (the poor-rich gradient).//)). | + | The left figure is for scaling 1 (focus on distances among plots), the right one for scaling 2 (focus on the correlation among species/variables, which is reflected in the angle of particular vectors). The circle in the left figure is a so-called circle of equilibrium contribution - the variables with vectors longer than the radius of the circle could be interpreted with confidence as important for a given combination of axes ((Daniel Borcard [[http://biol09.biol.umontreal.ca/ULaval08/Chapitre_4a.pdf|explains (p. 6)]] what does equilibrium contribution circle means://... it is possible to draw, on a plane made of two principal components, a circle representing the equilibrium contribution of the variables. Equilibrium contribution is the length that a descriptor-vector would have if it contributed equally to all the dimensions (principal axes) of the PCA. Variables that contribute little to a given reduced space (say, the 1×2 plane) have vectors that are shorter than the radius of the equilibrium contribution circle. Variables that contribute more have vectors whose lengths exceed the radius of that circle. The circle has a radius equal to √(d/p), where d equals the number of dimensions of the reduced space considered (usually d=2) and p equals the total number of descriptors (and hence of principal components) in the analysis.//)). In this case, first axis represents so-called poor-rich gradient((Hájek et al. 2002 comment it in their [[http://link.springer.com/article/10.1007%2FBF02804232|paper]] in the following way: //Water calcium and magnesium concentrations, pH and conductivity as well as the soil organic carbon content were the properties measured that showed the strongest correlation with the main vegetation gradient (the poor-rich gradient).//)). |

** What if we did not standardise the variables?** | ** What if we did not standardise the variables?** | ||

Line 284: | Line 286: | ||

==== Example 3: Evaluation of importance of ordination axes in PCA ==== | ==== Example 3: Evaluation of importance of ordination axes in PCA ==== | ||

- | This example uses environmental variables from Carpathians wetlands as above. It illustrates how to decide which PCA axis or axes should be used for interpretation of results. You need the define the function ''evplot'' first (written by Francois Gillet, definition [[en:numecolr|here]], in the script below done using ''source'' method directly from this website). | + | This example uses environmental variables from Carpathians wetlands as above. It illustrates how to decide which PCA axis or axes should be used for interpretation of results. You need the define the function ''evplot'' first (written by Francois Gillet, the definition [[en:numecolr|here]], in the script below done using ''source'' method directly from this website). |

<code rsplus> | <code rsplus> | ||

Line 297: | Line 299: | ||

PCA <- rda (stand.chem) | PCA <- rda (stand.chem) | ||

- | # Finally, in the PCA object select the component $eig with vector of eigenvalues: | + | # Finally, in the PCA object select the component $eig with the vector of eigenvalues: |

ev <- PCA$CA$eig | ev <- PCA$CA$eig | ||

Line 406: | Line 408: | ||

==== Example 4: tb-PCA on species data pre-transformed using Hellinger transformation ==== | ==== Example 4: tb-PCA on species data pre-transformed using Hellinger transformation ==== | ||

- | In this example we will use vegetation data from [[en:data:vltava|Vltava river valley dataset]] and analyse them by transformation-based version of principal component analysis, meaning after pre-transformation by Hellinger transformation. | + | In this example, we will use vegetation data from [[en:data:vltava|Vltava river valley dataset]] and analyse them by the transformation-based version of principal component analysis, meaning after pre-transformation by Hellinger transformation. |

<code rsplus> | <code rsplus> |

en/pca_examples.txt · Last modified: 2020/03/25 21:34 by David Zelený