User Tools

Site Tools


en:forward_sel_examples

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
en:forward_sel_examples [2019/04/06 08:01]
David Zelený [Option b) Use of ordiR2step function]
en:forward_sel_examples [2019/04/06 08:46]
David Zelený [Example 2: CCA with forward selection on data from Carpathian wetlands]
Line 565: Line 565:
  
 Selected variables are ''​Ca'',​ ''​conduct'',​ ''​Si'',​ ''​NH3'',​ ''​NO3'',​ ''​Mg'' ​ and ''​pH''​. Selected variables are ''​Ca'',​ ''​conduct'',​ ''​Si'',​ ''​NH3'',​ ''​NO3'',​ ''​Mg'' ​ and ''​pH''​.
 +
 +==== Example 2: CCA with forward selection on data from Carpathian wetlands ====
 +
 +This example is using the same data as the Example 1 above, namely composition of vascular plants in [[en:​data:​wetlands|Carpathian wetlands]] and chemical variables measured in fen water running through them. While in Example 1 we analysed the data using tb-RDA (RDA applied on data after Hellinger transformation),​ here we do this using CCA (note: no Hellinger transformation is done). Since the function ''​forward.sel''​ from ''​adespatial''​ package is using only RDA algorithm, we cannot use it here; instead, we will do the calculation using ''​ordiR2step''​ from ''​vegan'',​ which also implements variable selection with double stopping criteria: the selection is finished if the new variable to be selected is not significant at certain alpha level, or if the adjusted R<​sup>​2</​sup>​ explained by the model with this variable exceeds the adjusted R<​sup>​2</​sup>​ of the global model. ​
 +
 +<code rsplus>
 +# Carpathian wetlands - import data
 +vasc <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​vasc_plants.txt',​ row.names = 1)
 +chem <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​chemistry.txt',​ row.names = 1)
 +
 +# the last variable in the '​chem'​ dataset is '​slope',​ which is not a chemical variable - remove it:
 +chem1 <- chem[,-15]
 +
 +</​code>​
 +
 +We do similar steps as in the option b) of Example 1 above: create global model with CCA (including all variables) and test it's significance,​ and if it is significant,​ we will create also an empty model with only intercept and use the ''​ordiR2step''​ function with appropriate arguments:
 +
 +<code rsplus>
 +cca.all <- cca (vasc ~ ., data = chem1)
 +anova (cca.all) # P < 0.001 - yes, significant!
 +adjRsq <- RsquareAdj (cca.all)$adj.r.square ​ # adjR2 as stopping criteria used in ordiR2step
 +cca.0 <- cca (vasc ~ 1, data = chem1) # empty model only with intercept
 +</​code>​
 +
 +The function ''​ordiR2step''​ needs to know the variables which are being selected (this is defined by the empty model and the scope taken from the global model), the ''​direction''​ of the selection (''​forward'',​ ''​backward'',​ ''​both''​ - default is ''​both'',​ which means combined forward and backward), the threshold of adjusted R<​sup>​2</​sup>​ which should be used as stopping criteria (''​R2scope''​),​ number of ''​permutations''​ for the Monte Carlo permutation test applied on each variable (default is 499). Additionally,​ I set also argument ''​trace = FALSE''​ (default is ''​TRUE''​),​ which limits the verbal output of the function (as you can see above, the function is rather talkative).
 +
 +<code rsplus>
 +sel.osR2 <- ordiR2step (tb_rda.vasc.0,​ scope = formula (tb_rda.vasc.all),​ R2scope = adjR2.tbrda,​ direction = '​forward',​ permutations = 999, trace = FALSE)
 +sel.osR2
 +</​code>​
 +<​file>​
 +Call: cca(formula = vasc ~ Ca + conduct + Si + NH3 + NO3 + SO4 + Corg, data = chem1)
 +
 +              Inertia Proportion Rank
 +Total          3.0238 ​    ​1.0000 ​    
 +Constrained ​   0.6772 ​    ​0.2240 ​   7
 +Unconstrained ​ 2.3466 ​    ​0.7760 ​  62
 +Inertia is scaled Chi-square ​
 +
 +Eigenvalues for constrained axes:
 +  CCA1   ​CCA2 ​  ​CCA3 ​  ​CCA4 ​  ​CCA5 ​  ​CCA6 ​  ​CCA7 ​
 +0.3388 0.0942 0.0618 0.0569 0.0485 0.0418 0.0352 ​
 +
 +Eigenvalues for unconstrained axes:
 +    CA1     ​CA2 ​    ​CA3 ​    ​CA4 ​    ​CA5 ​    ​CA6 ​    ​CA7 ​    ​CA8 ​
 +0.17567 0.16047 0.12242 0.10303 0.09427 0.08952 0.08455 0.07697 ​
 +(Showing 8 of 62 unconstrained eigenvalues)
 +</​file>​
 +<code rsplus>
 +sel.osR2.cca$anova
 +</​code>​
 +<​file>​
 +                  R2.adj Df    AIC      F Pr(>​F) ​   ​
 ++ Ca            0.082170 ​ 1 385.90 7.1738 ​ 0.001 ***
 ++ conduct ​      ​0.095775 ​ 1 385.81 2.0268 ​ 0.001 ***
 ++ Si            0.108412 ​ 1 385.76 1.9704 ​ 0.003 ** 
 ++ NH3           ​0.118073 ​ 1 385.96 1.6923 ​ 0.005 ** 
 ++ NO3           ​0.125542 ​ 1 386.27 1.5578 ​ 0.009 ** 
 ++ SO4           ​0.131015 ​ 1 386.73 1.4028 ​ 0.029 *  ​
 ++ Corg          0.136553 ​ 1 387.17 1.3959 ​ 0.027 *  ​
 +<All variables>​ 0.142700
 +</​file>​
 +
 +The first five variables (Ca, conductivity,​ Si, NH3 and NO3) are the same as in tb-RDA analysis above, the last two differ. If we increase the number of permutations to 49,999 (to reduce the lowest P-value we can obtain - but this takes quite some time!) and adjust the P-values with Holm's correction for multiple testing issue, only the first four variables will be selected (compared to the first five in the case of tb-RDA):
 +<code rsplus>
 +sel.osR2.cca <- ordiR2step (cca.0, scope = formula (cca.all), direction = '​forward',​ R2scope = adjRsq.cca, permutations = 49999, trace = F)
 +sel.osR2.cca_adj <- sel.osR2.cca
 +sel.osR2.cca_adj$anova$`Pr(>​F)` <- p.adjust (sel.osR2.cca$anova$`Pr(>​F)`,​ method = '​holm',​ n = ncol (chem1))
 +sel.osR2.cca_adj$anova
 +</​code>​
 +<​file>​
 +                  R2.adj Df    AIC      F  Pr(>F)
 ++ Ca            0.082144 ​ 1 385.90 7.1738 0.00028 ***
 ++ conduct ​      ​0.095628 ​ 1 385.81 2.0268 0.00156 ** 
 ++ Si            0.108806 ​ 1 385.76 1.9704 0.00336 ** 
 ++ NH3           ​0.118293 ​ 1 385.96 1.6923 0.04752 *  ​
 ++ NO3           ​0.125796 ​ 1 386.27 1.5578 0.10720 ​   ​
 ++ SO4           ​0.130940 ​ 1 386.73 1.4028 0.30276 ​   ​
 ++ Mg            0.136065 ​ 1 387.19 1.3789 0.30276 ​   ​
 ++ Corg          0.141488 ​ 1 387.65 1.3550 0.30276 ​   ​
 +<All variables>​ 0.142191 ​                            
 +</​file>​
 +
 +You may have noticed that the use of ''​ordiR2step''​ function with CCA takes considerably longer than when applied on RDA (or tb-RDA as above). This is because in the case of CCA, [[en:​expl_var#​adjusted_r2|adjusted R2 needs to be calculated using the permutation method]] introduced by [[en:​references|Peres-Neto et al. (2006)]], while in the case of RDA the adjusted R<​sup>​2</​sup>​ can be calculated analytically by Ezekiel'​s formula. This also means that the resulting adjusted R<​sup>​2</​sup>​ values will slightly change between calculations,​ because the adjusted value calculation is using mean of R2 explained by randomized environmental variables (by default 1000 randomizations is used; this number can be increased by including the argument ''​R2permutations''​ in the ''​ordiR2step''​ function with higher number).
  
  
  
  
en/forward_sel_examples.txt · Last modified: 2020/05/03 18:10 by David Zelený