User Tools

Site Tools


en:forward_sel_examples

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:forward_sel_examples [2019/04/06 08:01]
David Zelený [Option b) Use of ordiR2step function]
en:forward_sel_examples [2019/04/06 08:46]
David Zelený [Example 2: CCA with forward selection on data from Carpathian wetlands]
Line 565: Line 565:
  
 Selected variables are ''​Ca'',​ ''​conduct'',​ ''​Si'',​ ''​NH3'',​ ''​NO3'',​ ''​Mg'' ​ and ''​pH''​. Selected variables are ''​Ca'',​ ''​conduct'',​ ''​Si'',​ ''​NH3'',​ ''​NO3'',​ ''​Mg'' ​ and ''​pH''​.
 +
 +==== Example 2: CCA with forward selection on data from Carpathian wetlands ====
 +
 +This example is using the same data as the Example 1 above, namely composition of vascular plants in [[en:​data:​wetlands|Carpathian wetlands]] and chemical variables measured in fen water running through them. While in Example 1 we analysed the data using tb-RDA (RDA applied on data after Hellinger transformation),​ here we do this using CCA (note: no Hellinger transformation is done). Since the function ''​forward.sel''​ from ''​adespatial''​ package is using only RDA algorithm, we cannot use it here; instead, we will do the calculation using ''​ordiR2step''​ from ''​vegan'',​ which also implements variable selection with double stopping criteria: the selection is finished if the new variable to be selected is not significant at certain alpha level, or if the adjusted R<​sup>​2</​sup>​ explained by the model with this variable exceeds the adjusted R<​sup>​2</​sup>​ of the global model. ​
 +
 +<code rsplus>
 +# Carpathian wetlands - import data
 +vasc <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​vasc_plants.txt',​ row.names = 1)
 +chem <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​chemistry.txt',​ row.names = 1)
 +
 +# the last variable in the '​chem'​ dataset is '​slope',​ which is not a chemical variable - remove it:
 +chem1 <- chem[,-15]
 +
 +</​code>​
 +
 +We do similar steps as in the option b) of Example 1 above: create global model with CCA (including all variables) and test it's significance,​ and if it is significant,​ we will create also an empty model with only intercept and use the ''​ordiR2step''​ function with appropriate arguments:
 +
 +<code rsplus>
 +cca.all <- cca (vasc ~ ., data = chem1)
 +anova (cca.all) # P < 0.001 - yes, significant!
 +adjRsq <- RsquareAdj (cca.all)$adj.r.square ​ # adjR2 as stopping criteria used in ordiR2step
 +cca.0 <- cca (vasc ~ 1, data = chem1) # empty model only with intercept
 +</​code>​
 +
 +The function ''​ordiR2step''​ needs to know the variables which are being selected (this is defined by the empty model and the scope taken from the global model), the ''​direction''​ of the selection (''​forward'',​ ''​backward'',​ ''​both''​ - default is ''​both'',​ which means combined forward and backward), the threshold of adjusted R<​sup>​2</​sup>​ which should be used as stopping criteria (''​R2scope''​),​ number of ''​permutations''​ for the Monte Carlo permutation test applied on each variable (default is 499). Additionally,​ I set also argument ''​trace = FALSE''​ (default is ''​TRUE''​),​ which limits the verbal output of the function (as you can see above, the function is rather talkative).
 +
 +<code rsplus>
 +sel.osR2 <- ordiR2step (tb_rda.vasc.0,​ scope = formula (tb_rda.vasc.all),​ R2scope = adjR2.tbrda,​ direction = '​forward',​ permutations = 999, trace = FALSE)
 +sel.osR2
 +</​code>​
 +<​file>​
 +Call: cca(formula = vasc ~ Ca + conduct + Si + NH3 + NO3 + SO4 + Corg, data = chem1)
 +
 +              Inertia Proportion Rank
 +Total          3.0238 ​    ​1.0000 ​    
 +Constrained ​   0.6772 ​    ​0.2240 ​   7
 +Unconstrained ​ 2.3466 ​    ​0.7760 ​  62
 +Inertia is scaled Chi-square ​
 +
 +Eigenvalues for constrained axes:
 +  CCA1   ​CCA2 ​  ​CCA3 ​  ​CCA4 ​  ​CCA5 ​  ​CCA6 ​  ​CCA7 ​
 +0.3388 0.0942 0.0618 0.0569 0.0485 0.0418 0.0352 ​
 +
 +Eigenvalues for unconstrained axes:
 +    CA1     ​CA2 ​    ​CA3 ​    ​CA4 ​    ​CA5 ​    ​CA6 ​    ​CA7 ​    ​CA8 ​
 +0.17567 0.16047 0.12242 0.10303 0.09427 0.08952 0.08455 0.07697 ​
 +(Showing 8 of 62 unconstrained eigenvalues)
 +</​file>​
 +<code rsplus>
 +sel.osR2.cca$anova
 +</​code>​
 +<​file>​
 +                  R2.adj Df    AIC      F Pr(>​F) ​   ​
 ++ Ca            0.082170 ​ 1 385.90 7.1738 ​ 0.001 ***
 ++ conduct ​      ​0.095775 ​ 1 385.81 2.0268 ​ 0.001 ***
 ++ Si            0.108412 ​ 1 385.76 1.9704 ​ 0.003 ** 
 ++ NH3           ​0.118073 ​ 1 385.96 1.6923 ​ 0.005 ** 
 ++ NO3           ​0.125542 ​ 1 386.27 1.5578 ​ 0.009 ** 
 ++ SO4           ​0.131015 ​ 1 386.73 1.4028 ​ 0.029 *  ​
 ++ Corg          0.136553 ​ 1 387.17 1.3959 ​ 0.027 *  ​
 +<All variables>​ 0.142700
 +</​file>​
 +
 +The first five variables (Ca, conductivity,​ Si, NH3 and NO3) are the same as in tb-RDA analysis above, the last two differ. If we increase the number of permutations to 49,999 (to reduce the lowest P-value we can obtain - but this takes quite some time!) and adjust the P-values with Holm's correction for multiple testing issue, only the first four variables will be selected (compared to the first five in the case of tb-RDA):
 +<code rsplus>
 +sel.osR2.cca <- ordiR2step (cca.0, scope = formula (cca.all), direction = '​forward',​ R2scope = adjRsq.cca, permutations = 49999, trace = F)
 +sel.osR2.cca_adj <- sel.osR2.cca
 +sel.osR2.cca_adj$anova$`Pr(>​F)` <- p.adjust (sel.osR2.cca$anova$`Pr(>​F)`,​ method = '​holm',​ n = ncol (chem1))
 +sel.osR2.cca_adj$anova
 +</​code>​
 +<​file>​
 +                  R2.adj Df    AIC      F  Pr(>F)
 ++ Ca            0.082144 ​ 1 385.90 7.1738 0.00028 ***
 ++ conduct ​      ​0.095628 ​ 1 385.81 2.0268 0.00156 ** 
 ++ Si            0.108806 ​ 1 385.76 1.9704 0.00336 ** 
 ++ NH3           ​0.118293 ​ 1 385.96 1.6923 0.04752 *  ​
 ++ NO3           ​0.125796 ​ 1 386.27 1.5578 0.10720 ​   ​
 ++ SO4           ​0.130940 ​ 1 386.73 1.4028 0.30276 ​   ​
 ++ Mg            0.136065 ​ 1 387.19 1.3789 0.30276 ​   ​
 ++ Corg          0.141488 ​ 1 387.65 1.3550 0.30276 ​   ​
 +<All variables>​ 0.142191 ​                            
 +</​file>​
 +
 +You may have noticed that the use of ''​ordiR2step''​ function with CCA takes considerably longer than when applied on RDA (or tb-RDA as above). This is because in the case of CCA, [[en:​expl_var#​adjusted_r2|adjusted R2 needs to be calculated using the permutation method]] introduced by [[en:​references|Peres-Neto et al. (2006)]], while in the case of RDA the adjusted R<​sup>​2</​sup>​ can be calculated analytically by Ezekiel'​s formula. This also means that the resulting adjusted R<​sup>​2</​sup>​ values will slightly change between calculations,​ because the adjusted value calculation is using mean of R2 explained by randomized environmental variables (by default 1000 randomizations is used; this number can be increased by including the argument ''​R2permutations''​ in the ''​ordiR2step''​ function with higher number).
  
  
  
  
en/forward_sel_examples.txt · Last modified: 2019/04/06 08:46 by David Zelený