User Tools

Site Tools


en:monte_carlo_examples

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
en:monte_carlo_examples [2018/04/22 10:44]
David Zelený
en:monte_carlo_examples [2019/02/10 16:04] (current)
David Zelený
Line 1: Line 1:
-====== ​Constrained ordination ​====== +====== ​Ordination analysis ​====== 
-===== Monte Carlo permutation test =====+===== Monte Carlo permutation test (constrained ordination) ​===== 
 [[{|width: 7em; background-color:​ white; color: navy}monte_carlo|Theory]] [[{|width: 7em; background-color:​ white; color: navy}monte_carlo|Theory]]
 [[{|width: 7em; background-color:​ white; color: navy}monte_carlo_R|R functions]] [[{|width: 7em; background-color:​ white; color: navy}monte_carlo_R|R functions]]
Line 8: Line 9:
  
  
-==== Example 1: Is the variance explained by environmental variables significant?​ ==== 
- 
-This example directly follows the [[en:​rda_examples|Example 1 in RDA & tb-RDA section]] and [[en:​expl_var_examples|Example 1 in Explained variance section]], consider checking them first. We used Vltava river valley dataset, and two field-measured environmental variables, soil pH and soil depth (''​pH''​ and ''​SOILDPT''​),​ to explain variance in species composition. We found that they can explain 8.9% of overall variance; when compared to the variance which can be maximally explained by two explanatory variables (21.7% explained by two tb-PCA axes, see here) this sounds not bad (it is more than 40% of variance we can maximally explain in this dataset with two variables). But is the result significant?​ By significant,​ I mean: is the variance considerably higher than the variance explained (in average) by two random variables not related to species composition?​ This is the task for Monte Carlo permutation test (check Theory part to see how it works). 
- 
-First, get the data and calculate tb-RDA on them (this is esentially repeating beginning of [[en:​expl_var_examples|Example 1 in the section Explained variance]]):​ 
-<code rsplus> 
-vltava.spe <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​vltava-spe.txt',​ row.names = 1) 
-vltava.env <- read.delim ('​https://​raw.githubusercontent.com/​zdealveindy/​anadat-r/​master/​data/​vltava-env.txt'​) 
-spe <- vltava.spe 
-env <- vltava.env[,​ c('​pH',​ '​SOILDPT'​)] 
-library (vegan) 
-spe.log <- log1p (spe)  
-spe.hell <- decostand (spe.log, '​hell'​) 
-tbRDA <- rda (spe.hell ~ pH + SOILDPT, data = env) 
-R2.obs <- RsquareAdj (tbRDA)$r.squared 
-R2.obs 
-</​code>​ 
-<​code>​ 
-[1] 0.08868879 
-</​code>​ 
- 
-In the next step, calculate variance explained by randomized env. variables 
-<code rsplus> 
-env.rand <- env[sample (1:​97),​] ​ # the function "​sample"​ will reshuffle the rows with environmental variabels 
-tbRDA.rand <- rda (spe.hell ~ pH + SOILDPT, data = env.rand) 
-RsquareAdj (tbRDA.rand)$r.squared 
-</​code>​ 
-This value represents the variance explained by two random explanatory variables. My result was 0.01854349, but this value will change in each run. We need to do enough repetitions (permutations) to get an idea about the distribution of this values (null model). We can use the "​for"​ loop for it, or, as in this case, function "​replicate",​ with two arguments: ''​n''​ = number of replicates, and ''​expr''​ = expression to be replicated (if more lines of script are involved, this expression needs to be enclosed in curly brackets {}): 
- 
-<code rsplus> 
-n.perm <- 99  # set the number of permutations 
-R2.rand <- replicate (n = n.perm, expr = { 
-  env.rand <- env[sample (1:97),] 
-  tbRDA.rand <- rda (spe.hell ~ pH + SOILDPT, data = env.rand) 
-  RsquareAdj (tbRDA.rand)$r.squared 
-  }) 
-<​code>​ 
-The vector ''​R2.rand''​ contains 99 values of variance explained by random variables. In the next step, we will merge them with the observed R2 (''​R2.obs''​),​ since this is considered to be also the part of null distribution (and this is also the reason why the numbers of permutation are usually ending with 9): 
-<code rsplus> 
-R2 <- c (R2.rand, R2.obs) 
-</​code>​ 
-Draw the histogram of values and highlight the observed R2 there: 
-<code rsplus> 
-hist (R2, nclass = 100)  # ; argument "​nclass"​ separates the bars into 100 categories (compare with hist without this argument) 
-abline (v = R2.obs, col = '​red'​) ​ # red line to indicate where in the histogram is the observed value 
-</​code>​ 
-{{:​obrazky:​perm-test-histogram-r2.png?​direct|}} 
- 
-Already now we can see that our observed value (0.089, red vertical line) is far higher than all values generated by randomized env. variables (''​range (R2.rand)''​ shows that these values are between 0.014 and 0.033, but again, this range will change in another run of the analysis, and will likely increase if the number of permutations increases, since there will be higher chance to get values more different from the mean). 
- 
-To calculate the significance,​ we need to know what is the number of cases in which the variance explained by random explanatory variables was higher or equal to the observed one (explained by real variables). Remember that R.obs is the part of our null distribution (we added it there), so we always obtain at least one value in such comparison (the R.obs itself). 
- 
-<code rsplus> 
-P <- sum (R2 >= R2.obs)/​(n.perm + 1)  # 0.01 
-</​code>​ 
- 
-# You can see that our observed R<​sup>​2</​sup>​ (R2.obs) is far higher than any R2 generated by a random variable, so the calculation of P-value contains only one value equal or higher than observed R2 (the observed R2 itself). The resulting P-value is 0.01, which is the lowest P-value we can get with Monte Carlo permutation test based on 99 permutations:​ 
- 
-<​code>​ 
-1/(99 + 1)  # 0.01 
-</​code>​ 
- 
-# If we are happy with rejecting the null hypothesis at the alpha level of 5% (P < 0.05), than this could be sufficient. But if we need to get lower P-value (e.g. because we are going to correct them for multiple testing, as in the case of forward selection later), we need to increase the number of permutations. As we saw in the theoretical part, the relationship between the number of permutations and the minimal P-value we can obtain is P<​sub>​min</​sub>​ = 1/​(number_of_permutations + 1). If you hope to be able reject the null hypothesis at P < 0.001, you need at least 1/0.001-1 = 999 permutations (in that case the lowest P value will be 0.001) or better more (e.g. 1499 permutations,​ with the lowest P-value 1/(1499+1) = 0.00067). 
- 
-In ''​vegan'',​ the test of the significance for constrained ordination is done by function ''​anova''​ (this may be a bit confusing name, since it is not really calculating ANOVA) 
-<code rsplus> 
-anova (tbRDA, permutations = 99) 
-</​code>​ 
-<​code>​ 
-# Permutation test for rda under reduced model 
-# Permutation:​ free 
-# Number of permutations:​ 99 
- 
-# Model: rda(formula = spe.hell ~ pH + SOILDPT, data = env) 
-# Df Variance ​    F Pr(>​F) ​   
-# Model     ​2 ​ 0.06250 4.574   0.01 ** 
-#   ​Residual 94  0.64226 ​               ​ 
-# --- 
-#   ​Signif. codes: ​ 0 '​***'​ 0.001 '​**'​ 0.01 '​*'​ 0.05 '​.'​ 0.1 ' ' 1 
-</​code>​ 
- 
-==== Example 2: Significance of RDA on species composition of plants in Carpathian wetlands ==== 
-This example continues from the chapter about RDA with [[en:​rda_examples|Example 1 on using RDA]]. 
- 
-After analysing RDA using all environmental variables as explanatory,​ the next question is whether the global model is significant:​ 
-<code rsplus> 
-anova (rda.vasc) 
-</​code>​ 
-<​code>​ 
-Permutation test for rda under reduced model 
-Permutation:​ free 
-Number of permutations:​ 999 
- 
-Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct + slope, data = chem) 
-         Df Variance ​     F Pr(>​F) ​   ​ 
-Model    15  0.21277 2.1599 ​ 0.001 *** 
-Residual 54  0.35464 ​                 ​ 
---- 
-Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
-</​code>​ 
- 
-Alternatively,​ we may be interested in significance of only the first constrained axis. This is achieved by adding argument ''​first = TRUE'':​ 
-<code rsplus> 
-anova (rda.vasc, first = TRUE) 
-</​code>​ 
-<​code>​ 
-Permutation test for rda under reduced model 
-Permutation:​ free 
-Number of permutations:​ 999 
- 
-Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct + slope, data = chem) 
-         Df Variance ​     F Pr(>​F) ​   ​ 
-RDA1      1  0.09793 14.912 ​ 0.001 *** 
-Residual 54  0.35464 ​                 ​ 
---- 
-Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
-</​code>​ 
- 
-Or, we may calculate significance of each constrained axis independently. This can be done by adding argument ''​by = "​axis"''​. Note that since there is 14 variables and hence 14 axes, the calculation takes rather long; you may speed it up using parallel calculation,​ if your computer has more than one core (use argument ''​parallel''​ with number of available cores, 4 in case of my computer): ​ 
-<code rsplus> 
-anova (rda.vasc, by = '​axis',​ parallel = 4) 
-</​code>​ 
-<​code>​ 
-Permutation test for rda under reduced model 
-Marginal tests for axes 
-Permutation:​ free 
-Number of permutations:​ 999 
- 
-Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct + slope, data = chem) 
-         Df Variance ​      F Pr(>​F) ​   ​ 
-RDA1      1  0.09793 14.9120 ​ 0.001 *** 
-RDA2      1  0.02237 ​ 3.4070 ​ 0.001 *** 
-RDA3      1  0.01546 ​ 2.3547 ​ 0.001 *** 
-RDA4      1  0.01110 ​ 1.6905 ​ 0.009 **  
-RDA5      1  0.01061 ​ 1.6152 ​ 0.012 *  ​ 
-RDA6      1  0.00930 ​ 1.4155 ​ 0.050 *  ​ 
-RDA7      1  0.00840 ​ 1.2798 ​ 0.110    ​ 
-RDA8      1  0.00637 ​ 0.9703 ​ 0.498    ​ 
-RDA9      1  0.00593 ​ 0.9023 ​ 0.644    ​ 
-RDA10     ​1 ​ 0.00582 ​ 0.8861 ​ 0.677    ​ 
-RDA11     ​1 ​ 0.00496 ​ 0.7549 ​ 0.877    ​ 
-RDA12     ​1 ​ 0.00423 ​ 0.6448 ​ 0.970    ​ 
-RDA13     ​1 ​ 0.00394 ​ 0.5994 ​ 0.992    ​ 
-RDA14     ​1 ​ 0.00331 ​ 0.5034 ​ 1.000    ​ 
-RDA15     ​1 ​ 0.00304 ​ 0.4626 ​ 1.000    ​ 
-Residual 54  0.35464 ​                   
---- 
-Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
-</​code>​ 
- 
-Other testing option is to test each term (constraining variable) separately - this is done adding the argument ''​by = "​terms"''​. It sequentially adds each variable one by one in order in which they enter the model formula, and for each calculates partial explained variation and its significance with previous variables as covariables (i.e. for the first variable it is marginal variation explained by this variable and its significance,​ for the second variable it is partial variation explained by the second variable after removing the variation of the first variable as covariable, etc.). In the context of our data this option is not too useful, since variables in dataset are ordered without explicit meaning. 
- 
-<code rsplus> 
-anova (rda.vasc, by = '​terms',​ parallel = 4) 
-</​code>​ 
-<​code>​ 
-Permutation test for rda under reduced model 
-Terms added sequentially (first to last) 
-Permutation:​ free 
-Number of permutations:​ 999 
- 
-Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct + slope, data = chem) 
-         Df Variance ​      F Pr(>​F) ​   ​ 
-Ca        1  0.07886 12.0079 ​ 0.001 *** 
-Mg        1  0.01395 ​ 2.1242 ​ 0.009 **  
-Fe        1  0.00962 ​ 1.4643 ​ 0.082 .  ​ 
-K         ​1 ​ 0.00822 ​ 1.2511 ​ 0.155    ​ 
-Na        1  0.01154 ​ 1.7577 ​ 0.031 *  ​ 
-Si        1  0.01387 ​ 2.1119 ​ 0.015 *  ​ 
-SO4       ​1 ​ 0.00688 ​ 1.0476 ​ 0.348    ​ 
-PO4       ​1 ​ 0.00598 ​ 0.9111 ​ 0.569    ​ 
-NO3       ​1 ​ 0.00860 ​ 1.3102 ​ 0.124    ​ 
-NH3       ​1 ​ 0.01239 ​ 1.8872 ​ 0.014 *  ​ 
-Cl        1  0.00601 ​ 0.9154 ​ 0.567    ​ 
-Corg      1  0.00905 ​ 1.3778 ​ 0.100 .  ​ 
-pH        1  0.01151 ​ 1.7525 ​ 0.034 *  ​ 
-conduct ​  ​1 ​ 0.00959 ​ 1.4609 ​ 0.100 .  ​ 
-slope     ​1 ​ 0.00669 ​ 1.0186 ​ 0.387    ​ 
-Residual 54  0.35464 ​                   
---- 
-Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
-</​code>​ 
- 
-The last testing option is with argument ''​by = "​margin"'',​ testing variation explained by each explanatory variable with all the others used as covariables:​ 
-<code rsplus> 
-anova (rda.vasc, by = '​margin',​ parallel = 4) 
-</​code>​ 
-<​code>​ 
-Permutation test for rda under reduced model 
-Marginal effects of terms 
-Permutation:​ free 
-Number of permutations:​ 999 
- 
-Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct + slope, data = chem) 
-         Df Variance ​     F Pr(>​F) ​   
-Ca        1  0.01441 2.1947 ​ 0.008 ** 
-Mg        1  0.00976 1.4857 ​ 0.075 .  
-Fe        1  0.00723 1.1006 ​ 0.263    
-K         ​1 ​ 0.00690 1.0508 ​ 0.342    
-Na        1  0.00561 0.8539 ​ 0.631    
-Si        1  0.01221 1.8599 ​ 0.030 *  
-SO4       ​1 ​ 0.00742 1.1306 ​ 0.260    
-PO4       ​1 ​ 0.00470 0.7159 ​ 0.874    
-NO3       ​1 ​ 0.00893 1.3600 ​ 0.114    
-NH3       ​1 ​ 0.01077 1.6404 ​ 0.054 .  
-Cl        1  0.00581 0.8849 ​ 0.628    
-Corg      1  0.00788 1.1997 ​ 0.185    
-pH        1  0.00858 1.3066 ​ 0.154    
-conduct ​  ​1 ​ 0.00922 1.4042 ​ 0.105    
-slope     ​1 ​ 0.00669 1.0186 ​ 0.373    
-Residual 54  0.35464 ​                 
---- 
-Signif. codes: ​ 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
-</​code>​ 
  
en/monte_carlo_examples.1524365049.txt.gz · Last modified: 2018/04/22 10:44 by David Zelený