### Introduction

### Theory, R functions & Examples

en:expl_var_examples

Section: Ordination analysis

This is direct continuation of Example 1 in tb-RDA section. We used Vltava river valley dataset to calculate tb-RDA (on log-transformed and then Hellinger transformed species composition data) with two measured variables (pH and SOILDPT) as explanatory. They explained 8.9% of the variance, and we may ask: if variables explain this amount of variance (which may seem rather low, less than 10%), is it enough to report this result?

To decide, we need to do two more things. One is to test whether the analysis is significant, meaning that this amount of variance is high enough compared to the variance generated (in average) by two random variables (this is the topic for Example 2 below). The other thing is to compare the value of explained variation (R2 = 8.9%) with the variation which would be explained by two best variables, i.e. variables which represent the strongest gradient along which the species composition is changing. As you can see in the theoretical part, we can create such variable for our dataset by applying first unconstrained ordination on our data and extract the first few unconstrained ordination axes (i.e. sample scores along them). These axes represent the main directions along which species composition changes the most rapidly, which is exactly what we need. Then we can use these axes as explanatory in the constrained version of the same ordination (here tb-RDA) to see how much variance they can explain. The number of axes I should take depends on the number of real variables I want to compare the variance with (in this case two).

Let's get data in and calculate tb-RDA on two explanatory variables (this is essentially repeating steps in Example 1 from tb-RDA:

vltava.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-spe.txt', row.names = 1) vltava.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-env.txt') spe <- vltava.spe env <- vltava.env[, c('pH', 'SOILDPT')] library (vegan) spe.log <- log1p (spe) # species data are in percentage scale which is strongly rightskewed, better to transform them spe.hell <- decostand (spe.log, 'hell') # we are planning to do tb-RDA, this is Hellinger pre-transformation tbRDA <- rda (spe.hell ~ pH + SOILDPT, data = env) # calculate tb-RDA with two explanatory variables

I can use the function `RsquareAd`

from `vegan`

to extract the explained variance directly from the ordination object:

R2.obs <- RsquareAdj (tbRDA)$r.squared R2.obs

[1] 0.08868879

Result is 0.089 = 8.9%. Note that the function `RsquareAdj`

returns a list with two components, `r.squared`

and `adj.r.squared`

, the first referring to the R2 we are interested now, and the second to adjusted R2, i.e. R2 corrected for the number of samples and number of explanatory variables (we will use this later).

Now we need to calculate the unconstrained version of tb-RDA on the same data and extract the first two ordination axes. The unconstrained version of tb-RDA is tb-PCA (i.e. PCA calculated on pre-transform data, where the pre-transformation should be the same in both analyses):

tbPCA <- rda (spe.hell) tbPCA

Call: rda(X = spe.hell) Inertia Rank Total 0.7048 Unconstrained 0.7048 96 Inertia is variance Eigenvalues for unconstrained axes: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 0.09197 0.06075 0.04684 0.03537 0.02650 0.02361 0.02093 0.02035 (Showed only 8 of all 96 unconstrained eigenvalues)

For comparison, let's also include the summary of tb-RDA method:

tbRDA

Call: rda(formula = spe.hell ~ pH + SOILDPT, data = env) Inertia Proportion Rank Total 0.70476 1.00000 Constrained 0.06250 0.08869 2 Unconstrained 0.64226 0.91131 94 Inertia is variance Eigenvalues for constrained axes: RDA1 RDA2 0.04023 0.02227 Eigenvalues for unconstrained axes: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 0.07321 0.04857 0.04074 0.03144 0.02604 0.02152 0.01917 0.01715 (Showed only 8 of all 94 unconstrained eigenvalues)

You can see that total inertia (variance of the whole dataset) is identical in both analyses (0.70476). This is the variance we aim to explain.

Extract the two PCA axes from `tbPCA`

object and use them as explanatory variables in tbRDA on the same data to see how much variance they explain:

PCA12 <- scores (tbPCA, display = 'sites', choices = 1:2) tbRDA_PCA12 <- rda (spe.hell ~ PCA12) RsquareAdj (tbRDA_PCA12)$r.squared

[1] 0.2167075

Two first tb-PCA axes, if used as explanatory in the tb-RDA on the same dataset, explain 21.7% of the variance. Note that this is the same number we would get by simply checking the amount of variance represented by the first two ordination axes in tb-PCA in the summary above: eigenvalue of PCA1 = 0.09197, eigenvalue of PCA1 = 0.06075, total variance (inertia) = 0.70476, recalculated into percentage: (0.09197+0.06075)/0.70476 = 21.7%. In fact, we don't need to really to the whole procedure (extract tb-PCA axes and use them as explanatory in tb-RDA), we can simply check the variance of *n* axes in unconstrained ordination (*n* = number of explanatory variables) and compare it with real variance explained by environmental variables.

The comparison here is: 8.9% explained by real variables (pH and SOILDPT) vs 21.7% which would be explained by two best, not correlated variables (if we had them). We see that measured variables explain something over 40% of variation they could (8.9/21.7 = 0.41), which is not bad. Remember important difference: PCA axes are (from definition) not correlated, while our real variables often will be (as in this case: `cor.test (~ pH + SOILDPT, data = env)`

shows that Pearson's correlation coefficient between pH and SOILDPT is *r* = 0.273, and this correlation is significant at P = 0.0069).

The next step is to test whether the variation explained by our variables is significant - this is a topic for Example 1 in the Permutation test section.

This example directly follows the Example 1 in RDA & tb-RDA section and Example 1 in Explained variance section, consider checking them first. We used Vltava river valley dataset, and two field-measured environmental variables, soil pH and soil depth (`pH`

and `SOILDPT`

), to explain variance in species composition. We found that they can explain 8.9% of overall variance; when compared to the variance which can be maximally explained by two explanatory variables (21.7% explained by two tb-PCA axes, see here) this sounds not bad (it is more than 40% of variance we can maximally explain in this dataset with two variables). But is the result significant? By significant, I mean: is the variance considerably higher than the variance explained (in average) by two random variables not related to species composition? This is the task for Monte Carlo permutation test (check Theory part to see how it works).

First, get the data and calculate tb-RDA on them (this is esentially repeating beginning of Example 1 in the section Explained variance):

vltava.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-spe.txt', row.names = 1) vltava.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-env.txt') spe <- vltava.spe env <- vltava.env[, c('pH', 'SOILDPT')] library (vegan) spe.log <- log1p (spe) spe.hell <- decostand (spe.log, 'hell') tbRDA <- rda (spe.hell ~ pH + SOILDPT, data = env) R2.obs <- RsquareAdj (tbRDA)$r.squared R2.obs

[1] 0.08868879

In the next step, calculate variance explained by randomized env. variables

env.rand <- env[sample (1:97),] # the function "sample" will reshuffle the rows with environmental variabels tbRDA.rand <- rda (spe.hell ~ pH + SOILDPT, data = env.rand) RsquareAdj (tbRDA.rand)$r.squared

This value represents the variance explained by two random explanatory variables. My result was 0.01854349, but this value will change in each run. We need to do enough repetitions (permutations) to get an idea about the distribution of this values (null model). We can use the “for” loop for it, or, as in this case, function “replicate”, with two arguments: `n`

= number of replicates, and `expr`

= expression to be replicated (if more lines of script are involved, this expression needs to be enclosed in curly brackets {}):

n.perm <- 99 # set the number of permutations R2.rand <- replicate (n = n.perm, expr = { env.rand <- env[sample (1:97),] tbRDA.rand <- rda (spe.hell ~ pH + SOILDPT, data = env.rand) RsquareAdj (tbRDA.rand)$r.squared }) <code> The vector ''R2.rand'' contains 99 values of variance explained by random variables. In the next step, we will merge them with the observed R2 (''R2.obs''), since this is considered to be also the part of null distribution (and this is also the reason why the numbers of permutation are usually ending with 9): <code rsplus> R2 <- c (R2.rand, R2.obs)

Draw the histogram of values and highlight the observed R2 there:

hist (R2, nclass = 100) # ; argument "nclass" separates the bars into 100 categories (compare with hist without this argument) abline (v = R2.obs, col = 'red') # red line to indicate where in the histogram is the observed value

Already now we can see that our observed value (0.089, red vertical line) is far higher than all values generated by randomized env. variables (`range (R2.rand)`

shows that these values are between 0.014 and 0.033, but again, this range will change in another run of the analysis, and will likely increase if the number of permutations increases, since there will be higher chance to get values more different from the mean).

To calculate the significance, we need to know what is the number of cases in which the variance explained by random explanatory variables was higher or equal to the observed one (explained by real variables). Remember that R.obs is the part of our null distribution (we added it there), so we always obtain at least one value in such comparison (the R.obs itself).

P <- sum (R2 >= R2.obs)/(n.perm + 1) # 0.01

You can see that our observed R^{2} (R2.obs) is far higher than any R2 generated by a random variable, so the calculation of P-value contains only one value equal or higher than observed R2 (the observed R2 itself). The resulting P-value is 0.01, which is the lowest P-value we can get with Monte Carlo permutation test based on 99 permutations:

1/(99 + 1) # 0.01

If we are happy with rejecting the null hypothesis at the alpha level of 5% (P < 0.05), than this could be sufficient. But if we need to get lower P-value (e.g. because we are going to correct them for multiple testing, as in the case of forward selection later), we need to increase the number of permutations. As we saw in the theoretical part, the relationship between the number of permutations and the minimal P-value we can obtain is P_{min} = 1/(number_of_permutations + 1). If you hope to be able reject the null hypothesis at P < 0.001, you need at least 1/0.001-1 = 999 permutations (in that case the lowest P value will be 0.001) or better more (e.g. 1499 permutations, with the lowest P-value 1/(1499+1) = 0.00067).

In `vegan`

, the test of the significance for constrained ordination is done by function `anova`

(this may be a bit confusing name, since it is not really calculating ANOVA)

anova (tbRDA, permutations = 99)

Permutation test for rda under reduced model Permutation: free Number of permutations: 99 Model: rda(formula = spe.hell ~ pH + SOILDPT, data = env) Df Variance F Pr(>F) Model 2 0.06250 4.574 0.01 ** Residual 94 0.64226 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is the same (P = 0.01) as in our manual Monte-Carlo permutation test above. Still, 99 permutations is really low number; consider using at least the default 999 permutations to get P-values close to 0.001.

The dataset Carpathian spring fen meadows consists of three data matrices - species matrix with bryophytes, species matrix with vascular plants and environmental matrix with chemical variables and slope. We ignore bryophytes here, and focus on relationship between vascular plants (`vasc`

) and chemical variables (`chem`

).

# Carpathian wetlands - import data vasc <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vasc_plants.txt', row.names = 1) chem <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/chemistry.txt', row.names = 1)

For the purpose of tb-RDA, we first pre-transform the species data using Hellinger's transformation (tb-RDA), and remove `slope`

from the matrix of evironmental variabels (slope of the site is not a chemical variable):

library (vegan) # if you haven't use it up to now vasc.hell <- decostand (vasc, 'hell') # pre-transform species data by Hellinger transformation chem <- chem[, -15] # remove slope rda.vasc <- rda (vasc.hell ~ ., chem)

To see the output, we can either directly print the `rda.vasc`

object, or apply the function `summary`

on it to get more extended information (it prints also all the scores, which may get quite long, so wrap it into `header`

function to shorten those tables into only first six rows):

head (summary (rda.vasc))

Call: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Partitioning of variance: Inertia Proportion Total 0.5674 1.0000 Constrained 0.2061 0.3632 Unconstrained 0.3613 0.6368 Eigenvalues, and their contribution to the variance Importance of components: RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 RDA7 RDA8 RDA9 RDA10 RDA11 RDA12 RDA13 RDA14 PC1 PC2 PC3 Eigenvalue 0.0978 0.02165 0.01501 0.01077 0.01060 0.009272 0.008199 0.00615 0.005908 0.005756 0.004406 0.004055 0.003383 0.003123 0.03260 0.03204 0.02084 Proportion Explained 0.1724 0.03815 0.02645 0.01899 0.01868 0.016340 0.014450 0.01084 0.010410 0.010140 0.007770 0.007150 0.005960 0.005500 0.05746 0.05647 0.03672 Cumulative Proportion 0.1724 0.21051 0.23696 0.25595 0.27463 0.290970 0.305420 0.31626 0.326680 0.336820 0.344590 0.351730 0.357700 0.363200 0.42066 0.47713 0.51386 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 Eigenvalue 0.01711 0.01638 0.01426 0.01370 0.01295 0.01166 0.01106 0.01057 0.009716 0.008781 0.008481 0.008224 0.007753 0.00759 0.007313 0.007163 0.006564 Proportion Explained 0.03015 0.02887 0.02513 0.02415 0.02282 0.02055 0.01950 0.01862 0.017120 0.015480 0.014950 0.014490 0.013660 0.01338 0.012890 0.012620 0.011570 Cumulative Proportion 0.54401 0.57287 0.59800 0.62215 0.64497 0.66552 0.68502 0.70364 0.720760 0.736240 0.751180 0.765680 0.779340 0.79272 0.805610 0.818230 0.829800 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 Eigenvalue 0.006371 0.005808 0.005689 0.00547 0.005307 0.004855 0.00475 0.004443 0.00403 0.003576 0.003554 0.003397 0.003139 0.002931 0.002827 0.00276 Proportion Explained 0.011230 0.010240 0.010030 0.00964 0.009350 0.008560 0.00837 0.007830 0.00710 0.006300 0.006260 0.005990 0.005530 0.005160 0.004980 0.00486 Cumulative Proportion 0.841030 0.851260 0.861290 0.87093 0.880280 0.888840 0.89721 0.905040 0.91215 0.918450 0.924710 0.930700 0.936230 0.941400 0.946380 0.95124 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49 PC50 PC51 PC52 Eigenvalue 0.002552 0.002361 0.002284 0.002046 0.001903 0.001807 0.001691 0.00157 0.001559 0.001434 0.001328 0.001304 0.00110 0.00103 0.0009236 0.0008693 Proportion Explained 0.004500 0.004160 0.004030 0.003610 0.003350 0.003180 0.002980 0.00277 0.002750 0.002530 0.002340 0.002300 0.00194 0.00182 0.0016300 0.0015300 Cumulative Proportion 0.955740 0.959900 0.963930 0.967530 0.970890 0.974070 0.977050 0.97982 0.982570 0.985090 0.987430 0.989730 0.99167 0.99349 0.9951100 0.9966500 PC53 PC54 PC55 Eigenvalue 0.0007434 0.0006739 0.0004857 Proportion Explained 0.0013100 0.0011900 0.0008600 Cumulative Proportion 0.9979600 0.9991400 1.0000000 Accumulated constrained eigenvalues Importance of components: RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 RDA7 RDA8 RDA9 RDA10 RDA11 RDA12 RDA13 RDA14 Eigenvalue 0.0978 0.02165 0.01501 0.01077 0.01060 0.009272 0.008199 0.00615 0.005908 0.005756 0.004406 0.004055 0.003383 0.003123 Proportion Explained 0.4746 0.10504 0.07282 0.05228 0.05144 0.044990 0.039790 0.02984 0.028670 0.027930 0.021380 0.019680 0.016420 0.015150 Cumulative Proportion 0.4746 0.57961 0.65243 0.70471 0.75615 0.801140 0.840920 0.87077 0.899440 0.927370 0.948750 0.968430 0.984850 1.000000 Scaling 2 for species and site scores * Species are scaled proportional to eigenvalues * Sites are unscaled: weighted dispersion equal on all dimensions * General scaling constant of scores: 2.501418 Species scores RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 Acer_sp. -0.0154645 -0.0007631 -0.0075229 -0.008245 0.001732 0.007874 AchiMill 0.0003371 0.0002172 0.0099623 -0.023240 -0.010564 0.017989 AgroCani 0.2989206 -0.0534899 -0.0003443 0.019328 -0.002615 -0.012205 AgroCapi 0.0439209 0.0196134 0.0014274 -0.007778 -0.018888 0.021651 AgroStol -0.1074882 -0.0278701 0.0070877 -0.050120 -0.016408 -0.003469 AjugRept -0.0607449 0.0359534 -0.0898934 -0.005017 -0.001849 -0.002417 .... Site scores (weighted sums of species scores) RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 S1 -0.13618 0.34012 -0.648298 0.3345 -0.08732 0.29732 S2 -0.02351 0.50122 0.483985 -0.4592 -0.34310 0.01176 S3 -0.36203 -0.29731 0.187990 -0.2157 0.37425 0.24345 S4 -0.45847 -0.75706 0.028869 0.1514 -0.12375 0.24845 S5 -0.34538 0.01186 -0.543172 0.0689 -0.09529 -0.01918 S6 -0.38319 -0.12405 0.008091 0.3563 0.37374 0.10958 .... Site constraints (linear combinations of constraining variables) RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 S1 -0.29599 0.51952 -0.676450 0.3432 -0.11594 0.041619 S2 0.13970 0.17128 0.057787 -0.2379 -0.43640 0.078915 S3 -0.37947 -0.26887 -0.021117 0.1957 0.09659 0.112267 S4 -0.42693 -0.41047 0.057856 0.2340 0.15766 0.017807 S5 -0.32293 0.13820 -0.465399 0.3952 0.13787 -0.020922 S6 -0.09608 -0.05028 0.002395 0.1068 0.12036 0.002839 .... Biplot scores for constraining variables RDA1 RDA2 RDA3 RDA4 RDA5 RDA6 Ca -0.88402 -0.2568958 -0.10774 0.09646 0.02262 -0.022930 Mg -0.81588 -0.0021299 -0.22725 -0.33887 0.08347 -0.034787 Fe 0.12061 0.4707166 0.03481 0.12330 0.24032 -0.144807 K -0.32015 0.3987824 -0.07212 0.02156 -0.19731 -0.273450 Na -0.51577 0.4008411 0.03295 -0.06688 -0.13851 -0.227552 Si -0.29676 0.4289340 -0.67314 0.06990 -0.21501 -0.089828 SO4 -0.34191 -0.3049719 -0.25253 -0.19042 -0.24876 0.064491 PO4 0.21537 -0.0008569 0.21286 0.11162 0.12566 -0.044834 NO3 0.15926 -0.4121772 -0.09585 0.23946 -0.48853 0.245962 NH3 0.44013 -0.2471674 -0.22154 -0.02665 0.38679 0.488924 Cl -0.03537 -0.2772850 -0.08676 0.18063 0.31496 -0.002797 Corg 0.69131 -0.0833783 0.05578 0.28857 0.25577 -0.260926 pH -0.67211 0.0469155 0.35972 0.16870 -0.03229 0.552316 conduct -0.82805 0.0162761 0.06142 0.17723 0.14450 0.280628

There are 14 environmental variables, which means that we got 14 constrained ordination axes (named RDA1 to RDA14). Unconstrained axes are named as PC axes. The table `Proportion Explained`

with the partitioning of the variance between constrained axes (i.e. environmental variables) and unconstrained axes (i.e. variance not explained by environmental factors) shows that the proportion of the total variance, which is explained by all environmental factors, is 36.3% (`0.3632`

at the row `Constrained`

and column `Proportion`

). The first constrained axis explains 17.2%, while the second just 3.8% (`Importance of components`

> `Proportion Explained`

, columns `RDA1`

and `RDA2`

); it seems that those 14 variables represent mostly a single strong environmental gradient represented by the first constrained axis.

What we calculated here is a global model - it includes all available environmental variables. The next question is whether the global model is significant:

anova (rda.vasc)

Permutation test for rda under reduced model Permutation: free Number of permutations: 999 Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Df Variance F Pr(>F) Model 14 0.20608 2.2407 0.001 *** Residual 55 0.36133 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Indeed it is significant (P = 0.001, which is the lowest P-value we can obtain with the default number of permutations, 999). Alternatively, we may be interested in the significance of only the first constrained axis. This is achieved by adding the argument `first = TRUE`

:

anova (rda.vasc, first = TRUE)

Permutation test for rda under reduced model Permutation: free Number of permutations: 999 Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Df Variance F Pr(>F) RDA1 1 0.09780 14.887 0.001 *** Residual 55 0.36133 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Or, we may calculate the significance of each constrained axis independently. This can be done by adding argument `by = “axis”`

(by default, the argument `by`

is set to `NULL`

, see the helpfile `?anova.cca`

). The calculation takes rather long (there are 14 constrained axes, each tested separately); you may speed it up using parallel calculation if your computer has more than one core (use the argument `parallel`

with the number of available cores, four in the case of my computer). Also, we will directly store the results into an object (`test_axis`

) so as we can call it later for further processing:

test_axis <- anova (rda.vasc, by = 'axis', parallel = 4)

Permutation test for rda under reduced model Forward tests for axes Permutation: free Number of permutations: 999 Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Df Variance F Pr(>F) RDA1 1 0.09780 14.8869 0.001 *** RDA2 1 0.02165 3.2950 0.008 ** RDA3 1 0.01501 2.2844 0.213 RDA4 1 0.01077 1.6399 0.930 RDA5 1 0.01060 1.6137 0.913 RDA6 1 0.00927 1.4113 0.975 RDA7 1 0.00820 1.2481 0.997 RDA8 1 0.00615 0.9362 1.000 RDA9 1 0.00591 0.8993 1.000 RDA10 1 0.00576 0.8761 1.000 RDA11 1 0.00441 0.6707 1.000 RDA12 1 0.00406 0.6173 1.000 RDA13 1 0.00338 0.5150 1.000 RDA14 1 0.00312 0.4754 1.000 Residual 55 0.36133 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Our impression from above was right - here, only first two axes are significant, second not so strongly as the first one. In fact, the test above is running 14 independent tests of significance, the number high enough to have a problem with multiple testing issue (more tests = more significant results). To correct for multiple testing, consider applying some correction (e.g. Bonferroni, or, as in this case, Holm's correction). Significances are stored in the component `Pr(>F)`

of `test_axis`

object (a list), and the function conducting correction is `p.adjust`

:

test_axis.adj <- test_axis test_axis.adj$`Pr(>F)` <- p.adjust (test_axis$`Pr(>F)`, method = 'holm') test_axis.adj

Permutation test for rda under reduced model Forward tests for axes Permutation: free Number of permutations: 999 Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Df Variance F Pr(>F) RDA1 1 0.09780 14.8869 0.014 * RDA2 1 0.02165 3.2950 0.052 . RDA3 1 0.01501 2.2844 1.000 RDA4 1 0.01077 1.6399 1.000 RDA5 1 0.01060 1.6137 1.000 RDA6 1 0.00927 1.4113 1.000 RDA7 1 0.00820 1.2481 1.000 RDA8 1 0.00615 0.9362 1.000 RDA9 1 0.00591 0.8993 1.000 RDA10 1 0.00576 0.8761 1.000 RDA11 1 0.00441 0.6707 1.000 RDA12 1 0.00406 0.6173 1.000 RDA13 1 0.00338 0.5150 1.000 RDA14 1 0.00312 0.4754 1.000 Residual 55 0.36133 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

After adjusting the P-values for multiple testing issue, the second axis became only marginally significant (note that your P-values may slightly differ, since this is permutation test, each time running with different set of permuted environmental variables).

Other testing option is to test each term (constraining variable) separately in the sequence of adding them into the model - this is done adding the argument `by = “terms”`

. It sequentially adds each variable one by one in order in which they enter the model formula, and for each calculates partial explained variation and its significance with previous variables as covariables (i.e. for the first variable it is marginal variation explained by this variable and its significance, for the second variable it is partial variation explained by the second variable after removing the variation of the first variable as covariable, etc.). Since variables in the dataset are ordered without exact meaning, this option is not useful here. More meaningful is the last option, `by = “margin”`

, testing the variance explained by each explanatory variable with all the others used as covariables (independently from their order in the model):

test_margin <- anova (rda.vasc, by = 'margin', parallel = 4)

Permutation test for rda under reduced model Marginal effects of terms Permutation: free Number of permutations: 999 Model: rda(formula = vasc.hell ~ Ca + Mg + Fe + K + Na + Si + SO4 + PO4 + NO3 + NH3 + Cl + Corg + pH + conduct, data = chem) Df Variance F Pr(>F) Ca 1 0.01460 2.2228 0.007 ** Mg 1 0.00973 1.4816 0.084 . Fe 1 0.00729 1.1096 0.258 K 1 0.00668 1.0162 0.375 Na 1 0.00565 0.8603 0.651 Si 1 0.01237 1.8828 0.026 * SO4 1 0.00739 1.1246 0.291 PO4 1 0.00467 0.7108 0.872 NO3 1 0.00908 1.3828 0.108 NH3 1 0.01066 1.6231 0.042 * Cl 1 0.00584 0.8887 0.617 Corg 1 0.00781 1.1881 0.217 pH 1 0.00850 1.2938 0.146 conduct 1 0.00959 1.4604 0.078 . Residual 55 0.36133 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

If we correct for multiple testing (not done here), we will see that only calcium (`Ca`

) remains marginally significant, i.e. it can explain some important extra variance which other variables cannot.

And if we want to know the variance each variable can explain independently from others (i.e. the variance each variable will explain if it is the only variable in the model, without any covariables)? What I do below is a bit complex R coding exercise and I won't explain it in detail, but what I did is to make a loop (using `apply`

) which will take each of the 14 variables one by one, calculate tb-RDA on it using `vasc.hell`

data, test it's significance, return all values in one summary table, correct them for the multiple-testing issue by Holm's correction, and order the table by decreasing value of variance explained by each variable:

test_each <- t(apply (chem, 2, FUN = function (x) as.matrix (anova (rda (vasc.hell ~ x))[1:4][1,]))) test_each <- as.data.frame (test_each) names (test_each) <- c("Df", "Variance", "F", "Pr(>F)") test_each.adj <- test_each test_each.adj$`Pr(>F)` <- p.adjust (test_each$`Pr(>F)`, method = 'holm') test_each.adj[order (test_each.adj$Variance, decreasing = TRUE),]

Df Variance F Pr(>F) Ca 1 0.078859882 10.9763368 0.014 conduct 1 0.069408621 9.4774885 0.014 Mg 1 0.068014761 9.2612406 0.014 Corg 1 0.050866878 6.6963607 0.014 pH 1 0.049697951 6.5277054 0.014 Na 1 0.032285586 4.1026474 0.014 NH3 1 0.026047765 3.2718455 0.016 Si 1 0.021008459 2.6145234 0.016 SO4 1 0.018808570 2.3313586 0.036 K 1 0.017124964 2.1161781 0.040 NO3 1 0.012759735 1.5643455 0.236 Fe 1 0.010942632 1.3371875 0.384 PO4 1 0.009863593 1.2029965 0.384 Cl 1 0.006910560 0.8383942 0.736

Some variables (at the bottom of the table) do not explain significant part of variance, some do (with calcium, `Ca`

, explaining the most). What I did is in fact the first step of forward selection, which aims to select the subset of variables that explain the most variance in the model (in the first step the variables are sorted by explained variance, the first of them is tested, and if significant, it is included as the first selected variables into the model). This will be the topic for Example 1 in Variable selection section.

en/expl_var_examples.txt · Last modified: 2019/04/06 10:00 by David Zelený