### Introduction

### Theory, R functions & Examples

en:similarity_examples

For this example, we use data from the river valley dataset – 97 forest vegetation plots sampled along transects across deep valley of Vltava river in South Bohemia, Czech Republic (check the description of data here). First, let’s import the species composition data into R (let’s use the txt file with values separated by tabulators):

vltava.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-spe.txt', row.names = 1)

**Question 1:** what is the Euclidean distance between the plot 12 and 56? Since the data in the compositional matrix represent species percentage cover (which usually has highly right-skewed distribution), reduce the importance of dominant species by log-transforming the data first before calculating the distance.
For this, we need to calculate Euclidean distance on log-transformed compositional data, and find the correct value in the distance matrix:

vltava.spe.log <- log1p (vltava.spe) # log transform the values, with adding 1 to all of them to keep zero being zero vltava.eucl <- dist (vltava.spe.log) as.matrix (vltava.eucl)[56,12] # 9.895347

Note two things: first, you cannot apply the subsetting (square brackets) on the object `vltava.eucl`

, since this is not a matrix or data.frame, but an object of the class `dist`

(run `class (vltava.eucl)`

to find out), and this object does not have two dimensions. But you can convert it into the squared matrix of distance, which is symmetric, ie with zeros on the diagonal, and the values above and below diagonal symmetrically distributed (this also means that ‘’as.matrix (vltava.eucl)[12,56]’’ gives the same result as `as.matrix (vltava.eucl)[56,12]`

).
Second – in this case, the plot numbers respect the order of the plots in the table (see `rownames (vltava.spe)`

and `colnames (as.matrix (vltava.eucl))`

). If this is not true, you need to make sure which samples in the distance matrix are actually those with id 12 and 56.

**Question 2**: which two plots are the most similar, measured by Bray-Curtis distance on log-transformed data?
Ok – we need to calculate Bray-Curtis this time (not Euclidean), and figure out which of the value is the lowest – and to which pair of plots this value belongs.

library (vegan) # while function dist is in the base R, for vegdist we need vegan package vltava.bray <- vegdist (vltava.spe.log, method = "bray") # the “method = ‘bray’” is not necessary here, since this method is the default one for the vegdist function – but we put it here for clarity which (as.matrix (vltava.bray) == min (vltava.bray), arr.ind = TRUE)

The function `which`

will return the position of TRUE value in the argument – here, after setting argument `arr.ind = TRUE`

(return array index) it will return the row and column of the shortest Bray-Curtis distance. Note that the function `min`

needs to be applied on the original `vltava.bray`

object, which is of the class `dist`

– in fact, this is a vector, which is wrapped into the triangular “distance matrix”. This triangle does not contain the diagonal values, which are all zeros here (we are calculating distances) – otherwise, the `min`

function would always return zero.
Plots 2 and 4 are the most similar – no wonder, since if we print and compare their species composition, most of the species occur in both of them:

vltava.spe[c(2,4), colSums (vltava.spe[c(2,4),]) > 0] # select only those columns (species) which are not zero (have colSums higher than zero)

Coryave23 Pinusyl23 Querpet23 Querrob23 Tilicor23 Achitan1 Allisen1 Anthram1 Asplsep1 Avenfle1 2 2 3 38 5 0 3 1 3 1 3 4 4 3 0 7 2 2 1 3 2 0 Bracpin1 Camprap1 Camprot1 Cytinig1 Diancar1 Euphcyp1 Festpal1 Galespc1 Galisyl1 Geniger1 Genitin1 2 2 1 2 2 2 1 3 1 0 1 2 4 0 0 2 0 3 1 4 0 1 0 2 Hierlac1 Hierpil1 Hiersab1 Hylomax1 Hypeper1 Impagla1 Lychvis1 Poa.nem1 Polyodo1 Polyvul1 Silenut1 2 2 3 2 2 1 1 3 3 3 0 3 4 2 3 2 2 0 1 0 3 2 1 4 Solivir1 Trifalp1 Verooff1 Vinchir1 2 1 4 2 2 4 1 2 0 3

This is quite different from two plots which are the most different (as measured by Bray-Curtis distance)

which (as.matrix (vltava.bray) == max (vltava.bray), arr.ind = TRUE) # there are many, take just the first one as an example vltava.spe[c(58,1), colSums (vltava.spe[c(58,1),])>0]

Abiealb23 Coryave23 Fagusyl23 Lonixyl23 Pinusyl23 Querrob23 Sambnig23 Tilicor23 Achitan1 Actaspi1 58 38 0 1 4 0 0 4 0 0 2 1 0 4 0 0 4 20 0 4 3 0 Anthram1 Asareur1 Bracpin1 Campper1 Camprot1 Cardimp1 Caredig1 Carespe1 Clinvul1 Cytinig1 Diancar1 58 0 2 0 0 0 3 2 0 0 0 0 1 3 0 4 2 2 0 0 3 1 2 1 Dryodil1 Dryofil1 Euphcyp1 Festovi1 Fragmos1 Galemon1 Galebif1 Genitin1 Gerarob1 Hierlac1 Hiermur1 58 2 3 0 0 0 8 0 0 2 0 0 1 0 0 2 2 4 0 2 3 0 3 1 Hiersab1 Hylomax1 Lunared1 Luzuluz1 Lychvis1 Melinut1 Mercper1 Milieff1 Moehtri1 Oxalace1 Pimpsax1 58 0 0 18 0 0 3 3 3 2 8 0 1 2 2 0 8 3 0 0 0 0 0 1 Poa.ang1 Poa.nem1 Polyodo1 Prenpur1 Pulmobs1 Seneova1 Silenut1 Solivir1 Trifalp1 Urtidio1 Verooff1 58 0 0 0 3 2 3 0 0 0 1 0 1 3 18 2 0 0 0 8 2 3 0 2 Vinchir1 58 0 1 1

(these two plots do not share any species, and their Bray-Curtis distance is 1, as for many others).

**Question 3:** The plots have been classified into four different vegetation types, and each plot belongs to one of them. Which of the vegetation types is the most compositionally homogeneous? (ie the plots in that type have the most similar species composition to each other?). Use Hellinger distance calculated on log-transformed species composition data for this purpose, and for assignment of the plots into vegetation types, use the variable `GROUP`

from the `vltava.env`

file.
First, let’s import the environmental data (see the original dataset description site):

vltava.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/vltava-env.txt') # names (vltava.env) # this you can use to see what are the names of the columns/variables in the data - but this is a "garbage" code, not needed for running the script, so better to "comment it out" lapply (1:4, FUN = function (x) mean (dist (decostand (vltava.spe.log[vltava.env$GROUP == x,], method = 'hell'))))

[[1]] [1] 1.112674 [[2]] [1] 1.04851 [[3]] [1] 1.103596 [[4]] [1] 0.9678859

The last group (four) has the lowest mean Hellinger distance - the vegetation is the most homogeneous. From the knowledge of the data, this makes quite a sense: group 4 represents a forest dominated by coniferous fir species (*Abies alba*), with rather specific species composition, unique for the region of the study (you can see this visually on the diagram of unconstrained ordination, at which plots belonging to individual groups are visualized by spiderplot and enclosed into envelopes - the group 4 seems to be the least dispersed across the ordination space (see here).

en/similarity_examples.txt · Last modified: 2020/03/24 15:58 by Lin, Po-Yu