Trace:

en:similarity

This shows you the differences between two versions of the page.

Both sides previous revision Previous revision Next revision | Previous revision | ||

en:similarity [2017/03/21 06:52] David Zelený [Distance indices] |
en:similarity [2019/02/26 22:08] (current) David Zelený [Double-zero problem] |
||
---|---|---|---|

Line 5: | Line 5: | ||

[[{|width: 7em; background-color: white; color: navy}similarity_examples|Examples]] | [[{|width: 7em; background-color: white; color: navy}similarity_examples|Examples]] | ||

[[{|width: 7em; background-color: white; color: navy}similarity_exercise|Exercise {{::lock-icon.png?nolink|}}]] | [[{|width: 7em; background-color: white; color: navy}similarity_exercise|Exercise {{::lock-icon.png?nolink|}}]] | ||

- | |||

- | ===== Theory ===== | ||

The ecological resemblance including similarities and distances between samples, is the basic tool how to handle multivariate ecological data. Two samples sharing the same species in the same abundances have the highest similarity (and lowest distance), and the similarity decreases (and distance increases) with the differences in their species composition. All cluster and ordination methods operate with similarity or distance between samples. Even PCA and CA, even if not said explicitly, are based on Euclidean and chi-square distances, respectively. | The ecological resemblance including similarities and distances between samples, is the basic tool how to handle multivariate ecological data. Two samples sharing the same species in the same abundances have the highest similarity (and lowest distance), and the similarity decreases (and distance increases) with the differences in their species composition. All cluster and ordination methods operate with similarity or distance between samples. Even PCA and CA, even if not said explicitly, are based on Euclidean and chi-square distances, respectively. | ||

- | ==== Similarity, dissimilarity and distance ==== | + | ===== Similarity, dissimilarity and distance ===== |

- | Intuitively, one thinks about **similarity** among objects - the more are two objects similar in terms of their properties, the higher is their similarity. In the case of species composition data, the similarity is calculated using similarity indices, ranging from 0 (the samples do not share any species) to 1 (samples have identical species composition). Ordination techniques are usually based on distances, because they need to localize the samples in a multidimensional space; clustering methods could usually handle both similarities or distances. **Distances** are of two types, either dissimilarities, converted from analogous similarity indices, or specific distance measures, such as Euclidean, which doesn't have a counterpart in any similarity index. While all similarity indices can be converted into distances, not all distances could be converted into similarities (as is true e.g. for Euclidean distance). | + | Intuitively, one thinks about **similarity** among objects - the more are two objects similar in terms of their properties, the higher is their similarity. In the case of species composition data, the similarity is calculated using similarity indices, ranging from 0 (the samples do not share any species) to 1 (samples have identical species composition). Ordination techniques are usually based on distances, because they need to localize the samples in a multidimensional space; clustering methods could usually handle both similarities or distances. **Distances** are of two types, either dissimilarity, converted from analogous similarity indices, or specific distance measures, such as Euclidean, which doesn't have a counterpart in any similarity index. While all similarity indices can be converted into distances, not all distances could be converted into similarities (as is true e.g. for Euclidean distance). |

There is a number of measures of similarities or distances ([[references|Legendre & Legendre 2012]] list around 30 of them). The first decision one has to make is whether the aim is R- or Q-mode analysis (R-mode focuses on differences among species, Q-mode on differences among samples), since some of the measures differ between both modes (e.g. Pearson's //r// correlation coefficient makes sense for association between species (R-mode), but not for association between samples (Q-mode); in contrast, e.g. Sørensen index can be used in both Q- and R-mode analysis, called Dice index in R-mode analysis). Further, if focusing on differences between samples (Q-mode), the most relevant measures in ecology are asymmetric indices ignoring double zeros (more about //double-zero problem// below). Then, it also depends whether the data are qualitative (i.e. binary, presence-absence) or quantitative (species abundances). In the case of distance indices, an important criterium is whether they are metric (they can be displayed in Euclidean space) or not, since this influences the choice of the index for some ordination or clustering methods. | There is a number of measures of similarities or distances ([[references|Legendre & Legendre 2012]] list around 30 of them). The first decision one has to make is whether the aim is R- or Q-mode analysis (R-mode focuses on differences among species, Q-mode on differences among samples), since some of the measures differ between both modes (e.g. Pearson's //r// correlation coefficient makes sense for association between species (R-mode), but not for association between samples (Q-mode); in contrast, e.g. Sørensen index can be used in both Q- and R-mode analysis, called Dice index in R-mode analysis). Further, if focusing on differences between samples (Q-mode), the most relevant measures in ecology are asymmetric indices ignoring double zeros (more about //double-zero problem// below). Then, it also depends whether the data are qualitative (i.e. binary, presence-absence) or quantitative (species abundances). In the case of distance indices, an important criterium is whether they are metric (they can be displayed in Euclidean space) or not, since this influences the choice of the index for some ordination or clustering methods. | ||

- | [[references|Legendre & Legendre (2012)]] offers a kind of "key" how to select an appropriate measure for given data and problem (Tables 7.4-7.6). Generally, as a rule of thumb, Bray-Curtis and Hellinger distances are better choices than Euclidean or Chi-square distances. | + | [[references|Legendre & Legendre (2012)]] offers a key how to select an appropriate measure for given data and problem (check their Tables 7.4-7.6). Generally, as a rule of thumb, Bray-Curtis and Hellinger distances are better choices than Euclidean or Chi-square distances. |

- | ==== Double-zero problem ==== | + | ===== Double-zero problem ===== |

- | "Double zero" is a situation when certain species is missing in both compared community samples for which similarity/distance is calculated. Species missing simultaneously in two samples can mean the following: (1) samples are located outside of the species ecological niche, but one cannot say whether both samples are on the same side of the ecological gradient (i.e. they can be rather ecologically similar, samples A and B on <imgref double-zero-curve>) or they are on the opposite sides (and hence very different, samples A and C). Alternatively, (2) samples are located inside species ecological niche (samples D and E), but the species in given samples it does not occur, since it didn’t get there (dispersal limitation), or it was present, but overlooked and not sampled (sampling bias). In both cases, the double zero represents missing information, which cannot offer an insight into the ecology of compared samples. | + | "Double zero" is a situation when certain species is missing in both compared community samples for which similarity/distance is calculated. Species missing simultaneously in two samples can mean the following: (1) samples are located outside of the species ecological niche, but one cannot say whether both samples are on the same side of the ecological gradient (i.e. they can be rather ecologically similar, samples A and B on <imgref double-zero-curve>) or they are on the opposite sides (and hence very different, samples A and C). Alternatively, (2) samples are located inside species ecological niche (samples D and E), but the species in given samples does not occur, since it didn’t get there (dispersal limitation), or it was present, but overlooked and not sampled (sampling bias). In both cases, the double zero represents missing information, which cannot offer an insight into the ecology of compared samples. |

<imgcaption double-zero-curve |Response curve of a single species along environmental gradient; A, B..., E are samples located within or outside the species niche.>{{ :obrazky:double-zero-illustration.png?direct&400|}}</imgcaption> | <imgcaption double-zero-curve |Response curve of a single species along environmental gradient; A, B..., E are samples located within or outside the species niche.>{{ :obrazky:double-zero-illustration.png?direct&400|}}</imgcaption> | ||

Line 27: | Line 25: | ||

<imgcaption double-zero-table |For details see the text.>{{ :obrazky:double-zero-table.jpg?direct&400|}}</imgcaption> | <imgcaption double-zero-table |For details see the text.>{{ :obrazky:double-zero-table.jpg?direct&400|}}</imgcaption> | ||

- | <imgref double-zero-table> shows an ecological example of double zero problem. Samples 1 to 3 are sorted according to the wetness of their habitat – sample 1 is the wettest and sample 3 is the driest. In samples 1 and 3, no mesic species occur, since sample 1 is too wet and sample 3 too dry - these is the double zero. The fact that the mesic species is missing does not say anything about ecological similarity or difference between both samples; simply there is no information, and it is better to ignore it. In the case of symmetrical indices of similarity, the absence of mesic species in sample 1 and sample 3 (0-0, double zero) will increase similarity of sample 1 and 2; in asymmetrical indices, double zeros will be ignored and only presences (1-1, 1-0, 0-1) will be considered. | + | <imgref double-zero-table> shows an ecological example of double zero problem. Samples 1 to 3 are sorted according to the wetness of their habitat – sample 1 is the wettest and sample 3 is the driest. In samples 1 and 3, no mesic species occur, since sample 1 is too wet and sample 3 too dry - these is the double zero. The fact that the mesic species is missing does not say anything about ecological similarity or difference between both samples; simply there is no information, and it is better to ignore it. In the case of symmetrical indices of similarity, the absence of mesic species in sample 1 and sample 3 (0-0, double zero) will increase similarity of sample 1 and 3; in asymmetrical indices, double zeros will be ignored and only presences (1-1, 1-0, 0-1) will be considered. |

- | ==== Similarity indices ==== | + | ===== Similarity indices ===== |

<tabref similarity-indices> summarizes categories of similarity indices. Symmetric indices, i.e. those which consider double zeros as relevant, are not further treated here since they are not useful for analysis of ecological data (although they may be useful e.g. for analysis of environmental variables if there are binary). Here we will consider only asymmetric similarity indices, i.e. those ignoring double zeros. These split into two types according to the data which they are using: qualitative (binary) indices, applied on presence-absence data, and quantitative indices, applied on raw (or transformed) species abundances. Note that some of the indices have also multi-sample alternatives (i.e. they could be calculated on more than two samples), which could be used for calculating beta diversity. | <tabref similarity-indices> summarizes categories of similarity indices. Symmetric indices, i.e. those which consider double zeros as relevant, are not further treated here since they are not useful for analysis of ecological data (although they may be useful e.g. for analysis of environmental variables if there are binary). Here we will consider only asymmetric similarity indices, i.e. those ignoring double zeros. These split into two types according to the data which they are using: qualitative (binary) indices, applied on presence-absence data, and quantitative indices, applied on raw (or transformed) species abundances. Note that some of the indices have also multi-sample alternatives (i.e. they could be calculated on more than two samples), which could be used for calculating beta diversity. | ||

Line 57: | Line 55: | ||

**Simpson similarity**: <m>Si~=~{a}/{a+min(b,c)}</m> | **Simpson similarity**: <m>Si~=~{a}/{a+min(b,c)}</m> | ||

</WRAP> | </WRAP> | ||

- | **Jaccard similarity index** divides the number of species shared by both samples (fraction //a//) by the sum of all species occurring in both samples (//a//+//b//+//c//, where //b// and //c// are numbers of species occurring only in the first and only in the second sample, respectively). **Sørensen similarity index** considers the number of species shared among both samples as more important, so it counts it twice. **Simpson similarity index** is useful in a case that compared samples largely differ in species richness (i.e. one sample has considerably more species than the other). If Jaccard or Sorensen are used on such data, their values are generally very low, since the fraction of species occurring only in the rich sample will make the denominator too large and the overall value of the index too low; Simpson index, which was originally introduced for comparison of fossil data, eliminates this problem by taking only the smaller from the fractions //b// and //c//((Note that there is yet another Simpson index at this website, namely Simpson diversity index; except the author's name, these two indices have nothing in common, and each was named after different Mr. Simspon.)). | + | **Jaccard similarity index** divides the number of species shared by both samples (fraction //a//) by the sum of all species occurring in both samples (//a//+//b//+//c//, where //b// and //c// are numbers of species occurring only in the first and only in the second sample, respectively). **Sørensen similarity index** considers the number of species shared among both samples as more important, so it counts it twice. **Simpson similarity index** is useful in a case that compared samples largely differ in species richness (i.e. one sample has considerably more species than the other). If Jaccard or Sørensen are used on such data, their values are generally very low, since the fraction of species occurring only in the rich sample will make the denominator too large and the overall value of the index too low; Simpson index, which was originally introduced for comparison of fossil data, eliminates this problem by taking only the smaller from the fractions //b// and //c//. (Note that there is yet another Simpson index, namely //Simpson diversity index//; each of the indices was named after different Mr. Simpson, and while Simpson similarity index is calculating similarity between pair of compositional samples, Simpson diversity index is index calculating diversity of a single community sample; you may find details in my [[http://davidzeleny.net/blog/2017/03/18/simpsons-similarity-index-vs-simpsons-diversity-index/|blog post]]). |

<WRAP right box 65%> | <WRAP right box 65%> | ||

Line 69: | Line 67: | ||

</WRAP> | </WRAP> | ||

- | ** Quantitative similarity indices ** (applied on raw abundances) include **percentage similarity**, which is a quantitative version of Sorensen similarity index (which means that if calculated on presence-absence data, it gives the same results are Sørensen similarity index). Note that //percentage difference//, calculated as 1-//percentage similarity//, is called Bray-Curtis index. | + | ** Quantitative similarity indices ** (applied on raw abundances) include **percentage similarity**, which is a quantitative version of Sørensen similarity index (which means that if calculated on presence-absence data, it gives the same results are Sørensen similarity index). Note that //percentage difference//, calculated as 1-//percentage similarity//, is called Bray-Curtis index. |

==== Distance indices ==== | ==== Distance indices ==== | ||

- | While similarity indices return the highest value in case that both compares samples are identical (maximally similar), distance indices are largest for two samples which do not share any species (are maximally dissimilar). There are two types of distance (or dissimilarity) indices((Note that the use of "distance" and "dissimilarity" is somewhat not systematic; some authors call distances only those indices which are metric (Euclidean), i.e. can be displayed in metric (Euclidean) geometric space, and the other indices are called dissimilarities; but sometimes these two terms are simply synonyms.)): | + | While similarity indices return the highest value in the case that both compares samples are identical (maximally similar), distance indices are largest for two samples which do not share any species (are maximally dissimilar). There are two types of distance (or dissimilarity) indices((Note that the use of "distance" and "dissimilarity" is somewhat not systematic; some authors call distances only those indices which are metric (Euclidean), i.e. can be displayed in metric (Euclidean) geometric space, and the other indices are called dissimilarities; but sometimes these two terms are simply synonyms.)): |

- | - those **calculated from similarity indices**, usually as **D = 1 - S**, where **S** is the similarity index (e.g. Jaccard, Sørensen, Simpson for qualitative (binary) indices, and percentage dissimilarity (known also as Bray-Curtis distance) for quantitative); | + | - those **calculated from similarity indices**, usually as **D = 1 - S**, where **S** is the similarity index; examples include Jaccard, Sørensen and Simpson dissimilarity for qualitative (binary) data, and percentage difference (known also as Bray-Curtis distance) for quantitative data; |

- | - those **distances which have no analog in the similarity** indices (e.g. Euclidean, chord, Hellinger or chi-square distance index). | + | - those **distances which have no analogue in the similarity** indices, e.g. Euclidean, chord, Hellinger or chi-square distance index. |

<imgcaption triangle-inequality|Triangle inequality principle.>{{ :obrazky:triangle_inequality_principle.jpg?direct&500|}}</imgcaption> | <imgcaption triangle-inequality|Triangle inequality principle.>{{ :obrazky:triangle_inequality_principle.jpg?direct&500|}}</imgcaption> | ||

- | An important criterium is **whether the distance index is metric or not** (i.e. it is semi-metric or non-metric). The term "metric" refers to the indices which can be displayed in the orthogonal Euclidean space, since they obey so-called "triangle inequality principle" (see explanation in <imgref triangle-inequality>). Some dissimilarity indices calculated from similarities are metric (e.g. Jaccard dissimilarity), some are not (e.g. Sørensen dissimilarity and it's quantitative version called Bray-Curtis dissimilarity). In the case of Sørensen and Bray-Curtis (and some others) this can be solved by calculating the dissimilarity as <m>D~=~sqrt{1-S}</m> instead of the standard <m>D~=~1-S</m> (where S is the similarity); resulting dissimilarity index is then metric. Indices which are not metric cause troubles in ordination methods relying on Euclidean space (PCoA or db-RDA) and numerical clustering algorithms which need to locate samples in the Euclidean space, such as Ward algorithm or K-means); e.g. PCoA calculated using distances which are not metric returns axes with negative eigenvalues, and this e.g. in db-RDA may result in virtually higher variation explained by explanatory variables than would reflect the data. | + | An important criterium is **whether the distance index is metric or not** (i.e. it is semi-metric or non-metric). The term "metric" refers to the indices which can be displayed in the orthogonal Euclidean space, since they obey so-called "triangle inequality principle" (see explanation in <imgref triangle-inequality>). Some dissimilarity indices calculated from similarities are metric (e.g. Jaccard dissimilarity), some are not (e.g. Sørensen dissimilarity and it's a quantitative version called Bray-Curtis dissimilarity). In the case of Sørensen and Bray-Curtis (and some others), this can be solved by calculating the dissimilarity as <m>D~=~sqrt{1-S}</m> instead of the standard <m>D~=~1-S</m> (where S is the similarity); resulting dissimilarity index is then metric. Indices which are not metric cause troubles in ordination methods relying on Euclidean space (PCoA or db-RDA) and numerical clustering algorithms which need to locate samples in the Euclidean space (such as Ward algorithm or K-means). For example, PCoA calculated using distances which are not metric creates axes with negative eigenvalues, and this e.g. in db-RDA may result in virtually higher variation explained by explanatory variables than would reflect the data. |

- | **Bray-Curtis dissimilarity** or **percentage difference**((Note that according to P. Legendre, Bray-Curtis index should not be called after Bray and Curtis, since they have not really published it, only used it.)) is one complement of //percentage similarity// index described above. It is considered as suitable for community composition data, since it ignores double zeros, and it has a meaningful upper value equal to 1 (meaning complete mismatch between species composition of two samples, i.e. if one species in one sample is present and has some abundance, the same species in the other samples is zero, and vice versa). Bray-Curtis considers absolute species abundances in the samples, not only relative species abundances. The index is not metric, but the version calculated as <m>sqrt{1-PS}</m> (where PS is percentage similarity) is metric and can be used in PCoA. | + | **Bray-Curtis dissimilarity** or **percentage difference**((Note that according to P. Legendre, Bray-Curtis index should not be called after Bray and Curtis, since they have not really published it, only used it.)) is one complement of //percentage similarity// index described above. It is considered suitable for community composition data, since it ignores double zeros, and it has a meaningful upper value equal to one (meaning complete mismatch between species composition of two samples, i.e. if one species in one sample is present and has some abundance, the same species in the other samples is zero, and vice versa). Bray-Curtis considers absolute species abundances in the samples, not only relative species abundances. The index is not metric, but the version calculated as <m>sqrt{1-PS}</m> (where PS is percentage similarity) is metric and can be used in PCoA. |

<WRAP right box 35%> | <WRAP right box 35%> | ||

Line 86: | Line 84: | ||

</WRAP> | </WRAP> | ||

- | **Euclidean distance**, although not suitable for ecological data, is frequently used in a multivariate analysis (mostly because it is the implicit distance for linear ordination methods like PCA, RDA and for some clustering algorithms). Euclidean distance has no upper limit and the maximum value depends on the data. The main reason why it is not suitable for compositional data is that it is a symmetrical index, i.e. it treats double zeros in the same way as double presences. Double zeros shrink the distance between two plots, leading to the paradox when two samples having no species in common have smaller Euclidean distance than two samples actually sharing all species. The solution is to apply Euclidean distances on pre-transformed species composition data (e.g. using Hellinger, Chord or chi-square transformation). An example of calculating Euclidean distance between samples with only two species is on <imgref eucl-dist>. | + | **Euclidean distance**, although not suitable for ecological data, is frequently used in a multivariate analysis (mostly because it is the implicit distance for linear ordination methods like PCA, RDA and for some clustering algorithms). Euclidean distance has no upper limit and the maximum value depends on the data. The main reason why it is not suitable for compositional data is that it is a symmetrical index, i.e. it treats double zeros in the same way as double presences. Double zeros shrink the distance between two plots. The solution is to apply Euclidean distances on pre-transformed species composition data (e.g. using Hellinger, Chord or chi-square transformation). An example of calculating Euclidean distance between samples with only two species is on <imgref eucl-dist>. |

<imgcaption eucl-dist|Euclidean distance between two samples with only two species.>{{ :obrazky:schema-calculating-eucl-distance.png?direct&400|}}</imgcaption> | <imgcaption eucl-dist|Euclidean distance between two samples with only two species.>{{ :obrazky:schema-calculating-eucl-distance.png?direct&400|}}</imgcaption> | ||

Line 96: | Line 94: | ||

**Chi-square distance** is rarely calculated itself, but is important since it is implicit for CA and CCA ordination. | **Chi-square distance** is rarely calculated itself, but is important since it is implicit for CA and CCA ordination. | ||

- | ==== Euclidean distance: paradox caused by double zeros ==== | + | ===== Euclidean distance: abundance paradox ===== |

- | If calculated by Euclidean distance, double zeros cause the distance between two samples to shrink, making them actually more similar. As a result, two samples not sharing any species could have lower Euclidean distance than two samples sharing species (see the example below). | + | When comparing two samples, Euclidean distance puts more weight on differences in species abundances than on difference in species presences. As a result, two samples not sharing any species could appear more similar (with lower Euclidean distance) than two samples which share species but the species largely differ in their abundances (see the example below). |

- | The original species composition matrix (note that samples 1 and 2 does not share species, while samples 1 and 3 share all species but differ in abundances). | + | In the species composition matrix below, samples 1 and 2 does not share any species, while samples 1 and 3 share all species but differ in abundances (e.g. species 3 has abundance 1 in sample 1 and abundance 8 in sample 3): |

| | **Species 1** | **Species 2** | **Species 3** | | | | **Species 1** | **Species 2** | **Species 3** | | ||

| **Sample 1** | 0 | 1 | 1 | | | **Sample 1** | 0 | 1 | 1 | | ||

Line 108: | Line 106: | ||

<m>D_Eucl~(Sample 1, Sample 3)~=~sqrt{(0-0)^2+(1-4)^2+(1-8)^2}~=~7.615</m> | <m>D_Eucl~(Sample 1, Sample 3)~=~sqrt{(0-0)^2+(1-4)^2+(1-8)^2}~=~7.615</m> | ||

+ | Euclidean distance between sample 1 and 2 is lower than between sample 1 and 3, although samples 1 and 2 have no species in common, while sample 1 and 3 share all species. | ||

- | ==== Matrix of similarities/distances ==== | + | ===== Matrix of similarities/distances ===== |

The matrix of similarities or distances is squared (the same number of rows as columns), with the values on diagonal either zeros (distances) or ones (similarities), and symmetric - the upper right triangle is a mirror of values in lower left one (<imgref dist-matrix>). | The matrix of similarities or distances is squared (the same number of rows as columns), with the values on diagonal either zeros (distances) or ones (similarities), and symmetric - the upper right triangle is a mirror of values in lower left one (<imgref dist-matrix>). | ||

- | <imgcaption dist-matrix|Matrix of Eculidean distances calculated between all pairs of samples (a subset of 10 samples from Ellenberg's Danube meadow dataset used). Diagonal values (yellow) are zeros, since distance of two identical samples is zero.>{{:obrazky:eucl-dist-matrix-danube-data.jpg?direct|}}</imgcaption> | + | <imgcaption dist-matrix|Matrix of Euclidean distances calculated between all pairs of samples (a subset of 10 samples from Ellenberg's Danube meadow dataset used). Diagonal values (yellow) are zeros since the distance of two identical samples is zero.>{{ :obrazky:eucl-dist-matrix-danube-data.jpg?direct |}}</imgcaption> |

en/similarity.1490050360.txt.gz · Last modified: 2017/10/11 20:36 (external edit)