User Tools

Site Tools


en:data_import

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
en:data_import [2017/12/24 16:42]
David Zelený [Import of data into R]
en:data_import [2019/02/09 19:44]
David Zelený
Line 6: Line 6:
 [[{|width: 7em; background-color:​ white; color: navy}data_import_exercise|Exercise {{::​lock-icon.png?​nolink|}}]] [[{|width: 7em; background-color:​ white; color: navy}data_import_exercise|Exercise {{::​lock-icon.png?​nolink|}}]]
  
-===== Theory ===== +===== Community ecology data sets =====
-==== Community ecology data sets====+
 Data sets used in community ecology usually consist of one, two or three of the following matrices (see also <imgref three-matrix-diagram>​):​ Data sets used in community ecology usually consist of one, two or three of the following matrices (see also <imgref three-matrix-diagram>​):​
   * the matrix of species composition (//sample// × //species// matrix **L**), ​   * the matrix of species composition (//sample// × //species// matrix **L**), ​
Line 15: Line 14:
 <​imgcaption three-matrix-diagram | Three matrices used in multivariate analysis of community data: matrix of species composition,​ environmental variables (or other sample attributes) and species traits (or other species attributes). >{{ :​obrazky:​three-matrix-diagram_v2.png }}</​imgcaption>​ <​imgcaption three-matrix-diagram | Three matrices used in multivariate analysis of community data: matrix of species composition,​ environmental variables (or other sample attributes) and species traits (or other species attributes). >{{ :​obrazky:​three-matrix-diagram_v2.png }}</​imgcaption>​
  
-==== Types of variables ====+===== Types of variables ​=====
 Variables (be it environmental variables, species attributes, or values in species composition matrix) fits into on of the three main categories: Variables (be it environmental variables, species attributes, or values in species composition matrix) fits into on of the three main categories:
   * **Qualitative** (categorical,​ nominal): individual categories are unique (each observation belongs to only one of them), and they cannot be meaningfully ordered; e.g. geological type, soil type, binary (presence-absence) variables).   * **Qualitative** (categorical,​ nominal): individual categories are unique (each observation belongs to only one of them), and they cannot be meaningfully ordered; e.g. geological type, soil type, binary (presence-absence) variables).
Line 21: Line 20:
   * **Quantitative**:​ variables measuring quantities   * **Quantitative**:​ variables measuring quantities
     * //​discrete//​ vs //​continuous//:​ discrete variables are e.g. numbers of individuals,​ measurements taken with low precision; continuous variables are e.g. precise measurements.     * //​discrete//​ vs //​continuous//:​ discrete variables are e.g. numbers of individuals,​ measurements taken with low precision; continuous variables are e.g. precise measurements.
-    * variables on the //relative (or ratio) scale// or on //interval scale// (<imgref relative-interval-scale>​). The difference is in the meaning of zero: for variables on a relative scale (or ratio scale) zero is a meaningful value (usually it means that the measured variable is not present), while for variables on interval-scale zero is arbitrarily chosen. Variables on interval-scale doesn'​t allow for the relative comparisons (e.g. in case of temperature measured in degrees of Celsius we cannot say that this temperature is twice higher than that one; however, if we measure temperature in Kelvins, which is example of the relative scale variable (since 0°K is absolute zero), we can make such comparison; similarly, e.g. wind speed is the example of variable on the relative scale with meaningful zero - no wind).+    * variables on the //relative (or ratio) scale// or on //interval scale// (<imgref relative-interval-scale>​). The difference is in the meaning of zero: for variables on a relative scale (or ratio scale) zero is a meaningful value (usually it means that the measured variable is not present), while for variables on interval-scale zero is arbitrarily chosen. Variables on interval-scale doesn'​t allow for the relative comparisons (e.g. in case of temperature measured in degrees of Celsius we cannot say that this temperature is twice higher than that one; however, if we measure temperature in Kelvins, which is an example of the relative scale variable (since 0°K is absolute zero), we can make such comparison; similarly, e.g. wind speed is the example of variable on the relative scale with meaningful zero - no wind).
 <​imgcaption relative-interval-scale left|Variable on the relative scale (A), where zero has the real meaning (absence of the measured variable), and on the interval scale (B), where zero is chosen arbitrarily.>​{{ :​obrazky:​variables_on_relative_vs_interval_scale.png }}</​imgcaption>​ <​imgcaption relative-interval-scale left|Variable on the relative scale (A), where zero has the real meaning (absence of the measured variable), and on the interval scale (B), where zero is chosen arbitrarily.>​{{ :​obrazky:​variables_on_relative_vs_interval_scale.png }}</​imgcaption>​
 \\ \\
Line 36: Line 35:
 ^ :::                                    ^ :::                                ^ :::                                       | continuous ​    | temperature,​ length, leaf size                                                  | ^ :::                                    ^ :::                                ^ :::                                       | continuous ​    | temperature,​ length, leaf size                                                  |
 </​tabcaption>​ </​tabcaption>​
-==== Primary data ====+ 
 +===== Primary data =====
 Primary data are at the beginning of all analytical exercise. While collected into notebooks or newly also to some electronic devices like recorders, tablets or smartphones,​ they need to be retyped to a spreadsheet and archived. I found it useful to keep data in a multi-sheet Excel file, each matrix in separate sheet, with one extra sheet containing **metadata** - detail description what kind of data each sheet contains, what is the meaning of abbreviations (if these are used for e.g. environmental variables or species names) and whether there were some changes or corrections done after the data were retyped (there are usually some). Such file can be used in future as a base for all further analyses, keeping all primary data at one place. Primary data are at the beginning of all analytical exercise. While collected into notebooks or newly also to some electronic devices like recorders, tablets or smartphones,​ they need to be retyped to a spreadsheet and archived. I found it useful to keep data in a multi-sheet Excel file, each matrix in separate sheet, with one extra sheet containing **metadata** - detail description what kind of data each sheet contains, what is the meaning of abbreviations (if these are used for e.g. environmental variables or species names) and whether there were some changes or corrections done after the data were retyped (there are usually some). Such file can be used in future as a base for all further analyses, keeping all primary data at one place.
  
-A long term storage of primary data is still a bit complex issue - it seems that the safest way is still to print them on acid-free paper, using the laser printer. The other option is to append data to published papers (usually as an online appendix) - some journals (like //​Ecological Monographs//,​ //Journal of Ecology// etc.) requires attaching the data as a condition for accepting the paper. Another option is to store data in public and managed databases or electronic repositories (e.g. Dryad Digital Repository, www.datadryad.org),​ which (for free or some small payment) offers time-unlimited storage of your data (with the advantage of data being citable by assigning them DOI identifier).+Long term storage of primary data is still a bit complex issue - it seems that the safest way is still to print them on acid-free paper, using the laser printer. The other option is to append data to published papers (usually as an online appendix) - some journals (like //​Ecological Monographs//,​ //Journal of Ecology// etc.) requires attaching the data as a condition for accepting the paper. Another option is to store data in public and managed databases or electronic repositories (e.g. Dryad Digital Repository, www.datadryad.org),​ which (for free or some small payment) offers time-unlimited storage of your data (with the advantage of data being citable by assigning them DOI identifier).
  
-==== Import of data into R ====+===== Import of data into R =====
  
 For analysis, the first step is always to get data into R. In most cases, R expects that community ecology data will have samples in rows and species (or other descriptors,​ like environmental variables) in columns((There are, however, notable exceptions, e.g. functions in the package ''​iNEXT''​ for analysis of diversity (__iN__terpolation/​__ext__rapolation) expect samples in columns and species (descriptors) in rows; this needs to be kept in mind when using it!)). For work in R, a good practice is to save each data matrix into one file (preferentially *.txt file with cells delimited by tabulator), save these files to a certain folder, and at the beginning of the script locate a set of lines reading the data from the files into R working space. This will make the script reproducible - just wrap the script together with the files into a zip file and the analysis (using the same data) can be fully reproduced. There are also other types of import (e.g. directly via the clipboard, from other format types like *.csv, directly from Excel 's *.xls or *xlsx file, etc.), and RStudio contains also a button functionality to load data. Still, I suggest that *.txt format loaded via script is the most stable solution. For analysis, the first step is always to get data into R. In most cases, R expects that community ecology data will have samples in rows and species (or other descriptors,​ like environmental variables) in columns((There are, however, notable exceptions, e.g. functions in the package ''​iNEXT''​ for analysis of diversity (__iN__terpolation/​__ext__rapolation) expect samples in columns and species (descriptors) in rows; this needs to be kept in mind when using it!)). For work in R, a good practice is to save each data matrix into one file (preferentially *.txt file with cells delimited by tabulator), save these files to a certain folder, and at the beginning of the script locate a set of lines reading the data from the files into R working space. This will make the script reproducible - just wrap the script together with the files into a zip file and the analysis (using the same data) can be fully reproduced. There are also other types of import (e.g. directly via the clipboard, from other format types like *.csv, directly from Excel 's *.xls or *xlsx file, etc.), and RStudio contains also a button functionality to load data. Still, I suggest that *.txt format loaded via script is the most stable solution.
Line 47: Line 47:
 Before importing the file into R, make sure that **all variable names are valid R names**. Valid name can contain letters, numbers, dots and underline characters; it should start with a letter (not a number) or dot not followed by a number. R is case sensitive, which means that uppercase and lowercase letters have a different meaning. Examples of **valid names**: ''​var_1'',​ ''​var.1'',​ ''​.var1'',​ ''​var_1.1'',​ ''​Var1''​. **Not valid names** include ''​1var'',​ ''​1_var'',​ ''​.1vars''​. There is also a list of **reserved words** which cannot be used as variable names, since they have fixed meaning in R: ''​if'',​ ''​else'',​ ''​repeat'',​ ''​while'',​ ''​function'',​ ''​for'',​ ''​in'',​ ''​next'',​ ''​break'',​ and also ''​TRUE'',​ ''​FALSE'',​ ''​NULL'',​ ''​Inf'',​ ''​NaN'',​ ''​NA'',​ ''​NA_integer_'',​ ''​NA_real_'',​ ''​NA_complex_'',​ ''​NA_character_''​. Before importing the file into R, make sure that **all variable names are valid R names**. Valid name can contain letters, numbers, dots and underline characters; it should start with a letter (not a number) or dot not followed by a number. R is case sensitive, which means that uppercase and lowercase letters have a different meaning. Examples of **valid names**: ''​var_1'',​ ''​var.1'',​ ''​.var1'',​ ''​var_1.1'',​ ''​Var1''​. **Not valid names** include ''​1var'',​ ''​1_var'',​ ''​.1vars''​. There is also a list of **reserved words** which cannot be used as variable names, since they have fixed meaning in R: ''​if'',​ ''​else'',​ ''​repeat'',​ ''​while'',​ ''​function'',​ ''​for'',​ ''​in'',​ ''​next'',​ ''​break'',​ and also ''​TRUE'',​ ''​FALSE'',​ ''​NULL'',​ ''​Inf'',​ ''​NaN'',​ ''​NA'',​ ''​NA_integer_'',​ ''​NA_real_'',​ ''​NA_complex_'',​ ''​NA_character_''​.
  
-Species names of taxa, consisting of genus, species and sometimes subspecific latin name, may need to be **abbreviated**. This is not because R cannot handle long names of variables (it can, up to 255 characters),​ but because long species names may be difficult to display, e.g. in ordination diagram. Library ''​vegan''​ contains handy function ''​make.cepnames'',​ which abbreviates species names into up to 8 letters abbreviations;​ it takes four first letters from genus name and merges them with first four letters from species names.+Species names of taxa, consisting of the genus, species and sometimes subspecific latin name, may need to be **abbreviated**. This is not because R cannot handle long names of variables (it can, up to 255 characters),​ but because long species names may be difficult to display, e.g. in ordination diagram. Library ''​vegan''​ contains handy function ''​make.cepnames'',​ which abbreviates species names into up to 8 letters abbreviations;​ it takes four first letters from genus name and merges them with first four letters from species names.
  
-[[data_import_examples|Example section]] contains various examples how to import data from other sources and formats into R.+[[data_import_examples|Example section]] contains various examples ​of how to import data from other sources and formats into R.
  
en/data_import.txt · Last modified: 2019/02/09 19:44 by David Zelený