Typical data pre-processing steps

The exact nature of a data pre-processing sequence naturally depends on the specifics of the project and is hard to generalize. The input data can be based on variety of different sources including (but not restricted to) expert knowledge, modelling, and indirect (surrogates) and direct observational data. It is also possible to have the input data being derived from a combination of these source types. There are few steps, however, that are almost always encountered when pre-processing the input data.

  1. Data type conversion. Zonation uses exclusively raster files as input data and so whatever type of spatial data is used originally, it has to be converted to raster format at some point of the pre-processing. The most common spatial type conversion encountered is converting vector (polygon) data to raster format. When converting categorical vector data, such as habitat or land cover data, into input rasters, there is a conceptual and practical difference in whether you convert the whole vector file into a single raster file with raster cell values indicating the class they belong to, or whether each class is converted into a separate raster file (see Section 3.1.1). While doing the conversion, you will have to pay attention to both of the following issues.
  2. Setting the geographic extent and resolution. All the spatial input data for Zonation need to have exactly the same spatial extent and resolution. Be careful with this! For some input data Zonation will give you an error if this is not the case, but not always. If your data is derived from a separate modelling step, such as using species distribution models, then you should produce all the outputs of the modelling already in correct spatial extent and resolution. In other cases (including data already in raster format), most of the tools used to do the conversion let you set the extent and resolution correctly. But what then should these values be? Extent is usually to overall extent of your study area, which if not defined explicitly, is the spatial union of the extents of all of your inputs. Resolution is often trickier to define. Selecting between a finer and coarser resolution is essentially a trade-off between analysis detail and computational time. Note that Arponen et al. (2012) showed, that Zonation analyses based on the same data may produce different results when ran using different initial resolutions for the input data. Furthermore, they concluded that the resolution of the analysis should correspond to the relevant ecological (i.e. taking into account species dispersal capabilities, see Section 3.1.3) and management scales (e.g. the size of planning units used). If you have the data and hardware resources available and the increase in the computational time is not prohibitive, we suggest that you go for as fine a resolution as meaningful and possible. Even if using finer resolution becomes computationally intractable, it may still be better to convert your vector data in high resolution and then aggregate the data to a computationally more feasible lower resolution grid (see below).
  3. Determining the cell value. As explained in Section 3.1.1, the absolute value of a raster cell in a given raster feature can only be meaningfully compared to the values of the other cells within that same feature grid. For the most typical input data types (binary presence/absence of a species range or a habitat, probability of occurrence), the interpretation of the range of values within a single feature is straightforward. If you are producing the input features using some modelling technique, such as a species distribution modelling software, you will want to check what the output values of that technique are and whether it would make sense to apply some form of a transformation on the values before entering them into Zonation. In case you have to aggregate your data to lower resolution for the actual Zonation analysis, some care must be taken when deciding the appropriate methods of determining the value of the aggregate cell (Lehtomäki and Moilanen, 2013). For the more common data types (probability of occurrence, coverage of habitat type) simply taking the sum of the values of higher resolution cells that fall within the lower resolution aggregate cell will be a good choice. Although the exact spatial arrangement of the higher resolution cells is lost in this process, at least full quantitative information from the higher resolution data is retained. In case your input biodiversity features are already some kind of aggregate quantities (such as species richness), you may consider using e.g. the maximum of high resolution cell values instead of taking a sum. One final consideration you should pay attention to concerns cells with missing data and cells with the value 0. Generally speaking, if you really have missing data in some parts of the input rasters, then these cells should be assigned an appropriate NoData value instead of 0. It does make a difference for example when connectivity transformations are in use. Also, missing data require little RAM memory in Zonation, making them preferable to 0's also from that perspective.