Automating pre-processing tasks

The most time spent in executing a Zonation planning project is typically spent in preparing and pre-processing input data. This is because the complex interplay between setting of the high-level objectives, identifying the factors relevant for addressing these objectives and the actual data available for doing so. This process is almost inevitably iterative in nature with the different pre-processing steps being repeated with varying frequency. Taking the more long-term perspective that is necessary especially for operative planning, the prioritization analyses will typically have to be repeated at given intervals because objectives might change, data get updated and so on. Therefore, it is often a good idea to automate the pre-processing tasks as much as possible given the resources available for a project. As many of the pre-processing tasks involve manipulating geospatial data in different ways, the exact way of implementing automation depends on the tools you are using for that manipulation. Below we quickly describe two approaches that we have used. While other software than those listed here exist, these examples should broadly cover the most typical uses.

Automation using a GIS. Using a GIS software, such as ArcGIS¹ and QGIS², is probably the most common way of pre-processing any kind of spatial data. Doing things manually is certainly feasible if you have to do it once or twice, but manual approach gets quickly unwieldy if the tasks have to be repeated several times or if there are thousands of layers to begin with. Using automation tools built-in to the software may thus be a good choice. For example, ArcGIS (version 10.2.1 at the time of writing) has an integrated facility called model-builder which lets you chain individual tools together using a graphical user interface. These chains, or models, can later be repeated relatively easily and applied to different data and possibly using different parameters.

Automation using a programming language. The functionality of many GIS software packages can also be controlled programmatically using a high-level language such as Python (Python Development Team, 2014). Furthermore, many programming languages have bindings to popular open source geospatial libraries such as GDAL (GDAL Development Team, 2014) that can be leveraged to do the heavy lifting. The programming approach usually gives you more intricate and flexible control over the tools, but it also requires more know-how from the team responsible for implementation.

Automation does not, however, come without a cost. Developing automated pre-processing toolchains can require significant amounts of expertise and time from the project team. Furthermore, if the tools are to be used in the future they require active maintenance which again requires dedicated resources. Besides time saved on repeating the pre-processing tasks, having automated toolchains has more advantages as well. An important, but often overlooked feature of having automated toolchains (such as model-builder models or scripts) is that they provide detailed documentation about what was actually done to the data before it was fed into Zonation. This may sound like trivial, but doing things manually rarely leaves a trace about what was done and unfortunately people often do not document what they did. Notice also that automation is not restricted to pre-processing of data; as we show later, many aspects of running Zonation and post-processing of results can also be automated.

Box 2. Common pitfalls in data pre-processing.

Depending on the data and the type of Zonation analysis you are developing, severalthings can go wrong in pre-processing your data. Look out for the following commonproblems to avoid errors and to manage your Zonation projects more efficiently.

Inconsistent resolution and extent between input features. Remember that Zonation requires that all of the input data you are using is in the same resolution (spatial grain). Zonation will make some checks while loading the input data, but there are cases in which features with different spatial resolution and extent can be loaded in without warnings. Generally, the biodiversity features are checked for consistency, but checks for e.g. different analysis masks may be less stringent.
Misspecified NoData-values can cause problems later on in Zonation. Internally, Zonation uses GDAL to read in values from your input feature rasters. The NoData-value in your input rasters can be misspecified so that you think it is for example -1 whereas in reality it is something else (e.g. -3.40282346639e+038; different GIS software and programming libraries have different defaults for NoData-values). In this case Zonation will not know that -1 should be treated as NoData and erroneously treats cells with that value as legitimate data. Fortunately this type of problem is typically easy to notice as your results will have (priority rank) values in all cells across the land- or seascape.
Coordinate reference systems (CRS) can also cause various problems in data pre-processing. While Zonation uses spatially-aware GDAL for reading in spatial data, it does not care about CRSs or projections. In other words, make sure that all input data are in the same CRS and that you correctly handle potential CRS transformations as a pre-processing step done in GIS. Note also, that with extensive spatial extents the choice of CRS and projection matters. Most notably, if you are working with global datasets you must take into account the fact the cells at different latitudes will be of different physical area.
While in practice Zonation has no limitation to how large raster files (i.e. features) it can read in, having a large number of very large files (size gigabytes / file) will make managing your Zonation project more difficult. If this is the case, it is a good idea to consider different ways of reducing the raster file size. Many GIS software and spatial programming libraries (such as GDAL) will let you compress your rasters, but might not do it by default. We recommend using GeoTIFFs with compression. Explicitly setting the raster data type to the smallest possible can greatly reduce files sizes. If you have input features with values ranging between 0 and 255, use 8-bit unsigned integer type instead of 32-bit float and so on. Also, if data still remains unwieldy big, consider dropping resolution by summing e.g. 2x2 grid cells into a single cell of coarser resolution. If doing so, only the locations of features inside coarser cells are lost, but the correct amount (sum) is retained (see Section 4.2).

¹. https://www.arcgis.com ↩

². http://www.qgis.org ↩