Reproducible Cartography

Timothée Giraud & Nicolas Lambert

Abstract: The framework of this paper is the production of statistical maps fol-lowing the reproducible research paradigm. To produce statistical maps, the current or at least the most widespread usage is to combine several software products in a complex toolchain that use a variety of data and file formats. This software and formats diversity make it difficult to reproduce analysis and maps from A to Z. The aim of this paper is to propose a unified workflow that fully integrates map production in a reproducible process. We propose hereby a solution based on the R software through the development of the cartography package, an extension that fills the need of specific thematic mapping solution within the software.

Keywords: Reproducibility, Open-source, R, Statistical cartography, Map workflow

1. Introduction

Scientific claims have to be supported by evidences. The assessment of scientific results is possible through the availability of methods and data used by scientists and reproducibility is a key element to validate studies. The main idea behind reproducible research is to release studies with the data and the computer code that support their scientific claims (Peng 2011).

This idea has been firstly addressed by Jon Claerbout, a geophysicist that established a standard toolchain to produce (and reproduced) each figure made in his laboratory (Claerbout and Karrenbach 1992). The will to establish standards or develop the reproducibility is high in the reproducible research movement (Stodden and Miguez 2013) and scientific journals are more and more considering to include datasets and code along articles they publish (Peng 2009).

Discussions about reproducible research mostly take place on computational or statistical science fields. We argue that maps, as other graphics or statistical outputs, are part of scientific studies and should be made reproducible as much as possible.

To be considered as an evidence, a map should be open to debate and its construction should be made transparent. Yet, maps produced in an academic context are currently made with a set of software products (spreadsheet, statistical software, GIS…) that slices the cartographic process. This multiplicity of tools and formats is an impediment to reproducibility. A fully reproducible map should be associated to the code and data used to produce it. Figure 1 describe what could be the spectrum of reproducibility for a map.

Fig.1 The Spectrum of map reproducibility Fig. 1. The spectrum of map reproducibility

In this paper we propose a unified workflow that fully integrates map production in a reproducible process. This solution is based on the R software through the development of the cartography package, an extension that fills the need of specific thematic mapping solution within the software.

2. From GUI to Script

2.1 A Step Backward?

Most of maps produced in an academic context are currently made with GIS or mapping applications that use graphical human-computer interaction. The use of graphical user interface (GUI) in computer sciences in general and cartography in particular is irrefutably a step toward more user-friendliness. But this step comes at the price of the growth of, if not impossible, at least difficult reproducible procedures. To circumvent this weak reproducibility ability, most applications provide languages to build maps in their framework (e.g. Model Builder for ArcGis or python to create scripts in QGIS). But these solutions do not easily cover the full workflow from raw data to graphic representation and statistical findings.

To solve this problem, one can use programming language that are explicitly build on the idea of keeping a trace of computations (data management, statistics, graphics and hence maps).

Moreover, we found three main advantages to advocate the use of theses scripting solutions. Firstly, the possibility to combine statistical operations before and after spatial operations within the same tool (i.e. unified toolchain). Building a map implies to implement a set of steps: data management, statistical analyses, geo-processing operations and eventually graphical display of results. Using two, three or more software products to conduct these operations introduces ruptures and a multiplication of data and file formats. Secondly, the ability to use literate programming reports (i.e. reproducibility). Literate programming, introduced by Donald Knuth (1992), strongly links analyses, statistical outputs and the code used to produce them in a single document.[1] And, eventually, most of scripting languages are open-source and the use of open-source frameworks gives a full transparency on the methods and tools used to conduct analyses.

Scripting solutions can appear as a step back for cartographers that learn computed cartography with proprietary software languages (ARC Macro Language designed by ESRI in 1986, or SAS macros). We argue that, in the reproducible research framework, researchers have to use literate programming solutions that enable the full traceability of their studies.

A script that describes every steps of the process that goes from raw data to comprehensive vector thematic maps and including all statistical and geometrical transformation can be considered as a full metadata document.

2.2 R as a Go-To Tool for Integrated Analysis

Several programming languages and software can be used to conduct studies in a unified workflow that integrates data handling, data processing including spatial analyses procedures and spatial representations. Among them python and R emerge as the most prominent ones.

Python appears to be more versatile than R which is more focused on data. Spatial libraries exist for these two languages but the R spatial ecosystem is more dynamic and includes more spatial analyses methods. We place ourselves in the statistical cartography field and we have decided to invest the R development since R fits our need to strongly connect statistics and cartography.

R is both a language and an environment for statistical computing and graphics. A large part of this open-source software popularity is due to its plug-ins model. These plug-ins are called packages, they are created by users and distributed through public repositories (Comprehensive R Archive Network[2]). This system allows to develop and share software pieces in a transparent way.

Among its many packages some are dedicated to the management, modification and display of spatial features. Three main packages are unavoidable: sp (Bivand et al. 2013; Pebesma and Bivand 2005), rgdal (Bivand et al. 2016) and rgeos (Bivand and Rundel 2016). sp allows to manage spatial data (vector and raster). To manipulate geographical projections and data import/export, rgdal provides bindings to the GDAL[3] and PROJ.4[4] libraries. rgeos is an interface to the GEOS[5] library aiming at geo-processing spatial features.

Several packages already exist to produce thematic maps in R, most of them are described in the Visualisation topic of the “Analysis of Spatial Data” CRAN Task View.[6] Nevertheless, none of them fits the need to have both easy mapping features, elegant design of functions, a complete set of cartographic layout element (arrow, scale, legends…) and a large set of representations.

With the cartography package we propose an extension of the R ecosystem that enables fully reproducible cartography along with data collection and data analysis. cartography is built upon and benefits from the development of two of the main spatial related packages (sp & rgeos).

3. The cartography Package

3.1 Design

The aim of the cartography package is to obtain thematic maps with the visual quality of those build with other common mapping and GIS software.

Users of the package could belong to one of two categories: cartographers new to R programming or R users new to cartography. Therefore, its functions have to be intuitive to cartographers and ensure compatibility with common R workflows.

The package design follows some of the current usage in GIS workflows. Each function has two main arguments that are a spatial object (e.g. GIS: vector layer / R: sp object) and an attribute table (e.g. GIS: csv / R: data frame) linked by a common identifier. Each function focuses on a single cartographic representation (e.g. proportional symbols or choropleth representation) and display it on a georeferenced plot. This solution allows to consider each representation as a layer and to overlay multiple representations on a same map.