Evolutionary processes in language and culture -
Tutorial
Simple mapping with R
author:
Michael Dunn
date:
2009-05-05
description:
A short practical tutorial on mapping written for primarily for my
colleagues at Max Planck Institute for Psycholinguistics and the Radboud
University Nijmegen
R is a computer language for doing statistics, including geostatistics. But without getting too complicated, it's possible to use R as a simple Geographic Information System (GIS) to plot data onto maps of the world. You can make publication ready maps directly in R, but if you save the maps in a vector graphic format (e.g. pdf or eps) you can edit them further them using a vector graphics editor, like Adobe Illustrator, xfig, or Inkscape.

Preparation
-
If it's not already installed, install the R statistical programming language from the downloads section of http://www.r-project.org. At time of writing the current version is 2.6, and the installation programmes for the Graphical User Interface version for MacOSX and Windows are R-2.6.0.dmg and R-2.6-win32.exe respectively. There are also command line versions, Linux versions etc available here.
-
R includes lots of optional functions called 'libraries'. Lots of libraries are present by default, but you will need to install two non-default libraries , maps and mapdata.
On the MacOSX version you open the R application, select Package Installer from the Packages & Data menu and choose Get List. When the list appears, scroll down to and select maps and mapdata, tick the Install Dependencies checkbox (to be on the safe side), and click Install Selected. That should be it.
The Windows version is presumably something similar...
Testing
-
Open R, and type into the console:
2+2
and press return. Your screen should look like this:
> 2+2
[1] 4
>Two plus two equals four. I don't see that anything else could possibly go wrong.
-
To check the installation of the maps and mapdata packages type:
library(maps) library(mapdata)
If R doesn't complain it means these libraries are loaded, and their functions are available to use. Otherwise, the libraries are not installed properly (do step 2 of preparation again).
Drawing maps
The Defaults
Try the following:
map()
This should produce a default map of the world. To add axes and scale you can always type:
map.axes() map.scale()
There are lots of other options. For example, the default map.scale() function draws a distance gauge (e.g. 100, 200, etc km) and a scale ratio (e.g. 1:4000000), but you can turn this into miles and suppress the ratio by specifying the options map.scale(metric=FALSE, ratio=FALSE).
Setting attributes
To get help on a function, e.g. map, type help(map) or ?map. The help screen gives a list of the attributes that can be specified for the function.
You'll often want to make maps of just part of the world. This is done by specifying xlim (limits on the x axis, i.e. longitude) and ylim (latitude). Longitude and latitude are always given in decimal degrees. The limits are written as R vectors, as follows c(MIN, MAX), where MIN is the lower value and MAX is the higher value:
map(xlim=c(140, 160), ylim=c(-20,10))
The default map used by the maps library is called "world". If you also use the mapdata library you'll have a high resolution map available (called "worldHires"). Compare the above to the following:
map("worldHires", xlim=c(140, 160), ylim=c(-20,10))
Working with spreadsheets
R is good at working with tabular data, so long as it is saved in a text-only format. A spreadsheet program like OpenOffice or Excel allows a spreadsheet to be saved as csv (comma separated values). I personally like to use tab delimited values rather than commas because they're easier to read and write (I often have fields that include commas or quotes, which can get very confusing if these are also used to show the edges of fields). The base function for reading tabular data is called read.table. Type ?read.table to see what options it has. Some important ones are header (first row is a header, TRUE or FALSE; sep (the separator character, e.g. "," or tab). There are some convenient wrapper functions for this that bundle up a bunch of settings:
read.csv()
comma-delimited, quoted fields, as produced by Excel
read.delim()
tab-delimited file (fields don't have to be quoted)
To read in data from a tab delimited file, simply call the function with the name of the data file as its argument. You can test this with the sample data file:
data = read.delim("language-data.txt")
Enter data to look at the data you've loaded, or try head(data) to just look at the first couple of rows.
If there is no header row, then use:
data = read.delim("language-data.txt", headers=FALSE)
With no header the columns will be labeled "VN", where N is the number of the column.
Read table and its kin return a data.frame object.
If the file "language_data.txt" contained the following (underscores represent a single tab space):
name__longitude_latitude AAAA_________10_______10 BBBB________110_______25 CCCC________100_______15
Then data$name would return a list of all the names (strictly speaking, this is "vector" in R terms). The following would produce a simple scatterplot of longitude versus latitude:
plot(data$longitude, data$latitude)
The plot() function starts a new plot. If you have a map drawn with the map() function you can add points with the points() function, e.g.:
map() points(data$longitude, data$latitude)
You could add labels with:
text(data$longitude, data$latitude, labels=data$name)
More map attributes
Here are a few routine tricks that will make mapping easier and/or prettier.
-
The range() function returns the end points of a range, so from our sample spreadsheet range(data$longitude) returns the minimum and maximum, e.g. in this case the vector c(10, 100).
This is useful for setting the xlim and ylim attributes of a map:
map("world", xlim=range(data$longitude), ylim=range(data$latitude))
The extendrange() function is just the same, but it returns a little bit extra either side (by default 5%, but you can set this to a different figure, see ?extendrange)
-
R indexing is very powerful. Many attributes of functions can have vector of values, rather than a single value. For instance, instead of specifying col="blue", colours can be specified as a vector (e.g. col=c("red", "blue", "red", "yellow", ...). A simple way of getting such a vector is to add it to the same data frame that contains your data (in our example, data). Here's an example of how to do this (text from the # sign to the end of the line is a comment, and is ignored by R):
# add a new column to the data frame filled with the value "gray" data$my.col = "gray" # change the values of the $my.col column to "green" where the # $group column is "Austronesian" data$my.col[data$group == "Austronesian"] = "green" # change the values of the $my.col column to "blue" where the # $group column is "Trans New Guinea" data$my.col[data$group == "Trans New Guinea"] = "blue" # use the $my.col column of data as the values of the col attribute points(data$longitude, data$latitude, col=data$my.col)
-
Uses the legend() function. Here's an example:
legend("bottomright", legend=c("Austronesian", "Papuan"), pch=19, col=c("red", "blue"), title="Language groups")The first argument is the position of the legend. This function is quite clever: instead of specifying x-y (i.e. longitude and latitude) coordinates it is also possible to use the values bottomright, bottom, bottomleft, left, topleft, top, topright, right and center. If you use one of these then R will calculate the position itself. Read the help from ?legend for an explanation of the other arguments of this function.
The range() and extendrange() functions.
Background colour
If you read the help on ?map there's a bit of advice on setting colours. A useful combination of attributes is fill=TRUE, col="white", bg="gray". This will produce white landmasses with gray seas.
Colouring items by value
Add a legend
Here is a basic example script using all these techniques to produce the map shown at the start of this document:
library(maps) # only necessary if these libraries
library(mapdata) # are not loaded already
data = read.delim("language-data.txt")
# trim the data so that we only have longitude less than 145°
data = data[data$longitude < 145,]
col.dict = data.frame(my.col=c("red", "blue", "green"),
row.names=c("non-Trans New Guinea",
"Trans New Guinea",
"Austronesian"),
stringsAsFactors=FALSE
)
map("worldHires", # higher resolution map than "world"
xlim=extendrange(data$longitude),
ylim=extendrange(data$latitude),
fill=TRUE,
col="white",
bg="gray")
map.axes()
map.scale()
points(data$longitude,
data$latitude,
pch=19, # make the points solid dots
cex=2, # make them double default size
col=col.dict[data$group,]) # and draw them in nice colours
text(data$longitude,
data$latitude,
labels=data$name,
pos=1) # put the text below the points
Note that functions can extend over multiple lines (so long as the parentheses match up). The help to ?points lists some of the other possible values for the point character (pch) attribute. Note also the use of col to set colour and cex to set the size ("character expansion").
Advanced topics
-
Colours can also be referred to by number from a standard series (starting "black", "red", "blue"; this is mostly useful for quick-and-dirty plotting, where you don't particularly care which colours are chosen for which category (note that n:m gives a range of integers from n to m):
points(data$longitude, data$latitude, col=2:4)
If there are fewer colours specified than points the colour vector will repeat. So the previous example and the following one (which illustrates manual specification of colour) would only make sense if the data was ordered in contrasting triplets:
points(data$longitude, data$latitude, col=c("gray", "blue", "green"))
'Dictionary style' indexing can be convenient. The following command constructs a data frame with row names equal to the different groups, meaning you can refer to the contents of the rows using these names:
col.dict = data.frame(my.col=c("gray", "blue", "green"), row.names=c("non-Trans New Guinea", "Trans New Guinea", "Austronesian"), stringsAsFactors=FALSE # data.frame default makes the wrong choice )Here are some tests. Note the comma in the index. Indexes of data frames have the format [row/s, column/s], so data[3:5,1:4] would give rows 3 to 5 and columns 1 to 4 of the data object. If you leave out a specification you get the whole lot, so data[,] is the same as data:
col.dict["Austronesian",] [1] green
This is identical to the following, since there's only one column in the col.dict data frame:
col.dict["Austronesian",1] [1] green
You can index something more than once to get repeated values:
col.dict[c(1,1,1,2),] [1] gray gray gray blue
Instead of using numbers to index this, you can use the row names---and of course, the row names have been chosen to be the same as the values that we want to code:
col.dict[data$group,] [1] green gray green blue gray ...
Other ways of colouring values
Using map projections
The mapproj library gives you access to other map projections, so you can e.g. plot your map onto a globe viewed from any position you choose. This is a bit tricky, but there are examples you can copy from in the documentation.
Workflow
-
If you save the instructions for drawing a map in a script, you can effortlessly regenerate your map after making small changes (e.g. when you notice some of you longitude-latitude pairs are wrong, or you want to switch from distinguishing points by colour to distinguishing them by shape)
-
From the R application you can type source("my-script.r") to run the contents of a script as if you typed them in yourself
-
From the command line you can type R --no-save < my-script.r (won't work for Windows)
You should archive your scripts for generating maps along with the ultimate, edited version that you use in publication. People are always asking "Do you have a colour version of that figure...?".
-
-
R writes graphics to a "device". If you are using R interactively (using the R console program, rather than running a script) the device is probably your computer screen. Whatever you do should be displayed immediately. But you can also use a graphics file as a device. This device has to be initialized using a function that creates the file in the right format (useful formats include pdf, postscript, png). Once the device is initialized all the R commands you run are composed in R's memory. Finally, you must close the device with dev.off(), and the combined graphics are written to the file Here is an example:
pdf("my_file.pdf", width=8, height=4) # initialize the file map() # compose some graphics (all in R's memory at this stage) dev.off() # write the accumulated graphics to the fileIf you are doing this from a script, and the script has a bug so that it stops before getting to dev.off(), then the graphics file will not contain anything. This can be confusing...
Running everything from a script.
Comments
Write lots of comments in your scripts to help you interpret them later. Everything from a hash (#) to the end of a line is a comment, and will be ignored by the program. The example in the middle of the next paragraph is well commented.
Writing to pdf/eps/png etc
When to stop
For publication quality maps you will probably find that you that there are some things that are too much trouble to do in R. For example, if you have lots of labelled points close together, then the labels will almost certainly overlap: it would be quite difficult to calculate the bounding box of each of these labels and to try and calculate visually acceptable, non-overlapping positions for them. When I get to this stage, I take a pdf drawn by R and edit it in a graphics program. Adobe Illustrator is a good choice, although expensive if you don't have it already. The program Inkscape is a good free alternative (I've only tried using it recently, but it seems perfectly adequate). Note that Photoshop is not suitable: Photoshop does not let you select vector objects, such as lines, points, text boxes, etc. Note also that if you want to tidy up a graphics file, it has to be in a vector format (pdf, postscript for illustrator, or svg for Inkscape).

