Optimizing your dataset

Most datasets should be ready for georeferencing without having to do any technical preparations. There are however a few simple things you can do to optimize it for better georeferencing results:

  1. Understand your location variables.
  2. Aggregate your data in advance.
  3. Check for spaces in your fieldnames.
  4. Save as Excel or tab-delimited txt.

1: Understand your location variables.
Location variables
Most crucial is that your dataset has at least one or two fields containing geographic names. For georeferencing points or small areas you need one field containing either the name of a province or a city/location (not both in the same field, and no street addresses), and another field containing the country name they are located in. There should also only be one location per record, not a list of several locations*. For georeferencing entire countries you only need a country field, and optionally a year field if you wish to map changes over time.

2: Aggregate your data in advance.
Aggregate or average out the individual attribute-values to their higher-level geographic unit so that there is only one record for each geographic unit (e.g. if there are multiple events in a city, or respondents in a province). While you are at it, purge your numerical-fields of any textual values like “NULL” so that they can be used for quantitative methods later on. This way your data will be immediately ready for visualization and analysis on a map. By aggregating your variables you won’t have to worry about your georeferenced cases overlapping each other or hiding potentially important information, not to mention that the georeferencing process will be much faster. (The only exceptions I can think of where you shouldn’t aggregate is if you are only trying to interactively explore and get information on individual events in each location, or if you need to dynamically visualize the distribution of your cases in a time-animation.)

3: Check for spaces in your fieldnames.
As recommended by most statistical and GIS software, check that the top row of your dataset that defines the names for your columns/fields/variables do not contain any spaces (including at the left and right edges where they are difficult to see). The program will still function but will not be able to use any fields containing spaces and will exclude them from the output.

4: Save as Excel or tab-delimited txt.
Finally, make sure your dataset is in a format that the software can read; either as an Excel file (.xls and .xlsx) or as a tab-delimited text file (.txt). For very large datasets you may want to prefer the latter .txt file format because the software reads those files much faster than Excel files.

*The program has a way of detecting characters that indicate multiple locations. In the current version, the program will only georeferencing the first one listed, skipping the other ones. The characters that indicate multiple locations are: ‘ and ‘, ‘ & ‘, ‘ in ‘, ‘,’, ‘;’, ‘/’, ‘(‘, ‘((‘.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s