Easy Georeferencer should be relatively straightforward to use, but if you still want a proper introduction or are looking for tips for more advanced usage, then this tutorial is here for you. In this tutorial you will find help with:
- Optimizing your dataset
- How to use the program
- Which settings to use
- What to do with the results
PART I: OPTIMIZING YOUR DATASET
Most datasets should be ready for georeferencing without having to do any technical preparations. There are however a few simple things you can do to optimize it for better georeferencing results:
- Understand your location variables.
- Aggregate your data in advance.
- Check for spaces in your fieldnames.
- Save as Excel or tab-delimited txt.
1: Understand your location variables.
Most crucial is that your dataset has at least one or two fields containing geographic names. For georeferencing points or small areas you need one field containing either the name of a province or a city/location (not both in the same field, and no street addresses), and another field containing the country name they are located in. There should also only be one location per record, not a list of several locations*. For georeferencing entire countries you only need a country field, and optionally a year field if you wish to map changes over time.
2: Aggregate your data in advance.
Aggregate or average out the individual attribute-values to their higher-level geographic unit so that there is only one record for each geographic unit (e.g. if there are multiple events in a city, or respondents in a province). While you are at it, purge your numerical-fields of any textual values like “NULL” so that they can be used for quantitative methods later on. This way your data will be immediately ready for visualization and analysis on a map. By aggregating your variables you won’t have to worry about your georeferenced cases overlapping each other or hiding potentially important information, not to mention that the georeferencing process will be much faster. (The only exceptions I can think of where you shouldn’t aggregate is if you are only trying to interactively explore and get information on individual events in each location, or if you need to dynamically visualize the distribution of your cases in a time-animation.)
3: Check for spaces in your fieldnames.
As recommended by most statistical and GIS software, check that the top row of your dataset that defines the names for your columns/fields/variables do not contain any spaces (including at the left and right edges where they are difficult to see). The program will still function but will not be able to use any fields containing spaces and will exclude them from the output.
4: Save as Excel or tab-delimited txt.
Finally, make sure your dataset is in a format that the software can read; either as an Excel file (.xls and .xlsx) or as a tab-delimited text file (.txt). For very large datasets you may want to prefer the latter .txt file format because the software reads those files much faster than Excel files.
*The program has a way of detecting characters that indicate multiple locations. In the current version, the program will only georeferencing the first one listed, skipping the other ones. The characters that indicate multiple locations are: ‘ and ‘, ‘ & ‘, ‘ in ‘, ‘,’, ‘;’, ‘/’, ‘(‘, ‘((‘.
PART II: HOW TO USE THE PROGRAM
Once you have started the program, the steps are relatively straight forward.
- Choose whether to georeference cities, provinces, or countries.
- Select your dataset and define its location variables.
- Run the program!
1: Choose whether to georeference cities, provinces, or countries.
You first select which geographic unit (“Cities”, “Provinces”, or “Countries”) you wish to georeference by selecting one of the tabs at the top. Note that city-mode can be used for more than just cities, and can include any type of point-location, even the smallest towns or villages.
2: Select your dataset and settings
Next, you click on the folder button to bring up the menu where you can select your dataset and tell the program which of your variables are the location variables. If desired you may also review and alter a few basic input settings, such as the fuzzy match ratio, which fields to keep in your output, or where to save the output.* To avoid repeating inputting your settings in the case of an error or crash, you can use the buttons at the top of the input screen to save your settings, return to your last save, or clear them if you want to start fresh.
3: Run it!
Finally, click the “play”-icon button to let the software georeference your dataset for you.**
*Although not really required, it is highly recommended that you click on the flag button next to the country field option to manually double check that the software’s automatic country name matching looks correct, and to change them if there were any mistakes. This will take very little time and may ensure a much higher success rate and a confidence that your cases are at least placed in the correct countries.
**While the program is running it shows you live match status updates of what it is doing in the gray area above the progress bar. If you click on the same area you will get an expanded view allowing you to backtrack and explore in more detail which places were matched or not.
PART III: WHICH SETTINGS TO USE
The ideal settings to use depends on the project. Here are some suggestions for some common scenarios:
- Global events datasets
- Cross-cultural surveys
- The case study
- Country statistics at different points in time
1: Global Events Datasets
For visualizing and analyzing the spatial distribution of global events data it might be most natural to use city-mode to get the most exact positions as possible. If you want as many matches as possible you may want to use the default “GNS” for the “coordinates provider” option because GNS appears to have the most extensive coverage and get higher match-rates. However, GNS is so extensive that it also has many local and peripheral placenames with identical placenames, in which case an event that occurred in a very large and famous city like New York, USA may be placed in a small-town with the same name halfway across the country, simply because the software had to choose arbitrarily between the identically named locations. If your data has fewer cases or tends to occur in larger population centers, it might be more important that each case is placed correctly and so you may therefore want to instead go with the “GeoNames” coordinates provider to only include the larger and more relevant locations.
2: Cross-Cultural Surveys
Although most multi-national surveys by organizations like Gallup, World Values Survey, or Pew are most often used to compare national-level opinions, it is quite common in such datasets to include a variable for the geographic area within each country that the respondent was given the survey. Unfortunately, the way this “region” variable is recorded varies from country to country, some only using vague descriptions such as “northern Italy” or “north-eastern Russia”. Others contain specific province-names that can be georeferenced but are often widely different in size and at different levels of the administrative chain. If you are only interested in displaying a continual surface of response values and the differences in province levels between countries does not matter to your project, then the “Any” option for “Administrative Level” will result in the highest match-rates and require no processing after georeferencing, though you should still watch out for multiple levels being present in the same country due to nested duplicate names (e.g. the higher level “New York” state would hide the lower level “New York” district and its neighbors and result in overlapping province shapes). If you think the level of the administrative chain may matter and you want to standardize to the same level across the different countries, then you can force lower-level name matches to be scaled upwards to a level of your choice. The backdraw here is that the resulting shapefile will very likely contain multiple shapes of the same upper-level province (one for each of its constituent sub-districts) and it therefore becomes important that after georeferencing you use a GIS to dissolve the shapefile so that only one shape and aggregated value is given for each province and its attribute fields. It also may lead to a lower match-rate because provinces higher than the level you specified will not be included in the output.
3: The Case Study
If you are engaged in a case study that looks in-depth at only one or more countries there are several settings you can use to highlight different aspects of your country case study. Use city-mode to highlight important events or locations of socio-economic-political importance, to explore the spatial distribution of the opinions of interview respondent opinions, or to quantitatively model the spatial distribution of events. Use province-mode to visualize contextual factors and statistics that are often available from national statistical reports, or to use such contextual factors to explain or predict city-events in a statistical model. Using province mode, just remember to set the “administrative unit” option to the correct level at which your contextual data was recorded.
4: Country Statistics at Different Points in Time
There are many datasets that record changes in how countries score on different variables over a range of years. It is not always straight forward how to visualize such changes because most country shapefiles usually represent only how country-borders look today. By clicking the “clock” icon next to the “Country Field” input in Easy Georeferencer’s country-mode, it becomes possible to visualize country-statistics for any point in time while accounting for the relevant name- and boundary-changes. By choosing the “Varying” option and selecting the field that contains the year value of your historical dataset you can create an equivalent historical shapefile. Different maps at different points in history can then be correctly represented by using simple select queries in a GIS.
PART IV: WHAT TO DO WITH THE RESULTS
When Easy Georeferencer has completed its work, it produces a shapefile* that can be immediately opened by any GIS software or mapping website. However, if you wish to ensure the quality of the georeferenced data, there are a number of aspects in the output data that can be useful. Listed below is an overview of some common ways to make use of Easy Georeferencer’s output data, starting with the easiest and ending with the most accurate. These strategies are:
Simply use the shapefile in your map and consider yourself finished! The shapefile is a spatial representation of the cases in your dataset that were able to be georeferenced, and is marked with “_GIS” in the filename. Accepting this shapefile without scrutiny is the most naïve approach but may be appropriate in many circumstances when time-efficiency and the overall patterns is the most important aspect and when potentially high errors are acceptable (for instance when georeferencing very large cross-national datasets).
Use the shapefile, but in a more quality-oriented approach exclude cases whose accuracy can be doubted. Doubtful cases can be highlighted in several ways.
- One way is to sort on the “GEOINCONS” field which measures inconsistency of whether there were multiple locations with the same geographic name as the name that was matched. In city-mode the field represents the maximum distance in kilometers between the identically named locations (calculated using the Haversine distance equation)** and therefore higher values means that the risk of geographic misplacement error is greater. Small distance values should not be a concern because gazetteers often have duplicate records for each place written with slightly different coordinates but essentially referring to the same place. It is recommended here that values higher than 100 km (about 1 ½ hour driving distance) should be interpreted as a significant risk that any of the alternative names are in fact different places. In province-mode the value is the number of locations with identical names at any administrative level, for instance the name “New York” could refer to the level-2 county as well as the level-1 state.
- Another way is to sort on the field “GEOMRATIO” which measures the percentage similarity between the original name and the name of the matched location. Though the match similarity threshold was originally set in the input settings screen, the similiarity values can be compared with your own qualitative assessment of how well the names match by quickly eye-balling the low-value cases. If the original match threshold seems too lenient, simply decide on a new threshold where the name-matches start becoming more acceptable and exclude the cases whose values are lower.
Pay attention to doubtful cases like above, but instead of flat-out excluding them, manually provide your own georeferencing. The field called “GEOALTERN” contains the names of the other alternative locations (and in city-mode also the coordinates for each). These alternative names can then serve as a guide of which places need to be compared and thus where to start in your manual georeferencing efforts.
If you require total or near-total completeness of the georeferenced data you can in addition to the above steps manually georeference all the cases that failed to be automatically georeferenced and are conveniently located in the “_remainders” file. The manually georeferenced cases can then be merged with the shapefile from the automatic process to have a complete georeferenced version of the original dataset. HINT: The “_remainders” file can even be put through another round with the Easy Georeferencer software, only this time with a lower “match similarity” threshold to return more matches.
**The Python code for the Haversine algorithm was adapted from Chris Veness’ online Java code guide for calculating distances. http://www.movable-type.co.uk/scripts/latlong.html