1 of 44

Welcome to GEOGM0068:

Geographic Information Retrieval and Integration

Rui Zhu

rui.zhu@bristol.ac.uk

GEOGM0068 - TB2 2024/25

2 of 44

Lecture 05

Rui Zhu

rui.zhu@bristol.ac.uk

GEOGM0068 - TB2 2024/25

3 of 44

Assessment

  • Summative Assessment is released on BlackBoard
  • I also attached a marking criteria to it. But note that it could still be subjective … But we should trust the academic judge of your work based on two staffs in the School.
  • Formative Assessment:
    • How would you like to do it?
    • Proposal: a workshop in Week 9? Presentation? Short proposal? Free-style workshop?

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

4 of 44

Review: What is Georeferencing

  • Making spatial information explicit from implicit
  • “References to locations”
  • Two steps:
  • Identifying references to location as expressed in natural language text
  • Geoparsing, place name identification, toponym recognition, …
  • Relating them to positions on the earth
    • Geocoding, toponym resolution, …
  • Sometimes, these two steps are done simultaneously
  • Ambiguation might exist in both steps → Disambiguation
  • It is also a sub-field in AI/Data Mining

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

5 of 44

Review: Geoparsing Pipeline

Tokenization

To split text into words/phrases or document into sentences

Tagging

To assign Part-of-Speech (POS)

Lookup

To look up lists of known locations (i.e., gazetteers), organizations, people, etc.

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

6 of 44

Lookup Resource - Gazetteers

  • The lookup step involves the combination of various lists of known locations, organizations and people with rules or machine learning that exploit elements of the surrounding contexts
  • Such a list of known locations is named Gazetteers
  • Well-known (global) gazetteers

Global coverage; traditional gazetteer

Global coverage; more culturally and historically related gazetteer

Global coverage; not only a gazetteer; in graph format

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

7 of 44

Lookup Resource - Gazetteers

  • UK gazetteers
  • Lookup to gazetteers will often involve the process of geocoding

Other potential resources:

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

8 of 44

Pros and Cons of Simple Lookup

Pros

  • Simple
  • Fast
  • Language independent
  • Fair precision and recall

Cons

  • Finiteness of gazetteers
  • Uncertainty in matching
  • A term might not be geographic/spatial; e.g., Washington, Chicago, …
  • Variation (uncertainty and ambiguity) in place names; e.g., UK and United Kingdom

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

9 of 44

Variation in Place Names

Example (Polysemy - same name refers to various things):

  • “Bath”, the city in the UK or a place to bathe?
  • “Turkey”, the country or the bird?
  • “Washington”, the people or the place?

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

10 of 44

Variation in Place Names

Example (Synonym - various names refer to the same place):

  • New York, Apple City
  • Bristol, Bricstow, Brcyg Stowe
  • London, The Great Wen, The Big Smoke, …

Zhu, R., Janowicz, K., Yan, B., & Hu, Y. (2016). Which kobani? a case study on the role of spatial statistics and semantics for coreference resolution across gazetteers. In International conference on GIScience short paper proceedings (Vol. 1, No. 1).

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

11 of 44

Disambiguation

Types of Evidence:

  • Internal:
    • Evidence based on the internal and/or phrasal structures
    • Example rules: capitalization (CapWord), prefixes or suffixes (e.g., City, Town, Road, Avene, Ave., Boulevard)
  • External (Contextual):
    • Evidence present within the text that makes it clear what type of entity a word or phrase is
    • Example: “President Washington chopped the tree” – Washington here is the person name, not a place name

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

12 of 44

Context

  • How do we define context?
    • a smoothing window,
    • a sentence
    • a paragraph
    • a whole document

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

13 of 44

Disambiguation

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

14 of 44

Context

  • How do we represent context?
    • A bag-of-words approach to represent the context of the place name

    • Washington: < 10, 4, 7, … 2 > (vector-based representation)
    • Texts about each place name candidate (e.g., from Wikipedia) are also represented as a bag of words
    • A simple approach, but often has good performances
    • However, it only records the frequencies of words; ignores word orders in the context

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

15 of 44

Selecting the Best Candidate

  • Given the vector-based representations of the place name and (several) candidates, how to decide which candidate the place name refers to?
    • Similarity-based approach
    • Probabilistic approach
    • Machine Learning approach

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

16 of 44

Similarity-based Approach

  1. Organize the context of the target place name and the candidate place names into the same form of vector representations

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

17 of 44

Similarity-based Approach

2. Compute the distance (dissimilarity) of each candidate place name’s vector to the target candidate place name’s vector.

  • there are different ways of measuring distance/similarity:
    • Euclidean distance
    • Cosine similarity
    • Manhattan distance
    • Jaccarb distance

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

18 of 44

Similarity-based Approach

  • Euclidean distance
    • Is influenced by the lengths (dimension) of the used contexts
    • Need additional normalization

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

19 of 44

Similarity-based Approach

  • Cosine similarity
    • Result is a value between 0 (not similar) and 1 (highly similar)
    • Accommodate context of different lengths

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

20 of 44

Probabilistic Approach

How likely “Washington” (target place name) refers to “Washington D.C.” (candidate place name) given the observed context?

  • Common approach: naïve Bayes
  • Y: context, X: place name
  • For each place name candidate, we can calculate P(X|Y); so we will have P(X1|Y), P(X2|Y), …, P(Xn|Y)
  • X* = argmaxn P(X|Y)

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

21 of 44

Machine Learning Approach

  • Disadvantage of aforementioned approaches:
    • The representation of context is based on terms, not entities → Use NER tools to extract entities; then vectors are constructed by entities
      • The dimension of such a vector is often in high dimension
    • The importance of terms is based on frequency, which might not be accurate → Need methods to identify critical entities related to a place instance → Semantics!

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

22 of 44

Machine Learning Approach

Key idea: to learn a low-dimensional vector to represent the term/place name, so that the distance between semantically relevant terms/place names is small.

Popular methods:

  • Support vector machine
  • Conditional random field
  • Recurrent neural network

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

23 of 44

Other Approach

  • Instead of using vector-based representation, current research investigates graph-based representation

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

24 of 44

Other Approach

  • Parsing the context into graphs
  • Graph similarity: nodes and edges
  • Generally slower but can explicitly identify evidence for supporting the disambiguation

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

25 of 44

Other Approach

Hu, Y., Mai, G., Cundy, C., Choi, K., Lao, N., Liu, W., ... & Joseph, K. (2023). Geo-knowledge-guided GPT models improve the extraction of location descriptions from disaster-related social media messages. International Journal of Geographical Information Science, 37(11), 2289-2318.

  • ChatGPT Supported Approach

Hu, Xuke, et al. "Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge." International Journal of Geographical Information Science (2024): 1-28.

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

26 of 44

Prompt Engineering

  • ChatGPT-X is just one example of a group of Generative AI models, which is called Foundation Model
  • Geo-Foundation Model is a trending topic in geographic data science/GIScience
  • Key technique while using foundation model is called prompt engineering

Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., ... & Lao, N. (2023). On the opportunities and challenges of foundation models for geospatial artificial intelligence. arXiv preprint arXiv:2304.06798.

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

27 of 44

Summary of Geoparsing

  • Simple lookup
    • Need a gazetteer
  • Knowledge or rule-based Approach
    • Hand-crafted rules: “North of CapWord”, → CapWord is more likely to be a place name (internal context)
    • Similarity-based approach (external context)
  • Machine learning Approach
    • To automatically learn the (complex and often implicit) context
  • Other approaches
    • Graph-based representation instead of vector-based representation
    • Generative AI-supported approach

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

28 of 44

Evaluation Metrics

  • Geoparsing can be regarded as a classification problem in ML
  • Human labeled corpora
    • Collect a set of texts that contain place names, and ask humans to label the place instances → baseline/gold standard data/reference
  • Some existing labeled dataset:

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

29 of 44

Evaluation Metrics

  • Apply your place name identification model to the same dataset, and compare the result (system-annotated version, or response-set/predicted) with human annotations (manually-generated set, or key-set/true condition)
  • Obtain the contingency table

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

30 of 44

Evaluation Metrics

  • Metrics:
    • Precision:

    • Recall:

    • F-Score:

  • However, these metrics are all biased for Geographic Information Retrieval
    • E.g., if your model simply predict Washington as the state in the US, it will achieves a high precision but will miss the capability to correctly parse it in other contexts (e.g., people, org, etc.)
    • Plus, geoparsing can be geographically biased, and none of these metrics can capture it

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

31 of 44

Evaluation Metrics

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

32 of 44

Recall: What is Georeferencing

  • Making spatial information explicit from implicit
  • “References to locations”
  • Two steps:
  • Identifying references to location as expressed in natural language text
  • Geoparsing, place name identification, toponym recognition, …
  • Relating them to positions on the earth
    • Geocoding, toponym resolution, …
  • Sometimes, these two steps are done simultaneously
  • Ambiguation might exist in both steps → Disambiguation
  • It is also a sub-field in AI/Data Mining

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

33 of 44

Geocoding

  • Definition: to associate referenced place names with a unique and meaningful identifier in a knowledge base/gazetteer, which ideally can be then associated with a metric georeference to a location.
    • E.g., “Bath, UK” → (51.3794N, -2.3656W)
    • In addition to location, the geocoding process also provides other relevant information about the place, e.g., type of the place, population, famous people related to the place.
  • Similar ambiguities like geoparsing
    • Referent ambiguity (Polysemy and Synonym)
    • Geographic ambiguity (e.g., a city refers to a point or a polygon?)
  • Same evaluation metrics as geoparsing
    • Precision, recall, and F1-score
    • Distance lag

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

34 of 44

Ambiguity in Geocoding

Example:

  • To geocode London using the GeoNames gazetteer:

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

35 of 44

Geocoding Approaches Overview

  • Knowledge-based (heuristics and hand-crafted rule-based)
  • Map-based
  • Data-drive or supervised ML

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

36 of 44

Knowledge-based Approach

  • Example heuristics or rules (external world knowledge )
    • Place with larger population will be more likely to be selected
    • Place that has a higher frequency to be mentioned will be more likely to be selected
    • The first candidate listed in the gazetteer

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

37 of 44

Knowledge-based Approach Example

  • Assign target place to the default location based on external world knowledge, e.g. most commonly used place; largest population.

Works well in this case

Failed

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

38 of 44

Again, Context is the Key!

  • Context-based knowledge:
    • Other places mentioned within the same text segment, or within the whole document
      • Candidate place can be selected based on their types
        • E.g., “Washington State” and “Washington Mountain”
      • Candidate place can be selected according to place relationships (ontology)
        • E.g., “London, England” and “London, Ontario”

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

39 of 44

Data-driven or Supervised Approach

  • In addition to explicitly using the place-related terms within a local or document context, we can also rely on contexts learned from those non-geographic names, more broadly.
    • Assumption: specific non-geographic words will often occur in context with specific places
    • Example:
      • London, UK → fiance center, Romans, Fashion, royal family, UCL;
      • London, Ontario → regional center, health care, manufacturing, information technology
    • How to learn this type of context? → ML with Semantics

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

40 of 44

Map-based Approach

  • Instead of using context from text, we can also apply spatial context
    • Assumption: Tobler’s First Law of Geography
    • Namely, locations mentioned in a document are spatially autocorrelated. So, the correct location will minimize distance to the target place name that is to be geocoded.
    • Process: to compute the (geographic) distance between all candidate places from a gazetteer to places surrounding the target place in the text, and then choose the candidate that minimize the distance

Waldo Tobler

1931 - 2018

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

41 of 44

Map-based Approach Example

1. Compute the distance between these contextual places to the candidate places, 2. average the distances, and 3. select the candidate with the minimal averaged distance as the geocoded result

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

42 of 44

More Challenges

  • Granularity
    • Many gazetteers still record places as points. It introduced huge uncertainties into place types such as river, mountain, street, and city
  • Historical places
    • Geocoding is much less successful for historical documents compared to current news papers
    • Major reason is the lack of a comprehensive gazetteer that covers places across time
  • Gazetteer integration
    • Integration can help combine information from different sources
    • Semantic interoperability is a challenge!

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

43 of 44

Summary of Geocoding

  • Knowledge-based approach
    • external world knowledge
    • contextual knowledge from the text and spatial relationships
  • Map-based approach
    • Geo-contextual knowledge from the text
    • Tobler’s First Law of Geography
  • Data-drive or supervised approach
    • Context in a more broad sense
  • Challenges

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25

44 of 44

Summary of Georeferencing

  • Making spatial information explicit from implicit
  • “References to locations”
  • Two steps:
  • Identifying references to location as expressed in natural language text
  • Geoparsing, place name identification, toponym recognition, …
  • Relating them to positions on the earth
    • Geocoding, toponym resolution, …
  • Sometimes, these two steps are done simultaneously
  • Ambiguation might exist in both steps → Disambiguation
  • Still a trending topic in AI/Data Science/Geography

GEOGM0068 - TB2 2024/25

GEOGM0068 - TB2 2024/25