4.7 Article

Learning to combine multiple string similarity metrics for effective toponym matching

Journal

INTERNATIONAL JOURNAL OF DIGITAL EARTH
Volume 11, Issue 9, Pages 913-938

Publisher

TAYLOR & FRANCIS LTD
DOI: 10.1080/17538947.2017.1371253

Keywords

Toponym matching; supervised learning; string similarity metrics; duplicate detection; ensemble learning; geographic information retrieval

Funding

  1. Trans-Atlantic Platform for the Social Sciences and Humanities, through the Digging into Data project [HJ-253525]
  2. Reassembling the Republic of Letters networking programme (EU COST Action) [IS1310]
  3. Fundacao para a Ciencia e a Tecnologia (FCT) [PTDC/EEI-SCR/1743/2014, CMUP-ERI/TIC/0046/2014]
  4. INESC-ID from the PIDDAC programme [UID/CEC/50021/2013]
  5. ESRC [ES/R003890/1] Funding Source: UKRI
  6. Economic and Social Research Council [ES/R003890/1] Funding Source: researchfish
  7. Fundação para a Ciência e a Tecnologia [CMUP-ERI/TIC/0046/2014, PTDC/EEI-SCR/1743/2014] Funding Source: FCT

Ask authors/readers for more resources

Several tasks related to geographical information retrieval and to the geographical information sciences involve toponym matching, that is, the problem of matching place names that share a common referent. In this article, we present the results of a wide-ranging evaluation on the performance of different string similarity metrics over the toponym matching task. We also report on experiments involving the usage of supervised machine learning for combining multiple similarity metrics, which has the natural advantage of avoiding the manual tuning of similarity thresholds. Experiments with a very large dataset show that the performance differences for the individual similarity metrics are relatively small, and that carefully tuning the similarity threshold is important for achieving good results. The methods based on supervised machine learning, particularly when considering ensembles of decision trees, can achieve good results on this task, significantly outperforming the individual similarity metrics.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available