Geolocating News Articles - A list of potential technologies
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
View only
 
 
Still loading...
ABCDEFGHIJKLMNOPQRSTUV
1
A compilation of information about extracting geographic data from unstructured text
2
Compiled by Catherine D'Ignazio for the MIT Center for Civic Media ---- questions, additions, access to edit doc to dignazio AT mit.edu
3
ProductMain purposeurlpricingtechnicalcommentsDoes geographic disambiguation?Reliability vs Humanscomments from Charlie, Data Scientist of CLAVINOther comments
4
Yahoo PlaceSpotterplace extractorhttp://developer.yahoo.com/boss/geo/ AND http://developer.yahoo.com/boss/geo/docs/key-concepts.html#d4e1755$8/1000 calls web based API, call with YQL, returns JSON or XMLPerfect service, seems to work well on their example, but super expensiveno Their gazetteer contains only 6M location names, while CLAVIN uses the GeoNames.org gazetteer with over 8M locations and nearly 16M unique place names. It also provides exact matches only, and does not attempt to disambiguate between locations have the same name. This is especially troublesome with place names like "Springfield" (Massachusetts? Illinois? Missouri? Virginia? etc. etc.). CLAVIN, on the other hand, uses context-based heuristics to select the right match for ambiguous place names, and also handles both typographic errors and phonetic misspellings.
5
Yahoo YQL Tables for Developersplace extractorhttp://developer.yahoo.com/boss/geo/docs/free_YQL.htmlfree for non-commercial usage, limited to 2000 queries per dayuse YQL queries, returns JSON or XMLwon't work for big data bc of query limitno but could use geo.placefinder table for this, same pricing applies
6
OpenCalaisentity extractionhttp://www.opencalais.com/ OR http://viewer.opencalais.com/free to 50000/day (from what I can tell)web based APIWorks well in example on their sitesort of - sometimes it will know which place a place is but sometimes it won't.This is a good tool, but its disambiguation functionality is incomplete. If you go to the demo and enter something like "I went to London and Oxford last summer," it'll extract both locations and will even resolve London to a gazetteer record, but for Oxford it just gives you a list of cities in the world with that name.
7
Alchemyentity extractionhttp://www.alchemyapi.com/$800/month for 70000 api callsweb based APIworks well but EXPENSIVE at scalesort of - sometimes it will know which place a place is but sometimes it won'tFor what it is, Alchemy is a good tool, but it only does extraction -- it doesn't even attempt to resolve location names to actual geospatial entities. Using the same example as before with their demo ("I went to London and Oxford last summer."), Alchemy will find both location names and tell you that they're cities, but not where they are or anything like that.
8
Zemantacontent suggestion (via entity extraction)http://developer.zemanta.com/docs/up to 10000 calls/day for freeweb based API, returns JSON or XMLworks decently in preliminary tests, not clear if "place" is categorysort of - sometimes it will know which place a place is but sometimes it won'tI hadn't heard of this one before. I tried out the demo, and it does seem like they're attempting to do disambiguation, but it also appears to be incomplete. It handled "I went to London and Oxford last summer" just fine, but when I gave it "I was born in Boston and grew up in Springfield," it only found Boston -- it missed Springfield completely.
9
Stanford NERentity extraction/researchhttp://nlp.stanford.edu/software/CRF-NER.shtmlfreeJava with interfaces in python or perlresearch project & open source, not well-documented but could run it on our server, haven't tested how well it works, possibly tricky to install bc no docs, used widely in other open source projects and reportedly works very well for entity extraction but would have to figure out geo piecenoStanford NER is great -- I use it all the time. But all it does is extract the location names from text (a necessary first step for CLAVIN) -- it doesn't resolve those names into geospatial entities or gazetteer records (which is what CLAVIN does).
10
SENNA parserentity extraction/researchhttp://ml.nec-labs.com/senna/senna-v2.0/free, open sourceC, command-line interfaceworks reasonably well on 1 test,noThese guys don't seem to offer an online demo, but if all they're doing is named entity recognition, then they're in the same boat as Stanford NER. However, they're reporting an F1 score of 89.59% on the CoNLL 2003 NER task, which puts them a bit ahead of Stanford NER (86.86%) but still behind the Illinois Named Entity Tagger (90.8%).
11
MetaCarta GeoTaggertags geographic references in unstructured texthttp://developers.metacarta.com/doc/metacarta-appliance-geotagger-api-guide-and-examples/$200K per appliance bc you buy their hardwareSOAP api, tags returned as XML, each tag has confidence indexExtremely pricey. Docs a little fuzzy at first glimpse. Reportedly works well. Main competitor, with NetOwl, to CLAVIN.yesMetaCarta has been the dominant commercial market solution in this space since 2001. It's really great, and also REALLY expensive -- prohibitively expensive for large-scale usage, in my opinion. It's something like $200k per appliance (yes, you buy their hardware) or $2 per document via a subscription based model.According to this paper http://www.cs.umd.edu/~hjs/pubs/geotag-icde2010.pdf MetaCarta's engine works by giving precendence to places of higher prominance, e.g. Paris will almost always resolve to Paris, France because it does that 95% of the time. This works in most cases but of course will drastically miss more local cases like when events happen in Paris TX
12
GutenmapExample of getting place names from gutenberg texts & mapping themhttp://www.mappinghacks.com/projects/gutenmap/map.shtmlLooks a bit out of date, not sure if it's maintainedn/a
13
GeoDictGet location names from unstructured texthttps://github.com/petewarden/geodictfree, open sourceWritten in python, command line interface, returns JSON or CSVlooks simple to use and well-documented, not sure if as robust for place context & reliability as other packages e.g. CLAVIN, only does countries, regions, cities, i.e. not street level, addresses, etc. Also the Data Science Toolkit -http://www.datasciencetoolkit.org/ - has a host of useful open source geo & NLP tools like free geocoder.no, but could use data science toolkit's geocoder
14
NetOwl ExtractorEntity Extraction with GeoTagging focushttp://www.netowl.com/entity-extraction/geotagging/pricing not available, probably really expensiveAPIs in Java, C & web services, handles input text in many formatsmultiple language support! http://www.netowl.com/entity-extraction/languages/ . Will check on cost but probably prohibitive. Could install locally. Built-in geocoder.yesNetOwl is a great tool for entity extraction and resolution. Although it doesn't focus primarily on locations like MetaCarta, it does have a very robust geospatial component. Unfortunately, it's just as expensive as MetaCarta.
15
Unlock Geoparser for web text, focus on contemporary & classical placeshttp://unlock.edina.ac.uk/texts/introductionfree, code possibly available on request for non-commercialweb service, returns JSONRe:scale - they have an asynchronous batch processing capability which hasn't been tested at huge scales. The person to contact about release of the source code is Claire Grover [grover@inf.ed.ac.uk] at the University’s Language Technology Group (LTG) [http://www.ltg.ed.ac.uk/] . The underlying natural language processing technology is developed by LTG and we simply wrap the back end technology with a front end web service for convenience. LTG have been amenable to provide their code to others in the past provided its for non-commercial application. yesfrom James, researcher on the project: some formal metrics such as F scores and precision and recall stats are published in some of Claires papers which can be found on the LTG website - http://www.ltg.ed.ac.uk/. In ideal situations where the document structure is predictable and the heuristics of the processing have been 'tweaked' its possible to achieve high 80's - 90%in precision i.e the software tallies with human marked up 8 or 9 times out of 10. Thsi drops off where teh documnet structure is unkown or highly variable - so thers always performance improvemnets to be had by understanding the structure and content of teh corpus - but thats true of any NLP software. I believe Cliare did some quick comparison with Yahoo placemaker when it was first released and teh LTG code performed very similarly and for some documnet types better than teh yahoo stuff. Of coures it can nvere be 100% but it can be tuned to improve its abilityt's not clear how it attempts to disambiguate ambiguous place names, but it does return a ranked list of potential locations for each place name it finds in the text. It would be interesting to see how changes to the input text affect the rank of potential matches (e.g., compare the output generated by "I was born in Boston and grew up in Springfield." with that of "I was born in Chicago and grew up in Springfield.").
16
Clavinintelligent place extractionhttp://clavin.bericotechnologies.com/free, open sourceJavaBy far best-sounding option if works,. Re: performance - "in a relatively small (9-node) Hadoop cluster, we're able to process 1 million documents containing 5.7 million location names in under an hour". Can switch in Stanford NER for better reliability: yesWith Stanford NER
precision: 0.7397959
recall: 0.76719576
f-measure (f1): 0.7532468
elapsed time: 2 seconds

With OpenNLP
precision: 0.73049647
recall: 0.54497355
f-measure (f1): 0.6242424
elapsed time: 4 seconds
17
Web-a-WhereHeard about it from this paper: http://www.cs.umd.edu/~hjs/pubs/geotag-icde2010.pdfuses small gazetteer so misses lots of locations
18
Gate
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
Loading...
 
 
 
Sheet1