Extracting Places with Python

Ok, a quick note to remind myself – I was interested in learning how to use this: https://pypi.python.org/pypi/geograpy/0.3.7 

Installation was a bit complicated; lots of dependencies. The following pages helped sort me out:



AND ultimately, I had to open up one of the geography/extraction.py file and change one line of code (line 31 as it happens), as per http://stackoverflow.com/questions/27341311/notimplementederror-use-label-to-access-a-node-label

So, first, let’s get all the bits and pieces installed. I downloaded the package as zip, unzipped, then:

$ sudo python setup.py install

At each stage, I would run a little python script, test.py. In my text editor. I just pasted their default script and saved it as test.py, which I’d then run from the command line. This thing:

import geograpy
url = 'http://www.bbc.com/news/world-europe-26919928'
places = geograpy.get_place_context(url=url)

Every error message moved me one step closer as it would tell me whatever module I was missing.

For starters, it turned out ‘pil’ was needed. But pil isn’t maintained any more. Some googling revealed that Pillow is the answer!

$ sudo pip install pillow

Next thing missing: lxml

$ sudo pip install lxml

Then beautiful soup was missing. So:

$ sudo pip install beautifulsoup

At this point, the error messages got a bit more cryptic:

Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>

So, from the command line I typed `python` and then `nltk.download()`. A little window pops open, and I found the punkt tokenizer package. Hit the download button, close the window, `exit()`, and run my test.py again:

Resource u'taggers/maxent_treebank_pos_tagger/english.pickle'
  not found.  

Solved that one the same way. Then:

u'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle' not


Resource u'corpora/words' not found

Then…. success! My wee script ran. (It’s always rather anticlimatic when something works – often, you only know it worked because you’re presented with the $ again, without comment). Now to get something useful out of it. So, for interest’s sake, I pointed it at a Gutenberg Project copy of the case book of Sherlock Holmes: http://www.gutenberg.ca/ebooks/doyleac-casebookofsherlockholmes/doyleac-casebookofsherlockholmes-00-h.html’

and instructed it to print things out, like so:

import geograpy
url = 'http://www.gutenberg.ca/ebooks/doyleac-casebookofsherlockholmes/doyleac-casebookofsherlockholmes-00-h.html'
places = geograpy.get_place_context(url=url)
print places.country_mentions
print places.region_mentions
print places.city_mentions

And the results in my terminal:
[(u’Canada’, 2), (u’Turkey’, 1), (‘Central African Republic’, 1), (‘United Kingdom’, 1), (u’Japan’, 1), (u’France’, 1), (u’United States’, 1), (u’Australia’, 1), (u’Hungary’, 1), (u’South Africa’, 1), (u’Norfolk Island’, 1), (u’Jamaica’, 1), (u’Netherlands’, 1)]
… the ‘canada’ is surely because this was gutenberg.ca, of course…

[(u’Baron’, 16), (u’England’, 3), (u’Adelbert’, 3), (u’Kingston’, 2), (u’Strand’, 2), (u’Southampton’, 1), (u’Briton’, 1), (u’Bedford’, 1), (u’Baker’, 1), (u’Queen’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’Damery’, 1), (u’Bedfordshire’, 1), (u’Greyminster’, 1), (u’Euston’, 1)]
… a few names creeping in there…

[(u’Watson’, 37), (u’Holmes’, 34), (u’Godfrey’, 23), (u’Ralph’, 10), (u’Baron’, 8), (u’Merville’, 5), (u’London’, 5), (u’Johnson’, 4), (u’England’, 3), (u’Eastern’, 2), (u’Strand’, 2), (u’Pretoria’, 2), (u’Kingston’, 2), (u’Violet’, 2), (u’Turkey’, 1), (u’Middlesex’, 1), (u’Dickens’, 1), (u’Bedford’, 1), (u’God’, 1), (u’Damery’, 1), (u’Wainwright’, 1), (u’Nara’, 1), (u’Bohemia’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’America’, 1), (u’Southampton’, 1), (u’Sultan’, 1), (u’Baker’, 1), (u’Richardson’, 1), (u’Square’, 1), (u’Four’, 1), (u’Lomax’, 1), (u’Emsworth’, 1), (u’Scott’, 1), (u’Valhalla’, 1)]

So yep, a bit noisy, but promising. Incidentally, when I run it on that BBC news story, the results are much more sensible:

[(u’Ukraine’, 23), (‘Russian Federation’, 20), (‘Czech Republic’, 2), (u’Lithuania’, 1), (u’United States’, 1), (u’Belgium’, 1)]

[(u’Luhansk’, 4), (u’Donetsk’, 2)]

[(u’Russia’, 20), (u’Moscow’, 5), (u’Kharkiv’, 5), (u’Donetsk’, 2), (u’Independence’, 2), (u’Media’, 1), (u’Brussels’, 1)]

So obviously the corpora that NLTK is using is geared towards more contemporary situations than the worlds described by Arthur Conan Doyle. That’s interesting, and useful to know. I expect – though I haven’t looked yet – that one could use, say, a trained 19th century corpora with NLTK’s taggers etc, to get more useful results. Hmmm! A project for someone, perhaps in my #digh5000