Ok, a quick note to remind myself – I was interested in learning how to use this: https://pypi.python.org/pypi/geograpy/0.3.7
Installation was a bit complicated; lots of dependencies. The following pages helped sort me out:
https://docs.python.org/2/install/
http://stackoverflow.com/questions/4867197/failed-loading-english-pickle-with-nltk-data-load
AND ultimately, I had to open up one of the geography/extraction.py file and change one line of code (line 31 as it happens), as per http://stackoverflow.com/questions/27341311/notimplementederror-use-label-to-access-a-node-label
So, first, let’s get all the bits and pieces installed. I downloaded the package as zip, unzipped, then:
$ sudo python setup.py install
At each stage, I would run a little python script, test.py. In my text editor. I just pasted their default script and saved it as test.py, which I’d then run from the command line. This thing:
import geograpy url = 'http://www.bbc.com/news/world-europe-26919928' places = geograpy.get_place_context(url=url)
Every error message moved me one step closer as it would tell me whatever module I was missing.
For starters, it turned out ‘pil’ was needed. But pil isn’t maintained any more. Some googling revealed that Pillow is the answer!
$ sudo pip install pillow
Next thing missing: lxml
$ sudo pip install lxml
Then beautiful soup was missing. So:
$ sudo pip install beautifulsoup
At this point, the error messages got a bit more cryptic:
Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
So, from the command line I typed `python` and then `nltk.download()`. A little window pops open, and I found the punkt tokenizer package. Hit the download button, close the window, `exit()`, and run my test.py again:
Resource u'taggers/maxent_treebank_pos_tagger/english.pickle' not found.
Solved that one the same way. Then:
u'chunkers/maxent_ne_chunker/english_ace_multiclass.pickle' not found.
…then:
Resource u'corpora/words' not found
Then…. success! My wee script ran. (It’s always rather anticlimatic when something works – often, you only know it worked because you’re presented with the $ again, without comment). Now to get something useful out of it. So, for interest’s sake, I pointed it at a Gutenberg Project copy of the case book of Sherlock Holmes: http://www.gutenberg.ca/ebooks/doyleac-casebookofsherlockholmes/doyleac-casebookofsherlockholmes-00-h.html’
and instructed it to print things out, like so:
import geograpy url = 'http://www.gutenberg.ca/ebooks/doyleac-casebookofsherlockholmes/doyleac-casebookofsherlockholmes-00-h.html' places = geograpy.get_place_context(url=url) print places.country_mentions print places.region_mentions print places.city_mentions
And the results in my terminal:
Countries:
[(u’Canada’, 2), (u’Turkey’, 1), (‘Central African Republic’, 1), (‘United Kingdom’, 1), (u’Japan’, 1), (u’France’, 1), (u’United States’, 1), (u’Australia’, 1), (u’Hungary’, 1), (u’South Africa’, 1), (u’Norfolk Island’, 1), (u’Jamaica’, 1), (u’Netherlands’, 1)]
… the ‘canada’ is surely because this was gutenberg.ca, of course…
Regions:
[(u’Baron’, 16), (u’England’, 3), (u’Adelbert’, 3), (u’Kingston’, 2), (u’Strand’, 2), (u’Southampton’, 1), (u’Briton’, 1), (u’Bedford’, 1), (u’Baker’, 1), (u’Queen’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’Damery’, 1), (u’Bedfordshire’, 1), (u’Greyminster’, 1), (u’Euston’, 1)]
… a few names creeping in there…
Cities:
[(u’Watson’, 37), (u’Holmes’, 34), (u’Godfrey’, 23), (u’Ralph’, 10), (u’Baron’, 8), (u’Merville’, 5), (u’London’, 5), (u’Johnson’, 4), (u’England’, 3), (u’Eastern’, 2), (u’Strand’, 2), (u’Pretoria’, 2), (u’Kingston’, 2), (u’Violet’, 2), (u’Turkey’, 1), (u’Middlesex’, 1), (u’Dickens’, 1), (u’Bedford’, 1), (u’God’, 1), (u’Damery’, 1), (u’Wainwright’, 1), (u’Nara’, 1), (u’Bohemia’, 1), (u’Liverpool’, 1), (u’Doyle’, 1), (u’America’, 1), (u’Southampton’, 1), (u’Sultan’, 1), (u’Baker’, 1), (u’Richardson’, 1), (u’Square’, 1), (u’Four’, 1), (u’Lomax’, 1), (u’Emsworth’, 1), (u’Scott’, 1), (u’Valhalla’, 1)]
So yep, a bit noisy, but promising. Incidentally, when I run it on that BBC news story, the results are much more sensible:
[(u’Ukraine’, 23), (‘Russian Federation’, 20), (‘Czech Republic’, 2), (u’Lithuania’, 1), (u’United States’, 1), (u’Belgium’, 1)]
[(u’Luhansk’, 4), (u’Donetsk’, 2)]
[(u’Russia’, 20), (u’Moscow’, 5), (u’Kharkiv’, 5), (u’Donetsk’, 2), (u’Independence’, 2), (u’Media’, 1), (u’Brussels’, 1)]
So obviously the corpora that NLTK is using is geared towards more contemporary situations than the worlds described by Arthur Conan Doyle. That’s interesting, and useful to know. I expect – though I haven’t looked yet – that one could use, say, a trained 19th century corpora with NLTK’s taggers etc, to get more useful results. Hmmm! A project for someone, perhaps in my #digh5000