Six Degrees of Harry Potter – Using Talis Aspire ‘ISBN’ API

Posted on July 23, 2010

1


Code:
http://github.com/benosteen/SixDegreesOfHarryPotter

Requires:

Talis Aspire ‘ISBN’ API (http://prototype.talisaspire.com/isbn/)
(see http://twitter.com/robotrobot/status/19282383847)

A Redis instance running on localhost with the relevant redis-py python library installed. (Scripts only use basic set
and value operations.)

To Run:

python prep_bottom_up.py

This will:

– parse and run through the OpenLibrary json reponse for ‘books by J K Rowling’, and push the results into Redis via
the SixDegrees class.

Resultant Redis data structures:
Key ‘h’ -> Set of all Harry Potter Book ISBNs
Key ‘isbn:XXXXXXX’ -> Value, Title of ISBN XXXX

– Load the list of Potter ISBNs and pass them through the Talis ‘appears-with’ API call.

Resultant Redis structures:
Key ‘r1’ -> Set of all ‘appears-with’ ISBNs from the Harry Potter ISBNs

– build up the next level of ‘appears-with’ items from the previous set

Key ‘r{n}’ -> Set of all ‘appears-with’ ISBNs from the ‘r{n-1}’ set

To Query:

Once you have run ‘prep_bottom_up.py’, you can then query your results from the Redis DB:

(Please see http://code.google.com/p/redis/wiki/CommandReference for the list of commands, and specifically, the
commands you can do on sets, like scard, smembers, sinter, etc)

Either, via the redis-cli:

cd redis-XXXX
sixdegreesofpotter/redis-2.0.0-rc2$ ./redis-cli scard r1
(integer) 12
sixdegreesofpotter/redis-2.0.0-rc2$ ./redis-cli smembers r1
1. "0141439769"
2. "0749707119"
3. "0198124929"
4. "000711561X"
5. "0141323558"
6. "0393975428"
7. "0140366717"
8. "0590139614"
9. "0439994926"
10. "0140350047"
11. "0439999464"
12. "0141322624"

or via python:

$ python
>>> from sixdegrees import SixDegrees
>>> s = SixDegrees()
>>> s.r
<redis.client.Redis object at 0x9ff3e14>
>>> s.get_related("4")
set(['0709910932', '0521390869', '0571089062', '0393926362', '0631197575', '0710000995', '0241134730', '000711561X',
'0375751513', '1405832827', '0465017185', '019953702X', '0192829556', '0415069378', '1907439021', '0140434275',
'0439994926', '0140366717', '0860915700', '0198710429', '1853812773', '0199535639', '0140434003', '0416924506',
'0747532745', '069107819X', '041506936X', '0749707119', '0333487087', '0631234357', '0140350047', '0590139614',
'0439999464', '0192833650', '0195073886', '0141323558', '0099511487', '0198124929', '074630725X', '0141322624',
'0133555615', '0199537240', '0803235615', '0198710410', '0520048024', '0141439769', '0631234365', '0748601384',
'0393975428', '0192835203', '0674211014', '0198185065'])

Preliminary Results:

Out of 559 ISBNs found from the OpenLibrary query (including all sorts of Harry Potter editions, translations, calendars, etc) there was a single hit in the Talis Aspire ‘Appears With’ API!

ISBN 0747532745 -> Title: Harry Potter and the Philosopher’s Stone (Harry Potter)

So, one hit out of a possible 559… not great, but something at least.

Question: What’s the distribution of sheer numbers of related ISBNs then?

At the 7th level, there were a mere 182 different ISBNs 😦 I guess the proposition cannot be true! The Talis Aspire customer institutions must not rate the work of J K Rowling!

Question: How many ‘new’ ISBNs were there at each level?

In other words, this data could be a little island of related ISBNs, which only appeared with each other, or it could be that each level is divergent and new compared to the last.

NB calculated by taking the difference of the set of ISBNs in a given level, with all the sets in the levels underneath it:

E.g

sixdegreesofpotter/redis-2.0.0-rc2$ ./redis-cli sdiff r3 r2 r1
1. "0710000995"
2. "0133555615"
3. "0416924506"
....
9. "0192829556"
10. "0140434003"
ben@localhost:/var/sixdegreesofpotter/redis-2.0.0-rc2$ ./redis-cli sdiff r4 r3 r2 r1
1. "0803235615"
2. "0860915700"
3. "0709910932"
4. "074630725X"
...
11. "0198185065"
12. "0415069378"
13. "1853812773"
14. "069107819X"
15. "0241134730"
16. "0520048024"

It does look like it begins to break out of a cluster at around the 5th level – upwards and onwards perhaps!

Question: What titles and subjects do these books have?

By using the Open Library simple API, can we look up the title and subjects for a given work and see if there are any trends?

(The script analysis.py handles the lookup and storage of titles and subjects)

Titles of the first ‘appears with’ level:

  • ‘The Secret Garden’
  • ‘The Railway Children (Puffin Classics)’
  • ‘Northern Lights (His Dark Materials)’
  • ‘The lion, the witch and the wardrobe’
  • ‘David Copperfield’
  • ‘The BFG’
  • ‘Jane Eyre’

No real surprise there, perhaps. And the subjects of these?

  • Brontë, Charlotte, — 1816-1855
  • Young men — Fiction
  • Fantasy
  • Boys — Fiction.
  • Fiction
  • Stepfathers — Fiction.
  • England — Fiction
  • Orphans — Fiction.
  • Child labor — Fiction.
  • Mentally ill women — Fiction
  • Governesses — Fiction

Some of the later levels titles are a little less juvenile in nature!

  • The History of Sexuality
  • GAY REPUBLIC: SEXUALITY, CITIZENSHIP AND SUBVERSION IN FRANCE
  • Three Jacobean Witchcraft Plays
  • Organic Memory
  • Oedipus and the Devil: Witchcraft, Religion and Sexuality in Early Modern Europe

Wordle fun: All the titles of the ‘appears with’ books:

How about a Wordle of all the subjects from all the levels?

Next up, I’ll be pushing all the data from this into some google spreadsheets for other people to play with 🙂

Advertisements
Posted in: Uncategorized