GutenbergPy

Page content

I have created a library used for interfacing with Gutenberg from python code. This is the first article about GutenbergPy

Why use my library ?

  • Only needs lxml (pymongo only if you use mongodb)
  • SQLite cache build time: about 2 minutes (instead of more than one day)
  • SQLite cache size: about 120 mb
  • Mongodb cache build time: about 3 minutes (will probably be less in the future, as it’s not optimized)
  • Mongodb cache size: about 300 mb (instead of 2 Gb berkley db previous solution)
  • Fast queries on both solutions

Just try it !

If you need to use gutenberg meta information and get filtered texts for your project, fast, reliable, without overloading gutenberg servers: try GutenbergPy.

Check the usage by reading the Github page

Install the pypi package:

        pip install gutenbergpy

Or if you are feeling adventurous you can install from source:

        git clone https://github.com/raduangelescu/gutenbergpy
        python setup.py install

Why did you do it?

I decided to continue improving artisticmachine42 quote capabilities. As I said in a [previous article]https://raduangelescu.com/post/artisticmachine42/, Started doing a simple reverse index for fast text searching -> wrote a library to help me scavenge for big-data from the gutenberg site.

Writing something like a reverse index, implies testing it on big data (in my case a lot of books). My go-to data source for my quote extraction was Gutenberg (great initiative, I am a big fan).

List of linked facts that made me do it :)

  • Searched the web for a good Python library for Gutenberg and found: this

  • Tried using the sqlite interface (great sqlite fan here :) ) after about 1 day + 1 night of running -> it still didn’t populate the cache.

  • Thought the sqlite cache feature was added later so it must be still in development -> used the bsd db cache -> took around 5 hours to populate (on a really good pc with a really good internet connection) and 2 GB of hdd space

  • I then proceeded to query all the books that are from a certain category with a certain language -> it failed.

  • Reported bug to developer

  • Read “the maintainer wanted to no longer maintain the project” -> thought it will never get fixed unless I do something about it

  • Downloaded the repository and started debugging -> saw nasty stuff, the cache base needed to be rewritten (in my opinion)

  • I decided to write my own and publish for everyone to use -> help the open source community

  • Wrote it -> published.

  • Published GutenbergPy -> maintainer of old library said he fixed the query bug :) (didn’t test it yet)

What about the other promised articles?

I had some crunch time at work so I didn’t have the time to write the other (bigger) articles on my site, but rest easy:

  • The gabor feature generation code is ready (writing a CUDA port). I didn’t forget, the article is coming soon :) .

  • The reverse index code is also ready so an article about that will also be coming soon. :)