GutenbergPy

Radu Angelescu

2017-2-20

Page content

I have created a library used for interfacing with Gutenberg from python code. This is the first article about GutenbergPy

Why use my library ?

Only needs lxml (pymongo only if you use mongodb)
SQLite cache build time: about 2 minutes (instead of more than one day)
SQLite cache size: about 120 mb
Mongodb cache build time: about 3 minutes (will probably be less in the future, as it’s not optimized)
Mongodb cache size: about 300 mb (instead of 2 Gb berkley db previous solution)
Fast queries on both solutions

Just try it !

If you need to use gutenberg meta information and get filtered texts for your project, fast, reliable, without overloading gutenberg servers: try GutenbergPy.

Check the usage by reading the Github page

Install the pypi package:

        pip install gutenbergpy

Or if you are feeling adventurous you can install from source:

        git clone https://github.com/raduangelescu/gutenbergpy
        python setup.py install

Why did you do it?

I decided to continue improving artisticmachine42 quote capabilities. As I said in a [previous article]https://raduangelescu.com/post/artisticmachine42/, Started doing a simple reverse index for fast text searching -> wrote a library to help me scavenge for big-data from the gutenberg site.

Writing something like a reverse index, implies testing it on big data (in my case a lot of books). My go-to data source for my quote extraction was Gutenberg (great initiative, I am a big fan).

List of linked facts that made me do it :)

Searched the web for a good Python library for Gutenberg and found: this
Tried using the sqlite interface (great sqlite fan here :) ) after about 1 day + 1 night of running -> it still didn’t populate the cache.
Thought the sqlite cache feature was added later so it must be still in development -> used the bsd db cache -> took around 5 hours to populate (on a really good pc with a really good internet connection) and 2 GB of hdd space
I then proceeded to query all the books that are from a certain category with a certain language -> it failed.
Reported bug to developer
Read “the maintainer wanted to no longer maintain the project” -> thought it will never get fixed unless I do something about it
Downloaded the repository and started debugging -> saw nasty stuff, the cache base needed to be rewritten (in my opinion)
I decided to write my own and publish for everyone to use -> help the open source community
Wrote it -> published.
Published GutenbergPy -> maintainer of old library said he fixed the query bug :) (didn’t test it yet)

What about the other promised articles?

I had some crunch time at work so I didn’t have the time to write the other (bigger) articles on my site, but rest easy:

The gabor feature generation code is ready (writing a CUDA port). I didn’t forget, the article is coming soon :) .
The reverse index code is also ready so an article about that will also be coming soon. :)