GutenbergPy
I have created a library used for interfacing with Gutenberg from python code. This is the first article about GutenbergPy
Why use my library ?
- Only needs lxml (pymongo only if you use mongodb)
- SQLite cache build time: about 2 minutes (instead of more than one day)
- SQLite cache size: about 120 mb
- Mongodb cache build time: about 3 minutes (will probably be less in the future, as it’s not optimized)
- Mongodb cache size: about 300 mb (instead of 2 Gb berkley db previous solution)
- Fast queries on both solutions
Just try it !
If you need to use gutenberg meta information and get filtered texts for your project, fast, reliable, without overloading gutenberg servers: try GutenbergPy.
Check the usage by reading the Github page
Install the pypi package:
pip install gutenbergpy
Or if you are feeling adventurous you can install from source:
git clone https://github.com/raduangelescu/gutenbergpy
python setup.py install
Why did you do it?
I decided to continue improving artisticmachine42 quote capabilities. As I said in a [previous article]https://raduangelescu.com/post/artisticmachine42/, Started doing a simple reverse index for fast text searching -> wrote a library to help me scavenge for big-data from the gutenberg site.
Writing something like a reverse index, implies testing it on big data (in my case a lot of books). My go-to data source for my quote extraction was Gutenberg (great initiative, I am a big fan).
List of linked facts that made me do it :)
-
Searched the web for a good Python library for Gutenberg and found: this
-
Tried using the sqlite interface (great sqlite fan here :) ) after about 1 day + 1 night of running -> it still didn’t populate the cache.
-
Thought the sqlite cache feature was added later so it must be still in development -> used the bsd db cache -> took around 5 hours to populate (on a really good pc with a really good internet connection) and 2 GB of hdd space
-
I then proceeded to query all the books that are from a certain category with a certain language -> it failed.
-
Reported bug to developer
-
Read “the maintainer wanted to no longer maintain the project” -> thought it will never get fixed unless I do something about it
-
Downloaded the repository and started debugging -> saw nasty stuff, the cache base needed to be rewritten (in my opinion)
-
I decided to write my own and publish for everyone to use -> help the open source community
-
Wrote it -> published.
-
Published GutenbergPy -> maintainer of old library said he fixed the query bug :) (didn’t test it yet)
What about the other promised articles?
I had some crunch time at work so I didn’t have the time to write the other (bigger) articles on my site, but rest easy:
-
The gabor feature generation code is ready (writing a CUDA port). I didn’t forget, the article is coming soon :) .
-
The reverse index code is also ready so an article about that will also be coming soon. :)