Printing an e-book for binding

Tools: pdftk, pdfjam.

Rearrange pages from the source file:

pdftk latin-out.pdf cat 1 1-317 1 1 output latin-fixed.pdf

Split the source file:

pdftk latin-fixed.pdf cat 1-16 output 016.pdf ... pdftk latin-fixed.pdf cat 305-320 output 320.pdf

Fuck, a page was duplicated in the source. Find another file which has it and blend it in:

pdftk A=latin-fixed.pdf B=anthologyoflatin00tyrrrich_bw.pdf cat A145 B154 A147-160 output 160-new.pdf

Generate some commands in python for easy copy-pasting.

for x in range(16, 324, 16): print "pdfbook --short-edge --signature '16' %03d.pdf" % (x,)

Run the commands:

pdfbook --short-edge --signature '16' 016.pdf ... pdfbook --short-edge --signature '16' 320.pdf

command line grepping

Finding certain PIDs in netstat:

netstat -lp --numeric-ports| grep -E "(9554|9512|9646|9664|17857)"

Building a personal IMDB

I wanted to do some serious searching of IMDB so I could prettify some movie lists. There's no public API for imdb, and although one guy built a web interface he closed it down due to licensing concerns. Luckly, IMDB still offers the imdb data and some tools for searching it. Unluckily, the mkdb script gave me a segmentation fault whenever I tried to import a list, so I decided to go at it alone.

First, I made a few database tables in MySQL:

create table movies ( id int not null auto_increment primary key, title varchar(255) not null, year year not null ); create table directors( id int not null auto_increment primary key, name varchar(255) not null ); create table movies_directors( id int not null auto_increment primary key, director_id int not null, movie_id int not null ); create table akatitles ( id int not null auto_increment primary key, movie_id int not null, title varchar(255) not null, is_english boolean not null default false ); create table movies_search( movie_id int not null primary key, search_field text, fulltext(search_field) );

The IMDB movie file is a tab separated list of movies and years, like so:

Notorious (1946) 1946 Notorious (1992) (TV) 1992 Notorious (1992) (V) 1992 ... Number 13 (2006) (TV) 2006-????

In order to simplify the import, I removed the header and footer from the file and deleted a bunch of data, namely all direct-to-video movies (V) , and all television episodes (TV). I also deleted everything with a quote '"' at the beginning of the title or with a year range (1995-1996) on the assumption that these were probably TV series, and everything with an unknown date (????). In vim:

:g/(V)/d :g/(TV)/d :g/^"/d :g/\t[0-9][0-9][0-9][0-9]-/d :g/\t????/d

I then removed all the extra tabs, so only a single tab was left between the year and the title.

:%s/\t\{2,}/\t/g

And finally to each row I added unique IDs, based on the line number, followed by a tab:

:%s/^/\=line('.') . "\t"/g

With the movie list ready, it could be imported into mysql:

mysqlimport --fields-terminated-by="\t" --lines-terminated-by="\n" --user=kellen --password movies /home/kellen/IMDB/movies.list

The directors list presented a different problem: it was broken up into sections by director, each followed by a blank line. The associated movies were tab separated in the same section, each on a new line, except the first, which followed after the director name. Like so:

Herzog, Robert M. Flights (2002) Herzog, Rudolph Abora - Letzte Position Atlantik (2008) Der Ball ist ein Sauhund (1999) Heil Hitler, das Schwein ist tot! - Humor unterm Hakenkreuz (2006) (TV) "The Heist" (2004) Herzog, Werner Aguirre, der Zorn Gottes (1972) Auch Zwerge haben klein angefangen (1970) Ballade vom kleinen Soldaten (1984) (TV) Behinderte Zukunft? (1971) (TV) Cerro Torre: Schrei aus Stein (1991) ...

It seemed sensible to take a two-step approach and load up all the directors first, then join them with the movies as a second step, so I went about removing the blank lines, lines which began with whitespace, and anything following a tab (the first film for each director). I then added line numbers, as for the movies list:

:g/^$/d :g/^\s/d :%s/\t.*$//g :%s/^/\=line('.') . "\t"/g

In order to speed up the processing, I also removed the lines which were videos or TV shows or began with '"' and which were not also a line for the director:

:g/^\t.*(V)/d :g/^\t.*(TV)/d :g/^\t\+"/d

This was followed by a similar import. Note that "directors.list" is a modified version of the original directors.list which comes from IMDB.

mysqlimport --fields-terminated-by="\t" --lines-terminated-by="\n" --user=root --password movies /home/kellen/IMDB/directors.list

I then needed to iterate over each director and add associate them with each of their films. I realized here that I could just as well have added the directors in the same step. Oh well. I wrote a python script to do this. Note that "directors_movies.list" contains the original directors.list file from IMDB.

#!/usr/bin/python import MySQLdb import sys p = sys.stdout.write fl = sys.stdout.flush def insert_directors(): f = open('directors_movies.list') conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies") cursor = conn.cursor() director = "" movies=[] ln = 0; try: for line in f: ln += 1 if len(line.strip()) == 0: if len(movies) > 0: cursor.execute ("SELECT id FROM directors WHERE name= %s", director) row = cursor.fetchone() if row is not None and row[0] > 0: dir_id = row[0] for movie in movies: cursor.execute("SELECT id FROM movies WHERE title=%s", movie) m_row = cursor.fetchone() if m_row is not None and m_row[0] > 0: movie_id = m_row[0] # movie found, so insert the relationship cursor.execute("INSERT INTO movies_directors (movie_id, director_id) VALUES (%s, %s)", (movie_id, dir_id)) if cursor.rowcount > 0: p("#") fl() else: p("x") fl() else: # no movie, print a '.' p(".") fl() else: print "No director in the DB for %s" % director else: print "No movies to insert for %s" % director director = "" movies=[] elif line.startswith("\t"): # its a movie row, so add it to the dir. movies.append(line.strip()) else: # add the director, then parse the first movie director, title = line.split('\t', 1) movies.append(title.strip()) except ValueError: print "on line ", ln, ": ***", line , "***" raise cursor.close() conn.close () f.close() insert_directors()

Before I ran the insert script, which does a lot of lookups, I created some indexes on the existing tables:

create index movie_titles on movies (title); create index dir_name on directors (name); create index mov_dir on movies_directors (movie_id, director_id);

Running the script:

> ./dirs.py ###.####.#######.......#........#.#..#######...#..#..#.########..##.##############..#######.##..####. ...

Great! Now there's a table of movies and a table of directors. We can do things like:

mysql> select d.name as director, m.title, m.year from -> directors d, movies_directors md, movies m where -> d.name like 'Belvaux, R%' and md.director_id = d.id and md.movie_id = m.id; +---------------+---------------------------------------+------+ | director | title | year | +---------------+---------------------------------------+------+ | Belvaux, R▒my | C'est arriv▒ pr▒s de chez vous (1992) | 1992 | | Belvaux, R▒my | Pas de C4 pour Daniel Daniel (1987) | 1987 | +---------------+---------------------------------------+------+ 2 rows in set (0.09 sec)

This is great, but what I'd really like to be able to do is throw in a bunch of keywords and get back a relevance-ordered list of nicely formatted titles. In english. So I need two more things. First, another table with the movies' "aka names" and secondly a fulltext index with all of the relevant info (title, alternative titles, year, director names).

The alternative names have to be imported from yet another file, "aka-titles.list". And because of the formatting, and because they need to get linked by ID, it's another python script. The script also creates all the text needed for the fulltext index (title, director names, plus alternative titles).

#!/usr/bin/python import MySQLdb import sys import re p = sys.stdout.write fl = sys.stdout.flush def insert_alt(): f = open('aka-titles.list') conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies") cursor = conn.cursor() title = "" movies=[] ln = 0; english_patt = re.compile("(english|\(UK\)|\(USA\))", re.I) clean_patt = re.compile("\(aka (.+?) \(") try: for line in f: ln += 1 if len(line.strip()) == 0: if len(movies) > 0: cursor.execute ("SELECT id FROM movies WHERE title= %s", title) row = cursor.fetchone() if row is not None and row[0] > 0: movie_id = row[0] alt_titles = "" for movie in movies: is_english = 0 if english_patt.search(movie) is None else 1 m = clean_patt.search(movie) if m is not None: cleaned_title = m.group(1) alt_titles += " " + cleaned_title cursor.execute("INSERT INTO akatitles (movie_id, title, is_english) VALUES (%s, %s, %s)", (movie_id, cleaned_title, is_english)) p("#") if cursor.rowcount > 0 else p("x") else: p("z") # create the search field: searchfield = title cursor.execute("SELECT name FROM directors d, movies_directors md WHERE md.movie_id = %s and md.director_id = d.id", movie_id) dir_rows = cursor.fetchall() for dir_row in dir_rows: searchfield += " " + dir_row[0] cursor.execute("INSERT INTO movies_search (movie_id, search_field) VALUES (%s, %s) on duplicate key update search_field=VALUES(search_field)", (movie_id, searchfield)) p("+") if cursor.rowcount > 0 else p("_") else: # main movie not in the DB, skipping p(".") else: print "No movies to insert for %s" % title fl() title= "" movies=[] elif line.startswith(" "): # its a movie row, so add it to the dir. movies.append(line.strip()) else: # add the title title = line.strip() except ValueError: print "on line ", ln, ": ***", line , "***" raise cursor.close() conn.close () f.close() insert_alt()

Great, now we can do:

mysql> select distinct m.* from movies m, akatitles a where -> a.movie_id = m.id and a.title like '%Man Bites Dog%'; +-------+---------------------------------------+------+ | id | title | year | +-------+---------------------------------------+------+ | 62985 | C'est arriv▒ pr▒s de chez vous (1992) | 1992 | +-------+---------------------------------------+------+ 1 row in set (0.17 sec)

After doing this, I realized that building the index in this way would only give a search value for movies which had alternate titles, which not all have. So I had to re-create the movies_search table using the data in the DB. First create some temporary tables:

create temporary table alttitles(movie_id int not null primary key, titles text); create temporary table dirnames(movie_id int not null primary key, names text);

Fill them with data:

insert into alttitles (movie_id, titles) select a.movie_id, group_concat(a.title separator " ") as titles from akatitles a group by movie_id; insert into dirnames(movie_id, names) select movie_id, group_concat(d.name separator " ") as names from movies_directors md, directors d where director_id = d.id group by movie_id;

Index them:

create index alt_idx on alttitles(movie_id); create index dir_idx on dirnames(movie_id);

Build the index in 4 steps:

-- with both director names and alternate titles insert into movies_search (movie_id, search_field) select m.id as movie_id, concat(m.title, " ", dirnames.names, " ", alttitles.titles) as search_field from movies m, alttitles, dirnames where m.id = alttitles.movie_id and m.id = dirnames.movie_id; -- with just director names insert into movies_search (movie_id, search_field) select m.id as movie_id, concat(m.title, " ", dirnames.names) as search_field from movies m, dirnames where m.id = dirnames.movie_id and not exists (select movie_id from alttitles where movie_id=m.id); -- with just alternate titles insert into movies_search (movie_id, search_field) select m.id as movie_id, concat(m.title, " ", alttitles.titles) as search_field from movies m, alttitles where m.id = alttitles.movie_id and not exists (select movie_id from dirnames where movie_id=m.id); -- without director or alternate titles insert into movies_search(movie_id, search_field) select m.id as movie_id, m.title as search_field from movies m where not exists (select movie_id from alttitles where movie_id=m.id) and not exists (select movie_id from dirnames where movie_id=m.id);

And finally verify that it all worked:

mysql> select count(*) from movies; +----------+ | count(*) | +----------+ | 491858 | +----------+ 1 row in set (0.00 sec) mysql> select count(*) from movies_search; +----------+ | count(*) | +----------+ | 491858 | +----------+ 1 row in set (0.00 sec)

After this, it is possible to do fulltext searches on directory or file names from download services and to get reasonable results. For example:

mysql> select m.*, match(search_field) -> against('Psycho.1960.560x304.25fps.800kbs.AC3.MultiSub.WunSeeDee') AS score -> from movies_search ms, movies m where ms.movie_id = m.id -> order by score desc limit 3; +--------+------------------+------+-----------------+ | id | title | year | score | +--------+------------------+------+-----------------+ | 323689 | Psycho (1960) | 1960 | 18.416343688965 | | 323703 | Psycho II (1983) | 1983 | 14.394601821899 | | 323690 | Psycho (1998) | 1998 | 12.881115913391 | +--------+------------------+------+-----------------+ 3 rows in set (0.56 sec)

So next, a script to look up titles and format them according to my silly formatting rules:

vim fanciness

Removing empty lines

:g/^$/d

Adding line numbers

:%s/^/\=line('.') . "\t"/g

words swedish doesn't have

Puppet - translates to docka or handdocka or marionette. none of these are statisfying

Installing django-threaded-multihost

I'm interested in serving multiple sites from a single django installation. The sites will be exactly the same, though with different db-driven content, so it seems unnecessary to have multiple django settings files and therefore distinct apache configurations. This is usually called "multihost." An early implementation of this idea is django-multihost. django-multihost is good, but it doesn't use the sites framework which can break some of the contrib applications.

Another implementation is django-threaded-multihost, which been factored out of Satchmo. It does respect and use the sites framework, but it has no install documentation. But since it was in Satchmo, there's code using it which can be used to write our own, horray.

First, get the package and do the basic install:

hg clone http://bitbucket.org/bkroeze/django-threaded-multihost/ sudo python setup.py install

Next, add the middleware. The satchmo install docs have this as the last entry in MIDDLEWARE_CLASSES before the satchmo-specific middleware.

MIDDLEWARE_CLASSES = ( ... "threaded_multihost.middleware.ThreadLocalMiddleware", )

django-threaded-multihost has a custom implementation of get_current() for the Site manager. This is the component which allows a single django installation to "easily" distinguish between multiple sites, by freeing the django instance from reliance upon the settings.SITE_ID setting. In order to use this, the threaded_multihost.multihost_patch must be imported somewhere in your installation. The easiest place is probably in an __init__ file for the primary application on your site. So do:

from threaded_multihost import multihost_patch

And that's it. Any further requests to Sites.objects.get_current() will return the appropriate site object depending upon the request URL or fallback on settings.SITE_ID.

except that keyedcache doesn't work. So install keyedcache from bitbucket. then set the settings CACHE_PREFIX = 'whatever' and one oother one.

Revision numbers in Subversion, Mercurial and Git

Recently, there seems to be a proliferation of revision control systems. In the django community, many started out using google code, which offers subversion, but now many projects are found in git or mercurial repositories.

Knowing the revision number is helpful in giving unique names to checked-out versions of django applications. And here's how to do that in various systems.

With subversion it's easiest to just get the revision number from the output of checkout. For example, if we want to check out dregni:

$ svn checkout http://dregni.googlecode.com/svn/trunk/ dregni-read-only ... Checked out revision 15.

Alternatively, the info command will output the revision number:

$ svn info | grep Revision Revision: 15

With mercurial, the latest revision can be found with the identify command.

$ hg identify -n 221

The revision number can be used to uniquely identify the version of the checked-out application. Note that this is only true if we're pulling from a centralized repository! Revision numbers can differ between developers (that's why there's a hash after the revision number), so if you're pulling from an independent developer you should use the entire changeset number.

Finding the revision in git is pretty much the same as in mercurial:

$ git log | head -1 commit 8952259e8d07b1e6a98897cc039853055a5d8f88

There's no helpful tiny revision number, however, so we have to use the entire commit hash instead.

But what about CVS, you say? Well, while CVS has a concept of revision numbers, these apply only to individual files and not to the entire source tree. If you do a cvs checkout you'll get the highest available version of each file (__init__.py might be version 1.1 at the same time models.py is version 1.8). As a result, if you're checking out from HEAD, the most unique version information you have is the datetime. date can print the datetime in a CVS-compatible format:

$ date +'%F %T %Z' 2009-06-11 04:04:16 CEST

But this doesn't work so well for directory naming. Using something like %F-%H.%M.%S-%Z might be better.

Alternatively, if the CVS repository has a useful tag which can be checked out with cvs checkout -r tagname, then this can be used in the same way as a revision number from subversion, etc.

django, django-tagging from debian repositories

The most recent python-django and python-django-tagging packages for debian are in the experimental repositories, but one probably doesn't want to pull all packages from experimental. APT pinning can be used to prioritize certain repositories for certain packages.

First, the repository should be added to /etc/apt/sources.list

deb http://ftp.us.debian.org/debian experimental main contrib non-free

Then in /etc/apt/preferences, each release needs to be given a priority, and certain packages should be pinned to certain releases. In particular, the django-related packages are pinned to experimental, and python-support is pinned to unstable (the django packages require a higher python-support version than is in stable).

Package: * Pin: release a=stable Pin-Priority: 900 Package: * Pin: release a=unstable Pin-Priority: 800 Package: python-support Pin: release a=unstable Pin-Priority: 1001 Package: python-django python-django-tagging Pin: release a=experimental Pin-Priority: 1001 Package: * Pin: release a=experimental Pin-Priority: 700

As noted in the previous entry, I've stopped using the debian repositories for django packages, but this is how it's done, for reference.

mod_wsgi and virtualenv for multiple django versions

I'm converting a mod_python environment to a mod_wsgi one. I was using the debian packages for both django-tagging and django itself. This has proved to be unacceptable both for upgrades and for keeping sites running. I don't want an apt-get dist-upgrade to break all my django sites (as it did yesterday).

One solution to this is to use virtualenv to keep separate python environments for each site, and then upgrade them individually as necessary. It wasn't clear how mod_python would work with virtualenv (I tried), so I moved to using mod_wsgi which is also reputed to be more efficient. To begin, install both virutalenv and mod_wsgi:

apt-get install python-virtualenv libapache2-mod-wsgi

Next, set up virtualenv. In the project directory (described previously), run:

virtualenv --no-site-packages --unzip-setuptools ENV

Next, grab django and whatever other packages you might want. I move these to a directory with the SVN revision after the package name.

cd ~/libs/ svn co http://code.djangoproject.com/svn/django/trunk/ django mv django django-10645 svn checkout http://django-tagging.googlecode.com/svn/trunk/ tagging mv tagging tagging-154

Symlink the packages in to the virtual environment.

cd ~/libs/django-10645 ln -s `pwd`/django /home/username/blog/ENV/lib/python2.5/site-packages/ cd ~/libs/tagging-154 ln -s `pwd`/tagging /home/username/blog/ENV/lib/python2.5/site-packages/

Create the wsgi config file. This goes in a separate directory from everything else so that the permissions that apache has to read and serve it don't affect anything else.

mkdir ~/blog/apache vi ~/blog/apache/blog.wsgi

The wsgi file should look like the following. Bolded are the parts which will be changed for each project.

ALLDIRS = ['/home/username/blog/ENV/lib/python2.5/site-packages/'] import os, sys, site prev_sys_path = list(sys.path) for directory in ALLDIRS: site.addsitedir(directory) new_sys_path = [] for item in list(sys.path): if item not in prev_sys_path: new_sys_path.append(item) sys.path.remove(item) sys.path[:0] = new_sys_path sys.path.append('/home/username/blog/projects/') os.environ['PYTHON_EGG_CACHE'] = '/home/username/.python-eggs' os.environ['DJANGO_SETTINGS_MODULE'] = 'blogsite.settings' import django.core.handlers.wsgi application = django.core.handlers.wsgi.WSGIHandler()

My version seems to differ slightly from other versions I've seen. In particular my "main site" app is in a subdirectory of projects, so I have appname.settings for the DJANGO_SETTINGS_MODULE. I'm not sure that my way has any advantage over any others, it's just the way it is.

Finally, set up the virtual host:

<VirtualHost XXX.XXX.XXX.XXX> ServerAdmin admin@example.com ServerName blog.example.com CustomLog /var/log/apache2/access.log combined Alias /adminmedia /home/username/blog/ENV/lib/python2.5/site-packages/django/contrib/admin/media Alias /media /home/username/blog/media <Directory /home/username/blog/apache/> Order deny,allow Allow from all </Directory> WSGIDaemonProcess blog.example.com user=www-data group=www-data threads=25 WSGIProcessGroup blog.example.com WSGIScriptAlias / /home/username/blog/apache/blog.wsgi </VirtualHost>

First, two aliases are set, one for the normal media (this is unchanged from a mod_python installation) and the other for the admin media. This should point to the virtualenv version of the admin media. Next, the Directory section allows apache access to the wsgi handler. And finally, wsgi is configured and pointed to the handler in the apache directory. Do a:

apache2ctl restart

And it's done!

apache2/mod_python configuration for django

The request for a working apache/mod_python configuration example comes up frequently on the django IRC channel. Usually a bad configuration is the underlying problem when admin media doesn't show up or when projects are not found on the python path. This is the model I use:

<VirtualHost xxx.xxx.xxx.xxx> ServerAdmin admin@example.com DocumentRoot /home/username/projectname/public_html/ ServerName example.com CustomLog /var/log/apache2/access.log combined Alias /adminmedia /usr/share/python-support/python-django/django/contrib/admin/media Alias /media /home/username/projectname/media <Location "/"> SetHandler python-program PythonHandler django.core.handlers.modpython PythonPath "['/home/username/projectname/projects/'] + sys.path" SetEnv DJANGO_SETTINGS_MODULE mysite.settings PythonDebug On </Location> <Location "/media"> SetHandler None </Location> <Location "/adminmedia"> SetHandler None </Location> </VirtualHost>

This is for a directory layout like this:

/home/username/projectname/ <- your project directory /home/username/projectname/media/ <- media /home/username/projectname/public_html/ <- empty, a harmless place to point apache /home/username/projectname/projects/ <- where all your projects live /home/username/projectname/projects/mysite/ <- your main project/app

Inside mysite/, you have settings.py. In settings.py, you've set MEDIA_ROOT = '/home/username/projectname/media/' and MEDIA_URL = '/media/' and ADMIN_MEDIA_PREFIX = '/adminmedia/'.