half of an adder

Entries tagged with regex

command line grepping

August 26, 2010

grep, linux, netstat, regex

Finding certain PIDs in netstat:

netstat -lp --numeric-ports| grep -E "(9554|9512|9646|9664|17857)"

Building a personal IMDB

May 5, 2010

imdb, python, regex, sql, vim

I wanted to do some serious searching of IMDB so I could prettify some movie lists. There's no public API for imdb, and although one guy built a web interface he closed it down due to licensing concerns. Luckly, IMDB still offers the imdb data and some tools for searching it. Unluckily, the mkdb script gave me a segmentation fault whenever I tried to import a list, so I decided to go at it alone.

First, I made a few database tables in MySQL:

create table movies (
    id int not null auto_increment primary key,
    title varchar(255) not null,
    year year not null
);
create table directors(
    id int not null auto_increment primary key,
    name varchar(255) not null
);
create table movies_directors(
    id int not null auto_increment primary key,
    director_id int not null, movie_id int not null
);
create table akatitles (
    id int not null auto_increment primary key, 
    movie_id int not null, 
    title varchar(255) not null,
    is_english boolean not null default false
);
create table movies_search(
    movie_id int not null primary key,
    search_field text, 
    fulltext(search_field)
);

The IMDB movie file is a tab separated list of movies and years, like so:

Notorious (1946)                                        1946
Notorious (1992) (TV)                                   1992
Notorious (1992) (V)                                    1992
...
Number 13 (2006) (TV)                                   2006-????

In order to simplify the import, I removed the header and footer from the file and deleted a bunch of data, namely all direct-to-video movies (V) , and all television episodes (TV). I also deleted everything with a quote '"' at the beginning of the title or with a year range (1995-1996) on the assumption that these were probably TV series, and everything with an unknown date (????). In vim:

:g/(V)/d
:g/(TV)/d
:g/^"/d
:g/\t[0-9][0-9][0-9][0-9]-/d
:g/\t????/d

I then removed all the extra tabs, so only a single tab was left between the year and the title.

:%s/\t\{2,}/\t/g

And finally to each row I added unique IDs, based on the line number, followed by a tab:

:%s/^/\=line('.') . "\t"/g

With the movie list ready, it could be imported into mysql:

mysqlimport --fields-terminated-by="\t" 
    --lines-terminated-by="\n" --user=kellen 
    --password movies /home/kellen/IMDB/movies.list

The directors list presented a different problem: it was broken up into sections by director, each followed by a blank line. The associated movies were tab separated in the same section, each on a new line, except the first, which followed after the director name. Like so:

Herzog, Robert M.       Flights (2002)

Herzog, Rudolph         Abora - Letzte Position Atlantik (2008)
                        Der Ball ist ein Sauhund (1999)
                        Heil Hitler, das Schwein ist tot! - Humor unterm Hakenkreuz (2006) (TV)
                        "The Heist" (2004)

Herzog, Werner          Aguirre, der Zorn Gottes (1972)
                        Auch Zwerge haben klein angefangen (1970)
                        Ballade vom kleinen Soldaten (1984) (TV)
                        Behinderte Zukunft? (1971) (TV)
                        Cerro Torre: Schrei aus Stein (1991)
                        ...

It seemed sensible to take a two-step approach and load up all the directors first, then join them with the movies as a second step, so I went about removing the blank lines, lines which began with whitespace, and anything following a tab (the first film for each director). I then added line numbers, as for the movies list:

:g/^$/d
:g/^\s/d
:%s/\t.*$//g
:%s/^/\=line('.') . "\t"/g

In order to speed up the processing, I also removed the lines which were videos or TV shows or began with '"' and which were not also a line for the director:

:g/^\t.*(V)/d
:g/^\t.*(TV)/d
:g/^\t\+"/d

This was followed by a similar import. Note that "directors.list" is a modified version of the original directors.list which comes from IMDB.

mysqlimport --fields-terminated-by="\t" 
    --lines-terminated-by="\n" --user=root 
    --password movies /home/kellen/IMDB/directors.list

I then needed to iterate over each director and add associate them with each of their films. I realized here that I could just as well have added the directors in the same step. Oh well. I wrote a python script to do this. Note that "directors_movies.list" contains the original directors.list file from IMDB.

#!/usr/bin/python
import MySQLdb
import sys

p = sys.stdout.write
fl = sys.stdout.flush

def insert_directors():
    f = open('directors_movies.list')
    conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies")
    cursor = conn.cursor()
    director = ""
    movies=[]
    ln = 0;
    try:
        for line in f:
            ln += 1
            if len(line.strip()) == 0:
                if len(movies) > 0:
                    cursor.execute ("SELECT id FROM directors WHERE name= %s", director)
                    row = cursor.fetchone()
                    if row is not None and row[0] > 0:
                        dir_id = row[0]
                        for movie in movies:
                            cursor.execute("SELECT id FROM movies WHERE title=%s", movie)
                            m_row = cursor.fetchone()
                            if m_row is not None and m_row[0] > 0:
                                movie_id = m_row[0]
                                # movie found, so insert the relationship
                                cursor.execute("INSERT INTO movies_directors (movie_id, director_id) VALUES (%s, %s)", (movie_id, dir_id))
                                if cursor.rowcount > 0:
                                    p("#")
                                    fl()
                                else:
                                    p("x")
                                    fl()
                            else:
                                # no movie, print a '.'
                                p(".")
                                fl()
                    else:
                        print "No director in the DB for %s" % director
                else:
                    print "No movies to insert for %s" % director
                director = ""
                movies=[]
            elif line.startswith("\t"):
                # its a movie row, so add it to the dir.
                movies.append(line.strip())
            else:
                # add the director, then parse the first movie
                director, title = line.split('\t', 1)
                movies.append(title.strip())
    except ValueError:
        print "on line ", ln, ": ***", line , "***"
        raise
    cursor.close()
    conn.close ()
    f.close()

insert_directors()

Before I ran the insert script, which does a lot of lookups, I created some indexes on the existing tables:

create index movie_titles on movies (title);
create index dir_name on directors (name);
create index mov_dir on movies_directors (movie_id, director_id);

Running the script:

> ./dirs.py
###.####.#######.......#........#.#..#######...#..#..#.########..##.##############..#######.##..####.
...

Great! Now there's a table of movies and a table of directors. We can do things like:

mysql> select d.name as director, m.title, m.year from 
    -> directors d, movies_directors md, movies m where 
    -> d.name like 'Belvaux, R%' and md.director_id = d.id and md.movie_id = m.id;
+---------------+---------------------------------------+------+
| director      | title                                 | year |
+---------------+---------------------------------------+------+
| Belvaux, R▒my | C'est arriv▒ pr▒s de chez vous (1992) | 1992 |
| Belvaux, R▒my | Pas de C4 pour Daniel Daniel (1987)   | 1987 |
+---------------+---------------------------------------+------+
2 rows in set (0.09 sec)

This is great, but what I'd really like to be able to do is throw in a bunch of keywords and get back a relevance-ordered list of nicely formatted titles. In english. So I need two more things. First, another table with the movies' "aka names" and secondly a fulltext index with all of the relevant info (title, alternative titles, year, director names).

The alternative names have to be imported from yet another file, "aka-titles.list". And because of the formatting, and because they need to get linked by ID, it's another python script. The script also creates all the text needed for the fulltext index (title, director names, plus alternative titles).

#!/usr/bin/python
import MySQLdb
import sys
import re

p = sys.stdout.write
fl = sys.stdout.flush

def insert_alt():
    f = open('aka-titles.list')
    conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies")
    cursor = conn.cursor()
    title = ""
    movies=[]
    ln = 0;
    english_patt = re.compile("(english|\(UK\)|\(USA\))", re.I)
    clean_patt = re.compile("\(aka (.+?) \(")
    try:
        for line in f:
            ln += 1
            if len(line.strip()) == 0:
                if len(movies) > 0:
                    cursor.execute ("SELECT id FROM movies WHERE title= %s", title)
                    row = cursor.fetchone()
                    if row is not None and row[0] > 0:
                        movie_id = row[0]
                        alt_titles = ""
                        for movie in movies:
                            is_english = 0 if english_patt.search(movie) is None else 1
                            m = clean_patt.search(movie)
                            if m is not None:
                                cleaned_title = m.group(1)
                                alt_titles += " " + cleaned_title
                                cursor.execute("INSERT INTO akatitles (movie_id, title, is_english) VALUES (%s, %s, %s)", (movie_id, cleaned_title, is_english))
                                p("#") if cursor.rowcount > 0 else p("x")
                            else:
                                p("z")
                        # create the search field:
                        searchfield = title
                        cursor.execute("SELECT name FROM directors d, movies_directors md WHERE md.movie_id = %s and md.director_id = d.id", movie_id)
                        dir_rows = cursor.fetchall()
                        for dir_row in dir_rows:
                            searchfield += " " + dir_row[0]
                        cursor.execute("INSERT INTO movies_search (movie_id, search_field) VALUES (%s, %s) on duplicate key update search_field=VALUES(search_field)", (movie_id, searchfield))
                        p("+") if cursor.rowcount > 0 else p("_")
                    else:
                        # main movie not in the DB, skipping
                        p(".")
                else:
                    print "No movies to insert for %s" % title
                fl()
                title= ""
                movies=[]
            elif line.startswith("   "):
                # its a movie row, so add it to the dir.
                movies.append(line.strip())
            else:
                # add the title
                title = line.strip()
    except ValueError:
        print "on line ", ln, ": ***", line , "***"
        raise
    cursor.close()
    conn.close ()
    f.close()

insert_alt()

Great, now we can do:

mysql> select distinct m.* from movies m, akatitles a where 
    -> a.movie_id = m.id and a.title like '%Man Bites Dog%';
+-------+---------------------------------------+------+
| id    | title                                 | year |
+-------+---------------------------------------+------+
| 62985 | C'est arriv▒ pr▒s de chez vous (1992) | 1992 |
+-------+---------------------------------------+------+
1 row in set (0.17 sec)

After doing this, I realized that building the index in this way would only give a search value for movies which had alternate titles, which not all have. So I had to re-create the movies_search table using the data in the DB. First create some temporary tables:

create temporary table alttitles(movie_id int not null primary key, titles text);
create temporary table dirnames(movie_id int not null primary key, names text);

Fill them with data:

insert into alttitles (movie_id, titles)
	select a.movie_id, group_concat(a.title separator " ") as titles from akatitles a group by movie_id;
insert into dirnames(movie_id, names)
	select movie_id, group_concat(d.name separator " ") as names from movies_directors md, directors d where director_id = d.id group by movie_id;

Index them:

create index alt_idx on alttitles(movie_id);
create index dir_idx on dirnames(movie_id);

Build the index in 4 steps:

-- with both director names and alternate titles
insert into movies_search (movie_id, search_field)
	select m.id as movie_id, 
	concat(m.title, " ", dirnames.names, " ", alttitles.titles) as search_field
from movies m, alttitles, dirnames
where m.id = alttitles.movie_id and m.id = dirnames.movie_id;
-- with just director names
insert into movies_search (movie_id, search_field)
	select m.id as movie_id,
	concat(m.title, " ", dirnames.names) as search_field
from movies m, dirnames
where m.id = dirnames.movie_id and not exists (select movie_id from alttitles where movie_id=m.id);
-- with just alternate titles
insert into movies_search (movie_id, search_field)
	select m.id as movie_id,
	concat(m.title, " ", alttitles.titles) as search_field
from movies m, alttitles
where m.id = alttitles.movie_id and not exists (select movie_id from dirnames where movie_id=m.id);
-- without director or alternate titles
insert into movies_search(movie_id, search_field)
	select m.id as movie_id,
	m.title as search_field
from movies m where not exists (select movie_id from alttitles where movie_id=m.id)
and not exists (select movie_id from dirnames where movie_id=m.id);

And finally verify that it all worked:

mysql> select count(*) from movies;
+----------+
| count(*) |
+----------+
|   491858 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from movies_search;
+----------+
| count(*) |
+----------+
|   491858 |
+----------+
1 row in set (0.00 sec)

After this, it is possible to do fulltext searches on directory or file names from download services and to get reasonable results. For example:

mysql> select m.*, match(search_field)
    -> against('Psycho.1960.560x304.25fps.800kbs.AC3.MultiSub.WunSeeDee') AS score
    -> from movies_search ms, movies m where ms.movie_id = m.id
    -> order by score desc limit 3;
+--------+------------------+------+-----------------+
| id     | title            | year | score           |
+--------+------------------+------+-----------------+
| 323689 | Psycho (1960)    | 1960 | 18.416343688965 |
| 323703 | Psycho II (1983) | 1983 | 14.394601821899 |
| 323690 | Psycho (1998)    | 1998 | 12.881115913391 |
+--------+------------------+------+-----------------+
3 rows in set (0.56 sec)

So next, a script to look up titles and format them according to my silly formatting rules:

vim fanciness

May 5, 2010

regex, vim

Removing empty lines

:g/^$/d

Adding line numbers

:%s/^/\=line('.') . "\t"/g

Using regex to validate django templatetag arguments

December 27, 2008

django, python, regex, templatetags

When writing custom templatetags, the arguments from the template are passed as a string to the tag function and it is necessary to parse these in some way to get the individual arguments. For example, with {% mytag arg1 arg2 %}, the templatetag function receives "mytag arg1 arg2" as a string (called "token") and has to figure out what each of the substrings mean. When there's fixed arguments, as in the official documentation's examples, splitting on whitespace can be sufficient: tag_name, format_string = token.split_contents()

When the tag becomes more complex and offers optional arguments, the error checking gets a bit hairy and extremely repetitive. An example from some code I wrote a while back:

def first_gallery(parser, token):
    tag_name = "first_gallery"
    try:
        bits = token.contents.split()
    except ValueError:
        raise TemplateSyntaxError, "%r usage: {% %r for app.model primary_key [as gallery] %}" % (tag_name, tag_name)
    noa = len(bits)
    if noa < 4:
        raise TemplateSyntaxError, "%r tag takes at least 3 arguments" % tag_name
    if bits[1] != 'for':
        raise TemplateSyntaxError, "%r tag requires the first argument to be 'for'" % tag_name
    if noa > 4:
        if noa < 6 or noa > 6:
            raise TemplateSyntaxError, "%r tag requires exactly 5 arguments when specifying a variable name" % tag_name
        if bits[4] != 'as':
            raiseTemplateSyntaxError, "%r tag requires 4th argument to be 'as' when specifying a variable name" % tag_name
        return FirstGalleryNode(bits[2], bits[3], bits[5])
    return FirstGalleryNode(bits[2], bits[3])

Pretty ugly. A much cleaner way to do this is with regular expressions. The downside of using regex is the loss of fine-grained error output to the user of the template tag. For me it's worth it; the regex shouldn't be much more complicated than those in urls.py, and the user should probably already be able to deal with those.

The tag we'll work with is {% nearby_images %}, which sets two context variables ("previous" and "next" by default) which contain images close to the specified one in a particular gallery. You'd use this most often when creating a carousel of thumbnails, as in (php)Gallery. The full syntax is: {% nearby_images [5 near] image in gallery [as previous, next] %}. This example gets ugly if we use the split()-based validation from above since both optional sections have to be accounted for and the magic index numbers shifted around appropriately.

The main regex features we'll use are grouping () and capturing (?P<name>pattern). The most simple part to write a pattern for are the required arguments, including the tag name: nearby_images (?P<image>\w+) in (?P<gallery>\w+). This regex uses the same principle as the urls.py regex in that it captures the bits between the parentheses with the name given in between the angle brackets. We'll use it like so:

>>> p = "nearby_images (?P<image>\w+) in (?P<gallery>\w+)"
>>> m = re.match(p, "nearby_images myimage in mygallery")
>>> m.groupdict()
{'image': 'myimage', 'gallery': 'mygallery'}

Next, add in the optional section by using a parenthetical grouping followed by a '?': nearby_images ((?P<num>\w+) near )?(?P<image>\w+) in (?P<gallery>\w+), making sure to include the whitespace inside the grouping. This pattern can be used in both cases:

>>> p = "nearby_images ((?P<num>\w+) near )?(?P<image>\w+) in (?P<gallery>\w+)"
>>> m = re.match(p, "nearby_images myimage in mygallery")
>>> m.groupdict()
{'image': 'myimage', 'num': None, 'gallery': 'mygallery'}
>>> m = re.match(p, "nearby_images 6 near myimage in mygallery")
>>> m.groupdict()
{'image': 'myimage', 'num': '6', 'gallery': 'mygallery'}

Finally, add the last optional group: nearby_images ((?P<num>\w+) near )?(?P<image>\w+) in (?P<gallery>\w+)( as (?P<pervious>\w+), (?P<next>\w+))? which lets us do:

>>> p = "nearby_images ((?P<num>\w+) near )?(?P<image>\w+) in (?P<gallery>\w+)( as (?P<pervious>\w+), (?P<next>\w+))?"
>>> m = re.match(p, "nearby_images 6 near myimage in mygallery as FOO, BAR")
>>> m.groupdict()
{'previous': 'FOO', 'image': 'myimage', 'num': '6', 'gallery': 'mygallery', 'next': 'BAR'}

So now we've got the arguments to the template node in a dict, and so long as our arguments match the function signature, everything is peachy. Our finished tag function looks like this:

def nearby_images(parser, token):
    pattern = re.compile("nearby_images ((?P<num>\w+) near )?(?P<image>\w+) in (?P<gallery>\w+)( as (?P<pervious>\w+), (?P<next>\w+))?")
    m = pattern.match(token.contents)
    if m:
        return NearbyImagesNode(**m.groupdict())
    else:
        raise TemplateSyntaxError, "FAIL"

And the matching Node class:

class NearbyImagesNode(Node):
    def __init__(self, num, image, gallery, previous, next):
        self.image = image
        ...

Alternatively, one can use kwargs in the Node as well, but then each argument has to be pulled from the dict:

class NearbyImagesNode(Node):
    def __init__(self, **kwargs):
        self.image = kwargs['image']
        ...

And that's it. Cleaner argument parsing, with the downside of less fine-grained error messages.

About

Archive

Tags