I wanted to do some serious searching of IMDB so I could prettify some movie lists. There's no public API for imdb, and although one guy built a web interface he closed it down due to licensing concerns. Luckly, IMDB still offers the imdb data and some tools for searching it. Unluckily, the mkdb script gave me a segmentation fault whenever I tried to import a list, so I decided to go at it alone.
First, I made a few database tables in MySQL:
create table movies (
id int not null auto_increment primary key,
title varchar(255) not null,
year year not null
);
create table directors(
id int not null auto_increment primary key,
name varchar(255) not null
);
create table movies_directors(
id int not null auto_increment primary key,
director_id int not null, movie_id int not null
);
create table akatitles (
id int not null auto_increment primary key,
movie_id int not null,
title varchar(255) not null,
is_english boolean not null default false
);
create table movies_search(
movie_id int not null primary key,
search_field text,
fulltext(search_field)
);
The IMDB movie file is a tab separated list of movies and years, like so:
Notorious (1946) 1946
Notorious (1992) (TV) 1992
Notorious (1992) (V) 1992
...
Number 13 (2006) (TV) 2006-????
In order to simplify the import, I removed the header and footer from the file and deleted a bunch of data, namely all direct-to-video movies (V) , and all television episodes (TV). I also deleted everything with a quote '"' at the beginning of the title or with a year range (1995-1996) on the assumption that these were probably TV series, and everything with an unknown date (????). In vim:
:g/(V)/d
:g/(TV)/d
:g/^"/d
:g/\t[0-9][0-9][0-9][0-9]-/d
:g/\t????/d
I then removed all the extra tabs, so only a single tab was left between the year and the title.
:%s/\t\{2,}/\t/g
And finally to each row I added unique IDs, based on the line number, followed by a tab:
:%s/^/\=line('.') . "\t"/g
With the movie list ready, it could be imported into mysql:
mysqlimport --fields-terminated-by="\t"
--lines-terminated-by="\n" --user=kellen
--password movies /home/kellen/IMDB/movies.list
The directors list presented a different problem: it was broken up into sections by director, each followed by a blank line. The associated movies were tab separated in the same section, each on a new line, except the first, which followed after the director name. Like so:
Herzog, Robert M. Flights (2002)
Herzog, Rudolph Abora - Letzte Position Atlantik (2008)
Der Ball ist ein Sauhund (1999)
Heil Hitler, das Schwein ist tot! - Humor unterm Hakenkreuz (2006) (TV)
"The Heist" (2004)
Herzog, Werner Aguirre, der Zorn Gottes (1972)
Auch Zwerge haben klein angefangen (1970)
Ballade vom kleinen Soldaten (1984) (TV)
Behinderte Zukunft? (1971) (TV)
Cerro Torre: Schrei aus Stein (1991)
...
It seemed sensible to take a two-step approach and load up all the directors first, then join them with the movies as a second step, so I went about removing the blank lines, lines which began with whitespace, and anything following a tab (the first film for each director). I then added line numbers, as for the movies list:
:g/^$/d
:g/^\s/d
:%s/\t.*$//g
:%s/^/\=line('.') . "\t"/g
In order to speed up the processing, I also removed the lines which were videos or TV shows or began with '"' and which were not also a line for the director:
:g/^\t.*(V)/d
:g/^\t.*(TV)/d
:g/^\t\+"/d
This was followed by a similar import. Note that "directors.list" is a modified version of the original directors.list which comes from IMDB.
mysqlimport --fields-terminated-by="\t"
--lines-terminated-by="\n" --user=root
--password movies /home/kellen/IMDB/directors.list
I then needed to iterate over each director and add associate them with each of their films. I realized here that I could just as well have added the directors in the same step. Oh well. I wrote a python script to do this. Note that "directors_movies.list" contains the original directors.list file from IMDB.
#!/usr/bin/python
import MySQLdb
import sys
p = sys.stdout.write
fl = sys.stdout.flush
def insert_directors():
f = open('directors_movies.list')
conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies")
cursor = conn.cursor()
director = ""
movies=[]
ln = 0;
try:
for line in f:
ln += 1
if len(line.strip()) == 0:
if len(movies) > 0:
cursor.execute ("SELECT id FROM directors WHERE name= %s", director)
row = cursor.fetchone()
if row is not None and row[0] > 0:
dir_id = row[0]
for movie in movies:
cursor.execute("SELECT id FROM movies WHERE title=%s", movie)
m_row = cursor.fetchone()
if m_row is not None and m_row[0] > 0:
movie_id = m_row[0]
# movie found, so insert the relationship
cursor.execute("INSERT INTO movies_directors (movie_id, director_id) VALUES (%s, %s)", (movie_id, dir_id))
if cursor.rowcount > 0:
p("#")
fl()
else:
p("x")
fl()
else:
# no movie, print a '.'
p(".")
fl()
else:
print "No director in the DB for %s" % director
else:
print "No movies to insert for %s" % director
director = ""
movies=[]
elif line.startswith("\t"):
# its a movie row, so add it to the dir.
movies.append(line.strip())
else:
# add the director, then parse the first movie
director, title = line.split('\t', 1)
movies.append(title.strip())
except ValueError:
print "on line ", ln, ": ***", line , "***"
raise
cursor.close()
conn.close ()
f.close()
insert_directors()
Before I ran the insert script, which does a lot of lookups, I created some indexes on the existing tables:
create index movie_titles on movies (title);
create index dir_name on directors (name);
create index mov_dir on movies_directors (movie_id, director_id);
Running the script:
> ./dirs.py
###.####.#######.......#........#.#..#######...#..#..#.########..##.##############..#######.##..####.
...
Great! Now there's a table of movies and a table of directors. We can do things like:
mysql> select d.name as director, m.title, m.year from
-> directors d, movies_directors md, movies m where
-> d.name like 'Belvaux, R%' and md.director_id = d.id and md.movie_id = m.id;
+---------------+---------------------------------------+------+
| director | title | year |
+---------------+---------------------------------------+------+
| Belvaux, R▒my | C'est arriv▒ pr▒s de chez vous (1992) | 1992 |
| Belvaux, R▒my | Pas de C4 pour Daniel Daniel (1987) | 1987 |
+---------------+---------------------------------------+------+
2 rows in set (0.09 sec)
This is great, but what I'd really like to be able to do is throw in a bunch of keywords and get back a relevance-ordered list of nicely formatted titles. In english. So I need two more things. First, another table with the movies' "aka names" and secondly a fulltext index with all of the relevant info (title, alternative titles, year, director names).
The alternative names have to be imported from yet another file, "aka-titles.list". And because of the formatting, and because they need to get linked by ID, it's another python script. The script also creates all the text needed for the fulltext index (title, director names, plus alternative titles).
#!/usr/bin/python
import MySQLdb
import sys
import re
p = sys.stdout.write
fl = sys.stdout.flush
def insert_alt():
f = open('aka-titles.list')
conn = MySQLdb.connect(host="localhost", user="USER", passwd="PASS", db="movies")
cursor = conn.cursor()
title = ""
movies=[]
ln = 0;
english_patt = re.compile("(english|\(UK\)|\(USA\))", re.I)
clean_patt = re.compile("\(aka (.+?) \(")
try:
for line in f:
ln += 1
if len(line.strip()) == 0:
if len(movies) > 0:
cursor.execute ("SELECT id FROM movies WHERE title= %s", title)
row = cursor.fetchone()
if row is not None and row[0] > 0:
movie_id = row[0]
alt_titles = ""
for movie in movies:
is_english = 0 if english_patt.search(movie) is None else 1
m = clean_patt.search(movie)
if m is not None:
cleaned_title = m.group(1)
alt_titles += " " + cleaned_title
cursor.execute("INSERT INTO akatitles (movie_id, title, is_english) VALUES (%s, %s, %s)", (movie_id, cleaned_title, is_english))
p("#") if cursor.rowcount > 0 else p("x")
else:
p("z")
# create the search field:
searchfield = title
cursor.execute("SELECT name FROM directors d, movies_directors md WHERE md.movie_id = %s and md.director_id = d.id", movie_id)
dir_rows = cursor.fetchall()
for dir_row in dir_rows:
searchfield += " " + dir_row[0]
cursor.execute("INSERT INTO movies_search (movie_id, search_field) VALUES (%s, %s) on duplicate key update search_field=VALUES(search_field)", (movie_id, searchfield))
p("+") if cursor.rowcount > 0 else p("_")
else:
# main movie not in the DB, skipping
p(".")
else:
print "No movies to insert for %s" % title
fl()
title= ""
movies=[]
elif line.startswith(" "):
# its a movie row, so add it to the dir.
movies.append(line.strip())
else:
# add the title
title = line.strip()
except ValueError:
print "on line ", ln, ": ***", line , "***"
raise
cursor.close()
conn.close ()
f.close()
insert_alt()
Great, now we can do:
mysql> select distinct m.* from movies m, akatitles a where
-> a.movie_id = m.id and a.title like '%Man Bites Dog%';
+-------+---------------------------------------+------+
| id | title | year |
+-------+---------------------------------------+------+
| 62985 | C'est arriv▒ pr▒s de chez vous (1992) | 1992 |
+-------+---------------------------------------+------+
1 row in set (0.17 sec)
After doing this, I realized that building the index in this way would only give a search value for movies which had alternate titles, which not all have. So I had to re-create the movies_search table using the data in the DB. First create some temporary tables:
create temporary table alttitles(movie_id int not null primary key, titles text);
create temporary table dirnames(movie_id int not null primary key, names text);
Fill them with data:
insert into alttitles (movie_id, titles)
select a.movie_id, group_concat(a.title separator " ") as titles from akatitles a group by movie_id;
insert into dirnames(movie_id, names)
select movie_id, group_concat(d.name separator " ") as names from movies_directors md, directors d where director_id = d.id group by movie_id;
Index them:
create index alt_idx on alttitles(movie_id);
create index dir_idx on dirnames(movie_id);
Build the index in 4 steps:
-- with both director names and alternate titles
insert into movies_search (movie_id, search_field)
select m.id as movie_id,
concat(m.title, " ", dirnames.names, " ", alttitles.titles) as search_field
from movies m, alttitles, dirnames
where m.id = alttitles.movie_id and m.id = dirnames.movie_id;
-- with just director names
insert into movies_search (movie_id, search_field)
select m.id as movie_id,
concat(m.title, " ", dirnames.names) as search_field
from movies m, dirnames
where m.id = dirnames.movie_id and not exists (select movie_id from alttitles where movie_id=m.id);
-- with just alternate titles
insert into movies_search (movie_id, search_field)
select m.id as movie_id,
concat(m.title, " ", alttitles.titles) as search_field
from movies m, alttitles
where m.id = alttitles.movie_id and not exists (select movie_id from dirnames where movie_id=m.id);
-- without director or alternate titles
insert into movies_search(movie_id, search_field)
select m.id as movie_id,
m.title as search_field
from movies m where not exists (select movie_id from alttitles where movie_id=m.id)
and not exists (select movie_id from dirnames where movie_id=m.id);
And finally verify that it all worked:
mysql> select count(*) from movies;
+----------+
| count(*) |
+----------+
| 491858 |
+----------+
1 row in set (0.00 sec)
mysql> select count(*) from movies_search;
+----------+
| count(*) |
+----------+
| 491858 |
+----------+
1 row in set (0.00 sec)
After this, it is possible to do fulltext searches on directory or file names from download services and to get reasonable results. For example:
mysql> select m.*, match(search_field)
-> against('Psycho.1960.560x304.25fps.800kbs.AC3.MultiSub.WunSeeDee') AS score
-> from movies_search ms, movies m where ms.movie_id = m.id
-> order by score desc limit 3;
+--------+------------------+------+-----------------+
| id | title | year | score |
+--------+------------------+------+-----------------+
| 323689 | Psycho (1960) | 1960 | 18.416343688965 |
| 323703 | Psycho II (1983) | 1983 | 14.394601821899 |
| 323690 | Psycho (1998) | 1998 | 12.881115913391 |
+--------+------------------+------+-----------------+
3 rows in set (0.56 sec)
So next, a script to look up titles and format them according to my silly formatting rules: