About vdig.net

Home-|-  New Zealand Hansard-|-  STV Software-|-  About Us


Our Mission

Our mission is free access to public information.

When a kid from a cash-starved public school uses our search engine to learn about government, we are happy. When a solo-mum uses our search engine to discover what her MP has said in parliament, we are happy. If a computer geek downloads our data to learn about full-text searching, we are happy [there is a link for you below].

A big 'thank you' to the anonymous donor of our new server! (see below)

The VDIG group

The VDIG group (hansard@vdig.net) is a non profit group of people who work in their spare time to provide this website. Why? Because we think that public access to parliamentary information is a good thing. All our time and equipment is donated. We have had many enquiries about how we implemented the hansard search engine. This page will hopefully answer those questions.

  • The free hansard search index went live in April 2002.
  • If you wish to experiment yourself, you can download the entire hansard archive from here [68 MB]. This is a PPMd compressed dump of our hansard_content database table. Uncompressed the data file is around 390 MB.

The Plan

The hansard search engine is the first step towards our final goal. That goal is a database of voting broken down by issue, by party and then down to individual members of parliament.

To do this we need a search engine. And here it is.

The Server

Our server has recently been updated to handle the increased load. It is a dedicated EPIA Eden 5000 rack server with high speed networking. It is running enterprise-class database server software by PostgreSQL and debian linux. On our current hansard database, the server can handle approximately 2000 searches per minute. Or roughly one hundred thousand searches per hour. We would like to thank an anonymous donor for providing this hardware.

The Search Engine

The hansard search engine is a Tomcat container running the Lucene search engine. Lets start with Lucene. This is a wonderful piece of software which has become part of the apache project. Lucene is the perfect tool for ultra-fast searching of large amounts of text.

The hansard data is parsed using a parser generated with the antlr 2.0 parser generator. This tool is freeware and was written by Terrence Parr. The hansard parser is written in Java, reads its data from the web and feeds a state machine that attempts to detect the title, speaker and date of the hansard data. This is carefully done to use only the content of Hansard which is not copyright, and not to use any additional markup which might be copyright. If you are not familiar with the antlr tools then I suggest a look -- grammar inheritance is a feature that no parser writer should do without. This parsing is not perfect however, and you might notice 'holes' in the data, these occur when our parser fails to correctly parse the data.

The output of the parser is fed into a PostgreSQL database. PostgreSQL is an aweseome free relational database which features stored procedures, triggers and many other enterprize level features. This table we call 'hansard_content'.

The Lucene search indexing routine is then run over the 'hansard_content' table and generates a search index -- basically a giant 100 MB bunch of files.

The Web Application

Now it is Tomcat's turn. Tomcat is a tool for building dynamic java-based servlets whence the '.jsp' files all over the place. Tomcat is Java-based and can talk directly to Lucene. This means that when you search for a string, Tomcat calls Lucene and queries the index and retrieves a list of matching entries in the 'hansard_content' table. This is displayed and if you click on one of the entries, the full contents of the parliamentary speech corresponding to that entry are retrieved and displayed.


hansard@vdig.net