KinoSearch vs Tsearch2 contd.

Report2:

The working principle of tsearch2 is thus:

The document is reduced into a tsvector, which is a data type which holds words in the document and also positional information of each word in the document. As mentioned in my earlier report this data type can hold only up to 256  positions  per  document.  Each  word  in  the  tsvector  is represented as a lexeme. Then a list of indexes according to the lexeme list is made.

The searches to be performed is stored in a data type called tsquery, which stores your search terms as lexemes again. Then we query for the necessary vector from the table with the tsvector column and the necessary documents are returned.

I tried building a very simple search engine from an example cited on the user’s guide to tsearch2.

=# CREATE TABLE docs ( id SERIAL, doc TEXT, vector tsvector )
=# CREATE INDEX docs_index ON docs USING gist(vector);
# CREATE FUNCTION insdoc(text) RETURNS void LANGUAGE sql AS
  'INSERT INTO docs (doc, vector) VALUES ($1, to_tsvector($1));'

=# SELECT insdoc('This is a test document I made up.')
=# SELECT insdoc('For my summer of code project for google')
=# SELECT insdoc('Which is to implement full text search for Bricolage.')
=# SELECT insdoc('I am testing tsearch2 right now')
=# SELECT insdoc('And I'm running out of sentences to put in.')
=# SELECT insdoc('So I guess I'm gonna stop here.')
=# SELECT insdoc('Okay, so maybe one more then... .')
=# CREATE TYPE finddoc_t AS (id INTEGER, headline TEXT, rank REAL)
=# CREATE FUNCTION finddoc(text) RETURNS SETOF finddoc_t LANGUAGE sql AS '
   SELECT id, headline(doc, q), rank(vector, q)
     FROM docs, to_tsquery($1) AS q
     WHERE vector @@ q ORDER BY rank(vector, q) DESC'

This was my test case:
=# SELECT * FROM finddoc('text|search')
 id |                       headline                        | rank 
----+-------------------------------------------------------+------
  3 | Which is to implement full <b>text</b> <b>search</b>  | 0.19
  4 | <b>tsearch2</b> right now                             |  0.1
  
(2 rows)
=# SELECT doc FROM docs WHERE id = 3
                       doc                       
-------------------------------------------------
 Which is to implement full text search for Bricolage.
(1 row)

Something I noticed was that tsearch returned only those leximes which were exact matches. While converting search cases into leximes for tsquery, this has to be paid attention to.

I ran into some problems when I first tried to install KinoSearch .

So I checked for a list of Kinosearch users:
I started out with www.evo.com. The search was user-friendly and fast.
( I just thought I’ll mention all the steps I followed)

After I installed KinoSearch using cpan I tried out a few examples from this page.
I made no real changes to the example search engine or the test case mentioned on the page.

Check the code I used:

invindex.plx

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    use File::Spec;
    use KinoSearch::InvIndexer;
    use KinoSearch::Analysis::PolyAnalyzer;
    
    my $source_dir       = '';
    my $path_to_invindex = '';
    my $base_url         = '/us_constitution';
    
    opendir( my $source_dh, $source_dir )
        or die "Couldn't opendir '$source_dir': $!";
    my @filenames = grep {/\.html/} readdir $source_dh;
    closedir $source_dh or die "Couldn't closedir '$source_dir': $!";
    
    ### Analyzer.
    my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( 
        language => 'en',
    );

    ### Create a InvIndexer object.
    my $invindexer = KinoSearch::InvIndexer->new(
        analyzer => $analyzer,
        invindex => $path_to_invindex,
        create   => 1,
    );
    
    ### fields.
    $invindexer->spec_field( name => 'title' );
    $invindexer->spec_field( 
        name       => 'bodytext',
        vectorized => 1,
    );
    $invindexer->spec_field(
        name    => 'url',
        indexed => 0,
    );
    
    foreach my $filename (@filenames) {
        next if $filename eq 'index.html';
        my $filepath = File::Spec->catfile( $source_dir, $filename );
        open( my $fh, '<', $filepath )
            or die "couldn't open file '$filepath': $!";
        my $content = do { local $/; <$fh> };
    
        ### new document.
        my $doc = $invindexer->new_doc;
    
        $content =~ m#<title>(.*?)</title>#s
            or die "couldn't isolate title in '$filepath'";
        my $title = $1;
        $content =~ m#<div id="bodytext">(.*?)</div><!--bodytext-->#s
            or die "couldn't isolate bodytext in '$filepath'";
        my $bodytext = $1;
        $bodytext =~ s/<.*?>/ /gsm;    
    
        ### value for each field.
        $doc->set_value( url      => "$base_url/$filename" );
        $doc->set_value( title    => $title );
        $doc->set_value( bodytext => $bodytext );
    
        ### Add the document to the invindex.
        $invindexer->add_doc($doc);
    
        
    }
    
    $invindexer->finish;

SEARCH.CGI

    #!/usr/bin/perl -T
    use strict;
    use warnings;
    
    use CGI;
    use List::Util qw( max min );
    use POSIX qw( ceil );
    use KinoSearch::Searcher;
    use KinoSearch::Analysis::PolyAnalyzer;
    use KinoSearch::Highlight::Highlighter;
    
    my $cgi           = CGI->new;
    my $q             = $cgi->param('q');
    my $offset        = $cgi->param('offset');
    my $hits_per_page = 10;
    $q      = '' unless defined $q;
    $offset = 0  unless defined $offset;
    
    my $path_to_invindex = '';
    my $base_url         = '/us_constitution';
    
    ### specify Analyzer 
    my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( 
        language => 'en', 
    );
    
    ### Searcher object.
    my $searcher = KinoSearch::Searcher->new(
        invindex => $path_to_invindex,
        analyzer => $analyzer,
    );
    
    ### query 
    my $hits = $searcher->search($q);
    
    my $highlighter = KinoSearch::Highlight::Highlighter->new( 
        excerpt_field => 'bodytext' );
    $hits->create_excerpts( highlighter => $highlighter );
    
    
    $hits->seek( $offset, $hits_per_page );
    
    # create result list
    my $report = '';
    while ( my $hit = $hits->fetch_hit_hashref ) {
        my $score = sprintf( "%0.3f", $hit->{score} );
        $report .= qq|
            <p>
                <a href="$hit->{url}"><strong>$hit->{title}</strong></a>
                <em>$score</em>
                <br>
                $hit->{excerpt}
                <br>
                <span class="excerptURL">$hit->{url}</span>
            </p>
            |;
    }
    
    $q = CGI::escapeHTML($q);
    
    
    my $total_hits = $hits->total_hits;
    my $num_hits_info;
    if ( !length $q ) {
        # no query, no display
        $num_hits_info = '';
    }
    elsif ( $total_hits == 0 ) {
     $num_hits_info = qq|<p>No matches for <strong>$q</strong></p>|;
    }
    else {
           my $last_result = min( ( $offset + $hits_per_page ), $total_hits );
        my $first_result = min( ( $offset + 1 ), $last_result );
    
    
        $num_hits_info = qq|
            <p>
                Results <strong>$first_result-$last_result</strong> 
                of <strong>$total_hits</strong> for <strong>$q</strong>.
            </p>
            <p>
                Results Page:
            |;
    
        my $current_page = int( $first_result / $hits_per_page ) + 1;
        my $last_page    = ceil( $total_hits / $hits_per_page );
        my $first_page   = max( 1, ( $current_page - 9 ) );
        $last_page = min( $last_page, ( $current_page + 10 ) );
    
       
        my $href = $cgi->url( -relative => 1 ) . "?" . $cgi->query_string;
        $href .= ";offset=0" unless $href =~ /offset=/;
    
        for my $page_num ( $first_page .. $last_page ) {
            if ( $page_num == $current_page ) {
                $num_hits_info .= qq|$page_num \n|;
            }
            else {
                my $new_offset = ( $page_num - 1 ) * $hits_per_page;
                $href =~ s/(?<=offset=)\d+/$new_offset/;
                $num_hits_info .= qq|<a href="$href">$page_num</a>\n|;
            }
        }
    
        
    
        
        $num_hits_info .= "</p>\n";
    }
    
    print "Content-type: text/html\n\n";
    print <<END_HTML;
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
    <html>
    <head>
        <meta http-equiv="Content-type" 
            content="text/html;charset=ISO-8859-1">
        <link rel="stylesheet" type="text/css" href="$base_url/uscon.css">
        <title>KinoSearch: $q</title>
    </head>
    
    <body>
    
        <div id="navigation">
            <form id="usconSearch" action="">
                <strong>
                Search the <a href="$base_url/index.html">US Constitution</a>:
                </strong>
                <input type="text" name="q" id="q" value="$q">
                <input type="submit" value="=&gt;">
                <input type="hidden" name="offset" value="0">
            </form>
        </div><!--navigation-->
    
        <div id="bodytext">
    
        $report
    
        $num_hits_info
    
        <p style="font-size: smaller; color: #666">
            <em>Powered by 
                <a href="http://www.rectangular.com/kinosearch/">
                    KinoSearch
                </a>
            </em>
        </p>
        </div><!--bodytext-->
    
    </body>
    
    </html>
    END_HTML

Details of this code are available here

Once the testing was over, I compared the both of them.

The first criteria checked was speed; the tsearch engine I built performed much faster performed much faster.
I wrote a script to populate the tsvector tables with the US constitution as the document, which was my test case while using KinoSearch. I checked both the searches for the same phrases for both the searches and KinoSearch returned detailed results faster. Though for elementary phrases tsearch performed really well. And a more sophisticated search engine might do a better job.

Kino Search uses an invindex for the document body and the search cases are checked for using the invindex.

There are a list of additional features of KinoSearch mentioned on that page. I cite them here anyway,

  • Incremental indexing (addition/deletion of documents to/from an existing index).
  • Full support for 12 Indo-European languages.
  • Support for boolean operators AND, OR, and AND NOT; parenthetical groupings, and prepended +plus and -minus
  • Algorithmic selection of relevant excerpts and highlighting of search terms within excerpts
  • Highly customizable query and indexing APIs

Building the invindex for the document in question was much easier when I used the Kino Engine.
The code too was more understandable and in Perl.

Please check this report and let me know. The question to be decided now is whether or not we intend to support mysql. If not implementing tsearch2 would be much easier as detailed in the original application.
Implementing KinoSearch would mean more changes to the installer script and bringing in unwanted dependencies.

2 responses to “KinoSearch vs Tsearch2 contd.

  1. Riyaad Miller

    Hi,

    What version of Kinosearch are you using here?
    I’m assuming version KinoSearch-0.162?

    Regards
    R

  2. hi R,

    you assume right.

    Cheers,
    Unni

Leave a comment