PostgreSQL full text search with Django

PostgreSQL 8.3 is coming out soon with full text search integrated into the core database system. It's pretty well documented in chapter 12 of the PostgreSQL docs. The docs are a bit intimidating, but it turns out to be pretty easy to use with Django.

Let's say you're doing a stereotypical blog application, named 'blog', and have a model for entries such as:

This is a generic Full Text Search engine for Django projects

Currently implements three backends: dummy, simple and pgsql.

It should be possible to easily integrate MySQL, Sphinx and Xapian backends too.

Install

To install the latest version:

svn checkout http://django-fts.googlecode.com/svn/trunk/ django-fts
cd django-fts
python setup.py install

Note: You will need to install the Snowball python bindings if you want to use the snowball stemmer. If you don't a bundled stemmer based in the Porter algorithm will be used (this is also not required if you are using the PostgreSQL backend). Get the Snowball bindings package from http://snowball.tartarus.org/wrappers/PyStemmer-1.0.1.tar.gz

Usage example

Add the fts app to your settings.py file and optionally configure a fts backend (simple by default):

INSTALLED_APPS =(
  #...
  'fts')#FTS_BACKEND = 'pgsql://' # or 'dummy://' or 'simple://'

Assume that we have this model in our imaginary application:

from django.db import models
classBlog(models.Model):
  title = models.CharField(max_length=100)
  body = models.TextField()
  def __unicode__(self):
    return u"%s"%(self.title)

And we want to apply full text search functionality for model Blog. You need to subclass your model from fts.SearchableModule instead of from django.db.models.Model. The new module may look like this:

from django.db import models
import fts
classBlog(fts.SearchableModel):
  title = models.CharField(max_length=100)
  body = models.TextField()
  # Defining a SearchManager without fields will use all CharFields and TextFields.
  # This is the default and you do not need to explicitly add the following line:
  # objects = fts.SearchManager()
  # You can pass a list of fields that should be indexed
  # objects = fts.SearchManager( fields=('title','body') )
  # The fields you pass as parameters can be foreign fields ('myfield__foreign_field')
  # or even functions (functions should receive the instance as the only parameter)
 
  # You may also specify fields as a dictionary, mapping each field to a weight for ranking purposes
  # see http://www.postgresql.org/docs/8.3/static/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR
  #objects = SearchManager( fields={
  #  'title': 'A',
  #  'body': 'B',
  #} )
  def __unicode__(self):
    return u"%s"%(self.title)

In the django shell create some instances of models:

python ./manage.py shell

>>>from core.models importBlog>>> p =Blog(title='This is the title', body='The body of the article')>>> p.save()>>> p =Blog(title='This is the second title', body='The body of another article in the blog')>>> p.save()>>> p =Blog(title='This is the third title', body='The body of yet another simple article')>>> p.save()

Now perform a search:

>>> result =Blog.objects.search('simple').all()>>> result.count()1>>> result
[<Blog:Thisis the third title>]

Additional information

You can force an index update to all or some instances:

>>> p.update_index()>>>Blog.objects.update_index()>>>Blog.objects.update_index(pk=1)>>>Blog.objects.update_index(pk=[1,2])

You can omit the search function and make the search directly:

>>> result =Blog.objects('simple')>>> result.count()1>>> result
[<Blog:Thisis the third title>]

PostgreSQL specific information

The PostgreSQL backend is heavily based in the code from http://www.djangosnippets.org/snippets/1328/ by Dan Watson.

If using the pgsql backend, don't forget to add a Gin or GiST index to your tables: http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

Example

CREATE INDEX "tablename_search_index" ON "tablename" USING gin("search_index");

Note: You should index the search_index column, not your text or char columns.

This is a generic Full Text Search engine for Django projects

Currently implements three backends: dummy, simple and pgsql.

It should be possible to easily integrate MySQL, Sphinx and Xapian backends too.

Install

To install the latest version:

svn checkout http://django-fts.googlecode.com/svn/trunk/ django-fts
cd django-fts
python setup.py install

Note: You will need to install the Snowball python bindings if you want to use the snowball stemmer. If you don't a bundled stemmer based in the Porter algorithm will be used (this is also not required if you are using the PostgreSQL backend). Get the Snowball bindings package from http://snowball.tartarus.org/wrappers/PyStemmer-1.0.1.tar.gz

Usage example

Add the fts app to your settings.py file and optionally configure a fts backend (simple by default):

INSTALLED_APPS =(
  #...
  'fts')#FTS_BACKEND = 'pgsql://' # or 'dummy://' or 'simple://'

Assume that we have this model in our imaginary application:

from django.db import models
classBlog(models.Model):
  title = models.CharField(max_length=100)
  body = models.TextField()
  def __unicode__(self):
    return u"%s"%(self.title)

And we want to apply full text search functionality for model Blog. You need to subclass your model from fts.SearchableModule instead of from django.db.models.Model. The new module may look like this:

from django.db import models
import fts
classBlog(fts.SearchableModel):
  title = models.CharField(max_length=100)
  body = models.TextField()
  # Defining a SearchManager without fields will use all CharFields and TextFields.
  # This is the default and you do not need to explicitly add the following line:
  # objects = fts.SearchManager()
  # You can pass a list of fields that should be indexed
  # objects = fts.SearchManager( fields=('title','body') )
  # The fields you pass as parameters can be foreign fields ('myfield__foreign_field')
  # or even functions (functions should receive the instance as the only parameter)
 
  # You may also specify fields as a dictionary, mapping each field to a weight for ranking purposes
  # see http://www.postgresql.org/docs/8.3/static/textsearch-features.html#TEXTSEARCH-MANIPULATE-TSVECTOR
  #objects = SearchManager( fields={
  #  'title': 'A',
  #  'body': 'B',
  #} )
  def __unicode__(self):
    return u"%s"%(self.title)

In the django shell create some instances of models:

python ./manage.py shell

>>>from core.models importBlog>>> p =Blog(title='This is the title', body='The body of the article')>>> p.save()>>> p =Blog(title='This is the second title', body='The body of another article in the blog')>>> p.save()>>> p =Blog(title='This is the third title', body='The body of yet another simple article')>>> p.save()

Now perform a search:

>>> result =Blog.objects.search('simple').all()>>> result.count()1>>> result
[<Blog:Thisis the third title>]

Additional information

You can force an index update to all or some instances:

>>> p.update_index()>>>Blog.objects.update_index()>>>Blog.objects.update_index(pk=1)>>>Blog.objects.update_index(pk=[1,2])

You can omit the search function and make the search directly:

>>> result =Blog.objects('simple')>>> result.count()1>>> result
[<Blog:Thisis the third title>]

PostgreSQL specific information

The PostgreSQL backend is heavily based in the code from http://www.djangosnippets.org/snippets/1328/ by Dan Watson.

If using the pgsql backend, don't forget to add a Gin or GiST index to your tables: http://www.postgresql.org/docs/8.3/static/textsearch-indexes.html

Example

CREATE INDEX "tablename_search_index" ON "tablename" USING gin("search_index");

Note: You should index the search_index column, not your text or char columns.

Your best bet is to use Django raw querysets, I use it with MySQL to perform full text matching. If the data is all in the database and Postgres provides the matching capability then it makes sense to use it. Plus Postgres offers some really useful things in terms of stemming etc with full text queries.

Basically it lets you write the actual query you want yet returns models (as long as you are querying a model table obviously).

The advantage this gives you is that you can test the exact query you will be using first in Postgres, the documentation covers full text queries pretty well.

The main gotcha with raw querysets at the moment is they don't support count. So if you will be returning lots of data and have memory constraints on your application you might need to do something clever.

search¶

A boolean full-text search, taking advantage of full-text indexing. This is
like contains but is significantly faster due to full-text indexing.

Example:

Entry.objects.filter(headline__search="+Django -jazz Python")

SQL equivalent:

SELECT ... WHERE MATCH(tablename, headline) AGAINST (+Django -jazz Python IN BOOLEAN MODE);

Note this is only available in MySQL and requires direct manipulation of the
database to add the full-text index. By default Django uses BOOLEAN MODE for
full text searches. Please check MySQL documentation for additional details.

12.3. Controlling Text Search

To implement full text searching there must be a function to
create a tsvector from a document and a
tsquery from a user query. Also, we need to
return results in a useful order, so we need a function that
compares documents with respect to their relevance to the query.
It's also important to be able to display the results nicely.
PostgreSQL provides support for
all of these functions.

12.3.1. Parsing
Documents

PostgreSQL provides the
function to_tsvector for
converting a document to the tsvector
data type.

    to_tsvector([ config regconfig, ] document text) returns tsvector

to_tsvector parses a textual
document into tokens, reduces the tokens to lexemes, and
returns a tsvector which lists the
lexemes together with their positions in the document. The
document is processed according to the specified or default
text search configuration. Here is a simple example:

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
to_tsvector
-----------------------------------------------------
'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see that the resulting tsvector does not contain the words a, on, or it, the word rats
became rat, and the punctuation sign
- was ignored.

The to_tsvector function
internally calls a parser which breaks the document text into
tokens and assigns a type to each token. For each token, a list
of dictionaries (Section
12.6) is consulted, where the list can vary depending on
the token type. The first dictionary that recognizes the token emits one or more
normalized lexemes to represent the
token. For example, rats became
rat because one of the dictionaries
recognized that the word rats is a
plural form of rat. Some words are
recognized as stop words (Section
12.6.1), which causes them to be ignored since they occur
too frequently to be useful in searching. In our example these
are a, on,
and it. If no dictionary in the list
recognizes the token then it is also ignored. In this example
that happened to the punctuation sign - because there are in fact no dictionaries
assigned for its token type (Space
symbols
), meaning space tokens will never be indexed. The
choices of parser, dictionaries and which types of tokens to
index are determined by the selected text search configuration
(Section 12.7). It
is possible to have many different configurations in the same
database, and predefined configurations are available for
various languages. In our example we used the default
configuration english for the English
language.

The function setweight can be
used to label the entries of a tsvector
with a given weight, where a weight is
one of the letters A, B, C, or D. This is typically used to mark entries coming
from different parts of a document, such as title versus body.
Later, this information can be used for ranking of search
results.

Because to_tsvector(NULL) will return NULL,
it is recommended to use coalesce
whenever a field might be null. Here is the recommended method
for creating a tsvector from a structured
document:

UPDATE tt SET ti =
setweight(to_tsvector(coalesce(title,'')), 'A')    ||
setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
setweight(to_tsvector(coalesce(body,'')), 'D');

Here we have used setweight to
label the source of each lexeme in the finished tsvector, and then merged the labeled tsvector values using the tsvector concatenation operator ||. (Section
12.4.1 gives details about these operations.)

12.3.2. Parsing Queries

PostgreSQL provides the
functions to_tsquery and
plainto_tsquery for converting a
query to the tsquery data type.
to_tsquery offers access to more
features than plainto_tsquery,
but is less forgiving about its input.

    to_tsquery([ config regconfig, ] querytext text) returns tsquery

to_tsquery creates a
tsquery value from querytext, which must consist of single
tokens separated by the Boolean operators & (AND), | (OR) and
! (NOT). These operators can be
grouped using parentheses. In other words, the input to
to_tsquery must already follow
the general rules for tsquery input, as
described in Section
8.11. The difference is that while basic tsquery input takes the tokens at face value,
to_tsquery normalizes each token
to a lexeme using the specified or default configuration, and
discards any tokens that are stop words according to the
configuration. For example:

SELECT to_tsquery('english', 'The & Fat & Rats');
to_tsquery
---------------
'fat' & 'rat'

As in basic tsquery input, weight(s)
can be attached to each lexeme to restrict it to match only
tsvector lexemes of those weight(s). For
example:

SELECT to_tsquery('english', 'Fat | Rats:AB');
to_tsquery
------------------
'fat' | 'rat':AB

to_tsquery can also accept
single-quoted phrases. This is primarily useful when the
configuration includes a thesaurus dictionary that may trigger
on such phrases. In the example below, a thesaurus contains the
rule supernovae stars : sn:

SELECT to_tsquery('''supernovae stars'' & !crab');
to_tsquery
---------------
'sn' & !'crab'

Without quotes, to_tsquery
will generate a syntax error for tokens that are not separated
by an AND or OR operator.

    plainto_tsquery([ config regconfig, ] querytext text) returns tsquery

plainto_tsquery transforms
unformatted text querytext to
tsquery. The text is parsed and
normalized much as for to_tsvector, then the & (AND) Boolean operator is inserted between
surviving words.

Example:

 SELECT plainto_tsquery('english', 'The Fat Rats');
plainto_tsquery
-----------------
'fat' & 'rat'

Note that plainto_tsquery
cannot recognize either Boolean operators or weight labels in
its input:

SELECT plainto_tsquery('english', 'The Fat & Rats:C');
plainto_tsquery
---------------------
'fat' & 'rat' & 'c'

Here, all the input punctuation was discarded as being space
symbols.

12.3.3. Ranking Search Results

Ranking attempts to measure how relevant documents are to a
particular query, so that when there are many matches the most
relevant ones can be shown first. PostgreSQL provides two predefined ranking
functions, which take into account lexical, proximity, and
structural information; that is, they consider how often the
query terms appear in the document, how close together the
terms are in the document, and how important is the part of the
document where they occur. However, the concept of relevancy is
vague and very application-specific. Different applications
might require additional information for ranking, e.g. document
modification time. The built-in ranking functions are only
examples. You can write your own ranking functions and/or
combine their results with additional factors to fit your
specific needs.

The two ranking functions currently available are:

        ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

Standard ranking function.

        ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

This function computes the cover
density
ranking for the given document vector and
query, as described in Clarke, Cormack, and Tudhope's
"Relevance Ranking for One to Three Term Queries" in the
journal "Information Processing and Management",
1999.

This function requires positional information in its
input. Therefore it will not work on "stripped" tsvector
values — it will always return zero.

For both these functions, the optional weights argument offers the ability to
weigh word instances more or less heavily depending on how they
are labeled. The weight arrays specify how heavily to weigh
each category of word, in the order:

{D-weight, C-weight, B-weight, A-weight}

If no weights are provided,
then these defaults are used:

{0.1, 0.2, 0.4, 1.0}

Typically weights are used to mark words from special areas
of the document, like the title or an initial abstract, so that
they can be treated as more or less important than words in the
document body.

Since a longer document has a greater chance of containing a
query term it is reasonable to take into account document size,
e.g. a hundred-word document with five instances of a search
word is probably more relevant than a thousand-word document
with five instances. Both ranking functions take an integer
normalization option that
specifies whether and how a document's length should impact its
rank. The integer option controls several behaviors, so it is a
bit mask: you can specify one or more behaviors using
| (for example, 2|4).

  • 0 (the default) ignores the document length

  • 1 divides the rank by 1 + the logarithm of the document
    length

  • 2 divides the rank by the document length

  • 4 divides the rank by the mean harmonic distance between
    extents (this is implemented only by ts_rank_cd)

  • 8 divides the rank by the number of unique words in
    document

  • 16 divides the rank by 1 + the logarithm of the number
    of unique words in document

  • 32 divides the rank by itself + 1

If more than one flag bit is specified, the transformations
are applied in the order listed.

It is important to note that the ranking functions do not
use any global information, so it is impossible to produce a
fair normalization to 1% or 100% as sometimes desired.
Normalization option 32 (rank/(rank+1)) can be applied to scale all ranks
into the range zero to one, but of course this is just a
cosmetic change; it will not affect the ordering of the search
results.

Here is an example that selects only the ten highest-ranked
matches:

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC LIMIT 10;
title                     |   rank
-----------------------------------------------+----------
Neutrinos in the Sun                          |      3.1
The Sudbury Neutrino Detector                 |      2.4
A MACHO View of Galactic Dark Matter          |  2.01317
Hot Gas and Dark Matter                       |  1.91171
The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
Rafting for Solar Neutrinos                   |      1.9
NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
Hot Gas and Dark Matter                       |   1.6123
Ice Fishing for Cosmic Neutrinos              |      1.6
Weak Lensing Distorts the Universe            | 0.818218

This is the same example using normalized ranking:

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC LIMIT 10;
title                     |        rank
-----------------------------------------------+-------------------
Neutrinos in the Sun                          | 0.756097569485493
The Sudbury Neutrino Detector                 | 0.705882361190954
A MACHO View of Galactic Dark Matter          | 0.668123210574724
Hot Gas and Dark Matter                       |  0.65655958650282
The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
Rafting for Solar Neutrinos                   | 0.655172410958162
NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
Hot Gas and Dark Matter                       | 0.617195790024749
Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
Weak Lensing Distorts the Universe            | 0.450010798361481

Ranking can be expensive since it requires consulting the
tsvector of each matching document, which
can be I/O bound and therefore slow. Unfortunately, it is
almost impossible to avoid since practical queries often result
in large numbers of matches.

12.3.4. Highlighting Results

To present search results it is ideal to show a part of each
document and how it is related to the query. Usually, search
engines show fragments of the document with marked search
terms. PostgreSQL provides a
function ts_headline that
implements this functionality.

    ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text

ts_headline accepts a document
along with a query, and returns an excerpt from the document in
which terms from the query are highlighted. The configuration
to be used to parse the document can be specified by config; if config is omitted, the default_text_search_config configuration is
used.

If an options string is
specified it must consist of a comma-separated list of one or
more option=value pairs.
The available options are:

  • StartSel, StopSel: the strings with which query words
    appearing in the document should be delimited to
    distinguish them from other excerpted words.

  • MaxWords, MinWords: these numbers determine the
    longest and shortest headlines to output.

  • ShortWord: words of this length
    or less will be dropped at the start and end of a headline.
    The default value of three eliminates the English
    articles.

  • HighlightAll: Boolean flag; if
    true the whole document will be
    highlighted.

Any unspecified options receive these defaults:

StartSel=<b>, StopSel=</b>, MaxWords=35, MinWords=15, ShortWord=3, HighlightAll=FALSE

For example:

SELECT ts_headline('english', 'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.', to_tsquery('query & similarity'));
ts_headline
------------------------------------------------------------
given <b>query</b> terms
and return them in order of their <b>similarity</b> to the
<b>query</b>.
SELECT ts_headline('english', 'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
to_tsquery('query & similarity'),
'StartSel = <, StopSel = >');
ts_headline
-------------------------------------------------------
given <query> terms
and return them in order of their <similarity> to the
<query>.

ts_headline uses the original
document, not a tsvector summary, so it
can be slow and should be used with care. A typical mistake is
to call ts_headline for
every matching
document when only ten documents are to be shown.
SQL subqueries can help;
here is an example:

SELECT id, ts_headline(body, q), rank
FROM (SELECT id, body, q, ts_rank_cd(ti, q) AS rank
FROM apod, to_tsquery('stars') q
WHERE ti @@ q
ORDER BY rank DESC LIMIT 10) AS foo;

发表回复