Zend_Search_Lucene: Limit results and some other questions

Alexander Veremyev

2008-01-10 17:58:33 UTC

Hi Ralf,

Hi,
I would like to start using Zend_Search_Lucene for the website search
engine of our travel community. We have a couple of different areas
which need to be searchable: forum, articles, gallery, destinations, and
maybe members. It should be possible to just search in one area, for
example the forum or the destinations. But it should also be possible to
search in all areas in one step.

Lucene document model is very flexible. You can store (and index) additional document attributes in a special field(s).
Ex. special 'area' field, which may contain one or more terms to define document area.
You can add 'area' clause to limit searching: "($userQuery) AND area:gallery"
Or do it through the API:
-------------------
$parsedQuery = Zend_Search_Lucene_Search_QueryParser::parse($userQuery);

$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($parsedQuery, true /* required */);

$areaTerm = new Zend_Search_Lucene_Index_Term('gallery', 'area');
$areaQuery = new Zend_Search_Lucene_Search_Query_Term($keywordTerm);

$query->addSubquery($areaQuery, true /* required */);

$hits = $index->find($query);
--------------------

See documentation for details (http://framework.zend.com/manual/en/zend.search.lucene.html).
ZendConf presentation also may be helpful (http://devzone.zend.com/content/zendcon_07_slides/Evron_Shahar_Indexing_With_Zend_Search_Lucene-ZendCon07.pdf).

Another requirement is to limit the results based on the current
selected destination. For example, if the user is searching in the
Greece forum, the index should be searched for all documents from type
forum and for all destinations in Greece. The destination data structure
is kept as a binary tree in a MySQL database.

I think, the best way is to store whole branch in a special document field.

Ex. we have some document with Athens as destination. The idea is to store whole tree path - 'Europe Greece Athens':
-----------------
$doc->addField(Zend_Search_Lucene_Field::UnStored('destination', 'Europe Greece Athens'));
--------

That gives the possibility to effectively search document (or limit search results) by any level of destination tree.

If some node name is not unique, you can specify full path with phrase query:
"($userQuery) AND destination:\"Europe Greece\""

Hm... Good candidate for Best Practice documentation section.

Another way is to specify only final destination in the destination field and construct additional subquery on the fly (using MySQL data).

The last important requirement is to limit the search results for
pagination, so on page 1 I only want to show results 1 to 10, on page 2
the results 11 to 20, and so on.

Lucene (and Zend_Search_Lucene) needs to process whole result set to calculate hit scores and return hits in a right order.
Returned hit objects contain only internal document IDs and don't need additional resources for processing while you don't try to access stored document fields (!). When it happens document is automatically loaded from the index.

Limiting result set functionality (Zend_Search_Lucene::setResultSetLimit($newLimit)) is intended for really huge result sets (tens or hundreds of thousands hits) and returns "first N hits" instead of "best N hits" (returned hits are ordered by score).
That's not the best behavior, but it may help in some cases. It's not suitable for pagination.

The right way for pagination implementation is to collect all doc IDs from result set somewhere (without access to any stored field!) and retrieve documents from the index when it's necessary:
------------------------------
$frontendOptions = array(...);
$backendOptions = array('cache_dir' => './tmp/');
$cache = Zend_Cache::factory('Core', 'File', $frontendOptions, $backendOptions);

if (!$result = $cache->load('myresult')) {
$result = array();
$scores = array();
foreach ($index->find($query) as $hit) {
$result[] = $hit->id;
$scores[] = $hit->score;
}

$cache->save($result, 'myresult');
$cache->save($scores, 'myscores');
} else {
$scores = $cache->load('myscores')
}

// Output $docsPerPage documents starting from $startResultID
for ($resultID = $startResultID;
isset($result[$resultID]) && $resultID < $startResultID+$docsPerPage;
$resultID++) {
$doc = index->getDocument($result[$resultID]);

...
echo $doc->url;
...
echo $ scores[$resultID];
}
...
-------------------------

My first idea to solve this is to add two additional fields to each
a) field 'area' which can only have one of the values 'forum',
'article', 'gallery', 'destination' or 'member'

Yeah, that's right way to do this.
The only thing I could recommend is to care about performance with low-selectivity fields.
Ex. if you have large enough documents set (hundreds of thousands documents) and have fields like 'sex', than engine has to construct list which contains ~1/2 of index documents to intersect it with other part of query result.

It may be more effective to retrieve full result set end then filter it by checking field value.

b) field 'destination' which will be filled with a string that combines
the destination hierarchie of the destination primary keys, e.g.
'Athens' has key 322, 'Greece' has key 44 and 'South Europe' key 40,
so for an article about 'Athens' this field would be filled with the
value '40-44-322-'. If I want to search for all 'Greece' articles I
will search for '40-44-*'

Yes. That's also the way to do this.
You should only a) switch default analyzer to index terms with numbers:
---------------------------------
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
---------------
b) use some letter as delimiter (otherwise it will be tokenized as several terms) or use your own analyzer, which treats '-' as a part of terms.

Besides these fields I want to add some further fields.
c) the document text to be indexed and searched
d) the document url
e) the page title
f) the date of the document (last changed)
1) Does my approach for the limitation of the area and the destination
fields make sense or did I overlook something?

That's right way.

2) I am not quite sure which field types I should use for the six
fields mentioned above. Any suggestions?

My suggestion:

the document text - UnStored
the document url - UnIndexed (if you don't plan to search by URL)
the page title - Text
the date of the document - Keyword

But it may depends on other things.

3) Does it make sense to create one index for each area to improve
performance? If yes, I might forget about the all-area search
facility.

That depends.
Search time is generally composed of the following operations:
1. Index opening.
2. Executing subqueries (your base query + area specification subquery)
3. Subqueries result set intersection.

1. Index opening is actually preloading dictionary index (usually each 128th dictionary term with binary search ability) and performed at first query execution.
Larger index doesn't always have larger dictionary (if dictionary is full).
So large index opening may take near the same time as area sub-index opening.

On the other hand, searching through several sub-indexes multiplies index opening time (depending on number of sub-indices you are using).

It may take significant time.
Ex. for some simple queries index opening may takes 0.040 sec and search itself 0.002 sec.

2. I estimate base query execution time as linear function of index size. The question is if it's comparable to index opening time.

Area specification subquery takes fixed time depending on areas size.

3. Subqueries result set intersection is effective now, but also takes time.

Only tests with your production data should give right answers.

PS This question correlates with index optimization issues. Some tips could be found here:
http://framework.zend.com/manual/en/zend.search.lucene.index-creation.html#zend.search.lucene.index-creation.optimization

http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.indexing-performance

It's also planned to implement multi-searcher to search through several indices (http://framework.zend.com/issues/browse/ZF-525). It may help in future with multi-index configurations.

4) It might be a slight overhead to use Zend_Search_Lucene to search
for a destination which basically only consists of the destination
name. So using a simple search directly in the MySQL database for
this area might be faster and would not need any indexing. What do
others think about this?

That depends on index size, your data nature, destinations tree size, destinations cardinality (common number of documents per destination) and so on.
Only tests with actual data may give right answer.

5) The documentaion shows a way to limit the total amount of results.
But I did not find a way how to set an offset to limit the results
for pagination. Do I really need to fetch all results and the handle
the pagination in my controller, which would mean that each request
will return all results?

See above.

With best regards,
Alexander Veremyev.

Thanks for your comments and help.
Best Regards,
Ralf

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.516 / Virus Database: 269.17.13/1214 - Release Date: 08.01.2008 13:38