Own analyzer

Discussion:

Own analyzer

François Parmentier

2007-07-04 15:52:05 UTC

Unfortunately, it does not seem to work with the Standard Analyzer, nor any
Analyzer I tested.
My index contain a 'conference-meeting-part' term, in a 'type' field, which
appear in 646 documents, according to Luke (and according to my own
counting, as I add this information in a MySQL database).

When I search type:"conference-meeting-part", ZendSearchLucene says that no
document is retrieved.
Idem when I escape -: type:"conference\-meeting\-part"

Is it a normal behavior?

You can test on this temporary URL:
http://peignier.veille.inist.fr/~niderlin/public/?notices=index&query=type%3A%22conference-meeting-part%22
(ZF 1.0.0RC2 on this site, but same results with 1.0.0)

Thanks in advance,
--
François Parmentier

I think it is very important. In English (or Czech too) there are so
many words with - character. As I wrote: " -" or " +" (whitespace
minus, whitespace plus) = query character. I never studied how Java
Lucene do this but this could be ok.

Yes, I've just checked, that Java Lucene treats '+case-law' as '+"case
law"' (with default analyzer). Thus it treats '-' inside lexeme as a
common character (and then brakes into two words with default analyzer)
I'll change this in Zend_Search_Lucene.

That's done.
"-" and "+" inside lexemes are processed as common characters (and then
processed by Analyzer).
Committed to SVN.
With best regards,
Alexander Veremyev.

--
View this message in context: http://www.nabble.com/Own-analyzer-tf2724259s16154.html#a11433283
Sent from the Zend MFS mailing list archive at Nabble.com.

Alexander Veremyev

2007-07-04 20:14:39 UTC

Permalink

Hi François,

That looks like you used 'Keyword' field type for this field.
Terms are not tokenized in this case and whole field is treated as one term.

That is why you can see 'conference-meeting-part' term in a dictionary.
Otherwise it would be three terms 'conference', 'meeting' and 'part'.

Query parser treats 'conference-meeting-part' as one lexeme, but then it
is split into three words.

There are three ways to solve this.

1. You can re-index your data using 'Text' field type instead of 'Keyword'.

Java Lucene query parser automatically recognizes field type and uses
special Keyword analyzer for keyword fields.
Zend_Search_Lucene query parser architecture differs from Java version
for some reasons. It allows Zend_Search_Lucene to search through all
fields by default, but makes some difficulties for field type recognition.
It's registered as a JIRA issue and I already have some ideas how it
could be done.

2. You can add 'conference-meeting-part' term to the query using query API:
---------------
$term = new Zend_Search_Lucene_Index_Term('conference-meeting-part',
'type');
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits = $index->find($query);
...
------------

You can also combine this with user entered query:
---------------
$parsedQuery = Zend_Search_Lucene_Search_QueryParser::parse($query);

$term = new Zend_Search_Lucene_Index_Term('conference-meeting-part',
'type');
$typeSubquery = new Zend_Search_Lucene_Search_Query_Term($term);

$fullQuery = new Zend_Search_Lucene_Search_Query_Boolean();
$fullQuery->addSubquery($parsedQuery, true /* required */);
$fullQuery->addSubquery($typeSubquery, true /* required */);

$hits = $index->find($query);
...
------------

3. You can use your own analyzer, which treats '-' as a letter.
I think it's a best way if you prefer not to brake words at '-'.
---------------
require_once 'Zend/Search/Lucene.php';

class MyAnalyzer extends Zend_Search_Lucene_Analysis_Analyzer_Common
{
/**
* Current position in a stream
*
* @var integer
*/
private $_position;

/**
* Reset token stream
*/
public function reset()
{
$this->_position = 0;

if ($this->_input === null) {
return;
}

// convert input into ascii
$this->_input = iconv($this->_encoding, 'ASCII//TRANSLIT',
$this->_input);
$this->_encoding = 'ASCII';
}

/**
* Tokenization stream API
* Get next token
* Returns null at the end of stream
*
* @return Zend_Search_Lucene_Analysis_Token|null
*/
public function nextToken()
{
if ($this->_input === null) {
return null;
}

do {
if (! preg_match('/[a-zA-Z0-9]+/', $this->_input, $match,
PREG_OFFSET_CAPTURE, $this->_position)) {
// It covers both cases a) there are no matches
(preg_match(...) === 0)
// b) error occured (preg_match(...) === FALSE)
return null;
}

$str = $match[0][0];
$pos = $match[0][1];
$endpos = $pos + strlen($str);

$this->_position = $endpos;

$token = $this->normalize(new
Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));
} while ($token === null); // try again if token is skipped

return $token;
}
}

class MyAnalyzerCaseInsensitive extends MyAnalyzer
{
public function __construct()
{
$this->addFilter(new
Zend_Search_Lucene_Analysis_TokenFilter_LowerCase());
}
}

Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new MyAnalyzerCaseInsensitive());

...
------------

With best regards,
Alexander Veremyev.

Post by FranÃ§ois Parmentier
Unfortunately, it does not seem to work with the Standard Analyzer, nor any
Analyzer I tested.
My index contain a 'conference-meeting-part' term, in a 'type' field, which
appear in 646 documents, according to Luke (and according to my own
counting, as I add this information in a MySQL database).
When I search type:"conference-meeting-part", ZendSearchLucene says that no
document is retrieved.
Idem when I escape -: type:"conference\-meeting\-part"
Is it a normal behavior?
http://peignier.veille.inist.fr/~niderlin/public/?notices=index&query=type%3A%22conference-meeting-part%22
(ZF 1.0.0RC2 on this site, but same results with 1.0.0)
Thanks in advance,
--
François Parmentier

That's done.
"-" and "+" inside lexemes are processed as common characters (and then
processed by Analyzer).
Committed to SVN.
With best regards,
Alexander Veremyev.

François Parmentier

2007-07-05 08:43:42 UTC

Permalink

Post by Alexander Veremyev
Hi François,

Hi Alexander.
Thanks for answering so quickly.

Post by Alexander Veremyev
That looks like you used 'Keyword' field type for this field.
Terms are not tokenized in this case and whole field is treated as one term.

Exactly: I used "Keyword", as I thought that "conference-meeting-part" does
not have to be tokenized.
To me, it's a whole (key)word.

Post by Alexander Veremyev
That is why you can see 'conference-meeting-part' term in a dictionary.
Otherwise it would be three terms 'conference', 'meeting' and 'part'.
Query parser treats 'conference-meeting-part' as one lexeme, but then it
is split into three words.

That's the point I missed!

Post by Alexander Veremyev
There are three ways to solve this.
1. You can re-index your data using 'Text' field type instead of 'Keyword'.

That works!

Post by Alexander Veremyev
Java Lucene query parser automatically recognizes field type and uses
special Keyword analyzer for keyword fields.
Zend_Search_Lucene query parser architecture differs from Java version
for some reasons. It allows Zend_Search_Lucene to search through all
fields by default, but makes some difficulties for field type recognition.
It's registered as a JIRA issue and I already have some ideas how it
could be done.
---------------
$term = new Zend_Search_Lucene_Index_Term('conference-meeting-part',
'type');
$query = new Zend_Search_Lucene_Search_Query_Term($term);
$hits = $index->find($query);
...
------------
---------------
$parsedQuery = Zend_Search_Lucene_Search_QueryParser::parse($query);
$term = new Zend_Search_Lucene_Index_Term('conference-meeting-part',
'type');
$typeSubquery = new Zend_Search_Lucene_Search_Query_Term($term);
$fullQuery = new Zend_Search_Lucene_Search_Query_Boolean();
$fullQuery->addSubquery($parsedQuery, true /* required */);
$fullQuery->addSubquery($typeSubquery, true /* required */);
$hits = $index->find($query);
...
------------

I don't want to do that, as I want to keep a generic program, for any type
of index (my application creates indexes dynamically), and content.

Post by Alexander Veremyev
3. You can use your own analyzer, which treats '-' as a letter.
I think it's a best way if you prefer not to brake words at '-'.
---------------8<----------
/**
* Tokenization stream API
* Get next token
* Returns null at the end of stream
*
*/
public function nextToken()
{
if ($this->_input === null) {
return null;
}
do {
if (! preg_match('/[a-zA-Z0-9]+/', $this->_input, $match,
PREG_OFFSET_CAPTURE, $this->_position)) {
// It covers both cases a) there are no matches
(preg_match(...) === 0)
// b) error occured (preg_match(...) === FALSE)
return null;
}
------------8<------------

I may have misunderstood, but I think there lacks a "\-" in the class of
characters of the regex...

Post by Alexander Veremyev
With best regards,
Alexander Veremyev.

That's done.
"-" and "+" inside lexemes are processed as common characters (and then
processed by Analyzer).
Committed to SVN.
With best regards,
Alexander Veremyev.

--
View this message in context: http://www.nabble.com/Own-analyzer-tf2724259s16154.html#a11442860
Sent from the Zend MFS mailing list archive at Nabble.com.

Alexander Veremyev

2007-07-05 10:28:54 UTC

Permalink

Post by FranÃ§ois Parmentier

I may have misunderstood, but I think there lacks a "\-" in the class of
characters of the regex...

Damn! :) I didn't copied it from a test script.

You are right. That of course should be '/[a-zA-Z1-7_\-]+/' or something
like this.

With best regards,
Alexander Veremyev.