George Herson
2008-01-14 19:27:02 UTC
Trying to improve my indexing, I converted some of my Text fields to Keyword. Now I'm having trouble matching anything but alphanumeric characters.
Example: Luke is showing "TestDirector_Features.doc" in my keyword field, and will find it if I search on "display_name:TestDirector_Features.doc" (while its org.apache.lucene.analysis.WhitespaceAnalyzer analyzer is in effect). The same search with Zend_Search_Lucene (ZSL) finds 0 documents. How can I get ZSL to find this data? Put another way, To be able to successfully search with non-alphanumeric characters, must the Text field type have been used in the indexing?
Related: I'd like to determine which are the stop-characters. http://framework.zend.com/manual/en/zend.search.lucene.charset.html has: "...the default text analyzer (which is also used within query parser) uses
ctype_alpha() for tokenizing text and queries." The ctype_alpha() function simply returns true or false. So does this mean that each character in the input is tested and when false is returned for a character, it's token over?
My platform is WinXP Pro, PHP Version 5.2.4, "Apache/2.2.6 (Win32)
DAV/2", Zend Framework 1.0.2.
(There's no "LC" or "locale" in my phpinfo() output or in my php.ini, so my locale is the default?)
I'm setting
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
before indexing and searching.
thank you,
George Herson
Example: Luke is showing "TestDirector_Features.doc" in my keyword field, and will find it if I search on "display_name:TestDirector_Features.doc" (while its org.apache.lucene.analysis.WhitespaceAnalyzer analyzer is in effect). The same search with Zend_Search_Lucene (ZSL) finds 0 documents. How can I get ZSL to find this data? Put another way, To be able to successfully search with non-alphanumeric characters, must the Text field type have been used in the indexing?
Related: I'd like to determine which are the stop-characters. http://framework.zend.com/manual/en/zend.search.lucene.charset.html has: "...the default text analyzer (which is also used within query parser) uses
ctype_alpha() for tokenizing text and queries." The ctype_alpha() function simply returns true or false. So does this mean that each character in the input is tested and when false is returned for a character, it's token over?
From the same section: "the analyzer converts text to 'ASCII//TRANSLIT' encoding before
indexing." Is that using iconv()? If yes, what becomes its first argument, i.e., what is used as the input charset?My platform is WinXP Pro, PHP Version 5.2.4, "Apache/2.2.6 (Win32)
DAV/2", Zend Framework 1.0.2.
(There's no "LC" or "locale" in my phpinfo() output or in my php.ini, so my locale is the default?)
I'm setting
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
before indexing and searching.
thank you,
George Herson