non-alphanumeric chars a problem for Keyword fields

George Herson

2008-01-14 19:27:02 UTC

Trying to improve my indexing, I converted some of my Text fields to Keyword. Now I'm having trouble matching anything but alphanumeric characters.

Example: Luke is showing "TestDirector_Features.doc" in my keyword field, and will find it if I search on "display_name:TestDirector_Features.doc" (while its org.apache.lucene.analysis.WhitespaceAnalyzer analyzer is in effect). The same search with Zend_Search_Lucene (ZSL) finds 0 documents. How can I get ZSL to find this data? Put another way, To be able to successfully search with non-alphanumeric characters, must the Text field type have been used in the indexing?

Related: I'd like to determine which are the stop-characters. http://framework.zend.com/manual/en/zend.search.lucene.charset.html has: "...the default text analyzer (which is also used within query parser) uses
ctype_alpha() for tokenizing text and queries." The ctype_alpha() function simply returns true or false. So does this mean that each character in the input is tested and when false is returned for a character, it's token over?

From the same section: "the analyzer converts text to 'ASCII//TRANSLIT' encoding before

indexing." Is that using iconv()? If yes, what becomes its first argument, i.e., what is used as the input charset?

My platform is WinXP Pro, PHP Version 5.2.4, "Apache/2.2.6 (Win32)
DAV/2", Zend Framework 1.0.2.
(There's no "LC" or "locale" in my phpinfo() output or in my php.ini, so my locale is the default?)

I'm setting
Zend_Search_Lucene_Analysis_Analyzer::setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
before indexing and searching.

thank you,
George Herson