Discussion:
Zend_Search_Lucene and UTF-8
Jack Sleight
2008-06-27 10:39:30 UTC
Permalink
Hi,
I'm a little confused. I have a database from which I'm building a
search index. The database is in utf-8, and when I add fields to the
index I'm specifying utf-8 as the encoding. It builds the index fine
without any errors or warning. The index functions perfectly, except if
you search for a word containing letters with diacritics, eg.
"Führerschein", which returns no results. I've tried setting the
analyzer to Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, and I've
also tried setting the encoding in
Zend_Search_Lucene_Search_QueryParser::parse() to utf-8, but neither work.

Any ideas?
--
Jack
Tobias Gies
2008-06-27 13:01:28 UTC
Permalink
Hi Jack,

could it be that the word "Führerschein" itself (or other words you search
for, for that matter) is/are not in UTF-8 encoding?

Best regards
Tobias

-----Ursprüngliche Nachricht-----
Von: Jack Sleight [mailto:jack.sleight-***@public.gmane.org]
Gesendet: Freitag, 27. Juni 2008 12:40
An: Zend Framework Mail Formats & Search
Betreff: [fw-formats] Zend_Search_Lucene and UTF-8

Hi,
I'm a little confused. I have a database from which I'm building a
search index. The database is in utf-8, and when I add fields to the
index I'm specifying utf-8 as the encoding. It builds the index fine
without any errors or warning. The index functions perfectly, except if
you search for a word containing letters with diacritics, eg.
"Führerschein", which returns no results. I've tried setting the
analyzer to Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, and I've
also tried setting the encoding in
Zend_Search_Lucene_Search_QueryParser::parse() to utf-8, but neither work.

Any ideas?
--
Jack
Jack Sleight
2008-06-27 13:05:52 UTC
Permalink
Um, I don't /think/ so. As far as I'm aware everything is in utf-8, the
database, the search index fields, the html files etc. The data in the
DB is definitely in utf-8, that's being copied directly to the index,
and output directly to the page. I'm copying the word directly from the
page and pasting it into the search box, so it should be identical to
that stored in the index.
Post by Tobias Gies
Hi Jack,
could it be that the word "Führerschein" itself (or other words you search
for, for that matter) is/are not in UTF-8 encoding?
Best regards
Tobias
-----Ursprüngliche Nachricht-----
Gesendet: Freitag, 27. Juni 2008 12:40
An: Zend Framework Mail Formats & Search
Betreff: [fw-formats] Zend_Search_Lucene and UTF-8
Hi,
I'm a little confused. I have a database from which I'm building a
search index. The database is in utf-8, and when I add fields to the
index I'm specifying utf-8 as the encoding. It builds the index fine
without any errors or warning. The index functions perfectly, except if
you search for a word containing letters with diacritics, eg.
"Führerschein", which returns no results. I've tried setting the
analyzer to Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, and I've
also tried setting the encoding in
Zend_Search_Lucene_Search_QueryParser::parse() to utf-8, but neither work.
Any ideas?
--
Jack
Jack Sleight
2008-06-27 13:54:39 UTC
Permalink
OK, weird thing, the value coming from the search form field is
definietly utf-8 (I used mb_detect_encoding to check), however, if I set
the encoding argument of Zend_Search_Lucene_Search_QueryParser::parse()
to utf-8, I get this error:

Notice: iconv_strlen() [function.iconv-strlen]: Detected an illegal
character in input string in
...1.5.2\library\Zend\Search\Lucene\Search\QueryLexer.php on line 346
Post by Tobias Gies
Hi Jack,
could it be that the word "Führerschein" itself (or other words you search
for, for that matter) is/are not in UTF-8 encoding?
Best regards
Tobias
-----Ursprüngliche Nachricht-----
Gesendet: Freitag, 27. Juni 2008 12:40
An: Zend Framework Mail Formats & Search
Betreff: [fw-formats] Zend_Search_Lucene and UTF-8
Hi,
I'm a little confused. I have a database from which I'm building a
search index. The database is in utf-8, and when I add fields to the
index I'm specifying utf-8 as the encoding. It builds the index fine
without any errors or warning. The index functions perfectly, except if
you search for a word containing letters with diacritics, eg.
"Führerschein", which returns no results. I've tried setting the
analyzer to Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8, and I've
also tried setting the encoding in
Zend_Search_Lucene_Search_QueryParser::parse() to utf-8, but neither work.
Any ideas?
--
Jack
Jack Sleight
2008-06-27 14:09:19 UTC
Permalink
Right, sorry, I'm an idiot. I've been using my own extended query
parser, which was never originally written for UTF-8, and I had a call
to strtolower, which should have been mb_strtolower. Now that's fixed it
works now. Sorry for wasting your time.
Post by Jack Sleight
OK, weird thing, the value coming from the search form field is
definietly utf-8 (I used mb_detect_encoding to check), however, if I
set the encoding argument of
Zend_Search_Lucene_Search_QueryParser::parse() to utf-8, I get this
Notice: iconv_strlen() [function.iconv-strlen]: Detected an illegal
character in input string in
...1.5.2\library\Zend\Search\Lucene\Search\QueryLexer.php on line 346
--
Jack
Loading...