Zend_Search_Lucene UTF8: different behaviour on Ubuntu and Gentoo

Discussion:

Johannes Schmidt

2007-09-24 14:18:00 UTC

Hi all,

I've moved a project including Zend_Search_Lucene from my Testserver (OS
Ubuntu, PHP 5.2.1, Apache) to our live system (OS Gentoo, PHP 5.2.1).
On Ubuntu UTF8 works correctly, e.g. indexing "Blöße" can be found by
entering "Blöße". (Inspection with Luke shows "Blöße"). On Gentoo the
same code/application doesn't recognize German umlauts and other special
chars - no indexing is done for fields containing german special chars.
Inspection with Luke shows broken terms.

Does anybody have an idea? Might that be an OS specific problem?

Regards,
Johannes

Alexander Veremyev

2007-09-24 19:12:58 UTC

Permalink

Hi Johannes,

I think the prolem is produced by different current locale.

There are several ways to specify indexed data/query encoding - http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)

With best regards,
Alexander Veremyev.

-----Original Message-----
Sent: Monday, September 24, 2007 6:18 PM
Subject: [fw-formats] Zend_Search_Lucene UTF8: different
behaviour on Ubuntu and Gentoo
Hi all,
I've moved a project including Zend_Search_Lucene from my
Testserver (OS Ubuntu, PHP 5.2.1, Apache) to our live system
(OS Gentoo, PHP 5.2.1).
On Ubuntu UTF8 works correctly, e.g. indexing "Blöße" can be
found by entering "Blöße". (Inspection with Luke shows
"Blöße"). On Gentoo the same code/application doesn't
recognize German umlauts and other special chars - no
indexing is done for fields containing german special chars.
Inspection with Luke shows broken terms.
Does anybody have an idea? Might that be an OS specific problem?
Regards,
Johannes
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release
Date: 23.09.2007 13:53

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

Johannes Schmidt

2007-09-24 20:17:52 UTC

Permalink

Hi Cameron and Alexander,

thank you very much - typing env on the shell shows different
lang-settings, and setlocale solved the problem...

Best regards,
Johannes

Post by Alexander Veremyev
Hi Johannes,
I think the prolem is produced by different current locale.
There are several ways to specify indexed data/query encoding - http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)
With best regards,
Alexander Veremyev.

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

xeoshow

2007-09-25 06:11:54 UTC

Permalink

Hi Alexander,
I found if I use the method Zend_Search_Lucene_Document_Html::loadHTML ,
it looks like no parameter for specifying the encoding?
My html page got :
-------------------------------

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

-------------------------------

I found the index will get garbage character regarding the above
situation. How to resolve this? Thanks a bunch.

Post by Alexander Veremyev
Hi Johannes,
I think the prolem is produced by different current locale.
There are several ways to specify indexed data/query encoding -
http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)
With best regards,
Alexander Veremyev.

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

Alexander Veremyev

2007-09-25 07:32:51 UTC

Permalink

Hi,

loadHTML() uses DOMDocument::loadHTML() method to parse HTML data (HYPERLINK "http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php"http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php). It determines encoding automatically according to the meta tag you mentioned.
But I saw an examples when it wasn't recognozed correctly. The problem was solved by moving Content-Type meta tag t othe begining of HEAD section.

If it doesn't help please try to get actual encoding of parsed data:
-----------------------------
$doc = Zend_Search_Lucene_Document_Html::loadHTML($data);

echo $doc->getField('body')->encoding . "\n";
------------------
or
-----------------------------
$doc = new DOMDocument();
$doc->loadHTML($data);

echo $doc->actualEncoding . "\n";
------------------

With best regards,
Alexander Veremyev.

_____

From: xeoshow [mailto:xeoshow-***@public.gmane.org]
Sent: Tuesday, September 25, 2007 10:12 AM
To: Alexander Veremyev
Cc: Johannes Schmidt; fw-formats-***@public.gmane.org
Subject: Re: [fw-formats] Zend_Search_Lucene UTF8: different behaviour on Ubuntu and Gentoo

Hi Alexander,
I found if I use the method Zend_Search_Lucene_Document_Html::loadHTML , it looks like no parameter for specifying the encoding?
My html page got :
-------------------------------

<meta http-equiv="Content-Type"

content="text/html; charset=gb2312" />
-------------------------------

I found the index will get garbage character regarding the above situation. How to resolve this? Thanks a bunch.

2007/9/25, Alexander Veremyev <HYPERLINK "mailto:alexander.v-C1q0ot2/***@public.gmane.org"alexander.v-C1q0ot2/***@public.gmane.org>:

Hi Johannes,

I think the prolem is produced by different current locale.

There are several ways to specify indexed data/query encoding - HYPERLINK "http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding"http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)

With best regards,
Alexander Veremyev.

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09

xeoshow

2007-09-26 07:59:06 UTC

Permalink

Hi Alexander,
Thank you for your help. It helped some, but still got some issue. Now I
have tested this scenario:
My html page is already UTF-8, as you can see from the following:
---------------------------------------------------------------------------------------------------

<HTML><HEAD><TITLE>äºèçœ</TITLE>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<META http-equiv=Content-Language content=UTF-8>
<META content=XycYc7sC1omGtfcxN8QprhS+BfnFVRvcahj/qu1g8Oo= name=verify-v1>
......
----------------------------------------------------------

Between the <TITLE> and </TITLE>, are 3 Chinese characters, and I used the
following code, it can work fine.
---------------------------------
$htmlBody = $response->getBody();
$body = mb_convert_encoding($htmlBody, 'HTML-ENTITIES', "utf-8");

$domdoc = new DOMDocument();
$domdoc->loadHTML($body);

$title = $domdoc->getElementsByTagName('title')->item(0)->nodeValue;

$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title, 'UTF-8'));
$index->addDocument($doc);
......
---------------------------------

But if I used the following:
----------------------------
$htmlBody = $response->getBody();
$body = mb_convert_encoding($htmlBody, 'HTML-ENTITIES', "utf-8");

$doc = Zend_Search_Lucene_Document_Html::loadHTML($body);
$index->addDocument($doc);
......
----------------------------
It will get following error:
----------------------------
*Fatal error*:
Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Invalid
UTF-8 string' in
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Storage\File.php:405
Stack trace:
#0
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter.php(514):
Zend_Search_Lucene_Storage_File->writeString('?')
#1
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter.php(438):
Zend_Search_Lucene_Index_SegmentWriter->_dumpTermDictEntry(Object(Zend_Search_Lucene_Storage_File_Filesystem),
Object(Zend_Search_Lucene_Index_Term),
Object(Zend_Search_Lucene_Index_Term),
Object(Zend_Search_Lucene_Index_TermInfo),
Object(Zend_Search_Lucene_Index_TermInfo))
#2
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter\DocumentWriter.php(184):
Zend_Search_Lucene_Index_SegmentWriter->addTerm(Object(Zend_Search_Lucene_Index_Term),
Array)
#3
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter\DocumentWriter.php(203):
Zend_ in *E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Storage\File.php*
on line *405* <http://localhost/more/food>
----------------------------
After carefully checking, seems in the body part, seems got some
garbage character,
exactly as below:
-----------------
2/10_ÃŠÂÂÃ¥ÂÂÃšÂÂÃ©Â€
2/10_Ã©ÂÂÃ§ÂÃ€ÂœÂ
......
-----------------
How should I resolve this "Invalid UTF-8 string" error? Is there
anyway to ignore
the garbage chracters? I tried to use the iconv function with IGNORE
option, but
that did not work, since Chinese,Japanese and Korean are all DBCS characters,
after using IGNORE option, all became the garbage words. :(

Thank you so much.

Post by Alexander Veremyev
Hi,
loadHTML() uses DOMDocument::loadHTML() method to parse HTML data (
http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php). It
determines encoding automatically according to the meta tag you mentioned.
But I saw an examples when it wasn't recognozed correctly. The problem was
solved by moving Content-Type meta tag t othe begining of HEAD section.
-----------------------------
$doc = Zend_Search_Lucene_Document_Html::loadHTML($data);
echo $doc->getField('body')->encoding . "\n";
------------------
or
-----------------------------
$doc = new DOMDocument();
$doc->loadHTML($data);
echo $doc->actualEncoding . "\n";
------------------
With best regards,
Alexander Veremyev.
------------------------------
*Sent:* Tuesday, September 25, 2007 10:12 AM
*To:* Alexander Veremyev
*Subject:* Re: [fw-formats] Zend_Search_Lucene UTF8: different behaviour
on Ubuntu and Gentoo
Hi Alexander,
I found if I use the method Zend_Search_Lucene_Document_Html::loadHTML, it looks like no parameter for specifying the encoding?
-------------------------------
<meta http-equiv="Content-Type"
content="text/html; charset=gb2312" />
-------------------------------
I found the index will get garbage character regarding the above
situation. How to resolve this? Thanks a bunch.

Post by Alexander Veremyev
Hi Johannes,
I think the prolem is produced by different current locale.
There are several ways to specify indexed data/query encoding -
http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)
With best regards,
Alexander Veremyev.

-----Original Message-----
Sent: Monday, September 24, 2007 6:18 PM
Subject: [fw-formats] Zend_Search_Lucene UTF8: different
behaviour on Ubuntu and Gentoo
Hi all,
I've moved a project including Zend_Search_Lucene from my
Testserver (OS Ubuntu, PHP 5.2.1, Apache) to our live system
(OS Gentoo, PHP 5.2.1).
On Ubuntu UTF8 works correctly, e.g. indexing "BlÃ¶Ãe" can be
found by entering "BlÃ¶Ãe". (Inspection with Luke shows
"BlÃ¶Ãe"). On Gentoo the same code/application doesn't
recognize German umlauts and other special chars - no
indexing is done for fields containing german special chars.
Inspection with Luke shows broken terms.
Does anybody have an idea? Might that be an OS specific problem?
Regards,
Johannes
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53

No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09