Hi Alexander,
Thank you for your help. It helped some, but still got some issue. Now I
have tested this scenario:
My html page is already UTF-8, as you can see from the following:
---------------------------------------------------------------------------------------------------
<HTML><HEAD><TITLE>äºèçœ</TITLE>
<META http-equiv=Content-Type content="text/html; charset=UTF-8">
<META http-equiv=Content-Language content=UTF-8>
<META content=XycYc7sC1omGtfcxN8QprhS+BfnFVRvcahj/qu1g8Oo= name=verify-v1>
......
----------------------------------------------------------
Between the <TITLE> and </TITLE>, are 3 Chinese characters, and I used the
following code, it can work fine.
---------------------------------
$htmlBody = $response->getBody();
$body = mb_convert_encoding($htmlBody, 'HTML-ENTITIES', "utf-8");
$domdoc = new DOMDocument();
$domdoc->loadHTML($body);
$title = $domdoc->getElementsByTagName('title')->item(0)->nodeValue;
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title, 'UTF-8'));
$index->addDocument($doc);
......
---------------------------------
But if I used the following:
----------------------------
$htmlBody = $response->getBody();
$body = mb_convert_encoding($htmlBody, 'HTML-ENTITIES', "utf-8");
$doc = Zend_Search_Lucene_Document_Html::loadHTML($body);
$index->addDocument($doc);
......
----------------------------
It will get following error:
----------------------------
*Fatal error*:
Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Invalid
UTF-8 string' in
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Storage\File.php:405
Stack trace:
#0
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter.php(514):
Zend_Search_Lucene_Storage_File->writeString('?')
#1
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter.php(438):
Zend_Search_Lucene_Index_SegmentWriter->_dumpTermDictEntry(Object(Zend_Search_Lucene_Storage_File_Filesystem),
Object(Zend_Search_Lucene_Index_Term),
Object(Zend_Search_Lucene_Index_Term),
Object(Zend_Search_Lucene_Index_TermInfo),
Object(Zend_Search_Lucene_Index_TermInfo))
#2
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter\DocumentWriter.php(184):
Zend_Search_Lucene_Index_SegmentWriter->addTerm(Object(Zend_Search_Lucene_Index_Term),
Array)
#3
E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Index\SegmentWriter\DocumentWriter.php(203):
Zend_ in *E:\Others\APM-Express\htdocs\yoursite_debug\Zend\Search\Lucene\Storage\File.php*
on line *405* <http://localhost/more/food>
----------------------------
After carefully checking, seems in the body part, seems got some
garbage character,
exactly as below:
-----------------
2/10_ÊÂÂÃ¥ÂÂÚÂÂé€
2/10_éÂÂç­ÂÀœÂ
......
-----------------
How should I resolve this "Invalid UTF-8 string" error? Is there
anyway to ignore
the garbage chracters? I tried to use the iconv function with IGNORE
option, but
that did not work, since Chinese,Japanese and Korean are all DBCS characters,
after using IGNORE option, all became the garbage words. :(
Thank you so much.
Post by Alexander VeremyevHi,
loadHTML() uses DOMDocument::loadHTML() method to parse HTML data (
http://www.php.net/manual/en/function.dom-domdocument-loadhtml.php). It
determines encoding automatically according to the meta tag you mentioned.
But I saw an examples when it wasn't recognozed correctly. The problem was
solved by moving Content-Type meta tag t othe begining of HEAD section.
-----------------------------
$doc = Zend_Search_Lucene_Document_Html::loadHTML($data);
echo $doc->getField('body')->encoding . "\n";
------------------
or
-----------------------------
$doc = new DOMDocument();
$doc->loadHTML($data);
echo $doc->actualEncoding . "\n";
------------------
With best regards,
Alexander Veremyev.
------------------------------
*Sent:* Tuesday, September 25, 2007 10:12 AM
*To:* Alexander Veremyev
*Subject:* Re: [fw-formats] Zend_Search_Lucene UTF8: different behaviour
on Ubuntu and Gentoo
Hi Alexander,
I found if I use the method Zend_Search_Lucene_Document_Html::loadHTML, it looks like no parameter for specifying the encoding?
-------------------------------
<meta http-equiv="Content-Type"
content="text/html; charset=gb2312" />
-------------------------------
I found the index will get garbage character regarding the above
situation. How to resolve this? Thanks a bunch.
Post by Alexander VeremyevHi Johannes,
I think the prolem is produced by different current locale.
There are several ways to specify indexed data/query encoding -
http://framework.zend.com/manual/en/zend.search.lucene.best-practice.html#zend.search.lucene.best-practice.encoding
Choose any of them :)
With best regards,
Alexander Veremyev.
-----Original Message-----
Sent: Monday, September 24, 2007 6:18 PM
Subject: [fw-formats] Zend_Search_Lucene UTF8: different
behaviour on Ubuntu and Gentoo
Hi all,
I've moved a project including Zend_Search_Lucene from my
Testserver (OS Ubuntu, PHP 5.2.1, Apache) to our live system
(OS Gentoo, PHP 5.2.1).
On Ubuntu UTF8 works correctly, e.g. indexing "BlöÃe" can be
found by entering "BlöÃe". (Inspection with Luke shows
"BlöÃe"). On Gentoo the same code/application doesn't
recognize German umlauts and other special chars - no
indexing is done for fields containing german special chars.
Inspection with Luke shows broken terms.
Does anybody have an idea? Might that be an OS specific problem?
Regards,
Johannes
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1025 - Release Date: 23.09.2007 13:53
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.488 / Virus Database: 269.13.30/1029 - Release Date: 24.09.2007 19:09