Discussion:
Zend_Lucene + UTF8 search problem... Help!
Maxim Savenko
2008-07-24 11:48:14 UTC
Permalink
Hi everybody,
I have a problem with searching russian strings, utf8 encoded, with
Zend_Search_Lucene. Here is my short sample code:


<?php

require_once 'ZendInit.php';

require_once 'Zend/Search/Lucene.php';

require_once 'Zend/Search/Lucene/Document.php';


// Create index

$index = Zend_Search_Lucene::create('data/index');

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::Text('samplefield', '§â§å§ã§ã§Ü§Ú§Û §ä§Ö§Ü§ã§ä;
english text', 'utf-8'));

$index->addDocument($doc);

$index->commit();


// Open index and search:

$index = Zend_Search_Lucene::open('data/index');

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

Zend_Search_Lucene::setDefaultSearchField('samplefield');


// Query the index:

$queryStr = 'english';

$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');

$hits = $index->find($query);

foreach ($hits as $hit) {

/*@var $hit Zend_Search_Lucene*/

$doc = $hit->getDocument();

echo $doc->getField('samplefield')->value, PHP_EOL;

}

The 'samplefield' of the document contain string in too languages šC russian
and english(see code). If we'll search 'english' it's all fine - we
successfully find the document, but if we'll try to find russian part of
field( set $queryStr to '§â§å§ã§ã§Ü§Ú§Û') then we don't find any document.
What is a problem with my code? Help me find solution...
Thank you guys
Maxim Savenko ***@gmail.com
Maxim Savenko
2008-07-24 12:05:12 UTC
Permalink
Hi everybody,

I have a problem with searching russian strings, utf8 encoded, with
Zend_Search_Lucene. Here is my short sample code:

<?php
require_once 'ZendInit.php';
require_once 'Zend/Search/Lucene.php';
require_once 'Zend/Search/Lucene/Document.php';

// Create index
$index = Zend_Search_Lucene::create('data/index');
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('samplefield', '���������ܧڧ�
���֧ܧ���; english text', 'utf-8'));
$index->addDocument($doc);
$index->commit();

// Open index and search:
$index = Zend_Search_Lucene::open('data/index');
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
Zend_Search_Lucene::setDefaultSearchField('samplefield');

// Query the index:
$queryStr = 'english';
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');
$hits = $index->find($query);
foreach ($hits as $hit) {
/*@var $hit Zend_Search_Lucene*/
$doc = $hit->getDocument();
echo $doc->getField('samplefield')->value, PHP_EOL;
}

The 'samplefield' of the document contain string in too languages �C
russian and english(see code). If we'll search 'english' it's all fine
- we successfully find the document, but if we'll try to find russian
part of field( set $queryStr to '���������ܧڧ�') then we don't find any
document.

What is a problem with my code? Help me find solution...

Thank you guys

Maxim Savenko
maxim.savenko@
Maxim Savenko
2008-07-24 13:21:36 UTC
Permalink
Hi everybody,

I have a problem with searching russian strings, utf8 encoded, with
Zend_Search_Lucene. Here is my short sample code:


require_once 'ZendInit.php';

require_once 'Zend/Search/Lucene.php';

require_once 'Zend/Search/Lucene/Document.php';


// Create index

$index = Zend_Search_Lucene::create('data/index');

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::Text('samplefield', 'русский
текст; english text', 'utf-8'));

$index->addDocument($doc);

$index->commit();


// Open index and search:

$index = Zend_Search_Lucene::open('data/index');

Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');

Zend_Search_Lucene::setDefaultSearchField('samplefield');


// Query the index:

$queryStr = 'english';

$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');

$hits = $index->find($query);

foreach ($hits as $hit) {

/*@var $hit Zend_Search_Lucene*/

$doc = $hit->getDocument();

echo $doc->getField('samplefield')->value, PHP_EOL;

}


The 'samplefield' of the document contain string in too languages –
russian and english(see code). If we'll search 'english' it's all fine -
we successfully find the document, but if we'll try to find russian part
of field( set $queryStr to 'русский') then we don't find any document.

What is a problem with my code? Help me find solution...

Thank you guys

Maxim Savenko
Christopher Östlund
2008-07-24 13:25:43 UTC
Permalink
What's up with the spam?
Post by Maxim Savenko
Hi everybody,
I have a problem with searching russian strings, utf8 encoded, with
require_once 'ZendInit.php';
require_once 'Zend/Search/Lucene.php';
require_once 'Zend/Search/Lucene/Document.php';
// Create index
$index = Zend_Search_Lucene::create('data/index');
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('samplefield', 'русскОй
текст; english text', 'utf-8'));
$index->addDocument($doc);
$index->commit();
$index = Zend_Search_Lucene::open('data/index');
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('utf-8');
Zend_Search_Lucene::setDefaultSearchField('samplefield');
$queryStr = 'english';
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'utf-8');
$hits = $index->find($query);
foreach ($hits as $hit) {
$doc = $hit->getDocument();
echo $doc->getField('samplefield')->value, PHP_EOL;
}
The 'samplefield' of the document contain string in too languages – russian
and english(see code). If we'll search 'english' it's all fine - we
successfully find the document, but if we'll try to find russian part of
field( set $queryStr to 'русскОй') then we don't find any document.
What is a problem with my code? Help me find solution...
Thank you guys
Maxim Savenko
Tobias Gies
2008-07-24 13:26:05 UTC
Permalink
Maxim,

disregard the "Your message could not be delivered" spam. Your message was
sent to this list 7 times now. The Mails with "Your message could not be
delivered" are not being sent by Zend, they come from some british bloke who
seems to be unable to properly configure his/her mailserver.

Best regards
Tobias
Post by Maxim Savenko
Hi everybody,
[...]
Loading...