Discussion:
UTF-8-Encoded Japanese
h***@public.gmane.org
2007-07-19 16:20:30 UTC
Permalink
I am porting a PDF generator from the fpdf class to Zend_PDF. The
text input, in English as well as several European languages and
Japanese, is all encoded in UTF-8. European diacritics display
without a hitch in Zend_PDF-generated documents, but Japanese text
renders blank. I've been googling for the past half hour, but have
yet to find anything on this subject.

I would appreciate any pointer.

Cheers,
Hakan
Kevin McArthur
2007-07-19 16:22:34 UTC
Permalink
I've copied willie on this, he's the resident font expert. Im not sure how
it works, but you might need to embed the japanese font?

K
----- Original Message -----
From: <hensari-***@public.gmane.org>
To: <fw-formats-***@public.gmane.org>
Sent: Thursday, July 19, 2007 10:20 AM
Subject: [fw-formats] UTF-8-Encoded Japanese
I am porting a PDF generator from the fpdf class to Zend_PDF. The text
input, in English as well as several European languages and Japanese, is
all encoded in UTF-8. European diacritics display without a hitch in
Zend_PDF-generated documents, but Japanese text renders blank. I've been
googling for the past half hour, but have yet to find anything on this
subject.
I would appreciate any pointer.
Cheers,
Hakan
Willie Alberty
2007-07-19 20:54:54 UTC
Permalink
Post by h***@public.gmane.org
I am porting a PDF generator from the fpdf class to Zend_PDF. The
text input, in English as well as several European languages and
Japanese, is all encoded in UTF-8. European diacritics display
without a hitch in Zend_PDF-generated documents, but Japanese text
renders blank. I've been googling for the past half hour, but have
yet to find anything on this subject.
Zend_Pdf's Unicode support is currently incomplete. It can only
handle Latin-1 characters right now as it lacks the code to generate
a custom Encoding dictionary for a font. We're currently using the
built-in WinAnsiEncoding, hence the Latin-1 limitation (see PDF 1.6
Reference § 5.5.5).

If you're embedding a TrueType font, there is a way to indicate that
the font's cmap tables should be used instead, but this requires
special support within the font program. There is currently no code
to verify that the required support exists. Some PDF generators can
add it if missing, and we could too, but it's a bit of rather
involved code. I eventually want to add support for font subsetting,
which rewrites font programs on-the-fly. When that's done, it will be
easier to add the required cmaps.

Because of the large character sets, CJK (Chinese, Japanese, Korean)
fonts are best handled in PDF as composite (CID-Keyed) fonts, which
use pure Unicode character maps (see PDF 1.6 Reference § 5.6). In
fact, this is a pretty efficient way to handle any non-Latin writing
system, such as Arabic, Hebrew, Thai, et al. It's far easier to
maintain than the myriad of language-specific ISO encodings in the
wild today. CID-Keyed fonts will also make it easier to support
advanced typography features such as ligatures.

CID-Keyed font support is on my personal roadmap for a future
contribution to Zend_Pdf, but I haven't had a project that required
it yet, and free time is scarce. If it's something that you really
need for your current project, I'm available for hire. ;-)

--

Willie Alberty, Owner
Spenlen Media
willie-***@public.gmane.org

http://www.spenlen.com/
h***@public.gmane.org
2007-07-22 07:28:33 UTC
Permalink
Thanks for the reply.

I did manage to devise a hack that serves my rather limited purpose,
which is to add Japanese addresses to the shipping label portion of a
packing slip: I first create a jpg file with the desired text using
imagettftext and insert that into the PDF. Not elegant, but it works.

This also keeps the file size small, as there is no need to embed the
hefty Japanese ttf file.

It may be helpful to include in the online documentation a brief note
on this -- something along the lines of "drawText currently supports
only the Latin-1 subset of UTF-8."

Cheers,
Hakan
Post by Willie Alberty
Zend_Pdf's Unicode support is currently incomplete. It can only
handle Latin-1 characters right now as it lacks the code to
generate a custom Encoding dictionary for a font. We're currently
using the built-in WinAnsiEncoding, hence the Latin-1 limitation
(see PDF 1.6 Reference § 5.5.5).
If you're embedding a TrueType font, there is a way to indicate
that the font's cmap tables should be used instead, but this
requires special support within the font program. There is
currently no code to verify that the required support exists. Some
PDF generators can add it if missing, and we could too, but it's a
bit of rather involved code. I eventually want to add support for
font subsetting, which rewrites font programs on-the-fly. When
that's done, it will be easier to add the required cmaps.
Because of the large character sets, CJK (Chinese, Japanese,
Korean) fonts are best handled in PDF as composite (CID-Keyed)
fonts, which use pure Unicode character maps (see PDF 1.6 Reference
§ 5.6). In fact, this is a pretty efficient way to handle any non-
Latin writing system, such as Arabic, Hebrew, Thai, et al. It's far
easier to maintain than the myriad of language-specific ISO
encodings in the wild today. CID-Keyed fonts will also make it
easier to support advanced typography features such as ligatures.
CID-Keyed font support is on my personal roadmap for a future
contribution to Zend_Pdf, but I haven't had a project that required
it yet, and free time is scarce. If it's something that you really
need for your current project, I'm available for hire. ;-)
--
Willie Alberty, Owner
Spenlen Media
http://www.spenlen.com/
Loading...