Multilanguage Joomla and UTF-8

Joomla supports out of the box the UTF-8 character encoding, so someone building a multilingual website should not have any problems using UTF-8 character encoding in his site. Right?

Yes and no, the CORE Joomla is problem free as I write this, but some non-core add-ons, templates - and yes, your own, home-cooked code can produce garbled output. Let's see why, and how we can fix it!

There are two major culprits here, one is the MySQL version used. Even if it's official unsupported, there are lot of servers running earlier (4.*) versions of the database engine powering Joomla. Upgrade it, and be sure, that you have chosen proper encoding for your database tables. There is not too much else to say there.

The second culprit is that smart little monkey making the magic of building your site from database records ad files stored on the webserver: the PHP engine.

Unfortunately PHP makes the assumption that all strings contain characters that are stored in single bytes. And here is the catch: UTF-8 is a multi-byte character encoding that enables us to store Unicode in a relatively small amount of space. Being a multi-byte character encoding, individual UTF-8 characters are stored in memory using a variable number of bytes.A good example of when this is problematic is counting the number of characters in a string using the PHP strlen() function. If the string contains UTF-8 data and one or more of the characters are represented in memory using multiple bytes, the value that is returned will be larger than expected.

Fortunately you don't need to write lengthly code chunks to deal with these situations often occurring when you deal with multi-byte character encoded strings in Joomla - we have handy since arrival of Joomla 1.5 an entire library of functions to help us treat correctly these strings, the JString class.

The Joomla! JString class contains a bunch of static methods that are UTF-8 aware. There is an equivalent JString method for each PHP string function that does not behave as expected when using UTF-8 strings.

Here are some of these (you can find easily the full list with examples at http://docs.joomla.org/API15:JString)

PHP FunctionJString MethodReturn TypeParametersDescription
strlen JString::strlen int string $str Determines the length of $str.
trim JString::trim string string $str,
[string $charlist]
Remove leading and trailing whitespace or characters defined in $charlist.
ltrim JString::ltrim string string $str,
[string $charlist]
Removes leading whitespace or characters defined in $charlist.
rtrim JString::rtrim string string $str,
[string $charlist]
Removes trailing whitespace or characters defined in $charlist.
strpos JString::strpos int or false string $haystack,
string $needle,
[int $offset = 0]
Finds position of the first occurrence of $needle in $haystack.
strrpos JString::strrpos int orfalse string $haystack,
string $needle,
[int $offset = 0]
Finds position of the last occurrence of $needle in $haystack (PHP 4 behaves slightly differently).

WARNING: JString does not support $offset.

substr JString::substr string string $string,
int $start,
[int $length]
Gets a portion of$string based on the character position $start and maximum length $length.
iconv JString::transcode string string $source,
string $from_encoding,
string $to_encoding
Converts $sourcefrom one character encoding to another.Depending on the encodings, this can result in data loss.

As you can see, the syntax is pretty similar to the basic PHP syntax, with the "JString::" prefix, and generally the same parameters and return values. So, if you want to build a really multilanguage-capable component, module, plugin or template for Joomla all what you need is to use these JString-equivalents of the PHP's string manipulation functions, and the success is guaranteed!

An important note: There is one exception worth remembering: the PHP iconv() function and JString::transcode() method are not technically equivalent to one another, but there are lot of similar applications. The JString::transcode() method is intended more as a helper method for using the PHP iconv() function. When using the JString::transcode() method transliteration is always enabled.