Guidelines for using Unicode

With the OpenEdge UTF-8 BASIC collation, composed and decomposed characters are treated as different characters. With the International Components for Unicode (ICU) collations, composed and decomposed characters are treated as the same character for comparisons and indexes.

The OpenEdge UTF-8 BASIC collation provides for sorting Unicode data in binary order. Alternatively, the ICU collations provide for sorting Unicode data based on the language-specific requirements for a locale.

Note: You can specify an OpenEdge collation or an ICU collation for sorting data using either the Collation Table (-cpcoll) startup parameter, or the COLLATE option on the FOR statement, the OPEN QUERY statement, and the PRESELECT phrase. For more information on the -cpcoll startup parameter, see OpenEdge Deployment: Startup Command and Parameter Reference. For more information on the ABL elements, see OpenEdge Development: ABL Reference.

Before sorting Unicode data with the UTF-8 BASIC collation, normalize the data using the ABL NORMALIZE function. Normalizing the data converts the data into a standardized form that allows for more accurate and consistent sorting and indexing. This is important when working with characters or sequences of characters that have multiple representations (for example, base characters and combining characters) because it ensures that equivalent strings have a unique binary representation. For more information on the ABL NORMALIZE function, see OpenEdge Development: ABL Reference.

When UTF-8 data contains decomposed characters, you cannot convert it to a single-byte code page. You must first compose the data using the ABL NORMALIZE function. When you convert data from a single-byte code page to Unicode, the result is always composed data.

OpenEdge supports code-page conversion to and from UTF-8 the same way it supports code-page conversion to and from other code pages. For more information on code-page conversion, see UnderstandingCode Pages and Understanding Character Processing Tables.

When an existing database is converted to UTF-8, the amount of storage required by each non-ASCII character increases. Roughly, each non-ASCII Latin-alphabet character converted to UTF-8 tends to require two bytes, while each double-byte Chinese, Japanese, or Korean character converted to UTF-8 tends to require three bytes.