Tizen Native API  8.0
Unormalization

The Unormalization module provides Unicode normalization functionality for standard unicode normalization.

Required Header

#include <utils_i18n.h>

Overview

The Unormalization module provides Unicode normalization functionality for standard unicode normalization. All instances of i18n_unormalizer_h are unmodifiable/immutable. Instances returned by i18n_unormalization_get_instance() are singletons that must not be deleted by the caller.

Sample Code 1

Creates a normalizer and normalizes a unicode string

    i18n_unormalizer_h normalizer = NULL;
    i18n_uchar src = 0xAC00;
    i18n_uchar dest[4] = {0,};
    int dest_str_len = 0;
    int i = 0;

    // gets instance for normalizer
    i18n_unormalization_get_instance( NULL, "nfc", I18N_UNORMALIZATION_DECOMPOSE, &normalizer );

    // normalizes a unicode string
    i18n_unormalization_normalize( normalizer, &src, 1, dest, 4, &dest_str_len );
    dlog_print(DLOG_INFO, LOG_TAG, "src is 0x%x\n", src );    // src is 0xAC00 (0xAC00: A Korean character combined with consonant and vowel)

    for ( i = 0; i < dest_str_len; i++ ) {
        dlog_print(DLOG_INFO, LOG_TAG, "dest[%d] is 0x%x\t", i + 1, dest[i] );    // dest[1] is 0x1100  dest[2] is 0x1161 (0x1100: consonant, 0x1161: vowel)
    }

Functions

int i18n_unormalization_get_instance (const char *package_name, const char *name, i18n_unormalization_mode_e mode, i18n_unormalizer_h *normalizer)
 Gets a i18n_unormalizer_h which uses the specified data file and composes or decomposes text according to the specified mode.
int i18n_unormalization_normalize (i18n_unormalizer_h normalizer, const i18n_uchar *src, int32_t len, i18n_uchar *dest, int32_t capacity, int32_t *len_deststr)
 Writes the normalized form of the source string to the destination string(replacing its contents).
int i18n_unormalization_get_nfc_instance (i18n_unormalizer_h *normalizer)
 Returns an i18n_unormalizer_h instance for Unicode NFC normalization.
int i18n_unormalization_get_nfd_instance (i18n_unormalizer_h *normalizer)
 Returns an i18n_unormalizer_h instance for Unicode NFD normalization.
int i18n_unormalization_get_nfkc_instance (i18n_unormalizer_h *normalizer)
 Returns an i18n_unormalizer_h instance for Unicode NFKC normalization.
int i18n_unormalization_get_nfkd_instance (i18n_unormalizer_h *normalizer)
 Returns an i18n_unormalizer_h instance for Unicode NFKD normalization.
int i18n_unormalization_get_nfkc_casefold_instance (i18n_unormalizer_h *normalizer)
 Returns an i18n_unormalizer_h instance for Unicode NFKC_Casefold normalization.
int i18n_unormalization_create_filtered (i18n_unormalizer_h normalizer, const i18n_uset_h *filter_set, i18n_unormalizer_h *normalizer2)
 Constructs a filtered normalizer wrapping any i18n_unormalizer_h and a filter set.
int i18n_unormalization_destroy (i18n_unormalizer_h normalizer)
 Closes a i18n_unormalizer_h from i18n_unormalization_create_filtered().
int i18n_unormalization_normalize_second_and_append (i18n_unormalizer_h normalizer, i18n_uchar *first, int32_t first_length, int32_t first_capacity, const i18n_uchar *second, int32_t second_length, int32_t *len)
 Appends the normalized form of the second string to the first string (merging them at the boundary) and returns the length of the first string.
int i18n_unormalization_append (i18n_unormalizer_h normalizer, i18n_uchar *first, int32_t first_length, int32_t first_capacity, const i18n_uchar *second, int32_t second_length, int32_t *len)
 Appends the second string to the first string (merging them at the boundary) and returns the length of the first string.
int i18n_unormalization_get_decomposition (i18n_unormalizer_h normalizer, i18n_uchar32 code_point, i18n_uchar *decomposition, int32_t capacity, int32_t *len)
 Gets the decomposition mapping of code_point.
int i18n_unormalization_get_raw_decomposition (i18n_unormalizer_h normalizer, i18n_uchar32 code_point, i18n_uchar *decomposition, int32_t capacity, int32_t *len)
 Gets the raw decomposition mapping of code_point.
int i18n_unormalization_compose_pair (i18n_unormalizer_h normalizer, i18n_uchar32 a, i18n_uchar32 b, i18n_uchar32 *c)
 Performs pairwise composition of two code points and returns the composite if there is one.
uint8_t i18n_unormalization_get_combining_class (i18n_unormalizer_h normalizer, i18n_uchar32 code_point)
 Gets the combining class of code_point.
int i18n_unormalization_is_normalized (i18n_unormalizer_h normalizer, const i18n_uchar *s, int32_t length, i18n_ubool *ret)
 Tests if the string is normalized.
int i18n_unormalization_quick_check (i18n_unormalizer_h normalizer, const i18n_uchar *s, int32_t length, i18n_unormalization_check_result_e *result)
 Tests if the string is normalized.
int i18n_unormalization_span_quick_check_yes (i18n_unormalizer_h normalizer, const i18n_uchar *s, int32_t length, int32_t *result)
 Gets the end of the normalized substring of the input string.
int i18n_unormalizer_has_boundary_before (i18n_unormalizer_h normalizer, i18n_uchar32 c, i18n_ubool *result)
 Tests if the character always has a normalization boundary before it, regardless of context.
int i18n_unormalizer_has_boundary_after (i18n_unormalizer_h normalizer, i18n_uchar32 c, i18n_ubool *result)
 Tests if the character always has a normalization boundary after it, regardless of context.
int i18n_unormalization_is_inert (i18n_unormalizer_h normalizer, i18n_uchar32 c, i18n_ubool *result)
 Tests if the character is normalization-inert.

Typedefs

typedef const void * i18n_unormalizer_h
 i18n_unormalizer_h.

Typedef Documentation

typedef const void* i18n_unormalizer_h

i18n_unormalizer_h.

Since :
2.3

Enumeration Type Documentation

Result values for normalization quick check functions.

Since :
2.4
Enumerator:
I18N_UNORMALIZATION_NO 

The input string is not in the normalization form.

I18N_UNORMALIZATION_YES 

The input string is in the normalization form.

I18N_UNORMALIZATION_MAYBE 

The input string may or may not be in the normalization form.

Enumeration of constants for normalization modes. For details about standard Unicode normalization forms and about the algorithms which are also used with custom mapping tables see http://www.unicode.org/reports/tr15/.

Since :
2.3
Enumerator:
I18N_UNORMALIZATION_COMPOSE 

Decomposition followed by composition. Same as standard NFC when using an "nfc" instance. Same as standard NFKC when using an "nfkc" instance. For details about standard Unicode normalization forms see http://www.unicode.org/reports/tr15/

I18N_UNORMALIZATION_DECOMPOSE 

Map and reorder canonically. Same as standard NFD when using an "nfc" instance. Same as standard NFKD when using an "nfkc" instance. For details about standard Unicode normalization forms see http://www.unicode.org/reports/tr15/

I18N_UNORMALIZATION_FCD 

"Fast C or D" form. If a string is in this form, then further decomposition without reordering would yield the same form as DECOMPOSE. Text in "Fast C or D" form can be processed efficiently with data tables that are "canonically closed", that is, that provide equivalent data for equivalent text, without having to be fully normalized. Not a standard Unicode normalization form. Not a unique form: Different FCD strings can be canonically equivalent. For details see http://www.unicode.org/notes/tn5/#FCD

I18N_UNORMALIZATION_COMPOSE_CONTIGUOUS 

Compose only contiguously. Also known as "FCC" or "Fast C Contiguous". The result will often but not always be in NFC. The result will conform to FCD which is useful for processing. Not a standard Unicode normalization form. For details see http://www.unicode.org/notes/tn5/#FCC


Function Documentation

int i18n_unormalization_append ( i18n_unormalizer_h  normalizer,
i18n_uchar first,
int32_t  first_length,
int32_t  first_capacity,
const i18n_uchar second,
int32_t  second_length,
int32_t *  len 
)

Appends the second string to the first string (merging them at the boundary) and returns the length of the first string.

The result is normalized if both the strings were normalized. The first and second strings must be different buffers.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in,out]firstA normalized string.
[in]first_lengthLength of the first string, or -1 if NUL-terminated
[in]first_capacityNumber of i18n_uchar that can be written to first
[in]secondString, should be normalized
[in]second_lengthLength of the source string, or -1 if NUL-terminated
[out]lenThe length of the first string after merging
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Performs pairwise composition of two code points and returns the composite if there is one.

Returns a composite code point c only if c has a two-way mapping to a+b. In standard Unicode normalization, this means that c has a canonical decomposition to a+b and c does not have the Full_Composition_Exclusion property.

This function is independent of the mode of the UNormalizer2.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]aA (normalization starter) code point.
[in]bAnother code point.
[out]cThe non-negative composite code point if there is one; otherwise a negative value.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_create_filtered ( i18n_unormalizer_h  normalizer,
const i18n_uset_h filter_set,
i18n_unormalizer_h normalizer2 
)

Constructs a filtered normalizer wrapping any i18n_unormalizer_h and a filter set.

Both are aliased and normalizer must not be modified or deleted while this object is used. The filter set should be frozen; otherwise the performance will suffer greatly.

Since :
6.0
Parameters:
[in]normalizerWrapped i18n_unormalizer_h
[in]filter_setAn i18n_uset_h which determines the characters to be normalized
[out]normalizer2The requested Normalizer2, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Closes a i18n_unormalizer_h from i18n_unormalization_create_filtered().

Do not destroy instances from i18n_unormalization_get_instance()

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h to be closed
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Gets the combining class of code_point.

The default implementation returns 0 but all standard implementations return the Unicode Canonical_Combining_Class value.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]code_pointCode point
Returns:
code_point's combining class
int i18n_unormalization_get_decomposition ( i18n_unormalizer_h  normalizer,
i18n_uchar32  code_point,
i18n_uchar decomposition,
int32_t  capacity,
int32_t *  len 
)

Gets the decomposition mapping of code_point.

Roughly equivalent to normalizing the String form of code_point on an I18N_UNORMALIZATION_DECOMPOSE i18n_unormalizer_h, but much faster, and except that this function returns a negative value and does not write a string if code_point does not have a decomposition mapping in this instance's data. This function is independent of the mode of the UNormalizer2.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]code_pointCode point
[in]decompositionString buffer which will be set to code_point's decomposition mapping, if there is one.
[in]capacityNumber of i18n_uchar that can be written to decomposition
[out]lenThe non-negative length of code_point's decomposition, if there is one; otherwise a negative value
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_get_instance ( const char *  package_name,
const char *  name,
i18n_unormalization_mode_e  mode,
i18n_unormalizer_h normalizer 
)

Gets a i18n_unormalizer_h which uses the specified data file and composes or decomposes text according to the specified mode.

Since :
2.3
Remarks:
The normalizer should not be released.
Parameters:
[in]package_nameNULL for ICU built-in data, otherwise application data package name.
[in]name"nfc" or "nfkc" or "nfkc_cf" or the name of the custom data file.
[in]modeThe normalization mode (compose or decompose).
[out]normalizerThe requested normalizer on success.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter

Returns an i18n_unormalizer_h instance for Unicode NFC normalization.

Same as i18n_unormalization_get_instance(NULL, "nfc", I18N_UNORM2_COMPOSE, n2). Returns an unmodifiable singleton instance. Do not delete it.

Since :
6.0
Remarks:
The normalizer should not be released.
Parameters:
[out]normalizerThe requested i18n_unormalizer_h, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Returns an i18n_unormalizer_h instance for Unicode NFD normalization.

Same as i18n_unormalization_get_instance(NULL, "nfc", I18N_UNORMALIZATION_DECOMPOSE, n2). Returns an unmodifiable singleton instance. Do not delete it.

Since :
6.0
Remarks:
The normalizer should not be released.
Parameters:
[out]normalizerThe requested i18n_unormalizer_h, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Returns an i18n_unormalizer_h instance for Unicode NFKC_Casefold normalization.

Same as i18n_unormalization_get_instance(NULL, "nfkc_cf", I18N_UNORM2_COMPOSE, n2). Returns an unmodifiable singleton instance. Do not delete it.

Since :
6.0
Remarks:
The normalizer should not be released.
Parameters:
[out]normalizerThe requested i18n_unormalizer_h, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Returns an i18n_unormalizer_h instance for Unicode NFKC normalization.

Same as i18n_unormalization_get_instance(NULL, "nfkc", I18N_UNORM2_COMPOSE, n2). Returns an unmodifiable singleton instance. Do not delete it.

Since :
6.0
Remarks:
The normalizer should not be released.
Parameters:
[out]normalizerThe requested i18n_unormalizer_h, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Returns an i18n_unormalizer_h instance for Unicode NFKD normalization.

Same as i18n_unormalization_get_instance(NULL, "nfkc", I18N_UNORMALIZATION_DECOMPOSE, n2). Returns an unmodifiable singleton instance. Do not delete it.

Since :
6.0
Remarks:
The normalizer should not be released.
Parameters:
[out]normalizerThe requested i18n_unormalizer_h, if successful
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_get_raw_decomposition ( i18n_unormalizer_h  normalizer,
i18n_uchar32  code_point,
i18n_uchar decomposition,
int32_t  capacity,
int32_t *  len 
)

Gets the raw decomposition mapping of code_point.

This is similar to the i18n_unormalization_get_decomposition() function but returns the raw decomposition mapping as specified in UnicodeData.txt or (for custom data) in the mapping files processed by the gennorm2 tool. By contrast, i18n_unormalization_get_decomposition() returns the processed, recursively-decomposed version of this mapping.

When used on a standard NFKC Normalizer2 instance, i18n_unormalization_get_raw_decomposition() returns the Unicode Decomposition_Mapping (dm) property.

When used on a standard NFC Normalizer2 instance, it returns the Decomposition_Mapping only if the Decomposition_Type (dt) is Canonical (Can); in this case, the result contains either one or two code points (=1..4 i18n_uchar).

This function is independent of the mode of the UNormalizer2.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]code_pointCode point.
[in]decompositionString buffer which will be set to code_point's decomposition mapping, if there is one.
[in]capacityNumber of i18n_uchar that can be written to decomposition
[out]lenThe non-negative length of code_point's raw decomposition, if there is one; otherwise a negative value
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Tests if the character is normalization-inert.

For details see the Normalizer2 base class documentation.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]cCharacter to test
[out]resultTRUE if c is normalization-inert
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_is_normalized ( i18n_unormalizer_h  normalizer,
const i18n_uchar s,
int32_t  length,
i18n_ubool ret 
)

Tests if the string is normalized.

Internally, in cases where the i18n_unormalization_quick_check() method would return "maybe" (which is only possible for the two COMPOSE modes) this method resolves to "yes" or "no" to provide a definitive result, at the cost of doing more work in those cases.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]sInput string
[in]lengthLength of the string, or -1 if NUL-terminated
[out]retTRUE if s is normalized
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_normalize ( i18n_unormalizer_h  normalizer,
const i18n_uchar src,
int32_t  len,
i18n_uchar dest,
int32_t  capacity,
int32_t *  len_deststr 
)

Writes the normalized form of the source string to the destination string(replacing its contents).

The source and destination strings must be different buffers.

Since :
2.3
Parameters:
[in]normalizerAn i18n_unormalizer_h handle.
[in]srcThe source string.
[in]lenThe length of the source string, otherwise -1 if NULL-terminated.
[out]destThe destination string
Its contents are replaced with normalized src.
[in]capacityThe number of string_uchar that can be written to dest
[out]len_deststrThe length of the destination string
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
int i18n_unormalization_normalize_second_and_append ( i18n_unormalizer_h  normalizer,
i18n_uchar first,
int32_t  first_length,
int32_t  first_capacity,
const i18n_uchar second,
int32_t  second_length,
int32_t *  len 
)

Appends the normalized form of the second string to the first string (merging them at the boundary) and returns the length of the first string.

The result is normalized if the first string was normalized. The first and second strings must be different buffers.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in,out]firstA normalized string
[in]first_lengthLength of the first string, or -1 if NUL-terminated
[in]first_capacityNumber of i18n_uchar that can be written to first
[in]secondString, will be normalized
[in]second_lengthLength of the source string, or -1 if NUL-terminated
[out]lenThe length of the first string after merging
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_quick_check ( i18n_unormalizer_h  normalizer,
const i18n_uchar s,
int32_t  length,
i18n_unormalization_check_result_e result 
)

Tests if the string is normalized.

For the two COMPOSE modes, the result could be "maybe" in cases that would take a little more work to resolve definitively. Use i18n_unormalization_span_quick_check_yes() and i18n_unormalization_normalize_second_and_append() for a faster combination of quick check + normalization, to avoid re-checking the "yes" prefix.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]sInput string
[in]lengthLength of the string, or -1 if NUL-terminated
[out]resultNormalization check result
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory
int i18n_unormalization_span_quick_check_yes ( i18n_unormalizer_h  normalizer,
const i18n_uchar s,
int32_t  length,
int32_t *  result 
)

Gets the end of the normalized substring of the input string.

The returned end index is usually one or more characters before the "no" or "maybe" character: The end index is at a normalization boundary. (See the class documentation for more about normalization boundaries.)

When the goal is a normalized string and most input strings are expected to be normalized already, then call this method, and if it returns a prefix shorter than the input string, copy that prefix and use i18n_unormalization_normalize_second_and_append() for the remainder.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]sInput string.
[in]lengthLength of the string, or -1 if NUL-terminated.
[out]resultEnd index of the normalized beginning of the s string.
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Tests if the character always has a normalization boundary after it, regardless of context.

For details see the Normalizer2 base class documentation.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]cCharacter to test
[out]resultTRUE if c has a normalization boundary after it
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory

Tests if the character always has a normalization boundary before it, regardless of context.

Since :
6.0
Parameters:
[in]normalizerAn i18n_unormalizer_h
[in]cCharacter to test
[out]resultTRUE if c has a normalization boundary before it
Returns:
0 on success, otherwise a negative error value
Return values:
I18N_ERROR_NONESuccessful
I18N_ERROR_INVALID_PARAMETERInvalid function parameter
I18N_ERROR_OUT_OF_MEMORYOut of memory