|  | 
 | This directory contains source code for the SQLite "ICU" extension, an | 
 | integration of the "International Components for Unicode" library with | 
 | SQLite. Documentation follows. | 
 |  | 
 |     1. Features | 
 |      | 
 |         1.1  SQL Scalars upper() and lower() | 
 |         1.2  Unicode Aware LIKE Operator | 
 |         1.3  ICU Collation Sequences | 
 |         1.4  SQL REGEXP Operator | 
 |      | 
 |     2. Compilation and Usage | 
 |      | 
 |     3. Bugs, Problems and Security Issues | 
 |      | 
 |         3.1  The "case_sensitive_like" Pragma | 
 |         3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | 
 |         3.3  Collation Sequence Security Issue | 
 |  | 
 |  | 
 | 1. FEATURES | 
 |  | 
 |   1.1  SQL Scalars upper() and lower() | 
 |  | 
 |     SQLite's built-in implementations of these two functions only  | 
 |     provide case mapping for the 26 letters used in the English | 
 |     language. The ICU based functions provided by this extension | 
 |     provide case mapping, where defined, for the full range of  | 
 |     unicode characters. | 
 |  | 
 |     ICU provides two types of case mapping, "general" case mapping and | 
 |     "language specific". Refer to ICU documentation for the differences | 
 |     between the two. Specifically: | 
 |  | 
 |        http://www.icu-project.org/userguide/caseMappings.html | 
 |        http://www.icu-project.org/userguide/posix.html#case_mappings | 
 |  | 
 |     To utilise "general" case mapping, the upper() or lower() scalar  | 
 |     functions are invoked with one argument: | 
 |  | 
 |         upper('ABC') -> 'abc' | 
 |         lower('abc') -> 'ABC' | 
 |  | 
 |     To access ICU "language specific" case mapping, upper() or lower() | 
 |     should be invoked with two arguments. The second argument is the name | 
 |     of the locale to use. Passing an empty string ("") or SQL NULL value | 
 |     as the second argument is the same as invoking the 1 argument version | 
 |     of upper() or lower(): | 
 |  | 
 |         lower('I', 'en_us') -> 'i' | 
 |         lower('I', 'tr_tr') -> 'ı' (small dotless i) | 
 |  | 
 |   1.2  Unicode Aware LIKE Operator | 
 |  | 
 |     Similarly to the upper() and lower() functions, the built-in SQLite LIKE | 
 |     operator understands case equivalence for the 26 letters of the English | 
 |     language alphabet. The implementation of LIKE included in this | 
 |     extension uses the ICU function u_foldCase() to provide case | 
 |     independent comparisons for the full range of unicode characters.   | 
 |  | 
 |     The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the | 
 |     dotless 'I' character used in the Turkish language is considered | 
 |     to be in the same equivalence class as the dotted 'I' character | 
 |     used by many languages (including English). | 
 |  | 
 |   1.3  ICU Collation Sequences | 
 |  | 
 |     A special SQL scalar function, icu_load_collation() is provided that  | 
 |     may be used to register ICU collation sequences with SQLite. It | 
 |     is always called with exactly two arguments, the ICU locale  | 
 |     identifying the collation sequence to ICU, and the name of the | 
 |     SQLite collation sequence to create. For example, to create an | 
 |     SQLite collation sequence named "turkish" using Turkish language | 
 |     sorting rules, the SQL statement: | 
 |  | 
 |         SELECT icu_load_collation('tr_TR', 'turkish'); | 
 |  | 
 |     Or, for Australian English: | 
 |  | 
 |         SELECT icu_load_collation('en_AU', 'australian'); | 
 |  | 
 |     The identifiers "turkish" and "australian" may then be used | 
 |     as collation sequence identifiers in SQL statements: | 
 |  | 
 |         CREATE TABLE aust_turkish_penpals( | 
 |           australian_penpal_name TEXT COLLATE australian, | 
 |           turkish_penpal_name    TEXT COLLATE turkish | 
 |         ); | 
 |    | 
 |   1.4 SQL REGEXP Operator | 
 |  | 
 |     This extension provides an implementation of the SQL binary | 
 |     comparision operator "REGEXP", based on the regular expression functions | 
 |     provided by the ICU library. The syntax of the operator is as described | 
 |     in SQLite documentation: | 
 |  | 
 |         <string> REGEXP <re-pattern> | 
 |  | 
 |     This extension uses the ICU defaults for regular expression matching | 
 |     behaviour. Specifically, this means that: | 
 |  | 
 |         * Matching is case-sensitive, | 
 |         * Regular expression comments are not allowed within patterns, and | 
 |         * The '^' and '$' characters match the beginning and end of the | 
 |           <string> argument, not the beginning and end of lines within | 
 |           the <string> argument. | 
 |  | 
 |     Even more specifically, the value passed to the "flags" parameter | 
 |     of ICU C function uregex_open() is 0. | 
 |  | 
 |  | 
 | 2  COMPILATION AND USAGE | 
 |  | 
 |   The easiest way to compile and use the ICU extension is to build | 
 |   and use it as a dynamically loadable SQLite extension. To do this | 
 |   using gcc on *nix: | 
 |  | 
 |     gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so | 
 |  | 
 |   You may need to add "-I" flags so that gcc can find sqlite3ext.h | 
 |   and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be | 
 |   loaded into sqlite in the same way as any other dynamically loadable | 
 |   extension. | 
 |  | 
 |  | 
 | 3 BUGS, PROBLEMS AND SECURITY ISSUES | 
 |  | 
 |   3.1 The "case_sensitive_like" Pragma | 
 |  | 
 |     This extension does not work well with the "case_sensitive_like" | 
 |     pragma. If this pragma is used before the ICU extension is loaded, | 
 |     then the pragma has no effect. If the pragma is used after the ICU | 
 |     extension is loaded, then SQLite ignores the ICU implementation and | 
 |     always uses the built-in LIKE operator. | 
 |  | 
 |     The ICU extension LIKE operator is always case insensitive. | 
 |  | 
 |   3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro | 
 |  | 
 |     Passing very long patterns to the built-in SQLite LIKE operator can | 
 |     cause excessive CPU usage. To curb this problem, SQLite defines the | 
 |     SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a | 
 |     pattern in bytes (irrespective of encoding). The default value is | 
 |     defined in internal header file "limits.h". | 
 |      | 
 |     The ICU extension LIKE implementation suffers from the same  | 
 |     problem and uses the same solution. However, since the ICU extension | 
 |     code does not include the SQLite file "limits.h", modifying | 
 |     the default value therein does not affect the ICU extension. | 
 |     The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by | 
 |     the ICU extension LIKE operator is 50000, defined in source  | 
 |     file "icu.c". | 
 |  | 
 |   3.3 Collation Sequence Security Issue | 
 |  | 
 |     Internally, SQLite assumes that indices stored in database files | 
 |     are sorted according to the collation sequence indicated by the | 
 |     SQL schema. Changing the definition of a collation sequence after | 
 |     an index has been built is therefore equivalent to database | 
 |     corruption. The SQLite library is not very well tested under | 
 |     these conditions, and may contain potential buffer overruns | 
 |     or other programming errors that could be exploited by a malicious | 
 |     programmer. | 
 |  | 
 |     If the ICU extension is used in an environment where potentially | 
 |     malicious users may execute arbitrary SQL (i.e. gears), they | 
 |     should be prevented from invoking the icu_load_collation() function, | 
 |     possibly using the authorisation callback. |