Changes between Initial Version and Version 1 of StringNormalization


Ignore:
Timestamp:
09/07/10 03:31:07 (7 years ago)
Author:
alexshinn
Comment:

initial string normalization discussion

Legend:

Unmodified
Added
Removed
Modified
  • StringNormalization

    v1 v1  
     1= String Normalization = 
     2 
     3== Introduction == 
     4 
     5There are four Unicode normalization forms defined in 
     6[http://unicode.org/reports/tr15/ UAX #15], corresponding to two types 
     7of character equivalence. 
     8 
     9The first, and most important, is canonical equivalence.  This is 
     10equivalence between sequences of codepoints which represent the same 
     11abstract character.  This includes: 
     12 
     13  * pre-composed characters and their separate base+combining forms 
     14  * pre-composed Hangul and their Jamo sequences 
     15  * unified singleton characters (e.g. mapping the Ohm sign to the Greek capital Omega by which it is represented) 
     16  * canonical ordering for combining marks when multiple combining marks are used 
     17 
     18The second character equivalence is compatibility equivalence. 
     19Compatibility equivalence is where two characters represent the same 
     20abstract character but differ in appearance or behavior.  Differences 
     21include: 
     22 
     23  * Font variants - cursive, bold, mathematical 
     24  * Breaking differences - different hyphen and space types 
     25  * Circled characters 
     26  * Width, size, rotated - variations common in Japanese scripts 
     27  * Superscripts and subscripts - 2**5 becomes "2"+"5" 
     28  * Squared characters - Japanese words written in a single character square 
     29  * Fractions - representing the single 1/2 character as "1"+"/"+"2" 
     30 
     31These two equivalences can be combined to provide four normalization 
     32forms: 
     33 
     34  * NFD - canonical decomposition 
     35  * NFC - canonical decomposition, followed by canonical composition 
     36  * NFKD - compatibility decomposition 
     37  * NFKC - compatibility decomposition, followed by canonical composition 
     38 
     39Order matters, since the canonical and compatibility normalizations 
     40are not commutative.  Normalizing by performing canonical composition 
     41followed by compatibility decomposition is not a defined form in the 
     42standard. 
     43 
     44Note that neither of these equivalences ignore control characters, nor 
     45do they attempt to unify characters which look alike, such as Latin 
     46and Cyrillic "o", so spoofing is a security issue regardless of 
     47Unicode normalization. 
     48 
     49== Problems with Normalization == 
     50 
     51The problem inherent in a character set as complex as Unicode is the 
     52"same" (by any of the definitions above) character can have multiple 
     53representations.  Thus the same conceptual word may be represented 
     54with different sets of codepoints, and not match as a search term or 
     55database key or filename, etc.  For example, Googling the same word 
     56with different forms for NFC and NFK will return a different set of 
     57results, which makes no sense.  Since web search is inherently an 
     58approximate art to begin with, this is perhaps not directly perceived 
     59as a "bug" by users, though certainly their experience would be 
     60improved if searching gave the same results. 
     61 
     62Googling for [http://www.google.com/search?q=unicode+normalization+bug unicode normalization bug], 
     63on the other hand, turns up a large number of hits for interoperability 
     64problems with filenames, and with systems sharing data between 
     65multiple users such as wikis. 
     66 
     67A large part of the problem is that whereas Windows and Linux tend to 
     68default to NFC normalization, and Mac OS X input tends to produce NFC, 
     69the Mac HFS filesystem automatically normalizes to NFD (as well as 
     70doing full Unicode case-folding).  But even if this were not the case 
     71problems would arise, just less predictably. 
     72 
     73== Approaches to Normalization == 
     74 
     75R6RS provided four separate procedures to convert explicitly between 
     76the four Unicode normalization forms.  This is the obvious choice, and 
     77is what most other languages do.  Whenever normalization matters (yes, 
     78every programmer working with any kind of text needs to understand 
     79normalization and when it is necessary), you need to convert all input 
     80to a chosen normalization form.  Or more likely ignore it until you 
     81get a rare but inevitable normalization bug, then go back to your 
     82design and figure out where all the relevant inputs are.  But it's 
     83what everyone else does, so at least we won't get laughed at for this 
     84choice. 
     85 
     86Another option is to leave all strings in their original form and just 
     87compare them in a normalization-insensitive manner, e.g. with 
     88`string-ni=?'.  If you want to hash with these you'll also need 
     89`string-hash-ni', and if you want to sort strings or use them in 
     90search trees you'll need `string-ni<?'.  Searching will require 
     91`string-contains-ni', though this could (very inefficiantly) be built 
     92using `string-ni=?'.  In general, anything that needs to be 
     93normalization independent needs to be built specially.  And you get a 
     94lot of duplicated work.  And the programmer still needs to remember to 
     95actually _use_ this API where appropriate instead of `string=?' and 
     96the like. 
     97 
     98How can we make things easier for the programmer?  One approach is to 
     99globally represent all strings in the same normalization form in 
     100memory.  This is mostly a matter of normalizing on port input, but 
     101also involves some simple checks on endpoints when concatenating 
     102strings.  String mutation is more complicated as well, though I think 
     103strings should be immutable anyway.  The advantage is this is all done 
     104once, at the implementation level.  The programmer never needs to 
     105worry about normalization, and can just compare with `string=?' and 
     106`string<?' like before (so existing code works too).  Normalization 
     107then becomes an encoding issue, only of concern when working with an 
     108external system or format that expects a normalization form different 
     109from your own. 
     110 
     111Automated normalization is not a complete silver bullet though.  The 
     112fundamental problem is that we're working with codepoints when 
     113conceptually most of the time we want to be working with graphemes. 
     114The problem can be seen when searching for a string ending in a base 
     115character in a document which has that same string followed by a 
     116combining character (in any of the normal forms this is possible). 
     117Even when both are in the same normal form the search will return 
     118success, but the string doesn't actually match.  If we were comparing 
     119grapheme by grapheme, on the other hand, it would correctly ignore the 
     120partial match.  And assuming graphemes are all internally normalized 
     121this would provide the same benefits of automatically normalized 
     122strings.  So using graphemes instead of codepoints (or layered over 
     123codepoints) as our basic unit of strings looks like a promising way to 
     124simplify the programmer's life. 
     125 
     126Of course, with the exception of the first option none of these have 
     127been specified formally, much less implemented, much less tested and 
     128used in real applications.  But they deserve consideration. 
     129Unfortunately, the first option, by exposing direct control over 
     130codepoint sequences, makes it impossible to implement either of the 
     131last two options, so it seems premature to standardize on this in WG1. 
     132 
     133A practical compromise would be to provide a single `string-normalize' 
     134procedure, which converts a string to a system-specific normalization 
     135form.  In an ASCII-only implementation, or an auto-normalizing 
     136implementation this could be the identify function.  Other 
     137implementations would simply choose a single preferred form.  Then 
     138programmers writing portable code could code just as they do in the 
     139first option, with the caveat that the normalization form has been 
     140chosen for them.  This procedure would be sufficient to implement the 
     141API described in the second option, though more optimized versions 
     142could be provided by the implementation.  Control of explicit 
     143normalization forms would be a matter for I/O and/or byte-vectors, 
     144probably in a separate WG2 module.  And APIs that make the 
     145programmer's life easier would at least be possible.