This site is a static rendering of the Trac instance that was used by R7RS-WG1 for its work on R7RS-small (PDF), which was ratified in 2013. For more information, see Home. For a version of this page that may be more recent, see UcdCowan in WG2's repo for R7RS-large.

Ucd­Cowan

cowan
2010-10-22 11:34:26
1history
source

Unicode Character Database

The Unicode Character Database (UCD) is a set of properties defined by the Unicode Standard and applicable to characters of the Unicode repertoire. The exact list of properties varies from version to version of the UCD, so they are not enumerated here. Instead, each property supported by a particular implementation of this package is represented by a property object belonging to a unique type. Given a property (which can be retrieved in a variety of ways) and a character, the fundamental ucd-get-property-value procedure returns the value of that property when applied to that character.

Implementations may implement whatever subset of the UCD properties they choose.

Returned values must be treated as immutable by callers.

Version procedure

(ucd-version)

Returns a list of three exact integers specifying the version of the UCD that this implementation provides. There is no mechanism for providing more than one version. If the UCD version is 5.0.0, the value of (ucd-version) is (5 0 0).

Properties

Properties can be thought of as analogous to symbols, but with multiple names. Every property has a canonical name as well as zero or more aliases. Unlike symbol names, property names are case-insensitive, and in addition the presence or absence of an underscore character in a name is not meaningful. Because there is no way to construct a new property, property objects may be compared with eqv?.

Properties are typed in a way that reflects the type of the values for that property. Some properties are numeric, some are string, some are boolean, and some are enumerated or catalog properties (the difference is that a catalog property typically gains new values in new UCD versions, whereas an enumerated property has a fairly closed set of values).

Property procedures

(ucd-find-property string)

Returns the property object one of whose names is string, or #f if there is no such property known to the implementation.

(ucd-properties)

Returns a list of all properties known to the implementation.

(ucd-property? obj)

Returns #t if obj is a property, and #f otherwise.

(ucd-property-name prop)

Returns a string which is the canonical name of prop.

(ucd-property-aliases prop)

Returns a list of strings which are the aliases (including the canonical name) of prop. Names that merely differ in case or underscores from any of the others are not included.

(ucd-default-value prop)

Returns the Unicode-defined default value for prop, or #f if there is none.

(ucd-property-syntax prop)

Returns a string whose value is a regular expression characterizing the valid syntax of all the values of the the property, or #f if no syntax is available.

Enumerated property values

Property values which are booleans, numbers, or strings constitute no special problem. Enumerated and catalogued property values, however, have canonical names and aliases and are subject to the same casing and underscore rules as properties. With the exception of Unicode character names, therefore, they are represented by a disjoint object type called enums, with procedures analogous to those for properties. Property value names are not unique across properties. Like properties, enums may be compared with eqv?.

Enum procedures

(ucd-find-enum property string-or-integer)

Returns the enum object associate with property one of whose names is string, or #f if there is no such property known to the implementation. The property named "Canonical_Combining_Class" has integer property values, but there are enumerated aliases for some of them, so in this case either a string or an integer may appear as the second argument.

(ucd-enums prop)

Returns a list of all enums associated with property.

(ucd-enum? obj)

Returns #t if obj is a property, and #f otherwise.

(ucd-enum-name enum)

Returns a string or integer which is the canonical name of enum.

(ucd-enum-property prop)

Returns the property which is associated with enum.

(ucd-enum-aliases enum)

Returns a list of strings (with a possible integer) which are the aliases (including the canonical name) of enum. Names that merely differ in case or underscores from any of the others are not included.

Retrieving property values

(ucd-get-property-value codepoint prop)

Return the boolean, string, number, or enum which represents the value of prop at codepoint, which can be a character or an exact integer.

Property predicates

There is a group of predicates, whose semantics is specified by the Unicode Standard, that specify a set of standard characteristics that Unicode properties may have. They have names of the form ucd-*-property?, where * may be any of obsolete deprecated stabilized numeric string binary enumerated catalogued miscellaneous irg mapping dictionary-index reading dictionary-like radical-stroke variant normative informative contributory provisional. In all cases the return value is #t or #f.

TODO: character names, blocks, standardized variants, named sequences.