The Standard ML Basis Library


The CHAR signature

The CHAR signature defines a type char of characters and provides basic operations and predicates on values of that type. There is a linear ordering supported on characters. In addition, there is an encoding of characters into a contiguous range of non-negative integers that preserves the linear ordering.

There are two structures matching the CHAR signature. The Char structure defines a superset of the usual ASCII characters and locale-independent operations on them. For this structure, Char.maxOrd = 255.

The optional WideChar structure defines wide characters, which are represented by a fixed number of 8-bit words (bytes). If the WideChar is provided, it is distinct from the Char structure.


Synopsis

signature CHAR
structure Char : CHAR
structure WideChar : CHAR

Interface

eqtype char
eqtype string
val minChar : char
val maxChar : char
val maxOrd : int
val ord : char -> int
val chr : int -> char
val succ : char -> char
val pred : char -> char
val < : (char * char) -> bool
val <= : (char * char) -> bool
val > : (char * char) -> bool
val >= : (char * char) -> bool
val compare : (char * char) -> order
val contains : string -> char -> bool
val notContains : string -> char -> bool
val toLower : char -> char
val toUpper : char -> char
val isAlpha : char -> bool
val isAlphaNum : char -> bool
val isAscii : char -> bool
val isCntrl : char -> bool
val isDigit : char -> bool
val isGraph : char -> bool
val isHexDigit : char -> bool
val isLower : char -> bool
val isPrint : char -> bool
val isSpace : char -> bool
val isPunct : char -> bool
val isUpper : char -> bool
val fromString : String.string -> char option
val scan : (Char.char, 'a) StringCvt.reader -> 'a -> (char * 'a) option
val toString : char -> String.string
val fromCString : String.string -> char option
val toCString : char -> String.string

Description

eqtype char

eqtype string

minChar
is the least character in the ordering. It always equals chr 0.

maxChar
is the greatest character in the ordering.

maxOrd
is the greatest character code; equals ord maxChar.

ord c
chr i
returns the integer code of the character c and the character whose code is i, respectively. The function chr raises Chr if i < 0 or i > maxOrd. When chr is restricted to the interval [0,maxOrd], these two functions denote the character encoding function and its inverse.

succ c
returns the character immediately following c in the ordering, or raises Chr if c = maxChar. When defined, succ c is equivalent to chr(ord c + 1).

pred c
returns the character immediately preceding c, or raises Chr if c = minChar. When defined, pred c is equivalent to chr(ord c - 1).

c < d
c <= d
c > d
c >= d
compare characters in the character ordering. Note that the functions ord and chr preserve orderings.

compare (c, d)
returns LESS, EQUAL, or GREATER, according as c precedes, equals, or follows d in the character ordering.

contains s c
returns true if character c occurs in the string s; otherwise false.
Implementation note:

In some implementations, the partial application of contains to s may build a table, which is used by the resulting function to decide whether a given character is in the string or not. Hence it may be expensive to compute val p = contains s, but fast to compute p c for any given character c.



notContains s c
returns true if character c does not occur in the string s; false otherwise. Equivalent to not(contains s c).
Implementation note:

As with contains, notContains may be implemented via table lookup.



toLower c
toUpper c
returns the lowercase (respectively, uppercase) letter corresponding to c if c is a letter; otherwise returns c.

isAlpha c
returns true if c is a letter (lowercase or uppercase).

isAlphaNum c
returns true if c is alphanumeric (a letter or a decimal digit).

isAscii c
returns true if c is a (seven-bit) ASCII character, i.e., 0 <= ord c <= 127. Note that this function is independent of locale.

isCntrl c
returns true if c is a control character. Equivalent to not o isPrint.

isDigit c
returns true if c is a decimal digit (0-9).

isGraph c
returns true if c is a graphical character, that is, it is printable and not a whitespace character.

isHexDigit c
returns true if c is a hexadecimal digit (0-9, a-f, A-F).

isLower c
returns true if c is a lowercase letter.

isPrint c
returns true if c is a printable character (space or visible), i.e., not a control character.

isSpace c
returns true if c is a whitespace character (space, newline, tab, carriage return, vertical tab, formfeed).

isPunct c
returns true if c is a punctuation character: graphical but not alphanumeric.

isUpper c
returns true if c is an uppercase letter.

fromString s
scan getc strm
scan a character (including space) or an SML escape sequence representing a character from the prefix of a string or a character stream. After a successful conversion, fromString ignores any additional characters in s. If no conversion is possible, e.g., if the first character is non-printable (i.e., not in the ASCII range 0x20-0x7E) or starts an illegal escape sequence, NONE is returned.

The allowable escape sequences are:

          \a       Alert (ASCII 0x07)
          \b       Backspace (ASCII 0x08)
          \t       Horizontal tab (ASCII 0x09)
          \n       Linefeed or newline (ASCII 0x0A)
          \v       Vertical tab (ASCII 0x0B)
          \f       Form feed (ASCII 0x0C)
          \r       Carriage return (ASCII 0x0D)
          \\       Backslash
          \"       Double quote
          \^c      A control character whose encoding is C - 64, where C
                   is the encoding of the character c, with C in the range
                   [64,95].
          \ddd     The character whose encoding is the number ddd, three decimal
                   digits denoting an integer in the range [0,255].
          \uxxxx   The character whose encoding is the number xxxx, 
                   four hexadecimal digits denoting an integer in the 
                   ordinal range of the alphabet.
          \f...f\  This sequence is ignored, where f...f stands for a sequence
                   of one or more formatting characters.
          

In the escape sequences involving decimal or hexadecimal digits, the sequence of digits is taken to be the longest sequence of such characters. If the resulting value cannot be represented in the character set, NONE is returned.

toString c
returns a printable string representation of the character, using, if necessary, SML escape sequences. Printable characters, except for #"\\" and #"\"", are left unchanged. Backslash #"\\" becomes "\\\\"; double quote #"\"" becomes "\\\"". The common control characters are converted to two-character escape sequences:
          Alert (ASCII 0x07)                    "\\a"
          Backspace (ASCII 0x08)                "\\b"
          Horizontal tab (ASCII 0x09)           "\\t"
          Linefeed or newline (ASCII 0x0A)      "\\n"
          Vertical tab (ASCII 0x0B)             "\\v"
          Form feed (ASCII 0x0C)                "\\f"
          Carriage return (ASCII 0x0D)          "\\r"
          
The remaining characters whose codes are less than 32 are represented by three-character strings in ``control character'' notation, e.g., #"\000" maps to "\\^@", #"\001" maps to "\\^A", etc. All other characters (i.e., those whose codes are 127 or greater) are mapped to four-character strings of the form "\\ddd", where ddd are the three decimal digits corresponding to a character's code.

fromCString s
scans a character (including space) or a C escape sequence representing a character from the prefix of a string. After a successful conversion, fromCString ignores any additional characters in s. If no conversion is possible, e.g., if the first character is non-printable (i.e., not in the ASCII range 0x20-0x7E) or starts an illegal escape sequence, NONE is returned.

The allowable escape sequences are given below (cf. Section 6.1.3.4 of the ISO C standard ISO/IEC [CITE]9899:1990/).

          \a       Alert (ASCII 0x07)
          \b       Backspace (ASCII 0x08)
          \t       Horizontal tab (ASCII 0x09)
          \n       Linefeed or newline (ASCII 0x0A)
          \v       Vertical tab (ASCII 0x0B)
          \f       Form feed (ASCII 0x0C)
          \r       Carriage return (ASCII 0x0D)
          \?       Question mark
          \\       Backslash
          \"       Double quote
          \'       Single quote
          \^c      A control character whose encoding is C - 64, where C
                   is the encoding of the character c, with C in the range
                   [64,95].
          \ddd     The character whose encoding is the number ddd, where
                   ddd consists of one to three octal.
          \uxxxx   The character whose encoding is the number xxxx, 
                   where xxxx is a sequence of hexadecimal digits.
          

In the escape sequences involving octal or hexadecimal digits, the sequence of digits is taken to be the longest sequence of such characters. If the resulting value cannot be represented in the character set, NONE is returned.

toCString c
returns a printable string corresponding to c, with non-printable characters replaced by C escape sequences. Specifically, printable characters, except for #"\\", #"\"", #"?" and #"'" are left unchanged. Backslash #"\\" becomes "\\\\"; double quote #"\"" becomes "\\\"", question mark #"?" becomes "\\?", single quote #"'" becomes "\\'". The common control characters are converted to two-character escape sequences:
          Alert (ASCII 0x07)                    "\\a"
          Backspace (ASCII 0x08)                "\\b"
          Horizontal tab (ASCII 0x09)           "\\t"
          Linefeed or newline (ASCII 0x0A)      "\\n"
          Vertical tab (ASCII 0x0B)             "\\v"
          Form feed (ASCII 0x0C)                "\\f"
          Carriage return (ASCII 0x0D)          "\\r"
          
All other characters are represented by one to three octal digits, corresponding to a character's code, preceded by a backslash.


Discussion

In WideChar, the functions toLower, toLower, isAlpha,..., isUpper are locale-dependent. In Char, these functions are locale-independent, with the following semantics:


isUpper c true if #"A" <= c andalso c <= #"Z"
isLower c true if #"a" <= c andalso c <= #"z"
isDigit c true if #"0" <= c andalso c <= #"9"
isAlpha c true if isUpper c orelse isLower c
isAlphaNum c true if isAlpha c orelse isDigit c
isHexDigit c true if isDigit c orelse (#"a" <= c andalso c <= #"f") orelse (#"A" <= c andalso c <= #"F")
isGraph c true if #"!" <= c andalso c <= #"~"
isPrint c true if isGraph c orelse c = #" "
isPunct c true if isGraph c andalso not (isAlphaNum c)
isCtrl c true if not (isPrint c)
isSpace c true if (#"\t" <= c andalso c <= #"\r") orelse c <= #"\ "
isAscii c true if 0 <= ord c andalso ord c <= 127
toLower c chr (ord c + 32) if isUpper c; otherwise, c
toUpper c chr (ord c - 32) if isLower c; otherwise, c

See Also

Locale, MultiByte, STRING

[ INDEX | TOP | Parent | Root ]

Last Modified October 6, 1997
Comments to John Reppy.
Copyright © 1997 Bell Labs, Lucent Technologies