pep-0223.txt

PEP: 223
Title: Change the Meaning of \x Escapes
Version: $Revision$
Last-Modified: $Date$
Author: tim.peters@gmail.com (Tim Peters)
Status: Final
Type: Standards Track
Created: 20-Aug-2000
Python-Version: 2.0
Post-History: 23-Aug-2000


Abstract

    Change \x escapes, in both 8-bit and Unicode strings, to consume
    exactly the two hex digits following.  The proposal views this as
    correcting an original design flaw, leading to clearer expression
    in all flavors of string, a cleaner Unicode story, better
    compatibility with Perl regular expressions, and with minimal risk
    to existing code.


Syntax

    The syntax of \x escapes, in all flavors of non-raw strings, becomes

        \xhh

    where h is a hex digit (0-9, a-f, A-F).  The exact syntax in 1.5.2 is
    not clearly specified in the Reference Manual; it says

        \xhh...

    implying "two or more" hex digits, but one-digit forms are also
    accepted by the 1.5.2 compiler, and a plain \x is "expanded" to
    itself (i.e., a backslash followed by the letter x).  It's unclear
    whether the Reference Manual intended either of the 1-digit or
    0-digit behaviors.


Semantics

    In an 8-bit non-raw string,
        \xij
    expands to the character
        chr(int(ij, 16))
    Note that this is the same as in 1.6 and before.

    In a Unicode string,
        \xij
    acts the same as
        \u00ij
    i.e. it expands to the obvious Latin-1 character from the initial
    segment of the Unicode space.

    An \x not followed by at least two hex digits is a compile-time error,
    specifically ValueError in 8-bit strings, and UnicodeError (a subclass
    of ValueError) in Unicode strings.  Note that if an \x is followed by
    more than two hex digits, only the first two are "consumed".  In 1.6
    and before all but the *last* two were silently ignored.


Example

    In 1.5.2:

        >>> "\x123465"  # same as "\x65"
        'e'
        >>> "\x65"
        'e'
        >>> "\x1"
        '\001'
        >>> "\x\x"
        '\\x\\x'
        >>>

    In 2.0:

        >>> "\x123465" # \x12 -> \022, "3456" left alone
        '\0223456'
        >>> "\x65"
        'e'
        >>> "\x1"
        [ValueError is raised]
        >>> "\x\x"
        [ValueError is raised]
        >>>


History and Rationale

    \x escapes were introduced in C as a way to specify variable-width
    character encodings.  Exactly which encodings those were, and how many
    hex digits they required, was left up to each implementation.  The
    language simply stated that \x "consumed" *all* hex digits following,
    and left the meaning up to each implementation.  So, in effect, \x in C
    is a standard hook to supply platform-defined behavior.

    Because Python explicitly aims at platform independence, the \x escape
    in Python (up to and including 1.6) has been treated the same way
    across all platforms:  all *except* the last two hex digits were
    silently ignored.  So the only actual use for \x escapes in Python was
    to specify a single byte using hex notation.

    Larry Wall appears to have realized that this was the only real use for
    \x escapes in a platform-independent language, as the proposed rule for
    Python 2.0 is in fact what Perl has done from the start (although you
    need to run in Perl -w mode to get warned about \x escapes with fewer
    than 2 hex digits following -- it's clearly more Pythonic to insist on
    2 all the time).

    When Unicode strings were introduced to Python, \x was generalized so
    as to ignore all but the last *four* hex digits in Unicode strings.
    This caused a technical difficulty for the new regular expression engine:
    SRE tries very hard to allow mixing 8-bit and Unicode patterns and
    strings in intuitive ways, and it no longer had any way to guess what,
    for example, r"\x123456" should mean as a pattern:  is it asking to match
    the 8-bit character \x56 or the Unicode character \u3456?

    There are hacky ways to guess, but it doesn't end there.  The ISO C99
    standard also introduces 8-digit \U12345678 escapes to cover the entire
    ISO 10646 character space, and it's also desired that Python 2 support
    that from the start.  But then what are \x escapes supposed to mean?
    Do they ignore all but the last *eight* hex digits then?  And if less
    than 8 following in a Unicode string, all but the last 4?  And if less
    than 4, all but the last 2?

    This was getting messier by the minute, and the proposal cuts the
    Gordian knot by making \x simpler instead of more complicated.  Note
    that the 4-digit generalization to \xijkl in Unicode strings was also
    redundant, because it meant exactly the same thing as \uijkl in Unicode
    strings.  It's more Pythonic to have just one obvious way to specify a
    Unicode character via hex notation.


Development and Discussion

    The proposal was worked out among Guido van Rossum, Fredrik Lundh and
    Tim Peters in email.  It was subsequently explained and disussed on
    Python-Dev under subject "Go \x yourself", starting 2000-08-03.
    Response was overwhelmingly positive; no objections were raised.


Backward Compatibility

    Changing the meaning of \x escapes does carry risk of breaking existing
    code, although no instances of incompabitility have yet been discovered.
    The risk is believed to be minimal.

    Tim Peters verified that, except for pieces of the standard test suite
    deliberately provoking end cases, there are no instances of \xabcdef...
    with fewer or more than 2 hex digits following, in either the Python
    CVS development tree, or in assorted Python packages sitting on his
    machine.

    It's unlikely there are any with fewer than 2, because the Reference
    Manual implied they weren't legal (although this is debatable!).  If
    there are any with more than 2, Guido is ready to argue they were buggy
    anyway <0.9 wink>.

    Guido reported that the O'Reilly Python books *already* document that
    Python works the proposed way, likely due to their Perl editing
    heritage (as above, Perl worked (very close to) the proposed way from
    its start).

    Finn Bock reported that what JPython does with \x escapes is
    unpredictable today.  This proposal gives a clear meaning that can be
    consistently and easily implemented across all Python implementations.


Effects on Other Tools

    Believed to be none.  The candidates for breakage would mostly be
    parsing tools, but the author knows of none that worry about the
    internal structure of Python strings beyond the approximation "when
    there's a backslash, swallow the next character".  Tim Peters checked
    python-mode.el, the std tokenize.py and pyclbr.py, and the IDLE syntax
    coloring subsystem, and believes there's no need to change any of
    them.  Tools like tabnanny.py and checkappend.py inherit their immunity
    from tokenize.py.


Reference Implementation

    The code changes are so simple that a separate patch will not be produced.
    Fredrik Lundh is writing the code, is an expert in the area, and will
    simply check the changes in before 2.0b1 is released.


BDFL Pronouncements

    Yes, ValueError, not SyntaxError.  "Problems with literal interpretations
    traditionally raise 'runtime' exceptions rather than syntax errors."


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End: