Non-Printable Characters

You can use special character sequences to put non-printable characters in your regular expression. Use \t to match a tab character (ASCII 0x09), \r for carriage return (0x0D) and \n for line feed (0x0A). More exotic non-printables are \a (bell, 0x07), \e (escape, 0x1B), \f (form feed, 0x0C) and \v (vertical tab, 0x0B). Remember that Windows text files use \r\n to terminate lines, while UNIX text files use \n.

Most regex flavors also support the tokens \cA through \cZ to insert ASCII control characters. The letter after the backslash is always a lowercase c. The second letter is an uppercase letter A through Z, to indicate Control+A through Control+Z. These are equivalent to \x01 through \x1A (26 decimal). E.g. \cM matches a carriage return, just like \r, \x0D, and \u000D. Most flavors allow the second letter to be lowercase, with no difference in meaning. Only Java requires the A to Z to be uppercase.

Using characters other than letters after \c is not recommended because the behavior is inconsistent between applications. Some allow any character after \c while other allow ASCII characters. The application may take the last 5 bits that character index in the code page or its Unicode code point to form an ASCII control character. Or the application may just flip bit 0x40. Either way \c@ through \c_ would match control characters 0x00 through 0x1F. But \c* might match a line feed or the letter j. The asterisk is character 0x2A in the ASCII table, so the lower 5 bits are 0x0A while flipping bit 0x40 gives 0x6A. Metacharacters indeed lose their meaning immediately after \c in applications that support \cA through \cZ for matching control characters.

In XML Schema regular expressions and XPath, \c is a shorthand character class that matches any character allowed in an XML name.

If your regular expression engine supports Unicode, you can use \uFFFF or \x{FFFF} to insert a Unicode character. The euro currency sign occupies Unicode code point U+20AC. If you cannot type it on your keyboard, you can insert it into a regular expression with \u20AC or \x{20AC}. See the tutorial section on Unicode for more details on matching Unicode code points.

If your regex engine works with 8-bit code pages instead of Unicode, then you can include any character in your regular expression if you know its position in the character set that you are working with. In the Latin-1 character set, the copyright symbol is character 0xA9. So to search for the copyright symbol, you can use \xA9. Another way to search for a tab is to use \x09. Note that the leading zero is required. In Tcl 8.5 and prior you have to be careful with this syntax, because Tcl used to eat up all hexadecimal characters after \x and treat the last 4 as a Unicode code point. So \xA9ABC20AC would match the euro symbol. Tcl 8.6 only takes the first two hexadecimal digits as part of the \x, as all other regex flavors do, so \xA9ABC20AC matches ©ABC20AC.

Many applications also support octal escapes in the form of \0377 or \377, where 377 is the octal representation of the character's position in the character set (255 decimal in this case). There is a lot of variation between regex flavors as to the number of octal digits allowed or required after the backslash, whether the leading zero is required or not allowed, and whether \0 without additional digits matches a NULL byte. In some flavors this causes complications as \1 to \77 can be octal escapes 1 to 63 (decimal) or backreferences 1 to 77 (decimal), depending on how many capturing groups there are in the regex. Therefore, using these octal escapes in regexes is strongly discouraged. Use hexadecimal escapes instead.

Perl 5.14, PCRE 8.34, PHP 5.5.10, and R 3.0.3 support a new syntax \o{377} for octal escapes. You can have any number of octal digits between the curly braces, with or without leading zero. There is no confusion with backreferences and literal digits that follow are cleanly separated by the closing curly brace. Do be careful to only put octal digits between the curly braces. In Perl, \o{whatever} is not an error but matches a NULL byte.

Non-Printable Characters

Post a Comment

0 Comments

Popular Posts

9ice – Ike Kan

Visual Basic 2012 Lesson 11- Looping

Maradona is planning to replace Sepp Blatter as Fifa President

Facebook

Categories

Tags

Recent Posts

Categories

Tags

Recent in News

Footer Menu Widget

Non-Printable Characters

You may like these posts

Post a Comment

0 Comments

Popular Posts

9ice – Ike Kan

Visual Basic 2012 Lesson 11- Looping

Maradona is planning to replace Sepp Blatter as Fifa President

Facebook

Categories

Tags

Recent Posts

Categories

Tags

Recent in News

Footer Menu Widget