XML Character Classes

XML Schema and XPath regular expressions support the usual six shorthand character classes, plus four more. These four aren't supported by any other regular expression flavor. \i matches any character that may be the first character of an XML name. \c matches any character that may occur after the first character in an XML name. \Iand \C are the respective negated shorthands. Note that the \c shorthand syntax conflicts with the control character syntax used in many other regex flavors.
You can use these four shorthands both inside and outside character classes using the bracket notation. They're very useful for validating XML references and values in your XML schemas. The regular expression \i\c*matches an XML name like xml:schema.
The regex <\i\c*\s*> matches an opening XML tag without any attributes. </\i\c*\s*> matches any closing tag. <\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*\s*> matches an opening tag with any number of attributes. Putting it all together, <(\i\c*(\s+\i\c*\s*=\s*("[^"]*"|'[^']*'))*|/\i\c*)\s*> matches either an opening tag with attributes or a closing tag.
No other regex flavors discussed in this tutorial support XML character classes. If your XML files are plain ASCII , you can use [_:A-Za-z] for \i and [-._:A-Za-z0-9] for \c. If you want to allow all Unicode characters that the XML standard allows, then you will end up with some pretty long regexes. You would have to use[:A-Z_a-z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u02FF\u0370-\u037D\u037F-\u1FFF\u200C-\u200D
\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]
 instead of \i and[-.0-9:A-Z_a-z\u00B7\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u037D\u037F-\u1FFF\u200C-\u200D
\u203F\u2040\u2070-\u218F\u2C00-\u2FEF\u3001-\uD7FF\uF900-\uFDCF\uFDF0-\uFFFD]
 instead of \c.

Post a Comment

0 Comments