Regex – Character Classes or Character Sets
Character Classes or Character Sets
With the help of a “character class”, also known as “character set”, we can tell the regex engine to match one more out of multiple characters. For that put characters you want to match between sqaure brackets []. e.g if you want to match a v or an i than use [vi].
1 2 3 4 5 6 7 8 9 10 | <?php $pattern = "@[vi]@"; $string = "vijay"; preg_match($pattern,$string,$a); //will match v print_r($a); $string = "inder"; preg_match($pattern,$string,$a);//will match i print_r($a); ?> |
Output:
Simply place the characters you want to match between square brackets. If you want to match an a or an e, use [ae]. You could use this in br[ae]nt to match either brant or brent. Very useful if you do not know whether the document you are searching through is written in American or British English.
A character class matches only a single character. br[ae]nt will not match braant, braent or any such thing. The order of the characters inside a character class does not matter. The results are identical.
Hyphen inside a character class is used to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use more than one range. [0-9a-fA-F] matches a single hexadecimal digit, case insensitively. You can combine ranges and single characters. [0-9a-fxA-FX] matches a hexadecimal digit or the letter X. Again, the order of the characters and the ranges does not matter.
Useful Examples
Find a word, even if it is misspelled, such as sep[ae]r[ae]te or li[cs]en[cs]e.
Find an identifier in a programming language with [A-Za-z_][A-Za-z_0-9]*.
Find a C-style hexadecimal number with 0[xX][A-Fa-f0-9]+.
Negated Character Classes
If you put caret ^ after the opening square bracket will negate the characte class. The result is that the character class will match any character that is not in the character class.
1 2 3 4 5 6 7 8 9 10 | <?php $pattern = "@[^vi]@"; $string = "vijay"; preg_match($pattern,$string,$a); //will match j print_r($a); $string = "inder"; preg_match($pattern,$string,$a);//will match n print_r($a); ?> |
Output:
It is most important to remember that a negated character class still must match some character. v[^i] does not mean: “a v not follwed by an i”. It means:”a v follwed by a character that is not an i”. v[^i] will not match the v in string Garv. It will match the v in Garv is bollywood movie.
Shorthand Character Classes
Since certain character classes are used often, a series of shorthand character classes are available. \d is short for [0-9].
\w stands for “word character”, usually [A-Za-z0-9_]
1 2 3 4 5 6 7 8 9 10 | <?php $pattern = "@[a-zA-Z0-9_]@"; $string = "vijay"; preg_match($pattern,$string,$a); print_r($a); $pattern = "@[\w]@"; preg_match($pattern,$string,$b); print_r($b); ?> |
Output:
\s stands for “whitespace character”. Again, which characters this actually includes, depends on the regex flavor. In all flavors discussed in this tutorial, it includes [ \t\r\n]. That is: \s will match a space, a tab or a line break. Some flavors include additional, rarely used non-printable characters such as vertical tab and form feed.
The flavor comparison shows “ascii only” for flavors that match only the ASCII characters listed in the previous paragraphs. With flavors marked as “YES”, letters, digits and space characters from other languages or Unicode are also included in the shorthand classes. In the screen shot, you can see the characters matched by \w in RegexBuddy using various scripts. Notice that the JavaScript ASCII for \d and \w, but Unicode for \s. XML does it the other way around. Python offers flags to control what the shorthands should match.
Shorthand character classes can be used both inside and outside the square brackets. \s\d matches a whitespace character followed by a digit. [\s\d] matches a single character that is either whitespace or a digit. When applied to 1 + 2 = 3, the former regex will match 2 (space two), while the latter matches 1 (one). [\da-fA-F] matches a hexadecimal digit, and is equivalent to [0-9a-fA-F].
Negated Shorthand Character Classes
The above three shorthands also have negated versions. \D is the same as [^\d], \W is short for [^\w] and \S is the equivalent of [^\s].
1 2 3 4 5 6 7 8 9 10 11 | <?php // \D is the same as [^\d] $pattern = "@[^\d]@"; $string = "123vijay"; preg_match($pattern,$string,$a); print_r($a); $pattern = "@[\D]@"; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
Be careful when using the negated shorthands inside square brackets. [\D\S] is not the same as [^\d\s]. The latter will match any character that is not a digit or whitespace. So it will match x, but not 8. The former, however, will match any character that is either not a digit, or is not whitespace. Because a digit is not whitespace, and whitespace is not a digit, [\D\S] will match any character, digit, whitespace or otherwise.
Repeating Character Classes
If you repeat a character class by using the ?, * or + operators, you will repeat the entire character class, and not just the character that it matched. The regex [0-9]+ can match 837 as well as 222.
1 2 3 4 5 6 7 8 9 10 | <?php $pattern = "@[0-9]+@"; $string = "837"; preg_match($pattern,$string,$a); print_r($a); $string = "222"; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
If you want to repeat the matched character, rather than the class, you will need to use backreferences. ([0-9])\1+ will match 222 but not 837. When applied to the string 833337, it will match 3333 in the middle of this string. If you do not want that, you need to use lookahead and lookbehind.
Metacharacters Inside Character Classes
Note that the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (\), the caret (^) and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash. To search for a star or plus, use [+*]. Your regex will work fine if you escape the regular metacharacters inside a character class, but doing so significantly reduces readability.
1 2 3 4 5 6 | <?php $pattern = "@[\-\^\]\\-a-zA-Z0-9_]*@"; $string = "^vijay]-123"; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
1 2 3 4 |
To include a backslash as a character without any special meaning inside a character class, you have to escape it with another backslash. [\\x] matches a backslash or an x. The closing bracket (]), the caret (^) and the hyphen (-) can be included by escaping them with a backslash, or by placing them in a position where they do not take on their special meaning. I recommend the latter method, since it improves readability. To include a caret, place it anywhere except right after the opening bracket. [x^] matches an x or a caret. You can put the closing bracket right after the opening bracket, or the negating caret. []x] matches a closing bracket or an x. [^]x] matches any character that is not a closing bracket or an x. The hyphen can be included right after the opening bracket, or right before the closing bracket, or right after the negating caret. Both [-x] and [x-] match an x or a hyphen.
You can use all non-printable characters in character classes just like you can use them outside of character classes. E.g. [$\u20AC] matches a dollar or euro sign, assuming your regex flavor supports Unicode.
The JGsoft engine, Perl and PCRE also support the \Q…\E sequence inside character classes to escape a string of characters. E.g. [\Q[-]\E] matches [, - or ].
POSIX regular expressions treat the backslash as a literal character inside character classes. This means you can’t use backslashes to escape the closing bracket (]), the caret (^) and the hyphen (-). To use these characters, position them as explained above in this section. This also means that special tokens like shorthands are not available in POSIX regular expressions.
Discussion 2 Responses
Trackbacks
Leave a Reply