Learn writing regular expression in 60 minutes
This tutorial will teach you every aspect of regular expresion. If you are new to regular expression please read this artical carefully and in the given order.
A regular expression “engine” is a piece of software that can process regular expressions, trying to match the pattern to the given string. As usual in the software world, different regular expression engines are not fully compatible with each other. It is not possible to describe every kind of engine and regular expression syntax (or “flavor”) in this tutorial. I will focus on the regex flavor used by Perl 5, for the simple reason that this regex flavor is the most popular one, and deservedly so.
Many more recent regex engines are very similar, but not identical, to the one of Perl 5. Examples are the open source PCRE engine (used in many tools and languages like PHP). Below are list of topics we are going to cover and each topic will be discuss in detail.
Literal Characters and Special Characters
The simple regular expression contains only literal characters. Certain characters have special meanings in a regular expression and need to escaped.
1 | e.g if i want to macth (111), the correct regex is \(111\), because (, ) have special meaning. |
How a regex engine works internally?
First we need to understand the internals of the regular expression engine’s internal. and the othere topic will depend on it. If you know the internals of engine that will help you to craft the regexes that match what you wanted , and don’t match what you don’t intended
Character Classes or Character Sets
A character class or character set matches a single character out of several possible characters, consisting of individual characters and/or ranges of characters. A negated character class matches a single character not in the character class. Shorthand character classes allow you to use common sets quickly.
1 | [characterset], [^characterset] |
1 2 3 4 5 6 7 8 9 | <?php $str = 'Which language you like most php or asp? New website launched, You can post your comment.'; $regex = "@[chi]@"; preg_match_all($regex,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
The dot matches any character, though usually not line break characters unless you change an option.
1 2 3 4 5 6 7 8 9 | <?php $str = 'Which language you like most php or asp? New website launched, You can post your comment.'; $regex = "@.*@"; preg_match($regex,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |
Start of String and End of String Anchors
Anchors are zero-width. They do not match any characters, but rather a position. The caret (^) and the dollar ($) sign match at the start and the end of the string. Depending on your regex flavor and its options, they can match at the start and the end of a line as well.
Start of string Anchor:
1 2 3 4 5 6 7 8 | <?php //Start of string Anchor $str = 'Which language you like most php or asp?'; $regex = "@^Which@"; preg_match($regex,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |
End of string anchor:
1 2 3 4 5 6 |
Output:
1 2 3 4 |
Word boundaries are like anchors, but match at the start of a word and the end of a word. However, most regex flavors define the concept of a “word” differently than your English teacher in grade school.
1 2 3 4 5 6 | <?php $str = 'Which language you like most php or asp ?'; $regex = "@php|asp@"; // expression writtem between @regex@. preg_match($regex,$str,$a); print_r($a); ?> |
Alternation
Means or operation,separating different sub-regexes with vertical bars, you can tell the regex engine to attempt them from left to right, and return success as soon as one of them can be matched.
1 2 3 4 5 6 7 8 9 10 11 | <?php $str = 'Which language you like most php or asp ?'; $regex = "@php|asp@"; preg_match($regex,$str,$a); print_r($a); $str = 'Which language you like most asp or php?'; $regex = "@php|asp@"; preg_match($regex,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 5 6 7 8 | Array ( [0] => php ) Array ( [0] => asp ) |
Putting a question mark after an item tells the regex engine to match the item if possible, but continue anyway (rather than admit defeat) if it cannot be matched. The question mark makes the preceding token in the regular expression optional.
1 2 3 4 5 6 7 8 9 10 | <?php $str = 'moustache'; $str1 = 'mustache'; $regex = "@m?ustache@"; preg_match($regex,$str,$a); print_r($a); echo "\r\n"; preg_match($regex,$str1,$b); print_r($b); ?> |
Output:
Repetition Using Various Quantifiers
There are three style of operators *, + , ? {}, allow you to repeat an item zero or more times, once or more, zero or once, or an arbiyrary number of times.
* :
1 2 3 4 5 6 | <?php $string = 'vijay'; $pattern = '@vijay[0-9]*@'; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
1 2 3 4 |
+ :
1 2 3 4 5 6 7 8 9 10 11 | <?php $string = 'vijay'; $pattern = '@vijay[0-9]+@'; preg_match($pattern,$string,$a); print_r($a); $string = 'vijay123'; $pattern = '@vijay[0-9]+@'; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
{m,n} :
1 2 3 4 5 6 7 8 9 10 11 | <?php $string = 'vijay1'; $pattern = '@vijay[0-9]{2,3}@'; preg_match($pattern,$string,$a); print_r($a); $string = 'vijay123'; $pattern = '@vijay[0-9]{2,3}@'; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
Grouping and Backreferences
By placing round brackets around part of the regex, you tell the engine to treat that part as a single item when applying operators such as quantifiers. With round brackets, you can also create backreferences that allow you to reuse the text matched by part of the regex inside the regular expression, or later in the replacement text of a search and replace operation. Backreferences are also very useful for extracting parts from a string in a programming language.
1 2 3 4 5 6 | <?php $string = 'vijay123'; $pattern = '@(vijay)([0-9]+)@'; preg_match($pattern,$string,$a); print_r($a); ?> |
Output:
1 2 3 4 5 6 |
Unicode Characters and Properties
If your regular expression flavor supports Unicode, then you can use special Unicode regex tokens to match specific Unicode characters, or to match any character that has a certain Unicode property or is part of a particular Unicode script or block.
Change matching modes such as “case insensitive” for specific parts of the regular expression. There are four matching modes
- /i case insensitive
- /s single line mode. dot matches new line.
- /m multi line mode. in this mode caret and dollar match before and after newlines.
- /x enables free spacing mode. int this mode, whitespace between regex tokens is ignored.
Case insensitive mode:
1 2 3 4 5 6 7 | <?php // Case insensitive mode $str = "ViJay"; $pattern = "/vijay/i"; preg_match($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |
Single line mode:
1 2 3 4 5 6 7 8 9 | <?php //single line mode $str = "There are lots of regesx engine, but we will not discuss all of them we will discuss pcre engine. visit my website http://webtechtuts.com for more details."; $pattern = "/.*/s"; preg_match($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 5 6 | Array ( [0] => There are lots of regesx engine, but we will not discuss all of them we will discuss pcre engine. visit my website http://webtechtuts.com for more details. ) |
Multi line mode:
1 2 3 4 5 6 7 8 9 | <?php //multi line mode, in this mode caret and dollar match before and after newlines $str = "There are lots of regesx engine, but we will not discuss all of them we will discuss pcre engine. visit my website http://webtechtuts.com for more details."; $pattern = "/^visit/m"; preg_match($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |
Free spacing mode:
1 2 3 4 5 6 7 8 9 | <?php // free spacing mode. int this mode, whitespace between regex tokens is ignored. $str = "There are lots of regesx engine, but we will not discuss all of them we will discuss pcre engine. visit my website http://webtechtuts.com for more details."; $pattern = "/[a-z A-Z]* .*/x"; preg_match($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |
Atomic Grouping and Possessive Quantifiers
Nested quantifiers can cause an exponentially increasing amount of backtracking that brings the regex engine to a grinding halt. Atomic grouping and possessive quantifiers provide a solution.
1 2 3 4 5 6 7 | <?php // vij(>?ay|a)y atomic group will match only vijayy but not vijay $str = "vijayy"; $pattern = "/vij(>?ay|a)y/"; preg_match_all($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Lookaround with Zero-Width Assertions, part 1 and part 2
Lookahead and lookbehind (collectively lookaround) are zero-width. With positive lookaround, you can specify multiple requirements (sub-regexes) to be applied to the same part of the string. With negative lookaround, you can invert the result of a regex match (i.e. match something that does not match something else). ^ looks behind, to see that there are no characters before. $ looks ahead, to see that there are no characters after. \b looks both ahead and behind, to see if the characters on either side differ in their ‘word’-ness. The lookahead assertion is denoted by (?=regexp) and the lookbehind assertion is denoted by (?<=fixed-regexp).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <?php $string = "I catch the housecat 'Tom-cat' with catnip"; $pattern = "/cat(?=\s+)/"; # matches 'cat' in 'housecat' preg_match($pattern,$string,$a); print_r($a); $pattern = "/(?<=\s)cat\w+/"; # matches, preg_match_all($pattern,$string,$a); # $catwords[0] = 'catch' print_r($a); # $catwords[1] = 'catnip' $pattern = "/\bcat\b/"; # matches 'cat' in 'Tom-cat' preg_match($pattern,$string,$a); print_r($a); $pattern = "/(?<=\s)cat(?=\s)/"; # doesn't match; no isolated 'cat' in preg_match($pattern,$string,$a); # middle of $x print_r($a); ?> |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Continuing from The Previous Match Attempt
Forcing a regex match to start at the end of a previous match provides an efficient way to parse text data.The \G token is supported by the JGsoft engine, .NET, Java, Perl and PCRE.
E.g. you could parse an HTML file in the following fashion:
1 2 3 4 5 6 7 8 9 | while ($string =~ m/</g) { if ($string =~ m/\GB>/c) { # Bold } elsif ($string =~ m/\GI>/c) { # Italics } else { # ...etc... } } |
Combining Positive and Negative Lookaround with Conditionals
A conditional is a special construct that will first evaluate a lookaround, and then execute one sub-regex if the lookaround succeeds, and another sub-regex if the lookaround fails.
1 2 3 4 5 | <?php $pattern = '/^(<)?[a-z]+(?(1)>)$/'; var_dump(preg_match($pattern, '<test>')); // true var_dump(preg_match($pattern, '<foo')); // false ?> |
Output:
1 2 | int(1) int(0) |
In the example above, ’1′ refers to the subpattern (<), which is also optional since it is followed by a question mark. Only if that condition is true, it matches for a closing bracket.
XML Character Classes
XML Schema regular expressions support four shorthand character classes to match XML names. They also introduce a handy feature called “character class subtraction”, which is now also available in the JGsoft and .NET regex engines.
XML Schema Regular Expressions support the usual six shorthand character classes, plus four more. These four aren’t supported by any other regular expression flavor. \i matches any character that may be the first character of an XML name, i.e. [_:A-Za-z]. \c matches any character that may occur after the first character in an XML name, i.e. [-._:A-Za-z0-9]. \I and \C are the respective negated shorthands. Note that the \c shorthand syntax conflicts with the control character syntax used in many other regex flavor
POSIX Bracket Expressions
If you are using a POSIX-compliant regular expression engine, you can use POSIX bracket expressions to match locale-dependent characters.
1 2 3 4 5 6 7 8 |
Output:
1 2 3 4 5 6 7 |
Some regex flavors allow you to add comments to make complex regular expressions easier to understand.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | <?php $number = '123 555 6789'; var_dump(preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)); var_dump(preg_match("/^ (1[-\s.])? # optional '1-', '1.' or '1' ( \( )? # optional opening parenthesis \d{3} # the area code (?(2) \) ) # if there was opening parenthesis, close it [-\s.]? # followed by '-' or '.' or space \d{3} # first 3 digits [-\s.]? # followed by '-' or '.' or space \d{4} # last 4 digits $/x",$number)); ?> |
Output:
1 2 | int(1) int(1) |
Splitting a regular expression into multiple lines, adding comments and whitespace, makes it even more readable.
1 2 3 4 5 6 7 8 9 | <?php // free spacing mode. int this mode, whitespace between regex tokens is ignored. $str = "There are lots of regesx engine, but we will not discuss all of them we will discuss pcre engine. visit my website http://webtechtuts.com for more details."; $pattern = "/[a-z A-Z]* .*/x"; preg_match($pattern,$str,$a); print_r($a); ?> |
Output:
1 2 3 4 |

Discussion 4 Responses
Trackbacks
Leave a Reply