Regex – Repetition Using Various Quantifiers

July 27th, 2011 | 5 Comments | Posted in regex by Vijay Gupta

Repetition with Star (*) and Plus (+) operator

Regex Repetition
I already discuss one repetition operator or qauntfier the ? mark in my post regex – optional item. It just tells the engine to match the preceding tokens zero times or once, in affect making it optional.

The asterisk or star (*) tells the engine to match the preceding token zero or more times. The plus (+) tells the engine to match the preceding token once or more. For Example if i have a html document and i want to find all the html elements not having any attributes.

1
2
3
4
5
6
<?php
$document = file_get_contents("http://google.com");
$pattern = "@<[A-Za-z][A-Za-z0-9]*>@";
preg_match_all($pattern,$document,$a);
print_r($a);
?>

output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Array
(
    [0] => Array
        (
            [0] => <html>
            [1] => <head>
            [2] => <title>
            [3] => <script>
            [4] => <script>
            [5] => <center>
            [6] => <br>
            [7] => <br>
            [8] => <br>
            [9] => <br>
            [10] => <script>
            [11] => <script>
            [12] => <script>
        )

)

Let me explain the pattern used in above example in details. The sharp brackets are literals. The first character class matches a letter and the second will match a letter or digit.The star repeats the second character class. Because we used the star, it’s OK if the second character class matches nothing.

How to limitting Repetition ?

We can limit the repetition using {min,max}. Where min is a positive integer number indicating minimum number of repetition, and max is an integer equal to or greater than min indicationd the maximum number of repetition.

  • {min,} = Maximum number of matches is infinite
  • {0,} = *
  • {1,} = +
  • {min} = {min,min} Repeat the token exactly min times

You could use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999. \b[1-9][0-9]{2,4}\b matches a number between 100 and 99999. Notice the use of the word boundaries.

Looking Inside The Regex Engine

The first token in the regex is <. This is a literal. As we already know, the first place where it will match is the first < in the string. The next token is the dot, which matches any character except newlines. The dot is repeated by the plus. The plus is greedy. Therefore, the engine will repeat the dot as many times as it can. The dot matches E, so the regex continues to try to match the dot with the next character. M is matched, and the dot is repeated once more. The next character is the >. You should see the problem by now. The dot matches the >, and the engine continues repeating the dot. The dot will match all remaining characters in the string. The dot fails when the engine has reached the void after the end of the string. Only at this point does the regex engine continue with the next token: >.<

Discussion 5 Responses

  1. Jaydeep Dave Wrote in to say:

    <[^>]+>

  2. Christiano R Wrote in to say:

    but your pattern will also match <body style=’background-color:red;’> and i don’t want to match it;

  3. Jaydeep Dave Wrote in to say:

    are you sure? try with regex. it will fetch all the tags.

  4. Christiano R Wrote in to say:

    The pattern which i shown in article will use to find all HTML element not having any attributes. And the pattern you suggest me will match all the HTML elements. Just read the article, i have explain that pattern.

  5. Jaydeep Dave Wrote in to say:

    oh got it… :)

Leave a Reply

*

About Us

webtechtuts is a site by web developers aimed at web developers and designers offering articles on technologies, skills and techniques to improve how you design and build websites. We cover HTML, CSS, Javascript, PHP, Photoshop, MySql, Oracle, and TeraData.