PHP Regular Expressions

What is Regular Expression

Regular Expressions, often abbreviated as "regex" or "RegExp", are specially formatted text strings used to identify patterns in text. They are highly effective tools for processing and manipulating text efficiently. For instance, they can validate user-entered data formats like names, emails, and phone numbers, as well as find and replace specific strings within text content.

PHP (version 5.3 and above) supports Perl-style regular expressions through its preg_ family of functions. Why Perl-style? Perl, known as "Practical Extraction and Report Language," was the first mainstream programming language to integrate robust support for regular expressions, renowned for its powerful text processing capabilities.

Let's start with a quick overview of PHP's commonly used built-in functions for pattern matching before diving deeper into the world of regular expressions.

Function Functionality
preg_match() Matches a regular expression against a string.
preg_match_all() Matches all occurrences of a regular expression in a string.
preg_replace() Searches a string using a regular expression and replaces matches.
preg_grep() Returns array elements that match a pattern.
preg_split() Splits a string into substrings using a regular expression.
preg_quote() Escapes regular expression characters in a string.
 

Note: In PHP, the preg_match() function stops searching after finding the first match, while preg_match_all() continues searching until the end of the string, finding all possible matches instead of stopping at the first one.


Regular Expression Syntax

Regular expression syntax involves special characters (distinct from HTML special characters). These special characters include: . * ? + [ ] ( ) { } ^ $ | \. To use these characters literally, you need to precede them with a backslash. For example, to match ".", you would write \.. All other characters are interpreted literally.

The following sections detail different ways to construct patterns:

Character Classes

Character classes are patterns enclosed in square brackets, such as [abc]. They match any single character from a specified list, like matching only characters 'a', 'b', or 'c' with [abc].

Negated character classes can also be defined to match any character except those listed within the brackets. Use a caret (^) immediately after the opening bracket, as in [^abc].

You can define character ranges using a hyphen (-) inside a character class, like [0-9]. Here are some examples of character classes:

RegExp What it Matches
[abc] Matches any of the characters 'a', 'b', or 'c'.
[^abc] Matches any character except 'a', 'b', or 'c'.
[a-z] Matches any lowercase letter from 'a' to 'z'.
[A-Z] Matches any uppercase letter from 'A' to 'Z'.
[a-Z] Matches any letter from lowercase 'a' to uppercase 'Z'.
[0-9] Matches any single digit from 0 to 9.
[a-z0-9] Matches any alphanumeric character from 'a' to 'z' or '0' to '9'.

The next example demonstrates how to determine if a pattern exists within a string using regular expressions and the PHP preg_match() function:

<?php
$pattern = "/ca[kf]e/";
$text = "He was eating cake in the cafe.";
if(preg_match($pattern, $text)){
echo "Match found!";
} else{
echo "Match not found.";
}
?>

Similarly, you can utilize the preg_match_all() function to locate all occurrences within a string:

<?php
$pattern = "/ca[kf]e/";
$text = "He was eating cake in the cafe.";
$matches = preg_match_all($pattern, $text, $array);
echo $matches . " matches were found.";
?>

 

Tip: Regular expressions are not limited to PHP. Languages like Java, Perl, Python, and others use similar notation to identify patterns in text.


Predefined Character Classes

Some character classes, such as digits, letters, and whitespaces, are so commonly used that there are shorthand names for them. The table below lists these predefined character classes:

Shortcut Functionality
. Matches any single character except newline \n.
\d Matches any digit character. Equivalent to [0-9].
\D Matches any non-digit character. Equivalent to [^0-9].
\s Matches any whitespace character (space, tab, newline, or carriage return). Equivalent to [ \t\n\r].
\S Matches any non-whitespace character. Equivalent to [^ \t\n\r].
\w Matches any word character (defined as a to z, A to Z, 0 to 9, and underscore _). Equivalent to [a-zA-Z_0-9].
\W Matches any non-word character. Equivalent to [^a-zA-Z_0-9].

The following example demonstrates how to replace spaces with hyphens in a string using regular expressions and the PHP preg_replace() function:

<?php
$pattern = "/\s/";
$replacement = "-";
$text = "Earth revolves around\nthe\tSun";
// Replace spaces, newlines and tabs
echo preg_replace($pattern, $replacement, $text);
echo "<br>";
// Replace only spaces
echo str_replace(" ", "-", $text);
?>

Repetition Quantifiers

In the previous section, we learned how to match single characters in various ways. But what if you need to match more than one character? For instance, if you want to find words containing one or more instances of the letter 'p', or words with at least two 'p's, quantifiers are used for this purpose. Quantifiers allow you to specify how many times a character in a regular expression should match.

Below is a table listing different ways to quantify a specific pattern:

RegExp Functionality
p+ Matches one or more occurrences of the letter 'p'.
p* Matches zero or more occurrences of the letter 'p'.
p? Matches zero or one occurrence of the letter 'p'.
p{2} Matches exactly two occurrences of the letter 'p'.
p{2,3} Matches at least two but not more than three occurrences of the letter 'p'.
p{2,} Matches two or more occurrences of the letter 'p'.
p{,3} Matches at most three occurrences of the letter 'p'.

The regular expression in the following example splits the string at commas, sequences of commas, whitespace, or any combination thereof using the PHP preg_split() function:

<?php
$pattern = "/[\s,]+/";
$text = "My favourite colors are red, green and blue";
$parts = preg_split($pattern, $text);

// Loop through parts array and display substrings
foreach($parts as $part){
echo $part . "<br>";
}
?>

Position Anchors

In certain scenarios, you may need to match patterns at the beginning or end of a line, word, or string. Anchors help achieve this. Two common anchors are the caret (^), which matches the start of a string, and the dollar sign ($), which matches the end of a string.

RegExp Functionality
^p Matches the letter 'p' at the beginning of a line.
p$ Matches the letter 'p' at the end of a line.

The regular expression in the following example displays only those names from the names array that start with the letter "J" using the PHP preg_grep() function:

<?php
$pattern = "/^J/";
$names = array("Jhon Carter", "Clark Kent", "John Rambo");
$matches = preg_grep($pattern, $names);

// Loop through matches array and display matched names
foreach($matches as $match){
echo $match . "<br>";
}
?>

Pattern Modifiers

Pattern modifiers allow you to control how a pattern match is processed. These modifiers are placed immediately after the regular expression. For example, to perform a case-insensitive search, you can use the i modifier like this: /pattern/i. Below are some commonly used pattern modifiers:

Modifier Functionality
i Makes the match case-insensitive.
m Changes the behavior of ^ and $ to match against newline boundaries (start or end of each line within a multiline string), rather than just the string boundaries.
g Performs a global match, finding all occurrences.
o Evaluates the expression only once.
s Changes the behavior of . (dot) to match all characters, including newlines.
x Allows whitespace and comments within a regular expression for clarity.

The following example demonstrates how to perform a global case-insensitive search using the i modifier and the PHP preg_match_all() function:

<?php
$pattern = "/color/i";
$text = "Color red is more visible than color blue in daylight.";
$matches = preg_match_all($pattern, $text, $array);
echo $matches . " matches were found.";
?>

Similarly, the next example demonstrates how to match at the beginning of each line in a multi-line string using the ^ anchor and the m modifier with PHP's preg_match_all() function.

<?php
$pattern = "/^color/im";
$text = "Color red is more visible than \ncolor blue in daylight.";
$matches = preg_match_all($pattern, $text, $array);
echo $matches . " matches were found.";
?>

Word Boundaries

A word boundary character (\b) allows you to search for words that begin and/or end with a specific pattern. For example, the regex /\bcar/ matches words that start with "car", such as cart, carrot, or cartoon, but not oscar.

Similarly, the regex /car\b/ matches words that end with "car", such as scar, oscar, or supercar, but not cart. The regex /\bcar\b/ matches only the word "car", which begins and ends with "car".

The following example demonstrates how to highlight words beginning with "car" in bold:

<?php
$pattern = '/\bcar\w*/';
$replacement = '<b>$0</b>';
$text = 'Words begining with car: cart, carrot, cartoon. Words ending with car: scar, oscar, supercar.';
echo preg_replace($pattern, $replacement, $text);
?>

We hope you now grasp the fundamentals of regular expressions. To explore how to validate form data using regular expressions, please refer to the tutorial on PHP Form Validation.