JS Regular Expressions

Understanding Regular Expressions

Regular expressions, often abbreviated as "regex" or "RegExp", are specially formatted strings designed for identifying patterns within text. They are incredibly powerful tools used extensively for processing and manipulating text efficiently. For instance, they can validate the correctness of user-input data like names, emails, and phone numbers, as well as locate and replace specific strings within text content.

JavaScript supports regular expressions in the style of Perl, a programming language renowned for its robust support of regular expressions and advanced text processing capabilities.

Before diving into the intricacies of regular expressions, let's first explore the commonly used built-in methods in JavaScript for pattern-matching.

Function What it Does
exec() Search for a match in a string. It returns an array of information or null on mismatch.
test() Test whether a string matches a pattern. It returns true or false.
search() Search for a match within a string. It returns the index of the first match, or -1 if not found.
replace() Search for a match in a string, and replaces the matched substring with a replacement string.
match() Search for a match in a string. It returns an array of information or null on mismatch.
split() Splits up a string into an array of substrings using a regular expression.
 

Important: The methods exec() and test() are functions of RegExp that accept a string as an argument. Conversely, search(), replace(), match(), and split() are String methods that accept a regular expression as an argument.


Understanding Regular Expressions

In JavaScript, regular expressions are represented by the RegExp object, which is a built-in JavaScript object similar to String, Array, and others. There are two primary methods to create a new RegExp object — one is through the literal syntax, and the other is using the RegExp() constructor.

The literal syntax encloses the regular expression pattern within forward slashes (/pattern/), while the constructor syntax uses quotes ("pattern"). Below is an example that illustrates both methods of creating a regular expression to match strings starting with "Mr.".

// Literal syntax 
var regex = /^Mr\./;

// Constructor syntax
var regex = new RegExp("^Mr\\.");

As observed, the regular expression literal syntax is concise and more readable. Hence, it is recommended to utilize the literal syntax. This approach will be consistently applied in this tutorial.

Important: When employing the constructor syntax, it's necessary to double-escape special characters. For instance, to match ".", you must write "\\." instead of "\.". A single backslash may be removed by JavaScript's string parser as an escape character.

Matching Patterns with Regular Expressions

Regular expression patterns can consist of letters, digits, punctuation marks, and special regular expression characters (distinct from HTML special characters).

Special characters in regular expressions include:
. * ? + [ ] ( ) { } ^ $ | \. To use these characters literally, you must precede them with a backslash. For example, to match ".", you would write \.. All other characters are interpreted literally by default.

The next sections detail different options for constructing patterns:

Character Classes

Enclosing characters in square brackets defines a character class, for example, [abc]. A character class matches any single character from the specified list, meaning [abc] matches either 'a', 'b', or 'c'.

Negated character classes can also be defined to match any character except those listed. They are denoted by placing a caret (^) immediately after the opening bracket, such as [^abc], which matches any character except 'a', 'b', and 'c'.

You can specify a range of characters within a character class using a hyphen (-), such as [0-9]. Here are some examples illustrating character classes:

RegExp What it Does
[abc] Matches any one of the characters a, b, or c.
[^abc] Matches any one character other than a, b, or c.
[a-z] Matches any one character from lowercase a to lowercase z.
[A-Z] Matches any one character from uppercase a to uppercase z.
[a-Z] Matches any one character from lowercase a to uppercase Z.
[0-9] Matches a single digit between 0 and 9.
[a-z0-9] Matches a single character between a and z or between 0 and 9.

Here's an example demonstrating how to determine whether a pattern exists within a string using regular expressions and the JavaScript test() method:

var regex = /ca[kf]e/;
var str = "He was eating cake in the cafe.";

// Test the string against the regular expression
if(regex.test(str)) {
alert("Match found!");
} else {
alert("Match not found.");
}

Additionally, you can apply the global flag g to a regular expression to locate all occurrences within a string:

var regex = /ca[kf]e/g;
var str = "He was eating cake in the cafe.";
var matches = str.match(regex);
alert(matches.length); // Outputs: 2

Tip: Regular expressions aren't limited to JavaScript. Programming languages like Java, Perl, Python, PHP, and others employ similar syntax to identify patterns in text.


Predefined Character Classes

Certain character groups like digits, letters, and whitespace are so commonly used that they have shorthand names. The table below outlines these predefined character classes:

Shortcut What it Does
. Matches any single character except newline \n.
\d matches any digit character. Same as [0-9]
\D Matches any non-digit character. Same as [^0-9]
\s Matches any whitespace character (space, tab, newline or carriage return character).
Same as [ \t\n\r]
\S Matches any non-whitespace character.
Same as [^ \t\n\r]
\w Matches any word character (definned as a to z, A to Z,0 to 9, and the underscore).
Same as [a-zA-Z_0-9]
\W Matches any non-word character. Same as [^a-zA-Z_0-9]

Here's an example demonstrating how to replace spaces with hyphens in a string using regular expressions with the JavaScript replace() method:

var regex = /\s/g;
var replacement = "-";
var str = "Earth revolves around\nthe\tSun";

// Replace spaces, newlines and tabs
document.write(str.replace(regex, replacement) + "<hr>");

// Replace only spaces
document.write(str.replace(/ /g, "-"));

Repetition Quantifiers

In the previous section, we learned how to match a single character in various ways. But what if you need to match multiple characters? For instance, suppose you want to find words that contain one or more occurrences of the letter 'p', or words with at least two 'p's, and so forth.

This is where quantifiers become useful. Quantifiers allow you to specify how many times a character in a regular expression should match. They can be applied to individual characters, as well as character classes and groups of characters enclosed in parentheses.

The table below outlines the different quantifiers available to define a specific pattern:

RegExp What it Does
p+ Matches one or more occurrences of the letter p.
p* Matches zero or more occurrences of the letter p.
p? Matches zero or one occurrences of the letter p.
p{2} Matches exactly two occurrences of the letter p.
p{2,3} Matches at least two occurrences of the letter p, but not more than three occurrences.
p{2,} Matches two or more occurrences of the letter p.
p{,3} Matches at most three occurrences of the letter p

The regular expression in this example splits the string at commas, sequences of commas, whitespace, or any combination thereof using the JavaScript split() method:

var regex = /[\s,]+/;
var str = "My favourite colors are red, green and blue";
var parts = str.split(regex);

// Loop through parts array and display substrings
for(var part of parts){
document.write("<p>" + part + "</p>");
}

Position Anchors

In certain cases, you might need to match patterns at the beginning or end of a line, word, or string. To achieve this, you can utilize anchors. The caret (^) represents the start of the string, and the dollar sign ($) represents the end of the string.

RegExp What it Does
^p Matches the letter p at the beginning of a line.
p$ Matches the letter p at the end of a line.

In this example, the regular expression will test and match only those names in the names array that begin with the letter "J" using the JavaScript test() function:

var regex = /^J/;
var names = ["James Bond", "Clark Kent", "John Rambo"];

// Loop through names array and display matched names
for(var name of names) {
if(regex.test(name)) {
document.write("<p>" + name + "</p>")
}
}

Pattern Modifiers (Flags)

Pattern modifiers enable you to control how a pattern match is processed. These modifiers are placed immediately after the regular expression. For instance, to search for a pattern in a case-insensitive manner, you can use the i modifier, like so: /pattern/i.

Below is a table listing some of the most frequently used pattern modifiers:

Modifier What it Does
g Perform a global match i.e. finds all occurrences.
i Makes the match case-insensitive manner.
m Changes the behavior of ^ and $ to match against a newline boundary (i.e. start or end of each line within a multiline string), instead of a string boundary.
o Evaluates the expression only once.
s Changes the behavior of . (dot) to match all characters, including newlines.
x Allows you to use whitespace and comments within a regular expression for clarity.

Here's an example demonstrating how to utilize the g and i modifiers in a regular expression to conduct a global and case-insensitive search using the JavaScript match() method:

var regex = /color/gi;
var str = "Color red is more visible than color blue in daylight.";
var matches = str.match(regex); // global, case-insensitive match
console.log(matches);
// expected output: ["Color", "color"]

Similarly, the following example demonstrates how to match at the beginning of every line in a multi-line string using the ^ anchor and the m modifier with the JavaScript match() method:

var regex = /^color/gim;
var str = "Color red is more visible than \ncolor blue in daylight.";
var matches = str.match(regex); // global, case-insensitive, multiline match
console.log(matches);
// expected output: ["Color", "color"]

Alternation

Alternation allows you to specify alternative versions of a pattern. In regular expressions, alternation works similarly to the OR operator in an if-else conditional statement.

You can specify alternation using a vertical bar (|). For example, the regex /fox|dog|cat/ matches either "fox", "dog", or "cat". Here's an example:

var regex = /fox|dog|cat/;
var str = "The quick brown fox jumps over the lazy dog.";
var matches = str.match(regex);
console.log(matches);
// expected output: ["fox", index: 16, ...]

Note: Alternatives are evaluated from left to right. Once a match is found, further alternatives to the right are ignored, even if they could potentially match.


Grouping

Regular expressions use parentheses to group subexpressions, similar to how parentheses are used in mathematical expressions. Grouping with parentheses allows a repetition quantifier to be applied to an entire subexpression.

For example, in the regex /go+/, the quantifier + applies only to the last character o, matching strings like "go", "goo", and so forth. In contrast, in the regex /(go)+/, the quantifier + applies to the group of characters go, matching strings like "go", "gogo", and so forth.

var regex = /(go)+/i; 
var str = "One day Gogo will go to school.";
var matches = str.match(regex); // case-insensitive match
console.log(matches);
// expected output: ["Gogo", "go", index: 8, ...]

Note: When a string matches the pattern, the match() method returns an array. The first element is the entire matched string, followed by any results captured in parentheses, and the index of the whole match. If no matches are found, it returns null.

Tip: When the regular expression includes the g flag, the match() method returns an array containing all matched substrings instead of a match object. Captured groups, index of the whole match, and other properties are not included.


Word Boundaries

A word boundary character (\b) helps you search for words that begin and/or end with a specific pattern. For instance, the regex /\bcar/ matches words starting with "car", such as "cart", "carrot", or "cartoon", but not "oscar".

Similarly, the regex /car\b/ matches words ending with "car", such as "oscar" or "supercar", but not "cart". Likewise, /\bcar\b/ matches words that begin and end with "car", specifically "car".

The following example highlights words beginning with "car" in bold:

var regex = /(\bcar\w*)/g;
var str = "Words begining with car: cart, carrot, cartoon. Words ending with car: oscar, supercar.";
var replacement = '<b>$1</b>';
var result = str.replace(regex, replacement);
document.write(result);