Everything You Need to Know About Regular Expressions in JavaScript

This detailed guide should answer all your regex questions

Published in

Better Programming

11 min readMar 19, 2020

Regular expressions are a necessity of software development. As a front-end developer, you can go a long time happily ignoring their existence, but sooner or later you’re going to have to deal with them.

In this guide, you will find everything that you need to know about regular expressions in JavaScript. This article is structured as follows:

What can you do with regular expressions?
What do regular expressions look like?
How do you define a regular expression in javascript?
How do regular expressions work?
What’s the regex syntax in Javascript?
How to use captures and backreferences?
How is a regular expression matched?
How can we optimize regular expressions?
When to not use regular expressions?

What Can You Do With Regular Expressions?

Think of regular expressions as high-speed, performant tools that allow you to find and replace patterns in a text. With regular expressions:

You can check if a text contains a particular substring or pattern.
You can find and return those pattern matches.
You can capture these substrings out of the text.
You can modify the captured substrings.

The following article lists a few use cases of regular expressions for frontend development.

6 Handy Regular Expressions Every Frontend Developer Should Know

Leverage the power of regular expressions to perform various text processing tasks.

blog.bitsrc.io

Most high-level languages make use of regular expressions. JavaScript’s Regular expression engine is based on Perl5 regular expression Grammar.

What Do Regular Expressions Look Like?

Here’s an example of a regular expression:/Medi[a-zA-Z]*/

Let’s try to understand what it does.

This regular expression describes a pattern of words that start with the substring: Medi. In this example Medium, Media, Medical and Medi would match.

Try it on yourself here.

Note: In case you haven’t noticed, regular expressions are case sensitive, for example, medium will not match.

How Do You Define a Regular Expression in JavaScript?

A regular expression is an object that describes a pattern of characters. In JavaScript, you can define regular expressions in two different ways:

Using regular expression literal, within a pair of slash /.../:

const myPattern = /Medi[a-zA-Z]*/;

Or, by constructing an instance of the RegExp object:

const myPattern = new RegExp (“Medi[a-zA-Z]*”);

Both formats result in the same regex being created in the variable myPattern.

Options

In addition to the expression itself, five options can be incorporated in a regex:

i : Makes the regex case insensitive. For example, /Medi[a-zA-Z]*/i would match all cases.
g: Matches all occurrences of the pattern. When g is not specified the regex would match only the first occurrence.
m : Allows matches across multiple lines of a text.
y : Enables sticky matching; a regex attempts sticky matching in a string by trying to match from the last matching position.
u : Allows the use of Unicode point escapes \u….

When using the RexExp objects, these options can be passed as a parameter. For example:

const myPattern = new RegExp (“Medi[a-zA-Z]*”, "ig");

This is equivalent to :

const myPattern = /Medi[a-zA-Z]*/ig;

How Do Regular Expressions Work?

Here’s the short answer. Think of regular expressions as a mini-program that describes a pattern and tells the machine what to look for. With that in mind, it shouldn’t surprise you that:

Regular expressions define a set of instructions. In our example: “first find the uppercase letter M, then find the lowercase i,…etc”
Regular expressions have an input (the text that you are trying to search or replace) and can output the subset that you are trying to match;
Regular expressions have a syntax, and they can be compiled, executed, and even optimized to run faster!

Scroll down for a longer, more detailed answer!

What’s the Regex Syntax in JavaScript?

Exact matching

Any alphanumeric character that’s not a special meta-character or an operator will match itself in a regex.

In our previous example, /Medi[a-zA-Z]*/: M is a character that matches itself, same as with e,d and i.

Placing one a character after another indicates that we are looking for M followed by e, followed by d, followed by i. In such an example, Medo will not match.

Alternation

If we want to express that we want to match either a or b, we can use the pipe operator |, the regex expression will look like this /a|b/.

Matching a class of characters

In many cases, you wouldn’t want to match an exact character, but a character from a finite set of characters. The set operator allows us to do this.

An [abc] character set will match any character in that set either a or b or c. If we were to add a ^ just after the brackets, the set [^abc] will match any character except a, b and c.

In our previous example, /Medi[a-zA-Z]*/: [a-zA-Z] is a character class that matches any character from a to z lower case or uppercase.

We could write [abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ], but with the dash operator - any character from a through z or A through Z, inclusively and lexicographically, will be matched.

Escaping

Not all characters in a regex represent themselves. In our previous example, /Medi[a-zA-Z]*/: * is a quantifier, it matches between zero and unlimited times — as many times as possible — a character in the set [a-zA-Z].

How do we specify that we want to match the literal *?

Within regex, when you want to specify a literal character, you can escape it using the backslash character \. If you're going to match *, you need to specify \* to the regex. If you want to match a literal backslash, you need to specify \\.

Start and end

In some cases, you might want to ensure that a pattern matches the beginning of the string. If we go back to our previous example, /Medi[a-zA-Z]*/, zdMedium will also be a match. If we want to ensure that the match starts with Medi, all we have to do is add a ^ at the start of the regular expression, like so: /^Medi[a-zA-Z]*/

Similarly, a dollar sign $ signifies that the pattern must appear at the end of the string.

Quantifiers

If we want to match a series of two m characters, we can write /mm/. Regex allows us to specify how many times we want to match a specific pattern.

For example, /m{2}/ indicates a match on two consecutive characters.

As specified above, in our example/Medi[a-zA-Z]*/ : * indicates that we will match any character in the set[a-zA-Z] as many times as possible.

Predefined character classes

Some character classes are already predefined. For example, we might need to match digits. For that reason the predefined \d class will match any decimal digit — it’s equivalent to[0-9].

Other examples include \t for the horizontal tab and \n for matching a newline.

You can find a cheat sheet of predefined character classes in this guide from Mozilla.

How to Use Captures and Backreferences

When parentheses surround a part of a regex, it creates a capture.

Say we want to match an HTML tag, we can use a regular expression that looks like this: /<([a-z]\w*)\b[^>]*>/

Let’s break it down.

An HTML tag starts with the character < and ends with >.
[a-z] will match any lowercase alphabetical character.
\w* will match any alphanumeric character including underscore.
\b assert position at a word boundary.
[^>]* matches any character except >.

This regex will match : <div>, <span>, <something>, etc

In this example, we’re capturing the name of the tag, ([a-z]\w*), that this part of the regex will match. For example, div, span, something.

Try it on yourself here.

For example, let's say we want our regex to match a correct HTML element, that starts with a tag and ends with the same tag closed: <div>something</div>.

In this case, we want to be able to reference the tag we captured previously. The notation of the backreference is the backslash followed by the number of the capture to be referenced: \1.

In our example, we have only one capture. To match a full HTML element, the regex will look like this: /<([a-z]\w*)\b[^>]*>.*?<\/\1>/.

It might seem confusing at first, but let’s break it down:

.*? will match any character, except for line terminators.
<\/\1> will match a closing HTML tag: < matches its literal self, \/ matches a / character. \1 refers back to our capture, and > matches again it’s literal self.

This regex will match the following strings: <div> <i>something</i> </div>, <span>my span</span>, etc.

A string like <div> something</other> will not match.

Without the backreference, we will not be able to match a simple HTML element.

How Is a Regular Expression Matched?

The JavaScript regular expression engine uses a pattern matching engine called Nondeterministic Finite Automaton. A Finite Automaton (FA)is a computational model used to accept or reject strings of symbols.

To simplify, an FA consists of a finite set of states, with possible transitions between them. When evaluating an input string, the engine will try to match each character of the engine by evaluating and transitioning through each stage until it reaches the end.

Let’s go back to our first example: /Medi[a-zA-Z]*/. If we were to represent it as an FA graph it would look something like this:

Let’s look at another example. Suppose we have the following regex: /Medi(um|cal|cine)/. The FA will look like this:

If you test this regex with Medium or Medical or Medicine, it will follow one of the paths. However, if, for example, the regex is checking the inputMedicinal, it will first match M, e, d, i with the state e, then it will match c and transition to the state k, then l.

We are now at Medicin , the next letter is a and the automata doesn’t match, so we cannot transition to the state m. The automata will backtrack to state h and try to match the other path. There’s no match possible there. It will backtrack to the state e, no match is possible there. Then it will fail.

The FA is called non-deterministic because, while trying to match a regex on a given input (e.g Medicinal), each character in the input string might be checked several times against different stages of the automata.

How Can We Optimize Regular Expressions?

Even in the simple example above, you saw that the engine had to backtrack several times before declaring that the input does not match.

An important part of optimizing regex is minimizing the amount of backtracking the engine does. As we saw, the engine takes longer to determine that an input is not a match than to determine a successful match. The sooner we throw out the non-matching input, the better.

Here’s a famous example of a bad regex: /^(\w+\s?)*$/. In this example, we’re trying to match words \w+ with an optional space character \s?. If you try to match it against this string: this will not cause catastrophic back tracking, it will take the engine 26 steps to declare that it’s a match. However, if you’re trying to evaluate the following input:this will cause catastrophic back tracking, it will cause so much backtracking that it might need to use 100% of the CPU. If it’s done in the browser, the UI will freeze and the browser might reload the page. All because it takes too many steps to figure out that . is missing.

In order to optimize your regular expressions, keep the following points in mind.

Use alternation wisely

Regular expressions such as /(A|B|C)/ have a reputation of being slow, of course, consider A, B andC to be complicated parts of the regex that will require backtracking.

In some cases, alternation can be simplified. Instead of writing /(abc|abba)/, in which case abc and abba are not mutually exclusive, the regex can be simplified like this: /ab(c|ba)/. This regex is faster because the engine will try to match ab and won’t backtrack if it doesn’t match.

Only capture groups if you intend to use the text inside

Captures are a powerful feature of regex, but if you don’t intend to use the extracted text there’s no need to use them.

In the previous example, we don’t need to capture c or ba. We can turn this into a non-capturing group as follow: /ab(?:c|ba)/

Optimizing greedy quantifiers

A greedy quantifier such as * or + will first try to match as many characters as possible from an input string, even if this means that the input string will not have sufficient characters left in it to match the rest of the regular expression. If this happens, the greedy quantifier will backtrack, returning characters until an overall match is found or until there are no more characters.

A lazy quantifier, on the other hand, will first try to match as few characters in the input string as possible.

In many cases, greedy quantifiers can easily be replaced by lazy quantifiers. Say you want to optimize part of a regex like Med.*m.

If the character m is located near the end of the input string it is better to use the greedy quantifier *. If the character is located near the beginning of the input string it would be better to use the lazy quantifier *? and change the sub-expression to Med.*?m.

Be specific

When writing a regular expression, the more specific you are the better. Use general sub-expressions like .* sparingly because they can cause the engine to backtrack a lot — particularly when the rest of the expression does not match the input string.

As an alternative, you should use a more specific character class. This gives you more control over how many characters the * will cause the regex engine to consume, giving you the power to stop the excessive backtracking.

When to Not Use Regular Expressions

Regular expressions are often used to validate the user’s input, but frequently they can cause a bad user experience — especially when the programmer makes assumptions about the user input.

A recurring issue that I face as a user is strict email validation. Some forms expect the email to not contain a + sign. Some forms expect the user’s name or phone number to be limited to certain characters. This validation is useless and will end up frustrating your users.

In some cases, regular expressions can be expensive and difficult to maintain. I’ve experienced that first hand when I used fluentld to parse some database logs. That fluentld container was regularly reaching 100% CPU when the logs were not as expected and the regex could not be matched. Every time the container stopped working, I needed to try to understand the complex regex that I wrote a few months in the past.

If you find yourself spending a lot of time parsing large text, you might want to consider writing a parser.