Computer Science
Regular Expressions

Introduction

A regular expression is a pattern that describes a set of strings. Regular expressions are used for a wide variety of tasks from searching for a string with a large body of text, finding and replacing, validating string input (eg email addresses), finding patterns in DNA.

Regular expressions are written using a standard notation. An example of a regular expression would be,

zo+

This regular expression would match any string that began with z followed by at least one o. So it would match zo and zoo but not z or zoom.

The need for a regular expression class or regex class in high level programming and scripting languages can be illustrated with the following example.

qu[a-z]*

This expression would match any string beginning with qu, followed by any number of lower case characters in the range [a-z]. It wouldn't tell you whether they are real words or just jumbles of letters that begin with qu. Imagine having to specify the full list of valid matches to this pattern. It is likely to be tens of thousands of different words that exist and plenty that don't. The list of valid matches is therefore, infinite. The pattern is short, concise and far easier to work with for programmers.

Example ExpressionDescriptionMatch(es)
abcmatches the string 'abc'abc
e+matches a string which is one or more of the letter 'e'e
ee
eee
etc.
br*matches a string beginning with the letter 'b' followed by 0 or more of the letter 'r'b
br
brr
etc.
th?matches a string beginning with the letter 't' followed by 0 or 1 letter 'h't
th
p|fmatches a string consisting of only the letter 'p' or the letter 'f'p
f
[a-z]matches a single lower case lettera
b
c
etc.
[A-Z]matches a single upper case letterA
B
C
etc.
[a-zA-Z]matches a single upper or lower case letterA
a
b
etc.
[abc]matches a string which consists of one character from the specified seta
b
c
.matches any single character other than a new line 
t[^h]matches a string consisting of a letter 't' followed by a character that is not an 'h'to
tpz
trouble
etc.
o{2}matches a string consisting of exactly 2 consecutive 'o' characters. oo
er\bmatches 'er' followed by a word boundary (ie as the last two letters of a word)'er' in 'never'
'er' in 'flower'
er\Bmatches 'er' where it is not followed by a word boundary'er' in 'periwinkle'
\dmatches a digit character, equivalent to [0-9] 
\Dmatches a non-digit character, equivalent to [^0-9] 
\nmatches a newline character 
\smatches any white space (space, tab etc.) 
\tmatches a tab character 
\wmatches any word character (any alphanumeric or the underscore character) 
\Wmatches any non-word character 
^dmatches a letter 'd' at the start of a string 

Example Patterns

Integers

[+\-]?\d+

The backslash is used next to the '-' in the square brackets to indicate that the symbol is not used to express a range - as in [a-z].

An integer can optionally have a sign in front of it and will consist of 1 or more digit characters.

Real Numbers

[+\-]?\d+(\.\d+)?

Again, the backslash is used to indicate that the dot represents a literal dot and not any character. In this case the brackets are used to indicate that the entire decimal part of the number may not be present in a valid string.

Email Addresses

[_a-zA-Z\d\-.]+@([_a-zA-Z\d\-]+(\.[_a-zA-Z\d\-]+)+)

A monster, I know. This is still a far more elegant solution than hard-coding the logic for determining the validity of string which puports to be an email address.