Computer Science
Regular Expressions

Introduction

A regular expression is a pattern that describes a set of strings. Regular expressions are used for a wide variety of tasks from searching for a string with a large body of text, finding and replacing, validating string input (eg email addresses), finding patterns in DNA.

Regular expressions are written using a standard notation. An example of a regular expression would be,

zo+

This regular expression would match any string that began with z followed by at least one o. So it would match zo and zoo but not z or zoom.

The need for a regular expression class or regex class in high level programming and scripting languages can be illustrated with the following example.

qu[a-z]*

This expression would match any string beginning with qu, followed by any number of lower case characters in the range [a-z]. It wouldn't tell you whether they are real words or just jumbles of letters that begin with qu. Imagine having to specify the full list of valid matches to this pattern. It is likely to be tens of thousands of different words that exist and plenty that don't. The list of valid matches is therefore, infinite. The pattern is short, concise and far easier to work with for programmers.

Example Expression	Description	Match(es)
abc	matches the string 'abc'	abc
e+	matches a string which is one or more of the letter 'e'	e ee eee etc.
br*	matches a string beginning with the letter 'b' followed by 0 or more of the letter 'r'	b br brr etc.
th?	matches a string beginning with the letter 't' followed by 0 or 1 letter 'h'	t th
p\|f	matches a string consisting of only the letter 'p' or the letter 'f'	p f
[a-z]	matches a single lower case letter	a b c etc.
[A-Z]	matches a single upper case letter	A B C etc.
[a-zA-Z]	matches a single upper or lower case letter	A a b etc.
[abc]	matches a string which consists of one character from the specified set	a b c
.	matches any single character other than a new line
t[^h]	matches a string consisting of a letter 't' followed by a character that is not an 'h'	to tpz trouble etc.
o{2}	matches a string consisting of exactly 2 consecutive 'o' characters.	oo
er\b	matches 'er' followed by a word boundary (ie as the last two letters of a word)	'er' in 'never' 'er' in 'flower'
er\B	matches 'er' where it is not followed by a word boundary	'er' in 'periwinkle'
\d	matches a digit character, equivalent to [0-9]
\D	matches a non-digit character, equivalent to [^0-9]
\n	matches a newline character
\s	matches any white space (space, tab etc.)
\t	matches a tab character
\w	matches any word character (any alphanumeric or the underscore character)
\W	matches any non-word character
^d	matches a letter 'd' at the start of a string

Example Patterns

Integers

[+\-]?\d+

The backslash is used next to the '-' in the square brackets to indicate that the symbol is not used to express a range - as in [a-z].

An integer can optionally have a sign in front of it and will consist of 1 or more digit characters.

Real Numbers

[+\-]?\d+(\.\d+)?

Again, the backslash is used to indicate that the dot represents a literal dot and not any character. In this case the brackets are used to indicate that the entire decimal part of the number may not be present in a valid string.

Email Addresses

[_a-zA-Z\d\-.]+@([_a-zA-Z\d\-]+(\.[_a-zA-Z\d\-]+)+)

A monster, I know. This is still a far more elegant solution than hard-coding the logic for determining the validity of string which puports to be an email address.

MultiWingSpan

Computer Science

Data Representation

Program Design

Hardware & Software

Networks

Databases

Data Structures

Algorithms

Other

Computer Science
Regular Expressions

Introduction

Example Patterns

Integers

Real Numbers

Email Addresses

MultiWingSpan

Computer Science

Data Representation

Program Design

Hardware & Software

Networks

Databases

Data Structures

Algorithms

Other

Computer ScienceRegular Expressions

Introduction

Example Patterns

Integers

Real Numbers

Email Addresses

Computer Science
Regular Expressions