Lesson 6
Sets, Ranges and Alternation
Introduction
Everything we've learnt so far requires you to know the exact characters that you want to match (or at least the type of character).
Regular expressions are more powerful than that, though.
Sets ranges, and alternation allow you to perform more complex logic in your expression.
Let's take a look...
Sets
Enclosing one-or-more characters inside square brackets means 'match any of these characters'. This is called a 'set'.
For example, the set [abc]
will matches either, a
, b
, or c
.
This expression selects either f
or d
followed by ish
:
Try it out!
I wish I had chosen the fish dish.
Notice how wish
is not selected, because it does not begin with f
or d
.
Excluding characters from sets
You can also exclude characters by prefixing the set with a ^
.
The expression [^abc]
means everything except for the characters a
, b
, or c
.
Here's the opposite of the previous expression. It matches any character except for f
or d
, followed by the text ish
.
Try it out!
I wish I had chosen the fish dish.
Notice how this only matches wish
now, and no longer matches fish
or dish
:
Ranges
Specifying individual characters is fine when you only need to match a couple of them, but it's not so good when you have something more complex.
For example, to match any number from 0
to 9
using that method, you'd have to write [0123456789]
.. that's not ideal.
That's where 'ranges' come in, letting you specify a range of characters to match.
Using ranges
Ranges are also enclosed in square brackets.
Instead of individual characters though, you specify the start and end of the range:
[0-3]
matches the numbers0
,1
,2
, or3
[e-h]
matches the characterse
,f
,g
, orh
[A-Z]
matches the uppercase charactersA
toZ
- etc.
This expression matches any (lowercase) character from a-z
, followed by the literal text ish
:
Try it out!
I wish I had chosen the fish dish.
Excluding ranges
The ^
character can also be used to 'negate' a range, the same way as we did with individual characters above.
This expression selects any character except for the ones a-f
, followed by ish
. See how it no longer selects
'dish'. because it starts with a character in this range:
Try it out!
I wish I had chosen the fish dish.
Combining ranges in sets
Just as you can combine multiple characters in a set ([abc]
matches the individual characters a
, b
or c
), you can also specify multiple ranges in a set.
The expression below matches either a character from a-d
or a character from p-z
.
Try it out!
I wish I had chosen the fish dish.
You'll commonly see this used in the set [A-Za-z]
, which captures any upper or lowercase letters (when the /i
modifier isn't being used to
specify case-insensitivity, of course).
Tricks with ASCII
When you specify a character in a range, you're actually referring to the address of that character in the character set you're using - usually ASCII.
That means that [a-z]
actually means 'ASCII character code 97 (a
) to character code 122 (z
)'.
Because of this, we can use ranges to do some useful things!
For example, the first printable character in the ASCII character table is the space
. The characters before this are 'un-printable' characters, such as tab, the carriage return etc.
The last printable character is the ~
.
That means that the range [ -~]
will match all printable characters in ASCII.
If you see this in a regular expression, now you know what it means!
Using Quantifiers
You can also use quantifiers with sets and ranges. They are added after the square brackets.
Remember the +
quantifier from Lesson 5? It means 'one-or-more times'.
Here we use it to match any number (0-9
) one or more times (+
), followed by a %
symbol:
Try it out!
Genius is 1% inspiration, 99% perspiration
- Thomas Edison
Alternation
The |
symbol in a regular expression acts like an 'or'.
This expression finds the word 'creativity' or the word 'intelligence':
Try it out!
Creativity is intelligence having fun.
- Albert Einstein
This is called 'alternation'.
You can also use alternation as part of a bigger expression, by enclosing the options in brackets. You can have as many
options as you like, as long as you separate them with the |
symbol.
This expression finds either the word Tell
, Teach
, or Involve
, followed by the word me
:
Try it out!
Tell me and I forget. Teach me and I remember. Involve me and I learn.
- Benjamin Franklin
These brackets are called a 'capturing group', and they're really useful for other reasons too.
We'll look at capturing groups in more detail in the next lesson.
Mini-Game
The combination of sets, ranges and alternation allows us to do some powerful things.
You're going to need all of them for this game, I'm afraid.. it's tricky one!
Let's build a tool that detects dates in an email and automatically overlays them on a calendar.
The email below contains four dates. Each date you match will be added to the calendar.
Match them all (without accidentally matching the times), to win!
To:
you@youraddress.com
From:
boss@yourcompany.com
Hi,
Let's meet on the 1st, 2nd, 5th or the 23rd. I can do 9am or 1pm.
Thanks.
- S
- M
- T
- W
- T
- F
- S
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
Your expression:
- The dates contain either one or two digits
- They are then followed by either the string
st
,nd
,rd
orth
- You'll need to use 'alternation', and there's a quantifier in this expression somewhere too!