Lesson 4
Setting Boundaries
Introduction
In the Lesson 3 we learnt about 'metacharacters'- characters that have a special meaning in the expression.
'Boundaries' and 'Anchors' are metacharacters that allow you to 'find' parts of the string.
They are incredibly useful and you'll find yourself using them a lot.
Let's take a look...
Word Boundaries
'Word boundaries' occur whenever the text changes from a non-word character to a word character, or vice-versa. Typically, you would use this to find the start or end of words.
These are represented in a regular expression with the \b
metacharacter.
The expression below finds the letter a
, but only when it is the first character in the word. Remove the word boundary and see what happens:
Try it out!
Life is either a daring adventure or nothing at all
- Helen Keller
In the expression above, only the a
is selected, the 'word boundary' metacharacter itself doesn't actually select any characters -
it just tell the regular expression engine where
to look for matches.
Word boundaries can be confusing though, as they are also found when there are special characters inside a word.
The word boundaries in the text below are marked in yellow. Try changing the string, adding special characters etc., to get a feel for what a 'word boundary' actually is:
Try it out!
Anchors
We use Anchors to 'pin' an expression to the start or end of a line.
Start of a string
The ^
character represents the start of the string.
This expression matches the letter o
, only when it occurs at the start of the string (^
).
Try removing the anchor character to see what we mean:
Try it out!
One, two, three, four, five
Once I caught a fish alive
We can describe this expression as being 'anchored' to the start of the string.
This expression uses the i
modifier from Lesson 2, which makes the expression case-insensitive - it will match both upper and lower-case o
characters.
End of the string
The $
anchors the expression to the end of the string.
This matches any character (.
) that is immediately followed by the end of the string ($
):
Try it out!
One, two, three, four, five
Once I caught a fish alive
We say that this expression is 'anchored' to the end of the string.
Multi-line strings
If we have a string with multiple lines, we can use these anchors to match the beginning or end of each line.
By adding the /m
modifier to the expression, the ^
will now match the start of each line, while the
$
matches the end of each line.
Try it out!
This is line 1 This is line2 This is line3
-
Mini-Game
Sentiment Analysis!
Select ONLY the tweets that contain the word 'bad'. Matching a single word in the tweet will select it.
Select these:
Trolly McTrollFace
@troll3545
Hey @baseclass, how can you be this BAD at stuff!?
Grumpy Customer
@grumpy1654
The @baseclass app just crashed on me, this app is so bad it hurts!
But DON'T select these:
Average Joe
@joe7978
Just earnt the 'super contributor' badge in the @baseclass app!!
Happy Customer
@notabot56
I can berely contain my excitement about the @baseclass app.
Your expression:
- We need to capture both upper and lower case 'bad's. You'll need a modifier from a previous lesson for this.
- To avoid accidentally matching tweets with words that contain the text 'bad' (e.g. 'badge'), thing about adding word-boundary