Backreferencing In Regex For Absolute Beginners (Python)

# A color-coded regex tutorial

Liu Zuo Lin

Level Up Coding

· ~3 min read · June 17, 2023 (Updated: June 18, 2023) · Free: No

Let's say that we are given a list of words, and we want to match the words that containing consecutive repeat letters.

words = ['apple', 'orange', 'pear', 'happy', 'sad']

# words we should match
# ['apple', 'happy']

I'm going to first say that using regex is not the absolute most efficient or user-friendly way to achieve this. But what this exercise can help you with is the understanding of back references that come with it.

The Answer

Let's begin with the end in mind. And break it down later on.

import re

words = ['apple', 'orange', 'pear', 'happy', 'sad']

regex = r'.*([a-zA-Z])\1.*'

matches = [w for w in words if re.match(regex, w)]

print(matches)
# ['apple', 'happy']

Here's the color-coded regex for easy visualization. (Also where I write that I wish Medium had color coding)

First things first — the 'r'

First things first — notice the r in front of the regex expression string. This denotes that the regex string is a raw string. Which means that any backslashes \ will be counted as an actual backslash character.

I like to use raw strings in regex so I don't have to deal with double backslashes eg. \\ to represent a single backslash.

.*

In regex, . means any character. And the * after means that the . can have any number of repetitions.

Together, .* means any number of any character. This applies for the red .* in front, and the orange .* behind

[a-zA-Z]

This matches one letter (lowercase or uppercase does not matter)

([a-zA-Z])

When we add brackets surrounding [a-zA-Z], we are putting [a-zA-Z] inside a capture group.

Doing this does not affect what it matches. In fact, ([a-zA-Z]) will still match one letter (lowercase or uppercase doesn't matter here).

We add a capture group because we want to back reference to it later on.

\1

This is the backreference.

Here, we use a 1 to backreference to the first capture group of the regex string. We use 2 to backreference to the second capture group, 3 to backreference to the third capture group and so on.

Essentially, this \1 matches whatever ([a-zA-Z]) matches.

If (a-zA-Z) matches an a, \1 can only match an a
If (a-zA-Z) matches a b, \1 can only match a b

Or else \1 doesn't match anything.

([a-zA-Z])\1

And together, ([a-zA-Z])\1 matches 2 consecutive repeat letters!

The .* at both ends match everything else. And this is how this regex expression is able to match words that contain consecutive repeat letters.

Here's a color-coded visualization of how each part of the regex matches each part of the word.

Conclusion

Hope you learnt something new today!

Also, I wish Medium had color coding for text. Though I understand that people can start making their articles look funky as hell so it might be a risky move on Medium's part. Oh well.

Some Final words

If this story provided value and you wish to show a little support, you could:

Clap 50 times for this story (this really, really helps me out)
Sign up for a Medium membership using my link ($5/month to read unlimited Medium stories)

My Home Office Setup: https://zlliu.co/workspace

My Free Ebooks: https://zlliu.co/books

Get an email whenever Liu Zuo Lin publishes.

Get an email whenever Liu Zuo Lin publishes. By signing up, you will create a Medium account if you don't already have…

medium.com

#python #python-programming #regex

< Go to the original