Understand the basics of Regular Expressions to become more effective in extracting patterns
Regex Example — Image by Author
If you are like me, at the start, regular expression patterns just looked like gibberish. But after paying close attention, you will notice that they aren’t actually as daunting as they seem.
As with many concepts, it is usually helpful to start with an example so let’s use this one here: Let us say we are trying to extract the name from an email address’ domain name like this one: firstname.lastname@example.org; so in this case, we would be trying to extract just the “gmail” part.
To practice along, I highly recommend you use this site: https://regex101.com/. Before we begin to tackle our example, let us look at some common characters that you may come across or need.
+ The character + in a regular expression means “match the preceding character one or more times”. For example ab+c matches “abc”, “abbc”, “abbbc” but it doesn’t match “ac”. The plus character, used in a regular expression, is called a Kleene plus, named after the mathematician, Stephen Kleene (1909–1994), who introduced the concept. source
* This character in a regular expression means “match the preceding character zero or more times”. For example ab*c matches “abc”, “abbc”, “abbbc” and “ac”. This is likewise called the Kleene star.
The question mark ? indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
. the full stop matches any single character. For example a.c matches “abc”, “adc”, “aec” etc. If we wanted to match multiple characters before the letter “c”, we would just use the star * from above like this: a.*c and this would match “abdefghc”.
[a-z] : This is very useful as it defines a range of possible values. Here it simply refers to all lowercase alphabets from a to z. We can do the same for uppercase alphabets and all positive numbers like this [A-Z] & [0-9] .
^: This matches the starting position of any line.
[^b]at matches all strings matched by .at except "bat". So when used within the square brackets, the letter following ^is excluded.
^[hc]at matches "hat" and "cat", but only at the beginning of the string or line.
Now that we are quite equipped, let us look at our example again. So to get just the name part of the email address’ domain, we would use this regular expression: