Regex is Magic

This is the written version of the talk, that I have presented this month at the PyCon Slovakia in Bratislava.

Introduction

To me, the regular expressions or regex-es were always a very useful tool that I had in the toolbox. It was almost kind of magical. But it was not until I started my first job, that I realize, that they can also induce the feeling of magic also in the non-technical people.

Regular Expressions as AI

Let me start with the story.

When I was starting my first job, they were advertising themselves as the AI company. I did not really paid it much mind, since my job was only collecting some of the data, and I did not do any of the processing or analysis at the start. I just heard mentions of this systems from time to time.

Then one day, I was to work on one of the systems. It was a system to change the PDF of the invoice to the JSON description of the invoice.

So the structure looked something like this. The PDF was sent to the microservice dealing with this. This microservice then sent the PDF to the Google Vision, in order to get the textual data. This textual data then got sent to the invoices2data library, in order to extract the data from the text.

So our work was, that we needed to write the regular expressions related to this. And this was the entire work needed, once the system was set up.

This is how the code, that we needed to write for this system, looked like:

amount:
     - Total\s*:\s*\$([\d\.\,]+)
     - Total amount\s*:\s*\$([\d\.\,]+)
date:
     - (\d+?[st|nd|rd|th]* [A-Z][a-z][a-z], \d\d\d\d)
     - (\d\d\/\d\d\/\d\d\d\d)
     - ([A-Z][a-z]+?-\d\d-\d\d\d\d)
invoice_number:
     - INV(\S+[\d]+)
     - Invoice ID:[\s\S]*?(\d+)

Basically, for each value, that we wanted extracted, we needed to write a regular expression. If there were multiple possible patters, we just needed to write each one in the separate line. Which was a good decision choice, since most of these regular expressions tend to be short enough to be really readable.

I once asked my team leader at the time, why we are selling this as the AI. Since I did not really connect the writing of regular expressions with the artificial intelligence. His answer was, that we are using Google Vision, so there is an AI component. Sure, technically true. But even so, we were not working on Google Vision, we were writing regular expressions and selling them as AI.

What are Regular Expressions

So what are these so powerful regular expressions. Regular expressions are the sequence of characters that specify a search pattern in text.

It does the search on the letter level, so it is possible to match anything, that you can describe on the letter level. It can range from the literal match of the string, to the string of any character of any length.

But since it matches it on the letter level, it can not match any conceptual matching, like give me all the places in the text. It can only do this, if this matching can also be made in the letter level, like give me everything after the string Location: until the new line.

It also does not deal with the structure of the data, that one would get by parsing the JSON or XML or HTML with the parser. In most of these cases, it is better to use a dedicated parser, since one also gets the structure of the data this way.

Explanatory Example

Let us go through one of the examples, and I will try to explain it on it.

Lets us say, that I wanted a very quick overview of when in the last 1000 years was Bratislava having a role in history. And the there is a Wikipedia article about Bratislava. But I want a really quick overview without reading all the data.

I could do this directly from the browser with some JavaScript in the console.

Array.from(document.querySelectorAll('#mw-content-text p'))
     .map(e => e.textContent)
     .join(" ")
     .match(/\d{4}/g)
     .sort()

Sure, it would be possible to do the same thing with the Python.

from bs4 import BeautifulSoup
import requests
import re

data = requests.get("https://en.wikipedia.org/wiki/Bratislava")

soup = BeautifulSoup(data.text)
text = " ".join([e.get_text() for e in 
soup.find(id="mw-content-text").find_all("p")])
re.findall("\d{4}", text)

It is a bit more lines of code, but that is mostly because the browser already opened and parsed the page for us.

The important part of this is the \d{4} part. With this, we are telling the program to find all the string, that have four digits one after another. The \d defines the letter we are looking for as a digit, while the {4} part is about how many characters do we want to have in a string.

Since it is matching the four digit, it would also match the digits, that would have more than 4 digits, since they also include the string with 4 digits one after another. In this case, it would search for the most left placed string, and then continue doing non-overlapping matches.

So the following would be true:

re.findall("\d{4}", "123456789")
# result: ['1234', '5678']

So if you want to detect the string that is exactly 4 digits long, then you would need to add some more criteria, like, that the preceding or later character should not be digit or something similar.

We do not really need this in our case, since it is a throw away example, and we will get something useful with just the basic regular expression.

We can see this, if we try to plot all these values.

As you can see from the graph, there are no values with more than 4 digits. There is one value at the 5000 mark, but searching for this value in the text, it is actually a year. Just before the Christ, not something that we would want for our last 1000 year project.

So let us remove that one value and plot it again.

Looking at this graph now, we can see some recency bias, since the biggest activity seems to be around and after year 2000. At the time when people writing the Wikipedia were most likely already alive.

But even excluding this, we can see the active time around 1800 and another smaller once around 1500. The first mention seems to be in the first years after 1000, which indicates, that it is an old city.

How are Regular Expressions Made

The regex expressions are made by two different types of characters:

  • literal characters
  • meta characters

Literal characters are just that - characters, that represent themselves. So while not really the point, it is possible to find just exact matches in the text. As long as only literal characters are used, this is the result.

So using a normal English-like string would result in regex finding an exact match.

re.findall("cat", "My cat went to the cat store and met the owner.")
# returns ['cat', 'cat']

But we can find the exact matches with the str.index() method and one can check if the substring exists in the string with in keyword. We do not really need regular expressions to do this.

Which is where the meta characters come in. With meta characters is it possible to describe what kind of letter it is supposed to be, and how many are there supposed to be. Where we end up with the . as a meta character describing any kind of letter.

So if we go back to the Bratislava example, the code looked like this:

\d{4}

The part \d is a meta character for the digits and the {4} is describes the number of characters, that we are expecting.

Meta Characters

Meta Characters for Character Types

There are different types of meta characters. Some of them are listed in the list below:

  • \d -> digit
  • \s -> space
  • \w -> alphanumeric character
  • \b -> word boundary
  • . -> any character
  • ^ -> start of the line
  • $ -> end of the line

It is usually that the lower letters define the group to search, and the upper letters define all letters but that one. So if \d would define the digits, then the \D would define any letter but a digit.

Generally, they are usually just a search away, so I would not start with remembering all of them. The ones that you will end up using the most are the ones that you will end up remembering.

Meta Characters Groupings

It is also possible to define the groups of character. For this the [] are used. It is possible to say any characters in the range. In English speaking countries, the [a-zA-Z] can be used to define all letters of the English alphabet. Is is possible to add as many ranges or characters as possible. Getting only even digits could be done by this [02468].

When using ranges, they are defined by their Unicode placement. So using [0-9] will get one all the Latin numbers. But using [一-九] (the same, but number replaced with the Japanese equvalent) would match more than just digits.

It is also possible to reverse the selection by using the ^ inside the square parenthesis. So getting everything but digits could be done by [^\d].

Meta Characters for Letter Repetitions and Multiple Choices

It is also possible to tell how many times either specific letter or letter meta character represents repeats. This is why I could write \d{4} instead of the \d\d\d\d. Below is the list:

    • -> 0 or more repetition
    • -> 1 or more repetition
  • ? -> 0 or 1 repetition or non-greedy variant
  • {n} -> exactly n of repetitions
  • {,n} -> up to n of repetitions
  • {n,} -> more than n of repetitions
  • {n,m} -> between n and m repetitions
  • | -> or

Getting the Info with Regular Expressions and the Example of Logs

My my job, my team is responsible for keeping the specific types of integrations with our product working. There are currently in the 100s, and while most of them are based on the official APIs, some of them are scrapers.

Keeping track of 100s of documentations and change logs is not really practical. We certainly do not have enough people to do this. Because of this, we generally wait until something fails, before we fix it.

Our logs to detect this are always formatted in the same way (since it is our code sending it) so this is a good case for regular expressions. The format looks like this:

Fetching data for REGION CUSTOMER (CUSTOMER_ID) - SERVICE failed with status ERROR due to SOME DESCRIPTION

Generally, we can ignore SOME DESCRIPTION part, since it does not really tell us any info, that is not already implicitly included in the in the ERROR data. So we will only look at the string before this.

So the general regex, that I would be using would look like this:

Fetching data for (\w+) ([\w\s]+?) \((\d+?)\) - ([\w\.]+?) failed with status (.+?) due to

You can see in this regular expression, that the data we could potentially be interested in is eclipsed in the parenthesis. This are called groups and this is a way to get just the data you need from the text. The rest of the text can be discarded.

If there are multiple groups, like in the upper example, it is possible to used the ?P<name of group> to name them. So if I wanted to name all of them, it would look sort of like this:

Fetching data for (?P<region>\w+) (?P<customer>[\w\s]+?) \((?P<customer_id>\d+?)\) - (?P<integration>[\w\.]+?) failed with status (?P<error>.+?) due to

This would allow to use the names with the re.Match.group(group name) to find the match for each of defined groups.

If I would be using it in practice, I would not be always searching for all of them, but I would change some of them to the literal characters. There are two of them. For me the main one is that I don't need to filter it later. The other one is, that the sooner the match is discarded, the quicker the regex usually is. If possible one want to filter all the non-wanted matches in the regex itself.

So if I would be looking at specific error, I would change the error field to what I would be interested to. If I got a customer complain, that was non-specific, then I would define region, customer and customer IDs with specific errors. If I wanted to look at specific integrations, then I would do this.

Flags in the Regular Expressions

Python (as well as other programming languages) provide the flags to control the action of the regular expression. I am listing some of them in the below list.

  • re.DOTALL -> dot matches also new line
  • re.MULTILINE -> ^ and $ match also new lines
  • re.ASCII -> match only ASCII
  • re.IGNORECASE -> do case-insensitive search

Using the regular expression with IGNORECASE flag is a way to do the case insensitive matches without transforming the string, so I guess this is the one, that would use the most frequently.

My name is Invalid or Difference Between Programming Languages

When I enter my name in the forms online, sometimes I get the message that my name is invalid. Which can get really annoying, since I am pretty sure I know my name better then some validation code on the internet. It is the same as on my official documentation. But I guess the name Sara Jakša is just too unusual to accept.

But I think it is a pretty good example of showing some differences between the programming languages.

Let say we do the validation, where we will accept any alphanumeric and space in the name. So this should cover everything, right? So if we want to check, if there are any characters beside this, we should be using the [^\w\s]+ regular expression.

But while the Python finds nothing wrong with my name, the JavaScript does. The letter š is not part of either alphanumeric or space, so my name would get rejected.

To get the same effect in Python, the ASCII flag needs to be used. In this case the š is no longer considered the letter.

I was told, that in JavaScript, all I needed to do was add the u flag, and it would work, but "Sara Jakša".match(/[^\w\s]/gu) still returns š as no-letter, so I sure I am doing something wrong.

But I am really happy that Python has the support for all of that as the default. I heard that most programming languages were created in US, which is why the heavily use the letters easily accessible with the US keyboard layout. But I am liking that we are at least starting to embrace, that there are different languages. And that they are something that can be usable by default.

How to Remember all of This

Regular expressions are generally considered unreadable, because of their weird syntax. But I would say that they are just as unreadable as the Python code is to the non-programming people. I had to tutor my cognitive science classmates, so they could pass the Programming 1 - the Python code is not clearly readable to people with no knowledge of programming. It has rules, that needed to be known.

The same thing with the reading the the foreign language. Whatever book is the first book that you read in a language, it is going to be frustrating, and you might end up looking up meaning a lot. But just as with reading the Python programs, with practice is becomes easier.

It is the same with the regular expressions. With the practice it becomes easier to both read it and write it. And just like you needed to check the meaning of Python inbuilt functions, the same is true with regular expression rules. The help() command can help here. The help(re) does bring up the list of meta characters meanings.

The only thing, that I can help is, to use a recall. First try to bring it from the memory, even if it is just a couple of seconds, before looking it up. This should strengthen the memory for the next time, and eventually you will not have to look it up at all.

Eventually, just like a lot of us learned to read English or programming languages or any other foreign language, the regular expression language will also become easy.

Language Learning

You could probably see from the previous paragraphs, that I like languages. That is even though the linguistic was the only course during my cognitive science master whose exam I had to take multiple times, since I failed the first time.

One of the language, that I am learning, is also Japanese. Since the Japanese uses different writing systems, it can be intimidating starting with reading - which is how I would usually start the language learning.

Thankfully, they produce a lot of anime and for a lot of them it is possible to find the subtitles. So reading (and analyzing) subtitles while watching anime could be considered studying time.

But the subtitles usually come in the two formats, both of which have things included, that I am not interested in at all.

One example from the fifth episode of the Trapped in a Dating Sim: The World of Otome Games Is Tough for Mobs. The scene start when during the duel the prince asks if it is fun, and Leon delivers his monologue.

48 00:04:03,494 --> 00:04:04,745 (リオン)最高だね

49 00:04:04,828 --> 00:04:05,829 あっ

50 00:04:06,413 --> 00:04:08,916 (リオン)もう 最高の気分だよ!

51 00:04:08,999 --> 00:04:11,251 俺は 確かに傲慢だが—

52 00:04:11,335 --> 00:04:14,588 お前らは そんな俺にも勝てないわけだ

53 00:04:14,672 --> 00:04:19,718 格下に見ていた相手に負ける気分は どうですか? 王子様!

54 00:04:19,802 --> 00:04:22,096 き… 貴様!

55 00:04:22,846 --> 00:04:25,849 (リオン)何が “王族になど生まれたくなかった”だ

56 00:04:26,266 --> 00:04:30,896 お前 変態ババアに売られて 殺されそうになったこと あるのか?

57 00:04:30,980 --> 00:04:31,981 (ユリウス)何!?

58 00:04:32,398 --> 00:04:35,234 女子にペコペコ 頭下げた上に—

59 00:04:35,567 --> 00:04:38,737 お茶会を台なしにされた経験は?

60 00:04:39,113 --> 00:04:43,575 話しかけただけで 突き飛ばされた俺たちの気持ちが—

61 00:04:43,659 --> 00:04:45,744 分かるのかよ!

So taking everything we learned into the account, it is pretty easy to get just the lines, that they were speaking. The MULTILINE flag, used here in the shorter form M is really useful, since it allow me to find the lines without actually splitting the text in the separate lines.

data = re.sub(r"^\d+?$", "\n", data, flags=re.M)
data = re.sub(r"\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d", "", data)
data = re.sub(r"([^)]+?)", "", data, flags=re.M)
data = re.sub(r"\\u\d+?", "", data)
data = re.sub(r"\n+", "\n", data)

The first line removes all the lines indicating which line of the subtitles it is. The second removes the timing data. The third removes the sound effects, the scene descriptions and the indications of who is speaking. The forth line removes some weird characters sometimes left in the text. The last one then replaces multiple new lines with just one.

So the upper subtitles become the normal script.

最高だね あっ 最高の気分だよ! 俺は 確かに傲慢だが— お前らは そんな俺にも勝てないわけだ 格下に見ていた相手に負ける気分は どうですか? 王子様! き… 貴様! 何が “王族になど生まれたくなかった”だ お前 変態ババアに売られて 殺されそうになったこと あるのか? 何!? 女子にペコペコ 頭下げた上に— お茶会を台なしにされた経験は? 話しかけただけで 突き飛ばされた俺たちの気持ちが— 分かるのかよ

Once I have the scripts, I could want to do some analysis of which series to analyze first. It is much better to start with something easier, like Ascendance of a Bookworm, then for example Tanya the Evil. The later would just bury me with the military vocabulary, that I would only find useful when reading military fiction books or at very advance level.

One very crude measurement is how many Japanese kanji are there and of what level are there. The less of them and the more frequent they are, the easier could the episode be to understand.

It is really easy to get only Japanese kanji from the string with regular expressions. Taking the same episode as before, we could run the following script:

import collections

kanji = re.sub(r"[^\u4E00-\u9FFF]", "", data)

len(set(kanji))
# 420

collections.Counter(kanji).most_common(10)
#[('俺', 37), ('殿', 28), ('下', 27), ('気', 24), ('分', 23)]

There are 420 different Japanese kanji in the entire script. The general use list has a bit over 2000, so only fifth of them are needed to understand the entire episode. The most common ones also look really frequent, which are both a good signs.

The regex part above ([^\u4E00-\u9FFF]) uses the range including more of the kanji, that I have stolen from the StackOverflow answer. It has been useful in the different analysis, that I have been doing for my pleasure.

Getting Emojis with Unicode Categories

Unicode categories are grouping of the characters, that they can be references by the group name. And there are the two main ways of grouping the characters.

The first one is blocks. All Unicode codes are divided into non-overlapping block. It is similar thing that makes the [0-9] or [a-z] ranges work. Except that they have specific names, so only the name needs to be remembered, and not the first and last character of the range.

But because this means, that the characters need to be next to each other in the Unicode order, they provide only a very coarse grouping of the letters.

The second one is the scripts. The scrips define the letters that a certain script is using - let this be Latin, Cyrillic, Arabic, Japanese, Korean and many more. They present the characters, that are present in any of the scripts, usually scattered through the multiple blocks. For example, the English uses the characters from both the Basic Latin and General Punctuation blocks. For example the Š letter, that was a problem with my name, is in the Latin Extended-A block, so Slovenian would also be using some of the letters from this block. But not all of the letters from this block, as we do not use some of the letter there, like letter with ogonek like ą or ę.

This means that characters from the same script may be in several different blocks. And a block can have letters from different scripts.

The scripts values are generally preferred to the block values, but it depends on the problem, that you are solving.

Now, the standard library in Python does not yet support these values. But the library that is recommended as replacement in the standard library replacement does. While it does not yet support all the blocks and scripts, it supports 100 of them. Here is the list of supported scripts and blocks.

One of the things is supports is finding the emojis. Here is the example from one of the sentences.

import regex

regex.findall("\p{Emoji}", "plenty of 🐟 in the 🌊")
# result: ['🐟', '🌊']

The \p{} part is the example of how it is possible to reference the block and script names in the regular expressions. They can contain any type of Unicode category and it will be matches by all characters in that category.

The \p{} will always match one letter and it can be manipulated the same way as other regex meta and literal characters. So having the row of 4 emojis would be \p{Emoji}{4}.

How did I even find out about this? I once had a idea, where I would compare different communities by their emoji usage. So this is why I am aware of this. The support for blocks and scripts is the one reason why I would be using this library instead of the standard library. I will admit, if I am not using them, then I generally stick with the standard library.

Why we were not using this in the Japanese subtitles examples? While some of the Japanese groups like the hiragana and katakana blocks are added. But not the entire Japanese script. So I took the solution that I did.

Last Example: Code Rewrite

One thing that I was recently doing is dealing with the code rewrite in my job. The team I am a part of is in the middle of the code restructuring our JavaScript project. The main reason is, that it is apparently hard to debug the promises inside of promises.

But that also means, that there are a lot of examples, where the code needs to be changed, since it would not work otherwise.

One of the example, that I was working on in the week of the presentation was rewriting the use of the CSV file processing from .on and .done syntax to the await syntax.

The original code would look something like this:

csv()
  .fromString(csvString)
  .on('json', (jsonObject) => {
    // something is done (1)
  })
  .on('done', () => {
    // something is done (2)
    resolve(result)
  });

And the new code would then look something like this:

const csvData = await csv().fromString(csvString)
for (const jsonObject of csvData) {csv()
  .fromString(csvString)
  .on('json', (jsonObject) => {
    // something is done (1)
  })
  .on('done', () => {
    // something is done (2)
    resolve(result)
  });
  // something is done (1)
}
// something is done (2)
return result

There were multiple examples of this, so it was easier to get a regex to help me rewrite it.

csv\(\)[\s\n]*?
\.fromString\((.+)\)[\s\n]*?
\.on\(\s*?'json', \(.*?\) => \{([\s\S\n]+?)\}\)[\s\S\n]+
\.on\('done', \(.*?\) => \{([\s\S\n]+?)\}\)

Here is the regular expression written in multiple lines for readability. But in the reality it only worked if written in one line. Since it is a throw away regular expression, that would be deleted after use if not for this presentation, I am not that worried about readability. Which is very different from the first example in this presentation, since these were the regular expressions, that we do not to maintain in the production, so it helps that they are really short. Which I think is the only tip I have on how to make more maintainable regular expressions.

In order to write this regular expressions for the rewrite, I was a heavy user of the JetBrain's WebStorm search function to test them. This allowed to me see which examples in the code will my script find with this regular expression. So I am quite grateful that their search have the option to search with the regex. This made the testing of them very simple.

I is very helpful, to test regular expression in one of these environments. For regular expressions outside of the code base, I generally use https://regex101.com/. But I think any would do the job.

Conclusion

When I talk to people around me, I get the feeling that they do not like the regular expressions a lot. In my job we replaced another regular expression with machine learning models. Which I hope was not for maintenance purposes.

I would also have people quoting the below quote, thinking it is going to prove an argument against the regular expressions:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Which does not actually prove anything, since based on my superficial search, this was said in the context of not misusing the regular expressions for problems, it is not suited for.

But I actually like the regular expressions, since they are a good tool in my toolbox. True, if there is a parsing library for that input, like beautifulsoup for HTML, the regular expressions are not a good fit.

But a lot of textual data are not able to be parsed with some of these libraries. Non-technical people do not like seeing the HTML or XML or even JSON, they prefer text. But this text have a lot of potential interesting data, and regular expressions are a great way to get it from the text.

So my hope with this was, that I managed to show you, that regular expressions are useful and you should not look at them with disdain.