Finding Overlapping Matches - Another reason to use regex over re in Python
When I was having the presentation on the regex at Python Slovakia, I had a bit of problems answering the question about the practical differences of re (part of standard library) and regex (a recommended alternative) packages.
So when I noticed a where it can help me, I was really happy.
The case was like this. We use ReadMe for the documentation. Their recent change to the site meant, that markdown images needed to be separated by two spaces, otherwise they would be overwritten by any change not made in the raw editor.
Since I can not expect the people to know this, I needed to find this pictures and make sure that they are separated by the double new lines.
All was alright, except in the cases, where one picture just followed another. In this cases, since the new lines were used by the previous picture, then were not detected for the next one.
This is the result of the re.findall()
finding no-overlapping matches.
But the regex does have a way to also find overlapping matches, so this is what I ended up using in the end.
The one thing that I needed to be careful of is filtering out the double matches. Since I was not sure if there was any other space like character between the new lines and the images, I ended up getting the double results with the overlapping matches. But stripping the white space and removing duplicates before comparing solved this problem.
This is the final code, that I ended up using.
import requests import regex as re auth = requests.auth.HTTPBasicAuth("my API token", "") data = requests.get("https://dash.readme.com/api/v1/categories/getting-started/docs", auth=auth) slugs = [site["slug"] for site in data.json()[2]["children"]] for slug in slugs: data = requests.get("https://dash.readme.com/api/v1/docs/" + slug, auth=auth) correct_images = set([i.strip() for i in re.findall('\n\n\s*?\!\[[^\]]*?\]\([^\)]*?\)\s*?\n\n', body, overlapped=True)]) all_images = set([i.strip() for i in re.findall('\!\[[^\]]*?\]\([^\)]*?\)', body, overlapped=True)]) if len(all_images) != len(correct_images): print(f"{slug}: {len(all_images)} - {len(correct_images)}")