Blog of Sara Jakša

Removing all Whitespaces but Normal Spaces with Python and Regex

My sister asked me to help her scrape a website for her summer work. While I was doing is, some of the data was showing a strange behavior, when I tried to put it in the CSV file. It would jump in the next line multiple spaces ahead.

I first tried replacing all the normal suspects, like tabs ("\t"), newlines ("\n") and escapes ("\r") and even double spaces (" ") among other things. But none of that dealt with the problem.

Emil Stenström had published on the Stackoverflow an answer that eventually lead me to my solution. The only problem with his one was, that it also removed the spaces.

So in a fashion of trying to find the quickest solution possible, I used the following code:

    text = text.replace(" ", "ß")
    text = re.sub(r"\s+", "", text, flags=re.UNICODE)
    text = text.replace("ß", " ")

What I did is, that I replaced all the spaces with the letter that I was sure would not appear in the text. Since the text was in Slovenian, this one worked. If I was parsing the website in German, then the choose of letter would not be that good. In that case, using a Slovenian letter like ć would be a much better choice, since it is not likely that it would appear in the text.

I am sure that there exist a more Pythonic way of doing this, but I just was not able to come up with it. Also, google was quite unhelpful in this case.