Blog of Sara Jakša

Filtering CSV with Python

There was a word that came up a couple of time, when I was researching the word frequencies. This word is hapaxes, which basically means that the word only appears once in the text. Apparently they are useless for building model.

So I created a sort of script that removes them. Well, it also removes the words that only appeared twice, but considering the amount of data that I had, I did not think this is going to impact the model much.

Here is the code:

    filenames = ["ENFJ-2",

    for filename in filenames:

        with open(filename, "r") as read:
            read = read.readlines()

        with open(filename, "w") as write:
            for line in read:
                word, number = line.split("\t")
                number = int(number)
                if number > 2: