Blog of Sara Jakša

Filtering CSV with Python

There was a word that came up a couple of time, when I was researching the word frequencies. This word is hapaxes, which basically means that the word only appears once in the text. Apparently they are useless for building model.

So I created a sort of script that removes them. Well, it also removes the words that only appeared twice, but considering the amount of data that I had, I did not think this is going to impact the model much.

Here is the code:

    filenames = ["ENFJ-2",
                 "ENTJ-2",
                 "ESFJ-2",
                 "ESTJ-2",
                 "INFP-2",
                 "INTP-2",
                 "ISFP-2",
                 "ISTP-2",
                 "ENFP-2",
                 "ENTP-2",
                 "ESFP-2",
                 "ESTP-2",
                 "INFJ-2",
                 "INTJ-2",
                 "ISFJ-2",
                 "ISTJ-2",
                 ]

    for filename in filenames:

        with open(filename, "r") as read:
            read = read.readlines()

        with open(filename, "w") as write:
            for line in read:
                word, number = line.split("\t")
                number = int(number)
                if number > 2:
                    write.write(line)