Blog of Sara Jakša

Finding Word Frequency with Python

Well, since the it took too much time with the SQL, I figure out that maybe I am going to try it with the python only. I was quite intent on the word frequency tables, and I wanted to make them work.

This is the code that I used:

    import nltk
    from collections import defaultdict
    import string

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")

    def makeTextLower(wordlist):
        return [word.lower() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    for filename in filenames:
        print(filename)
        allwords = defaultdict(int)
        with open(filename, "r") as read:
            read = read.readlines()
        read = "".join(read)
        read = read.split("\n\n\n")
        for line in read:
            line = line.translate(str.maketrans('','',string.punctuation))
            line = nltk.tokenize.word_tokenize(line)
            line = makeTextLower(line)
            line = excludeWords(stopwords, line)
            line = wordStemmer(line)
            for word in line:
                allwords[word] += 1
        with open(filename + "-2", "w") as write:
            for word, freq in allwords.items():
                write.write(word + "\t" + str(freq) + "\n")

This one hat one big advantage to the SQL method. I did not even needed to let it work over the night. It was hell of a lot quicker. And the data was relatively clean, certainly usable.