Blog of Sara Jakša

Cleaning Text Data with Python

I did have the word frequencies in the CSV table, but I had a lot of troubles getting it in a form, where I could use it to make informative models. So I ended up learning to how do it in a way, that it can end up maybe in a form that I could actually use in the end.

While I was doing this, I learned a bit about cleaning and preparing data to be used in the analysis later.

Here is the code that I have used:

    import nltk
    import string

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    contentdict = dict()
    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")

    def makeTextLower(wordlist):
        return [word.lower().strip() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    def removeEmpty(wordlist):
        return [word for word in wordlist if word]

    for filename in filenames:
        contentdict[filename] = list()

    for filename in filenames:
        print(filename)
        with open(filename, "r") as read:
            content = read.readlines()
        content = "".join(content)
        content = content.translate(str.maketrans('','',string.punctuation))
        content = nltk.tokenize.word_tokenize(content)
        content = makeTextLower(content)
        content = removeEmpty(content)
        content = excludeWords(stopwords, content)
        content = wordStemmer(content)
        content = nltk.FreqDist(content)
        content = excludeWords(content.hapaxes(), content)
        contentdict[filename].append(content)

I started it an a dictionary, because I had ideas what else could still add. But when I tried to add another thing, I ended up not having enough RAM in order to do it, so I had to give up on this idea. It would have been good, but I guess I will either wait until I get more RAM or I figure out how to do it without holding the whole thing in the RAM.