I did have the word frequencies in the CSV table, but I had a lot of troubles getting it in a form, where I could use it to make informative models. So I ended up learning to how do it in a way, that it can end up maybe in a form that I could actually use in the end.
While I was doing this, I learned a bit about cleaning and preparing data to be used in the analysis later.
Here is the code that I have used:
import nltk import string filenames = ["ENFJ", "ENFP", "ENTJ", "ENTP", "ESFJ", "ESFP", "ESTJ", "ESTP", "INFJ", "INFP", "INTJ", "INTP", "ISFJ", "ISFP", "ISTJ", "ISTP"] contentdict = dict() stopwords = nltk.corpus.stopwords.words('english') stemmer = nltk.SnowballStemmer("english") def makeTextLower(wordlist): return [word.lower().strip() for word in wordlist] def excludeWords(excludedlist, wordlist): return [word for word in wordlist if word not in excludedlist] def wordStemmer(wordlist): return [stemmer.stem(word) for word in wordlist] def removeEmpty(wordlist): return [word for word in wordlist if word] for filename in filenames: contentdict[filename] = list() for filename in filenames: print(filename) with open(filename, "r") as read: content = read.readlines() content = "".join(content) content = content.translate(str.maketrans('','',string.punctuation)) content = nltk.tokenize.word_tokenize(content) content = makeTextLower(content) content = removeEmpty(content) content = excludeWords(stopwords, content) content = wordStemmer(content) content = nltk.FreqDist(content) content = excludeWords(content.hapaxes(), content) contentdict[filename].append(content)
I started it an a dictionary, because I had ideas what else could still add. But when I tried to add another thing, I ended up not having enough RAM in order to do it, so I had to give up on this idea. It would have been good, but I guess I will either wait until I get more RAM or I figure out how to do it without holding the whole thing in the RAM.