Well, since the it took too much time with the SQL, I figure out that maybe I am going to try it with the python only. I was quite intent on the word frequency tables, and I wanted to make them work.
This is the code that I used:
import nltk from collections import defaultdict import string filenames = ["ENFJ", "ENFP", "ENTJ", "ENTP", "ESFJ", "ESFP", "ESTJ", "ESTP", "INFJ", "INFP", "INTJ", "INTP", "ISFJ", "ISFP", "ISTJ", "ISTP"] stopwords = nltk.corpus.stopwords.words('english') stemmer = nltk.SnowballStemmer("english") def makeTextLower(wordlist): return [word.lower() for word in wordlist] def excludeWords(excludedlist, wordlist): return [word for word in wordlist if word not in excludedlist] def wordStemmer(wordlist): return [stemmer.stem(word) for word in wordlist] for filename in filenames: print(filename) allwords = defaultdict(int) with open(filename, "r") as read: read = read.readlines() read = "".join(read) read = read.split("\n\n\n") for line in read: line = line.translate(str.maketrans('','',string.punctuation)) line = nltk.tokenize.word_tokenize(line) line = makeTextLower(line) line = excludeWords(stopwords, line) line = wordStemmer(line) for word in line: allwords[word] += 1 with open(filename + "-2", "w") as write: for word, freq in allwords.items(): write.write(word + "\t" + str(freq) + "\n")
This one hat one big advantage to the SQL method. I did not even needed to let it work over the night. It was hell of a lot quicker. And the data was relatively clean, certainly usable.