Blog of Sara Jakša

Creating Word Frequency Tables with SQLite and Python

Now that I had my text data in the SQLite file, I had to figure out what to do with it. One thing that was constantly repeating itself through the different tutorials and books was the word frequency table or the connected concepts bag of words or Tfi-df tables.

So I decided to try and calculate the frequencies myself. I mean, I had the data in the SQL file, and most of the examples did not. Or at least not in SQL file organized in my way.

This is the code that I used:

    import sqlite3
    import nltk
    import string
    from nltk.stem.wordnet import WordNetLemmatizer

    database = "mbti-posts.db"

    conn = sqlite3.connect(database)
    c = conn.cursor()

    #if you already have the table, then comment this line
    c.execute("CREATE TABLE frequency(id INTEGER PRIMARY KEY, type CHAR(5), word CHAR(50), freq INT)")

    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")
    lemmatizer = WordNetLemmatizer()

    def makeTextLower(wordlist):
        return [word.lower() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    c.execute('''SELECT id, text, type FROM personalitycafe''')
    sqldata = c.fetchall()

    for idname, text, mbtitype in sqldata:
        #removes the punctiation
        text = text.translate(str.maketrans('','',string.punctuation))
        row = nltk.tokenize.word_tokenize(text)
        row = makeTextLower(row)
        row = wordStemmer(row)
        #excludes common words
        row = excludeWords(stopwords, row)
        for word in row:
            print(word + ", " + mbtitype)
            word = word.replace("'", "")
            word = word.replace('"', '')
            word = "'" + word + "'"
            searchresult = c.execute('SELECT id, type, word, freq FROM frequency WHERE type = "' + mbtitype + '" AND word = ' + word)
            searchresult = searchresult.fetchall()
            if not searchresult:
                c.execute("INSERT INTO frequency(type, word, freq) VALUES ('" + mbtitype + "', " + word + ", 1)")
                freq = searchresult[0][3] + 1
                id = searchresult[0][0]
                c.execute("UPDATE frequency SET freq=" + str(freq) + " WHERE id = " + str(id))


I left the code to use through the night, but by the time I woke up, it was only on the second type out of the 16. So I decided to not use this way and tried to find another one.