Skip to main content

How to Make a Nice Looking Cake

Nutella Cake by my Sister

I have baked a nutella cake with my sister. But it ended up being not that nice looking.

My sister after that made a cake by herself. Unlike the previous cake, she said that she let the cake cool inside the model and that she did not use the paper, but oil and flour to make sure that the cake would not stick.

It is much better looking than my version. So it is something to listen to the next time.

Nutella Cake Recipe

Nutella Cake

It is a birthday season. I mean, I would like to say, that it is the birthday season, but it is actually my sister's friends that all have birthdays.

So she decide that she wants to try to make them a cake. Since it is for her group and she and most likely a lot of her friends like nutella, it had to be the nutella cake.

For me, it was a time to spend with my sister doing the stuff that was fun to me. It was fun, even when she complained that she needs to wait 15 minutes without doing anything. My mother suggesting that she can talk to her was not helping.

It does not look that nice, as you can see on the recipe. But it was tasty.

Ingredients:

  • eggs
  • sugar
  • vanilla sugar
  • flour
  • baking powder
  • nutella
  • whipping cream
  • chocolate

Recipe:

  • Separate the yolks and the whites of the egg
  • Whip the whites of the egg with the sugar
  • Mix the yolks, sugar, vanilla sugar and nutella
  • Mix the both mixtures
  • Bake for 25 minutes at 180°C
  • Whip the whipping cream
  • Mix the whipping cream with nutella
  • Melt the chocolate
  • Cut the baked mixture to three layers
  • Put the whipping cream mixture in between
  • Put the melted chocolate on the top

Suggestions:

  • When separating the yolks from the whites, use the shell of the egg to help you
  • It is all right to have some whites in the yolks. It is not fine to have any yolk in the whites at all
  • When mixing whipped whites, make sure to be tender. The air should stay inside in order to make it more fluffy when baked. The similar can be said for the whipping cream.
  • If you add some milk or whipping cream to the chocolate when baking, it is less likely to get stuck on the pan. Plus, it is less work than doing in in the bain-marie
  • Me and my sister also added the banana on the shipping cream mixture. I don't think it helped the taste.
  • If you don't like dry cakes, then put the run on the baked dough

Cleaning Text Data with Python

I did have the word frequencies in the CSV table, but I had a lot of troubles getting it in a form, where I could use it to make informative models. So I ended up learning to how do it in a way, that it can end up maybe in a form that I could actually use in the end.

While I was doing this, I learned a bit about cleaning and preparing data to be used in the analysis later.

Here is the code that I have used:

    import nltk
    import string

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    contentdict = dict()
    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")

    def makeTextLower(wordlist):
        return [word.lower().strip() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    def removeEmpty(wordlist):
        return [word for word in wordlist if word]

    for filename in filenames:
        contentdict[filename] = list()

    for filename in filenames:
        print(filename)
        with open(filename, "r") as read:
            content = read.readlines()
        content = "".join(content)
        content = content.translate(str.maketrans('','',string.punctuation))
        content = nltk.tokenize.word_tokenize(content)
        content = makeTextLower(content)
        content = removeEmpty(content)
        content = excludeWords(stopwords, content)
        content = wordStemmer(content)
        content = nltk.FreqDist(content)
        content = excludeWords(content.hapaxes(), content)
        contentdict[filename].append(content)

I started it an a dictionary, because I had ideas what else could still add. But when I tried to add another thing, I ended up not having enough RAM in order to do it, so I had to give up on this idea. It would have been good, but I guess I will either wait until I get more RAM or I figure out how to do it without holding the whole thing in the RAM.

Creating CSV Word Frequency Table with Python

Now I have finally come to the step, where I could create the word frequency table. So I tired various ways, again, but I ended up using python.

Here is the code:

    from collections import defaultdict

    filenames = ["ENFJ-2.csv",
                 "ESFJ-2.csv",
                 "INFJ-2.csv",
                 "ISFJ-2.csv",
                 "ENFP-2.csv",
                 "ESFP-2.csv",
                 "INFP-2.csv",
                 "ISFP-2.csv",
                 "ENTJ-2.csv",
                 "ESTJ-2.csv",
                 "INTJ-2.csv",
                 "ISTJ-2.csv",
                 "ENTP-2.csv",
                 "ESTP-2.csv",
                 "INTP-2.csv",
                 "ISTP-2.csv",
                 ]

    outfile = "table-freq.csv"

    allwords = defaultdict(defaultdict)

    for filename in filenames:
        with open(filename, "r") as read:
            read = read.readlines()

        for line in read[1:]:
            word, freq = line.split("\t")
            word = word.strip()
            freq = int(freq.strip())
            if not filename[:4] in allwords[word]:
                allwords[word][filename[:4]] = 0
            allwords[word][filename[:4]] += freq

    allwordslist = [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

    for word, typedict in allwords.items():
        allwordslist[0].append(word)
        for i, typename in [(1, "ESFJ"), (2, "INFJ"), (3, "ENFJ"), (4, "ISFJ"), (5, "ENFP"), (6, "ESFP"), (7, "INFP"), (8, "ISFP"), (9, "ENTJ"), (10, "ESTJ"), (11, "INTJ"), (12, "ISTJ"), (13, "ENTP"), (14, "ESTP"), (15, "INTP"), (16, "ISTP")]:
            if typename not in typedict:
                typedict[typename] = 0
            allwordslist[i].append(typedict[typename])

    with open(outfile, "w") as write:
        for line in allwordslist:
            write.write("\t".join([str(element) for element in line]) + "\n")

At this point I hardcoded some of the variables, because I was starting to feel that I had spend to much time on this path. And I would be right, as I did not end up using it.

But if anybody is interested, here is the link to the word frequency file here. You can find the types order hidden in the code.

Filtering CSV with Python

There was a word that came up a couple of time, when I was researching the word frequencies. This word is hapaxes, which basically means that the word only appears once in the text. Apparently they are useless for building model.

So I created a sort of script that removes them. Well, it also removes the words that only appeared twice, but considering the amount of data that I had, I did not think this is going to impact the model much.

Here is the code:

    filenames = ["ENFJ-2",
                 "ENTJ-2",
                 "ESFJ-2",
                 "ESTJ-2",
                 "INFP-2",
                 "INTP-2",
                 "ISFP-2",
                 "ISTP-2",
                 "ENFP-2",
                 "ENTP-2",
                 "ESFP-2",
                 "ESTP-2",
                 "INFJ-2",
                 "INTJ-2",
                 "ISFJ-2",
                 "ISTJ-2",
                 ]

    for filename in filenames:

        with open(filename, "r") as read:
            read = read.readlines()

        with open(filename, "w") as write:
            for line in read:
                word, number = line.split("\t")
                number = int(number)
                if number > 2:
                    write.write(line)

Finding Word Frequency with Python

Well, since the it took too much time with the SQL, I figure out that maybe I am going to try it with the python only. I was quite intent on the word frequency tables, and I wanted to make them work.

This is the code that I used:

    import nltk
    from collections import defaultdict
    import string

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")

    def makeTextLower(wordlist):
        return [word.lower() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    for filename in filenames:
        print(filename)
        allwords = defaultdict(int)
        with open(filename, "r") as read:
            read = read.readlines()
        read = "".join(read)
        read = read.split("\n\n\n")
        for line in read:
            line = line.translate(str.maketrans('','',string.punctuation))
            line = nltk.tokenize.word_tokenize(line)
            line = makeTextLower(line)
            line = excludeWords(stopwords, line)
            line = wordStemmer(line)
            for word in line:
                allwords[word] += 1
        with open(filename + "-2", "w") as write:
            for word, freq in allwords.items():
                write.write(word + "\t" + str(freq) + "\n")

This one hat one big advantage to the SQL method. I did not even needed to let it work over the night. It was hell of a lot quicker. And the data was relatively clean, certainly usable.

Creating Word Frequency Tables with SQLite and Python

Now that I had my text data in the SQLite file, I had to figure out what to do with it. One thing that was constantly repeating itself through the different tutorials and books was the word frequency table or the connected concepts bag of words or Tfi-df tables.

So I decided to try and calculate the frequencies myself. I mean, I had the data in the SQL file, and most of the examples did not. Or at least not in SQL file organized in my way.

This is the code that I used:

    import sqlite3
    import nltk
    import string
    from nltk.stem.wordnet import WordNetLemmatizer

    database = "mbti-posts.db"

    conn = sqlite3.connect(database)
    c = conn.cursor()

    #if you already have the table, then comment this line
    c.execute("CREATE TABLE frequency(id INTEGER PRIMARY KEY, type CHAR(5), word CHAR(50), freq INT)")

    stopwords = nltk.corpus.stopwords.words('english')
    stemmer = nltk.SnowballStemmer("english")
    lemmatizer = WordNetLemmatizer()

    def makeTextLower(wordlist):
        return [word.lower() for word in wordlist]

    def excludeWords(excludedlist, wordlist):
        return [word for word in wordlist if word not in excludedlist]

    def wordStemmer(wordlist):
        return [stemmer.stem(word) for word in wordlist]

    c.execute('''SELECT id, text, type FROM personalitycafe''')
    sqldata = c.fetchall()

    for idname, text, mbtitype in sqldata:
        #removes the punctiation
        text = text.translate(str.maketrans('','',string.punctuation))
        row = nltk.tokenize.word_tokenize(text)
        row = makeTextLower(row)
        row = wordStemmer(row)
        #excludes common words
        row = excludeWords(stopwords, row)
        for word in row:
            print(word + ", " + mbtitype)
            word = word.replace("'", "")
            word = word.replace('"', '')
            word = "'" + word + "'"
            searchresult = c.execute('SELECT id, type, word, freq FROM frequency WHERE type = "' + mbtitype + '" AND word = ' + word)
            searchresult = searchresult.fetchall()
            if not searchresult:
                c.execute("INSERT INTO frequency(type, word, freq) VALUES ('" + mbtitype + "', " + word + ", 1)")
            else:
                freq = searchresult[0][3] + 1
                id = searchresult[0][0]
                c.execute("UPDATE frequency SET freq=" + str(freq) + " WHERE id = " + str(id))

    conn.commit()
    conn.close()

I left the code to use through the night, but by the time I woke up, it was only on the second type out of the 16. So I decided to not use this way and tried to find another one.

How to Put Textdata to SQLite with Python

After I had the data that I wanted, I started to do some simple analysis.

Considering the size of the data, I figured that R is most likely not the right way to go, since this was the only time before this project that I had memory problems.

So the first idea that I ended up using was Orange. They have an addon Textable, which has some functions for analysing text. And they had some text analysis tutorials on the site. But I was quickly getting into serious memory problems.

Then I tired following along with some of the tutorials that I found on the internet, but usually I got into memory problems pretty quickly in the analysis. Yes, I am aware that my computer is not the best one there is.

Then I borrowed a book that had a chapter on the text analysis as well. There was an example of classifying Reddit posts, but they were saving them to the SQLite. I am aware of the SQL, since I actually had to listen to it in school.

So I decided to get all the data in the SQL file. The python code that I used for this is used below.

    import sqlite3

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    database = "mbti-posts.db"

    conn = sqlite3.connect(database)
    c = conn.cursor()

    c.execute("""CREATE TABLE personalitycafe(id INTEGER PRIMARY KEY, text TEXT, type CHAR(5))""")

    for filename in filenames:

        with open(filename, "r") as read:
            content = read.readlines()

        content = "".join(content)
        content = content.split("\n\n\n")

        filename = '"' + filename + '"'

        for element in content:
            element = element.strip()
            if not element:
                continue
            element = element.replace("\n", " ")
            element = element.replace('"', "'")
            element = '"' + element + '"'
            string = """INSERT INTO personalitycafe (text, type) VALUES (""" + element + """, """ + filename + """)"""
            c.execute(string)

        print(filename)
        conn.commit()

How I Used Vim to Clean MBTI Data

Before I had gotten the Personality Cafe posts in order to analyse them. But when I was checking the file over, I realized that there is still some sort of JavaScript code included in it.

Thankfully, by analysing it a little, I figure out that spacing of it is really convenient. It was always on the end of the posts, and it always started in the same way: "(function(w,d,s,i)"

I first tried to use some sort of python script, but then I realized that I am most likely overcomplicating things. I mean, this would be simple, if I could just use the search and replace. But gedit, the program that I normally use, had problems dealing with 30MB+ files.

Then I remembered that I saw some examples of how people used vim to clean big files. I figured out that it could not be that hard.

    :%s/(function(w,d,s,i)/\r(function(w,d,s,i)/g 
    :g/(function(w,d,s,i)/d

The first line find the beginning of the JavaScript and puts it in the new line.

    :%s/what to find/what to replace with/g 

The thing to be careful here is, that the new line in vim in \r, and not \n like in the python.

The second line simply deletes the whole lines, if the content is inside. Or maybe starting with it, in my case it did not make interesting, so I am not certain.

    :g/what line to delete/d

Vim is a good for manipulating small amount of big files. But I had 16 files and I almost lost track of which did I already went through.

Raising the Error by also Printing it

In the previous post, I talked about that I had one problem with errors. Well, even though it did not help me later, I still researched how to see the errors, without them stopping the script.

    try:
        page = urllib.request.urlopen(webpage)
    except Exception as e:
        #e is the error
        print(e) 

Just in case somebody else finds it helpful.