Skip to main content

Slovenian Cuisine (Studying effect of Preprocessing on Topic Modeling)

When I have worked on my topic modeling of the cognitive science articles, I have noticed something. By using a different algorithms on the same preprocessed data, I would get relatively similar results. But I could get a lot more interpretative results, if I simply filtered out the noise. For example, filtering out the stop words or filtering out the verbs. For some reason, by including these in, I have more problems finding meaning in the topics.

When I have been preparing the PyConBalkan speech (which will happen this Friday), I have tried to find the examples to present. One of the things, that I am interested in is cooking. And I figured that everybody eats, so the topic of food would be at least familiar to everybody.

So what I did was downloaded over 18000 recipes from one of the Slovenian recipe sites. The code that I eventually used could be found on my GitHub. I though to include them in my presentation, but when I practiced it on Python Meetup, I realized that 12 different food categories is too much. So instead, what I am going to do, it present the results here. I also rerun the analysis, while the pictures that I drew were from the first run. The results should at least be very similar, but I did not check that.

I will first present the 12 groups, that I got without preprocessing. This means, that none (actually just most) of my biases or decisions are included here. But I find these groups to be less representative. I have put down the 10 most representative words for each group, excluding punctuation and numbers. Still, feel free to peruse them.

Topic 1 Topic 2 Topic 3 Topic 4
izbiri (choice) sladkorja (sugar) bio (bio) kocki (cubes)
lastni (one own) moke (flour) le (only) marinada (marinade)
ste (are) za (for) gusto (coffee) pesta (pesto)
noč (night) mleka (milk) milfina (Milfina - brand) paličice (sticks)
agar (agar) v (in) okus (taste) pesti (fists)
vsaj (at least) masla (butter) natur (natural) pekač (baking tray)
namočeni (soaked) smetane (cream) aktiv (Aktiv - brand) bele (white)
jajc (eggs) ali (or) piranske (from Piran) kakav (cacao)
občutku (feeling) sladkor (sugar) iz (from) česnom (garlic)
podlaga (grounding) prahu (powder) soline (salters) močno (strong)
Topic 5 Topic 6 Topic 7 Topic 8
začimbe (spices) bananinega (banana) janež (anise) vodke (vodka)
kis (vinegar) grobe (rough) mandarin (mandarin) polenovke (codfish)
omaka (sauce) gre (goes) smarties (Smarties - brand) blue (blue)
sojina (soya) kruhki (canapes) francoski (French) topi (blunt)
solate (salad) nimamo (not having) luskic (little scales) losos (salmon)
zelenjava (vegetables) marsale (wine) čaj (tea) zrno (grain)
file (fillet) soka (juice) žlico (spoon) dimljen (smoked)
česen (garlic) poljuben (optional) marcipanove (marzipan) dan (day)
olje (oil) solata (salad) lan (flax) zamenjamo (exchange)
koruza (corn) ostali (other) fine (fine) trda (hard/rigid)
Topic 9 Topic 10 Topic 11 Topic 12
sol (salt) ki (which) ravna (flat) so (then)
poper (pepper) ga (him) mu (him) je (is)
česna (garlic) jo (her) kupljeno (bought) da (that)
olje (oil) kavo (coffee) polovičke (halves) led (ice)
ali (or) semen (seeds) orehovo (walnuts) pri (at)
čebula (onion) domači (homemade) žafranke (saffron) ga (him)
po (after) puding (pudding) sardelinih (anchovy) toliko (this much)
olja (oil) sami (on our own) sirova (cheese) bedra (leg)
in (and) domač (homemade) rastlinske (plants) kot (like)
sol (salt) kakija (persimmon) nescafeja (Nescafe - brand) ker (because)

One thing to keep in mind is, that if topic modeling is done without preprocessing, then some of the topics are noise. But here a lot of them seems like noise to me.

I also drew the picture of these twelve groups. See it below:

Then I did the preprocessing. Since I have now structured the model more, I might get different result. So I will first add the picture (and the rest of you can see from this, how much can very small personal decisions in filtering effect the result):

Because here are supposed to be only ingredient words, I am only going to describe each topic with 5 words:

Topic 1 Topic 2 Topic 3 Topic 4
limona (lemon) jabolko (apple) oreščki (nuts) soja (soya)
sok (juice) banana (banana) muškat (nutmeg) sezam (sesame)
sladkor (sugar) breskev (peach) ingver (ginger) buča (pumpkin)
pomaranča (orange) ananas (pineapple) klinček (clove) sončnica (sunflower)
voda (water) jogurt (yoghurt) maslo (butter) ohrovt (Brussels sprout)
Topic 5 Topic 6 Topic 7 Topic 8
sadje (fruit) marmelada (jam) sladkor (sugar) sol (salt)
vino (wine) jagoda (strawberry) jajce (egg) poper (pepper)
cimet (cinnamon) marelica (apricot) moka (flour) olje (oil)
hruška (pear) sliva (plum) mleko (milk) čebula (onion)
sladoled (icecream) borovnica (blueberry) vanilija (vanilla) česen (garlic)
Topic 9 Topic 10 Topic 11 Topic 12
liker (liqueur) med (honey) sir (cheese) voda (water)
oblat (layer/wafer) mandelj (almond) testo (dough) moka (flour)
pivo (beer) kokos (coconut) mascarpone (cheese) sol (salt)
marcipan (marzipan) kosmiči (cereal) kava (coffee) maščoba (fat)
ribez (currant) cimet (cinnamon) piškot (cookies) ajda (buckwheat)

For some reason, when I looked at the original groups, they seems to make more sense then these ones. But these still make sort of sense. At least for some of them, I can imagine how it came together? So topic 8 is probably the group of specific way of making vegetables and meats (we say it "na čebuli", which would be directly translated on onions). Topic 7 is basic baking. Topic 10 is probably breakfast. Topic 1 in juicing. Topic 5 would be Christmas, if not for ice cream. And so on.

Still, this shows that filtering can have a huge effect on the results. On the other side, I have no idea how to interpret the results that I got.

And I guess the results of the 4 group solution is only going to go into my presentation for PyConBalkan.

Creativity Test

I have found a pretty interesting creativity test on the internet. I uses words associations, and it tries to see, how disconnected are the words. It uses LSA difference between all preceding words as a measurement. And the results of this test seems to be connected to creativity.

Here you can find the test and the article describing it.

Needs in Communication

I have finally deleted my Facebook account. I have only created it, because it was the main communication channel for my cognitive science studies. And then there was always another reason, why I did not deleted, usually because this was the one way to communicate with one or two people. But now I have decided to screw everything and deleted it.

Instead, I prefer to simply meet people in person. This has been this way since the end of the primary school when the email and MSN Messenger has become popular with people in my school. I never really liked using it, and I stopped bothering when a friend of mine told me, that my personality changed when using it.

In reality, I have never experienced communication through these medium as more positive than face-to-face communication. At first I though it was because of the content and quality. There are many place on the internet, that has the chat room problem. But eventually I figured out that this can only be part of the explanation. The best readings on the internet were about on the same level than the average conversation.

Even if I only take the content that is not on social media, but on the websites owned by people (check IndieWeb, if you want to know more), the feeling was better than social media, but worse then real life.

This can also be seen in the other things. I feel better, after reading a book for two hours, then reading the internet articles for two hours. In the first place, I feel like I either enjoyed the story or actually learned something new or started to think differently about something. And it happens almost every single time. On the internet, this feeling is a lot less frequent and in majority of cases. Is it because I am choosing the books myself, instead of the algorithms? Or it is because I can be more focused on the ideas and go deeper into them? I actually don't know.

On the other hand, I am still writing this blog, and I do still sometimes read blogs. Unlike the social media, which I get bored in 5 minutes of spending time on them (which is why I ended up deleting the last one). But now thinking about it, maybe I should start changing this as well. But some of the things I am interested in - like minimalism - it is hard to find the books for here in Slovenia. And if I want to order them from abroad, then I need to find out about them from somewhere. I maybe I am way too much of a thinker instead of a doer, but even that is slowly changing to a more doing direction. And eventually, I will not need a constant support of other people's ideas - or at least not that much.

Plus, most of the blogs of people that I know and follow don't actually publish much. So even if I would check once-per-month, I still would not get a lot of reading material.

Because the personality did not give me the definitive answer either. I though that maybe the low extroversion, low agreeableness and maybe high openness could explain that. As low extroverted person, I don't get as many positive emotions out of the social status. So there is less emotional reaction on likes and readers, as least in theory. Which makes me worried for all the people, that feel this more acutely than me. And low agreeableness make me less interested in people than acreage, and most of the content there was what people were doing. And maybe openness wanted something more unusual? I don't know.

On a little sidetrack, most studies done in the first years of social media showed, that openness predicted the activity of people. So the higher the openness, the more likely they were active on social media. Now I have a hypothesis, that the trend is changing. That people with more openness are the ones more likely to leave it behind.

Then I have recently started going into the non-violent communication. One of the important concepts there was to understand and then express one own needs, without expectation that they must be addressed. But here is the problem, I am using other people as sort of exploration of the opportunity space, so I am not sure what my needs are in this case. Or maybe my need is to be pushed to do something more? But then, I also enjoy conversations, where I don't get to express it.

This reminds me of a exercise, that we needed to do in the university. I remember writing, that I am afraid of asking people for things, because I was afraid, they will say yes, without wanting to. This might be a consequence of reading to many marketing and selling books before university. Inside of them, they hammer on the point, that people generally don't like saying no. This is why I really like being in the company of people, that I can trust will tell me to go fuck myself, when they have enough of me.

But that still leaves me with the problem of what need am I addressing with face to face communication, that is not addressed with the other forms of communication? Is it simply that we evolved to be social? Do I prefer the synchronicity of it? Do I want to feel heard? What do I want from it? I don't think it is safety, or being heard or respect. But knowing what it is not does not make it easier to realize what it is.

Which might be a reason, why I have problem with directing the social energy intentionally. Since I can't conceptualize what I really want from it. I just know, that whatever it is, I can't get it from the internet or SMS-ing or anything other similar, and these things should just stay for information sharing.

Recipe: Tiramisu

I remember, that I made tiramisu once. It is one of the sweets, that I really like (the creamy texture is usually so good), but I don't really make or order it that much. But for some reason, I don't usually make it. Which is a shame.

I created my recipe by mixing the two recipes together. It is not really perfected. What it is missing is a more creamy structure and more light taste. But maybe this will motivate me, to make it more frequently.

Ingredients: * at least 1 box of baby cookies (so around 250g?) * Coffee * 2 eggs * 500g of mascarpone * 2 spoons of rum or other alcohol (Cognac) * 2 spoons of cacao * 100 dag of sugar

  1. Make coffee
  2. Mix the whites of the eggs until firm
  3. Mix egg yorks with sugar, mascarpone and alcohol
  4. Add egg whites to the mixture
  5. Drown cookies in coffee
  6. Put coffee-soaked cookies in the pan followed by cream. Can be repeated multiple times.
  7. Put cacao on top
  8. Cool in in refrigerator for 1-2 hours at least