Slovenian Cuisine (Studying effect of Preprocessing on Topic Modeling)

When I have worked on my topic modeling of the cognitive science articles, I have noticed something. By using a different algorithms on the same preprocessed data, I would get relatively similar results. But I could get a lot more interpretative results, if I simply filtered out the noise. For example, filtering out the stop words or filtering out the verbs. For some reason, by including these in, I have more problems finding meaning in the topics.

When I have been preparing the PyConBalkan speech (which will happen this Friday), I have tried to find the examples to present. One of the things, that I am interested in is cooking. And I figured that everybody eats, so the topic of food would be at least familiar to everybody.

So what I did was downloaded over 18000 recipes from one of the Slovenian recipe sites. The code that I eventually used could be found on my GitHub. I though to include them in my presentation, but when I practiced it on Python Meetup, I realized that 12 different food categories is too much. So instead, what I am going to do, it present the results here. I also rerun the analysis, while the pictures that I drew were from the first run. The results should at least be very similar, but I did not check that.

I will first present the 12 groups, that I got without preprocessing. This means, that none (actually just most) of my biases or decisions are included here. But I find these groups to be less representative. I have put down the 10 most representative words for each group, excluding punctuation and numbers. Still, feel free to peruse them.

Topic 1 Topic 2 Topic 3 Topic 4
izbiri (choice) sladkorja (sugar) bio (bio) kocki (cubes)
lastni (one own) moke (flour) le (only) marinada (marinade)
ste (are) za (for) gusto (coffee) pesta (pesto)
noč (night) mleka (milk) milfina (Milfina - brand) paličice (sticks)
agar (agar) v (in) okus (taste) pesti (fists)
vsaj (at least) masla (butter) natur (natural) pekač (baking tray)
namočeni (soaked) smetane (cream) aktiv (Aktiv - brand) bele (white)
jajc (eggs) ali (or) piranske (from Piran) kakav (cacao)
občutku (feeling) sladkor (sugar) iz (from) česnom (garlic)
podlaga (grounding) prahu (powder) soline (salters) močno (strong)
Topic 5 Topic 6 Topic 7 Topic 8
začimbe (spices) bananinega (banana) janež (anise) vodke (vodka)
kis (vinegar) grobe (rough) mandarin (mandarin) polenovke (codfish)
omaka (sauce) gre (goes) smarties (Smarties - brand) blue (blue)
sojina (soya) kruhki (canapes) francoski (French) topi (blunt)
solate (salad) nimamo (not having) luskic (little scales) losos (salmon)
zelenjava (vegetables) marsale (wine) čaj (tea) zrno (grain)
file (fillet) soka (juice) žlico (spoon) dimljen (smoked)
česen (garlic) poljuben (optional) marcipanove (marzipan) dan (day)
olje (oil) solata (salad) lan (flax) zamenjamo (exchange)
koruza (corn) ostali (other) fine (fine) trda (hard/rigid)
Topic 9 Topic 10 Topic 11 Topic 12
sol (salt) ki (which) ravna (flat) so (then)
poper (pepper) ga (him) mu (him) je (is)
česna (garlic) jo (her) kupljeno (bought) da (that)
olje (oil) kavo (coffee) polovičke (halves) led (ice)
ali (or) semen (seeds) orehovo (walnuts) pri (at)
čebula (onion) domači (homemade) žafranke (saffron) ga (him)
po (after) puding (pudding) sardelinih (anchovy) toliko (this much)
olja (oil) sami (on our own) sirova (cheese) bedra (leg)
in (and) domač (homemade) rastlinske (plants) kot (like)
sol (salt) kakija (persimmon) nescafeja (Nescafe - brand) ker (because)

One thing to keep in mind is, that if topic modeling is done without preprocessing, then some of the topics are noise. But here a lot of them seems like noise to me.

I also drew the picture of these twelve groups. See it below:

Then I did the preprocessing. Since I have now structured the model more, I might get different result. So I will first add the picture (and the rest of you can see from this, how much can very small personal decisions in filtering effect the result):

Because here are supposed to be only ingredient words, I am only going to describe each topic with 5 words:

Topic 1 Topic 2 Topic 3 Topic 4
limona (lemon) jabolko (apple) oreščki (nuts) soja (soya)
sok (juice) banana (banana) muškat (nutmeg) sezam (sesame)
sladkor (sugar) breskev (peach) ingver (ginger) buča (pumpkin)
pomaranča (orange) ananas (pineapple) klinček (clove) sončnica (sunflower)
voda (water) jogurt (yoghurt) maslo (butter) ohrovt (Brussels sprout)
Topic 5 Topic 6 Topic 7 Topic 8
sadje (fruit) marmelada (jam) sladkor (sugar) sol (salt)
vino (wine) jagoda (strawberry) jajce (egg) poper (pepper)
cimet (cinnamon) marelica (apricot) moka (flour) olje (oil)
hruška (pear) sliva (plum) mleko (milk) čebula (onion)
sladoled (icecream) borovnica (blueberry) vanilija (vanilla) česen (garlic)
Topic 9 Topic 10 Topic 11 Topic 12
liker (liqueur) med (honey) sir (cheese) voda (water)
oblat (layer/wafer) mandelj (almond) testo (dough) moka (flour)
pivo (beer) kokos (coconut) mascarpone (cheese) sol (salt)
marcipan (marzipan) kosmiči (cereal) kava (coffee) maščoba (fat)
ribez (currant) cimet (cinnamon) piškot (cookies) ajda (buckwheat)

For some reason, when I looked at the original groups, they seems to make more sense then these ones. But these still make sort of sense. At least for some of them, I can imagine how it came together? So topic 8 is probably the group of specific way of making vegetables and meats (we say it "na čebuli", which would be directly translated on onions). Topic 7 is basic baking. Topic 10 is probably breakfast. Topic 1 in juicing. Topic 5 would be Christmas, if not for ice cream. And so on.

Still, this shows that filtering can have a huge effect on the results. On the other side, I have no idea how to interpret the results that I got.

And I guess the results of the 4 group solution is only going to go into my presentation for PyConBalkan.