Blog of Sara Jakša

How I got the MBTI Types from Tumblr Descriptions

Since I wanted to keep track of which texts are from which blogs, I figured out that the best way would be to divide the different descriptions per type. Since I had almost 13000 blogs with descriptions, there is no way I was doing it by hand.

In order to get it, I figured out that as long as they mentioned one type in the description and not any others, it should be a good indicator that the writer is of that type.

    import collections

    types = ["intp", "intj", "istp", "istj", "estj", "entj", "estp", "estj", "infp", "isfp", "isfj", "isfj", "esfp", "esfj", "enfj", "enfp"]
    filename = "file.csv"

    def countTags(singlefile):
        typesfreq = collections.defaultdict(set)
        with open(singlefile, "r") as read:
            content = read.readlines()
            for line in content:
                line = line.split("\t")
                content = line[2].lower()
                for mbtitype in types:
                    if mbtitype in content:
                        typesfreq[mbtitype].add(line[0])
        return typesfreq

    typesfreq = countTags(filename)

    for mbtitype in types:
        for mbtitype2 in types:
            if not mbtitype == mbtitype2:
                intersection = typesfreq[mbtitype].intersection(typesfreq[mbtitype2])
                for element in intersection:
                    typesfreq[mbtitype].discard(element)

    with open(filename, "r") as read:
        content = read.readlines()

    for mbtitype in types:
        print(mbtitype)
        sites = typesfreq[mbtitype]
        with open("blogs" + mbtitype + ".csv", "a") as write:
            for line in content:
                splitline = line.split("\t")
                if not splitline:
                    continiue
                if splitline[0].strip() in sites:
                    write.write(line)

There is probably still something wrong with the code, since about 1% of descriptions still had at least 2 MBTI codes in them. But overall, I was quite satisfied with the results. There was less the 5% of blogs that did not belong in the type they were classified under.