Since I wanted to keep track of which texts are from which blogs, I figured out that the best way would be to divide the different descriptions per type. Since I had almost 13000 blogs with descriptions, there is no way I was doing it by hand.
In order to get it, I figured out that as long as they mentioned one type in the description and not any others, it should be a good indicator that the writer is of that type.
import collections types = ["intp", "intj", "istp", "istj", "estj", "entj", "estp", "estj", "infp", "isfp", "isfj", "isfj", "esfp", "esfj", "enfj", "enfp"] filename = "file.csv" def countTags(singlefile): typesfreq = collections.defaultdict(set) with open(singlefile, "r") as read: content = read.readlines() for line in content: line = line.split("\t") content = line.lower() for mbtitype in types: if mbtitype in content: typesfreq[mbtitype].add(line) return typesfreq typesfreq = countTags(filename) for mbtitype in types: for mbtitype2 in types: if not mbtitype == mbtitype2: intersection = typesfreq[mbtitype].intersection(typesfreq[mbtitype2]) for element in intersection: typesfreq[mbtitype].discard(element) with open(filename, "r") as read: content = read.readlines() for mbtitype in types: print(mbtitype) sites = typesfreq[mbtitype] with open("blogs" + mbtitype + ".csv", "a") as write: for line in content: splitline = line.split("\t") if not splitline: continiue if splitline.strip() in sites: write.write(line)
There is probably still something wrong with the code, since about 1% of descriptions still had at least 2 MBTI codes in them. But overall, I was quite satisfied with the results. There was less the 5% of blogs that did not belong in the type they were classified under.