Blog of Sara Jakša

First Working Model for the MBTI Project

I have presented the model at the school for predicting the MBTI types. I am not yet satisfied with it, as it only have around 35% of getting it right. Which is still better than 6% it would get from guessing.

On the end, I only used one model, since I could not get any better results by using multiple ones. So I decided that in this case it would be better to simply use the a single one. It was also the only one, where when I did not see a bias on the predicted/true diagram.

Interestingly, all of them had the same type of bias. I have not figured out why.

Here is the code that I have used for building the final model.

    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.svm import LinearSVC
    from sklearn.pipeline import Pipeline
    from sklearn.externals import joblib
    import matplotlib.pyplot as plt
    import numpy as np
    import random
    import sklearn

    filenames = ["ENFJ",

    texts = list()
    types = list()

    for filename in filenames:
        with open(filename, "r") as read:
            content = read.readlines()
        content = "".join(content)
        content = content.split("\n\n\n")
        content = [post.strip() for post in content]
        for post in content[:1281]:

    #preparing data for building the model
    data = [element for element in zip(texts, types)]
    texts2 = [element[0] for element in data]
    types2 = [element[1] for element in data]

    #building the model
    model = Pipeline([('vect', CountVectorizer(stop_words="english")),
                         ('tfidf', TfidfTransformer()),
                         ('svc', LinearSVC()),

    dtfitted =, types2)

    #this part is just how I checked if it is a good model -> usually it is better to check with a different dataset than the one used to build the model
    predicted = dtfitted.predict(texts2)
    print(np.mean(predicted == types2))
    conf = sklearn.metrics.confusion_matrix(predicted, types2)
    plt.imshow(conf, cmap='binary', interpolation='None')

    #saving the model
    joblib.dump(dtfitted, 'finalmodel.pkl') 

The model with a very simple version of the program, where you can test your own type, can be found on the