Blog of Sara Jakša

First Working Model for the MBTI Project

I have presented the model at the school for predicting the MBTI types. I am not yet satisfied with it, as it only have around 35% of getting it right. Which is still better than 6% it would get from guessing.

On the end, I only used one model, since I could not get any better results by using multiple ones. So I decided that in this case it would be better to simply use the a single one. It was also the only one, where when I did not see a bias on the predicted/true diagram.

Interestingly, all of them had the same type of bias. I have not figured out why.

Here is the code that I have used for building the final model.

    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.svm import LinearSVC
    from sklearn.pipeline import Pipeline
    from sklearn.externals import joblib
    import matplotlib.pyplot as plt
    import numpy as np
    import random
    import sklearn

    filenames = ["ENFJ",
                 "ENFP",
                 "ENTJ",
                 "ENTP",
                 "ESFJ",
                 "ESFP",
                 "ESTJ",
                 "ESTP",
                 "INFJ",
                 "INFP",
                 "INTJ",
                 "INTP",
                 "ISFJ",
                 "ISFP",
                 "ISTJ",
                 "ISTP"]

    texts = list()
    types = list()

    for filename in filenames:
        with open(filename, "r") as read:
            content = read.readlines()
        content = "".join(content)
        content = content.split("\n\n\n")
        content = [post.strip() for post in content]
        print(filename)
        for post in content[:1281]:
            texts.append(post)
            types.append(filename)

    #preparing data for building the model
    data = [element for element in zip(texts, types)]
    random.shuffle(data)
    texts2 = [element[0] for element in data]
    types2 = [element[1] for element in data]

    #building the model
    model = Pipeline([('vect', CountVectorizer(stop_words="english")),
                         ('tfidf', TfidfTransformer()),
                         ('svc', LinearSVC()),
                        ])

    dtfitted = model.fit(texts2, types2)

    #this part is just how I checked if it is a good model -> usually it is better to check with a different dataset than the one used to build the model
    predicted = dtfitted.predict(texts2)
    print(np.mean(predicted == types2))
    conf = sklearn.metrics.confusion_matrix(predicted, types2)
    plt.imshow(conf, cmap='binary', interpolation='None')
    plt.show()

    #saving the model
    joblib.dump(dtfitted, 'finalmodel.pkl') 

The model with a very simple version of the program, where you can test your own type, can be found on the https://github.com/sarajaksa/schoolwork/tree/master/personalitytype.