I have presented the model at the school for predicting the MBTI types. I am not yet satisfied with it, as it only have around 35% of getting it right. Which is still better than 6% it would get from guessing.
On the end, I only used one model, since I could not get any better results by using multiple ones. So I decided that in this case it would be better to simply use the a single one. It was also the only one, where when I did not see a bias on the predicted/true diagram.
Interestingly, all of them had the same type of bias. I have not figured out why.
Here is the code that I have used for building the final model.
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer from sklearn.svm import LinearSVC from sklearn.pipeline import Pipeline from sklearn.externals import joblib import matplotlib.pyplot as plt import numpy as np import random import sklearn filenames = ["ENFJ", "ENFP", "ENTJ", "ENTP", "ESFJ", "ESFP", "ESTJ", "ESTP", "INFJ", "INFP", "INTJ", "INTP", "ISFJ", "ISFP", "ISTJ", "ISTP"] texts = list() types = list() for filename in filenames: with open(filename, "r") as read: content = read.readlines() content = "".join(content) content = content.split("\n\n\n") content = [post.strip() for post in content] print(filename) for post in content[:1281]: texts.append(post) types.append(filename) #preparing data for building the model data = [element for element in zip(texts, types)] random.shuffle(data) texts2 = [element for element in data] types2 = [element for element in data] #building the model model = Pipeline([('vect', CountVectorizer(stop_words="english")), ('tfidf', TfidfTransformer()), ('svc', LinearSVC()), ]) dtfitted = model.fit(texts2, types2) #this part is just how I checked if it is a good model -> usually it is better to check with a different dataset than the one used to build the model predicted = dtfitted.predict(texts2) print(np.mean(predicted == types2)) conf = sklearn.metrics.confusion_matrix(predicted, types2) plt.imshow(conf, cmap='binary', interpolation='None') plt.show() #saving the model joblib.dump(dtfitted, 'finalmodel.pkl')
The model with a very simple version of the program, where you can test your own type, can be found on the https://github.com/sarajaksa/schoolwork/tree/master/personalitytype.