Blog of Sara Jakša

Scrapping PersonalityCafe with Python and Beautifulsoup

In the previous post, I told you how I figured out that the Tumblir will take too much time. I really did not want to spend more than a month getting enough information and I would run out of time for analysis.

But I have already registered the project, so I could not just change it. So I had to thin about where can I get texts, where I would know the MBTI type of a person. I first thought about Reddit, but after having problems with Tumblir API, I decided to no go that route.

I eventually remembered the PersonalityCafe forum. People, at least on some subforums put their MBTI types under their handle name. So I figured out that if I scrap the subset of this, I will have the information I need.

For a plus, the whole data collection took less than 48 hours.

Here is the code that I used to scrap their forum:

    import urllib.request
    from bs4 import BeautifulSoup
    import re

    gettype = r"\b[A-Z]{4}\b"
    getpagenumber = r"-(\d*?)\.html"


    webpages = ["http://personalitycafe.com/istj-forum-duty-fulfillers/", 
                "http://personalitycafe.com/intp-forum-thinkers/",
                "http://personalitycafe.com/isfj-forum-nurturers/",
                "http://personalitycafe.com/estj-forum-guardians/",
                "http://personalitycafe.com/esfj-forum-caregivers/",
                "http://personalitycafe.com/istp-forum-mechanics/",
                "http://personalitycafe.com/isfp-forum-artists/",
                "http://personalitycafe.com/estp-forum-doers/",
                "http://personalitycafe.com/esfp-forum-performers/",
                "http://personalitycafe.com/intj-forum-scientists/",
                "http://personalitycafe.com/entj-forum-executives/",
                "http://personalitycafe.com/entp-forum-visionaries/",
                "http://personalitycafe.com/infj-forum-protectors/",
                "http://personalitycafe.com/infp-forum-idealists/",
                "http://personalitycafe.com/enfj-forum-givers/",
                "http://personalitycafe.com/enfp-forum-inspirers/"]

    def gettextfrompersonalitycaffee(webpage):
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        allposts = soup.find_all("div", class_="content")
        users = soup.find_all("div", class_="userinfo")
        infos = zip(users, allposts)
        for user, post in infos:
            post = post.find_all("blockquote")[0]
            if "<script>" in post:
                post = post.script.decompose()
            post = post.get_text()
            pertype = re.search(gettype, user.get_text())
            if not pertype:
                continue
            pertype = pertype.group()
            with open(pertype, "a") as write:
                write.write(post)
                write.write("\n\n\n\n\n")

    def getnumberofpages(webpage):
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        numbers = soup.find_all("span", class_="first_last")
        if not numbers:
            return None
        link = numbers[-1].find_all("a")
        link = link[0].get("href")
        number = re.search(getpagenumber, link)
        number = number.group().replace(".html", "").replace("-", "")
        return int(number)

    def getlinksfromfrontpage(webpage):
        allthreads = []
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        threads = soup.find_all("h3", class_="threadtitle")
        for thread in threads:
            link = thread.find_all("a")
            link = link[0].get("href")
            allthreads.append(link)
        return allthreads

    def getallthreadlinks(website, number):
        allwebsites = []
        if not number:
            return [website]
        for i in range(number):
            webpage = website.split(".")
            webpage[-2] = webpage[-2] + "-" + str(i + 1)
            webpage = ".".join(webpage)
            allwebsites.append(webpage)
        return allwebsites

    for webpage in webpages:
        print(webpage)
        alllinks = getlinksfromfrontpage(webpage)
        for link in alllinks:
            number = getnumberofpages(link)
            if not number:
                allthreadlinks = [link]
            else:
                allthreadlinks = getallthreadlinks(link, number)
            for treadlink in allthreadlinks:
                gettextfrompersonalitycaffee(treadlink)

There were parts that I later added because I got an error, and I wanted to make sure I don't need to start from the beginning because of it. But it did not repeat itself again, so I decided to post the code without it.

From this script, I got about 330MB of data, 300MB was tagged with the valid MBTI types. But it was in a form, that the MBTI tag was the name of file, and inside of it there were posts separated with three new lines (\n\n\n).

I was later wondering from time to time, if this was the best way, but this is the way that I ultimately ended up using.

Getting Blogs From Tumblir

Up until now, I was working on the list of Tumblir blogs by type of the writter. Now I wanted to get the text of their blog posts.

What I made sure here to no longer bother with the timeout because of the API restrictions, since the daily restrictions were always reached quite soon. And it did not matter if I run the script6 b6y hand, or if I had the time our function programmed in.

    import pytumblr
    import re
    import os
    import sys

    #this is a regex, to be able to get the url of the Tumblr blog from the url post
    tumblr_url = r"\w+.tumblr.com"
    blog_url = r"/(\d+)/([\w-]+)"
    name_after_url = r"/\d+/[\w-]+"
    digits_only = r"/(\d+)"

    offset = 0
    webpages = get_all_webpages("file.csv")
    write_set_to_file(webpages, "file.csv")
    url = list(webpages)[0]
    tag = ""

    # Authenticate via OAuth
    client = pytumblr.TumblrRestClient()

    def get_all_webpages(filename):
        allwebpages = set()
        with open(filename, "r") as read:
            content = read.readlines()
            for line in content:
                line = line.split("\t")
                allwebpages.add(line[0].strip())
        return allwebpages

    def write_set_to_file(settowrite, singlefile):
        with open(singlefile, "w") as write:
            for element in settowrite:
                write.write(element + "\n")
        return None

    def get_blog_post(mbti, url, offset):
        blogpost = client.posts(url, type='text', limit=20, offset=offset)

        try:
            if not blogpost["posts"]:
                print("Finished with blog: " + url)
                return "finished"

            url = blogpost["posts"][0]["post_url"]
            title = blogpost["posts"][0]["title"]
            body = blogpost["posts"][0]["body"]
        except:
            print(blogpost)
            if blogpost["meta"]["msg"] == "Limit Exceeded":
                print("Next offset for " + mbti + " is: " + str(offset))
                with open("offset", "w") as write:
                    write.write(str(offset))
                return None

        try:
            os.mkdir(mbti)
        except:
            pass

        print(url)
        if re.compile(name_after_url).search(url) is not None:
            blog_name = re.search(blog_url, url.strip())
            blog_name = blog_name.groups()
            blog_name = "-".join(blog_name)
        else:
            blog_name = re.search(digits_only, url.strip())
            blog_name = blog_name.groups()
            blog_name = "".join(blog_name)


        with open(mbti + "/" + blog_name + ".txt", "w") as write:
            if not title:
                title = ""
            if not body:
                body = ""
            title = title.encode("utf8")
            body = body.encode("utf8")
            write.write(title + "\n\n\n" + body)

        return offset + 20

    while 1:
        offset = get_blog_post(tag, url, offset)
        if offset == None:
            write_set_to_file(webpages, "blogsistp-cleaned-2.csv")
            sys.exit()
        if offset == "finished":
            webpages.remove(url)
            if not webpages:
                break
            url = list(webpages)[0]
            offset = 0
            write_set_to_file(webpages, "file-out.csv")

Because of this restriction, I decided to abandon this way of getting information, since it would take at least a month to get the amount of data that I wanted.

Since I planed to use this for the school presentation as well, that was not something that I could afford. So I decided to try and find a different source of data.

How I got the MBTI Types from Tumblr Descriptions

Since I wanted to keep track of which texts are from which blogs, I figured out that the best way would be to divide the different descriptions per type. Since I had almost 13000 blogs with descriptions, there is no way I was doing it by hand.

In order to get it, I figured out that as long as they mentioned one type in the description and not any others, it should be a good indicator that the writer is of that type.

    import collections

    types = ["intp", "intj", "istp", "istj", "estj", "entj", "estp", "estj", "infp", "isfp", "isfj", "isfj", "esfp", "esfj", "enfj", "enfp"]
    filename = "file.csv"

    def countTags(singlefile):
        typesfreq = collections.defaultdict(set)
        with open(singlefile, "r") as read:
            content = read.readlines()
            for line in content:
                line = line.split("\t")
                content = line[2].lower()
                for mbtitype in types:
                    if mbtitype in content:
                        typesfreq[mbtitype].add(line[0])
        return typesfreq

    typesfreq = countTags(filename)

    for mbtitype in types:
        for mbtitype2 in types:
            if not mbtitype == mbtitype2:
                intersection = typesfreq[mbtitype].intersection(typesfreq[mbtitype2])
                for element in intersection:
                    typesfreq[mbtitype].discard(element)

    with open(filename, "r") as read:
        content = read.readlines()

    for mbtitype in types:
        print(mbtitype)
        sites = typesfreq[mbtitype]
        with open("blogs" + mbtitype + ".csv", "a") as write:
            for line in content:
                splitline = line.split("\t")
                if not splitline:
                    continiue
                if splitline[0].strip() in sites:
                    write.write(line)

There is probably still something wrong with the code, since about 1% of descriptions still had at least 2 MBTI codes in them. But overall, I was quite satisfied with the results. There was less the 5% of blogs that did not belong in the type they were classified under.

Getting the List of Blog Descriptions from Tumblir

In the last time, before I went on detour, I ended up with the list of tumblir blogs that used mbti related tags. Now, what I needed to do now was figure out for which of these blogs can I find the MBTI type of the writter.

I spend too much of my time browsing Tumblr, so I knew that a lot of people write their MBTI types in the blog descriptions. I had the blog web addresses, now I only needed to get their descriptions as well.

    import pytumblr
    import re
    import time

    files = ["file.csv"]

    #this is a regex, to be able to get the url of the Tumblr blog from the url post
    tumblr_url = r"[\w-]+.tumblr.com"

    # Authenticate via OAuth
    client = pytumblr.TumblrRestClient()

    sites = set()
    with open("blogs.csv", "r") as data:
        content = data.readlines()
        for line in content[1:]:
            line = line.split("\t")
            sites.add(line[0].strip())

    postsid = set()

    for singlefile in files:
        with open("blogs.csv", "aw") as write:
            write.write("user url" + "\t" + "number of posts" + "\t" + "description" + "\n")
            with open(singlefile) as read:
                content = read.readlines()
                for line in content[1:]:
                    fields = line.split("\t")
                    if fields[0].strip() in postsid:
                        continue
                    if fields[1].strip() == "post_url":
                        continue
                    if not "tumblr.com" in fields[1].strip():
                        continue
                    postsid.add(fields[0].strip())

                    #now find the blog url from the post url
                    user_url = re.search(tumblr_url, fields[1].strip())
                    user_url = user_url.group()

                    if user_url in sites:
                        print("SKIP: " + user_url)
                        continue

                    #get the information about the blog
                    blog = client.blog_info(user_url)

                    try:
                        #get blog description and number of posts
                        blog_description = blog[u"blog"][u"description"]
                        number_of_posts = blog[u"blog"][u"posts"]

                        blog_description = blog_description.replace("\n", " ")

                        blog_description = blog_description.encode("utf8")

                        write.write(user_url + "\t" + str(number_of_posts) + "\t" + blog_description + "\n")

                        sites.add(user_url)
                        print("ADDED: " + user_url + "   :)")
                    except KeyError:
                        if blog["meta"]["msg"] == "Limit Exceeded":
                            print(blog)
                            print("SLEEP TIME")
                            time.sleep(3600)

This is the time, where I was starting to add some automated way to keep track of the API calls. Tumblr had per hour and per day limit of how many calls a person can make. But when a person exceeds the limit, there is no error, but it gets send as a JSON file with a error message written inside.

So what I added, that if there was a problem, it checked if it was JSON with an error message, and then made the script wait for an hour.

I then removed the duplicates the same way I used before.

What Tags Are Used With Different MBTI Tags on Tumblir

While I have been working on my personality type project, I have also made a detour with the tag analysis. So I wanted to know, what can I found out from what kind of tags do they use. For the first exploratory analysis, I used the bottom code.

    data <- read.csv("results.csv", header=TRUE, sep="\t")

    gettagbarplot <- function(data) {
        data <- data[order(-data$freq),]
        datasum <- sum(data$freq)
        png()
        pie(head(data$freq, 20), labels = head(data$tag, 20))
        barplot(head(data$freq, 10), names.arg = head(data$tag, 10))
        barplot(head(data$freq, 500))
        dev.off()
    }

    intp <- subset(data, type=="intp")
    gettagbarplot(intp)

The first one is the following.

INTP Most Frequent Tags

Yes, I am aware, that the piechart is one of the worst ways to represent the data. But I wanted to also learn how to do it, in case I ever get a weird request.

For all of your purist, here is a different representation, that tells you quite similar data.

INTP most frequent tags

Here I think I need to put a warning for the people trying to explain this graph. These tags were taken from the posts that were tagged with either 'mbti' or any of the types tags. So the overrepresentation of these tags could be based on this.

And it is which tags are together, not which types use which tags.

Still, does the first non-type like tag for INTPs have to be 'intp problems'? I mean, we don't have that many problems, right?

But then again, here is the list of most frequent tag, if we ignore the type (intp, istp, infj, estj, enfp,...) and mbti tags (mbti, myers briggs, mbti types, all types,...), for each type.

type 1st most frequent 2nd most frequent 3rd most frequent 4th most frequent
intp intp problems intp things introvert intp thoughts
intj intj problems introvert intj thoughts personality
infp introvert personal infp thoughts personality
infj introvert infj problems personal personality
istp introvert personality cognitive functions submission
istj personality introvert submission cognitive functions
isfp personality introvert personal cognitive functions
isfj personal personality introvert submission
entp entp problems ne personality submission
entj personality submission cognitive functions psychology
enfp enfp problems personality ne personal
enfj personality personal cognitive functions enfj problems
estp cognitive functions personality submission se
estj personality cognitive functions submission mine
esfp personality submission cognitive functions se
esfj personality cognitive functions submission fe

What we can see from this is that introversion is more a talk than extraversion, and usually in connection to the introverted types. On the other hand, cognitive functions are usually discussed in connection to the extraverted types. Let this be overall, of specific functions.

There are some non-surprising things, like that people like talking about personality in connection to the personality types.

I am not entirely sure, what the submission tag is supposed to be. I hope it is submission as in I submit something to the art competition, than submission as I submit to my master. Could be either.

For the last part, I would like to show the distribution of the tags.

INTP tag frequency distribution

I have printed here the most frequent 500 tags. As you can see from the picture, the power law here is easily seen. Kind of shows, that just a small amount of tags is used in combination with this tag. But that there are a multiple tags, that are used only a couple of times or even only once.

If you want to do some other interesting analysis, you are free to take my tag file.

I Made a Detour With Tumblir By Calculating Tag Frequency

While I have been waiting for the next data to collect, I decided that I wanted to make a little detour. I remembered a article, that was calculating the frequency of the words for different The Big Five types, and they found some interesting differences. I took that and I tried to figure out, if the difference can also be seen in the usage of different tags.

The following code was used to get the frequency of the tags.

    allTags = set(["intp", "intj", "istp", "istj", "infp", "infj", "isfp", "isfj", "entp", "entj", "estj", "estp", "esfp", "esfj", "enfp", "enfj"])

    def getNeighborTag(singlefile):
        with open(singlefile, "r") as read:
            content = read.readlines()
            tagsDictionary = dict()
            for line in content[1:]:
                currentTags = line.split("\t")[2]
                currentTags = getListFromString(currentTags)
                currentTags = cleanTumblrTags(currentTags)
                for tag in currentTags:
                    if tag in allTags:
                        for tagNeighbor in currentTags:
                            if tagNeighbor == tag:
                                continue
                            if not tag in tagsDictionary:
                                tagsDictionary[tag] = dict()
                            if not tagNeighbor in tagsDictionary[tag]:
                                tagsDictionary[tag][tagNeighbor] = 0
                            tagsDictionary[tag][tagNeighbor] = tagsDictionary[tag][tagNeighbor] + 1
        return tagsDictionary

    def getFrequencyDictionary(allTags, singlefile, output):
        tagsDict = getNeighborTag(singlefile)
        for key in tagsDict.keys():
            tagsDict[key] = reverseFrequencyDictionary(tagsDict[key])
        with open(output, "w") as write:
            for key in tagsDict.keys():
                for freq, tags in tagsDict[key].items():
                    for tag in tags:
                        write.write(key + "\t" + tag + "\t" + str(freq) + "\n")
        return None

    def reverseFrequencyDictionary(inputDict):
        finalDict = dict()
        for key, value in inputDict.items():
            if not value in finalDict:
                finalDict[value] = set()
            finalDict[value].add(key)              
        return finalDict


    def getListFromString(listAsString):
        listAsString = listAsString[1:-1].split(", ")
        for i in range(len(listAsString)):
            if "u'" in listAsString[i]:
                listAsString[i] =  listAsString[i][2:-1]
        return listAsString

    def cleanTumblrTags(tagList):
        for i in range(len(tagList)):
            if "c: " in tagList[i]:
                tagList[i] = tagList[i][2:]
            tagList[i] = tagList[i].lower()
        return tagList

    getFrequencyDictionary(allTags, "input-file.csv", "output-file.csv")

First Cleaning of Tumblir Data

After using the script, to get the tagged posts, I quickly figured out that there are multiple posts included into it. Since they were in multiple files, I figured out that doing it in Spreadsheet program is going to probably be the pain in the ass. I apologize here to all my professors of business informatics, but I don‘t care if most of the companies use Excel. This program is still a pain in the ass to use for anything else but calculations.

What this small script does, it to get the list of all the files to check. After that, it writes only the first appearance of the line in the file, and discards all the rest.

    def removeDuplicate(singlefile):
        with open(singlefile, "r") as read:
            content = read.readlines()
            with open(singlefile, "w") as write:
                readLines = set()
                for line in content:
                    if line not in readLines:
                        write.write(line)
                        readLines.add(line)
        return None

    def removeDuplicateMultipleFile(files, output):
        with open(output, "w") as write:
            addedLines = set()
            for singlefile in files:
                with open(singlefile, "r") as read:
                    content = read.readlines()
                    for line in content:
                        if line not in addedLines:
                            write.write(line)
                            addedLines.add(line)
        return None


    files = ["file.csv"]

    removeDuplicateMultipleFile(files, "file.csv")

I am pretty sure that I am going to use this in the future as well.

How I Got Tagged Posts from Tumblir

Recently, I had written how I am going to use Tumblr data for my project, and how could I attempt to do this.

After that, my code still needed some work, before I could just start the script and get the data. Below is the code, that I have used to eventually get a lot of posts with different tags.

What this code does, it to take a certain tag, and then it writes in the file the id of the post, the name of the blog and the list of all tags.

    import pytumblr

    # Authenticate via OAuth
    client = pytumblr.TumblrRestClient("")

    #find 20 posts tagged MBTI
    def getTumblrPosts(before=None, tag="MBTI", limit=20, filename="data.csv"):
        posts = client.tagged(tag, limit=limit, before=before)
        if len(posts) == 0:
            return None
        with open(filename, "aw") as f:
            #get information out of the post
            f.write("post_id" + "\t" + "post_url" + "\t" + "tags" + tag + "\t" + "timestamp" + "\n")
            for post in posts:             
                if post["type"] == "text":
                    timestamp = post[u"timestamp"]
                    post_id = post[u"id"]
                    tags = post[u"tags"]
                    if not tags:
                        tags = ""
                    post_url = post[u"post_url"]
                    string = str(post_id) + "\t" + post_url + "\t" + str(tags) + "\t" + str(timestamp) + "\n" 
                    string = string.encode("utf8")
                    try:
                        f.write(string)
                    except UnicodeEncodeError:
                        print("Unicode Error:" + post_url)
                        pass
                else:
                    pass
        return post[u"timestamp"]

    #Loop the post getting function
    def getData(tag, timestamp=None, filename="data.csv"):
        while 1:
            timestamp = timestamp
            print(tag)
            timestamp = getTumblrPosts(before=timestamp, tag=tag, limit=20, filename=filename)
            if not timestamp:
                return None
        return timestamp

    timestamp = getData("tag", filename="file-name.csv")

I used this code to get posts for all the Myers-Brigs related tags, like MBTI and the names of all the types (like: INTP, ENFP, ESFJ, ESTJ, INTJ, ISFP,…). On the end I got around 13 MB of data this way.

On this that I did not yet include in this code is keeping track of how many API calls can I still do. I am sure, that if I did that, I would be able to let the script run longer and get more data.

I also ignored all the non ASCII characters. This is one of the drawbacks of this script, as it is still using Python2, and Python2 is the one that has a pain-in-the-ass encoding. I mean, it is not hard, I just don‘t know, why a programmer should bother with it.

Well, pytumblr, the library that I am using to access Tumblr is written in Python2, and I did not found the Python3 version, so that is why I used Python2.

Are You Productive?

I have a confession to make. I am a productivity info addict. It means that just like every addict, there are things that I can't resist, and one of them is the information about how to improve productivity. The other addiction or guilty pleasure that I have is chocolate.

But just because I am reading all this information does not mean that I am actually more productive. I would consider myself rather average about it. If all this reading does not result in the change, then it is just entertainment. As an alternative to tuning in for another episode of the Big Bang Theory.

This is mostly a problem between the epistemic and pragmatic action. Reading leads, at least in my case, mostly to an epistemic activity. But being productive, would usually be in the realm of the pragmatic action.

Let me first explain what I mean by these two terms. The difference is in the intent of the action.

Epistemic action

Epistemic action is an action that we do to help us in our cognitive process. It is rotating the shape in the tetris game. It is how we write down the notes or when we do calculations on the paper. I mean, I am not the only one that still does this, right?

So, when I say that for me reading productivity information is epistemic activity, I mean that these actions frequently lead to the change in my cognitive process, but they are lot less likely to lead to being more productive. At least I can't really show much for it.

Pragmatic action

On the other hand, the purpose of the pragmatic action is the change in the world. For example, building the house, making roads or voting in the election. Though, the last one also does not lead to any change frequently, but the purpose is there.

And this is where the productivity should reside. The main purpose of increasing productivity is to make a different mark on the world, be it by making more things that change the world in some way or by starting to make things that have a bigger change in the world.

It is this that I still struggle with. In my two decades and something, I have yet to produce something, that will leave a satisfying mark on the world. I did not come even close.

I guess, just like some people get reality escape through the television, I get it through the productivity porn. With me ending up addicted to the feeling that this is something that is going to be easily managed.

Why are you reading it?

The Constant Attention

What is your intention focused on right now? I recently started the class on the first person research, and we spend a lot of time going over our our experience. For example, my problem solving most of the time is either asking myself questions in the inner voice or placing verbal words into space.

Because of that, I have became more aware of where my attention is focused on. And is it a sad thing that most of my attention is spend on the inner voice. I have commentary on the random concepts or ideas, I am arguing with myself, I am analysing my own thinking and actions and so on.

I wonder is this is why I found walks making up some story so relaxing. My inner voice had something simple, but interesting to do.

I then talked to a guy at my class, who has the introverted intuition (Ni) for the dominant function. He was saying that he just observes what is going on all the time, without evaluating it.

I wonder if this has something to do with the differences in Jung's functions. My dominant function, which is introverted thinking (Ti) is concerned with building theories, regardless of the outside data. And this is why my inner voice is doing. It tries to build the theories, but even I am aware, that at this stage they are most likely not a good way to explain the world.

I will be the first one to admit, that I still don't understand the introverted intuition (Ni), but from what I understood from conversations to the Jung's writings, introverted intuition is a constant sensation of the inner world. It is like they are constantly aware of the experience they are having. Which is quite similar to what he said it is happening to him when experiencing.

So I started to wonder if other functions are the same way. I have a little understanding of the mentality of the extraverted feeling (Fe), even if I lack their tools. People with extraverted feeling evaluated the meaning of everything based on the group's opinion. Their attention goes to being constantly aware of how the actions are going to be perceived.

While the extraverted intuition (Ne) is more concerned with new possibilities of already existing objects. So the mentality of the extroverted type is similar to the brainstorming sensing going well. There are just constantly new ideas, which leads to new ideas, which leads to new ideas,...

The others would be just conjunction, but extroverted thinking type (Te) could potentially keep their attention on the data and rules, introverted feeling (Fi) on their feelings (not emotions!)? values?, introverted sensing (Si) on the impressions that it is getting, and extraverted sensing (Se) on the outside world perception.

Makes me wonder how much this changes the perception of the world...