Blog of Sara Jakša

Scrapping PersonalityCafe with Python and Beautifulsoup

In the previous post, I told you how I figured out that the Tumblir will take too much time. I really did not want to spend more than a month getting enough information and I would run out of time for analysis.

But I have already registered the project, so I could not just change it. So I had to thin about where can I get texts, where I would know the MBTI type of a person. I first thought about Reddit, but after having problems with Tumblir API, I decided to no go that route.

I eventually remembered the PersonalityCafe forum. People, at least on some subforums put their MBTI types under their handle name. So I figured out that if I scrap the subset of this, I will have the information I need.

For a plus, the whole data collection took less than 48 hours.

Here is the code that I used to scrap their forum:

    import urllib.request
    from bs4 import BeautifulSoup
    import re

    gettype = r"\b[A-Z]{4}\b"
    getpagenumber = r"-(\d*?)\.html"


    webpages = ["http://personalitycafe.com/istj-forum-duty-fulfillers/", 
                "http://personalitycafe.com/intp-forum-thinkers/",
                "http://personalitycafe.com/isfj-forum-nurturers/",
                "http://personalitycafe.com/estj-forum-guardians/",
                "http://personalitycafe.com/esfj-forum-caregivers/",
                "http://personalitycafe.com/istp-forum-mechanics/",
                "http://personalitycafe.com/isfp-forum-artists/",
                "http://personalitycafe.com/estp-forum-doers/",
                "http://personalitycafe.com/esfp-forum-performers/",
                "http://personalitycafe.com/intj-forum-scientists/",
                "http://personalitycafe.com/entj-forum-executives/",
                "http://personalitycafe.com/entp-forum-visionaries/",
                "http://personalitycafe.com/infj-forum-protectors/",
                "http://personalitycafe.com/infp-forum-idealists/",
                "http://personalitycafe.com/enfj-forum-givers/",
                "http://personalitycafe.com/enfp-forum-inspirers/"]

    def gettextfrompersonalitycaffee(webpage):
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        allposts = soup.find_all("div", class_="content")
        users = soup.find_all("div", class_="userinfo")
        infos = zip(users, allposts)
        for user, post in infos:
            post = post.find_all("blockquote")[0]
            if "<script>" in post:
                post = post.script.decompose()
            post = post.get_text()
            pertype = re.search(gettype, user.get_text())
            if not pertype:
                continue
            pertype = pertype.group()
            with open(pertype, "a") as write:
                write.write(post)
                write.write("\n\n\n\n\n")

    def getnumberofpages(webpage):
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        numbers = soup.find_all("span", class_="first_last")
        if not numbers:
            return None
        link = numbers[-1].find_all("a")
        link = link[0].get("href")
        number = re.search(getpagenumber, link)
        number = number.group().replace(".html", "").replace("-", "")
        return int(number)

    def getlinksfromfrontpage(webpage):
        allthreads = []
        page = urllib.request.urlopen(webpage)
        content = page.read()
        soup = BeautifulSoup(content, 'html.parser')
        threads = soup.find_all("h3", class_="threadtitle")
        for thread in threads:
            link = thread.find_all("a")
            link = link[0].get("href")
            allthreads.append(link)
        return allthreads

    def getallthreadlinks(website, number):
        allwebsites = []
        if not number:
            return [website]
        for i in range(number):
            webpage = website.split(".")
            webpage[-2] = webpage[-2] + "-" + str(i + 1)
            webpage = ".".join(webpage)
            allwebsites.append(webpage)
        return allwebsites

    for webpage in webpages:
        print(webpage)
        alllinks = getlinksfromfrontpage(webpage)
        for link in alllinks:
            number = getnumberofpages(link)
            if not number:
                allthreadlinks = [link]
            else:
                allthreadlinks = getallthreadlinks(link, number)
            for treadlink in allthreadlinks:
                gettextfrompersonalitycaffee(treadlink)

There were parts that I later added because I got an error, and I wanted to make sure I don't need to start from the beginning because of it. But it did not repeat itself again, so I decided to post the code without it.

From this script, I got about 330MB of data, 300MB was tagged with the valid MBTI types. But it was in a form, that the MBTI tag was the name of file, and inside of it there were posts separated with three new lines (\n\n\n).

I was later wondering from time to time, if this was the best way, but this is the way that I ultimately ended up using.