Scrapping PersonalityCafe with Python and Beautifulsoup

In the previous post, I told you how I figured out that the Tumblir will take too much time. I really did not want to spend more than a month getting enough information and I would run out of time for analysis.

But I have already registered the project, so I could not just change it. So I had to thin about where can I get texts, where I would know the MBTI type of a person. I first thought about Reddit, but after having problems with Tumblir API, I decided to no go that route.

I eventually remembered the PersonalityCafe forum. People, at least on some subforums put their MBTI types under their handle name. So I figured out that if I scrap the subset of this, I will have the information I need.

For a plus, the whole data collection took less than 48 hours.

Here is the code that I used to scrap their forum:

    import urllib.request
    from bs4 import BeautifulSoup
    import re

    gettype = r"\b[A-Z]{4}\b"
    getpagenumber = r"-(\d*?)\.html"

    webpages = ["", 

    def gettextfrompersonalitycaffee(webpage):
        page = urllib.request.urlopen(webpage)
        content =
        soup = BeautifulSoup(content, 'html.parser')
        allposts = soup.find_all("div", class_="content")
        users = soup.find_all("div", class_="userinfo")
        infos = zip(users, allposts)
        for user, post in infos:
            post = post.find_all("blockquote")[0]
            if "<script>" in post:
                post = post.script.decompose()
            post = post.get_text()
            pertype =, user.get_text())
            if not pertype:
            pertype =
            with open(pertype, "a") as write:

    def getnumberofpages(webpage):
        page = urllib.request.urlopen(webpage)
        content =
        soup = BeautifulSoup(content, 'html.parser')
        numbers = soup.find_all("span", class_="first_last")
        if not numbers:
            return None
        link = numbers[-1].find_all("a")
        link = link[0].get("href")
        number =, link)
        number =".html", "").replace("-", "")
        return int(number)

    def getlinksfromfrontpage(webpage):
        allthreads = []
        page = urllib.request.urlopen(webpage)
        content =
        soup = BeautifulSoup(content, 'html.parser')
        threads = soup.find_all("h3", class_="threadtitle")
        for thread in threads:
            link = thread.find_all("a")
            link = link[0].get("href")
        return allthreads

    def getallthreadlinks(website, number):
        allwebsites = []
        if not number:
            return [website]
        for i in range(number):
            webpage = website.split(".")
            webpage[-2] = webpage[-2] + "-" + str(i + 1)
            webpage = ".".join(webpage)
        return allwebsites

    for webpage in webpages:
        alllinks = getlinksfromfrontpage(webpage)
        for link in alllinks:
            number = getnumberofpages(link)
            if not number:
                allthreadlinks = [link]
                allthreadlinks = getallthreadlinks(link, number)
            for treadlink in allthreadlinks:

There were parts that I later added because I got an error, and I wanted to make sure I don't need to start from the beginning because of it. But it did not repeat itself again, so I decided to post the code without it.

From this script, I got about 330MB of data, 300MB was tagged with the valid MBTI types. But it was in a form, that the MBTI tag was the name of file, and inside of it there were posts separated with three new lines (\n\n\n).

I was later wondering from time to time, if this was the best way, but this is the way that I ultimately ended up using.