Scrapping PersonalityCafe with Python and Beautifulsoup
In the previous post, I told you how I figured out that the Tumblir will take too much time. I really did not want to spend more than a month getting enough information and I would run out of time for analysis.
But I have already registered the project, so I could not just change it. So I had to thin about where can I get texts, where I would know the MBTI type of a person. I first thought about Reddit, but after having problems with Tumblir API, I decided to no go that route.
I eventually remembered the PersonalityCafe forum. People, at least on some subforums put their MBTI types under their handle name. So I figured out that if I scrap the subset of this, I will have the information I need.
For a plus, the whole data collection took less than 48 hours.
Here is the code that I used to scrap their forum:
import urllib.request from bs4 import BeautifulSoup import re gettype = r"\b[A-Z]{4}\b" getpagenumber = r"-(\d*?)\.html" webpages = ["http://personalitycafe.com/istj-forum-duty-fulfillers/", "http://personalitycafe.com/intp-forum-thinkers/", "http://personalitycafe.com/isfj-forum-nurturers/", "http://personalitycafe.com/estj-forum-guardians/", "http://personalitycafe.com/esfj-forum-caregivers/", "http://personalitycafe.com/istp-forum-mechanics/", "http://personalitycafe.com/isfp-forum-artists/", "http://personalitycafe.com/estp-forum-doers/", "http://personalitycafe.com/esfp-forum-performers/", "http://personalitycafe.com/intj-forum-scientists/", "http://personalitycafe.com/entj-forum-executives/", "http://personalitycafe.com/entp-forum-visionaries/", "http://personalitycafe.com/infj-forum-protectors/", "http://personalitycafe.com/infp-forum-idealists/", "http://personalitycafe.com/enfj-forum-givers/", "http://personalitycafe.com/enfp-forum-inspirers/"] def gettextfrompersonalitycaffee(webpage): page = urllib.request.urlopen(webpage) content = page.read() soup = BeautifulSoup(content, 'html.parser') allposts = soup.find_all("div", class_="content") users = soup.find_all("div", class_="userinfo") infos = zip(users, allposts) for user, post in infos: post = post.find_all("blockquote")[0] if "<script>" in post: post = post.script.decompose() post = post.get_text() pertype = re.search(gettype, user.get_text()) if not pertype: continue pertype = pertype.group() with open(pertype, "a") as write: write.write(post) write.write("\n\n\n\n\n") def getnumberofpages(webpage): page = urllib.request.urlopen(webpage) content = page.read() soup = BeautifulSoup(content, 'html.parser') numbers = soup.find_all("span", class_="first_last") if not numbers: return None link = numbers[-1].find_all("a") link = link[0].get("href") number = re.search(getpagenumber, link) number = number.group().replace(".html", "").replace("-", "") return int(number) def getlinksfromfrontpage(webpage): allthreads = [] page = urllib.request.urlopen(webpage) content = page.read() soup = BeautifulSoup(content, 'html.parser') threads = soup.find_all("h3", class_="threadtitle") for thread in threads: link = thread.find_all("a") link = link[0].get("href") allthreads.append(link) return allthreads def getallthreadlinks(website, number): allwebsites = [] if not number: return [website] for i in range(number): webpage = website.split(".") webpage[-2] = webpage[-2] + "-" + str(i + 1) webpage = ".".join(webpage) allwebsites.append(webpage) return allwebsites for webpage in webpages: print(webpage) alllinks = getlinksfromfrontpage(webpage) for link in alllinks: number = getnumberofpages(link) if not number: allthreadlinks = [link] else: allthreadlinks = getallthreadlinks(link, number) for treadlink in allthreadlinks: gettextfrompersonalitycaffee(treadlink)
There were parts that I later added because I got an error, and I wanted to make sure I don't need to start from the beginning because of it. But it did not repeat itself again, so I decided to post the code without it.
From this script, I got about 330MB of data, 300MB was tagged with the valid MBTI types. But it was in a form, that the MBTI tag was the name of file, and inside of it there were posts separated with three new lines (\n\n\n).
I was later wondering from time to time, if this was the best way, but this is the way that I ultimately ended up using.