In the last time, before I went on detour, I ended up with the list of tumblir blogs that used mbti related tags. Now, what I needed to do now was figure out for which of these blogs can I find the MBTI type of the writter.
I spend too much of my time browsing Tumblr, so I knew that a lot of people write their MBTI types in the blog descriptions. I had the blog web addresses, now I only needed to get their descriptions as well.
import pytumblr import re import time files = ["file.csv"] #this is a regex, to be able to get the url of the Tumblr blog from the url post tumblr_url = r"[\w-]+.tumblr.com" # Authenticate via OAuth client = pytumblr.TumblrRestClient() sites = set() with open("blogs.csv", "r") as data: content = data.readlines() for line in content[1:]: line = line.split("\t") sites.add(line.strip()) postsid = set() for singlefile in files: with open("blogs.csv", "aw") as write: write.write("user url" + "\t" + "number of posts" + "\t" + "description" + "\n") with open(singlefile) as read: content = read.readlines() for line in content[1:]: fields = line.split("\t") if fields.strip() in postsid: continue if fields.strip() == "post_url": continue if not "tumblr.com" in fields.strip(): continue postsid.add(fields.strip()) #now find the blog url from the post url user_url = re.search(tumblr_url, fields.strip()) user_url = user_url.group() if user_url in sites: print("SKIP: " + user_url) continue #get the information about the blog blog = client.blog_info(user_url) try: #get blog description and number of posts blog_description = blog[u"blog"][u"description"] number_of_posts = blog[u"blog"][u"posts"] blog_description = blog_description.replace("\n", " ") blog_description = blog_description.encode("utf8") write.write(user_url + "\t" + str(number_of_posts) + "\t" + blog_description + "\n") sites.add(user_url) print("ADDED: " + user_url + " :)") except KeyError: if blog["meta"]["msg"] == "Limit Exceeded": print(blog) print("SLEEP TIME") time.sleep(3600)
This is the time, where I was starting to add some automated way to keep track of the API calls. Tumblr had per hour and per day limit of how many calls a person can make. But when a person exceeds the limit, there is no error, but it gets send as a JSON file with a error message written inside.
So what I added, that if there was a problem, it checked if it was JSON with an error message, and then made the script wait for an hour.
I then removed the duplicates the same way I used before.