Blog of Sara Jakša

Getting the List of Blog Descriptions from Tumblir

In the last time, before I went on detour, I ended up with the list of tumblir blogs that used mbti related tags. Now, what I needed to do now was figure out for which of these blogs can I find the MBTI type of the writter.

I spend too much of my time browsing Tumblr, so I knew that a lot of people write their MBTI types in the blog descriptions. I had the blog web addresses, now I only needed to get their descriptions as well.

    import pytumblr
    import re
    import time

    files = ["file.csv"]

    #this is a regex, to be able to get the url of the Tumblr blog from the url post
    tumblr_url = r"[\w-]+.tumblr.com"

    # Authenticate via OAuth
    client = pytumblr.TumblrRestClient()

    sites = set()
    with open("blogs.csv", "r") as data:
        content = data.readlines()
        for line in content[1:]:
            line = line.split("\t")
            sites.add(line[0].strip())

    postsid = set()

    for singlefile in files:
        with open("blogs.csv", "aw") as write:
            write.write("user url" + "\t" + "number of posts" + "\t" + "description" + "\n")
            with open(singlefile) as read:
                content = read.readlines()
                for line in content[1:]:
                    fields = line.split("\t")
                    if fields[0].strip() in postsid:
                        continue
                    if fields[1].strip() == "post_url":
                        continue
                    if not "tumblr.com" in fields[1].strip():
                        continue
                    postsid.add(fields[0].strip())

                    #now find the blog url from the post url
                    user_url = re.search(tumblr_url, fields[1].strip())
                    user_url = user_url.group()

                    if user_url in sites:
                        print("SKIP: " + user_url)
                        continue

                    #get the information about the blog
                    blog = client.blog_info(user_url)

                    try:
                        #get blog description and number of posts
                        blog_description = blog[u"blog"][u"description"]
                        number_of_posts = blog[u"blog"][u"posts"]

                        blog_description = blog_description.replace("\n", " ")

                        blog_description = blog_description.encode("utf8")

                        write.write(user_url + "\t" + str(number_of_posts) + "\t" + blog_description + "\n")

                        sites.add(user_url)
                        print("ADDED: " + user_url + "   :)")
                    except KeyError:
                        if blog["meta"]["msg"] == "Limit Exceeded":
                            print(blog)
                            print("SLEEP TIME")
                            time.sleep(3600)

This is the time, where I was starting to add some automated way to keep track of the API calls. Tumblr had per hour and per day limit of how many calls a person can make. But when a person exceeds the limit, there is no error, but it gets send as a JSON file with a error message written inside.

So what I added, that if there was a problem, it checked if it was JSON with an error message, and then made the script wait for an hour.

I then removed the duplicates the same way I used before.