Blog of Sara Jakša

How I Got Tagged Posts from Tumblir

Recently, I had written how I am going to use Tumblr data for my project, and how could I attempt to do this.

After that, my code still needed some work, before I could just start the script and get the data. Below is the code, that I have used to eventually get a lot of posts with different tags.

What this code does, it to take a certain tag, and then it writes in the file the id of the post, the name of the blog and the list of all tags.

    import pytumblr

    # Authenticate via OAuth
    client = pytumblr.TumblrRestClient("")

    #find 20 posts tagged MBTI
    def getTumblrPosts(before=None, tag="MBTI", limit=20, filename="data.csv"):
        posts = client.tagged(tag, limit=limit, before=before)
        if len(posts) == 0:
            return None
        with open(filename, "aw") as f:
            #get information out of the post
            f.write("post_id" + "\t" + "post_url" + "\t" + "tags" + tag + "\t" + "timestamp" + "\n")
            for post in posts:             
                if post["type"] == "text":
                    timestamp = post[u"timestamp"]
                    post_id = post[u"id"]
                    tags = post[u"tags"]
                    if not tags:
                        tags = ""
                    post_url = post[u"post_url"]
                    string = str(post_id) + "\t" + post_url + "\t" + str(tags) + "\t" + str(timestamp) + "\n" 
                    string = string.encode("utf8")
                    except UnicodeEncodeError:
                        print("Unicode Error:" + post_url)
        return post[u"timestamp"]

    #Loop the post getting function
    def getData(tag, timestamp=None, filename="data.csv"):
        while 1:
            timestamp = timestamp
            timestamp = getTumblrPosts(before=timestamp, tag=tag, limit=20, filename=filename)
            if not timestamp:
                return None
        return timestamp

    timestamp = getData("tag", filename="file-name.csv")

I used this code to get posts for all the Myers-Brigs related tags, like MBTI and the names of all the types (like: INTP, ENFP, ESFJ, ESTJ, INTJ, ISFP,…). On the end I got around 13 MB of data this way.

On this that I did not yet include in this code is keeping track of how many API calls can I still do. I am sure, that if I did that, I would be able to let the script run longer and get more data.

I also ignored all the non ASCII characters. This is one of the drawbacks of this script, as it is still using Python2, and Python2 is the one that has a pain-in-the-ass encoding. I mean, it is not hard, I just don‘t know, why a programmer should bother with it.

Well, pytumblr, the library that I am using to access Tumblr is written in Python2, and I did not found the Python3 version, so that is why I used Python2.