Recently, I had written how I am going to use Tumblr data for my project, and how could I attempt to do this.
After that, my code still needed some work, before I could just start the script and get the data. Below is the code, that I have used to eventually get a lot of posts with different tags.
What this code does, it to take a certain tag, and then it writes in the file the id of the post, the name of the blog and the list of all tags.
import pytumblr # Authenticate via OAuth client = pytumblr.TumblrRestClient("") #find 20 posts tagged MBTI def getTumblrPosts(before=None, tag="MBTI", limit=20, filename="data.csv"): posts = client.tagged(tag, limit=limit, before=before) if len(posts) == 0: return None with open(filename, "aw") as f: #get information out of the post f.write("post_id" + "\t" + "post_url" + "\t" + "tags" + tag + "\t" + "timestamp" + "\n") for post in posts: if post["type"] == "text": timestamp = post[u"timestamp"] post_id = post[u"id"] tags = post[u"tags"] if not tags: tags = "" post_url = post[u"post_url"] string = str(post_id) + "\t" + post_url + "\t" + str(tags) + "\t" + str(timestamp) + "\n" string = string.encode("utf8") try: f.write(string) except UnicodeEncodeError: print("Unicode Error:" + post_url) pass else: pass return post[u"timestamp"] #Loop the post getting function def getData(tag, timestamp=None, filename="data.csv"): while 1: timestamp = timestamp print(tag) timestamp = getTumblrPosts(before=timestamp, tag=tag, limit=20, filename=filename) if not timestamp: return None return timestamp timestamp = getData("tag", filename="file-name.csv")
I used this code to get posts for all the Myers-Brigs related tags, like MBTI and the names of all the types (like: INTP, ENFP, ESFJ, ESTJ, INTJ, ISFP,…). On the end I got around 13 MB of data this way.
On this that I did not yet include in this code is keeping track of how many API calls can I still do. I am sure, that if I did that, I would be able to let the script run longer and get more data.
I also ignored all the non ASCII characters. This is one of the drawbacks of this script, as it is still using Python2, and Python2 is the one that has a pain-in-the-ass encoding. I mean, it is not hard, I just don‘t know, why a programmer should bother with it.
Well, pytumblr, the library that I am using to access Tumblr is written in Python2, and I did not found the Python3 version, so that is why I used Python2.