Blog of Sara Jakša

First Cleaning of Tumblir Data

After using the script, to get the tagged posts, I quickly figured out that there are multiple posts included into it. Since they were in multiple files, I figured out that doing it in Spreadsheet program is going to probably be the pain in the ass. I apologize here to all my professors of business informatics, but I don‘t care if most of the companies use Excel. This program is still a pain in the ass to use for anything else but calculations.

What this small script does, it to get the list of all the files to check. After that, it writes only the first appearance of the line in the file, and discards all the rest.

    def removeDuplicate(singlefile):
        with open(singlefile, "r") as read:
            content = read.readlines()
            with open(singlefile, "w") as write:
                readLines = set()
                for line in content:
                    if line not in readLines:
                        write.write(line)
                        readLines.add(line)
        return None

    def removeDuplicateMultipleFile(files, output):
        with open(output, "w") as write:
            addedLines = set()
            for singlefile in files:
                with open(singlefile, "r") as read:
                    content = read.readlines()
                    for line in content:
                        if line not in addedLines:
                            write.write(line)
                            addedLines.add(line)
        return None


    files = ["file.csv"]

    removeDuplicateMultipleFile(files, "file.csv")

I am pretty sure that I am going to use this in the future as well.