Blog of Sara Jakša

Presumptions and Programming

Last week, a teacher asked for a volunteer in school to do a data visualization project. I volunteered, since I figured that in the worst case scenario I will brush on some of my programming skills. Yes, even from the start I did know I will do it by programming, and not in any kind of program for graphics.

The professor only said that he wanted a timeline, and he did not want to share with me what it is going to be used for. He did say that he hoped it will help him prove a point.

When a person mentions timeline, the picture below is something that I get in my mind.

Timeline - Inventions

I got the homework from all my classmates and I am supposed to create a timeline of important inventions and discoveries. The upper one is the timeline of what I wrote down were five the most important discoveries.

I used the file that looked like this:

year    discovery   name
-650    Number 0    Brahmagupta
-360    Dualism Plato
1600    Magnetism   William Gilbert
1859    Evolution   Charles Darwin
1900    Quantum Mechanics   Max Planck

And down below is the code to produce the upper plot:

    data <- read.csv("data.csv", header=TRUE, sep="\t")

    data$weight <- 1
    data$yearplace <- runif(5, -0.7, -0.2)


    plot(data$year, data$weight, type="h", frame.plot=FALSE, axes = FALSE, xlab="",ylab="", xlim=c(min(data$year),max(data$year)+100), ylim=c(-5,7))
    text(data$year, 3, data$discovery, cex=1, srt=90)
    text(data$year, data$yearplace, data$year, cex=0.75)

The timeline that I needed to create had 5 discoveries from each person. And then another one with around equal amount of data for inventions.

Where I came into the big problems, as most inventions happened in the last 2 centuries, but the earliest one was more than 100.000 years ago. I am talking about fire. The problem was how to put something like that on any kind of meaningful scale. I prepared the version with just information from the last 1000 years and one with the logarithmic scale, showing how many years ago it had happened.

When I showed the timeline to the professor, it ended up not being what he needed. He explained to me, that he had a graph of citations in the last 100 years. He was hoping that he would be able to show, that most important discoveries happened around year 1900. But that citations rised in the second half of the 20th century.

I figured out that this means that he wanted something similar to the histogram. Something a lot easier to do than the timeline mentioned above.

    data <- read.csv("data.csv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
    data <- subset(data, year>1900)
    hist(data$year, breaks=seq(1900,2020,10), freq=FALSE, axes=FALSE, col="lightblue", xlab="",ylab="", main="Discoveries since 1900")

Frequency of important discoveries since 1900

But when I was looking over the information on the internet how to remove all the axes and additional info, I discovered something called density. With very short reading it seems to be the probability function for the new data.

    data <- read.csv("data.csv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
    data <- subset(data, year>1800)
    plot(density(data$year), axes=FALSE, col="black", ylab=NULL, xlab=NULL, sub=NULL, ann=FALSE)
    title("Discoveries od leta 1800")

Frequency of important discoveries since 1800

Which I think is so far the most elegant visualization of it so far. Now I just hope that this time I did understood the professor correctly.