Blog of Sara Jakša

Analysis of My Citations for Economic Master Thesis

The Jupyter-Noteboom can also be found here: My_Citations_For_Economic_Master_Thesis

I have finally sent the final version of my economic master thesis to my mentor. While I was doing this, I decided to try and analyse what kind of citations was I using in my master thesis.

Importing the libaries

import os
import re
import pandas

Regex patterns

citations_re = r"cite{.+?}"
re_entry = r"@\w*{.+?timestamp.+?}"
re_type = r"@\w*{"
re_journal = r"journal[\s]+?=[\s]+?{.+?}"
re_name = r"@\w*{.+?,"
re_year = r"year.+?=.+?{.+?\d+?.+?}"

Get all citations from tex files

In this stage, what I did was go over all my tex files and put out all the citations (\parencite{}, \cite{}, \textcite{}).

all_citations_in_my_work = set()
for filename in os.listdir("files"):
    with open(os.path.join("files", filename)) as f:
        data = f.readlines()
        data = " ".join(data)
        all_citations = re.findall(citations_re, data)
        for s in all_citations:
            s = s.replace("parencite{", "")
            s = s.replace("textcite{", "")
            s = s.replace("cite{", "")
            s = s.replace(" ", "")
            s = s.replace("}", "")
            if "," in s:
                s = s.split(",")
                for c in s:
                    all_citations_in_my_work.add(c)
            else:
                all_citations_in_my_work.add(s)

I used 157 different citations in my work. Which I think is not bad for a master thesis.

len(all_citations_in_my_work)
157

Preparing bib for parsing

In the next stage, I parsed the bib files, so that I could search them based on what I wanted to find.

lines = ""
for filename in os.listdir("bib"):
    with open(os.path.join("bib", filename)) as f:
        data = f.readlines()
        data = " ".join(data)
        lines = lines + data
lines = lines.replace("\n", " ")
lines = re.findall(re_entry, lines)

From what scientific journals were my scientific articles

In the next step, I parsed the data to try and figure out, what scietific journuals were I using.

my_journuals = dict()
for line in lines:
    name = re.findall(re_name, line)
    try: 
        name = name[0].split("{")[1].replace(",", "")
    except IndexError:
        continue
    if name in all_citations_in_my_work:
        t = re.findall(re_type, line)
        t = t[0][1:-1]
        if t.lower().strip() == "article":
            j = re.findall(re_journal, line)
            if j:
                j = j[0].split("{")[1].replace("}", "")
                if j not in my_journuals:
                    my_journuals[j] = 0
                my_journuals[j] += 1

Here I first counted the number of articles.

articles = 0
for j, n in my_journuals.items():
    articles += n
articles
97

And then I counted the number of journuals, that I was using.

len(my_journuals)
66

So I took about 1.5 articles from each journual.

articles/len(my_journuals)
1.4696969696969697

I then tried to see, if there were any journuals, that I used more. I used Computers in Human Behavior the most. You can see below, which ones did I used more than twice.

my_journuals = pandas.DataFrame.from_dict(my_journuals, orient="index", columns=["Count"])
my_journuals.sort_values("Count", ascending=False, inplace=True)
my_journuals.reset_index(level=0, inplace=True)
my_journuals.head(5)
index Count
0 Computers in Human Behavior 13
1 Personality and Individual Differences 6
2 Annual Review of Psychology 5
3 Social Media + Society 4
4 Information Systems Frontiers 3

What type were my sources

Next I wanted to see, what different types were my sources. Here is the code.

types = dict()
for line in lines:
    name = re.findall(re_name, line)
    name = name[0].split("{")[1].replace(",", "")
    if name in all_citations_in_my_work:
        t = re.findall(re_type, line)
        t = t[0][1:-1]
        t = t.lower()
        if t not in types:
            types[t] = 0
        types[t] += 1

As you can see, the articles were the most frequent (99). The books were less so, even combining the whole books and the chapters (18). The rest were used 5 times or less.

types
{'online': 2,
 'www': 1,
 'electronic': 1,
 'report': 3,
 'manual': 1,
 'inproceedings': 5,
 'incollection': 5,
 'book': 13,
 'article': 99,
 'thesis': 2}

From what year were my sources

Next I tried to see, from what year were my sources, that I used.

my_years = dict()
for line in lines:
    name = re.findall(re_name, line)
    name = name[0].split("{")[1].replace(",", "")
    if name in all_citations_in_my_work:
        t = re.findall(re_year, line)
        if t:
            t = t[0].split("{")[1][:-1]
            if not t in my_years:
                my_years[t] = 0
            my_years[t] += 1
my_years = pandas.DataFrame.from_dict(my_years, orient="index", columns=["Count"])
my_years.sort_values("Count", ascending=False, inplace=True)
my_years.reset_index(level=0, inplace=True)
my_years.sort_values("index", ascending=False, inplace=True)

I have used 1 source from this year. It seems that most of my sources were recent. The most sources were from last year, then the year before, then four years before (not sure, why there are not more sources from 2016).

Looking more into the past, oldest reference was from 1970. I used 4 from the 70', 1 from the 80' (so before I was born), 3 from the 90' and additional 33 from the 00'. All the rest are from the time, when I was already attending the university.

my_years
index Count
18 2019 1
0 2018 26
1 2017 15
7 2016 6
2 2015 12
9 2014 5
3 2013 8
6 2012 7
10 2011 5
8 2010 5
4 2009 8
13 2008 3
5 2007 7
22 2006 1
12 2005 3
11 2004 4
26 2003 1
14 2002 3
24 2001 1
16 2000 2
17 1999 1
23 1991 1
19 1990 1
25 1988 1
15 1977 2
21 1973 1
20 1970 1