Network Analysis with Harry Potter Fanfiction Talk
In the PyCon Italy, I have presented the talk titled 'Introduction to Network Analysis by Analyzing Characters in Harry Potter Fanfiction'. Here is the write up for the people, that prefer reading to listening.
What is Network Analysis
In the last two years, we had this huge event called the pandemics. With all the rest of this things it brought with it, one of the things is also did was popularize the network analysis. We were taking about think like contact tracing, which is a technique of how to create networks. We were talking about the super spreaders, which is connected to the hubs of networks, so which people are the most responsible for spreading the virus in the network. We were talking about the lowering the rate of infection, which is another topic in the network analysis.
So what is network analysis? Network analysis is another method of data analysis along with machine learning and text analysis. It deals with the data, where the relationships are the most important. This could be the train lines between the cities, the friendship ties between classmates, the marriage relationships between the different Nobel families, the trade connections between countries or the genetic similarity between viruses. So it not just used to study pandemics, but can be used to study many other things. One interesting things will be the electrical grid, and how it needs to change, so it will be resilient even with only renewable energy. And many, many more.
In the list I listed a couple of hard problems, that will need to be solved in the future. But for me, I prefer to do something a bit more light-hearted and fun. Which is why my example is going to be fanfiction, instead of all this serious stuff.
What is Fanfiction
So what is fanfiction? Taking the definition from the fanlore:
Fanfiction (fanfic, fic) is a work of fiction written by fans for other fans, taking a source text or a famous person as a point of departure.
Simply they are stories when people take the movie or series or books or even real people, and write stories about it. It can be because the writers are dissatisfied with the part of the story or they just want to spend more time with the characters. Maybe they want to explore some topics, that were not that well explored. This can also happen with the characters. Some people even write to create the representation for themselves, that does not exists in the mainstream media.
This is where fanfiction comes in. They are stories written to help with all of this and many other reasons as well.
In this talk, we are going to be exploring the world of the Harry Potter fanfiction world. I am going to try to explain the basics of the network analysis though the examples taken from Harry Potter fanfiction.
Dataset
The dataset that I am going to be using is the list of all the works with their tags from the Archive of our Own. Archive of our Own or AO3 for short is one of the bigger English language fanfiction websites - though you can also find fanfiction in different languages than English there.
The other reason is, that they were nice enough to provide this data in the easier to consume format. The dataset is from the March 2021, and can be found on: https://archiveofourown.org/admin_posts/18804
Creating a Graph
So when trying to create a network in the python, first you would need to define what the objects in the network are and what the connections between them are.
So let us start with the first network. In the first network, I am going to be using the 50 most popular parings in Harry Potter fandom.
There are two main concepts in the network analysis: nodes and edges.
The nodes are the objects or people, that appear in the network. What are the things, that there are connections in between. The nodes in this network are going to be people like Harry or Snape.
The second concept are edges. The edges are the connections between the nodes. And the edges or relationships are going to exist, if the appear together in the most 50 popular pairings.
import networkx G = networkx.Graph() for node1, node2, weight in edges_list: G.add_edge(node1, node2, weight=weight)
Types of Graphs
There are actually different types of networks, that we could use in the analysis. You can have the directed and the undirected graph. In this case, we have the undirected graph because we are using tags. Tags in the AO3 go through the wrangling process, so unlike on some other web spaces, the order of the people in the tags is not important. It usually indicated the both-sided relationship, and therefore does not have an order.
If we wanted to also track the one-sided loves, then we would need a directed graph. The person that fell in love would have a connection to the person they love, but not the other way around. We do not have this information in this case, so we can not use this one.
The other types of graphs are, if some nodes or edges are more important. In the relationship graphs, the paring between for example Sirius and Snape is a lot more popular (almost 2000 works) than between Santa Claus and Dumbledore (2 works). So we could have a graph, where both of these connections could be equally important, like we had in the upper example. This would be unweighted graph. So you could decide, that the more frequent pairing should have a stronger connection. In this case, this would be a weighted graph.
The last one is whenever they can have multiple relationships between two nodes. In the relationship example, there are pairing involving more than two people. For example, there are about 70 stories involving the pairing of Hermione Granger, Harry Potter and Luna Lovegood. So while this relationship exist, the relationships between each of these pairings also exists. So Harry and Luna could have connection between them through different love relationships. Whenever one would treat this as one relationship or as multiple would decide the type of graph.
In this case, we had no multi-people relationships, so the graph above allows just one relationship between two people.
Some Metrics
Once we have a graph, it helps if we have some easy description for how it looks like. The visualization would normally be a good way, but it actually fails when the data have too many nodes. Maybe I just have a good way to visualize large networks yet.
So there are some simple metrics, that could be used to describe the graph.
In order to have a bigger graph, I am going to use the characters tags in the next example. If two people appear together in the story, then there is a connection between them. If not, then there is not.
The first we can get is the simply the number of nodes and the edges.
G.nodes() G.edges() len(G.edges())/len(G.nodes())
Taking the character graph in to account, there are 12.844 nodes and 825.051 edges. This means, that in the Harry Potter fanfiction, there are almost 13 thousand different characters appearing. You can probably imagine, why visualization of this is not such a simple problem.
Based on these two number, we can also calculate, that the average character appears in the stories with 64 other characters. Average depends on the network simulated. But normally this number would be below 10. For comparison, the Facebook network have the average 190. So this network would be somewhere between the usual network and the Facebook one.
networkx.density(G) networkx.average_shortest_path_length(G) networkx.diameter(G)
The average can also be calculated with the density, where the 0 would be zero connection and 1 would be every node connected with every other node. In this case it is around 0.010. I just find the average number of connection more understandable.
The average shorted path length tells us, what could we expect as the shortest distance between random nodes.
For example, have you heard of the 6 degrees of separations? This was an actual experiment before the internet age. People had to get the mail to another person - chosen by the experimenter. But they were only allowed to mail it to the people they knew. For the mails that arrived at the address, it took on average a chain of 6 people. Though there were mails, that did not get to the right address even with the chain more than 20 people long. But in the experiment, the mails that did not arrive did not count.
Here it is the same thing, except that people do not need to guess at the best person to use for this. Since we are analyzing the data in the computer, the computer takes care of finding the shortest paths.
The diameters calculates something similar. It tells you what the longest shortest distance between nodes are.
Connected components
The diameter can only be calculated, if the network in connected. What do I mean by that?
Well, when we were talking about the relationships, I did not actually show the entire graph. The picture below shows the entire graph.
Can you can see from the graph, there are some groups, that are not connected to the main group. Let this be the groups connected to the sequel, like the Rose Wesley, Scorpius Malfoy, and Albus Severus Potter. Or some cannon pairings, where the characters do not have other popular pairings, like Arthur and Molly Potter, Bill Wesley and Fleur Delacour, and Dumbledore and Ginderwald. And so on.
This all represent the disconnected components. And we can check the connected components, and analyse each of them separately.
networkx.connected_components(G)
For analysis, the biggest component is usually takes for the analysis. It is usually pretty simple to find the biggest one. Usually taking the number of nodes, one component takes most of them and this one is used.
Logically, only one component can have more than half of the nodes. Even if none does, number of nodes can differentiate between them.
But sometimes is makes sense to do the analysis of the smaller components as well. For example, the Harry Potter fandom is a lot of times also used as a stage, without any of the original characters present. In this case, we would get the smaller components with these characters. And there are some cases of that.
Do you know of the A Christmas Carol story? Where the ghosts of Christmas past, preset and future visit a person. And this person have their life changed. Well, all these three ghosts also appear as the characters in the Harry Potter fanfiction.
The next one that I would point out is this graph of what appears to the English royalty of the past. Or maybe they were just a nobility. I don't know enough about English history to judge this. But we have the prices of the Tower, and Rchard of England and Edward of England and so on.
I guess since Hogwards had existed for millennia, it could have played the role in the English politics as well.
But maybe this is just justification. Something that I do not have for the last example. I generally avoid the real person fanfiction, so I am not that familiar with the norms there. But it is something, that I have problem explaining.
What do Kimi Räikkönen, Serbastian Vettel, Fernando Alonso, Sergio Pérez, and Lewis Hamilton have in common? They are all formula 1 drivers, that appear in the Harry Potter fanfiction.
The graph also shows which characters are written with multiple other characters and which only appear with specific other drivers. We could get some idea of the fandom from this. But I will skip this part, since real person fanfiction is not really my cup of tea.
I actually tried to research the Elon Musk fanfiction, because I hoped I could get some joke for the speech. I gave up once I realized that I there is an Alpha/Omega Trump/Musk fanfiction there. Each their own cup of tea, but I would rather be reading about Kudou Shinichi, Albert James Moriarty or Leon Fou Bartfort. All of these fictional characters from some of my bit smaller fandoms. Only two of them have a crossover with Harry Potter at this point. Maybe I should write a third one? Maybe a cliche one, where Leon becomes the defence against the dark arts teacher?
So let so move from this regression, as interesting is it to me.
Centrality
So when we talk about the network, one of the things that we would want to know is, who the most important nodes are. They can be important, since these are usually the ones, that are best at spreading things - from information to the viruses and memes.
Though there is an interesting research by Centola, that these are important for spreading simple things, but can actually imped the spread of the more complex behaviour. This is because they have a lot of influences, so none of them is prevailing influence. It is also the reason, why the changes usually happen in the fridge - it is only later that shows, if it will spread further than that.
The most important notes can also means the people that make sure, the network does not get disconnected. This is important for the flow of the information, as this ones can not spread without having at least some connection between groups.
We are going to be taking about the centrality on the example of network of people appearing with stories tagged with BAMF. BAMF means the bad ass mother fucker, and it is one of my favorite troupes. It is actually my guilty pleasure to read these stories - I am a sucker for competent characters, no matter in which way. The fanlore definition is:
It is often used to describe a certain way of writing a character that emphasizes their badass qualities
And this is how the subset of a graph of the most commonly used characters look like:
It sort of looks like, what one would expect from the popularity of certain characters. It is basically the cast of the Harry Potter.
Degree Centrality
So let us now try and find the most important person. We are going to start with the degree centrality.
networkx.degree_centrality(G)
The degree centrality just measures how many edges each node has. So who is the most connected person. The below list is the list of the most important nodes, found by this method:
- Harry Potter
- Hermione Granger
- Draco Malfoy
- Luna Lovegood
- Albus Dumbledore
- Sirius Black
- Ron Weasley
- Severus Snape
- Neville Longbottom
- Minerva McGonagall
So what this list tells us is, which characters appear with the most amount of other characters in the fanfiction stories with BAMF tag. They are basically similar to the popularity, but not exactly. For example, if we just compare it with the number of stories with this tag, that they appear in, the order would be like this:
- Harry Potter
- Hermione Granger
- Draco Malfoy
- Ron Weasley
- Sirius Black
- Severus Snape
- Albus Dumbledore
- Remus Lupin
- Ginny Wesley
- Minerva McGonagall
The list is mostly the same. But there are some differences. For example, Luna have a lot more varied characters she appears in, compared to the number of stories she appears in.
Between Centrality
The next we are going to check the degree centrality.
networkx.betweenness_centrality(G, weight='weight')
The degree centrality tells us, which are the nodes, that are important for connecting different groups. So a person, who is the only link between two groups would have a big between centrality, even if it would not have a lot of edges.
So in our case, which people connect the most amount of other people to each other.
Example of this in the Harry Potter world would be, if, for example, Hermione Granger would end up in the relationship with the Voldermolt. You might think it is a joke relationship, but it has over 200 stories on the AO3, so it would be possible. In this case, both of them would have a high betweenness centrality, since they would connect two very different groups, which normally would not really communicate.
So the below is the list of important nodes by this criteria, for our case. The number in the parenthesis is the rank in accordance to the degree centrality.
- Harry Potter (1)
- Hermione Granger (2)
- Albus Dumbledore (5)
- Luna Lovegood (4)
- Draco Malfoy (3)
- Sirius Black (6)
- Severus Snape (8)
- Ginny Weasley (15)
- Original Characters (23)
- Original Female Character(s) (14)
While some of the people appear the same as on the upper two lists, some of the are a lot higher here than before.
One example, that I would want to discuss in the Original Characters and Original Female Characters. They are higher here than on the previous example. The reason for this is, that this a character, that does not exist in the cannon. And therefore it not limited by what happened in the canon. So they can be used with basically all the characters. This is the reason, why they are higher here.
Page Rank
The third centrality that I want to discuss is the page rank.
networkx.pagerank(G, weight='weight')
The page rank is a degree centrality, that tries to take into account the importance of the nodes it connects to.
So for example, the business associate of Mr Dursley, that visited them in the second book could be important in their own story. In the Harry Potter world, they are not important at all. Because this character does not direct connection to anybody of importance in magical world.
On the other hand, Luna in the books does not have many friends. So in the network terms, she does not have many edges. But she is friends with Harry Potter, which puts he in a important position.
So here are the result, in the parenthesis being the results form the previous two centrality measures:
- Harry Potter (1 - 1)
- Hermione Granger (2 - 2)
- Draco Malfoy (3 - 5)
- Ron Weasley (7 - 11)
- Albus Dumbledore (5 - 3)
- Sirius Black (6 - 6)
- Severus Snape (8 - 7)
- Remus Lupin (11 - 14)
- Luna Lovegood (4 - 4)
- Minerva McGonagall (10 - 12)
There are some changes, but not that many. But the people that raised here are the ones, that are well connected to the higher placed people. I will leaved to your imagination, what that means for them.
Correlations
In a lot of cases, the degree centralities correlate a lot. Just like in the upper case, that we just went through. The below are the correlations.
measure 1 | measure 2 | corr |
---|---|---|
degree | betweenness | 0.68 |
degree | page rank | 0.89 |
page rank | betweenness | 0.82 |
But this is not necessary the case. I have seen cases, where the correlations were even slightly negative. So it is important to pick the centrality measures, that you are interested in, or use the combination of multiple ones.
How to pick the nodes and edges?
Next we are going to talk about the problem, that normally arise on the beginning. How to actually pick the nodes and the edges for the network. This is depended on the problem, that you need analysed.
In our examples so far, the nodes were always the characters in the fanfiction stories. But the edges first indicated the romantic relationship between the two people, and later the co-appearance in the stories.
So let me show the difference on another fanfiction tag. In this case we are going to be using the 'Remus Lupin Needs a Hug' tag.
This tag usually indicates, that in the story this person will go through something bad. A lot of times this is an abusive relationship, not necessary romantic. And then the story is written in a way to resolve this with the happy end.
In the Harry Potter, there are four people where the tag is used frequently enough to be canonical tag: Harry Potter, Draco Malfoy, Sirius Black and Remus Lupin.
So for the edges, we have a couple of options, All of them could exist, or there would need to be a threshold. Maybe the count could be used as a weight.
For my, I am interested in how the tag differs from the rest of the tags. So I could simply only take into the account the relationships, that happen more frequently than what I would expect from the general Harry Potter fandom numbers.
So what we would be left with is the graph, that tell the relationships, that are representative of this tag only - and how the co-appearance of people differs from the general Harry Potter fandom.
So if I look at this graph, it does not look like the Harry Potter graph. There is a Fairy Tail fandom group, that appears there. I am not familiar with either fandom, nor the cannon for Fairy Tail, so I can not comment on this. Then there is an interesting group of people, including the DC characters, Dracula and Churchill. Don't ask me what the connection between these is.
Then there are groups, that could be explained with the wolf connection. So basically, the crossover, where werewolves from multiple stories and/or fandoms appear. It is these weird connection, that make the research of the fanfiction so interesting.
Community Finding
For the last point in this, I would like to talk about the community finding.
Normally, not all the notes are equality connected to another notes. Just like in the real life we have groups in the social life, so do the groups exist in the network analysis. It simply means that sometimes some parts of the graph are a bit more separated from the rest.
I will explain this on the example of Alpha/Beta/Omega tag. The fanlore definition is:
Alpha/Beta/Omega is a kink trope wherein some people have defined biological roles based on a hierarchical system. [...] Alphas are able to impregnate Omegas. Alphas are often (but not always) "dominant" in some fashion (depending on the worldbuilding of the specific story). [...] Omegas can get pregnant and go into heat. They are generally lowest on the hierarchy (although in some fanworks omegas are rare and prized). Some mangas don't have this hierarchy; their society is just like the actual human world. Male Omegas are self-lubricating and have the ability to become pregnant, sometimes referred to as being bred or mated. In some fanworks Omegas are the most fragile of the hierarchy, with frailer bodies and painful presentations.
Just as in the previous example, we are going to be using the edges based on the difference from the general Harry Potter fandom.
Just looking at the picture, it seems to be a pretty good example to find some groups. The nodes seems to be well divided into groups, based on the visual inspection of the graph.
from networkx.algorithms import community groups = community.louvain_communities(G)
Let us now try to divide the graph also with the help of the algorithm. The below picture represents the same graph as above, but this time colored for the different groups.
It seems that it found a couple of well defined groups. Let us go through group by group and try to find the common point to each of the group. While the algorithm found six groups, I will only describe the bigger four.
The first group seems to have mostly people from the Harry Potter universe. There are characters form the popular ones, like Harry Potter, to the rare ones like Giant Squid. Some characters from the outside of the Harry Potter cannon also appear, like Rigo Vasquez or Steve McGarrett. But they are on the outskirts of the groups. The core seems to be Harry Potter and Voldermort. So we could all it Harry Potter cluster.
The next one is pretty interesting. Most of the nodes are connected to one node, and that is Reader. There are basically no other connections between the nodes in this group. Which makes sense, since we have from real life people, like Chris Evans and Tom Hiddleston, to the characters from the books like Tolkein's Thranduil, to comics like Marvel Wade Wilson and Peter Parker, or to movie characters like Star Wars Kylo Ren, series characters like Lucifer's Lucifer Morningstar or games, like Asra from The Arcana. Basically, the mix of everything.
The Reader are usually the stories, that are written in a way, that the reader can insert itself. I am not sure why these stories with these pairing are more popular with the Omegaverse part of the Harry Potter fandom. But usually skip the Reader stories, so I can not judge or give an answer.
The third group is either DC or the Arrowverse. If I remeber correctly, the Felicity Smoak is Arrowverse original character, and she appears here, so a good indication that this might be Arrowverse. But since it is based on then DC comics and so a lot of the same characters appear, it is hard to judge.
I have to admit, that I watched multiple seasons of different series for Arrowverse. But I do not really know much about the DC comics universe. So maybe there is a good explanation for the results, that exists in the DC lore.
But I still find interesting, that Barry Allen is the main character on this subgraph. Most of the character appear to be the Flash characters - as much as you can categorize them like this. There is a small subgraph with three people from Arrow and two people form Legends of Tomorrow. For some reason, no Supergirl characters though. Which is an interesting.
The last one is either the MCU or the Marvel. Again, I am only familiar with the movies and the MCU fandom. I am not at all familiar enough with the Marvel, to figure out which one of these two it is.
Unlike the previous ones, here it seems that multiple people are representing the code, not just one person like the in previous examples. For some reason it is Lily Potter, that is here in this group - even though she is a Harry Potter character. I guess this makes it easy to figure out the connective character. Though why Lily?
The last two are a good example, how good Harry Potter is with the crossovers. There are a lot of crossovers within the Harry Potter fandom, and it is one of the things that I like the most in this fandom. I probably read more Harry Potter crossovers than non-crossover stories.
Conclusion
My main goal with this was to give a hopefully interesting introduction to the network analysis to the people, that might be interested in using it. I think that unlike text analysis and machine learning, the network analysis is not as frequently presented. So it could be overlooked.
My other secret reason was, that I wanted to share some of my passion with the people. A lot of people are not aware, how fun and interesting internet is. So I wanted to share a part of it with the people.