Best pony, as told by Fimfiction
Background
When JockeTF released his archive of every single story on Fimfiction two months ago, I immediately realized the potential for calculating some interesting statistics, though apparently, I took my time. (Fimfiction, for the uninitiated, is the brony community's premier platform for sharing fan fiction. If this confuses you more, read this Wikipedia page, or just continue on and laugh at how absurdly niche the contents of this article are). The dump consists of a whopping 135,000 stories published over the last 5 years, totalling 8.5 GB of uncompressed text. It's a huge sample size and is largely representative of all but the earliest of the fandom's written works. But where to start?
The one question that gets brought up more often than anything else in this fandom is, by far, "who is best pony?" So, why not start there? Who, really, is best pony?
Best pony by story votes
Well ah'll be - wouldja look at that? The set of stories in which Applejack appears (by name) has a higher average rating than the set of stories in which any other given character appears. That is to say, if you randomly pick one story in which Applejack appears and a different story in which one of the other mane 6 appears, you can expect the former to be rated higher.
By the way, here's the same figure extended for more characters. If you want to read a good story, pick one about Nightmare Moon!
So that's one interpretation of the "best pony" question (and probably the best approach for the data we have available), but why stop there? Maybe being the best pony isn't about having the highest story ratings. Maybe it's about being written of the most frequently, or being portayed in the most positive light? The question is rather ill-defined, which is what makes it so much fun in the first place.
So, here we go.
When I say a character "appears" in a story, I mean that their name, or nickname, appears in the text. For example, if the text contains "Rainbow", "Dash", or "Dashie" (case-sensitive, to avoid attributing the verb use of "dash" to the character), then I say that Rainbow Dash makes an appearance in that story. This is flawed, as it could be that a different character is actually speaking about RD instead, but I think it's reasonably accurate.
Using the same method, we can also infer something about the fandom's favorite ship. Spoiler, it's Twilight Sparkle/Rainbow Dash.
If we go back to the frequency of character appearances, there's something a little odd: although Celestia appears in as many stories as Twilight, Twilight appears in far more sentences than anyone else. This could mean that Celestia tends to play a less important role in the stories she's mentioned in, while Twilight plays more central roles (or plays central roles in lengthier stories). Or it could mean that - for whatever reason - authors tend to mention Twilight by her name, using fewer pronouns or non-name titles (e.g. "the librarian") than with other characters.
Next up, how has the quality (or expectations) of fandom writing changed with time?
I'll say that this is not a trend I expected. Unfortunately, it's difficult to determine if the peaks correspond to creative bursts within the fandom, or readers becoming more liberal in their "likes". However, the large bump around Sept. 2013 - Jan. 2014 does correspond to the premiere of Season 4, and specifically, "Princess Twilight". That might not be an anomaly.
As pointed out by TheOtherHarryPotter, Fimfiction no longer exposes vote counts for stories with fewer than 10 votes, as of July 2015. This may account for the current upward trend if we assume that these stories have disproportionally poor ratings.
Additionally, there's an unmistakable trend for longer stories to have higher ratings. Note: stories with 0 votes (likes and dislikes) were removed from the dataset for all these plots that mention rating.
Here's the same thing as a scatterplot, and before you ask, yes - I
checked my work and that sharp edge at 1000 words doesn't seem to be
an error! Perhaps Fimfiction has (or used to have) a minimum word
requirement for stories? Several readers have comfirmed that Fimfiction
does have a 1000 word requirement for new stories. By the way, the
strong horizontal lines in the plot below are due to quantization - the number
of different possible ratings for stories with few votes is small.
Best pony by sentiment analysis
Ok, let's get back on track. Can we figure out which character is written in the most positive manner? Even though stories containing Applejack have the highest ratings, it would be questionable to proclaim her as best pony if all these stories were about how hilariously boring she is.
To some degree, we can, using something called Sentiment Analysis. A very simple form of sentiment analysis is to take a look at a single sentence and attach some number indicating how "positive" the sentence is. For example, on a scale from -1 to +1, "your smile fills me with glee" might score +0.8 because that's a happy thought. On the other hand, "cows are terrifying because they're fat and they talk a lot" might get -0.6. The exact process is hard to define, since we're attempting to quantify a qualitative and subjective measurement. It's also fairly limited. Consider "Pinkie moved out today. Thank God - she was so annoying." Each sentence would score negatively on its own, even though the overall meaning depicts something that's (arguably) good. Regardless, sentiment analysis can still be useful.
If we take every sentence in which a given character appears (by name) and measure the sentiment, we can infer something about how "positively" that character is spoken of. Of course, the scope is pretty limited - a character can be cheerful and happy, or determined and hard-working, yet unbearably shallow, and sentiment analysis wouldn't have anything to say for the latter attribute.
I would expect Rainbow Dash and Pinkie Pie to be associated with the most positive sentiments - the former because of how "awesome" she is and the latter because of all the parties and "fun" she likes to have. Rarity would probably be the lowest, as drama doesn't tend to go over well and she is known to frequently exaggerate the "worst possible things". Surprisingly, this isn't quite the case:
I'm glad I decided to look at the sentiments as a function of time, rather than averaging all the text in one go. I wish I would have thought to do that with the ratings earlier. In any case, Rainbow Dash actually comes out on bottom, with Rarity in 2nd for most of the time. It looks like the antics of Rarity and Pinkie Pie have been chronicled in a noticably more pleasant manner over the years. The trend is also evident for Twilight Sparkle, though the other characters - especially Applejack - appear remarkably consistent. Noteworthy is that no character has a downward trend. Perhaps the general upward motion is due to the many trollfics that were astoundingly popular in the early days. At least to my knowledge, these seem to have faded.
On the other hand, these trends aren't quite the same when we look at the sentiment of every sentence within a story (not just the ones in which a specific character is mentioned). This doesn't support the trollfic idea, since topics like murder would absolutely bring down the sentiment of a text.
Next, take a look at the sentiment as a function of location within a given story.
And with short, medium-length, and long stories plotted separately:
The takeaway here is that Fimfiction authors love happy endings.
The state of Fimfiction
That's it for the best pony nanalysis. But while I was gathering those statistics, I found some others worthy of mention as well.
I suppose that with 135,000 stories, it's inevitable that a few will share the same title. Additionally, Fimfiction has more incomplete stories than completed stories, though I would expect this to be the case for fan fiction in general.
Finally, I'll leave you with the most common words found in sentences mentioning each given character. This ignores all words that occur less than 50% above the baseline (i.e. across all the text on Fimfiction) in order to weed out uninteresting words like "the" and "it". Have fun with your stereotyping! (Click for larger versions).
The code to perform these analyses and generate all the plots found in this article is available publicly on Github.
[Update] It turns out I'm not the only one out there doing analyses on Fimfiction. Here are some more analyses in reverse-chronological order, with emphasis on the ones I think you're most likely to enjoy if you found this article intriguing:
- Chukker: Analyzing Fimfiction data with Tableau and postgres, part 1
- Chinchillax: 2016.05.25 Fimfiction Archive release statistics
- Bad Horse: What tags correlate with popularity on Fimfiction
- Bad Horse: The Great Shakespeare / My Little Pony Showdown: Part 2
- Bar Horse: Writer attrition: 1/3 per year
- Bad Horse: Stylometrics and part 2, part 3
- Bad Horse: Does grammar matter?
- Bad Horse: Ranking authors
Let me know if you come across any others & I'll link them here.