I’ve been meaning to write this article for about 6 months now, and it’s about time I got it off my chest. I’ll preface this by saying I don’t “know the lingo” of statistics and might get a few phrasings incorrect, but the article should otherwise stand on it’s own.
The Problem
Protonaut is a game that relies (in a very hard way) on user-generated content, and I wasn’t sure how to approach the presentation of it. As Community Manager for Fantastic Contraption, I’ve watched the community struggle with content management. Common questions or feature requests would revolve around a few key issues:
- How do I find the latest good levels to play?
- How do I find the best all-time content?
- How do I know which of my levels were the most popular.. in the last week?
- What are some promising newly-created levels?
- What new levels are complete crap?
- What authors produce the best levels?
- How can we deal with people who always down-vote content other than their own?
- And conversely, how do we deal with people who only up-vote their own content?
Even with me diving into Google Analytics for the whole domain (or using some third-party applications), most of these questions are impossible to answer; even those that came close were very difficult to deal with in an in-game setting.
The Common Solution
Most games implement the most basic ranking systems: offering a choice of (usually) 0 to 5 stars, and allowing you to sort the resultant list of median scores. This, unfortunately, has dozens of problems. The only real value in such systems is finding the “best score of all time,” and even then it does it poorly (not accounting for chronic upvoters or downvoters). Not to say that it isn’t easy to implement; indeed, it could hardly get simpler, and any lazy programmer such as myself will argue that “good enough is perfect.” But is it?
Each ranking system has several components; the voting system, the ranking system, the sorting system, and an optional adjustment system. There are various pros and cons to consider in the foundation — the ranking portion — of your system: Should you simply have a “like” button, as in your Facebook feed? Should you have upvote/downvote system? Should you have a scale of 3 stars? 5 stars? 9 chainsaws? 10 stars?
Each choice you make in the construction of your content-feedback system ads to a layer of complexity and results in a whack of cascading consequences, that will likely change your subsequent decisions. What looks simple on the surface can ruin your community-friendly ecosystem, and what looks dauntingly complex is shied-away from by most developers.
Thankfully, many people have done the thinking for us, and this will be one more article for the mix.
Understanding the Consequences
There are dozens of research papers out there on some of the different voting styles. I’m too lazy to make it convenient for you by linking some here; it’s been 6 months since I’ve read up on them, but feel free to Google them for yourself. Some of the key takeaways from actual research have been:
- Given a neutral choice in addition to a positive/negative choice, neutral will be chosen way more often than the others. Since neutral is a non-vote that hardly effects the statistics, it’s just giving someone an “out” from making a decision.
- Given a large-scale choice (such as 1-10 stars), people will either vote to the extremes (1 or 10) to game the system, or vote 7. “7″ votes are so statistically huge that any 10-star system skews towards it; it becomes the new “neutral” and is the “safe vote” that makes everyone feel happy. There’s even a running joke that game magazines only ever rank video games “7/10″.
- Given a “like” button with no alternative, or only positive-scale responses, it’s hard to generate true statistics (there’s no way to differentiate between “all-right” and “this content is so horrible nobody should ever see it”).
- After your user base reaches a large enough size, more and more people will take to “strategic voting”; voting the maximum on things you like and minimum on things you dislike.
- Giving an incentive to voting will drastically increase your response rate, but also predictably produce spam votes, or votes “just for the reward” without any thought. It will also amplify any of the above listed effects.
I strongly recommend you take a look at your planned voting style and do a full bout of research on it. Check out Google Scholar and find some actual results, instead of going from what your heart says. Or better yet, build a system however you want and observe the results, and change them to suit your needs.
How Bayesian Sets Up The Stage
A Bayesian Rating system tackles most of our problems head-on. Let’s step through how it works.
Let’s say we’re dealing with user-generated levels in a video game. A brand new level shouldn’t be rated at 0; users will never find your new content, as it would be buried under all your crap content. Likewise, it’s unfair to make new levels with the maximum possible score, as people will spam bad levels just to remain at the top of the list. Most often, people will set the Bayesian rating to be somewhere in the middle – say, 50%. This way, every level starts off with a “fair chance”.
As more and more people vote on your level, the score will start to normalize towards the actual statistical median instead of the artifically-engineered 50% rating. This way, if a single user-made level gets 10 up-votes in a row, it’s ranking will appear as such:
- 50% with 0 votes
- 60% with 1 vote
- 80% with 4 votes
- 90% with 8 votes
- 100% with 10 votes
The magic behind the Bayesian system is this weighting (or dampening) system. It’s influenced by what is referred to as a “magic number”; this figure tells you how many votes need to be cast before the Bayesian weighting is no longer in effect, and the score truly is a median of all the other scores.
The weighting system actually takes into account the variety of scores, as well. If you only have two levels, one gets a 75% score and the other gets a 30% score, one will show up as 0 and the other as 10 stars. There is always a minimum and a maximum represented, and everything in between is taken as a scale of those two figures. This solves a great deal of weighting and balancing problems with large-range systems, and compensates for situations where nobody has ever voted 100% or 0% (or where everybody votes for a narrow range of percentages).
Essentially, what the weighting mechanic is doing is “making up” votes for things that don’t have enough votes to get real meaningful data to start with.
How We Can Take It To The Next Level
It’s not just that simple though! We can crank this baby up to the next level so you never need to tune it yourself ever again.
For example; who says 50% is the optimal “starting value”? Isn’t it more fair to make the entry-level rating the median of all current scores? Why not! You can calculate your new starting value on-the-fly by looking at all your current content. This means that every user-contributed piece of content will have it’s relative scores affected by every single vote cast.
And the magic doesn’t stop there; why hard-code your magic-number to something like “10 votes” when you can make it a variable as well? Make the magic number equivalent to the median number of votes across all your content. This again will make the scores of everything change with a single vote cast, but it helps corral fringe cases and unexpected userbase expansion (or contraction) as well. Imagine you run an online store where almost everything gets a 60% vote. Then all of a sudden, one particular item you have for sale attracts a huge bag of traffic and everyone votes 100% on this new product (without casting votes on all your older stuff).
As each vote is cast, your older products would normally be scaled farther and farther down the “top” lists. However, they are also sinking deeper and deeper into the “magic number” and the score is being balanced more towards your 0-vote figure. This figure in turn is going upwards, as the new product is heavily increasing your median score!
The system is amazingly resilient and self-balancing. If a lot of levels are made and nobody votes, the system compensates. If a single item gets a lot of votes, the system compensates. If you have a thousand votes on every item, the system compensates.
Protonaut uses all of these tweaks for it’s level ranking and system generation. To keep my database from being fried, I recalculate my Bayesian constants once per day (instead of on every vote being cast), so as the clock turns over at midnight you will often see levels rise or drop in ranking to reflect the day’s votes.
Furthermore, Protonaut looks at your average vote cast and uses that as a weight against how much your vote should be “worth” in the first place. This makes everyone’s vote count, but chronic upvoters (or downvoters!) will be technically punished for their preference.
This also helps stem runaway-voting on levels; a popular level floats to the top of the list, everyone plays it, and everyone votes it upward because it’s good — well if you only ever vote up, the system knows this and scales your votes back a bit. Players that explore the deep content of the game (as opposed to just browsing the top ten list) get a heftier vote. If that wasn’t fancy enough, the average-user-vote value is bayesian-ranked itself!
And BAM — there we have it! A benefit for voting, that self-regulates and balances as well. It’s a whole lot harder to game this system, and it rewards the true fans of the game.
Furthermore, I optionally artificially reduce each Bayesian-generated score by how many level IDs precede it, thus making it an excellent adjustable time-based system (a truly epic level could stay at the top of the list, but most levels will disappear in the natural “churn” of things).
It really is a beautiful system. And easy to implement, too. Taken from the link below, this is the basic bayesian formula:
Use this equation:
br = ( (avg_num_votes * avg_rating) + (this_num_votes * this_rating) ) / (avg_num_votes + this_num_votes)
Legend:
- avg_num_votes: The average number of votes of all items that have num_votes>0
- avg_rating: The average rating of each item (again, of those that have num_votes>0)
- this_num_votes: number of votes for this item
- this_rating: the rating of this item
Some links for further reading
Have any questions about Bayesian ratings? Used them yourself? Success story? Horror story? Post it to the comments! I’d love to hear about it.

Lost Garden
And since I forgot to address it (and it’s not really bayesian specific) – Protonaut uses a simple +/- voting system with no neutral vote. People have to have an opinion, no matter how slight!
I was careful to phrase them nicely, too: “Worth Playing” or “Not Worth Playing”. It helps to stem preference issues; Some people think “10 stars” should mean “best level ever” and some take it to mean “great job! one of the best!”. Not having properly labelled ratings is suicide! You need to bracket your votes with context.
To get over the lack-of-voting you get when you have neutral votes, there is a super-secret set of gold coins you collect per-vote. When this system goes public it’ll help motivate more people to get involved.
That’s a pretty interesting use of Bayesian weighting. Did any of this go into Fantastic Contraption too?
Unfortunately, no! I’m hoping I can change that someday, though. :)
Great post thanks for shedding some light on how you manage your community. Another avenue which might be worth exploring is collaborative filtering. You could potentially have your players help find each other find levels they like — sort of similar to how Netflix will recommend a movie for you based on your past rentals. Though it’s an open question whether such an improved filtering mechanism would be worth requiring users to enter additional metadata about each level.
Great article, 7/10! ;)
Andy, interesting post. Rankings/ratings and that paraphernalia happen to be my daily work, so thought I’d chime in.
One way I’ve used recently on a multi-country emerging market study is conjoint analysis; that’s where people have a list of choices and must also rank those choices. Along with your Bayesian, this can give greater nuance (i.e. switches from a one-dimensional list sorted in some way to a two-dimensional grid, also sorted in some way)
The one I prefer for everyday use, though, is a self-benchmarked relative rating system (a demo of which can be downloaded from my website).
Without going into the maths, consider the process that Amazon uses to recommend books to users: their ranking is based both on what you purchased (but may not have enjoyed) as well as on what you searched for (but didn’t buy, and probably weren’t interested in). In other words, while this type of rankings system appears to be sophisticated (and does increase sales) it still does a poor job of recommending things I will really enjoy.
The alternative is for a system that has all possible books in it grouped into segments that are relevant to me and, whenever I’m interested in finding a new book, I compare all books in the system to the type of book I have read before and am looking for now.
Feel free to email me if you find this approach interesting.
I’ve seen progression in every post. Your newer posts are simply wonderful compared to your posts in the past. Keep up the good work!