Game Graduate

Your Game Guru

What is A/B Testing? – Bayesian Approach vs Null Hypothesis

youtube screenshot of talk of "A/B Testing for Game Design Iteration: A Bayesian Approach".What is A/B Testing? – Bayesian Approach vs Null Hypothesis. Youtube link is https://www.youtube.com/watch?v=-OfmPhYXrxY

TAKEAWAYS

  • General lifecycle of development journey
  • Basic approach of A/B Testing
    • Also A/B/C Testing…
  • What can be tested; and what is noise?
  • What is Null Hypothesis and p-value?
  • What is Bayesian Approach and prior?
  • Player and time independency factor

Update from the future!

Hello everyone! In this blog post, I want to share my takeaways from the beautiful talk from GDC. I created this blog post roughly 1 year ago and published it in one of my other blogs. Lately, when I decided to go with this blog, I thought that this post is more appropriate for this blog’s theme. Then I altered my old post with my humble addons and opinions while sharing with you in this blog. If you want to read more of my posts about technical stuff related to games, don’t forget to check the Technical Spot of Games Category.

We -as game developers- try to publish our games as fast as possible, most of the time. First, we have the development phase which ends up with the Beta version. This is also called “Soft Launch”. Then we test our beta and continue our development which ends up with the “Launch” title. Then our development journey continues as service: dev + test: release, dev + test: release, …

This development journey consists of our bad and good design choices which can be “approached” with several proxies. Whatever design choice we made, we should get enough (or hopefully more) acquisition from our efforts.

Mr. Steven Collins from Swyrve lights one important point that our most of revenue comes in the first 30 days of installation. This number is so high which is 50% of your revenue. Therefore, we can make good design choices, iterations, and “tests”; if we want to get good revenue.

General Lifecycle of The Game and A/B Testing

Our general lifecycle of the game is formed by three main points: Understand, Test Hypotheses, Take Action; then repeat.

Understand: We use metrics to analyze and understand users’ behaviors.

Test Hypotheses: Then test our reasonable hypotheses.

Take Action: Finally, we choose to take or leave our hypotheses.

A/B Testing is also similar to this approach:

  1. We split our players’ into different groups
  2. Different groups play different versions of our game
  3. We analyze and measure test results
  4. In case of huge difference, we choose our winner approach
  5. Then we deploy our winner choice to the whole our players

A/B Testing Explanation with an Example

We have different servers: One is our actual game server and another one is for A/B testing. Our A or B version is published to the users that we want to test. Here comes a big question: What can we test? The answer is simple: everything; even buttons’ locations in UI. Let’s discuss one example of A/B testing:

You have a tutorial at the beginning of the new game. This tutorial has 10 stages; however, people tend to skip the tutorial at stage 5. This should give you a clue that something is wrong with the tutorial; maybe it is too hard, too boring, too long, etc. You can prepare two variations of this tutorial: one with the same and 5 stages, second with easier parts with 10 stages. Then you analyze results again to try to understand the problem.

The game economy is another topic to discuss, but you can use A/B testing for the in-game stores too. You can give the same package with the same (let’s say $5) price but exhibit them with different discounts (let’s say A: 50% discount ends up with $5; B: 20% discount ends up with 5$). Some discount rates can look like fishing and can disturb players. You also can test different package prices (A: 100 gems for 5$; B: 150 gems for $5) to see their effects. To not break the equality between players, you can test something like equal and see how different looking prices affect players’ decisions (A: 100 gems $5; B: 150 gems $7.5).

Shot from Resource of What is A/B Testing? – Bayesian Approach vs Null Hypothesis
Figure 1:VIPs repeat their shopping at $70. Graph from[1]

            The bottom part of the graph shows the first purchases. You can see some players continue to buy packages from $70. With A/B testing you can change $15 to 16 and 14. Then see how your users react to those changes. Do not forget that so few changes can have a huge impact on players’ choices.

Testing call-to-action Choices

Let’s discuss Call-to-action choices. The biggest mistake can be asking players if s/he want to rate the game immediately after installation. “Do you want to share this screenshot in Twitter?”, “Do you want to watch a video to double your reward?” are just two examples of actions that can be tested. The best thing is that: Never forget that you must not have the best idea about where to put the call-to-action; instead, you have multiple opinions about where to put these. You should try them all to see when and where are better for your users.

Noise & Wrong Analysis in Testing

Noise and wrong analysis are your two enemies while doing A/B testing. You cannot say test A is better than test B after two days; maybe 1 week later, B will dramatically pass A. You should have a mathematical model to understand what your data from users “actually” show. What is hidden beneath the data? You should find answers when your data shows accurate results.

Conversion Rate should be increased with testing

Conversion rate is another basic topic that needs to be considered. If I explain this term with an example: if 50% of your users come back and play your game on day 2, then your conversion rate for day 1 is 50%. This can show how effective your “welcome” design is on your players. Let’s assume that you made a big change with the A/B test and your conversion rate for 70 players is 10% now. Is this bad? How should we approach this result?

Null Hypothesis

In the Null Hypothesis, it is accepted that different entities have no impact on each other. Therefore, one modification on A does not have any consequence on B. Here is the good figure that examples definition:

null hypothesis explained in this picture.
Figure 2: Figure from[2]

According to the Null hypothesis, there is no relationship between the day 1 conversion rate and the day 2 conversion rate. What we need to do is dispute this. In the starting, we do not have enough information to dispute it, so we accept it. If the result is off the chart like 0%, we can start thinking the hypothesis is wrong.

Shot from Resource of What is A/B Testing? – Bayesian Approach vs Null Hypothesis
Figure 3: Figure from[3]

But what this 30% conversion rate tells us? It just tells that it was not sustainable. Nothing more. Dark orange parts of the graphic show us p-values. These are extreme values of graphics. If our conversion rate is in these areas, we can reject the hypothesis. P-value can also be seen as the probability of concluding with an extreme result. Let’s say that the p-value is between 0 and 1. If it is less than 0.05, we see this as a big change. On the other hand, the probability of seeing this result is less than 5% according to this hypothesis. As you see, P-values are hard to approach with intuitions as game designers; instead, we need something that says us our design choice is good.

Problem of Noise in Testing

Another problem with this approach is noise: while you are using this model, when you reach below 5%, you can say “yeah this is what I want”. But it can only be noise, false positive.

We also have a Family-wise error topic that says that an increased number of treatments leads to us increased false-positive rate.

A better approach with A/B testing is Bayesian Approach. With the example given before, we can say that our 30% conversion rate is our “retention” rate. We know that -without any changes- we can reach this rate. Then we build our test upon this data. The Bayesian approach gives us the “probability” of the model based on given data.

Classical Example for Test: Tossing the Coin

Shot from Resource of What is A/B Testing? – Bayesian Approach vs Null Hypothesis
Figure 4:Tossing Coin. Figure from[4]

Let’s try to clear this approach with a classical example: tossing a coin. We simply have two opportunities when we toss: head and tail; and we have 0,5 probability for each. In Bayesian Approach -even for this basic example-, we have noise at the beginning of the experiment as you can see in figure 4. Getting more observations is an essential step to getting closer to accuracy as much as possible.

We need to be clear that we want the possibility of a model according to given data, not the possibility of data. In other words, we want to “clearly” say that our conversion rate is 30%. Mr. Collins gives a valuable example to clear this point:

Shot from Resource of What is A/B Testing? – Bayesian Approach vs Null Hypothesis
Figure 5: Figure from[5]

The probability of being cloudy while it is raining is not the same as the probability of raining while it is cloudy. These two clearly are not the same and we want to find the right side from the given left side.

Shot from Resource of What is A/B Testing? – Bayesian Approach vs Null Hypothesis
Figure 6: Figure from[6]

We continue with more iterations and every iteration leads us to a better result. Do this repeatedly until we reach “actionable” certainty. We increase our certainty with every repetition: more data.

What is Prior in the topic of A/B Testing?

Setting the “prior” is another step to consider carefully. Prior is our starting point which is the confidence point. While the experiment is continuing to repeat, we are getting further or closer to our prior point. This shows how much accurate our first setup is.

There is another advanced version of the A/B test which is done with three versions of the game. In this case, we have a volumetric graph in which we deal with four three dimensions.

Mr. Collins talks about general assumptions that we need to consider:

  • Users are independent from each other
    • This assumption is not always good especially if you are running a test for multiplayer, team-based games.
  • Probability of conversion is not dependent of time
    • This is also not so accurate approach; basically 8 o’clock in the morning will give you so different results instead of 1 PM on Sunday.

Some of the benefits and features of the Bayesian Approach are:

You can continuously observe from your graph. Population size is not fixed during the tests. We have the precious term called prior in which we can rely on our previous experiences and/or knowledge. Have an accurate probability. We have the opportunity to consider the magnitude of the difference. Lots of different situations can be adapted to this approach.


RESOURCES

[1-3-4-5-6] are also from the video above— https://www.youtube.com/watch?v=-OfmPhYXrxY&feature=youtu.be

[2] — https://www.thoughtco.com/thmb/ayMTs7HtvLoJHeWqqN7C7a-l9Oo=/1333×1000/smart/filters:no_upscale()/null-hypothesis-examples-609097_FINAL-100262e70b70426fb0633304eb2f49f4.png