What’s Wrong With A/B Testing

A/B testing is an internet marketing standard. In order to optimize response rates, you compare one page against another. You run with the page that gives you the best response rates.

But anyone who has tried A/B testing will know that whilst it sounds simple in concept, it can be problematic in execution. For example, it can be difficult to determine if what you’re seeing is a tangible difference in customer behaviour or simply a result of chance. Is A/B testing an appropriate choice in all cases? Or is it best suited to specific applications? Does A/B testing obscure what customers really want?

In this article, we’ll look at some of the gotchas for those new to A/B testing.

1. Insufficient Sample Size

You set up test. You’ve got one page featuring call to action A and one page featuring call to action B. You enable your PPC campaign and leave it running for a day.

When you stop the test, you’ve found call-to-action A converted at twice the rate of call-to-action B. So call-to-action A is the winner and we should run with it, and eliminate option B.

But this would be a mistake.

The sample size may be insufficient. If we only tested one hundred clicks, we might get a significant difference in results between two pages, but that change doesn't show up when we get to 1,000 clicks. In fact, the result may even be reversed!

So, how do we determine a sample size that is statistically significant? This excellent article explains the maths. However, there are various online sample size calculators that will do the calculations for you, including Evan’s. Most A/B tracking tools will include sample size calculators, but it’s a good idea to understand what they’re calculating, and how, to ensure the accuracy of your tests.

In short, make sure you've tested enough of the audience to determine a trend.

2. Collateral Damage

We might want to test a call to action metric. We want to test the number of people who click on the “find out more” link on a landing page. We find that a lot more people click on this link we use the term “find out more” than if we use the term “buy now”.

Great, right?

But what if the conversion rate for those who actually make a purchase falls as a result? We achieved higher click-thrus on one landing page at the expense of actual sales.

This is why it’s important to be clear about the end goal when designing and executing tests. Also, ensure we look at the process as a whole, especially when we’re chopping the process up into bits for testing purposes. Does a change in one place affect something else further down the line?

In this example, you might A/B test the landing page whilst keeping an eye on your total customer numbers deeming the change effective only if customer numbers also rise. If your aim was only to increase click-thru, say to boost quality scores, then the change was effective.

3. What, Not Why

In the example above, we know the “what”. We changed the wording of a call-to-action link, and we achieved higher click thru’s, although we’re still in the dark as to why. We’re also in the dark as to why the change of wording resulted in fewer sales.

Was it because we attracted more people who were information seekers? Were buyers confused about the nature of the site? Did visitors think they couldn’t buy from us? Were they price shoppers who wanted to compare price information up front?

We don’t really know.

But that’s good, so long as we keep asking questions. These types of questions lead to more ideas for A/B tests. By turning testing into an ongoing process, supported by asking more and hopefully better questions, we’re more likely to discover a whole range of “why’s”.

4. Small Might Be A Problem

If you’re a small company competing directly with big companies, you may already be on the back foot when it comes to A/B testing.

It’s clear that its very modularity can cause problems. But what about in cases where the number of tests that can be run at once is low? While A/B testing makes sense on big websites where you can run hundreds of tests per day and have hundreds of thousands of hits, only a few offers can be tested at one time in cases like direct mail. The variance that these tests reveal is often so low that any meaningful statistical analysis is impossible.

Put simply, you might not have the traffic to generate statistically significant results. There’s no easy way around this problem, but the answer may lay in getting tricky with the maths.

Experimental design massively and deliberately increases the amount of variance in direct marketing campaigns. It lets marketers project the impact of many variables by testing just a few of them. Mathematical formulas use a subset of combinations of variables to represent the complexity of all the original variables. That allows the marketing organization to more quickly adjust messages and offers and, based on the responses, to improve marketing effectiveness and the company’s overall economics

Another thing to consider is that if you’re certain the bigger company is running A/B tests, and achieving good results, then “steal” their landing page*. Take their ideas for landing pages and use that as a test against your existing pages. *Of course, you can’t really steal their landing page, but you can be "influenced by” their approach.

What your competitors do is often a good starting point for your own tests. Try taking their approach and refine it.

5. Might There Be A Better Way?

Are there alternatives to A/B testing?

Some swear by the Multi Armed Bandit methodology:

The multi-armed bandit problem takes its terminology from a casino. You are faced with a wall of slot machines, each with its own lever. You suspect that some slot machines pay out more frequently than others. How can you learn which machine is the best, and get the most coins in the fewest trials?
Like many techniques in machine learning, the simplest strategy is hard to beat. More complicated techniques are worth considering, but they may eke out only a few hundredths of a percentage point of performance.

Then again…..

What multi-armed bandit algorithm does is that it aggressively (and greedily) optimizes for currently best performing variation, so the actual worse performing versions end up receiving very little traffic (mostly in the explorative 10% phase). This little traffic means when you try to calculate statistical significance, there’s still a lot of uncertainty whether the variation is “really” worse performing or the current worse performance is due to random chance. So, in a multi-armed bandit algorithm, it takes a lot more traffic to declare statistical significance as compared to simple randomization of A/B testing. (But, of course, in a multi-armed bandit campaign, the average conversion rate is higher).

Multivariate testing may be suitable if you’re testing a combination of variables, as opposed to just one i.e.

Product Image: Big vs. Medium vs Small
Price Text Style: Bold vs Normal
Price Text Color: Blue vs. Black vs. Red

There would be 3x2x3 different versions to test.

The problem with multivariate tests is they can get complicated pretty quickly and require a lot of traffic to produce statistically significant results. One advantage of multivariate testing over A/B testing is that it can tell you which part of the page is most influential. Was it a graphic? A headline? A video? If you're testing a page using an A/B test, you won't know. Multivariate testing will tell you which page sections influence the conversion rate and which don’t.

6. Methodology Is Only One Part Of The Puzzle

So is A/B testing worthwhile? Are the alternatives better?

The methodology we choose will only be as good as the test design. If tests are poorly designed, then the maths, the tests, the data and the software tools won’t be much use.

To construct good tests, you should first take a high level view:

Start the test by first asking yourself a question. Something on the lines of, “Why is the engagement rate of my site lower than that of the competitors…..Collect information about your product from customers before setting up any big test. If you plan to test your tagline, run a quick survey among your customers asking how they would define your product.

Secondly, consider the limits of testing. Testing can be a bit of a heartless exercise. It’s cold. We can’t really test how memorable and how liked one design is over the other, and typically have to go by instinct on some questions. Sometimes, certain designs just work for our audience, and other designs don’t. How do we test if we're winning not just business, but also hearts and minds?

Does it mean we really understand our customers if they click this version over that one? We might see how they react to an offer, but that doesn’t mean we understand their desires and needs. If we’re getting click-backs most of the time, then it’s pretty clear we don’t understand the visitors. Changing a graphic here, and wording there, isn’t going to help if the underlying offer is not what potential customers want. No amount of testing ad copy will sell a pink train.

The understanding of customers is gained in part by tests, and in part by direct experience with customers and the market we’re in. Understanding comes from empathy. From asking questions. From listening to, and understanding, the answers. From knowing what’s good, and bad, about your competitors. From providing options. From open communication channels. From reassuring people. You're probably armed with this information already, and that information is highly useful when it comes to constructing effective tests.

Do you really need A/B testing? Used well, it can markedly improve and hone offers. It isn't a magic bullet. Understanding your audience is the most important thing. Google, a company that uses testing extensively, seem to be most vulnerable when it comes to areas that require a more intuitive understanding of people. Google Glass is a prime example of failing to understand social context. Apple, on the other hand, were driven more by an intuitive approach. Jobs: "We built [the Mac] for ourselves. We were the group of people who were going to judge whether it was great or not. We weren’t going to go out and do market research"

A/B testing is can work wonders, just so long as it isn’t used as a substitute for understanding people.