This is the subhead for the blog post

This week, the 3Q blog takes on the topic of testing – one near and dear to all digital marketers. 

I see myself as the contrarian, or maybe the voice of reason, on the importance of ad testing. It can be valuable, sure, but I’ve argued (and will do so again here) that it’s a lot harder than you think – and should be deprioritized in favor of more impactful initiatives.

Skeptical? I’ll start by describing several truths we’ve discovered over years of experience across thousands of SEM accounts; we call them 3Q’s Test Traps. Then I’ll show you two case studies that should have you reevaluating how you’re spending all that A/B testing time.

3Q’s Test Traps

  1. It’s nearly impossible to get ads to enter an auction on even footing. Your Quality Score or Ad Relevancy will vary between your two ad variations, which will result in a difference in cost. Most statistical significance calculations require metrics that represent a population and a sample (see more detail below), which means we’re unable to take efficiency, or CPA, into account. Because of the differences in costs, ads with the highest CPI may not always have the lowest CPA.
  2. There is a limited time when results are meaningful. Seasonality, limited time sales, changes to the SERP, or changes to competitor activity can make test results irrelevant in the future. If we cannot expect our “winner” to continue being the winner as a result of these changes, then the test was a waste of time in the first place.
  3. Ad platforms are not designed to support controlled testing. Google rewards a lack of ad copy testing thanks to their recent changes to ad serving that favor the “optimize” setting and ad groups with many ads. Facebook, as described in one of our case studies below, severely limits any ability to split traffic in a meaningful way between two unique ad variations.

Falling into Test Traps: Examples

We’ve seen and tried to correct all manner of flawed A/B tests, whether the issues were in the process, the analysis, or both. Here are a few examples that might look all too familiar.

Google Display Network Ad Formats

When Google added Responsive Ads as an available format for the GDN, a brand decided to run a test to evaluate the performance of these new ads.

Based on the results, the test evaluator recommended pausing image ads to focus more spend on responsive ads. This recommendation was not statistically valid, not only because the brand was working with a very small sample size, but because the impressions for the different ad formats do not represent impressions on the same users on the same sites. That’s caused by the fact that responsive ads can be placed on all available GDN inventory, while image ad inventory is limited to just the few slots that align with the image ad sizes live in the account.

Therefore, the ad format performance differences were most likely the result of differences in inventory. We cannot compare ad size to ad size for a fair evaluation. Instead of evaluating the new ad format under the assumptions of a controlled A/B test, each ad format should be evaluated on its own merit. Any ad format that achieved the brand’s goals should have remained live in the account.

Facebook Ad Copy

A brand implemented a new ad in their campaigns to evaluate any impact on performance from introducing a video ad against the current static image ad. After a few weeks, they found that CTR was 20% higher on the video ad, and that the results were statistically significant.

When we dove deeper into the differences between the ads, though, we found that there were more differences than just image vs. video:

Each Facebook ad has a life of its own – you can never control for the likes, comments, and shares that impact performance for an individual ad. As a result, it’s nearly impossible to have a truly controlled A/B test on Facebook.

Instead of using statistical significance to evaluate tests on Facebook that are, in fact, not controlled, advertisers should look for patterns in performance. For example, a certain advertiser may find that:

  • Lifestyle images tend to outperform utility images
  • Videos tend to perform better than static images

The ultimate decisions with and without statistical significance may be similar; the difference is that we need to be more conservative when drawing conclusions about Facebook performance and should run many variations before determining that a certain type of ad may in fact be better.

Stay tuned for my next post, where I’ll show you the requirements of setting up a true A/B test in Google. For a start-to-finish look at how (and when) to put better ad testing into play, download our Introduction to SEM AdQ, 3Q’s approach to ad copy testing.