Another anecdotal argument against optimizing for Quality Score
Published: September 23, 2013
Author: Susan Waldes
After last week’s great round-up on the Quality Score debate from Clix, I’ve been noodling on QS again – and as the single named antagonist, I feel compelled to antagonize a bit more. (Which might make for a fun dynamic on the SMX East Enhanced Campaigns panel I’m sharing with brilliant QS advocate Larry Kim on Oct. 1.)
Seriously, though, I just happened to notice some really odd QS behavior this week. Some caveats: the campaign numbers aren’t statistically significant, and the results have been bounced through some AdWords fluctuations (ahem, Enhanced Campaigns). But since most marketers optimizing less-than-massive accounts are subject to these conditions when analyzing data, I firmly believe the conclusions are relevant.
Let me set the stage:
We have a client whose head terms are extremely position-sensitive and sensitive to competitor actions. We set up a campaign a little over 6 months ago to serve as a “test environment” and have run a series of various position tests within this campaign. We purposely duplicated a handful of key exact-match head terms within this campaign – each duplicate lives in its own ad group like this:
Test Ad Group [keyword 1]
Control Ad Group [keyword 1]
Test Ad Group [keyword 2]
Control Ad Group [keyword 2]
And so on…
Over these 6 months, we have put the test groups vs. the control groups in a series of tests that look at the best way to approach position for this advertiser. We’ve used different bidding tool and methodologies as well as tested positions that average only .10 differences between the 2 groups. These tests are executed in shorter intervals, and in most of the tests we’ve used AdWords experiment tools to split the impressions in an A/B fashion.
(Oh, and just to get this out of the way, this client only has 1 lead gen landing page, and the ad texts are also cloned for the test and control ad groups.)
Earlier in the week, I happened to be looking at an aggregate 6 months of data for this whole campaign at the keyword level and noticed some really weird numbers around QS!
So here are the 6 months of 1 of the keywords – we have relatively even impressions and clicks for this one resulting in ~11% better CTR for one variation. Avg Position is only .1 different. The conversion rate is quite different, with one variation ~50% better than the other – but with the same ad text and LP I can say this is likely a couple of factors. It either has to do with how we were treating position on some earlier test that collided with some seasonality or shifting competitor strategy, or it’s just random.
In either case, it’s not statistically significant enough to be sure. Still, a 2-slot difference in QS? And the higher QS group has double the first page CPC? Huh?
Here’s another one:
Here we have a much bigger difference in average position, less activity overall, and less parity in the impression/click numbers. There is a way bigger CTR delta and a way bigger conversion rate delta. Both favor the higher QS. First-page CPC is relatively the same here though.
And one more:
Higher QS aligned with lower CTR and only slightly higher CVR. CPA is overall higher – certainly not benefiting from CPC discounting on the higher QS.
I’ll also say that within this campaign there are examples of these duplicate pairs that also DO have the same QS despite wildly different metrics. Such as this:
Better CTR and CVR on one variation but the same QS.
So, what does this all mean? Well, as I mentioned, these are all small, statistically insignificant data sets. Additionally, this data was gathered over 6 months with several forced and varied test intervals we executed. In those 6 months, there have been large shifts in AdWords (Enhanced Campaigns!), shifting competitor landscape for this advertiser, and some major seasonal spikes. All of those factors are layered on top of each other, so this is in no sense “clean.”
However, to be realistic, most advertisers consistently are working with: a) data sets that are way smaller than truly stat sig data even when you aggregate their entire account activity; b) shifting marketplace conditions.
Most advertisers also can only react to data that they have and is reflected in their account. So, what would the proper reaction/optimization here be? It seems strange that there is no anecdotal consistency that even supports a “gut check” optimization.
I will say that the data Larry Kim has gathered by aggregating hundreds of accounts is surely true and does show strong correlative data. I also know that Google is pretty brilliant at algorithmically rewarding things that ultimately make more money for them and that their data pool is of staggering scale and 100% of all AdWords activity. However, I repeat my mantra that for most (perhaps all) advertisers working within the scale of even a single huge account, reacting to QS is the wrong thing to do. Once again, exercise the best practices and ignore the QS column.