This is the subhead for the blog post
Page Rank – Google’s system of ranking Web site relevancy based on the number of “inbound” links to a site – was truly a revolution in search engine relevancy. Every day that goes by, however, Page Rank – at least in its purest form – becomes less and less relevant. This post will provide a brief history of Page Rank, explain its current failures, and suggest future improvements to solve Page Rank’s deficiencies.
A Brief History of Page Rank
Until Page Rank, search engines relied almost entirely on site content to determine the relevancy of a Web site. The search engines looked at the “meta content” (title tags, description tags, meta-keywords) as well as the actual words on a page and tried to match this content with a user’s search query. This worked for a while, until search engine optimization professionals (SEOs) figured out that it was very easy to manipulate content to trick a search engine into thinking a non-relevant page was actually quite relevant. As a result, professional SEO pages quickly rose to the top of search engine result pages (SERPs), irregardless of relevancy. Without relevancy, search engines serve no purpose and undoubtedly would have eventually rode off into the sunset.
Enter Page Rank. Based on the concept in academia that the number of citations an academic article receives is a good indication of the article’s importance, Page Rank combined site content with “link popularity.” Thus, an SEO expert sitting at home creating non-relevant pages filled with spider-friendly content could no longer expect to shoot to the top of the SERPs. Instead, these “SEO spam” sites were replaced with, well, actually relevant sites. After all, who would create a link to a non-relevant site?
In the academic world, this makes a ton of sense. When you write an academic paper, your reputation is on the line. Write a bad article, and you risk the ridicule of your peers, and more importantly, you risk losing the chance to achieve tenure or a greater position in your field. Moreover, since all academic articles are peer-reviewed, it’s unlikely that an article with a bunch of irrelevant links would even get included in a prominent journal. Thus, due to author self-interest and community policing, a citation in an academic journal is truly a very strong indicator of relevance.
Why Page Rank Fails
Sadly, self-interest and community policy don’t apply to the Internet world, at least not with respect to Page Rank. It turns out that it is very easy to create what is essentially fake links to a page to bolster Page Rank. This can be done by buying links (for example, from the company Text Link Ads), by trading links (reciprocal link exchanges), by creating fake sites that link to a master site, by spaming other sites with “comment spam”, and so on and so on.
In other words, as soon as linking became a factor in a page’s rank on Google, an entire industry of link manipulation sprung up overnight. Thus, self-interest, unlike it academia, is often rewarded by gaming the system online by buying links or creating fake links. And community policing is basically impossible, since Google and Google alone determines what is a “quality” or “trusted” link – the greater community really has no say in this decision.
Page Rank – or any system based on linking – is doomed to failure simply because “bad actors” can manipulate the system. This basically creates an arms race between search engines and SEOs – Google changes its algorithm to address a particular manipulation of Page Rank (for example, reciprocal linking), the SEOs assess the algorithm change, and then figure out a loophole. Google has to then go back to the drawing board to create a new algorithm.
This reminds me of trying to build a sand castle at the edge of the ocean. You build the castle, a wave comes in and knocks down part of your creation, so you build a wall to protect the castle from the wave, but then a wave comes in and knocks down the wall, so you build a moat, which eventually overflows, which you protect with a bigger wall, which is closer to the water, which gets hit by bigger waves, and so on. After about 30 minutes, you realize that you’ve spent all of your efforts trying to create ways to protect your castle, and that you actually haven’t had any time to improve the castle itself. This is what it must feel like to work on a search engine algorithm team.
When you combine lack of self-interest (and actually an incentive to cheat the system), lack of community policing, and an arms race that requires the search engines to spend more and more time fighting SEGos instead of building a better search algorithm, it just makes sense that Page Rank will eventually die.
So what will work then? Let’s return to the original impetus of Page Rank – the world of academia. As noted, academic citations work for two reasons – self-interest and community policing. Neither of these principles apply to Page Rank or any link ranking algorithm. They do, however, apply to two emerging areas of online search: personalization and vertical directories. I believe that one (or a combination) of these search functionalities will eventually replace or greatly diminish the importance of algorithmic search results. Let’s look at each individually.
Personalization essentially combines algorithmic search with user-specific preferences. This can be anything from enabling a user to rank or remove sites from their results (which Google is currently beta-testing), to actually tracking every movement a user makes online and inferring user preferences from user behavior (sort of the concept behind AllAdvantage).
The bottom line, however, is that the ultimate determinant of what is shown in the search results is not determined by the algorithm, but the user. As such, Web sites that are attractive to search engines but turn out to be horribly irrelevant to the user eventually get blacklisted – by the user itself – and this information can then be used by the search engine to adjust the algorithm going forward for that user.
Thus, personalization is the very definition of self-interest. As such, sites that will rank well through a personalization engine will have to focus their efforts on one and only one thing: usability. Lots of links, dummy content, and “spider food” may help you initially get a user to click on your site, but without truly outstanding user experience, your site will eventually be blacklisted from most users’ personalization algorithm.
You could argue, however, that personalization only works after a lot of observation of a user’s searching behavior. After all, if a user suddenly decides to get an MBA and has never searched for anything related to MBAs, it’s hard to create a personalized algorithm that understands the users needs. Thus, if personalization solves the “self interest” element of academic citations, there is still a “community policing” element – the peer review to ensure that a submitted paper is up to snuff – that is missing.
One possible solution is “collaborative filtering.” Collaborative filtering is the science of matching similar users to each other and using these similarities to predict results for a user, even if the user has never searched for a particular topic. The best example of this online is found on Amazon. When you search for a book on Amazon, you’ll always see a note at the bottom that says “People like you also bought . . .” Another great example is email spam filters, such as through Yahoo. Every time you tell Yahoo a particular email is spam, Yahoo counts your vote against that particular sender. If Yahoo gets enough votes (community policing), it concludes that the email must be spam.
Collaborative filtering combined with personalization is now emerging as a major force in search. Del.icio.us, Flickr and Technorati are all great examples. With Del.icio.us, you tag sites based on your personal preferences, and you use collaborative filtering to find out what other people with similar tagging styles liked online.
The problem with the Del.icio.us model, however, is that it is susceptible to the same manipulation (from link manipulation to tag manipulation) that occurred with Page Rank. In fact, this is already happening. Without an incentive for people to vote honestly (on Amazon, you vote with your wallet; on Yahoo Mail, you vote with your bulk folder), collaborative filtering will eventually fail. So, while I do think that this technique will become more important to search in the future, and will be a good initial bridge for personalization, collaborative filtering ultimately can’t solve for the lack of community policing online.
The answer to community policing is vertical directories that combine vertical algorithms with human editors. A fully-automated search engine like Google can never really be a good community police officer, simply because Google’s algorithm doesn’t work well enough for any particular community. Google’s algorithm tries too hard to be all things to all people. As a result, it does a pretty good job for most searches, but a very good job for few searches.
Say, for example, that you wanted to buy a house in Pacifica, CA. I went to Google and typed in “Pacifica California Houses for Sale.” The first ad I got was for used Chrysler cars (they have a car named the “Pacifica” apparently), and the first organic result was for HomeGain, which sent me to a screen without any results and a button that said “show me Pacifica results, which I pressed and was then redirected to ZipRealty, which required me to register online before I could see any results. The next result was for Homes.com, which had a total of three listings in Pacifica, all from Wells Fargo.
Now compare this user experience to a vertical search site, for example Trulia.com. The Trulia homepage asks you six questions: where do you want to buy your home, how much do you want to pay, how many bedrooms, how many bathrooms, how much square footage, and what sort of residence do you want. It takes about 20 seconds to fill out this information. I clicked search and I was shown 17 properties, a nice map, and tons of information. Exactly what I was looking for.
At a base level then, simply have a search focused on a specific vertical will always beat a general algorithmic search. Yes, Google could build a real estate search engine, but frankly they will never be able to invest enough resources to be able to build good search results for the hundreds or thousands of verticals that exist online.
My thought though, is that a verticalized algorithm like that of Trulia is still not enough. Frankly, if Trulia becomes the Google of real estate sites, a whole army of Trulia-optimization experts will emerge ready to manipulate Web sites specifically for the purpose of getting a high ranking on Trulia. In other words, vertical search works better right now in part because it is more focused, but also in part because there is no incentive for bad actors to try to manipulate the system.
What’s great about vertical search, however, is that it is a much more controlled environment. The folks at Trulia can “eyeball” results and improve them. For example, they may conclude over time that there are a few real estate brokers in Pacifica that basically dominate the market. As such, they can ‘tag’ these sites for preferential rank in their algorithm. Thus, because the results sets are more defined, it is much easier to be a good community police officer with a vertical search site, than it is on a generic site like Google.
You might argue that this isn’t as pure as the Google algorithm. I agree, it absolutely isn’t. But it would be better. For example, a human would never serve an ad for the Chrysler Pacifica based on a user query for “Pacifica Homes for Sale” (I doubt that Trulia will ever sell an ad to Chrysler period). Nor would a human allow a site like HomeGain – which is clearly not providing what the user is looking for – to show up as the #1 result. As smart as computers are, they don’t get nuances like humans do.
One thing that I’ve assumed all along in this discussion of vertical search is that these vertical search engines have a strong incentive to do accurate community policing. After all, it would be pretty easy to edit your results to make sure that your top-paying advertiser always shows up #1 for every result.
Ultimately, as with academic citations, for community policing to work, there must be a strong element of self-interest. For commercial vertical search engines, the entire raison d’etre for the site to provide a service better than that of a Google or a Yahoo. If these sites begin to manipulate their results to the point that their relevancy decreases to the level of a generic search engine, they will end up losing users who eventually went to their sites because of their increased relevancy. While I do think that most vertical search engines will make some decisions for commercial-purposes, the ones that thrive will find a happy medium between revenue opportunities and appropriate community policy.
Putting It All Together
Personalization, collaborative filtering, vertical algorithms, and human-edited adjustments. Imagine del.icio.us, combined with Amazon, focused on a vertical like Trulia, and edited with the care of Wikipedia. Now compare that to a generic algorithm that is ripe for manipulation, attempts to provide the same results for all verticals, and has only limited understanding of a specific user’s unique preferences. It doesn’t seem like much of a comparison.
Way back in the 1800s, the field of medicine has one specialty – “doctor.” Over time, medicine has developed specialties, sub-specialties, and so on. Now we have neurologists, and within neurology there are movement disorder specialists, and within movement disorder experts, there are Parkinson’s Disease gurus, etc. Just as you wouldn’t want your family doctor to perform brain surgery on a family member, given the choice, consumers will eventually avoid purely-algorithmic sites when it comes to anything but the most obscure or generic searches.
Editor’s Note: As this is the longest posting I have ever written, anyone who has read to the end certainly deserves a prize. Write your name as a comment to this post and you will receive recognition later on as a “VIP Blogation Reader”!