Channels ▼

Nick Plante

Dr. Dobb's Bloggers

Spam, The Final (Search) Frontier

January 30, 2011

I don't usually read TechCrunch, but it's a lazy Sunday here in snowy Maine, and Vivek Wadhwa's piece on search engine spam and the future of search caught my attention and tickled a personal pain point. Search spam or "spamdexing" is clearly out of control and becoming increasingly frustrating. But although it may be a problem, it's also an opportunity for innovation.

For the last 6-8 years web search hasn't fundamentally changed a whole lot. Many forget that when Google entered the space, that there were already thoroughly entrenched players, and true innovation (and market dominance) seemed like a long shot. However, Google's introduction of PageRank proved to be a significant seachange and ultimately changed the way we searched for information on the web -- and where we went to perform searches.

And thus, the SEO expert was born.

I should say that there's nothing fundamentally wrong with the concept of search engine optimization (SEO). Really. After all, at its core, it's just about capitalizing on communicating your message well, putting your most important copy front and center, and getting others to link to you and recognize you as an authority on the topic. But like any system used to ultimately feed profits, it's frequently gamed and seemingly peripheral aspects of it are exploited in ridiculous ways. This isn't a new problem by any means, but the rise of spamdexing and the success of the content farm as a business model is particularly disturbing. And growing with a sick velocity.

For example, just this last week Demand Media, perhaps the most infamous content farm of all time, IPOed with a valuation of 1.5 billion. In a nutshell, Demand and other content farms identify topics with high advertising potential using an algorithm that takes search query data and bids on advertising auctions into account, and then produces content to satisfy that criteria. Almost a million pieces of content a month are created by Demand alone. Some of those items are for its own sites (eHow, Trails.com, Cracked.com) but many are for other mainstream sites such as YouTube as well.

The problem isn't that they're creating tailored content as much as it is that the content created isn't high quality, and usually isn't what you want to find when you search for "cat log messages". But it's effective. And it's happening more and more in niche focus areas too, sometimes with even less value than the content that is explicitly produced by Demand's low-paid freelancers.

Take, for example, efreedom.com, which pretty much exclusively republishes content from StackOverflow. I don't know about you, but as a developer, I'm frequently using Google as a debugging tool when I run into some sort of new esoteric new error message or fault (more frequently than usual lately, as I'm still coming up to speed on Android). StackOverflow is a fantastic repository of questions and answers (which I should probably be searching directly), and more often than not it has the answer that I want, but efreedom seems to almost always manage to have a higher-ranked search result than the original content that they're repurposing. I still get the answer I'm searching for, but I'd argue that at best this is disingenuous and frustrating. It doesn't help that efreedom is poorly designed and difficult to navigate, either.

It seems clear that the web, or at least, search on the web, is suffering from a major spam problem. Although Google suggests that it's results are more accurate than ever, the overall feeling of satisfaction I get from typical search results has certainly decreased over the last year or two. Google engineers recognize this low-quality content problem of course, and are working on solutions. But can small tweaks be made to the existing rank algorithms to handle these seemingly legitimate scenarios? Or is a whole new strategy required?

New search engines like Blekko are banking on the latter. By leveraging "slashtags" or context groupings, and allowing users to categorize and curate content for others, Blekko is trying to push crowdsourced search annotations into the mainstream, while cutting down on spam.

So is curated crowdsourced search the answer? Only time -- and use -- will tell. Google doesn't seem to think so, having shelved their brief SearchWiki crowdsourcing experiment in favor of starred search (I think I was one of about 7 people who actually liked SearchWiki, oh well).

I have my doubts about whether curated or annotated search is really the answer to filtering unwanted spam from the "toxic waste dump" of search engine result listings. Blekko's results are interesting but far from perfect thus far, and they'll need to get some serious critical mass to test the hypothesis thoroughly. Will they fair better than the over-hyped "Google killer" Cuil, which shut down in September? Or will something completely new emerge that will topple Google's dominance of the space?

One thing is for sure: search is still an interesting and unsolved problem. Thanks to the spammers.I don't usually read TechCrunch, but it's a lazy Sunday here in snowy Maine, and Vivek Wadhwa's piece on search engine spam and the future of search caught my attention and tickled a personal pain point. Search spam or "spamdexing" is clearly out of control and becoming increasingly frustrating. But although it may be a problem, it's also an opportunity for innovation.

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 


Video