There are a fair number of reputable people making a lot of noise right now about the fact that the Google search engine is broken. The main complaint is that Google’s search results are “poisoned by spam”. Here are several examples of the complaints:
Google has become a snake that too readily consumes its own keyword tail. Identify some words that show up in profitable searches — from appliances, to mesothelioma suits, to kayak lessons — churn out content cheaply and regularly, and you’re done. On the web, no-one knows you’re a content-grinder.
The result, however, is awful. Pages and pages of Google results that are just, for practical purposes, advertisements in the loose guise of articles, original or re-purposed. It hearkens back to the dark days of 1999, before Google arrived, when search had become largely useless, with results completely overwhelmed by spam and info-clutter.
In 2010, our mailboxes suddenly started overflowing with complaints from users – complaints that they were doing perfectly reasonable Google searches, and ending up on scraper sites that mirrored Stack Overflow content with added advertisements. Even worse, in some cases, the original Stack Overflow question was nowhere to be found in the search results! That’s particularly odd because our attribution terms require linking directly back to us, the canonical source for the question, without nofollow. Google, in indexing the scraped page, cannot avoid seeing that the scraped page links back to the canonical source. This culminated in, of all things, a special browser plug-in that redirects to Stack Overflow from the ripoff sites. How totally depressing. Joel and I thought this was impossible. And I felt like I had personally failed all of you.
But it turns out that you can’t easily do such searches in Google any more. Google has become a jungle: a tropical paradise for spammers and marketers. Almost every search takes you to websites that want you to click on links that make them money, or to sponsored sites that make Google money. There’s no way to do a meaningful chronological search.
The problem of SEO poisoning has been growing at an alarming rate. At any one time, more than 100 of the top 300 search terms contained at least 10 percent malicious links, according to one security firm. That means that an Internet searcher has a one in three chance of being sent to a malicious websiste from a search page. That same security firm estimates that 15 of the first 70 hits in a search will contain malicious links.
So the question is, are these complaints true? Or do they fall into CK Lewis’ “Everything’s Amazing & Nobody’s Happy” observation:
So I am trying to think of different kinds of searches people would do. One would be a “general factual question” like “why is the sky blue?” When I type this into Google, the first result is from sciencemadesimple.com and it seems completely legitimate. Does it have ads? Sure, but so what? The second example happens to be HowStuffWorks. Legitimate, and it has ads too. The third result is from NASA – spaceplace.nasa.gov. Legitimate. And no ads. But then we helped pay for that content with our taxes – Maybe they SHOULD put ads on it so NASA has more money to spend on space missions.
And so on. I don’t see any evidence of spam poisoning here. Maybe I am missing something, but it all seems legitimate.
Here is another search that people might do – a history question. How did Marie Antoinette die? Let me point out that I have no idea. In fact, I need Google’s help to correctly spell “Antoinette”. First result is Wikipedia. Second is Yahoo Answers. I will admit a bit of prejudice here and say that I think Yahoo Answers should be purged from Google search results because many of the “answers” are plain awful, but that’s just me. And in this case the answer is correct. The next result is legitimate. The next result is Answers.com – another place that should be purged.
So this brings up a good point. Yahoo answers often has bad answers. I don’t even click into Yahoo answers when I see it in the search results. On the other hand, it is an incredibly popular service, and it must be popular for a reason. Answers.com is pure spam as best as I can tell. I don’t know why it ranks so high. Google (in my opinion) should purge it like it purged …
OK, now here is a real world case where I need to use Google. I know that there was an article about a month ago that talked about a guy who abused his customers. His deal was that he would abuse them, so they would complain. The complaints would link back to his site and his site would rise to the top in the search engine rankings. He got good Google rank through abuse. I know that somebody exposed him, and then just a couple days later Google responded. I want to find the original article that exposed him plus Google’s response.
So I go to Google. I type in a search string. I try this “man who abused customers to get google rank”. Let’s admit that that’s a pretty nebulous search string. OK – it’s really nebulous. How does Google perform? The first two results are Google sites helping customers resolve problems with Google, which makes sense given my search string. One is “www.google.com/support/forum” and the other is “www.google.com/support/webmasters”, so those seem legitimate. The third is an article on smartcompany.com.au talking about the incident I am referring to. The article is referencing the NY Times article, but does not give me a link.
Since I now know it is in the NY Times, I add “site:nytimes.com” to my query and I get the exact article I want:
That’s amazing. In 30 seconds, out of trillions and trillions of web pages, I have gotten to the exact page I want using a really nebulous search string.
Now I want to find Google’s response to that article. So I type in “Google’s response to A Bully Finds a Pulpit on the Web”. The first result is the NY times article. The next result is a syndicated version of the article. The next result is a democraticunderground.com forum referencing the article. And now we get into a problem that is not “spam poisoning” but what I call “news poisoning”. Legitimate news sites report on the story because it causes outrage (and therefore appeals to readers). News Spam has been a problem since the very first day Google went live. But then, at the bottom of the first page of results, there is a Reddit link that points straight to Google’s response:
Again, in less than 30 seconds I have found the exact piece of content I am looking for.
Why isn’t Google’s response the first result? I don’t know. But I did find what I was looking for very quickly. In 10 years there will be an AI that works a lot more like a human librarian and gives better results, but we are not there yet. If I can find exactly what I am looking for in 30 seconds using what is available now, does it really matter?
So now back to my original point…. I think Google should purge Answers.com in the same way it purged DecorMyEyes.
One of the articles up top indicated that the author was trying to buy a dishwasher and he got mired in spam rather than legitimate results. So for my next query I type in “Dishwasher Reviews”. The first result, as best I can tell, is useless. I have no idea why it gets the top slot. Score 1 for the complainers. The second result looks really good to me – it appears to be a review aggregator, sort of like RottenTomatoes is for movies. The next result is a set of reviews from CNET. The next is buzzillions.com – one of the sites picked up by the aggregator. Next is epionions.com. And so on. With the exception of the first result, it all looks legitimate.
So, I am not seeing the problem. But let’s say there really is a problem with spam poisoning. How could Google fix it? I have no “insider knowledge” of Google. But my impression is that Google gives the highest placement to the sites with the highest number of legitimate incoming links. If there is a problem with spam poisoning, it seems like that approach would weed out the spam, because no one would link to the spam. What if the spammers are smart enough to build networks of sites to link to their spam? I would think the word “legitimate” takes care of that, and Google can come up with data mining algorithms that detect this problem.
Let’s say that approach is insufficient. Google could let its audience tag spam – use crowd-sourcing to solve the problem. It might then need a little human help to avoid the “Google bomb” effect, but again those effects can be detected by a large number of votes coming in all at once, for example. Or by looking for pages that have too many votes.
Or Google could use humans to help weed out the spammers. Google knows what its top 1,000,000 searches are. Google also has the resources to hire people to look at the top-25 results for each of those searches and detect the spammers. It could then purge the spammers and a huge part of the problem would disappear.
In other words, this is a solvable problem. That’s why I don’t think it is a problem.
But let me ask you – do you have a Google query that, to you, seems to have “poisoned” results? If so, please add it into the comments. I will also keep my eyes open – I do a lot of Google searches. I will add anything I find in the comments as well.