Teh Xiggeh


Hand-made PageRank?

Posted in Search Engines by Xiggeh on August 17, 2006

A few months ago I was pulling apart the Google Toolbar to see what kind of data was being thrown around in the background. This was nothing new, once you had the hash you could query anything.

It was interesting in two regards; the first being the PR response the toolbarqueries server returned, and the second being a much more comprehensive page of data being available (more on that later).

I couldn’t find the hash again, or get the toolbarqueries site to work in my browser, so for the moment I’m using Ethereal to pull out the server responses. Let’s see what happens on various sites:

digg.com “Rank_1:1:7”

theregister.co.uk “Rank_1:1:8”

slashdot.org “Rank_1:1:9”

cnn.com “Rank_1:1:9”

adobe.com “Rank_1:2:10”

Hello .. something’s different with that last one. Not only is it a PR 10 site, the the second parameter has changed from “1” to “2”. Does this happen on any other sites?

nasa.gov “Rank_1:2:10”

apple.com “Rank_1:2:10”

statcounter.com “Rank_1:2:10”

Yes it does. In fact I couldn’t find a page with PR 10 without that number changing. What does it mean? I have a theory…

When it comes to ranking the web, you have to start somewhere. Back when Google first started crawling the web, no doubt Larry and Sergey popped in a few URLs and watched it explore links outward. It makes sense to continue this when calculating PRs through the index. To get the most accurate results by the 40th or 50th iteration, one could start with known trustworthy or “authoritative” pages.

I believe the second value in “Rank_1:2:10” is exactly that – a flag saying this isn’t just any old web page, its an authoritative page. And to support this theory further, The PageRank Citation Ranking: Bringing Order to the Web paper published in 1998 has a section titled Personalized PageRank. It outlines an “important component of the PageRank calculation” and “a powerful parameter to adjust the page ranks”. This could be used to tone down PR for a site with an unusually large number of incoming links, or to adjust the PR for known authority sites in the database.

Does anyone have any information that could prove/disprove this?

Google Project Hosting

Posted in Search Engines by Xiggeh on July 28, 2006

Well, as speculated, it has finally arrived – Google now provide hosting for open source projects as announced yesterday at OSCON. Philipp Lenssen provides an excellent overview of the system.

Google Gearing Up For Corporate Web Builders?

Posted in Search Engines by Xiggeh on July 27, 2006

ResourceShelf spotted some new domain registrations from Google yesterday, and there’s a couple of interesting ones on the list;

  • Your-company-site.com
  • Your-company-website.com
  • Your-online-site.com

A bit spammy in my opinion, not like Google at all, but it shows a clear intent to market web sites to corporates. This isn’t surprising considering the popularity of Google Page Creator, and the commercial nature of many other Google products.

Supplemental Hell

Posted in Search Engines by Xiggeh on July 25, 2006

On 27 August 2003 GoogleGuy announced a new feature in Google – supplemental results. Although Google had done a great job of removing irrelevant pages from queries that returned a large number of results, they discovered that useful information was not being displayed for more obscure queries. To rectify this Google introduced the ‘supplemental index’ (SI). The supplemental index contained pages not recognised as useful enough to be returned in standard results, but not spammy or irrelevant enough to be shunned entirely.

Nate Tyler (Google Media Contact) explained; “The supplemental is simply a new Google experiment. As you know we’re always trying new and different ways to provide high quality search results.”

In fact this kind of thinking wasn’t new. In June 2000 Inktomi introduced a smaller index of authority sites, and a larger index with ‘the rest’ of the web, similar to Google’s main and supplemental indices.

Few webmasters complained when the SI was introduced in 2000, but trouble started to creep in on 24 January 2006 – a substantial part of Google’s main index began shifting into the supplemental index. It was a serious bug.

Without much recognition from Google, forums exploded with speculation about why this happened. I’ve done a bit of research on the topic and present some facts I’ve discovered, and my (equally unproven) speculation.

What is the supplemental index (SI)?

First let’s get Google’s official view on supplemental results when it was launched in 2003;

“Supplemental sites are part of Google’s auxiliary index. We’re able to place fewer restraints on sites that we crawl for this supplemental index than we do on sites that are crawled for our main index. For example, the number of parameters in a URL might exclude a site from being crawled for inclusion in our main index; however, it could still be crawled and added to our supplemental index. The index in which a site is included is completely automated; there’s no way for you to select or change the index in which your site appears. Please be assured that the index in which a site is included does not affect its PageRank.”

The official line is rather vague. Let’s see what else we knew about the supplemental index before the bug(s) were introduced in 2006;

  • Pages could be moved to SI without being crawled (Google support pages)
  • Pages could be moved to SI after being crawled (personal experience)
  • “Some” of the moving process “happens in the crawl/index cycle” (Matt Cutts)
  • SI does not affect PR (Matt Catts)
  • SI is not affected by PR (personal experience)
  • Under certain cicumstances SI results will be listed above results from the main index
  • The SI is held seperately from the main index
  • The SI has dedicated crawlers, running on different cycles and agendas from the main index crawlers and AdSense crawlers

Some very good observations from the community, and some snippets of wheat amongst the chaff from Google. The dedicated SE community observation and Google spokesbot action took 3 years (just over). So what’s the problem? Big Daddy…

Big Daddy – Supplemental Hell

Between November 2005 and April 2006 Google rolled out Big Daddy – new software on a new architecture – across its worldwide datacentres. Servers were taken offline, upgraded, and brought back up. A datacentre took around 10 days to upgrade.

Matt Cutts originally said “changes on Big Daddy are relatively subtle (less ranking changes and more infrastructure changes)”, but once the rollout had started it became clear the changes were a bit more disruptive than planned.

An interesting live commentary of the rollout can be found at the WMW forums (24 January 2006).

Over a period of weeks a substantial amount of Google’s main index was placed in the supplemental index, and the problems haven’t completely cleared up yet.

In March Google “identified and changed” a “threshold” on Big Daddy which brought many of the supplemental pages back into the main index. At the end of March Google did the same again, telling Big Daddy to crawl more pages.

Getting out of the supplemental index

To quote Matt Cutts on this one:

In general, the best way I know of to move sites from more supplemental to normal is to get high-quality links (don’t bother to get low-quality links just for links’ sake).

From my experience over the last few months this advice works great. A site that went supplemental for 8 weeks shot right to the top for popular search terms using this method.

However it won’t work for everyone. You should do your own research (there’s plenty of advice out there, good and bad) to find your own solution.

References

Google Product For Open Source Developers

Posted in Search Engines by Xiggeh on July 25, 2006

Google’s Greg Stein has announced a forthcoming Google product aimed at the open source community, but has refused to give us any further information. The launch is to tie in with OSCON – the Open Source Convention, and on Monday 24 July 2006 Greg said “we’re putting the final touches on it as I write this blog post”. He promised full details on Thursday at his OSCON talk.

Greg has long been with the open source community – he is Apache Software Foundation‘s chairman and an engineering manager at Google. He’s previously worked on Subversion (SVN) and WebDAV.

Google Borg

Posted in Search Engines by Xiggeh on July 24, 2006

We noticed the “using_borg” string appear in the Google error message last week, and now the phrase has come back again through another Google leak. To quote Garett Rogers at ZDNet:

“When checking out Google’s impressive second quarter on Google Finance today, I stumbled across something that leads me to believe they are testing “version 2″ of Google Finance”

What he actually saw was a link at the top-right of the page (next to “My Account”, etc) titled “v2 (test)”. The really interesting part is the URL contained in the link:

http://0.frontend-live.sfe.scrooge.hs.borg.google.com/finance

The URL isn’t accessable from here, and others have discovered the same. Assuming ‘borg.google.com’ can only be accessed from inside Google’s network, this may shed a bit of light on what “borg” actually means to the engineers.

Oh, and it’s been spotted again by Philipp Lenssen, this time using Google Video. In fact there’s rather a lot in Google.
Could it be an acronym? Here’s some that might fit:

  • BMRT Ordinary Rendering GUI
  • Business and Organisational Leadership

Or could it refer Anita Borg – a woman who “spent her life revolutionizing the way we think about technology”

I believe “borg.google.com” is used for internal testing, and perhaps as a repository for code not intended to leave the building. This would explain the accidental link from Google Finance, and the Google error message if they’re pulling test code for new features.

Any more sightings or possible explanations?

Google Supplementals .. Itself

Posted in Search Engines by Xiggeh on July 19, 2006

This afternoon I was looking for a local I.T. support company (they’re a client of mine), and came across the Google Web Directory in the search results. What made me chuckle though – the result came back supplemental.

Google Supplemental

(view full size screenshot)

What goes around comes around, eh?

Microsoft vs. Google

Posted in Search Engines by Xiggeh on July 14, 2006

According to an article at El Reg, Google have finally awoken the beast with a threat of encroaching on Microsoft’s territory. Kevin Turner, Microsoft COO is quoted as saying:

“Enterprise search is our business, it’s our house and Google is not going to take that business. Those people are not going to be allowed to take food off our plate, because that is what they are intending to do.”

Not allowed? Thems fighting words partner.

But hold on, Microsoft doesn’t own the enterprise search business. In fact they hardly dent the market, there are clear leaders and Microsoft isn’t one of them. Could it be using Google to become a serious contender? Or is it all bluster?

Google Error Message

Posted in Search Engines by Xiggeh on July 14, 2006

There’s been quite a lot of talk about an error message discovered on a Google repository server, which has been confirmed as real by Matt Cutts. Here’s the error message in question:

pacemaker-alarm-delay-in-ms-overall-sum 2341989
pacemaker-alarm-delay-in-ms-total-count 7776761
cpu-utilization 1.28
cpu-speed 2800000000
timedout-queries_total 14227
num-docinfo_total 10680907
avg-latency-ms_total 3545152552
num-docinfo_total 10680907
num-docinfo-disk_total 2200918
queries_total 1229799558
e_supplemental=150000 –pagerank_cutoff_decrease_per_round=100 –pagerank_cutoff_increase_per_round=500 –parents=12,13,14,15,16,17,18,19,20,21,22,23 –pass_country_to_leaves –phil_max_doc_activation=0.5 –port_base=32311 –production –rewrite_noncompositional_compounds –rpc_resolve_unreachable_servers –scale_prvec4_to_prvec –sections_to_retrieve=body+url+compactanchors –servlets=ascorer –supplemental_tier_section=body+url+compactanchors –threaded_logging –nouse_compressed_urls –use_domain_match –nouse_experimental_indyrank –use_experimental_spamscore –use_gwd –use_query_classifier –use_spamscore –using_borg”

How revealing is that! Unfortunately Google say they’ve put procedures in place to stop it happening again, but I’m determined to enjoy it while it lasts.

Here’s my take on the whole deal:

pacemaker-alarm: I don’t know what this is, but it sounds like a system Google have in place to keep everything up and running. It’s an alarm, so it has a trigger. And it’s a pacemaker, so it may be triggered to prevent request timeouts. And if these numbers are right, it looks like it takes 0.301ms to trigger, and triggers on 0.6% of queries.

cpu-utilization: I assume this is a number in the same format as your standard *nix load readout. Now it could be an average over 1 minute, 5 minutes, 15 minutes, or something random Google considers important, but whatever the timescale there are 1.28 processes queued for processing on average.

cpu-speed: 2800000000 works out to be 2.8GHz – could this be a Pentium 4 with the 533MHz FSB? Or maybe Intel Xeon 2.8GHz. It’s not likely to be AMD (yet).

timedout-queries: Well it looks like queries can timeout, so maybe that’s not what the pacemaker alarm is for. But on this server 14,227 queries have timed out, which works out to be 0.0012% of total queries.

num-docinfo-total: Most likely the total number of documents stored on this particular box. If we do some number crunching with other values in the message, it looks like the average document size is 4.85KiBi. Makes sense to me, and I assume Google are still compressing documents in the repository.

avg-latency-ms_total: Could this be the average latency per query? In that case it works out to be 2.88ms per query on average.

queries_total: Fucking hell! 1.2 billion queries, presumably on that box alone.

Now all that stuff above seems to be debug values returned from the server. The next lot seems to be settings on the box;

pagerank_cutoff: Increase and decrease per round? Now we now Google needs to go through 40-60 iterations of the PageRank algorithms to get vaguely accurate figures, and documents would need to be pulled from the repository (which is what this box does). This could relate to the maximum increase/decrease in PR per iteration. It would certainly prevent drastic variations per iteration. What’s interesting are the values – “100” and “500”. Now we see PR values 1-10 (which we know is just a fluffy number which is almost meaningless). Google patents and research papers refer to the Internet having a total PR of 1. So could these values be percentages?

parents: Every good server needs a parent, and every good system needs a backup. It appears this box has 12 sequentially-numbered parent servers, probably to assign workloads and retrieve documents.

port_base: Well I tried a port scan (sorry Google), but I really didn’t expect to find anything. Could servers in the GooglePlexi communicate in the 32000 port range?

production: Obviously, as we’re using it. It looks like Google can switch servers between testing and production at the click of a mouse.

rewrite_noncompositional_compounds: Non-Compositional Compounds (NCC) are phrases such as “hot dog” or “cold turkey” that make absolutely no sense when split up. When a computer is expected to understand the meanings of a document, and find related key terms, NCCs need to be extracted and treated differently.

scale_prvec4_to_prvec: I’m still working on the math on this one. Vectors scare me and I spent too much time on IRC at school.

sections_to_retrieve: Appears to be a list of document elements to return, either when a user requests a cached document, or when the indexer requests documents for scoring.

supplemental_tier_section: As above, but for supplemental documents?

servlets=ascorer: I don’t know what the ‘ascorer’ is, but it’s live. What begins with ‘a’ that Google would want to score? Hah, what doesn’t Google want to score 😉

threaded_logging: That’s some serious logging going on, although I’d probably do the same.

nouse_experimental_indyrank: I can’t think what “IndyRank” might be, what it would rank or how. I want to know though 🙂

use_experimental_spamscore: No surprises there. The “bad data push” (uh huh) caused a helluva ruckus with spam in the results, and the problem is slowly going away. This appears to be our knight in shining armour.

use_gwd: Google Web Directory?

use_query_classifier: I believe this is related to Google’s OneBox results (e.g. health, stocks, companies, and so forth)

use_spamscore: Obviously this is the older method of calculating spammer pages. Didn’t work so great now, did it?

phil_max_doc_activation: Maximum document activation? Not sure on this one. I looked up some Phils at Google. We have a Phil Winterbottom who’s published some papers on Plan 9 from Bell Labs – a distributed system build from terminals, CPU servers and file servers, but that doesn’t fit.

And now I’ve finished writing this, I found another good interpretation of this error message over at Stuntdubl. It’s interesting how he’s taken the values to be related to the cached document on the server, and I assumed they were values related to the server itself.

What are your thoughts? Am I tapping away at the vague truth, or am I on the wrong track completely?

And can I have a job at Google Ireland please? 🙂