TORONTO — It uses highly sophisticated software to help serve more than 200 million queries a day and employs some of the brightest PhDs to develop search algorithms, but when it’s time to buy more hardware, Google heads for the low
At its 13th annual CASCON, one of the world’s largest computer science and engineering events, IBM Canada welcomed a host of students from schools across the country, many of whom probably look forward to getting a job where they’ll use best-in-class computers. They won’t find them at Google, though. According to Craig Nevill-Manning, Google’s senior research scientist and CASCON’s keynote speaker, the search engine firm stockpiles approximately 10,000 servers to keep its business running. They’re cheap servers, Nevill-Manning said — the kind of unlabelled, commodity-type computers that might be purchased by home users.
“”They’re very unreliable; we have failures,”” he said, “”but what we try to do is push the processing power of the PCs and achieve reliability in the software.””
Many of the servers Google uses are redundant, Nevill-Manning explained. The company estimates that a server running Google applications all day is the equivalent of 40 years of use in a regular context. Approximately 82 of these servers die every day, but not completely; Google employs maintenance people who walk around with carts of hard disks, for example, and replace them in malfunctioning servers or UPSes. Though it’s important that they be notified when a server fails in seconds in order to offload the application work, some of the maintenance may take a week or more. “”We can save a lot of money by doing this in a lazy fashion,”” he said.
IDC Canada infrastructure analyst Alan Freedman said Google’s level of redundancy isn’t uncommon.
“”There’s a couple of companies that can’t afford not to have that,”” he said, citing payment processing companies as an example. “”They’ll do that either by horizontally-scaling — by getting all those different boxes in and physically managing those — or by getting fault-tolerant, more expensive types of systems with greater software functionality built into them as well.””
When a query is entered into Google’s Web site, it goes in several directions. This includes the Index Servers, where the company has categorized about four billion Web documents (including 400 million images and 35 million pieces of non-HTML content) and the Document Servers, where the results of a particular search are parsed and then presented as a three-line summary on the search results page. Everything is replicated throughout the chain, Nevill-Manning said — at the server level, in the set of server clusters, and the sites themselves.
Google also sets up databases around the world in case one is affected by natural disasters like earthquakes, Nevill-Manning said. The data centres are still stuffed with relatively inexpensive boxes, a strategy that hasn’t changed since Google’s origins at Standford University in 1998. As part of his presentation Nevill-Manning showed a series of slides taken from the Google’s first server racks, which were built using pieces of folded aluminum. Components placed on top were in danger of causing electrical problems until the company put a cork lining on top. As for disk enclosures, Google’s two founders came up with a colorful solution — Lego blocks.
“”I think there are more pieces there than are absolutely necessary,”” Nevill-Manning mused, looking at the slide. “”Some of that may be decorative.””
Nevill-Maning showed a pyramid diagram to illustrate how Google organizes its searches. At the bottom is “”main,”” where there tends to be higher latency for pages that don’t change very much. “”Fresh,”” in the middle, includes portals that need more up-to-the-minute checks, like e-commerce sites during the December holiday shopping season. On top is “”News,”” for CNN.com and other sites that change all the time.
To maintain the relevancy of its search results, Google hires PhDs with expertise in machine learning to create algorithms that look for links to other sites — what Nevill-Manning called the company’s secret sauce. The Web poses a lot of challenges to the traditional methods of information retrieval, he said. Where the offline world (like libraries) assumes that queries will be well-defined, that documents are coherent and that the vocabulary is small, online queries are all over the map. This applies to its users as well, the majority of whom Nevill-Manning said come from outside the United States.
Google’s main challenge right now is to handle the billing and syndication of its advertising, which often includes transactions that are only worth pennies each. On the R&D side, Nevill-Manning said the company is hoping to extend its portfolio with a number of new services, including a Google Glossary of hard-to-find terms, and Google Sets, which would bring up related searches. Typing in Prada, Armani and Hugo Boss, for example, might bring up Versace and a number of other designer clothing labels. These kinds of ideas turned into prototypes in employees’ spare time, Nevill-Manning added; a company rule stipulates that 20 per cent of their work hours can be devoted to brainstorming.
“”We don’t quite know what it will be useful for,”” he said of Google Sets, “”but it’s awfully fun.””