Google recently announced its new search feature called “Instant”. From early reports about the new service, there was a lot of back-end work that needed to take place. In fact, it sounded like deficiencies in the speed at which Google was able to update its index was a motivating factor. The Register is points to Google’s “caffene” release earlier this year as departure from its reliance on MapReduce in favor of improvements to BigTable. The real story here is the much needed reimplementation of the Google File System.
A Constantly Changing Landscape
The improved search infrastructure is a direction Google undertook in reaction, in part, to the speed at which information is added to the internet — specifically Twitter and then a few other social media sites. Closed systems like Facebook are shutting out Google’s ability “to organize the world’s information and make it universally accessible and useful.” Yes, publically visible pages are displayed in search results but the overwhelming majority of data in Facebook may never be accessed by Google. That might not be a bad thing.
I’m sure many people would not want their profile indexed and displayed online to just anyone searching on Google. Despite the fact that many Facebook users complain about privacy issues, most users don’t understand that Facebook has had a good security system for some time. Not everyone knows how to use it effectively.
Google Caffeine
The Caffiene release in June of this year was a complete and holistic overhaul of their indexing system – with an obvious eye toward future growth, scalability and flexibility.
Google wrote on the webmaster central blog in June of the Caffeine release:
Caffeine lets us index web pages on an enormous scale. In fact, every second Caffeine processes hundreds of thousands of pages in parallel. If this were a pile of paper it would grow three miles taller every second. Caffeine takes up nearly 100 million gigabytes of storage in one database and adds new information at a rate of hundreds of thousands of gigabytes per day. You would need 625,000 of the largest iPods to store that much information; if these were stacked end-to-end they would go for more than 40 miles.
And that is just for today’s World Wide Web. Back in the early 2000′s, for those of us that remember, the Google index was rebuilt every month and had code names. In subsequent years Google started updating the index more frequently to once a day and then once an hour.
Now, in this latest release, the index is being updated constantly – every 10 seconds as Peter Norvig stated in March. It also sounds like their previous indexing system was tiered or layered — causing some layers to become as much as 2 weeks old before being updated.
Well, if you’re working on the indexing system so heavily then you can imagine some of the necessary infrastructure updates to support the new indexing system’s requirements.
GFS2 aka Colossus
The original Google File System is about 10 year old now and the biggest bottleneck has been known for some time: a single-master. Sometimes the shortest path to solving today’s business problem is by deferring other decisions to the future.
Sean Quinlan of Google states in an interview with ACM:
The decision to build the original GFS around the single master really helped get something out into the hands of users much more rapidly than would have otherwise been possible.
The new file system implementation breaks away from having a single-master in favor of distributed masters and distributed slave nodes for managing the file system. It really sounds like the new approach in GFS2 looks to design principals from large implementations of LDAP (Active Directory): where response time can be improved by stragetgic deployments of more slaves and masters — not to mention fault-tolerance. That provides redundancy and improves performance. The new implementation also relies more heavily on BigTable to store and retrieve node information.
Constant Improvement
The success of a software project is not determined by its initial success.
Imagine if Google had stopped at online advertising and called it a day. There would be no Answers, no Gmail, no Google News, no Webmaster Tools, no Maps, no Blogger, no Calendar, no Talk, no Docs, no Picasa purchase, no YouTube purchase, no Analytics, no Voice (GrandCentral), no Checkout, no Android, no Google Code, no App Engine, no Base, no Google Tablet, no …
Successful software projects have good management who understand that success can be only temporary if you’re not constantly improving.