A Search Engine Architecture Based on Collection Selection
Google Tech Talks
December, 19 2007
We present a distributed architecture for a Web search engine, based on the concept of collection selection. We introduce a novel approach to partition the collection of documents, able to greatly improve the effectiveness of standard collection selection techniques (CORI), and a new selection function outperforming the state of the art. Our technique is based on the novel query-vector (QV) document model, built from the analysis of query logs, and on our strategy of co-clustering queries and documents at the same time. By suitably partitioning the documents in the collection, our system is able to select the subset of servers containing the most relevant documents for each query. Instead of broadcasting the query to every server in the computing platform, only the most relevant will be polled, this way reducing the average computing cost to solve a query. We introduce a novel strategy to use the instant load at each server to drive the query routing. Also, we describe a new approach to caching, able to incrementally improve the quality of the stored
results. Our caching strategy is effectively both in reducing computing load and in improving result quality. The proposed architecture, overall, presents a trade-off between computing cost and result quality, and we show how to guarantee very precise results in face of a dramatic reduction to computing load. This means that, with the same computing infrastructure, our system can serve more users, more queries and more documents.
Speaker: Diego Puppin