Near-optimal Monitoring of Online Data Sources
Crawling the Web for interesting and relevant changes has become increasingly difficult due to the abundance of frequently changing information. Common techniques for solving such problems make use of heuristics, which do not provide performance guarantees and tend to be tailored to specific scenarios or benchmarks.
In this talk, I will present a principled approach based on mathematical optimization for monitoring high-volume online data sources. We have built and deployed a distributed system called Corona that enables clients to subscribe to Web pages and notifies clients of updates asynchronously via instant messages. Corona assigns multiple nodes to cooperatively monitor each Web page and employs a novel decentralized optimization technique for distributing the monitoring load. In its currently running form, the optimization algorithm guarantees the best update detection time on average without exceeding resource constraints on the monitoring servers. Based on simulations and measurements on our deployed system, I will show that Corona performs substantially better than commonly used heuristics.
July 27, 2006