- use traits in rewindinputstream, alternative wrappers for char/string traits depending on underlying io stream - spooling in RIS: - thread-safe tempnames - Windows, respect environment variables like TEMP - module loader - use weak pointers to create objects (pointers which don't transfer ownership but act more as a handle). Avoid funny trouble with objects originating for modules (especially DLLs on Windows) - ctor parameter types must be part of register function signature - type detection - content based type detection on Windows - port of libmagic? - something from Microsoft (around the index service)? - robots.txt - handle Sitemap - Parse URLs from sitemaps - SpoolRewindInputStream - thread-safe name of spool file, not /tmp/spool.tmp - Windows and Linux should resprect %TEMP% resp. $TEMPDIR - architectural premises - avoid disk, memory is there at abundance nowadays, rather distribute. Also we can distribute easily on cluster, disk and network is not so simple to distribute in clouds. Latency is a big issues in clouds. - urlseen - memory is a set implementation - bloom filter implementation - keep mostly in memory and swap out portions not needed (IRLBot) - database implementation is too slow - scalable implementation - FIFOs hit single hosts, not good - for recrawling we must support also deletion - robots.txt - key value storage, shared between instances (can be organized in classical distributed linear hashing way) - DNS - DNS cache sharing (how, when several implementations are used in the modules?) - frontier - centralized vs. decentralized - again using the hash(host) linearhash approach should work - the frontier should keep the URLs (so we can even compress parts of the frontier if needed), it should be kept appart from the crawl data per URL. - tricky question, can a centralized DB be better partitioned with database means (partitioning function) or is it better to use dedicated URL servers? - consistent hashing: avoid reorganization hash shifts (ubicrawler), important is that distributed downloader can "jump-in" for another one which is on a down machine. maybe in modern cloud environment it's more important to continue where another one left and make sure to avoid zombies. - multiple queues in frontier implementing policies - discrete priority levels with a queue for each. URLs are moved between levels (pro- nd demotion). Low-priorized queues can choose more disk-based and slower storage, High-priorized queues should be kept in memory. Picking URLs is simple this way. - duplicate detection - easy, based on hashes of shingles as we know it - near duplicate detection - still sort of a challenge as the problem is not symmetrical - Heritrix is close to mercator, so we could migrate easily later to my crawler, as it is also mainly based on Mercator - nanomsg between components to synchronize behaviour? centralized data structures should be lock-free or lock-light and be distributable physically again - crawling local pages on same download process - we should avoid a multi-threaded approach as it tends to get complicated too soon. Multiple fetchers are done how then? - forgotten components: - URLPrioritizer: picks place of detected or rechecked URLs in the frontier. - Or should the frontier to it itself? - Should the host netiquette also be put in here? - URLDistributor: the component decides the policy of exchanging URLs to other processes. - random distribute - keep local links of host local - debugging embedded Lua