From d8cef8971717d689fb206bfb3e4df6d9ae96d823 Mon Sep 17 00:00:00 2001 From: Andreas Baumann Date: Wed, 15 Oct 2014 09:47:15 +0200 Subject: updated todo list --- TODOS | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 60 insertions(+) diff --git a/TODOS b/TODOS index 59701a7..76c42c1 100755 --- a/TODOS +++ b/TODOS @@ -18,3 +18,63 @@ - SpoolRewindInputStream - thread-safe name of spool file, not /tmp/spool.tmp - Windows and Linux should resprect %TEMP% resp. $TEMPDIR +- architectural premises + - avoid disk, memory is there at abundance nowadays, rather + distribute. Also we can distribute easily on cluster, disk + and network is not so simple to distribute in clouds. Latency + is a big issues in clouds. +- urlseen + - memory is a set implementation + - bloom filter implementation + - keep mostly in memory and swap out portions not needed (IRLBot) + - database implementation is too slow + - scalable implementation + - FIFOs hit single hosts, not good + - for recrawling we must support also deletion +- robots.txt + - key value storage, shared between instances (can be organized + in classical distributed linear hashing way) +- DNS + - DNS cache sharing (how, when several implementations are used + in the modules?) +- frontier + - centralized vs. decentralized + - again using the hash(host) linearhash approach should work + - the frontier should keep the URLs (so we can even compress + parts of the frontier if needed), it should be kept appart + from the crawl data per URL. + - tricky question, can a centralized DB be better partitioned + with database means (partitioning function) or is it better + to use dedicated URL servers? + - consistent hashing: avoid reorganization hash shifts (ubicrawler), + important is that distributed downloader can "jump-in" for + another one which is on a down machine. maybe in modern cloud + environment it's more important to continue where another one + left and make sure to avoid zombies. + - multiple queues in frontier implementing policies + - discrete priority levels with a queue for each. URLs are + moved between levels (pro- nd demotion). Low-priorized queues + can choose more disk-based and slower storage, High-priorized + queues should be kept in memory. Picking URLs is simple this way. +- duplicate detection + - easy, based on hashes of shingles as we know it +- near duplicate detection + - still sort of a challenge as the problem is not symmetrical +- Heritrix is close to mercator, so we could migrate easily later + to my crawler, as it is also mainly based on Mercator +- nanomsg between components to synchronize behaviour? centralized + data structures should be lock-free or lock-light and be + distributable physically again +- crawling local pages on same download process +- we should avoid a multi-threaded approach as it tends to get + complicated too soon. Multiple fetchers are done how then? +- forgotten components: + - URLPrioritizer: picks place of detected or rechecked URLs + in the frontier. + - Or should the frontier to it itself? + - Should the host netiquette also be put in here? + - URLDistributor: the component decides the policy of exchanging + URLs to other processes. + - random distribute + - keep local links of host local + -- cgit v1.2.3-54-g00ecf