diff options
author | Andreas Baumann <mail@andreasbaumann.cc> | 2014-10-15 09:52:50 +0200 |
---|---|---|
committer | Andreas Baumann <mail@andreasbaumann.cc> | 2014-10-15 09:52:50 +0200 |
commit | e2f8ce69fdaf9da19eb0adb9595d8daa18668137 (patch) | |
tree | 7939c11fcb6d173d0d57671ec7add5aaf1580796 | |
parent | 70caaffb21a0830911c8c281a16324ac0af64061 (diff) | |
parent | d8cef8971717d689fb206bfb3e4df6d9ae96d823 (diff) | |
download | crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.gz crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.bz2 |
Merge branch 'master' of ssh://andreasbaumann.dyndns.org:2222/crawler
-rwxr-xr-x | TODOS | 60 |
1 files changed, 60 insertions, 0 deletions
@@ -18,3 +18,63 @@ - SpoolRewindInputStream - thread-safe name of spool file, not /tmp/spool.tmp - Windows and Linux should resprect %TEMP% resp. $TEMPDIR +- architectural premises + - avoid disk, memory is there at abundance nowadays, rather + distribute. Also we can distribute easily on cluster, disk + and network is not so simple to distribute in clouds. Latency + is a big issues in clouds. +- urlseen + - memory is a set<URL> implementation + - bloom filter implementation + - keep mostly in memory and swap out portions not needed (IRLBot) + - database implementation is too slow + - scalable implementation + - FIFOs hit single hosts, not good + - for recrawling we must support also deletion +- robots.txt + - key value storage, shared between instances (can be organized + in classical distributed linear hashing way) +- DNS + - DNS cache sharing (how, when several implementations are used + in the modules?) +- frontier + - centralized vs. decentralized + - again using the hash(host) linearhash approach should work + - the frontier should keep the URLs (so we can even compress + parts of the frontier if needed), it should be kept appart + from the crawl data per URL. + - tricky question, can a centralized DB be better partitioned + with database means (partitioning function) or is it better + to use dedicated URL servers? + - consistent hashing: avoid reorganization hash shifts (ubicrawler), + important is that distributed downloader can "jump-in" for + another one which is on a down machine. maybe in modern cloud + environment it's more important to continue where another one + left and make sure to avoid zombies. + - multiple queues in frontier implementing policies + - discrete priority levels with a queue for each. URLs are + moved between levels (pro- nd demotion). Low-priorized queues + can choose more disk-based and slower storage, High-priorized + queues should be kept in memory. Picking URLs is simple this way. +- duplicate detection + - easy, based on hashes of shingles as we know it +- near duplicate detection + - still sort of a challenge as the problem is not symmetrical +- Heritrix is close to mercator, so we could migrate easily later + to my crawler, as it is also mainly based on Mercator +- nanomsg between components to synchronize behaviour? centralized + data structures should be lock-free or lock-light and be + distributable physically again +- crawling local pages on same download process +- we should avoid a multi-threaded approach as it tends to get + complicated too soon. Multiple fetchers are done how then? +- forgotten components: + - URLPrioritizer: picks place of detected or rechecked URLs + in the frontier. + - Or should the frontier to it itself? + - Should the host netiquette also be put in here? + - URLDistributor: the component decides the policy of exchanging + URLs to other processes. + - random distribute + - keep local links of host local + |