summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorAndreas Baumann <mail@andreasbaumann.cc>2014-10-15 09:52:50 +0200
committerAndreas Baumann <mail@andreasbaumann.cc>2014-10-15 09:52:50 +0200
commite2f8ce69fdaf9da19eb0adb9595d8daa18668137 (patch)
tree7939c11fcb6d173d0d57671ec7add5aaf1580796
parent70caaffb21a0830911c8c281a16324ac0af64061 (diff)
parentd8cef8971717d689fb206bfb3e4df6d9ae96d823 (diff)
downloadcrawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.gz
crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.bz2
Merge branch 'master' of ssh://andreasbaumann.dyndns.org:2222/crawler
-rwxr-xr-xTODOS60
1 files changed, 60 insertions, 0 deletions
diff --git a/TODOS b/TODOS
index 59701a7..76c42c1 100755
--- a/TODOS
+++ b/TODOS
@@ -18,3 +18,63 @@
- SpoolRewindInputStream
- thread-safe name of spool file, not /tmp/spool.tmp
- Windows and Linux should resprect %TEMP% resp. $TEMPDIR
+- architectural premises
+ - avoid disk, memory is there at abundance nowadays, rather
+ distribute. Also we can distribute easily on cluster, disk
+ and network is not so simple to distribute in clouds. Latency
+ is a big issues in clouds.
+- urlseen
+ - memory is a set<URL> implementation
+ - bloom filter implementation
+ - keep mostly in memory and swap out portions not needed (IRLBot)
+ - database implementation is too slow
+ - scalable implementation
+ - FIFOs hit single hosts, not good
+ - for recrawling we must support also deletion
+- robots.txt
+ - key value storage, shared between instances (can be organized
+ in classical distributed linear hashing way)
+- DNS
+ - DNS cache sharing (how, when several implementations are used
+ in the modules?)
+- frontier
+ - centralized vs. decentralized
+ - again using the hash(host) linearhash approach should work
+ - the frontier should keep the URLs (so we can even compress
+ parts of the frontier if needed), it should be kept appart
+ from the crawl data per URL.
+ - tricky question, can a centralized DB be better partitioned
+ with database means (partitioning function) or is it better
+ to use dedicated URL servers?
+ - consistent hashing: avoid reorganization hash shifts (ubicrawler),
+ important is that distributed downloader can "jump-in" for
+ another one which is on a down machine. maybe in modern cloud
+ environment it's more important to continue where another one
+ left and make sure to avoid zombies.
+ - multiple queues in frontier implementing policies
+ - discrete priority levels with a queue for each. URLs are
+ moved between levels (pro- nd demotion). Low-priorized queues
+ can choose more disk-based and slower storage, High-priorized
+ queues should be kept in memory. Picking URLs is simple this way.
+- duplicate detection
+ - easy, based on hashes of shingles as we know it
+- near duplicate detection
+ - still sort of a challenge as the problem is not symmetrical
+- Heritrix is close to mercator, so we could migrate easily later
+ to my crawler, as it is also mainly based on Mercator
+- nanomsg between components to synchronize behaviour? centralized
+ data structures should be lock-free or lock-light and be
+ distributable physically again
+- crawling local pages on same download process
+- we should avoid a multi-threaded approach as it tends to get
+ complicated too soon. Multiple fetchers are done how then?
+- forgotten components:
+ - URLPrioritizer: picks place of detected or rechecked URLs
+ in the frontier.
+ - Or should the frontier to it itself?
+ - Should the host netiquette also be put in here?
+ - URLDistributor: the component decides the policy of exchanging
+ URLs to other processes.
+ - random distribute
+ - keep local links of host local
+