Merge branch 'master' of ssh://andreasbaumann.dyndns.org:2222/crawler

author: Andreas Baumann <mail@andreasbaumann.cc> 2014-10-15 09:52:50 +0200
committer: Andreas Baumann <mail@andreasbaumann.cc> 2014-10-15 09:52:50 +0200
commit: e2f8ce69fdaf9da19eb0adb9595d8daa18668137 (patch)
tree: 7939c11fcb6d173d0d57671ec7add5aaf1580796
parent: 70caaffb21a0830911c8c281a16324ac0af64061 (diff)
parent: d8cef8971717d689fb206bfb3e4df6d9ae96d823 (diff)
download: crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.gz
crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.bz2
1 files changed, 60 insertions, 0 deletions
diff --git a/TODOS b/TODOS
index 59701a7..76c42c1 100755
--- a/TODOS
+++ b/TODOS
@@ -18,3 +18,63 @@
 - SpoolRewindInputStream
   - thread-safe name of spool file, not /tmp/spool.tmp
   - Windows and Linux should resprect %TEMP% resp. $TEMPDIR
+- architectural premises
+  - avoid disk, memory is there at abundance nowadays, rather
+    distribute. Also we can distribute easily on cluster, disk
+    and network is not so simple to distribute in clouds. Latency
+    is a big issues in clouds.
+- urlseen
+  - memory is a set<URL> implementation
+  - bloom filter implementation
+  - keep mostly in memory and swap out portions not needed (IRLBot)
+  - database implementation is too slow
+  - scalable implementation
+  - FIFOs hit single hosts, not good
+  - for recrawling we must support also deletion
+- robots.txt
+  - key value storage, shared between instances (can be organized
+    in classical distributed linear hashing way)
+- DNS
+  - DNS cache sharing (how, when several implementations are used
+    in the modules?)
+- frontier
+  - centralized vs. decentralized
+  - again using the hash(host) linearhash approach should work
+  - the frontier should keep the URLs (so we can even compress
+    parts of the frontier if needed), it should be kept appart
+    from the crawl data per URL.
+  - tricky question, can a centralized DB be better partitioned
+    with database means (partitioning function) or is it better
+    to use dedicated URL servers?
+  - consistent hashing: avoid reorganization hash shifts (ubicrawler),
+    important is that distributed downloader can "jump-in" for
+    another one which is on a down machine. maybe in modern cloud
+    environment it's more important to continue where another one
+    left and make sure to avoid zombies.
+  - multiple queues in frontier implementing policies
+  - discrete priority levels with a queue for each. URLs are
+    moved between levels (pro- nd demotion). Low-priorized queues
+    can choose more disk-based and slower storage, High-priorized
+    queues should be kept in memory. Picking URLs is simple this way.
+- duplicate detection
+  - easy, based on hashes of shingles as we know it
+- near duplicate detection
+  - still sort of a challenge as the problem is not symmetrical
+- Heritrix is close to mercator, so we could migrate easily later
+  to my crawler, as it is also mainly based on Mercator
+- nanomsg between components to synchronize behaviour? centralized
+  data structures should be lock-free or lock-light and be
+  distributable physically again
+- crawling local pages on same download process
+- we should avoid a multi-threaded approach as it tends to get
+  complicated too soon. Multiple fetchers are done how then?
+- forgotten components:
+  - URLPrioritizer: picks place of detected or rechecked URLs
+    in the frontier. 
+    - Or should the frontier to it itself?
+    - Should the host netiquette also be put in here?
+  - URLDistributor: the component decides the policy of exchanging
+    URLs to other processes.
+    - random distribute
+    - keep local links of host local
+
author	Andreas Baumann <mail@andreasbaumann.cc>	2014-10-15 09:52:50 +0200
committer	Andreas Baumann <mail@andreasbaumann.cc>	2014-10-15 09:52:50 +0200
commit	e2f8ce69fdaf9da19eb0adb9595d8daa18668137 (patch)
tree	7939c11fcb6d173d0d57671ec7add5aaf1580796
parent	70caaffb21a0830911c8c281a16324ac0af64061 (diff)
parent	d8cef8971717d689fb206bfb3e4df6d9ae96d823 (diff)
download	crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.gz crawler-e2f8ce69fdaf9da19eb0adb9595d8daa18668137.tar.bz2