TODOS


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80

- use traits in rewindinputstream, alternative wrappers for char/string
  traits depending on underlying io stream
- spooling in RIS:
  - thread-safe tempnames
  - Windows, respect environment variables like TEMP
- module loader
  - use weak pointers to create objects (pointers which don't transfer
    ownership but act more as a handle). Avoid funny trouble with
    objects originating for modules (especially DLLs on Windows)
  - ctor parameter types must be part of register function signature
- type detection
  - content based type detection on Windows
    - port of libmagic?
    - something from Microsoft (around the index service)? 
- robots.txt
  - handle Sitemap
- Parse URLs from sitemaps
- SpoolRewindInputStream
  - thread-safe name of spool file, not /tmp/spool.tmp
  - Windows and Linux should resprect %TEMP% resp. $TEMPDIR
- architectural premises
  - avoid disk, memory is there at abundance nowadays, rather
    distribute. Also we can distribute easily on cluster, disk
    and network is not so simple to distribute in clouds. Latency
    is a big issues in clouds.
- urlseen
  - memory is a set<URL> implementation
  - bloom filter implementation
  - keep mostly in memory and swap out portions not needed (IRLBot)
  - database implementation is too slow
  - scalable implementation
  - FIFOs hit single hosts, not good
  - for recrawling we must support also deletion
- robots.txt
  - key value storage, shared between instances (can be organized
    in classical distributed linear hashing way)
- DNS
  - DNS cache sharing (how, when several implementations are used
    in the modules?)
- frontier
  - centralized vs. decentralized
  - again using the hash(host) linearhash approach should work
  - the frontier should keep the URLs (so we can even compress
    parts of the frontier if needed), it should be kept appart
    from the crawl data per URL.
  - tricky question, can a centralized DB be better partitioned
    with database means (partitioning function) or is it better
    to use dedicated URL servers?
  - consistent hashing: avoid reorganization hash shifts (ubicrawler),
    important is that distributed downloader can "jump-in" for
    another one which is on a down machine. maybe in modern cloud
    environment it's more important to continue where another one
    left and make sure to avoid zombies.
  - multiple queues in frontier implementing policies
  - discrete priority levels with a queue for each. URLs are
    moved between levels (pro- nd demotion). Low-priorized queues
    can choose more disk-based and slower storage, High-priorized
    queues should be kept in memory. Picking URLs is simple this way.
- duplicate detection
  - easy, based on hashes of shingles as we know it
- near duplicate detection
  - still sort of a challenge as the problem is not symmetrical
- Heritrix is close to mercator, so we could migrate easily later
  to my crawler, as it is also mainly based on Mercator
- nanomsg between components to synchronize behaviour? centralized
  data structures should be lock-free or lock-light and be
  distributable physically again
- crawling local pages on same download process
- we should avoid a multi-threaded approach as it tends to get
  complicated too soon. Multiple fetchers are done how then?
- forgotten components:
  - URLPrioritizer: picks place of detected or rechecked URLs
    in the frontier. 
    - Or should the frontier to it itself?
    - Should the host netiquette also be put in here?
  - URLDistributor: the component decides the policy of exchanging
    URLs to other processes.
    - random distribute
    - keep local links of host local