1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
|
- use traits in rewindinputstream, alternative wrappers for char/string
traits depending on underlying io stream
- spooling in RIS:
- thread-safe tempnames
- Windows, respect environment variables like TEMP
- module loader
- use weak pointers to create objects (pointers which don't transfer
ownership but act more as a handle). Avoid funny trouble with
objects originating for modules (especially DLLs on Windows)
- ctor parameter types must be part of register function signature
- type detection
- content based type detection on Windows
- port of libmagic?
- something from Microsoft (around the index service)?
- robots.txt
- handle Sitemap
- Parse URLs from sitemaps
- SpoolRewindInputStream
- thread-safe name of spool file, not /tmp/spool.tmp
- Windows and Linux should resprect %TEMP% resp. $TEMPDIR
- architectural premises
- avoid disk, memory is there at abundance nowadays, rather
distribute. Also we can distribute easily on cluster, disk
and network is not so simple to distribute in clouds. Latency
is a big issues in clouds.
- urlseen
- memory is a set<URL> implementation
- bloom filter implementation
- keep mostly in memory and swap out portions not needed (IRLBot)
- database implementation is too slow
- scalable implementation
- FIFOs hit single hosts, not good
- for recrawling we must support also deletion
- robots.txt
- key value storage, shared between instances (can be organized
in classical distributed linear hashing way)
- DNS
- DNS cache sharing (how, when several implementations are used
in the modules?)
- frontier
- centralized vs. decentralized
- again using the hash(host) linearhash approach should work
- the frontier should keep the URLs (so we can even compress
parts of the frontier if needed), it should be kept appart
from the crawl data per URL.
- tricky question, can a centralized DB be better partitioned
with database means (partitioning function) or is it better
to use dedicated URL servers?
- consistent hashing: avoid reorganization hash shifts (ubicrawler),
important is that distributed downloader can "jump-in" for
another one which is on a down machine. maybe in modern cloud
environment it's more important to continue where another one
left and make sure to avoid zombies.
- multiple queues in frontier implementing policies
- discrete priority levels with a queue for each. URLs are
moved between levels (pro- nd demotion). Low-priorized queues
can choose more disk-based and slower storage, High-priorized
queues should be kept in memory. Picking URLs is simple this way.
- duplicate detection
- easy, based on hashes of shingles as we know it
- near duplicate detection
- still sort of a challenge as the problem is not symmetrical
- Heritrix is close to mercator, so we could migrate easily later
to my crawler, as it is also mainly based on Mercator
- nanomsg between components to synchronize behaviour? centralized
data structures should be lock-free or lock-light and be
distributable physically again
- crawling local pages on same download process
- we should avoid a multi-threaded approach as it tends to get
complicated too soon. Multiple fetchers are done how then?
- forgotten components:
- URLPrioritizer: picks place of detected or rechecked URLs
in the frontier.
- Or should the frontier to it itself?
- Should the host netiquette also be put in here?
- URLDistributor: the component decides the policy of exchanging
URLs to other processes.
- random distribute
- keep local links of host local
|