summaryrefslogtreecommitdiff
path: root/DEVELOPERS
blob: dbee0e0a1d8dd2cad6c20cb13de17878b07ffc63 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
Internal documentation for developers
-------------------------------------

   Coding guidelines
   Design decisions
      Storage of binarydata
      Directory tree in database
      Transaction Policies
      Self-containment
   Testing
   References

Coding guidelines
-----------------

  8-character TABs.
    
  Standard C89, portable to all platforms supporting FUSE (Linux,
  FreeBSD, OpenBSD, NetBSD, Darwin/MacOSX). No C++, no C99. Some
  newer 64-bit things like 'endian.h' are allowed, if there is no
  other option.
    
  Do not introduce unnecessary 3rd party dependencies in addition
  to the required 'libfuse' and 'libpq'.
  
  Use the internal fuse_opts to parse command line options.
    
  Use native 'libpq', not abstractions. The database operations
  are simple enough. If possible avoid string manipulations as
  for timestamps (we are on low-level OS-abstraction layer, so
  'struct timespec' and epochs are fine).
          
Design desicions
----------------

Storage of binary data
----------------------

Options:

One ByteA field

  All data in a big bytea column: needs memory of the size of the
  complete file on client (pgfuse) and server (PostgreSQL) side,
  is ok for small files (first proof-of-concept implementation).

Multiple ByteA of equal size

  As in Mysqlfs simulate blocks as bytea fields of fixes size with
  a block number. The blocksize has to be carefully tuned with file-
  system, PostgreSQL and fuse parameters.
  
  Should give good average performance, the "One ByteA field" variant
  for small files is still as efficient as before.

Blobs

  They are streamable, but we lack some security (this was the
  case before PostgreSQL 9.0, they were publibly readable) and we
  lack referential integrity.
  
  The functions to manipulate the blobs are not so nice.
  
  It's also questionable whether they could be faster than a bytea.

Some unsorted thoughts:

Streams are mere abstractions and not really needed from the database
interface.

COPY FROM and COPY to as a fast, non-transactional mode?

Pad blocks in data or not? Or all but the last one, allowing very
small files to be stored efficiently.

How to tune the block sizes? What factors influence the experiment?
At the moment we store padded blocks of fixes size (STANDARD_BLOCK_SIZE),
not really sure if that is good or bad.

Directory tree in database
--------------------------

Every inode has a 'parent_id', self-referencing. To avoid the NULL
reference there is a root element point to itself. Pathes are decomposed
programatically by descending the tree. This requires no additional
redudant storage is easy to change in renames and gives acceptable
read performance.

Transaction Policies
--------------------

Fundamental question: What file operations should form a database transaction?
  
one extreme: isolate threads and all file operations in one transaction.
This is most likely an illution, as we can't assume that FUSE threads
are assigned fix to a specific file (on the contrary!).

other extreme: "autocommit" (every write, every read, etc.), this allows
for parallel usage. We trust in FUSE and the locking there. The database
should help us sequentiallize the operations.

Currently the second option was choosen.
  
Self-containment
----------------

React decently to loss of database connections. Try to reestablish
the connection, the loss of database connection could be temporary.
Try to reexecute the file system operation.

What should be reported back as permanent error state to FUSE after
a certain timeout when the database doesn't appear again?
EIO seems a good option (as if the disk would have temporary I/O
problems).

Testing
-------

The makefile contains some basic functionallity tests (mostly using
commands of the shell).

bonnie is a good stress and performance tester. Don't despair because
of poor performance, that's normal. :-)

Another option is to have many shells open and do some things in
parallel, like:

while(true);do
	mkdir mnt/bla
	rmdir mnt/bla
done

We should actually write a filesystem operation simulator doing all
kind of random weird operations with certain probabilities.

References
----------

Good FUSE tutorials at:

http://www.cs.hmc.edu/~geoff/classes/hmc.cs135.201109/homework/fuse/fuse_doc.html
http://www.cs.nmsu.edu/~pfeiffer/fuse-tutorial/