gdp / doc / developer / tutorial / gdp-tutorial-part1.md @ master
History | View | Annotate | Download (10.2 KB)
1 | 8949c8d7 | Nitesh Mor | % Introduction to the Global Data Plane |
---|---|---|---|
2 | |||
3 | For any questions about this document, send comments to: |
||
4 | |||
5 | - Nitesh Mor, U.C. Berkeley Ubiquitous Swarm Lab, <mor@eecs.berkeley.edu> |
||
6 | - Eric Allman, U.C. Berkeley Ubiquitous Swarm Lab, <eric@cs.berkeley.edu> |
||
7 | |||
8 | --- |
||
9 | |||
10 | This is a brief high level introduction to the Global Data Plane, primarily for |
||
11 | users wishing to incorporate GDP in their existing applications. It is a |
||
12 | living document, subject to change. In particular we are planning on some |
||
13 | major new functionality which will partly change the interface. |
||
14 | |||
15 | # Introduction |
||
16 | |||
17 | eebee62a | Nitesh Mor | Let's imagine we have a tiny little sensor that measures temperature |
18 | periodically and spits out the current temperature. To make this sensor *smart*, |
||
19 | we add some communication capabilities to it. But how does one use the data |
||
20 | generated by this sensor? |
||
21 | |||
22 | To preserve the data generated by this sensor, let's put the sensor values in a |
||
23 | log-file. A new temperature reading gets appended to this file as the sensor |
||
24 | generates new data. With the log-file approach, we suddenly get quite a few new |
||
25 | capabilities--a user can look at the log file to see historic values, query it |
||
26 | for the latest reading, and so on--all this without adding any functionality to |
||
27 | the sensor itself. As a matter of fact, one doesn't even need to know about the |
||
28 | real sensor anymore; a log file virtualizes this sensor in a way. |
||
29 | |||
30 | Now, let's bring an actuator (a light bulb, for example) in the picture. The |
||
31 | actuation happens when the actuator receives a certain command over some |
||
32 | communication channel. Let's apply the log-file approach to the actuator |
||
33 | too--the actuator is *subscribed* to a log-file (subscription just means that |
||
34 | the actuator receives any new data in a particular log-file as it arrives). This |
||
35 | gives us a way to virtualize the actuator as well. Anyone intending to actuate a |
||
36 | particular actuator only needs to write the actuation data to a log-file. |
||
37 | 8949c8d7 | Nitesh Mor | |
38 | # Global Data Plane (GDP) |
||
39 | |||
40 | eebee62a | Nitesh Mor | In GDP, we take this basic abstraction of a log and use this as a communication |
41 | and storage mechanism for various entities producing/consuming data. A log in |
||
42 | GDP is nothing more than a place where you can write records (each called a |
||
43 | datum) in a queue/list like fashion. Any structure in the datum is assigned by |
||
44 | your program; a datum is just an opaque blob of data for the GDP. Other than |
||
45 | the real data, each datum has a sequence number (starting from one), a |
||
46 | timestamp, and potentially other information (TBD). Logs have some metadata that |
||
47 | tells some general information about the log, e.g. a public key for checking |
||
48 | signatures (Details TBD). Metadata can only be specified when the log is |
||
49 | created. |
||
50 | 8949c8d7 | Nitesh Mor | |
51 |  |
||
52 | |||
53 | As for the different software components, there are clients (readers/writers), |
||
54 | routers and log-servers. As a user, you probably need to know only about the |
||
55 | client side. However, just for a little better understanding of the entire |
||
56 | eebee62a | Nitesh Mor | system--a log-server (gdplogd) is a server process that actually stores the |
57 | 8949c8d7 | Nitesh Mor | data and responds to commands for reading/writing etc. A router (gdp-router), |
58 | as the name implies, routes all the GDP traffic appropriately. As a client |
||
59 | interested in a particular log, you connect to *any* router over a TCP channel. |
||
60 | It is the job of the router to figure out what log-server hosts the log in |
||
61 | eebee62a | Nitesh Mor | question and then route all the communication appropriately. |
62 | 8949c8d7 | Nitesh Mor | |
63 |  |
||
64 | |||
65 | |||
66 | ## Writes and Reads |
||
67 | |||
68 | eebee62a | Nitesh Mor | The write operation for the logs is *append*. A datum written to the log can not |
69 | be over-written or changed at a later time. As of now, neither logs nor records |
||
70 | expire (one or both may change in future). There are two versions of append |
||
71 | available in the library provided: synchronous and asynchronous, however they |
||
72 | are just two different ways of implementing the same underlying operation. In |
||
73 | synchronous operation, a writer waits for an acknowledgment before commencing |
||
74 | the next append operation. This might not be the most efficient way to go about |
||
75 | for applications with high performance requirements, hence the asynchronous |
||
76 | 8949c8d7 | Nitesh Mor | operation where the writer can have multiple append operations in transit at a |
77 | time. However, note that the synchronous vs asynchronous operation is purely a |
||
78 | eebee62a | Nitesh Mor | difference in how the client library processes the data--internally all |
79 | 8949c8d7 | Nitesh Mor | operations are asynchronous. |
80 | |||
81 | eebee62a | Nitesh Mor | The read operation can be performed in a couple of different ways--read a |
82 | 8949c8d7 | Nitesh Mor | single record by a record number, read multiple records together with a single |
83 | read request, or subscribe to a log to be notified of potential future data. |
||
84 | However, querying a record by the record number is not the best way for all the |
||
85 | situations, because often times the latest few values in a log are of interest. |
||
86 | The question is: as a reader, how are you supposed to know the record number to |
||
87 | be used. For this reason, the record number for a read request can take a |
||
88 | negative value to address the records from most recent first. For instance, |
||
89 | record number -1 refers to the most recent record in a log. |
||
90 | |||
91 | eebee62a | Nitesh Mor | Another important thing to note is that the logs are single-writer--each sensor |
92 | has their own log. This makes quite a few things easier for us (encryption, |
||
93 | signatures, data-ordering, etc.). However, there are situations where a |
||
94 | composite log is desired, such as one log for all the temperature sensors in a |
||
95 | building. We envision that a simple aggregation service can subscribe to all the |
||
96 | individual logs, perform some kind of aggregation on those values (for example, |
||
97 | calculating average temperature in a building), and then write those values to a |
||
98 | *composite* log. A *composite* log is also a single writer log where the |
||
99 | aggregation service is the sole writer to the *composite* log. |
||
100 | 8949c8d7 | Nitesh Mor | |
101 | ## Addressing |
||
102 | |||
103 | All objects known to the GDP, including logs and services, exist in a single, |
||
104 | flat, global namespace based on 256-bit numbers (loosely referred to as the |
||
105 | eebee62a | Nitesh Mor | internal name). These are assigned by GDP when the log is created using a |
106 | 8949c8d7 | Nitesh Mor | hash of the metadata and public key (the exact composition of this is TBD). |
107 | *(At the moment it is possible to specify the data that is hashed for the name, |
||
108 | but this will be changed soon).* A directory service will map |
||
109 | human-understandable names into internal names (coming soon). |
||
110 | |||
111 | <!-- |
||
112 | |||
113 | All communications in the GDP are provided by the GDP Routing Layer. At the |
||
114 | moment the implementation has limited scalability and does not work with |
||
115 | routers behind NAT firewalls. We are planning on dropping the second |
||
116 | restriction soon, but the truly global, scalable, peer-to-peer based |
||
117 | implementation will take a bit longer. |
||
118 | |||
119 | At the moment, logs must be created by hand on a specific log server. This |
||
120 | will be replaced by a log creation service soon. To have a log created, |
||
121 | contact Eric (eric@cs.berkeley.edu). |
||
122 | |||
123 | --> |
||
124 | |||
125 | ## Client side implementation |
||
126 | |||
127 | As of now, at the core of client side distribution is a C library. There are |
||
128 | wrappers around the C library for Python and Java. Wrappers for other languages |
||
129 | tend to run slightly behind in features compared to the C interface, however, |
||
130 | you might find them an easier option if you want an object oriented interface |
||
131 | or are not comfortable programming in C. |
||
132 | |||
133 | eebee62a | Nitesh Mor | Java interface is moderately supported. Contact Christopher |
134 | (<cxh@eecs.berkeley.edu>) or Nitesh (<mor@eecs.berkeley.edu>) if you are in |
||
135 | desperate need for a Java interface. There is also a REST-based interface, |
||
136 | which is covered elsewhere. |
||
137 | 8949c8d7 | Nitesh Mor | |
138 | |||
139 | ## Security and Privacy |
||
140 | |||
141 | The long-term plan is to use encryption for data secrecy and signatures for |
||
142 | authenticity. Ideally, a user reading/writing data from/to GDP should not have |
||
143 | to trust any other component of the system for any verification (not even the |
||
144 | log servers). We envision that we can achieve such guarantees by making the |
||
145 | client side a little smarter and using mathematical techniques such as |
||
146 | encryption. However, it is a work in progress and in the current stage, we do |
||
147 | not provide any such guarantees. |
||
148 | |||
149 | As mentioned earlier, logs are single-writer. Roughly, each log has a public |
||
150 | signature key included in the log-level metadata; the private part stays with |
||
151 | the intended writer of the log. All the writes are signed using the private |
||
152 | signature key which are verified by the log-server to enforce 1) write access |
||
153 | control, 2) data authenticity. These signatures are included in the on-disk |
||
154 | data and could be queried by a client to make sure that the data was not |
||
155 | modified in any way. Any tampering with record ordering can also be detected, |
||
156 | since the single-writer model allows for the writer to create an implicit |
||
157 | linked-list protected by signatures. However, this is only partly implemented |
||
158 | as of now. |
||
159 | |||
160 | Encryption is used to make data available to only certain users. Ideally, |
||
161 | anyone can read encrypted data but can not make any sense out of it without |
||
162 | access to the decryption key. However, there are limitations to this approach. |
||
163 | Side channels, weak encryption, key-management are some of the many challenges |
||
164 | that need to be solved. |
||
165 | |||
166 | As for the overhead of signatures/encryption, in the current phase we envision |
||
167 | that if a device can speak the GDP protocol, it can also perform |
||
168 | signature/encryption. Ultra low power devices might offload the entire process |
||
169 | to a more powerful gateway device (a smartphone, for instance). |
||
170 | |||
171 | <!-- Ideally devices will have a key-pair assigned at the factory with only the |
||
172 | public key exposed, although low-power crypto-unaware devices may speak to a |
||
173 | gateway that (optionally) collects data from multiple devices and then signs |
||
174 | and encrypts the data before submitting it to the GDP. |
||
175 | |||
176 | Security is enforced by cryptographic techniques — for example, there are no |
||
177 | ACLs. |
||
178 | |||
179 | Starting from the data acquisition device (for example, a sensor), the model is |
||
180 | that each device will have a key pair created at the factory. The secret key |
||
181 | will be burned into the device and not be accessible. Each device has a |
||
182 | corresponding log, and the public key for the device will be included in the |
||
183 | log metadata when the log is created. The device will sign all outgoing |
||
184 | records using the secret key, and the GDP log server hosting the log will check |
||
185 | the signature before writes are permitted. The signature is retained with |
||
186 | record, so consumers (readers) may verify the signature themselves. Very small |
||
187 | and/or low power devices that can not sign themselves will have an intermediate |
||
188 | gateway that does the signing, at the risk of lowered security should the |
||
189 | gateway be compromised or if a Man In The Middle attack can be raised between |
||
190 | the sensor and the gateway. --> |
||
191 |