Project

General

Profile

Statistics
| Branch: | Tag: | Revision:

gdp / doc / developer / tutorial / gdp-tutorial-part1.md @ master

History | View | Annotate | Download (10.2 KB)

1 8949c8d7 Nitesh Mor
% Introduction to the Global Data Plane
2
3
For any questions about this document, send comments to:
4
5
- Nitesh Mor, U.C. Berkeley Ubiquitous Swarm Lab, <mor@eecs.berkeley.edu>
6
- Eric Allman, U.C. Berkeley Ubiquitous Swarm Lab, <eric@cs.berkeley.edu>
7
8
---
9
10
This is a brief high level introduction to the Global Data Plane, primarily for
11
users wishing to incorporate GDP in their existing applications.  It is a
12
living document, subject to change.  In particular we are planning on some
13
major new functionality which will partly change the interface.
14
15
# Introduction
16
17 eebee62a Nitesh Mor
Let's imagine we have a tiny little sensor that measures temperature
18
periodically and spits out the current temperature. To make this sensor *smart*,
19
we add some communication capabilities to it. But how does one use the data
20
generated by this sensor? 
21
22
To preserve the data generated by this sensor, let's put the sensor values in a
23
log-file. A new temperature reading gets appended to this file as the sensor
24
generates new data. With the log-file approach, we suddenly get quite a few new
25
capabilities--a user can look at the log file to see historic values, query it
26
for the latest reading, and so on--all this without adding any functionality to
27
the sensor itself. As a matter of fact, one doesn't even need to know about the
28
real sensor anymore; a log file virtualizes this sensor in a way. 
29
30
Now, let's bring an actuator (a light bulb, for example) in the picture. The
31
actuation happens when the actuator receives a certain command over some
32
communication channel. Let's apply the log-file approach to the actuator
33
too--the actuator is *subscribed* to a log-file (subscription just means that
34
the actuator receives any new data in a particular log-file as it arrives). This
35
gives us a way to virtualize the actuator as well. Anyone intending to actuate a
36
particular actuator only needs to write the actuation data to a log-file.
37 8949c8d7 Nitesh Mor
38
# Global Data Plane (GDP)
39
40 eebee62a Nitesh Mor
In GDP, we take this basic abstraction of a log and use this as a communication
41
and storage mechanism for various entities producing/consuming data. A log in
42
GDP is nothing more than a place where you can write records (each called a
43
datum) in a queue/list like fashion.  Any structure in the datum is assigned by
44
your program; a datum is just an opaque blob of data for the GDP.  Other than
45
the real data, each datum has a sequence number (starting from one), a
46
timestamp, and potentially other information (TBD). Logs have some metadata that
47
tells some general information about the log, e.g. a public key for checking
48
signatures (Details TBD). Metadata can only be specified when the log is
49
created.
50 8949c8d7 Nitesh Mor
51
![Logical representation of GDP log](log-structure-1.png)
52
53
As for the different software components, there are clients (readers/writers),
54
routers and log-servers. As a user, you probably need to know only about the
55
client side. However, just for a little better understanding of the entire
56 eebee62a Nitesh Mor
system--a log-server (gdplogd) is a server process that actually stores the
57 8949c8d7 Nitesh Mor
data and responds to commands for reading/writing etc. A router (gdp-router),
58
as the name implies, routes all the GDP traffic appropriately. As a client
59
interested in a particular log, you connect to *any* router over a TCP channel.
60
It is the job of the router to figure out what log-server hosts the log in
61 eebee62a Nitesh Mor
question and then route all the communication appropriately.
62 8949c8d7 Nitesh Mor
63
![GDP software components](gdp-components.png)
64
65
66
## Writes and Reads
67
68 eebee62a Nitesh Mor
The write operation for the logs is *append*. A datum written to the log can not
69
be over-written or changed at a later time. As of now, neither logs nor records
70
expire (one or both may change in future). There are two versions of append
71
available in the library provided: synchronous and asynchronous, however they
72
are just two different ways of implementing the same underlying operation. In
73
synchronous operation, a writer waits for an acknowledgment before commencing
74
the next append operation. This might not be the most efficient way to go about
75
for applications with high performance requirements, hence the asynchronous
76 8949c8d7 Nitesh Mor
operation where the writer can have multiple append operations in transit at a
77
time. However, note that the synchronous vs asynchronous operation is purely a
78 eebee62a Nitesh Mor
difference in how the client library processes the data--internally all
79 8949c8d7 Nitesh Mor
operations are asynchronous. 
80
81 eebee62a Nitesh Mor
The read operation can be performed in a couple of different ways--read a
82 8949c8d7 Nitesh Mor
single record by a record number, read multiple records together with a single
83
read request, or subscribe to a log to be notified of potential future data.
84
However, querying a record by the record number is not the best way for all the
85
situations, because often times the latest few values in a log are of interest.
86
The question is: as a reader, how are you supposed to know the record number to
87
be used. For this reason, the record number for a read request can take a
88
negative value to address the records from most recent first. For instance,
89
record number -1 refers to the most recent record in a log.
90
91 eebee62a Nitesh Mor
Another important thing to note is that the logs are single-writer--each sensor
92
has their own log. This makes quite a few things easier for us (encryption,
93
signatures, data-ordering, etc.). However, there are situations where a
94
composite log is desired, such as one log for all the temperature sensors in a
95
building. We envision that a simple aggregation service can subscribe to all the
96
individual logs, perform some kind of aggregation on those values (for example,
97
calculating average temperature in a building), and then write those values to a
98
*composite* log. A *composite* log is also a single writer log where the
99
aggregation service is the sole writer to the *composite* log.
100 8949c8d7 Nitesh Mor
101
## Addressing
102
103
All objects known to the GDP, including logs and services, exist in a single,
104
flat, global namespace based on 256-bit numbers (loosely referred to as the
105 eebee62a Nitesh Mor
internal name). These are assigned by GDP when the log is created using a
106 8949c8d7 Nitesh Mor
hash of the metadata and public key (the exact composition of this is TBD).
107
*(At the moment it is possible to specify the data that is hashed for the name,
108
but this will be changed soon).*  A directory service will map
109
human-understandable names into internal names (coming soon).
110
111
<!--
112
113
All communications in the GDP are provided by the GDP Routing Layer.  At the
114
moment the implementation has limited scalability and does not work with
115
routers behind NAT firewalls.  We are planning on dropping the second
116
restriction soon, but the truly global, scalable, peer-to-peer based
117
implementation will take a bit longer.
118
119
At the moment, logs must be created by hand on a specific log server.  This
120
will be replaced by a log creation service soon.  To have a log created,
121
contact Eric (eric@cs.berkeley.edu).
122
123
-->
124
125
## Client side implementation
126
127
As of now, at the core of client side distribution is a C library. There are
128
wrappers around the C library for Python and Java. Wrappers for other languages
129
tend to run slightly behind in features compared to the C interface, however,
130
you might find them an easier option if you want an object oriented interface
131
or are not comfortable programming in C.
132
133 eebee62a Nitesh Mor
Java interface is moderately supported. Contact Christopher
134
(<cxh@eecs.berkeley.edu>) or Nitesh (<mor@eecs.berkeley.edu>) if you are in
135
desperate need for a Java interface.  There is also a REST-based interface,
136
which is covered elsewhere.
137 8949c8d7 Nitesh Mor
138
139
## Security and Privacy
140
141
The long-term plan is to use encryption for data secrecy and signatures for
142
authenticity. Ideally, a user reading/writing data from/to GDP should not have
143
to trust any other component of the system for any verification (not even the
144
log servers). We envision that we can achieve such guarantees by making the
145
client side a little smarter and using mathematical techniques such as
146
encryption. However, it is a work in progress and in the current stage, we do
147
not provide any such guarantees. 
148
149
As mentioned earlier, logs are single-writer. Roughly, each log has a public
150
signature key included in the log-level metadata; the private part stays with
151
the intended writer of the log. All the writes are signed using the private
152
signature key which are verified by the log-server to enforce 1) write access
153
control, 2) data authenticity. These signatures are included in the on-disk
154
data and could be queried by a client to make sure that the data was not
155
modified in any way. Any tampering with record ordering can also be detected,
156
since the single-writer model allows for the writer to create an implicit
157
linked-list protected by signatures. However, this is only partly implemented
158
as of now.
159
160
Encryption is used to make data available to only certain users. Ideally,
161
anyone can read encrypted data but can not make any sense out of it without
162
access to the decryption key. However, there are limitations to this approach.
163
Side channels, weak encryption, key-management are some of the many challenges
164
that need to be solved. 
165
166
As for the overhead of signatures/encryption, in the current phase we envision
167
that if a device can speak the GDP protocol, it can also perform
168
signature/encryption. Ultra low power devices might offload the entire process
169
to a more powerful gateway device (a smartphone, for instance).
170
171
<!-- Ideally devices will have a key-pair assigned at the factory with only the
172
public key exposed, although low-power crypto-unaware devices may speak to a
173
gateway that (optionally) collects data from multiple devices and then signs
174
and encrypts the data before submitting it to the GDP.
175
176
Security is enforced by cryptographic techniques — for example, there are no
177
ACLs.
178
179
Starting from the data acquisition device (for example, a sensor), the model is
180
that each device will have a key pair created at the factory.  The secret key
181
will be burned into the device and not be accessible.  Each device has a
182
corresponding log, and the public key for the device will be included in the
183
log metadata when the log is created.  The device will sign all outgoing
184
records using the secret key, and the GDP log server hosting the log will check
185
the signature before writes are permitted.  The signature is retained with
186
record, so consumers (readers) may verify the signature themselves.  Very small
187
and/or low power devices that can not sign themselves will have an intermediate
188
gateway that does the signing, at the risk of lowered security should the
189
gateway be compromised or if a Man In The Middle attack can be raised between
190
the sensor and the gateway.  -->
191