Project

General

Profile

Statistics
| Branch: | Revision:

gdp-if / tensorflow / README.md @ master

History | View | Annotate | Download (7.94 KB)

1 a531ea78 Nitesh Mor
# Tensorflow extensions to use GDP as a file system
2
3
This directory contains a set of tools, source files, and patches to
4
[TensorFlow](https://www.tensorflow.org). Please see [Custom
5
Filesystem](https://www.tensorflow.org/extend/add_filesys) documentation
6
for a high level overview.
7
8 4f6daa26 Nitesh Mor
**Note: This is experimental software. No guarantees are made regarding
9
  the protection of data. Please use it at your own risk. Specifically,
10
  do not use this to store sensitive data, or data that you can not
11
  afford to lose.**
12
13 a531ea78 Nitesh Mor
# Mapping files to GDP logs
14
15
In an ideal world, one could encapsulate an entire filesystem inside a
16
single GDP log. This ensures that all the updates, even across different
17
file, have a sequential ordering.
18
19
A slightly less than ideal, but much easier to implement/deploy, soultion
20
is to dedicate one log for each file. Directories are just special type
21
of files that contain directory entries (just like any other file system).
22 af1118c9 Nitesh Mor
The current implementation follows this approach with one extra caveat:
23
there is only one directory in a filesystem; all subdirectories are merely
24
empty entries. For example: `gdp://x/y/z/1/2/` is merely a filesystem
25
rooted in a log `x` with empty entries `y`, `y/z`, `y/z/1`, and `y/z/1/2`.
26 a531ea78 Nitesh Mor
27
## Encoding data
28
29 af1118c9 Nitesh Mor
See `GDPfs.proto`. Each record is a message of type `GDPfsMsg`. Only the
30
first message in a log has `GDPfsMeta` set; all other records have either
31
`GDPfsFchunk` or `GDPfsDentry` set (depending on the type set in the first
32
record). No mixing and matching of `GDPfsFchunk` and `GDPfsDentry` allowed
33
in the same log.
34 a531ea78 Nitesh Mor
35
# Compiling
36
37 ff0d3055 Nitesh Mor
There are two high-level ways of providing support for `gdp://` paths
38
with tensorflow: 
39 a531ea78 Nitesh Mor
40 ff0d3055 Nitesh Mor
- compile as a standalone library that can be loaded at python runtime
41
  by using `load_library`. This is easier to use, but requires any
42
  tensorflow code that requires `gdp://` support to be updated.
43
- integrate with the tensorflow code and compiled into the main
44
  tensorflow runtime. This requires custom built tensorflow package, but
45
  can run existing tensorflow code unmodified.
46
47
## Compiling as a loadable library
48 a531ea78 Nitesh Mor
49 04f34b8d Nitesh Mor
*See a set of full instructions in a shell script form at the end*
50
51 a531ea78 Nitesh Mor
Simply use `make` (assuming GDP is installed properly). Then, at
52
runtime, use the following:
53
54
```
55
from tensorflow.python.framework import load_library
56 af1118c9 Nitesh Mor
load_library.load_file_system_library("gdpfs_tf.so")
57 a531ea78 Nitesh Mor
```
58
Any tensorflow operations that use, say `tf.gfile.GFile`, should
59
now be able to resolve paths that begin with `gdp://`.
60
61
## Compiling with tensorflow source
62
63 ff0d3055 Nitesh Mor
**Note: this doesn't quite work yet. Still a work in progress.**
64
65
*The instructions are outdated; kept only for future reference*
66 a531ea78 Nitesh Mor
67
- First, install GDP on the machine where we are compiling
68
- Clone tensorflow; revert to git commit `35287be3bb7` (this is the
69
  commit that is known to work with the patch below).
70
- Apply patch from `extra/gdp_config.patch` by using `git apply <patch>`
71
- Copy `extra/gdp.BUILD` to tensorflow root
72
- Copy `CAAPI` to `~/tensorflow.gdp/CAAPI`
73
- Copy `tf` to `tensorflow/tensorflow/core/platform/gdp`
74
- Follow instructions for compiling tensorflow. Briefly:
75
  - `./configure`. Say `-march=x86-64` for optimization
76
  - `bazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package`
77
  - `bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg`
78
  - `sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-*.whl`
79
80
## Testing whether the installation works?
81
82
(Based on some instructions [here](https://www.tensorflow.org/versions/master/deploy/s3#smoke_test))
83
84 af1118c9 Nitesh Mor
After a successful installation (whether by pip package, or by local
85
compilation and runtime instantiation), something like the following
86
should work (with appropriate paths, and loading the `.so` file).
87 a531ea78 Nitesh Mor
88
```
89
$ python
90
>>> from tensorflow.python.lib.io import file_io
91 4f6daa26 Nitesh Mor
>>> print file_io.stat('gdp://X/') # this should also work with a correct path
92 a531ea78 Nitesh Mor
```
93
94 4f6daa26 Nitesh Mor
# Command line utility
95
96
For ease of use, a command line utility `gdpfs_cmd` is also provided. At the
97
very least, this utility allows for viewing the contents of a GDP filesystem,
98
adding files from local disk to a given GDP filesystem, copying files from a
99
GDP filesystem to the local disk, and removing files and directories.
100
101
Please run `./gdpfs_cmd` for usage instructions.
102
103
# Limitations
104
105 dc2306d0 Nitesh Mor
- The current code should not be relied upon being bug-free. In fact, there
106
  are a number of seucrity vulnerabilities that are present with the current
107
  code.
108 4f6daa26 Nitesh Mor
- For the moment, no guarantees are made regarding protection (confidentiality
109
  and/or integrity) of data. Please use this at your own risk.
110
- No signature/encryption keys are used at the moment. This will change at some
111
  point in near future.
112
- Since signature keys are not implemented right now, the single-writer model
113
  is not really enforced. Please avoid architecting a solution that relies on
114
  multiple writers for a given filesystem.
115
- Recursive operations are not supported from `gdpfs_cmd`. This will change.
116
- Creating new files in the filesystem has a fixed overhead. Thus, for copying
117
  a large number of very small files is very slow. Partially, this is because
118
  of the limited recursive functionality.
119
120
121 ff0d3055 Nitesh Mor
# Miscellaneous
122 a531ea78 Nitesh Mor
123
Also see: 
124
125 4b6ea84e Nitesh Mor
- [Tensorflow: extending file system](https://www.tensorflow.org/extend/add_filesys)
126
- [Tensorflow file_system.h](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h)
127
- [Tensorflow on S3](https://www.tensorflow.org/versions/master/deploy/s3#smoke_test)
128
- [Tensorflow compilation instructions](https://www.tensorflow.org/install/install_sources)
129 a531ea78 Nitesh Mor
- [Loadable extensions at runtime](https://www.tensorflow.org/extend/adding_an_op#compile_the_op_using_your_system_compiler_tensorflow_binary_installation)
130 7b100d8f Nitesh Mor
- [IPTF: IPFS with tensorflow](https://github.com/tesserai/iptf)
131
- [Bazel: adding external dependencies](https://docs.bazel.build/versions/master/cpp-use-cases.html#adding-dependencies-on-precompiled-libraries)
132 04f34b8d Nitesh Mor
133
# Amazon EC2 installation instructions (Ubuntu 16.04)
134
135
Here's a list of steps that one can perform on a vanilla EC2 image
136
137
```
138
#!/bin/bash
139
140
NUM_PROCS=`cat /proc/cpuinfo  | grep "^processor" | wc -l`
141
NUM_JOBS=$((NUM_PROCS*2))
142
143
# sudo apt-get update && sudo apt-get dist-upgrade -y && sudo reboot
144
145
# Download protobuf
146
wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz
147
tar xvzf protobuf-cpp-3.5.1.tar.gz
148
149
## Install protobuf pre-reqs
150
sudo apt-get -y install autoconf automake libtool curl make g++ unzip
151
152
( cd protobuf-3.5.1 && \
153
    ./configure && \
154
    make -j $NUM_JOBS && \
155
    make -j $NUM_JOBS check && \
156
    sudo make install && \
157
    sudo ldconfig \
158
)
159
160
# Clone GDP
161
git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp.git
162
163
# Install GDP dependencies.
164
(cd gdp/adm/ && ./gdp-setup.sh)
165
# Download protobuf-c-compiler; we need a new version
166
sudo apt-get -y --purge autoremove protobuf-c-compiler
167
git clone https://github.com/protobuf-c/protobuf-c.git
168
( cd protobuf-c && \
169
    ./autogen.sh && \
170
    ./configure && \
171
    make && \
172
    sudo make install\
173
)
174
175
# compile GDP
176 79b4f991 Nitesh Mor
(cd gdp && make && sudo make install-client)
177 04f34b8d Nitesh Mor
178
# configure gdp
179
( sudo mkdir /etc/ep_adm_params && \
180
  echo "swarm.gdp.zeroconf.enable=false" >> /tmp/gdp && \
181
  echo "swarm.gdp.routers=gdp-03.eecs.berkeley.edu:8009" >> /tmp/gdp && \
182
  echo "swarm.gdp.gdp-create.server=gdplogd.mor" >> /tmp/gdp && \
183
  sudo mv /tmp/gdp /etc/ep_adm_params \
184
)
185
186
# install tensorflow
187
sudo apt-get -y install python-pip
188
sudo pip install --upgrade pip
189
sudo pip install tensorflow
190
191
# Get the gdp-if (GDP interfaces) repository
192
git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp-if.git
193
194
# compile
195
(cd gdp-if/tensorflow && make)
196
197
( echo "swarm.gdpfs.logprefix = edu.berkerley.tensorflow." >> /tmp/gdpfs && \
198
  echo "swarm.gdp.debug = gdp.*=5" >> /tmp/gdpfs && \
199
  echo "swarm.gdpfs.debug = gdpfs.*=19" >> /tmp/gdpfs && \
200
  sudo mv /tmp/gdpfs /etc/ep_adm_params \
201
)
202
203
##
204
echo "Include this in your LD path>>>>"
205
echo export LD_LIBRARY_PATH=`python -c "import tensorflow as tf; print tf.sysconfig.get_lib()"`:\$LD_LIBRARY_PATH
206
207
```