gdp-if / tensorflow / README.md @ master
History | View | Annotate | Download (7.94 KB)
Tensorflow extensions to use GDP as a file system
This directory contains a set of tools, source files, and patches to TensorFlow. Please see Custom Filesystem documentation for a high level overview.
Note: This is experimental software. No guarantees are made regarding the protection of data. Please use it at your own risk. Specifically, do not use this to store sensitive data, or data that you can not afford to lose.
Mapping files to GDP logs
In an ideal world, one could encapsulate an entire filesystem inside a single GDP log. This ensures that all the updates, even across different file, have a sequential ordering.
A slightly less than ideal, but much easier to implement/deploy, soultion
is to dedicate one log for each file. Directories are just special type
of files that contain directory entries (just like any other file system).
The current implementation follows this approach with one extra caveat:
there is only one directory in a filesystem; all subdirectories are merely
empty entries. For example: gdp://x/y/z/1/2/
is merely a filesystem
rooted in a log x
with empty entries y
, y/z
, y/z/1
, and y/z/1/2
.
Encoding data
See GDPfs.proto
. Each record is a message of type GDPfsMsg
. Only the
first message in a log has GDPfsMeta
set; all other records have either
GDPfsFchunk
or GDPfsDentry
set (depending on the type set in the first
record). No mixing and matching of GDPfsFchunk
and GDPfsDentry
allowed
in the same log.
Compiling
There are two high-level ways of providing support for gdp://
paths
with tensorflow:
- compile as a standalone library that can be loaded at python runtime
by using
load_library
. This is easier to use, but requires any tensorflow code that requiresgdp://
support to be updated. - integrate with the tensorflow code and compiled into the main tensorflow runtime. This requires custom built tensorflow package, but can run existing tensorflow code unmodified.
Compiling as a loadable library
See a set of full instructions in a shell script form at the end
Simply use make
(assuming GDP is installed properly). Then, at
runtime, use the following:
from tensorflow.python.framework import load_library load_library.load_file_system_library("gdpfs_tf.so")
Any tensorflow operations that use, say tf.gfile.GFile
, should
now be able to resolve paths that begin with gdp://
.
Compiling with tensorflow source
Note: this doesn't quite work yet. Still a work in progress.
The instructions are outdated; kept only for future reference
- First, install GDP on the machine where we are compiling
- Clone tensorflow; revert to git commit
35287be3bb7
(this is the commit that is known to work with the patch below). - Apply patch from
extra/gdp_config.patch
by usinggit apply <patch>
- Copy
extra/gdp.BUILD
to tensorflow root - Copy
CAAPI
to~/tensorflow.gdp/CAAPI
- Copy
tf
totensorflow/tensorflow/core/platform/gdp
- Follow instructions for compiling tensorflow. Briefly:
./configure
. Say-march=x86-64
for optimizationbazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-*.whl
Testing whether the installation works?
(Based on some instructions here)
After a successful installation (whether by pip package, or by local
compilation and runtime instantiation), something like the following
should work (with appropriate paths, and loading the .so
file).
$ python >>> from tensorflow.python.lib.io import file_io >>> print file_io.stat('gdp://X/') # this should also work with a correct path
Command line utility
For ease of use, a command line utility gdpfs_cmd
is also provided. At the
very least, this utility allows for viewing the contents of a GDP filesystem,
adding files from local disk to a given GDP filesystem, copying files from a
GDP filesystem to the local disk, and removing files and directories.
Please run ./gdpfs_cmd
for usage instructions.
Limitations
- The current code should not be relied upon being bug-free. In fact, there are a number of seucrity vulnerabilities that are present with the current code.
- For the moment, no guarantees are made regarding protection (confidentiality and/or integrity) of data. Please use this at your own risk.
- No signature/encryption keys are used at the moment. This will change at some point in near future.
- Since signature keys are not implemented right now, the single-writer model is not really enforced. Please avoid architecting a solution that relies on multiple writers for a given filesystem.
- Recursive operations are not supported from
gdpfs_cmd
. This will change. - Creating new files in the filesystem has a fixed overhead. Thus, for copying a large number of very small files is very slow. Partially, this is because of the limited recursive functionality.
Miscellaneous
Also see:
- Tensorflow: extending file system
- Tensorflow file_system.h
- Tensorflow on S3
- Tensorflow compilation instructions
- Loadable extensions at runtime
- IPTF: IPFS with tensorflow
- Bazel: adding external dependencies
Amazon EC2 installation instructions (Ubuntu 16.04)
Here's a list of steps that one can perform on a vanilla EC2 image
#!/bin/bash NUM_PROCS=`cat /proc/cpuinfo | grep "^processor" | wc -l` NUM_JOBS=$((NUM_PROCS*2)) # sudo apt-get update && sudo apt-get dist-upgrade -y && sudo reboot # Download protobuf wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz tar xvzf protobuf-cpp-3.5.1.tar.gz ## Install protobuf pre-reqs sudo apt-get -y install autoconf automake libtool curl make g++ unzip ( cd protobuf-3.5.1 && \ ./configure && \ make -j $NUM_JOBS && \ make -j $NUM_JOBS check && \ sudo make install && \ sudo ldconfig \ ) # Clone GDP git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp.git # Install GDP dependencies. (cd gdp/adm/ && ./gdp-setup.sh) # Download protobuf-c-compiler; we need a new version sudo apt-get -y --purge autoremove protobuf-c-compiler git clone https://github.com/protobuf-c/protobuf-c.git ( cd protobuf-c && \ ./autogen.sh && \ ./configure && \ make && \ sudo make install\ ) # compile GDP (cd gdp && make && sudo make install-client) # configure gdp ( sudo mkdir /etc/ep_adm_params && \ echo "swarm.gdp.zeroconf.enable=false" >> /tmp/gdp && \ echo "swarm.gdp.routers=gdp-03.eecs.berkeley.edu:8009" >> /tmp/gdp && \ echo "swarm.gdp.gdp-create.server=gdplogd.mor" >> /tmp/gdp && \ sudo mv /tmp/gdp /etc/ep_adm_params \ ) # install tensorflow sudo apt-get -y install python-pip sudo pip install --upgrade pip sudo pip install tensorflow # Get the gdp-if (GDP interfaces) repository git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp-if.git # compile (cd gdp-if/tensorflow && make) ( echo "swarm.gdpfs.logprefix = edu.berkerley.tensorflow." >> /tmp/gdpfs && \ echo "swarm.gdp.debug = gdp.*=5" >> /tmp/gdpfs && \ echo "swarm.gdpfs.debug = gdpfs.*=19" >> /tmp/gdpfs && \ sudo mv /tmp/gdpfs /etc/ep_adm_params \ ) ## echo "Include this in your LD path>>>>" echo export LD_LIBRARY_PATH=`python -c "import tensorflow as tf; print tf.sysconfig.get_lib()"`:\$LD_LIBRARY_PATH