gdp-if / tensorflow / README.md @ master
Tensorflow extensions to use GDP as a file system
Note: This is experimental software. No guarantees are made regarding the protection of data. Please use it at your own risk. Specifically, do not use this to store sensitive data, or data that you can not afford to lose.
Mapping files to GDP logs
In an ideal world, one could encapsulate an entire filesystem inside a single GDP log. This ensures that all the updates, even across different file, have a sequential ordering.
A slightly less than ideal, but much easier to implement/deploy, soultion
is to dedicate one log for each file. Directories are just special type
of files that contain directory entries (just like any other file system).
The current implementation follows this approach with one extra caveat:
there is only one directory in a filesystem; all subdirectories are merely
empty entries. For example:
gdp://x/y/z/1/2/ is merely a filesystem
rooted in a log
x with empty entries
GDPfs.proto. Each record is a message of type
GDPfsMsg. Only the
first message in a log has
GDPfsMeta set; all other records have either
GDPfsDentry set (depending on the type set in the first
record). No mixing and matching of
in the same log.
There are two high-level ways of providing support for
- compile as a standalone library that can be loaded at python runtime
load_library. This is easier to use, but requires any tensorflow code that requires
gdp://support to be updated.
- integrate with the tensorflow code and compiled into the main tensorflow runtime. This requires custom built tensorflow package, but can run existing tensorflow code unmodified.
Compiling as a loadable library
See a set of full instructions in a shell script form at the end
make (assuming GDP is installed properly). Then, at
runtime, use the following:
from tensorflow.python.framework import load_library load_library.load_file_system_library("gdpfs_tf.so")
Any tensorflow operations that use, say
now be able to resolve paths that begin with
Compiling with tensorflow source
Note: this doesn't quite work yet. Still a work in progress.
The instructions are outdated; kept only for future reference
- First, install GDP on the machine where we are compiling
- Clone tensorflow; revert to git commit
35287be3bb7(this is the commit that is known to work with the patch below).
- Apply patch from
git apply <patch>
extra/gdp.BUILDto tensorflow root
- Follow instructions for compiling tensorflow. Briefly:
bazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package
sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-*.whl
Testing whether the installation works?
(Based on some instructions here)
After a successful installation (whether by pip package, or by local
compilation and runtime instantiation), something like the following
should work (with appropriate paths, and loading the
$ python >>> from tensorflow.python.lib.io import file_io >>> print file_io.stat('gdp://X/') # this should also work with a correct path
Command line utility
For ease of use, a command line utility
gdpfs_cmd is also provided. At the
very least, this utility allows for viewing the contents of a GDP filesystem,
adding files from local disk to a given GDP filesystem, copying files from a
GDP filesystem to the local disk, and removing files and directories.
./gdpfs_cmd for usage instructions.
- The current code should not be relied upon being bug-free. In fact, there are a number of seucrity vulnerabilities that are present with the current code.
- For the moment, no guarantees are made regarding protection (confidentiality and/or integrity) of data. Please use this at your own risk.
- No signature/encryption keys are used at the moment. This will change at some point in near future.
- Since signature keys are not implemented right now, the single-writer model is not really enforced. Please avoid architecting a solution that relies on multiple writers for a given filesystem.
- Recursive operations are not supported from
gdpfs_cmd. This will change.
- Creating new files in the filesystem has a fixed overhead. Thus, for copying a large number of very small files is very slow. Partially, this is because of the limited recursive functionality.
- Tensorflow: extending file system
- Tensorflow file_system.h
- Tensorflow on S3
- Tensorflow compilation instructions
- Loadable extensions at runtime
- IPTF: IPFS with tensorflow
- Bazel: adding external dependencies
Amazon EC2 installation instructions (Ubuntu 16.04)
Here's a list of steps that one can perform on a vanilla EC2 image
#!/bin/bash NUM_PROCS=`cat /proc/cpuinfo | grep "^processor" | wc -l` NUM_JOBS=$((NUM_PROCS*2)) # sudo apt-get update && sudo apt-get dist-upgrade -y && sudo reboot # Download protobuf wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz tar xvzf protobuf-cpp-3.5.1.tar.gz ## Install protobuf pre-reqs sudo apt-get -y install autoconf automake libtool curl make g++ unzip ( cd protobuf-3.5.1 && \ ./configure && \ make -j $NUM_JOBS && \ make -j $NUM_JOBS check && \ sudo make install && \ sudo ldconfig \ ) # Clone GDP git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp.git # Install GDP dependencies. (cd gdp/adm/ && ./gdp-setup.sh) # Download protobuf-c-compiler; we need a new version sudo apt-get -y --purge autoremove protobuf-c-compiler git clone https://github.com/protobuf-c/protobuf-c.git ( cd protobuf-c && \ ./autogen.sh && \ ./configure && \ make && \ sudo make install\ ) # compile GDP (cd gdp && make && sudo make install-client) # configure gdp ( sudo mkdir /etc/ep_adm_params && \ echo "swarm.gdp.zeroconf.enable=false" >> /tmp/gdp && \ echo "swarm.gdp.routers=gdp-03.eecs.berkeley.edu:8009" >> /tmp/gdp && \ echo "swarm.gdp.gdp-create.server=gdplogd.mor" >> /tmp/gdp && \ sudo mv /tmp/gdp /etc/ep_adm_params \ ) # install tensorflow sudo apt-get -y install python-pip sudo pip install --upgrade pip sudo pip install tensorflow # Get the gdp-if (GDP interfaces) repository git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp-if.git # compile (cd gdp-if/tensorflow && make) ( echo "swarm.gdpfs.logprefix = edu.berkerley.tensorflow." >> /tmp/gdpfs && \ echo "swarm.gdp.debug = gdp.*=5" >> /tmp/gdpfs && \ echo "swarm.gdpfs.debug = gdpfs.*=19" >> /tmp/gdpfs && \ sudo mv /tmp/gdpfs /etc/ep_adm_params \ ) ## echo "Include this in your LD path>>>>" echo export LD_LIBRARY_PATH=`python -c "import tensorflow as tf; print tf.sysconfig.get_lib()"`:\$LD_LIBRARY_PATH