Project

General

Profile

Statistics
| Branch: | Revision:

gdp-if / tensorflow / README.md @ master

History | View | Annotate | Download (7.94 KB)

Tensorflow extensions to use GDP as a file system

This directory contains a set of tools, source files, and patches to TensorFlow. Please see Custom Filesystem documentation for a high level overview.

Note: This is experimental software. No guarantees are made regarding the protection of data. Please use it at your own risk. Specifically, do not use this to store sensitive data, or data that you can not afford to lose.

Mapping files to GDP logs

In an ideal world, one could encapsulate an entire filesystem inside a single GDP log. This ensures that all the updates, even across different file, have a sequential ordering.

A slightly less than ideal, but much easier to implement/deploy, soultion is to dedicate one log for each file. Directories are just special type of files that contain directory entries (just like any other file system). The current implementation follows this approach with one extra caveat: there is only one directory in a filesystem; all subdirectories are merely empty entries. For example: gdp://x/y/z/1/2/ is merely a filesystem rooted in a log x with empty entries y, y/z, y/z/1, and y/z/1/2.

Encoding data

See GDPfs.proto. Each record is a message of type GDPfsMsg. Only the first message in a log has GDPfsMeta set; all other records have either GDPfsFchunk or GDPfsDentry set (depending on the type set in the first record). No mixing and matching of GDPfsFchunk and GDPfsDentry allowed in the same log.

Compiling

There are two high-level ways of providing support for gdp:// paths with tensorflow:

  • compile as a standalone library that can be loaded at python runtime by using load_library. This is easier to use, but requires any tensorflow code that requires gdp:// support to be updated.
  • integrate with the tensorflow code and compiled into the main tensorflow runtime. This requires custom built tensorflow package, but can run existing tensorflow code unmodified.

Compiling as a loadable library

See a set of full instructions in a shell script form at the end

Simply use make (assuming GDP is installed properly). Then, at runtime, use the following:

from tensorflow.python.framework import load_library
load_library.load_file_system_library("gdpfs_tf.so")

Any tensorflow operations that use, say tf.gfile.GFile, should now be able to resolve paths that begin with gdp://.

Compiling with tensorflow source

Note: this doesn't quite work yet. Still a work in progress.

The instructions are outdated; kept only for future reference

  • First, install GDP on the machine where we are compiling
  • Clone tensorflow; revert to git commit 35287be3bb7 (this is the commit that is known to work with the patch below).
  • Apply patch from extra/gdp_config.patch by using git apply <patch>
  • Copy extra/gdp.BUILD to tensorflow root
  • Copy CAAPI to ~/tensorflow.gdp/CAAPI
  • Copy tf to tensorflow/tensorflow/core/platform/gdp
  • Follow instructions for compiling tensorflow. Briefly:
    • ./configure. Say -march=x86-64 for optimization
    • bazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package
    • bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
    • sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-*.whl

Testing whether the installation works?

(Based on some instructions here)

After a successful installation (whether by pip package, or by local compilation and runtime instantiation), something like the following should work (with appropriate paths, and loading the .so file).

$ python
>>> from tensorflow.python.lib.io import file_io
>>> print file_io.stat('gdp://X/') # this should also work with a correct path

Command line utility

For ease of use, a command line utility gdpfs_cmd is also provided. At the very least, this utility allows for viewing the contents of a GDP filesystem, adding files from local disk to a given GDP filesystem, copying files from a GDP filesystem to the local disk, and removing files and directories.

Please run ./gdpfs_cmd for usage instructions.

Limitations

  • The current code should not be relied upon being bug-free. In fact, there are a number of seucrity vulnerabilities that are present with the current code.
  • For the moment, no guarantees are made regarding protection (confidentiality and/or integrity) of data. Please use this at your own risk.
  • No signature/encryption keys are used at the moment. This will change at some point in near future.
  • Since signature keys are not implemented right now, the single-writer model is not really enforced. Please avoid architecting a solution that relies on multiple writers for a given filesystem.
  • Recursive operations are not supported from gdpfs_cmd. This will change.
  • Creating new files in the filesystem has a fixed overhead. Thus, for copying a large number of very small files is very slow. Partially, this is because of the limited recursive functionality.

Miscellaneous

Also see:

Amazon EC2 installation instructions (Ubuntu 16.04)

Here's a list of steps that one can perform on a vanilla EC2 image

#!/bin/bash

NUM_PROCS=`cat /proc/cpuinfo  | grep "^processor" | wc -l`
NUM_JOBS=$((NUM_PROCS*2))

# sudo apt-get update && sudo apt-get dist-upgrade -y && sudo reboot

# Download protobuf
wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz
tar xvzf protobuf-cpp-3.5.1.tar.gz

## Install protobuf pre-reqs
sudo apt-get -y install autoconf automake libtool curl make g++ unzip

( cd protobuf-3.5.1 && \
    ./configure && \
    make -j $NUM_JOBS && \
    make -j $NUM_JOBS check && \
    sudo make install && \
    sudo ldconfig \
)

# Clone GDP
git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp.git

# Install GDP dependencies.
(cd gdp/adm/ && ./gdp-setup.sh)
# Download protobuf-c-compiler; we need a new version
sudo apt-get -y --purge autoremove protobuf-c-compiler
git clone https://github.com/protobuf-c/protobuf-c.git
( cd protobuf-c && \
    ./autogen.sh && \
    ./configure && \
    make && \
    sudo make install\
)

# compile GDP
(cd gdp && make && sudo make install-client)

# configure gdp
( sudo mkdir /etc/ep_adm_params && \
  echo "swarm.gdp.zeroconf.enable=false" >> /tmp/gdp && \
  echo "swarm.gdp.routers=gdp-03.eecs.berkeley.edu:8009" >> /tmp/gdp && \
  echo "swarm.gdp.gdp-create.server=gdplogd.mor" >> /tmp/gdp && \
  sudo mv /tmp/gdp /etc/ep_adm_params \
)

# install tensorflow
sudo apt-get -y install python-pip
sudo pip install --upgrade pip
sudo pip install tensorflow

# Get the gdp-if (GDP interfaces) repository
git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp-if.git

# compile
(cd gdp-if/tensorflow && make)

( echo "swarm.gdpfs.logprefix = edu.berkerley.tensorflow." >> /tmp/gdpfs && \
  echo "swarm.gdp.debug = gdp.*=5" >> /tmp/gdpfs && \
  echo "swarm.gdpfs.debug = gdpfs.*=19" >> /tmp/gdpfs && \
  sudo mv /tmp/gdpfs /etc/ep_adm_params \
)

##
echo "Include this in your LD path>>>>"
echo export LD_LIBRARY_PATH=`python -c "import tensorflow as tf; print tf.sysconfig.get_lib()"`:\$LD_LIBRARY_PATH