# Tensorflow extensions to use GDP as a file system This directory contains a set of tools, source files, and patches to [TensorFlow](https://www.tensorflow.org). Please see [Custom Filesystem](https://www.tensorflow.org/extend/add_filesys) documentation for a high level overview. **Note: This is experimental software. No guarantees are made regarding the protection of data. Please use it at your own risk. Specifically, do not use this to store sensitive data, or data that you can not afford to lose.** # Mapping files to GDP logs In an ideal world, one could encapsulate an entire filesystem inside a single GDP log. This ensures that all the updates, even across different file, have a sequential ordering. A slightly less than ideal, but much easier to implement/deploy, soultion is to dedicate one log for each file. Directories are just special type of files that contain directory entries (just like any other file system). The current implementation follows this approach with one extra caveat: there is only one directory in a filesystem; all subdirectories are merely empty entries. For example: `gdp://x/y/z/1/2/` is merely a filesystem rooted in a log `x` with empty entries `y`, `y/z`, `y/z/1`, and `y/z/1/2`. ## Encoding data See `GDPfs.proto`. Each record is a message of type `GDPfsMsg`. Only the first message in a log has `GDPfsMeta` set; all other records have either `GDPfsFchunk` or `GDPfsDentry` set (depending on the type set in the first record). No mixing and matching of `GDPfsFchunk` and `GDPfsDentry` allowed in the same log. # Compiling There are two high-level ways of providing support for `gdp://` paths with tensorflow: - compile as a standalone library that can be loaded at python runtime by using `load_library`. This is easier to use, but requires any tensorflow code that requires `gdp://` support to be updated. - integrate with the tensorflow code and compiled into the main tensorflow runtime. This requires custom built tensorflow package, but can run existing tensorflow code unmodified. ## Compiling as a loadable library *See a set of full instructions in a shell script form at the end* Simply use `make` (assuming GDP is installed properly). Then, at runtime, use the following: ``` from tensorflow.python.framework import load_library load_library.load_file_system_library("gdpfs_tf.so") ``` Any tensorflow operations that use, say `tf.gfile.GFile`, should now be able to resolve paths that begin with `gdp://`. ## Compiling with tensorflow source **Note: this doesn't quite work yet. Still a work in progress.** *The instructions are outdated; kept only for future reference* - First, install GDP on the machine where we are compiling - Clone tensorflow; revert to git commit `35287be3bb7` (this is the commit that is known to work with the patch below). - Apply patch from `extra/gdp_config.patch` by using `git apply ` - Copy `extra/gdp.BUILD` to tensorflow root - Copy `CAAPI` to `~/tensorflow.gdp/CAAPI` - Copy `tf` to `tensorflow/tensorflow/core/platform/gdp` - Follow instructions for compiling tensorflow. Briefly: - `./configure`. Say `-march=x86-64` for optimization - `bazel build --config=opt --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_package` - `bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg` - `sudo pip install --upgrade /tmp/tensorflow_pkg/tensorflow-*.whl` ## Testing whether the installation works? (Based on some instructions [here](https://www.tensorflow.org/versions/master/deploy/s3#smoke_test)) After a successful installation (whether by pip package, or by local compilation and runtime instantiation), something like the following should work (with appropriate paths, and loading the `.so` file). ``` $ python >>> from tensorflow.python.lib.io import file_io >>> print file_io.stat('gdp://X/') # this should also work with a correct path ``` # Command line utility For ease of use, a command line utility `gdpfs_cmd` is also provided. At the very least, this utility allows for viewing the contents of a GDP filesystem, adding files from local disk to a given GDP filesystem, copying files from a GDP filesystem to the local disk, and removing files and directories. Please run `./gdpfs_cmd` for usage instructions. # Limitations - The current code should not be relied upon being bug-free. In fact, there are a number of seucrity vulnerabilities that are present with the current code. - For the moment, no guarantees are made regarding protection (confidentiality and/or integrity) of data. Please use this at your own risk. - No signature/encryption keys are used at the moment. This will change at some point in near future. - Since signature keys are not implemented right now, the single-writer model is not really enforced. Please avoid architecting a solution that relies on multiple writers for a given filesystem. - Recursive operations are not supported from `gdpfs_cmd`. This will change. - Creating new files in the filesystem has a fixed overhead. Thus, for copying a large number of very small files is very slow. Partially, this is because of the limited recursive functionality. # Miscellaneous Also see: - [Tensorflow: extending file system](https://www.tensorflow.org/extend/add_filesys) - [Tensorflow file_system.h](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/file_system.h) - [Tensorflow on S3](https://www.tensorflow.org/versions/master/deploy/s3#smoke_test) - [Tensorflow compilation instructions](https://www.tensorflow.org/install/install_sources) - [Loadable extensions at runtime](https://www.tensorflow.org/extend/adding_an_op#compile_the_op_using_your_system_compiler_tensorflow_binary_installation) - [IPTF: IPFS with tensorflow](https://github.com/tesserai/iptf) - [Bazel: adding external dependencies](https://docs.bazel.build/versions/master/cpp-use-cases.html#adding-dependencies-on-precompiled-libraries) # Amazon EC2 installation instructions (Ubuntu 16.04) Here's a list of steps that one can perform on a vanilla EC2 image ``` #!/bin/bash NUM_PROCS=`cat /proc/cpuinfo | grep "^processor" | wc -l` NUM_JOBS=$((NUM_PROCS*2)) # sudo apt-get update && sudo apt-get dist-upgrade -y && sudo reboot # Download protobuf wget https://github.com/google/protobuf/releases/download/v3.5.1/protobuf-cpp-3.5.1.tar.gz tar xvzf protobuf-cpp-3.5.1.tar.gz ## Install protobuf pre-reqs sudo apt-get -y install autoconf automake libtool curl make g++ unzip ( cd protobuf-3.5.1 && \ ./configure && \ make -j $NUM_JOBS && \ make -j $NUM_JOBS check && \ sudo make install && \ sudo ldconfig \ ) # Clone GDP git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp.git # Install GDP dependencies. (cd gdp/adm/ && ./gdp-setup.sh) # Download protobuf-c-compiler; we need a new version sudo apt-get -y --purge autoremove protobuf-c-compiler git clone https://github.com/protobuf-c/protobuf-c.git ( cd protobuf-c && \ ./autogen.sh && \ ./configure && \ make && \ sudo make install\ ) # compile GDP (cd gdp && make && sudo make install-client) # configure gdp ( sudo mkdir /etc/ep_adm_params && \ echo "swarm.gdp.zeroconf.enable=false" >> /tmp/gdp && \ echo "swarm.gdp.routers=gdp-03.eecs.berkeley.edu:8009" >> /tmp/gdp && \ echo "swarm.gdp.gdp-create.server=gdplogd.mor" >> /tmp/gdp && \ sudo mv /tmp/gdp /etc/ep_adm_params \ ) # install tensorflow sudo apt-get -y install python-pip sudo pip install --upgrade pip sudo pip install tensorflow # Get the gdp-if (GDP interfaces) repository git clone git://repo.eecs.berkeley.edu/projects/swarmlab/gdp-if.git # compile (cd gdp-if/tensorflow && make) ( echo "swarm.gdpfs.logprefix = edu.berkerley.tensorflow." >> /tmp/gdpfs && \ echo "swarm.gdp.debug = gdp.*=5" >> /tmp/gdpfs && \ echo "swarm.gdpfs.debug = gdpfs.*=19" >> /tmp/gdpfs && \ sudo mv /tmp/gdpfs /etc/ep_adm_params \ ) ## echo "Include this in your LD path>>>>" echo export LD_LIBRARY_PATH=`python -c "import tensorflow as tf; print tf.sysconfig.get_lib()"`:\$LD_LIBRARY_PATH ```