diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md index 76e3ae07a4..9ab0641235 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/DeveloperGuide.md @@ -14,7 +14,7 @@ # Developer Guide -By default, submarine uses YARN service framework as runtime. If you want to add your own implementation. You can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`) +By default, Submarine uses YARN service framework as runtime. If you want to add your own implementation, you can add a new `RuntimeFactory` implementation and configure following option to `submarine.xml` (which should be placed under same `$HADOOP_CONF_DIR`) ``` diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md index 3e7f02feb1..d878adde25 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Examples.md @@ -18,4 +18,6 @@ Here're some examples about Submarine usage. [Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html) +[Running Standalone CIFAR 10 PyTorch Job](RunningSingleNodeCifar10PTJobs.html) + [Running Zeppelin Notebook on YARN](RunningZeppelinOnYARN.html) \ No newline at end of file diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md index baeaa15da2..f8556a6c10 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/Index.md @@ -22,6 +22,8 @@ Goals of Submarine: - Support run distributed Tensorflow jobs with simple configs. +- Support run standalone PyTorch jobs with simple configs. + - Support run user-specified Docker images. - Support specify GPU and other resources. @@ -37,7 +39,9 @@ Click below contents if you want to understand more. - [Examples](Examples.html) -- [How to write Dockerfile for Submarine jobs](WriteDockerfile.html) +- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html) + +- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html) - [Developer guide](DeveloperGuide.html) diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md index 1c7812ba8e..e73887eb72 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuide.md @@ -304,7 +304,7 @@ https://github.com/NVIDIA/nvidia-docker ### Tensorflow Image -There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. we can get basic docker images by referring to WriteDockerfile.md. +There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images. We can get basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html). ### Test tensorflow in a docker container diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md index ba996e8d21..7667c1c10b 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/InstallationGuideChineseVersion.md @@ -293,7 +293,7 @@ https://github.com/NVIDIA/nvidia-docker ### Tensorflow Image -CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Sumbmarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件,基础的Dockfile可参见WriteDockerfile.md +CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Submarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件,基础的Dockfile可参见WriteDockerfileTF.md ### 测试 TF 环境 diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md index 3b68f51239..5648c11e64 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/QuickStart.md @@ -24,15 +24,18 @@ Optional: - Enable YARN DNS. (When yarn service runtime is required.) - Enable GPU on YARN support. (When GPU-based training is required.) -- Docker images for submarine jobs. (When docker container is required.) +- Docker images for Submarine jobs. (When docker container is required.) ``` # Get prebuilt docker images (No liability) docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1 # Or build your own docker images docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1 ``` -More details, please refer to -[How to write Dockerfile for Submarine jobs](WriteDockerfile.html) +For more details, please refer to: + +- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html) + +- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html) ## Run jobs @@ -120,7 +123,7 @@ reported from `entry_script.py`. ### Submarine Configuration -For submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`. +For Submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`. |Configuration Name | Description | |:---- |:---- | @@ -235,7 +238,7 @@ Or you can use `yarn logs -applicationId ` to get logs from CLI ## Build from source code -If you want to build submarine project by yourself, you can follow the steps: +If you want to build the Submarine project by yourself, you can follow the steps: - Run 'mvn install -DskipTests' from Hadoop source top level once. diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md index 7da98d55cf..c0cf088774 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningDistributedCifar10TFJobs.md @@ -39,9 +39,9 @@ python generate_cifar10_tfrecords.py --data-dir=cifar-10-data hadoop fs -put cifar-10-data/ /dataset/cifar-10-data ``` -**Please note that:** +**Warning:** -YARN service doesn't allow multiple services with the same name, so please run following command +Please note that YARN service doesn't allow multiple services with the same name, so please run following command ``` yarn application -destroy ``` @@ -49,7 +49,7 @@ to delete services if you want to reuse the same service name. ## Prepare Docker images -Refer to [Write Dockerfile](WriteDockerfile.md) to build a Docker image or use prebuilt one. +Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or use prebuilt one. ## Run Tensorflow jobs @@ -92,6 +92,8 @@ Explanations: - `>1` num_workers indicates it is a distributed training. - Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server doesn't require GPU. +For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page! + *Outputs of distributed training* Sample output of master: diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md new file mode 100644 index 0000000000..ca77c829ac --- /dev/null +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/RunningSingleNodeCifar10PTJobs.md @@ -0,0 +1,62 @@ + +# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example. + +Currently, PyTorch integration with Submarine only supports PyTorch in standalone (non-distributed mode). +Please also note that HDFS as a data source is not yet supported by PyTorch. + +## What is CIFAR-10? +CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset. + +**Warning:** + +Please note that YARN service doesn't allow multiple services with the same name, so please run following command +``` +yarn application -destroy +``` +to delete services if you want to reuse the same service name. + +## Prepare Docker images + +Refer to [Write Dockerfile](WriteDockerfilePT.html) to build a Docker image or use prebuilt one. + +## Running PyTorch jobs + +### Run standalone training + +``` +export HADOOP_CLASSPATH="/home/systest/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar:/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar" +/opt/hadoop/bin/yarn jar /home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar job run \ +--name pytorch-job-001 \ +--verbose \ +--framework pytorch \ +--wait_job_finish \ +--docker_image pytorch-latest-gpu:0.0.1 \ +--input_path hdfs://unused \ +--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \ +--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \ +--env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \ +--num_workers 1 \ +--worker_resources memory=5G,vcores=2 \ +--worker_launch_cmd "cd /test/ && python cifar10_tutorial.py" + +``` + +For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page! + +**Remarks:** +Please note that the input path parameter is mandatory, but not yet used by the PyTorch docker container. \ No newline at end of file diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md new file mode 100644 index 0000000000..84ca479978 --- /dev/null +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfilePT.md @@ -0,0 +1,114 @@ + + +# Creating Docker Images for Running PyTorch on YARN + +## How to create docker images to run PyTorch on YARN + +Dockerfile to run PyTorch on YARN needs two parts: + +**Base libraries which PyTorch depends on** + +1) OS base image, for example ```ubuntu:16.04``` + +2) PyTorch dependent libraries and packages. For example ```python```, ```scipy```. For GPU support, you also need ```cuda```, ```cudnn```, etc. + +3) PyTorch package. + +**Libraries to access HDFS** + +1) JDK + +2) Hadoop + +Here's an example of a base image (with GPU support) to install PyTorch: +``` +FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04 +ARG PYTHON_VERSION=3.6 +RUN apt-get update && apt-get install -y --no-install-recommends \ + build-essential \ + cmake \ + git \ + curl \ + vim \ + ca-certificates \ + libjpeg-dev \ + libpng-dev \ + wget &&\ + rm -rf /var/lib/apt/lists/* + + +RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ + chmod +x ~/miniconda.sh && \ + ~/miniconda.sh -b -p /opt/conda && \ + rm ~/miniconda.sh && \ + /opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \ + /opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \ + /opt/conda/bin/conda clean -ya +ENV PATH /opt/conda/bin:$PATH +RUN pip install ninja +# This must be done before pip so that requirements.txt is available +WORKDIR /opt/pytorch +RUN git clone https://github.com/pytorch/pytorch.git +WORKDIR pytorch +RUN git submodule update --init +RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \ + CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ + pip install -v . + +WORKDIR /opt/pytorch +RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v . + +``` + +On top of above image, add files, install packages to access HDFS +``` +RUN apt-get update && apt-get install -y openjdk-8-jdk wget +# Install hadoop +ENV HADOOP_VERSION="3.1.2" +RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz +RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz +RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current +RUN rm hadoop-${HADOOP_VERSION}.tar.gz +``` + +Build and push to your own docker registry: Use ```docker build ... ``` and ```docker push ...``` to finish this step. + +## Use examples to build your own PyTorch docker images + +We provided some example Dockerfiles for you to build your own PyTorch docker images. + +For latest PyTorch + +- *docker/pytorch/base/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that supports GPU, which is prebuilt to CUDA10. +- *docker/pytorch/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that GPU, which is prebuilt to CUDA10, with models. + +## Build Docker images + +### Manually build Docker image: + +Under `docker/pytorch` directory, run `build-all.sh` to build all Docker images. This command will build the following Docker images: + +- `pytorch-latest-gpu-base:0.0.1` for base Docker image which includes Hadoop, PyTorch, GPU base libraries. +- `pytorch-latest-gpu:0.0.1` which includes cifar10 model as well + +### Use prebuilt images + +(No liability) +You can also use prebuilt images for convenience: + +- hadoopsubmarine/pytorch-latest-gpu-base:0.0.1 diff --git a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md similarity index 87% rename from hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md rename to hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md index 0d4c6c1fda..5dc565d068 100644 --- a/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfile.md +++ b/hadoop-submarine/hadoop-submarine-core/src/site/markdown/WriteDockerfileTF.md @@ -98,10 +98,10 @@ We provided following examples for you to build tensorflow docker images. For Tensorflow 1.13.1 (Precompiled to CUDA 10.x) -- *docker/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only. -- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models -- *docker/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10. -- *docker/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models. +- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only. +- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models +- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10. +- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models. ## Build Docker images