HADOOP-16670. Stripping Submarine code from Hadoop codebase. Contributed by Zhankun Tang.

Reviewed-by: Akira Ajisaka <aajisaka@apache.org>
Signed-off-by: Wei-Chiu Chuang <weichiu@apache.org>
This commit is contained in:
Zhankun Tang 2020-01-21 20:06:53 -08:00 committed by Wei-Chiu Chuang
parent b4870bce3a
commit d40d7cc4f9
229 changed files with 0 additions and 23587 deletions

View File

@ -1,56 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.3 http://maven.apache.org/xsd/assembly-1.1.3.xsd">
<id>hadoop-src</id>
<formats>
<format>tar.gz</format>
</formats>
<includeBaseDirectory>true</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>.</directory>
<includes>
<include>LICENCE.txt</include>
<include>README.txt</include>
<include>NOTICE.txt</include>
</includes>
</fileSet>
<fileSet>
<directory>.</directory>
<useDefaultExcludes>true</useDefaultExcludes>
<excludes>
<exclude>.git/**</exclude>
<exclude>**/.gitignore</exclude>
<exclude>**/.svn</exclude>
<exclude>**/*.iws</exclude>
<exclude>**/*.ipr</exclude>
<exclude>**/*.iml</exclude>
<exclude>**/.classpath</exclude>
<exclude>**/.project</exclude>
<exclude>**/.settings</exclude>
<exclude>**/target/**</exclude>
<!-- until the code that does this is fixed -->
<exclude>**/*.log</exclude>
<exclude>**/build/**</exclude>
<exclude>**/file:/**</exclude>
<exclude>**/SecurityAuth.audit*</exclude>
</excludes>
</fileSet>
</fileSets>
</assembly>

View File

@ -56,7 +56,6 @@
<exclude>**/build/**</exclude>
<exclude>**/file:/**</exclude>
<exclude>**/SecurityAuth.audit*</exclude>
<exclude>hadoop-submarine/**</exclude>
</excludes>
</fileSet>
</fileSets>

View File

@ -430,4 +430,3 @@ please contact the developer mailing list for the relevant component(s):
* [hdfs-dev](mailto:hdfs-dev@hadoop.apache.org)
* [mapreduce-dev](mailto:mapreduce-dev@hadoop.apache.org)
* [yarn-dev](mailto:yarn-dev@hadoop.apache.org)
* [submarine-dev](mailto:submarine-dev@hadoop.apache.org)

View File

@ -175,12 +175,6 @@
<item name="System Services" href="hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html"/>
</menu>
<menu name="Submarine" inherit="top">
<item name="Index" href="hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/Index.html"/>
<item name="QuickStart" href="hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/QuickStart.html"/>
<item name="Examples" href="hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/Examples.html"/>
</menu>
<menu name="Hadoop Compatible File Systems" inherit="top">
<item name="Aliyun OSS" href="hadoop-aliyun/tools/hadoop-aliyun/index.html"/>
<item name="Amazon S3" href="hadoop-aws/tools/hadoop-aws/index.html"/>

View File

@ -1,24 +0,0 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
mkdir -p target
rm target/rat-aggregated.txt
mvn apache-rat:check
grep -r --include=rat.txt -E "\!\?\?\?\?" ./* | tee ./target/rat-aggregated.txt
if [ "$(cat target/rat-aggregated.txt)" ]; then
echo "Failed to pass apache rat check!"
exit -1
fi

View File

@ -1,183 +0,0 @@
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<artifactId>hadoop-submarine</artifactId>
<groupId>org.apache.hadoop</groupId>
<version>0.3.0-SNAPSHOT</version>
</parent>
<artifactId>${project.artifactId}</artifactId>
<version>${project.version}</version>
<name>Hadoop Submarine All</name>
<properties>
<!-- Needed for generating FindBugs warnings using parent pom -->
<yarn.basedir>${project.parent.parent.basedir}</yarn.basedir>
<project.artifactId>hadoop-submarine-all</project.artifactId>
<project.version>0.3.0-SNAPSHOT</project.version>
</properties>
<dependencies>
<!-- Dependencies for Hadoop commons -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-core</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
<profiles>
<profile>
<id>hadoop-3.2</id>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-yarnservice-runtime</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-tony-runtime</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</profile>
<!-- Default profile-->
<profile>
<id>hadoop-3.1</id>
<activation>
<activeByDefault>true</activeByDefault>
</activation>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-yarnservice-runtime</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-tony-runtime</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</profile>
<profile>
<id>hadoop-2.9</id>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-tony-runtime</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
</dependencies>
</profile>
<profile>
<id>hadoop-2.7</id>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-submarine-tony-runtime</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>
</profiles>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.2.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<!--
<shadedArtifactAttached>true</shadedArtifactAttached>
<shadedClassifierName>with-all-dependencies</shadedClassifierName>
-->
<outputFile>target/${project.artifactId}-${project.version}-${project.activeProfiles[0].id}.jar</outputFile>
<artifactSet>
<excludes>
<exclude>classworlds:classworlds</exclude>
<exclude>junit:junit</exclude>
<exclude>jmock:*</exclude>
<exclude>*:xml-apis</exclude>
<exclude>org.apache.maven:lib:tests</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.apache.hadoop.yarn.submarine.client.cli.Cli</mainClass>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -1,54 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Overview
```$xslt
_ _
| | (_)
___ _ _ | |__ _ __ ___ __ _ _ __ _ _ __ ___
/ __|| | | || '_ \ | '_ ` _ \ / _` || '__|| || '_ \ / _ \
\__ \| |_| || |_) || | | | | || (_| || | | || | | || __/
|___/ \__,_||_.__/ |_| |_| |_| \__,_||_| |_||_| |_| \___|
?
~~~~~~~~~~~~~~~~~~~~~~~~~~~|^"~~~~~~~~~~~~~~~~~~~~~~~~~o~~~~~~~~~~~
o | o __o
o | o |X__>
___o | __o
(X___>-- __|__ |X__> o
| \ __o
| \ |X__>
_______________________|_______\________________
< \____________ _
\ \ (_)
\ O O O >=)
\__________________________________________________________/ (_)
```
Submarine is a project which allows infra engineer / data scientist to run
*unmodified* Tensorflow or PyTorch programs on YARN or Kubernetes.
Goals of Submarine:
- It allows jobs easy access data/models in HDFS and other storages.
- Can launch services to serve Tensorflow/PyTorch models.
- Support run distributed Tensorflow jobs with simple configs.
- Support run user-specified Docker images.
- Support specify GPU and other resources.
- Support launch tensorboard for training jobs if user specified.
- Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
Please jump to [QuickStart](src/site/markdown/QuickStart.md) guide to quickly understand how to use this framework.
Please jump to [Examples](src/site/markdown/Examples.md) to try other examples like running Distributed Tensorflow Training for CIFAR 10.

View File

@ -1,158 +0,0 @@
<?xml version="1.0"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
https://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<artifactId>hadoop-submarine</artifactId>
<groupId>org.apache.hadoop</groupId>
<version>0.3.0-SNAPSHOT</version>
</parent>
<artifactId>hadoop-submarine-core</artifactId>
<version>0.3.0-SNAPSHOT</version>
<name>Hadoop Submarine Core</name>
<properties>
<!-- Needed for generating FindBugs warnings using parent pom -->
<yarn.basedir>${project.parent.parent.basedir}</yarn.basedir>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>
<groupId>commons-logging</groupId>
<artifactId>commons-logging</artifactId>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>org.yaml</groupId>
<artifactId>snakeyaml</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
</dependency>
<!-- Dependencies for Hadoop commons -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-api</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-client</artifactId>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-yarn-common</artifactId>
<type>test-jar</type>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>jar</goal>
</goals>
<!-- strictly speaking, the unit test is really a regression test. It
needs the main jar to be available to be able to run. -->
<phase>test-compile</phase>
</execution>
</executions>
<configuration>
<archive>
<manifest>
<mainClass>org.apache.hadoop.yarn.submarine.client.cli.Cli</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<environmentVariables>
<JAVA_HOME>${java.home}</JAVA_HOME>
</environmentVariables>
</configuration>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>test-jar</goal>
</goals>
<phase>test-compile</phase>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>

View File

@ -1,77 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
ARG PYTHON_VERSION=3.6
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
curl \
vim \
ca-certificates \
libjpeg-dev \
libpng-dev \
wget &&\
rm -rf /var/lib/apt/lists/*
RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \
/opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
/opt/conda/bin/conda clean -ya
ENV PATH /opt/conda/bin:$PATH
RUN pip install ninja
# This must be done before pip so that requirements.txt is available
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/pytorch.git
WORKDIR pytorch
RUN git submodule update --init
RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
pip install -v .
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
WORKDIR /
# Install Hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo "$LOG_TAG Install java8" && \
apt-get update && \
apt-get install -y --no-install-recommends openjdk-8-jdk && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN echo "Install python related packages" && \
pip --no-cache-dir install Pillow h5py ipykernel jupyter matplotlib numpy pandas scipy sklearn && \
python -m ipykernel.kernelspec
# Set the locale to fix bash warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
RUN apt-get update && apt-get install -y --no-install-recommends locales && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN locale-gen en_US.UTF-8
WORKDIR /workspace
RUN chmod -R a+w /workspace

View File

@ -1,30 +0,0 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "Building base images"
set -e
cd base/ubuntu-16.04
docker build . -f Dockerfile.gpu.pytorch_latest -t pytorch-latest-gpu-base:0.0.1
echo "Finished building base images"
cd ../../with-cifar10-models/ubuntu-16.04
docker build . -f Dockerfile.gpu.pytorch_latest -t pytorch-latest-gpu:0.0.1

View File

@ -1,354 +0,0 @@
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
# http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# -*- coding: utf-8 -*-
"""
Training a Classifier
=====================
This is it. You have seen how to define neural networks, compute loss and make
updates to the weights of the network.
Now you might be thinking,
What about data?
----------------
Generally, when you have to deal with image, text, audio or video data,
you can use standard python packages that load data into a numpy array.
Then you can convert this array into a ``torch.*Tensor``.
- For images, packages such as Pillow, OpenCV are useful
- For audio, packages such as scipy and librosa
- For text, either raw Python or Cython based loading, or NLTK and
SpaCy are useful
Specifically for vision, we have created a package called
``torchvision``, that has data loaders for common datasets such as
Imagenet, CIFAR10, MNIST, etc. and data transformers for images, viz.,
``torchvision.datasets`` and ``torch.utils.data.DataLoader``.
This provides a huge convenience and avoids writing boilerplate code.
For this tutorial, we will use the CIFAR10 dataset.
It has the classes: airplane, automobile, bird, cat, deer,
dog, frog, horse, ship, truck. The images in CIFAR-10 are of
size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.
.. figure:: /_static/img/cifar10.png
:alt: cifar10
cifar10
Training an image classifier
----------------------------
We will do the following steps in order:
1. Load and normalizing the CIFAR10 training and test datasets using
``torchvision``
2. Define a Convolutional Neural Network
3. Define a loss function
4. Train the network on the training data
5. Test the network on the test data
1. Loading and normalizing CIFAR10
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Using ``torchvision``, its extremely easy to load CIFAR10.
"""
import torch
import torchvision
import torchvision.transforms as transforms
########################################################################
# The output of torchvision datasets are PILImage images of range [0, 1].
# We transform them to Tensors of normalized range [-1, 1].
transform = transforms.Compose(
[transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat',
'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
########################################################################
# Let us show some of the training images, for fun.
import matplotlib.pyplot as plt
import numpy as np
# functions to show an image
def imshow(img):
img = img / 2 + 0.5 # unnormalize
npimg = img.numpy()
plt.imshow(np.transpose(npimg, (1, 2, 0)))
plt.show()
# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()
# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))
########################################################################
# 2. Define a Convolutional Neural Network
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Copy the neural network from the Neural Networks section before and modify it to
# take 3-channel images (instead of 1-channel images as it was defined).
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
net = Net()
########################################################################
# 3. Define a Loss function and optimizer
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Let's use a Classification Cross-Entropy loss and SGD with momentum.
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
########################################################################
# 4. Train the network
# ^^^^^^^^^^^^^^^^^^^^
#
# This is when things start to get interesting.
# We simply have to loop over our data iterator, and feed the inputs to the
# network and optimize.
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 2000))
running_loss = 0.0
print('Finished Training')
########################################################################
# 5. Test the network on the test data
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#
# We have trained the network for 2 passes over the training dataset.
# But we need to check if the network has learnt anything at all.
#
# We will check this by predicting the class label that the neural network
# outputs, and checking it against the ground-truth. If the prediction is
# correct, we add the sample to the list of correct predictions.
#
# Okay, first step. Let us display an image from the test set to get familiar.
dataiter = iter(testloader)
images, labels = dataiter.next()
# print images
imshow(torchvision.utils.make_grid(images))
print('GroundTruth: ', ' '.join('%5s' % classes[labels[j]] for j in range(4)))
########################################################################
# Okay, now let us see what the neural network thinks these examples above are:
outputs = net(images)
########################################################################
# The outputs are energies for the 10 classes.
# The higher the energy for a class, the more the network
# thinks that the image is of the particular class.
# So, let's get the index of the highest energy:
_, predicted = torch.max(outputs, 1)
print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
for j in range(4)))
########################################################################
# The results seem pretty good.
#
# Let us look at how the network performs on the whole dataset.
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (
100 * correct / total))
########################################################################
# That looks waaay better than chance, which is 10% accuracy (randomly picking
# a class out of 10 classes).
# Seems like the network learnt something.
#
# Hmmm, what are the classes that performed well, and the classes that did
# not perform well:
class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs, 1)
c = (predicted == labels).squeeze()
for i in range(4):
label = labels[i]
class_correct[label] += c[i].item()
class_total[label] += 1
for i in range(10):
print('Accuracy of %5s : %2d %%' % (
classes[i], 100 * class_correct[i] / class_total[i]))
########################################################################
# Okay, so what next?
#
# How do we run these neural networks on the GPU?
#
# Training on GPU
# ----------------
# Just like how you transfer a Tensor onto the GPU, you transfer the neural
# net onto the GPU.
#
# Let's first define our device as the first visible cuda device if we have
# CUDA available:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Assuming that we are on a CUDA machine, this should print a CUDA device:
print(device)
########################################################################
# The rest of this section assumes that ``device`` is a CUDA device.
#
# Then these methods will recursively go over all modules and convert their
# parameters and buffers to CUDA tensors:
#
# .. code:: python
#
# net.to(device)
#
#
# Remember that you will have to send the inputs and targets at every step
# to the GPU too:
#
# .. code:: python
#
# inputs, labels = inputs.to(device), labels.to(device)
#
# Why dont I notice MASSIVE speedup compared to CPU? Because your network
# is realllly small.
#
# **Exercise:** Try increasing the width of your network (argument 2 of
# the first ``nn.Conv2d``, and argument 1 of the second ``nn.Conv2d``
# they need to be the same number), see what kind of speedup you get.
#
# **Goals achieved**:
#
# - Understanding PyTorch's Tensor library and neural networks at a high level.
# - Train a small neural network to classify images
#
# Training on multiple GPUs
# -------------------------
# If you want to see even more MASSIVE speedup using all of your GPUs,
# please check out :doc:`data_parallel_tutorial`.
#
# Where do I go next?
# -------------------
#
# - :doc:`Train neural nets to play video games </intermediate/reinforcement_q_learning>`
# - `Train a state-of-the-art ResNet network on imagenet`_
# - `Train a face generator using Generative Adversarial Networks`_
# - `Train a word-level language model using Recurrent LSTM networks`_
# - `More examples`_
# - `More tutorials`_
# - `Discuss PyTorch on the Forums`_
# - `Chat with other users on Slack`_
#
# .. _Train a state-of-the-art ResNet network on imagenet: https://github.com/pytorch/examples/tree/master/imagenet
# .. _Train a face generator using Generative Adversarial Networks: https://github.com/pytorch/examples/tree/master/dcgan
# .. _Train a word-level language model using Recurrent LSTM networks: https://github.com/pytorch/examples/tree/master/word_language_model
# .. _More examples: https://github.com/pytorch/examples
# .. _More tutorials: https://github.com/pytorch/tutorials
# .. _Discuss PyTorch on the Forums: https://discuss.pytorch.org/
# .. _Chat with other users on Slack: https://pytorch.slack.com/messages/beginner/
# %%%%%%INVISIBLE_CODE_BLOCK%%%%%%
del dataiter
# %%%%%%INVISIBLE_CODE_BLOCK%%%%%%

View File

@ -1,21 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM pytorch-latest-gpu-base:0.0.1
RUN mkdir -p /test/data
RUN chmod -R 777 /test
ADD cifar10_tutorial.py /test/cifar10_tutorial.py

View File

@ -1,71 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM ubuntu:16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \
--allow-change-held-packages --allow-unauthenticated \
build-essential libfreetype6-dev libpng12-dev \
libzmq3-dev pkg-config python python-dev \
rsync software-properties-common curl unzip wget grep sed vim iputils-ping net-tools gdb python2.7-dbg tzdata && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq --no-install-recommends \
krb5-user libpam-krb5 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN echo "Install python related packages" && \
apt-get update && \
apt-get install -y --no-install-recommends gfortran \
# numerical/algebra packages
libblas-dev libatlas-dev liblapack-dev \
# font, image for matplotlib
libpng-dev libxft-dev \
# for tkinter
python-tk libxml2-dev libxslt-dev zlib1g-dev && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip --no-cache-dir install Pillow h5py ipykernel jupyter matplotlib numpy pandas scipy sklearn && \
python -m ipykernel.kernelspec
# Install TensorFlow CPU version.
ENV TENSORFLOW_VERSION="1.13.1"
RUN pip --no-cache-dir install \
http://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
RUN apt-get update && apt-get install -y --no-install-recommends git && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Install hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo "$LOG_TAG Install java8" && \
apt-get update && \
apt-get install -y --no-install-recommends openjdk-8-jdk && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Set the locale to fix bash warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
RUN apt-get update && apt-get install -y --no-install-recommends locales && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN locale-gen en_US.UTF-8

View File

@ -1,85 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --allow-downgrades --no-install-recommends \
--allow-change-held-packages --allow-unauthenticated \
build-essential libfreetype6-dev libpng12-dev \
libzmq3-dev pkg-config python python-dev \
rsync software-properties-common curl unzip wget grep sed vim \
iputils-ping net-tools gdb python2.7-dbg tzdata \
cuda-command-line-tools-10-0 cuda-cublas-10-0 \
cuda-cufft-10-0 cuda-curand-10-0 cuda-cusolver-10-0 \
cuda-cusparse-10-0 libcudnn7=7.4.1.5-1+cuda10.0 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Install TensorRT
RUN apt-get update && \
apt-get install -y --allow-unauthenticated --no-install-recommends \
nvinfer-runtime-trt-repo-ubuntu1604-5.0.2-ga-cuda10.0 && \
apt-get update && \
apt-get install -y --no-install-recommends \
libnvinfer5=5.0.2-1+cuda10.0 && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && \
apt-get install -yq --no-install-recommends krb5-user libpam-krb5 \
&& apt-get clean && rm -rf /var/lib/apt/lists/*
RUN wget https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN echo "Install python related packages" && \
apt-get -y update && \
apt-get install -y --no-install-recommends gfortran \
# numerical/algebra packages
libblas-dev libatlas-dev liblapack-dev \
# font, image for matplotlib
libpng-dev libxft-dev \
# for tkinter
python-tk libxml2-dev libxslt-dev zlib1g-dev && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN pip --no-cache-dir install Pillow h5py ipykernel jupyter matplotlib numpy pandas scipy sklearn && \
python -m ipykernel.kernelspec
# Install TensorFlow GPU version.
ENV TENSORFLOW_VERSION="1.13.1"
RUN pip --no-cache-dir install \
http://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-${TENSORFLOW_VERSION}-cp27-none-linux_x86_64.whl
RUN apt-get update && apt-get install -y --no-install-recommends git && \
apt-get clean && rm -rf /var/lib/apt/lists/*
# Install hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget http://mirrors.shu.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo "$LOG_TAG Install java8" && \
apt-get -y update && \
apt-get install -y --no-install-recommends openjdk-8-jdk && \
rm -rf /var/lib/apt/lists/*
# Set the locale to fix bash warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
RUN apt-get update && apt-get install -y --no-install-recommends locales && \
apt-get clean && rm -rf /var/lib/apt/lists/*
RUN locale-gen en_US.UTF-8

View File

@ -1,32 +0,0 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "Building base images"
set -e
cd base/ubuntu-16.04
docker build . -f Dockerfile.cpu.tf_1.13.1 -t tf-1.13.1-cpu-base:0.0.1
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
echo "Finished building base images"
cd ../../with-cifar10-models/ubuntu-16.04
docker build . -f Dockerfile.cpu.tf_1.13.1 -t tf-1.13.1-cpu:0.0.1
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu:0.0.1

View File

@ -1,22 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM tf-1.13.1-cpu-base:0.0.1
# Include models
RUN mkdir /test
ADD cifar10_estimator_tf_1.13.1 /test/cifar10_estimator
RUN chown -R nobody /test

View File

@ -1,22 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM tf-1.13.1-gpu-base:0.0.1
# Include models
RUN mkdir /test
ADD cifar10_estimator_tf_1.13.1 /test/cifar10_estimator
RUN chown -R nobody /test

View File

@ -1,542 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
(Copied from https://github.com/tensorflow/models/tree/master/tutorials/image/cifar10_estimator)
CIFAR-10 is a common benchmark in machine learning for image recognition.
http://www.cs.toronto.edu/~kriz/cifar.html
Code in this directory focuses on how to use TensorFlow Estimators to train and
evaluate a CIFAR-10 ResNet model on:
* A single host with one CPU;
* A single host with multiple GPUs;
* Multiple hosts with CPU or multiple GPUs;
Before trying to run the model we highly encourage you to read all the README.
## Prerequisite
1. [Install](https://www.tensorflow.org/install/) TensorFlow version 1.2.1 or
later.
2. Download the CIFAR-10 dataset and generate TFRecord files using the provided
script. The script and associated command below will download the CIFAR-10
dataset and then generate a TFRecord for the training, validation, and
evaluation datasets.
```shell
python generate_cifar10_tfrecords.py --data-dir=${PWD}/cifar-10-data
```
After running the command above, you should see the following files in the
--data-dir (```ls -R cifar-10-data```):
* train.tfrecords
* validation.tfrecords
* eval.tfrecords
## Training on a single machine with GPUs or CPU
Run the training on CPU only. After training, it runs the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
--job-dir=/tmp/cifar10 \
--num-gpus=0 \
--train-steps=1000
```
Run the model on 2 GPUs using CPU as parameter server. After training, it runs
the evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
--job-dir=/tmp/cifar10 \
--num-gpus=2 \
--train-steps=1000
```
Run the model on 2 GPUs using GPU as parameter server.
It will run an experiment, which for local setting basically means it will run
stop training
a couple of times to perform evaluation.
```
python cifar10_main.py --data-dir=${PWD}/cifar-10-data \
--job-dir=/tmp/cifar10 \
--variable-strategy GPU \
--num-gpus=2 \
```
There are more command line flags to play with; run
`python cifar10_main.py --help` for details.
## Run distributed training
### (Optional) Running on Google Cloud Machine Learning Engine
This example can be run on Google Cloud Machine Learning Engine (ML Engine),
which will configure the environment and take care of running workers,
parameters servers, and masters in a fault tolerant way.
To install the command line tool, and set up a project and billing, see the
quickstart [here](https://cloud.google.com/ml-engine/docs/quickstarts/command-line).
You'll also need a Google Cloud Storage bucket for the data. If you followed the
instructions above, you can just run:
```
MY_BUCKET=gs://<my-bucket-name>
gsutil cp -r ${PWD}/cifar-10-data $MY_BUCKET/
```
Then run the following command from the `tutorials/image` directory of this
repository (the parent directory of this README):
```
gcloud ml-engine jobs submit training cifarmultigpu \
--runtime-version 1.2 \
--job-dir=$MY_BUCKET/model_dirs/cifarmultigpu \
--config cifar10_estimator/cmle_config.yaml \
--package-path cifar10_estimator/ \
--module-name cifar10_estimator.cifar10_main \
-- \
--data-dir=$MY_BUCKET/cifar-10-data \
--num-gpus=4 \
--train-steps=1000
```
### Set TF_CONFIG
Considering that you already have multiple hosts configured, all you need is a
`TF_CONFIG` environment variable on each host. You can set up the hosts manually
or check [tensorflow/ecosystem](https://github.com/tensorflow/ecosystem) for
instructions about how to set up a Cluster.
The `TF_CONFIG` will be used by the `RunConfig` to know the existing hosts and
their task: `master`, `ps` or `worker`.
Here's an example of `TF_CONFIG`.
```python
cluster = {'master': ['master-ip:8000'],
'ps': ['ps-ip:8000'],
'worker': ['worker-ip:8000']}
TF_CONFIG = json.dumps(
{'cluster': cluster,
'task': {'type': master, 'index': 0},
'model_dir': 'gs://<bucket_path>/<dir_path>',
'environment': 'cloud'
})
```
*Cluster*
A cluster spec, which is basically a dictionary that describes all of the tasks
in the cluster. More about it [here](https://www.tensorflow.org/deploy/distributed).
In this cluster spec we are defining a cluster with 1 master, 1 ps and 1 worker.
* `ps`: saves the parameters among all workers. All workers can
read/write/update the parameters for model via ps. As some models are
extremely large the parameters are shared among the ps (each ps stores a
subset).
* `worker`: does the training.
* `master`: basically a special worker, it does training, but also restores and
saves checkpoints and do evaluation.
*Task*
The Task defines what is the role of the current node, for this example the node
is the master on index 0 on the cluster spec, the task will be different for
each node. An example of the `TF_CONFIG` for a worker would be:
```python
cluster = {'master': ['master-ip:8000'],
'ps': ['ps-ip:8000'],
'worker': ['worker-ip:8000']}
TF_CONFIG = json.dumps(
{'cluster': cluster,
'task': {'type': worker, 'index': 0},
'model_dir': 'gs://<bucket_path>/<dir_path>',
'environment': 'cloud'
})
```
*Model_dir*
This is the path where the master will save the checkpoints, graph and
TensorBoard files. For a multi host environment you may want to use a
Distributed File System, Google Storage and DFS are supported.
*Environment*
By the default environment is *local*, for a distributed setting we need to
change it to *cloud*.
### Running script
Once you have a `TF_CONFIG` configured properly on each host you're ready to run
on distributed settings.
#### Master
Run this on master:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
40000 steps. It will run evaluation a couple of times during training. The
num_workers arugument is used only to update the learning rate correctly. Make
sure the model_dir is the same as defined on the TF_CONFIG.
```shell
python cifar10_main.py --data-dir=gs://path/cifar-10-data \
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync \
--num-workers=2
```
*Output:*
```shell
INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'master', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd16fb2be10>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
gpu_options {
}
allow_soft_placement: true
, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
...
2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:04.0
Total memory: 11.17GiB
Free memory: 11.09GiB
2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:05.0
Total memory: 11.17GiB
Free memory: 11.10GiB
...
2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=1; total_num_replicas=1
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-0
2017-08-01 19:59:37.560775: I tensorflow/core/distributed_runtime/master_session.cc:999] Start master session 156fcb55fe6648d6 with config:
intra_op_parallelism_threads: 1
gpu_options {
per_process_gpu_memory_fraction: 1
}
allow_soft_placement: true
INFO:tensorflow:Saving checkpoints for 1 into gs://path/model_dir/model.ckpt.
INFO:tensorflow:loss = 1.20682, step = 1
INFO:tensorflow:loss = 1.20682, learning_rate = 0.1
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
INFO:tensorflow:Starting evaluation at 2017-08-01-20:00:14
2017-08-01 20:00:15.745881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
2017-08-01 20:00:15.745949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)
2017-08-01 20:00:15.745958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)
2017-08-01 20:00:15.745964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)
2017-08-01 20:00:15.745969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:4) -> (device: 4, name: Tesla K80, pci bus id: 0000:00:08.0)
2017-08-01 20:00:15.745975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:5) -> (device: 5, name: Tesla K80, pci bus id: 0000:00:09.0)
2017-08-01 20:00:15.745987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:6) -> (device: 6, name: Tesla K80, pci bus id: 0000:00:0a.0)
2017-08-01 20:00:15.745997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:7) -> (device: 7, name: Tesla K80, pci bus id: 0000:00:0b.0)
INFO:tensorflow:Restoring parameters from gs://path/model_dir/model.ckpt-10023
INFO:tensorflow:Evaluation [1/100]
INFO:tensorflow:Evaluation [2/100]
INFO:tensorflow:Evaluation [3/100]
INFO:tensorflow:Evaluation [4/100]
INFO:tensorflow:Evaluation [5/100]
INFO:tensorflow:Evaluation [6/100]
INFO:tensorflow:Evaluation [7/100]
INFO:tensorflow:Evaluation [8/100]
INFO:tensorflow:Evaluation [9/100]
INFO:tensorflow:Evaluation [10/100]
INFO:tensorflow:Evaluation [11/100]
INFO:tensorflow:Evaluation [12/100]
INFO:tensorflow:Evaluation [13/100]
...
INFO:tensorflow:Evaluation [100/100]
INFO:tensorflow:Finished evaluation at 2017-08-01-20:00:31
INFO:tensorflow:Saving dict for global step 1: accuracy = 0.0994, global_step = 1, loss = 630.425
```
#### Worker
Run this on worker:
Runs an Experiment in sync mode on 4 GPUs using CPU as parameter server for
40000 steps. It will run evaluation a couple of times during training. Make sure
the model_dir is the same as defined on the TF_CONFIG.
```shell
python cifar10_main.py --data-dir=gs://path/cifar-10-data \
--job-dir=gs://path/model_dir/ \
--num-gpus=4 \
--train-steps=40000 \
--sync
```
*Output:*
```shell
INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600,
'_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'worker',
'_is_chief': False, '_cluster_spec':
<tensorflow.python.training.server_lib.ClusterSpec object at 0x7f6918438e10>,
'_model_dir': 'gs://<path>/model_dir/',
'_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000,
'_session_config': intra_op_parallelism_threads: 1
gpu_options {
}
allow_soft_placement: true
, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1,
'_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
...
2017-08-01 19:59:26.496208: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:04.0
Total memory: 11.17GiB
Free memory: 11.09GiB
2017-08-01 19:59:26.775660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 1 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:05.0
Total memory: 11.17GiB
Free memory: 11.10GiB
...
2017-08-01 19:59:29.675171: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_1/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_2/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_3/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_4/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_5/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage/residual_v1_6/: (?, 16, 32, 32)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/avg_pool/: (?, 16, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_1/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_2/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_3/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_4/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_5/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_1/residual_v1_6/: (?, 32, 16, 16)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/avg_pool/: (?, 32, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_1/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_2/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_3/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_4/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_5/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/stage_2/residual_v1_6/: (?, 64, 8, 8)
INFO:tensorflow:image after unit resnet/tower_0/global_avg_pool/: (?, 64)
INFO:tensorflow:image after unit resnet/tower_0/fully_connected/: (?, 11)
INFO:tensorflow:SyncReplicasV2: replicas_to_aggregate=2; total_num_replicas=2
INFO:tensorflow:Create CheckpointSaverHook.
2017-07-31 22:38:04.629150: I
tensorflow/core/distributed_runtime/master.cc:209] CreateSession still waiting
for response from worker: /job:master/replica:0/task:0
2017-07-31 22:38:09.263492: I
tensorflow/core/distributed_runtime/master_session.cc:999] Start master
session cc58f93b1e259b0c with config:
intra_op_parallelism_threads: 1
gpu_options {
per_process_gpu_memory_fraction: 1
}
allow_soft_placement: true
INFO:tensorflow:loss = 5.82382, step = 0
INFO:tensorflow:loss = 5.82382, learning_rate = 0.8
INFO:tensorflow:Average examples/sec: 1116.92 (1116.92), step = 10
INFO:tensorflow:Average examples/sec: 1233.73 (1377.83), step = 20
INFO:tensorflow:Average examples/sec: 1485.43 (2509.3), step = 30
INFO:tensorflow:Average examples/sec: 1680.27 (2770.39), step = 40
INFO:tensorflow:Average examples/sec: 1825.38 (2788.78), step = 50
INFO:tensorflow:Average examples/sec: 1929.32 (2697.27), step = 60
INFO:tensorflow:Average examples/sec: 2015.17 (2749.05), step = 70
INFO:tensorflow:loss = 37.6272, step = 79 (19.554 sec)
INFO:tensorflow:loss = 37.6272, learning_rate = 0.8 (19.554 sec)
INFO:tensorflow:Average examples/sec: 2074.92 (2618.36), step = 80
INFO:tensorflow:Average examples/sec: 2132.71 (2744.13), step = 90
INFO:tensorflow:Average examples/sec: 2183.38 (2777.21), step = 100
INFO:tensorflow:Average examples/sec: 2224.4 (2739.03), step = 110
INFO:tensorflow:Average examples/sec: 2240.28 (2431.26), step = 120
INFO:tensorflow:Average examples/sec: 2272.12 (2739.32), step = 130
INFO:tensorflow:Average examples/sec: 2300.68 (2750.03), step = 140
INFO:tensorflow:Average examples/sec: 2325.81 (2745.63), step = 150
INFO:tensorflow:Average examples/sec: 2347.14 (2721.53), step = 160
INFO:tensorflow:Average examples/sec: 2367.74 (2754.54), step = 170
INFO:tensorflow:loss = 27.8453, step = 179 (18.893 sec)
...
```
#### PS
Run this on ps:
The ps will not do training so most of the arguments won't affect the execution
```shell
python cifar10_main.py --job-dir=gs://path/model_dir/
```
*Output:*
```shell
INFO:tensorflow:Using model_dir in TF_CONFIG: gs://path/model_dir/
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_num_ps_replicas': 1, '_keep_checkpoint_max': 5, '_task_type': u'ps', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f48f1addf90>, '_model_dir': 'gs://path/model_dir/', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_session_config': intra_op_parallelism_threads: 1
gpu_options {
}
allow_soft_placement: true
, '_tf_random_seed': None, '_environment': u'cloud', '_num_worker_replicas': 1, '_task_id': 0, '_save_summary_steps': 100, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
, '_evaluation_master': '', '_master': u'grpc://master-ip:8000'}
2017-07-31 22:54:58.928088: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-ip:8000}
2017-07-31 22:54:58.928153: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
2017-07-31 22:54:58.928160: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-ip:8000}
2017-07-31 22:54:58.929873: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
```
## Visualizing results with TensorBoard
When using Estimators you can also visualize your data in TensorBoard, with no
changes in your code. You can use TensorBoard to visualize your TensorFlow
graph, plot quantitative metrics about the execution of your graph, and show
additional data like images that pass through it.
You'll see something similar to this if you "point" TensorBoard to the
`job dir` parameter you used to train or evaluate your model.
Check TensorBoard during training or after it. Just point TensorBoard to the
model_dir you chose on the previous step.
```shell
tensorboard --log-dir="<job dir>"
```
## Warnings
When runninng `cifar10_main.py` with `--sync` argument you may see an error
similar to:
```python
File "cifar10_main.py", line 538, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "cifar10_main.py", line 518, in main
hooks), run_config=config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 210, in run
return _execute_schedule(experiment, schedule)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 47, in _execute_schedule
return task()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 501, in train_and_evaluate
hooks=self._eval_hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 681, in _call_evaluate
hooks=hooks)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 292, in evaluate
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 638, in _evaluate_model
features, labels, model_fn_lib.ModeKeys.EVAL)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 545, in _call_model_fn
features=features, labels=labels, **kwargs)
File "cifar10_main.py", line 331, in _resnet_model_fn
gradvars, global_step=tf.train.get_global_step())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/sync_replicas_optimizer.py", line 252, in apply_gradients
variables.global_variables())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped
return _add_should_use_warning(fn(*args, **kwargs))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning
wrapped = TFShouldUseWarningWrapper(x)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__
stack = [s.strip() for s in traceback.format_stack()]
```
This should not affect your training, and should be fixed on the next releases.

View File

@ -1,113 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""CIFAR-10 data set.
See http://www.cs.toronto.edu/~kriz/cifar.html.
"""
import os
import tensorflow as tf
HEIGHT = 32
WIDTH = 32
DEPTH = 3
class Cifar10DataSet(object):
"""Cifar10 data set.
Described by http://www.cs.toronto.edu/~kriz/cifar.html.
"""
def __init__(self, data_dir, subset='train', use_distortion=True):
self.data_dir = data_dir
self.subset = subset
self.use_distortion = use_distortion
def get_filenames(self):
if self.subset in ['train', 'validation', 'eval']:
return [os.path.join(self.data_dir, self.subset + '.tfrecords')]
else:
raise ValueError('Invalid data subset "%s"' % self.subset)
def parser(self, serialized_example):
"""Parses a single tf.Example into image and label tensors."""
# Dimensions of the images in the CIFAR-10 dataset.
# See http://www.cs.toronto.edu/~kriz/cifar.html for a description of the
# input format.
features = tf.parse_single_example(
serialized_example,
features={
'image': tf.FixedLenFeature([], tf.string),
'label': tf.FixedLenFeature([], tf.int64),
})
image = tf.decode_raw(features['image'], tf.uint8)
image.set_shape([DEPTH * HEIGHT * WIDTH])
# Reshape from [depth * height * width] to [depth, height, width].
image = tf.cast(
tf.transpose(tf.reshape(image, [DEPTH, HEIGHT, WIDTH]), [1, 2, 0]),
tf.float32)
label = tf.cast(features['label'], tf.int32)
# Custom preprocessing.
image = self.preprocess(image)
return image, label
def make_batch(self, batch_size):
"""Read the images and labels from 'filenames'."""
filenames = self.get_filenames()
# Repeat infinitely.
dataset = tf.data.TFRecordDataset(filenames).repeat()
# Parse records.
dataset = dataset.map(
self.parser, num_parallel_calls=batch_size)
# Potentially shuffle records.
if self.subset == 'train':
min_queue_examples = int(
Cifar10DataSet.num_examples_per_epoch(self.subset) * 0.4)
# Ensure that the capacity is sufficiently large to provide good random
# shuffling.
dataset = dataset.shuffle(buffer_size=min_queue_examples + 3 * batch_size)
# Batch it up.
dataset = dataset.batch(batch_size)
iterator = dataset.make_one_shot_iterator()
image_batch, label_batch = iterator.get_next()
return image_batch, label_batch
def preprocess(self, image):
"""Preprocess a single image in [height, width, depth] layout."""
if self.subset == 'train' and self.use_distortion:
# Pad 4 pixels on each dimension of feature map, done in mini-batch
image = tf.image.resize_image_with_crop_or_pad(image, 40, 40)
image = tf.random_crop(image, [HEIGHT, WIDTH, DEPTH])
image = tf.image.random_flip_left_right(image)
return image
@staticmethod
def num_examples_per_epoch(subset='train'):
if subset == 'train':
return 45000
elif subset == 'validation':
return 5000
elif subset == 'eval':
return 10000
else:
raise ValueError('Invalid data subset "%s"' % subset)

View File

@ -1,521 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""ResNet model for classifying images from CIFAR-10 dataset.
Support single-host training with one or multiple devices.
ResNet as proposed in:
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Deep Residual Learning for Image Recognition. arXiv:1512.03385
CIFAR-10 as in:
http://www.cs.toronto.edu/~kriz/cifar.html
"""
from __future__ import division
from __future__ import print_function
import argparse
import functools
import itertools
import os
import cifar10
import cifar10_model
import cifar10_utils
import numpy as np
import six
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
def get_model_fn(num_gpus, variable_strategy, num_workers):
"""Returns a function that will build the resnet model."""
def _resnet_model_fn(features, labels, mode, params):
"""Resnet model body.
Support single host, one or more GPU training. Parameter distribution can
be either one of the following scheme.
1. CPU is the parameter server and manages gradient updates.
2. Parameters are distributed evenly across all GPUs, and the first GPU
manages gradient updates.
Args:
features: a list of tensors, one for each tower
labels: a list of tensors, one for each tower
mode: ModeKeys.TRAIN or EVAL
params: Hyperparameters suitable for tuning
Returns:
A EstimatorSpec object.
"""
is_training = (mode == tf.estimator.ModeKeys.TRAIN)
weight_decay = params.weight_decay
momentum = params.momentum
tower_features = features
tower_labels = labels
tower_losses = []
tower_gradvars = []
tower_preds = []
# channels first (NCHW) is normally optimal on GPU and channels last (NHWC)
# on CPU. The exception is Intel MKL on CPU which is optimal with
# channels_last.
data_format = params.data_format
if not data_format:
if num_gpus == 0:
data_format = 'channels_last'
else:
data_format = 'channels_first'
if num_gpus == 0:
num_devices = 1
device_type = 'cpu'
else:
num_devices = num_gpus
device_type = 'gpu'
for i in range(num_devices):
worker_device = '/{}:{}'.format(device_type, i)
if variable_strategy == 'CPU':
device_setter = cifar10_utils.local_device_setter(
worker_device=worker_device)
elif variable_strategy == 'GPU':
device_setter = cifar10_utils.local_device_setter(
ps_device_type='gpu',
worker_device=worker_device,
ps_strategy=tf.contrib.training.GreedyLoadBalancingStrategy(
num_gpus, tf.contrib.training.byte_size_load_fn))
with tf.variable_scope('resnet', reuse=bool(i != 0)):
with tf.name_scope('tower_%d' % i) as name_scope:
with tf.device(device_setter):
loss, gradvars, preds = _tower_fn(
is_training, weight_decay, tower_features[i], tower_labels[i],
data_format, params.num_layers, params.batch_norm_decay,
params.batch_norm_epsilon)
tower_losses.append(loss)
tower_gradvars.append(gradvars)
tower_preds.append(preds)
if i == 0:
# Only trigger batch_norm moving mean and variance update from
# the 1st tower. Ideally, we should grab the updates from all
# towers but these stats accumulate extremely fast so we can
# ignore the other stats from the other towers without
# significant detriment.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,
name_scope)
# Now compute global loss and gradients.
gradvars = []
with tf.name_scope('gradient_averaging'):
all_grads = {}
for grad, var in itertools.chain(*tower_gradvars):
if grad is not None:
all_grads.setdefault(var, []).append(grad)
for var, grads in six.iteritems(all_grads):
# Average gradients on the same device as the variables
# to which they apply.
with tf.device(var.device):
if len(grads) == 1:
avg_grad = grads[0]
else:
avg_grad = tf.multiply(tf.add_n(grads), 1. / len(grads))
gradvars.append((avg_grad, var))
# Device that runs the ops to apply global gradient updates.
consolidation_device = '/gpu:0' if variable_strategy == 'GPU' else '/cpu:0'
with tf.device(consolidation_device):
# Suggested learning rate scheduling from
# https://github.com/ppwwyyxx/tensorpack/blob/master/examples/ResNet/cifar10-resnet.py#L155
num_batches_per_epoch = cifar10.Cifar10DataSet.num_examples_per_epoch(
'train') // (params.train_batch_size * num_workers)
boundaries = [
num_batches_per_epoch * x
for x in np.array([82, 123, 300], dtype=np.int64)
]
staged_lr = [params.learning_rate * x for x in [1, 0.1, 0.01, 0.002]]
learning_rate = tf.train.piecewise_constant(tf.train.get_global_step(),
boundaries, staged_lr)
loss = tf.reduce_mean(tower_losses, name='loss')
examples_sec_hook = cifar10_utils.ExamplesPerSecondHook(
params.train_batch_size, every_n_steps=10)
tensors_to_log = {'learning_rate': learning_rate, 'loss': loss}
logging_hook = tf.train.LoggingTensorHook(
tensors=tensors_to_log, every_n_iter=100)
train_hooks = [logging_hook, examples_sec_hook]
optimizer = tf.train.MomentumOptimizer(
learning_rate=learning_rate, momentum=momentum)
if params.sync:
optimizer = tf.train.SyncReplicasOptimizer(
optimizer, replicas_to_aggregate=num_workers)
sync_replicas_hook = optimizer.make_session_run_hook(params.is_chief)
train_hooks.append(sync_replicas_hook)
# Create single grouped train op
train_op = [
optimizer.apply_gradients(
gradvars, global_step=tf.train.get_global_step())
]
train_op.extend(update_ops)
train_op = tf.group(*train_op)
predictions = {
'classes':
tf.concat([p['classes'] for p in tower_preds], axis=0),
'probabilities':
tf.concat([p['probabilities'] for p in tower_preds], axis=0)
}
stacked_labels = tf.concat(labels, axis=0)
metrics = {
'accuracy':
tf.metrics.accuracy(stacked_labels, predictions['classes'])
}
return tf.estimator.EstimatorSpec(
mode=mode,
predictions=predictions,
loss=loss,
train_op=train_op,
training_hooks=train_hooks,
eval_metric_ops=metrics)
return _resnet_model_fn
def _tower_fn(is_training, weight_decay, feature, label, data_format,
num_layers, batch_norm_decay, batch_norm_epsilon):
"""Build computation tower (Resnet).
Args:
is_training: true if is training graph.
weight_decay: weight regularization strength, a float.
feature: a Tensor.
label: a Tensor.
data_format: channels_last (NHWC) or channels_first (NCHW).
num_layers: number of layers, an int.
batch_norm_decay: decay for batch normalization, a float.
batch_norm_epsilon: epsilon for batch normalization, a float.
Returns:
A tuple with the loss for the tower, the gradients and parameters, and
predictions.
"""
model = cifar10_model.ResNetCifar10(
num_layers,
batch_norm_decay=batch_norm_decay,
batch_norm_epsilon=batch_norm_epsilon,
is_training=is_training,
data_format=data_format)
logits = model.forward_pass(feature, input_data_format='channels_last')
tower_pred = {
'classes': tf.argmax(input=logits, axis=1),
'probabilities': tf.nn.softmax(logits)
}
tower_loss = tf.losses.sparse_softmax_cross_entropy(
logits=logits, labels=label)
tower_loss = tf.reduce_mean(tower_loss)
model_params = tf.trainable_variables()
tower_loss += weight_decay * tf.add_n(
[tf.nn.l2_loss(v) for v in model_params])
tower_grad = tf.gradients(tower_loss, model_params)
return tower_loss, zip(tower_grad, model_params), tower_pred
def input_fn(data_dir,
subset,
num_shards,
batch_size,
use_distortion_for_training=True):
"""Create input graph for model.
Args:
data_dir: Directory where TFRecords representing the dataset are located.
subset: one of 'train', 'validate' and 'eval'.
num_shards: num of towers participating in data-parallel training.
batch_size: total batch size for training to be divided by the number of
shards.
use_distortion_for_training: True to use distortions.
Returns:
two lists of tensors for features and labels, each of num_shards length.
"""
with tf.device('/cpu:0'):
use_distortion = subset == 'train' and use_distortion_for_training
dataset = cifar10.Cifar10DataSet(data_dir, subset, use_distortion)
image_batch, label_batch = dataset.make_batch(batch_size)
if num_shards <= 1:
# No GPU available or only 1 GPU.
return [image_batch], [label_batch]
# Note that passing num=batch_size is safe here, even though
# dataset.batch(batch_size) can, in some cases, return fewer than batch_size
# examples. This is because it does so only when repeating for a limited
# number of epochs, but our dataset repeats forever.
image_batch = tf.unstack(image_batch, num=batch_size, axis=0)
label_batch = tf.unstack(label_batch, num=batch_size, axis=0)
feature_shards = [[] for i in range(num_shards)]
label_shards = [[] for i in range(num_shards)]
for i in xrange(batch_size):
idx = i % num_shards
feature_shards[idx].append(image_batch[i])
label_shards[idx].append(label_batch[i])
feature_shards = [tf.parallel_stack(x) for x in feature_shards]
label_shards = [tf.parallel_stack(x) for x in label_shards]
return feature_shards, label_shards
def get_experiment_fn(data_dir,
num_gpus,
variable_strategy,
use_distortion_for_training=True):
"""Returns an Experiment function.
Experiments perform training on several workers in parallel,
in other words experiments know how to invoke train and eval in a sensible
fashion for distributed training. Arguments passed directly to this
function are not tunable, all other arguments should be passed within
tf.HParams, passed to the enclosed function.
Args:
data_dir: str. Location of the data for input_fns.
num_gpus: int. Number of GPUs on each worker.
variable_strategy: String. CPU to use CPU as the parameter server
and GPU to use the GPUs as the parameter server.
use_distortion_for_training: bool. See cifar10.Cifar10DataSet.
Returns:
A function (tf.estimator.RunConfig, tf.contrib.training.HParams) ->
tf.contrib.learn.Experiment.
Suitable for use by tf.contrib.learn.learn_runner, which will run various
methods on Experiment (train, evaluate) based on information
about the current runner in `run_config`.
"""
def _experiment_fn(run_config, hparams):
"""Returns an Experiment."""
# Create estimator.
train_input_fn = functools.partial(
input_fn,
data_dir,
subset='train',
num_shards=num_gpus,
batch_size=hparams.train_batch_size,
use_distortion_for_training=use_distortion_for_training)
eval_input_fn = functools.partial(
input_fn,
data_dir,
subset='eval',
batch_size=hparams.eval_batch_size,
num_shards=num_gpus)
num_eval_examples = cifar10.Cifar10DataSet.num_examples_per_epoch('eval')
if num_eval_examples % hparams.eval_batch_size != 0:
raise ValueError(
'validation set size must be multiple of eval_batch_size')
train_steps = hparams.train_steps
eval_steps = num_eval_examples // hparams.eval_batch_size
classifier = tf.estimator.Estimator(
model_fn=get_model_fn(num_gpus, variable_strategy,
run_config.num_worker_replicas or 1),
config=run_config,
params=hparams)
# Create experiment.
return tf.contrib.learn.Experiment(
classifier,
train_input_fn=train_input_fn,
eval_input_fn=eval_input_fn,
train_steps=train_steps,
eval_steps=eval_steps)
return _experiment_fn
def main(job_dir, data_dir, num_gpus, variable_strategy,
use_distortion_for_training, log_device_placement, num_intra_threads,
**hparams):
# The env variable is on deprecation path, default is set to off.
os.environ['TF_SYNC_ON_FINISH'] = '0'
os.environ['TF_ENABLE_WINOGRAD_NONFUSED'] = '1'
# Session configuration.
sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=log_device_placement,
intra_op_parallelism_threads=num_intra_threads,
gpu_options=tf.GPUOptions(force_gpu_compatible=True))
config = cifar10_utils.RunConfig(
session_config=sess_config, model_dir=job_dir)
tf.contrib.learn.learn_runner.run(
get_experiment_fn(data_dir, num_gpus, variable_strategy,
use_distortion_for_training),
run_config=config,
hparams=tf.contrib.training.HParams(
is_chief=config.is_chief,
**hparams))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--data-dir',
type=str,
required=True,
help='The directory where the CIFAR-10 input data is stored.')
parser.add_argument(
'--job-dir',
type=str,
required=True,
help='The directory where the model will be stored.')
parser.add_argument(
'--variable-strategy',
choices=['CPU', 'GPU'],
type=str,
default='CPU',
help='Where to locate variable operations')
parser.add_argument(
'--num-gpus',
type=int,
default=1,
help='The number of gpus used. Uses only CPU if set to 0.')
parser.add_argument(
'--num-layers',
type=int,
default=44,
help='The number of layers of the model.')
parser.add_argument(
'--train-steps',
type=int,
default=80000,
help='The number of steps to use for training.')
parser.add_argument(
'--train-batch-size',
type=int,
default=128,
help='Batch size for training.')
parser.add_argument(
'--eval-batch-size',
type=int,
default=100,
help='Batch size for validation.')
parser.add_argument(
'--momentum',
type=float,
default=0.9,
help='Momentum for MomentumOptimizer.')
parser.add_argument(
'--weight-decay',
type=float,
default=2e-4,
help='Weight decay for convolutions.')
parser.add_argument(
'--learning-rate',
type=float,
default=0.1,
help="""\
This is the inital learning rate value. The learning rate will decrease
during training. For more details check the model_fn implementation in
this file.\
""")
parser.add_argument(
'--use-distortion-for-training',
type=bool,
default=True,
help='If doing image distortion for training.')
parser.add_argument(
'--sync',
action='store_true',
default=False,
help="""\
If present when running in a distributed environment will run on sync mode.\
""")
parser.add_argument(
'--num-intra-threads',
type=int,
default=0,
help="""\
Number of threads to use for intra-op parallelism. When training on CPU
set to 0 to have the system pick the appropriate number or alternatively
set it to the number of physical CPU cores.\
""")
parser.add_argument(
'--num-inter-threads',
type=int,
default=0,
help="""\
Number of threads to use for inter-op parallelism. If set to 0, the
system will pick an appropriate number.\
""")
parser.add_argument(
'--data-format',
type=str,
default=None,
help="""\
If not set, the data format best for the training device is used.
Allowed values: channels_first (NCHW) channels_last (NHWC).\
""")
parser.add_argument(
'--log-device-placement',
action='store_true',
default=False,
help='Whether to log device placement.')
parser.add_argument(
'--batch-norm-decay',
type=float,
default=0.997,
help='Decay for batch norm.')
parser.add_argument(
'--batch-norm-epsilon',
type=float,
default=1e-5,
help='Epsilon for batch norm.')
args = parser.parse_args()
if args.num_gpus > 0:
assert tf.test.is_gpu_available(), "Requested GPUs but none found."
if args.num_gpus < 0:
raise ValueError(
'Invalid GPU count: \"--num-gpus\" must be 0 or a positive integer.')
if args.num_gpus == 0 and args.variable_strategy == 'GPU':
raise ValueError('num-gpus=0, CPU must be used as parameter server. Set'
'--variable-strategy=CPU.')
if (args.num_layers - 2) % 6 != 0:
raise ValueError('Invalid --num-layers parameter.')
if args.num_gpus != 0 and args.train_batch_size % args.num_gpus != 0:
raise ValueError('--train-batch-size must be multiple of --num-gpus.')
if args.num_gpus != 0 and args.eval_batch_size % args.num_gpus != 0:
raise ValueError('--eval-batch-size must be multiple of --num-gpus.')
main(**vars(args))

View File

@ -1,80 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Model class for Cifar10 Dataset."""
from __future__ import division
from __future__ import print_function
import tensorflow as tf
import model_base
class ResNetCifar10(model_base.ResNet):
"""Cifar10 model with ResNetV1 and basic residual block."""
def __init__(self,
num_layers,
is_training,
batch_norm_decay,
batch_norm_epsilon,
data_format='channels_first'):
super(ResNetCifar10, self).__init__(
is_training,
data_format,
batch_norm_decay,
batch_norm_epsilon
)
self.n = (num_layers - 2) // 6
# Add one in case label starts with 1. No impact if label starts with 0.
self.num_classes = 10 + 1
self.filters = [16, 16, 32, 64]
self.strides = [1, 2, 2]
def forward_pass(self, x, input_data_format='channels_last'):
"""Build the core model within the graph."""
if self._data_format != input_data_format:
if input_data_format == 'channels_last':
# Computation requires channels_first.
x = tf.transpose(x, [0, 3, 1, 2])
else:
# Computation requires channels_last.
x = tf.transpose(x, [0, 2, 3, 1])
# Image standardization.
x = x / 128 - 1
x = self._conv(x, 3, 16, 1)
x = self._batch_norm(x)
x = self._relu(x)
# Use basic (non-bottleneck) block and ResNet V1 (post-activation).
res_func = self._residual_v1
# 3 stages of block stacking.
for i in range(3):
with tf.name_scope('stage'):
for j in range(self.n):
if j == 0:
# First block in a stage, filters and strides may change.
x = res_func(x, 3, self.filters[i], self.filters[i + 1],
self.strides[i])
else:
# Following blocks in a stage, constant filters and unit stride.
x = res_func(x, 3, self.filters[i + 1], self.filters[i + 1], 1)
x = self._global_avg_pool(x)
x = self._fully_connected(x, self.num_classes)
return x

View File

@ -1,153 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
import collections
import six
import tensorflow as tf
from tensorflow.python.platform import tf_logging as logging
from tensorflow.core.framework import node_def_pb2
from tensorflow.python.framework import device as pydev
from tensorflow.python.training import basic_session_run_hooks
from tensorflow.python.training import session_run_hook
from tensorflow.python.training import training_util
from tensorflow.python.training import device_setter
from tensorflow.contrib.learn.python.learn import run_config
# TODO(b/64848083) Remove once uid bug is fixed
class RunConfig(tf.contrib.learn.RunConfig):
def uid(self, whitelist=None):
"""Generates a 'Unique Identifier' based on all internal fields.
Caller should use the uid string to check `RunConfig` instance integrity
in one session use, but should not rely on the implementation details, which
is subject to change.
Args:
whitelist: A list of the string names of the properties uid should not
include. If `None`, defaults to `_DEFAULT_UID_WHITE_LIST`, which
includes most properties user allowes to change.
Returns:
A uid string.
"""
if whitelist is None:
whitelist = run_config._DEFAULT_UID_WHITE_LIST
state = {k: v for k, v in self.__dict__.items() if not k.startswith('__')}
# Pop out the keys in whitelist.
for k in whitelist:
state.pop('_' + k, None)
ordered_state = collections.OrderedDict(
sorted(state.items(), key=lambda t: t[0]))
# For class instance without __repr__, some special cares are required.
# Otherwise, the object address will be used.
if '_cluster_spec' in ordered_state:
ordered_state['_cluster_spec'] = collections.OrderedDict(
sorted(ordered_state['_cluster_spec'].as_dict().items(),
key=lambda t: t[0])
)
return ', '.join(
'%s=%r' % (k, v) for (k, v) in six.iteritems(ordered_state))
class ExamplesPerSecondHook(session_run_hook.SessionRunHook):
"""Hook to print out examples per second.
Total time is tracked and then divided by the total number of steps
to get the average step time and then batch_size is used to determine
the running average of examples per second. The examples per second for the
most recent interval is also logged.
"""
def __init__(
self,
batch_size,
every_n_steps=100,
every_n_secs=None,):
"""Initializer for ExamplesPerSecondHook.
Args:
batch_size: Total batch size used to calculate examples/second from
global time.
every_n_steps: Log stats every n steps.
every_n_secs: Log stats every n seconds.
"""
if (every_n_steps is None) == (every_n_secs is None):
raise ValueError('exactly one of every_n_steps'
' and every_n_secs should be provided.')
self._timer = basic_session_run_hooks.SecondOrStepTimer(
every_steps=every_n_steps, every_secs=every_n_secs)
self._step_train_time = 0
self._total_steps = 0
self._batch_size = batch_size
def begin(self):
self._global_step_tensor = training_util.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError(
'Global step should be created to use StepCounterHook.')
def before_run(self, run_context): # pylint: disable=unused-argument
return basic_session_run_hooks.SessionRunArgs(self._global_step_tensor)
def after_run(self, run_context, run_values):
_ = run_context
global_step = run_values.results
if self._timer.should_trigger_for_step(global_step):
elapsed_time, elapsed_steps = self._timer.update_last_triggered_step(
global_step)
if elapsed_time is not None:
steps_per_sec = elapsed_steps / elapsed_time
self._step_train_time += elapsed_time
self._total_steps += elapsed_steps
average_examples_per_sec = self._batch_size * (
self._total_steps / self._step_train_time)
current_examples_per_sec = steps_per_sec * self._batch_size
# Average examples/sec followed by current examples/sec
logging.info('%s: %g (%g), step = %g', 'Average examples/sec',
average_examples_per_sec, current_examples_per_sec,
self._total_steps)
def local_device_setter(num_devices=1,
ps_device_type='cpu',
worker_device='/cpu:0',
ps_ops=None,
ps_strategy=None):
if ps_ops == None:
ps_ops = ['Variable', 'VariableV2', 'VarHandleOp']
if ps_strategy is None:
ps_strategy = device_setter._RoundRobinStrategy(num_devices)
if not six.callable(ps_strategy):
raise TypeError("ps_strategy must be callable")
def _local_device_chooser(op):
current_device = pydev.DeviceSpec.from_string(op.device or "")
node_def = op if isinstance(op, node_def_pb2.NodeDef) else op.node_def
if node_def.op in ps_ops:
ps_device_spec = pydev.DeviceSpec.from_string(
'/{}:{}'.format(ps_device_type, ps_strategy(op)))
ps_device_spec.merge_from(current_device)
return ps_device_spec.to_string()
else:
worker_device_spec = pydev.DeviceSpec.from_string(worker_device or "")
worker_device_spec.merge_from(current_device)
return worker_device_spec.to_string()
return _local_device_chooser

View File

@ -1,118 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""Read CIFAR-10 data from pickled numpy arrays and writes TFRecords.
Generates tf.train.Example protos and writes them to TFRecord files from the
python version of the CIFAR-10 dataset downloaded from
https://www.cs.toronto.edu/~kriz/cifar.html.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import argparse
import os
import sys
import tarfile
from six.moves import cPickle as pickle
from six.moves import xrange # pylint: disable=redefined-builtin
import tensorflow as tf
CIFAR_FILENAME = 'cifar-10-python.tar.gz'
CIFAR_DOWNLOAD_URL = 'https://www.cs.toronto.edu/~kriz/' + CIFAR_FILENAME
CIFAR_LOCAL_FOLDER = 'cifar-10-batches-py'
def download_and_extract(data_dir):
# download CIFAR-10 if not already downloaded.
tf.contrib.learn.datasets.base.maybe_download(CIFAR_FILENAME, data_dir,
CIFAR_DOWNLOAD_URL)
tarfile.open(os.path.join(data_dir, CIFAR_FILENAME),
'r:gz').extractall(data_dir)
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _get_file_names():
"""Returns the file names expected to exist in the input_dir."""
file_names = {}
file_names['train'] = ['data_batch_%d' % i for i in xrange(1, 5)]
file_names['validation'] = ['data_batch_5']
file_names['eval'] = ['test_batch']
return file_names
def read_pickle_from_file(filename):
with tf.gfile.Open(filename, 'rb') as f:
if sys.version_info >= (3, 0):
data_dict = pickle.load(f, encoding='bytes')
else:
data_dict = pickle.load(f)
return data_dict
def convert_to_tfrecord(input_files, output_file):
"""Converts a file to TFRecords."""
print('Generating %s' % output_file)
with tf.python_io.TFRecordWriter(output_file) as record_writer:
for input_file in input_files:
data_dict = read_pickle_from_file(input_file)
data = data_dict[b'data']
labels = data_dict[b'labels']
num_entries_in_batch = len(labels)
for i in range(num_entries_in_batch):
example = tf.train.Example(features=tf.train.Features(
feature={
'image': _bytes_feature(data[i].tobytes()),
'label': _int64_feature(labels[i])
}))
record_writer.write(example.SerializeToString())
def main(data_dir):
print('Download from {} and extract.'.format(CIFAR_DOWNLOAD_URL))
download_and_extract(data_dir)
file_names = _get_file_names()
input_dir = os.path.join(data_dir, CIFAR_LOCAL_FOLDER)
for mode, files in file_names.items():
input_files = [os.path.join(input_dir, f) for f in files]
output_file = os.path.join(data_dir, mode + '.tfrecords')
try:
os.remove(output_file)
except OSError:
pass
# Convert to tf.train.Example and write the to TFRecords.
convert_to_tfrecord(input_files, output_file)
print('Done!')
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--data-dir',
type=str,
default='',
help='Directory to download and extract CIFAR-10 to.')
args = parser.parse_args()
main(args.data_dir)

View File

@ -1,219 +0,0 @@
# Copyright 2017 The TensorFlow Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""ResNet model.
Related papers:
https://arxiv.org/pdf/1603.05027v2.pdf
https://arxiv.org/pdf/1512.03385v1.pdf
https://arxiv.org/pdf/1605.07146v1.pdf
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import tensorflow as tf
class ResNet(object):
"""ResNet model."""
def __init__(self, is_training, data_format, batch_norm_decay, batch_norm_epsilon):
"""ResNet constructor.
Args:
is_training: if build training or inference model.
data_format: the data_format used during computation.
one of 'channels_first' or 'channels_last'.
"""
self._batch_norm_decay = batch_norm_decay
self._batch_norm_epsilon = batch_norm_epsilon
self._is_training = is_training
assert data_format in ('channels_first', 'channels_last')
self._data_format = data_format
def forward_pass(self, x):
raise NotImplementedError(
'forward_pass() is implemented in ResNet sub classes')
def _residual_v1(self,
x,
kernel_size,
in_filter,
out_filter,
stride,
activate_before_residual=False):
"""Residual unit with 2 sub layers, using Plan A for shortcut connection."""
del activate_before_residual
with tf.name_scope('residual_v1') as name_scope:
orig_x = x
x = self._conv(x, kernel_size, out_filter, stride)
x = self._batch_norm(x)
x = self._relu(x)
x = self._conv(x, kernel_size, out_filter, 1)
x = self._batch_norm(x)
if in_filter != out_filter:
orig_x = self._avg_pool(orig_x, stride, stride)
pad = (out_filter - in_filter) // 2
if self._data_format == 'channels_first':
orig_x = tf.pad(orig_x, [[0, 0], [pad, pad], [0, 0], [0, 0]])
else:
orig_x = tf.pad(orig_x, [[0, 0], [0, 0], [0, 0], [pad, pad]])
x = self._relu(tf.add(x, orig_x))
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x
def _residual_v2(self,
x,
in_filter,
out_filter,
stride,
activate_before_residual=False):
"""Residual unit with 2 sub layers with preactivation, plan A shortcut."""
with tf.name_scope('residual_v2') as name_scope:
if activate_before_residual:
x = self._batch_norm(x)
x = self._relu(x)
orig_x = x
else:
orig_x = x
x = self._batch_norm(x)
x = self._relu(x)
x = self._conv(x, 3, out_filter, stride)
x = self._batch_norm(x)
x = self._relu(x)
x = self._conv(x, 3, out_filter, [1, 1, 1, 1])
if in_filter != out_filter:
pad = (out_filter - in_filter) // 2
orig_x = self._avg_pool(orig_x, stride, stride)
if self._data_format == 'channels_first':
orig_x = tf.pad(orig_x, [[0, 0], [pad, pad], [0, 0], [0, 0]])
else:
orig_x = tf.pad(orig_x, [[0, 0], [0, 0], [0, 0], [pad, pad]])
x = tf.add(x, orig_x)
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x
def _bottleneck_residual_v2(self,
x,
in_filter,
out_filter,
stride,
activate_before_residual=False):
"""Bottleneck residual unit with 3 sub layers, plan B shortcut."""
with tf.name_scope('bottle_residual_v2') as name_scope:
if activate_before_residual:
x = self._batch_norm(x)
x = self._relu(x)
orig_x = x
else:
orig_x = x
x = self._batch_norm(x)
x = self._relu(x)
x = self._conv(x, 1, out_filter // 4, stride, is_atrous=True)
x = self._batch_norm(x)
x = self._relu(x)
# pad when stride isn't unit
x = self._conv(x, 3, out_filter // 4, 1, is_atrous=True)
x = self._batch_norm(x)
x = self._relu(x)
x = self._conv(x, 1, out_filter, 1, is_atrous=True)
if in_filter != out_filter:
orig_x = self._conv(orig_x, 1, out_filter, stride, is_atrous=True)
x = tf.add(x, orig_x)
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x
def _conv(self, x, kernel_size, filters, strides, is_atrous=False):
"""Convolution."""
padding = 'SAME'
if not is_atrous and strides > 1:
pad = kernel_size - 1
pad_beg = pad // 2
pad_end = pad - pad_beg
if self._data_format == 'channels_first':
x = tf.pad(x, [[0, 0], [0, 0], [pad_beg, pad_end], [pad_beg, pad_end]])
else:
x = tf.pad(x, [[0, 0], [pad_beg, pad_end], [pad_beg, pad_end], [0, 0]])
padding = 'VALID'
return tf.layers.conv2d(
inputs=x,
kernel_size=kernel_size,
filters=filters,
strides=strides,
padding=padding,
use_bias=False,
data_format=self._data_format)
def _batch_norm(self, x):
if self._data_format == 'channels_first':
data_format = 'NCHW'
else:
data_format = 'NHWC'
return tf.contrib.layers.batch_norm(
x,
decay=self._batch_norm_decay,
center=True,
scale=True,
epsilon=self._batch_norm_epsilon,
is_training=self._is_training,
fused=True,
data_format=data_format)
def _relu(self, x):
return tf.nn.relu(x)
def _fully_connected(self, x, out_dim):
with tf.name_scope('fully_connected') as name_scope:
x = tf.layers.dense(x, out_dim)
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x
def _avg_pool(self, x, pool_size, stride):
with tf.name_scope('avg_pool') as name_scope:
x = tf.layers.average_pooling2d(
x, pool_size, stride, 'SAME', data_format=self._data_format)
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x
def _global_avg_pool(self, x):
with tf.name_scope('global_avg_pool') as name_scope:
assert x.get_shape().ndims == 4
if self._data_format == 'channels_first':
x = tf.reduce_mean(x, [2, 3])
else:
x = tf.reduce_mean(x, [1, 2])
tf.logging.info('image after unit %s: %s', name_scope, x.get_shape())
return x

View File

@ -1,75 +0,0 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM nvidia/cuda:9.0-base-ubuntu16.04
RUN echo "$LOG_TAG update and install basic packages" && \
apt-get -y update && apt-get install -y --no-install-recommends \
build-essential \
curl \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
rsync \
software-properties-common \
unzip \
vim \
wget \
&& \
apt-get install -y locales && \
locale-gen $LANG && \
apt-get clean && \
apt -y autoclean && \
apt -y dist-upgrade && \
apt-get install -y build-essential && \
rm -rf /var/lib/apt/lists/*
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
RUN echo "$LOG_TAG Install java8" && \
apt-get -y update && \
apt-get install -y openjdk-8-jdk && \
rm -rf /var/lib/apt/lists/*
# Install Zeppelin
ENV Z_VERSION="0.7.3" \
Z_HOME="/zeppelin"
RUN echo "$LOG_TAG Download Zeppelin binary" && \
wget -O /tmp/zeppelin-${Z_VERSION}-bin-all.tgz http://archive.apache.org/dist/zeppelin/zeppelin-${Z_VERSION}/zeppelin-${Z_VERSION}-bin-all.tgz && \
tar -zxvf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
rm -rf /tmp/zeppelin-${Z_VERSION}-bin-all.tgz && \
mv /zeppelin-${Z_VERSION}-bin-all ${Z_HOME}
ENV PATH="${Z_HOME}/bin:${PATH}"
RUN echo "$LOG_TAG Set locale" && \
echo "LC_ALL=en_US.UTF-8" >> /etc/environment && \
echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen && \
echo "LANG=en_US.UTF-8" > /etc/locale.conf && \
locale-gen en_US.UTF-8
ENV LANG=en_US.UTF-8 \
LC_ALL=en_US.UTF-8
COPY zeppelin-site.xml $Z_HOME/conf/zeppelin-site.xml
COPY shiro.ini ${Z_HOME}/conf/shiro.ini
RUN chmod 777 -R ${Z_HOME}
COPY run_container.sh /usr/local/bin/run_container.sh
RUN chmod 755 /usr/local/bin/run_container.sh
EXPOSE 8080
CMD ["/usr/local/bin/run_container.sh"]

View File

@ -1,22 +0,0 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"${Z_HOME}/bin/zeppelin-daemon.sh" start
while true; do
#perform the test
sleep 5
done

View File

@ -1,120 +0,0 @@
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
[users]
# List of users with their password allowed to access Zeppelin.
# To use a different strategy (LDAP / Database / ...) check the shiro doc at http://shiro.apache.org/configuration.html#Configuration-INISections
# To enable admin user, uncomment the following line and set an appropriate password.
admin = admin, admin
user1 = password2, role1, role2
user2 = password3, role3
user3 = password4, role2
# Sample LDAP configuration, for user Authentication, currently tested for single Realm
[main]
### A sample for configuring Active Directory Realm
#activeDirectoryRealm = org.apache.zeppelin.realm.ActiveDirectoryGroupRealm
#activeDirectoryRealm.systemUsername = userNameA
#use either systemPassword or hadoopSecurityCredentialPath, more details in http://zeppelin.apache.org/docs/latest/security/shiroauthentication.html
#activeDirectoryRealm.systemPassword = passwordA
#activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://file/user/zeppelin/zeppelin.jceks
#activeDirectoryRealm.searchBase = CN=Users,DC=SOME_GROUP,DC=COMPANY,DC=COM
#activeDirectoryRealm.url = ldap://ldap.test.com:389
#activeDirectoryRealm.groupRolesMap = "CN=admin,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM":"admin","CN=finance,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM":"finance","CN=hr,OU=groups,DC=SOME_GROUP,DC=COMPANY,DC=COM":"hr"
#activeDirectoryRealm.authorizationCachingEnabled = false
### A sample for configuring LDAP Directory Realm
#ldapRealm = org.apache.zeppelin.realm.LdapGroupRealm
## search base for ldap groups (only relevant for LdapGroupRealm):
#ldapRealm.contextFactory.environment[ldap.searchBase] = dc=COMPANY,dc=COM
#ldapRealm.contextFactory.url = ldap://ldap.test.com:389
#ldapRealm.userDnTemplate = uid={0},ou=Users,dc=COMPANY,dc=COM
#ldapRealm.contextFactory.authenticationMechanism = simple
### A sample PAM configuration
#pamRealm=org.apache.zeppelin.realm.PamRealm
#pamRealm.service=sshd
### A sample for configuring ZeppelinHub Realm
#zeppelinHubRealm = org.apache.zeppelin.realm.ZeppelinHubRealm
## Url of ZeppelinHub
#zeppelinHubRealm.zeppelinhubUrl = https://www.zeppelinhub.com
#securityManager.realms = $zeppelinHubRealm
## A same for configuring Knox SSO Realm
#knoxJwtRealm = org.apache.zeppelin.realm.jwt.KnoxJwtRealm
#knoxJwtRealm.providerUrl = https://domain.example.com/
#knoxJwtRealm.login = gateway/knoxsso/knoxauth/login.html
#knoxJwtRealm.logout = gateway/knoxssout/api/v1/webssout
#knoxJwtRealm.logoutAPI = true
#knoxJwtRealm.redirectParam = originalUrl
#knoxJwtRealm.cookieName = hadoop-jwt
#knoxJwtRealm.publicKeyPath = /etc/zeppelin/conf/knox-sso.pem
#
#knoxJwtRealm.groupPrincipalMapping = group.principal.mapping
#knoxJwtRealm.principalMapping = principal.mapping
#authc = org.apache.zeppelin.realm.jwt.KnoxAuthenticationFilter
sessionManager = org.apache.shiro.web.session.mgt.DefaultWebSessionManager
### If caching of user is required then uncomment below lines
#cacheManager = org.apache.shiro.cache.MemoryConstrainedCacheManager
#securityManager.cacheManager = $cacheManager
### Enables 'HttpOnly' flag in Zeppelin cookies
cookie = org.apache.shiro.web.servlet.SimpleCookie
cookie.name = JSESSIONID
cookie.httpOnly = true
### Uncomment the below line only when Zeppelin is running over HTTPS
#cookie.secure = true
sessionManager.sessionIdCookie = $cookie
securityManager.sessionManager = $sessionManager
# 86,400,000 milliseconds = 24 hour
securityManager.sessionManager.globalSessionTimeout = 86400000
shiro.loginUrl = /api/login
[roles]
role1 = *
role2 = *
role3 = *
admin = *
[urls]
# This section is used for url-based security. For details see the shiro.ini documentation.
#
# You can secure interpreter, configuration and credential information by urls.
# Comment or uncomment the below urls that you want to hide:
# anon means the access is anonymous.
# authc means form based auth Security.
#
# IMPORTANT: Order matters: URL path expressions are evaluated against an incoming request
# in the order they are defined and the FIRST MATCH WINS.
#
# To allow anonymous access to all but the stated urls,
# uncomment the line second last line (/** = anon) and comment the last line (/** = authc)
#
/api/version = anon
# Allow all authenticated users to restart interpreters on a notebook page.
# Comment out the following line if you would like to authorize only admin users to restart interpreters.
/api/interpreter/setting/restart/** = authc
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
#/** = anon
/** = authc

View File

@ -1,569 +0,0 @@
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>zeppelin.server.addr</name>
<value>0.0.0.0</value>
<description>Server address</description>
</property>
<property>
<name>zeppelin.server.port</name>
<value>8080</value>
<description>Server port.</description>
</property>
<property>
<name>zeppelin.server.ssl.port</name>
<value>8443</value>
<description>Server ssl port. (used when ssl property is set to true)</description>
</property>
<property>
<name>zeppelin.server.context.path</name>
<value>/</value>
<description>Context Path of the Web Application</description>
</property>
<property>
<name>zeppelin.war.tempdir</name>
<value>webapps</value>
<description>Location of jetty temporary directory</description>
</property>
<property>
<name>zeppelin.notebook.dir</name>
<value>notebook</value>
<description>path or URI for notebook persist</description>
</property>
<property>
<name>zeppelin.notebook.homescreen</name>
<value></value>
<description>id of notebook to be displayed in homescreen. ex) 2A94M5J1Z Empty value displays default home screen</description>
</property>
<property>
<name>zeppelin.notebook.homescreen.hide</name>
<value>false</value>
<description>hide homescreen notebook from list when this value set to true</description>
</property>
<property>
<name>zeppelin.notebook.collaborative.mode.enable</name>
<value>true</value>
<description>Enable collaborative mode</description>
</property>
<!-- Google Cloud Storage notebook storage -->
<!--
<property>
<name>zeppelin.notebook.gcs.dir</name>
<value></value>
<description>
A GCS path in the form gs://bucketname/path/to/dir.
Notes are stored at {zeppelin.notebook.gcs.dir}/{notebook-id}/note.json
</description>
</property>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GCSNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
-->
<!-- Amazon S3 notebook storage -->
<!-- Creates the following directory structure: s3://{bucket}/{username}/{notebook-id}/note.json -->
<!--
<property>
<name>zeppelin.notebook.s3.user</name>
<value>user</value>
<description>user name for s3 folder structure</description>
</property>
<property>
<name>zeppelin.notebook.s3.bucket</name>
<value>zeppelin</value>
<description>bucket name for notebook storage</description>
</property>
<property>
<name>zeppelin.notebook.s3.endpoint</name>
<value>s3.amazonaws.com</value>
<description>endpoint for s3 bucket</description>
</property>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
-->
<!-- Additionally, encryption is supported for notebook data stored in S3 -->
<!-- Use the AWS KMS to encrypt data -->
<!-- If used, the EC2 role assigned to the EMR cluster must have rights to use the given key -->
<!-- See https://aws.amazon.com/kms/ and http://docs.aws.amazon.com/kms/latest/developerguide/concepts.html -->
<!--
<property>
<name>zeppelin.notebook.s3.kmsKeyID</name>
<value>AWS-KMS-Key-UUID</value>
<description>AWS KMS key ID used to encrypt notebook data in S3</description>
</property>
-->
<!-- provide region of your KMS key -->
<!-- See http://docs.aws.amazon.com/general/latest/gr/rande.html#kms_region for region codes names -->
<!--
<property>
<name>zeppelin.notebook.s3.kmsKeyRegion</name>
<value>us-east-1</value>
<description>AWS KMS key region in your AWS account</description>
</property>
-->
<!-- Use a custom encryption materials provider to encrypt data -->
<!-- No configuration is given to the provider, so you must use system properties or another means to configure -->
<!-- See https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/EncryptionMaterialsProvider.html -->
<!--
<property>
<name>zeppelin.notebook.s3.encryptionMaterialsProvider</name>
<value>provider implementation class name</value>
<description>Custom encryption materials provider used to encrypt notebook data in S3</description>
</property>
-->
<!-- Server-side encryption enabled for notebooks -->
<!--
<property>
<name>zeppelin.notebook.s3.sse</name>
<value>true</value>
<description>Server-side encryption enabled for notebooks</description>
</property>
-->
<!-- Optional override to control which signature algorithm should be used to sign AWS requests -->
<!-- Set this property to "S3SignerType" if your AWS S3 compatible APIs support only AWS Signature Version 2 such as Ceph. -->
<!--
<property>
<name>zeppelin.notebook.s3.signerOverride</name>
<value>S3SignerType</value>
<description>optional override to control which signature algorithm should be used to sign AWS requests</description>
</property>
-->
<!-- If using Azure for storage use the following settings -->
<!--
<property>
<name>zeppelin.notebook.azure.connectionString</name>
<value>DefaultEndpointsProtocol=https;AccountName=<accountName>;AccountKey=<accountKey></value>
<description>Azure account credentials</description>
</property>
<property>
<name>zeppelin.notebook.azure.share</name>
<value>zeppelin</value>
<description>share name for notebook storage</description>
</property>
<property>
<name>zeppelin.notebook.azure.user</name>
<value>user</value>
<description>optional user name for Azure folder structure</description>
</property>
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.AzureNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
-->
<!-- Notebook storage layer using local file system
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
<description>local notebook persistence layer implementation</description>
</property>
-->
<!-- Notebook storage layer using hadoop compatible file system
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.FileSystemNotebookRepo</value>
<description>Hadoop compatible file system notebook persistence layer implementation, such as local file system, hdfs, azure wasb, s3 and etc.</description>
</property>
<property>
<name>zeppelin.server.kerberos.keytab</name>
<value></value>
<description>keytab for accessing kerberized hdfs</description>
</property>
<property>
<name>zeppelin.server.kerberos.principal</name>
<value></value>
<description>principal for accessing kerberized hdfs</description>
</property>
-->
<!-- For connecting your Zeppelin with ZeppelinHub -->
<!--
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo, org.apache.zeppelin.notebook.repo.zeppelinhub.ZeppelinHubRepo</value>
<description>two notebook persistence layers (versioned local + ZeppelinHub)</description>
</property>
-->
<!-- MongoDB notebook storage -->
<!--
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.MongoNotebookRepo</value>
<description>notebook persistence layer implementation</description>
</property>
<property>
<name>zeppelin.notebook.mongo.uri</name>
<value>mongodb://localhost</value>
<description>MongoDB connection URI used to connect to a MongoDB database server</description>
</property>
<property>
<name>zeppelin.notebook.mongo.database</name>
<value>zeppelin</value>
<description>database name for notebook storage</description>
</property>
<property>
<name>zeppelin.notebook.mongo.collection</name>
<value>notes</value>
<description>collection name for notebook storage</description>
</property>
<property>
<name>zeppelin.notebook.mongo.autoimport</name>
<value>false</value>
<description>import local notes into MongoDB automatically on startup</description>
</property>
-->
<property>
<name>zeppelin.notebook.storage</name>
<value>org.apache.zeppelin.notebook.repo.GitNotebookRepo</value>
<description>versioned notebook persistence layer implementation</description>
</property>
<property>
<name>zeppelin.notebook.one.way.sync</name>
<value>false</value>
<description>If there are multiple notebook storages, should we treat the first one as the only source of truth?</description>
</property>
<property>
<name>zeppelin.interpreter.dir</name>
<value>interpreter</value>
<description>Interpreter implementation base directory</description>
</property>
<property>
<name>zeppelin.interpreter.localRepo</name>
<value>local-repo</value>
<description>Local repository for interpreter's additional dependency loading</description>
</property>
<property>
<name>zeppelin.interpreter.dep.mvnRepo</name>
<value>http://repo1.maven.org/maven2/</value>
<description>Remote principal repository for interpreter's additional dependency loading</description>
</property>
<property>
<name>zeppelin.dep.localrepo</name>
<value>local-repo</value>
<description>Local repository for dependency loader</description>
</property>
<property>
<name>zeppelin.helium.node.installer.url</name>
<value>https://nodejs.org/dist/</value>
<description>Remote Node installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.helium.npm.installer.url</name>
<value>http://registry.npmjs.org/</value>
<description>Remote Npm installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.helium.yarnpkg.installer.url</name>
<value>https://github.com/yarnpkg/yarn/releases/download/</value>
<description>Remote Yarn package installer url for Helium dependency loader</description>
</property>
<property>
<name>zeppelin.interpreters</name>
<value>org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter,org.apache.zeppelin.rinterpreter.RRepl,org.apache.zeppelin.rinterpreter.KnitR,org.apache.zeppelin.spark.SparkRInterpreter,org.apache.zeppelin.spark.SparkSqlInterpreter,org.apache.zeppelin.spark.DepInterpreter,org.apache.zeppelin.markdown.Markdown,org.apache.zeppelin.angular.AngularInterpreter,org.apache.zeppelin.shell.ShellInterpreter,org.apache.zeppelin.file.HDFSFileInterpreter,org.apache.zeppelin.flink.FlinkInterpreter,,org.apache.zeppelin.python.PythonInterpreter,org.apache.zeppelin.python.PythonInterpreterPandasSql,org.apache.zeppelin.python.PythonCondaInterpreter,org.apache.zeppelin.python.PythonDockerInterpreter,org.apache.zeppelin.lens.LensInterpreter,org.apache.zeppelin.ignite.IgniteInterpreter,org.apache.zeppelin.ignite.IgniteSqlInterpreter,org.apache.zeppelin.cassandra.CassandraInterpreter,org.apache.zeppelin.geode.GeodeOqlInterpreter,org.apache.zeppelin.jdbc.JDBCInterpreter,org.apache.zeppelin.kylin.KylinInterpreter,org.apache.zeppelin.elasticsearch.ElasticsearchInterpreter,org.apache.zeppelin.scalding.ScaldingInterpreter,org.apache.zeppelin.alluxio.AlluxioInterpreter,org.apache.zeppelin.hbase.HbaseInterpreter,org.apache.zeppelin.livy.LivySparkInterpreter,org.apache.zeppelin.livy.LivyPySparkInterpreter,org.apache.zeppelin.livy.LivyPySpark3Interpreter,org.apache.zeppelin.livy.LivySparkRInterpreter,org.apache.zeppelin.livy.LivySparkSQLInterpreter,org.apache.zeppelin.bigquery.BigQueryInterpreter,org.apache.zeppelin.beam.BeamInterpreter,org.apache.zeppelin.pig.PigInterpreter,org.apache.zeppelin.pig.PigQueryInterpreter,org.apache.zeppelin.scio.ScioInterpreter,org.apache.zeppelin.groovy.GroovyInterpreter</value>
<description>Comma separated interpreter configurations. First interpreter become a default</description>
</property>
<property>
<name>zeppelin.interpreter.group.order</name>
<value>spark,md,angular,sh,livy,alluxio,file,psql,flink,python,ignite,lens,cassandra,geode,kylin,elasticsearch,scalding,jdbc,hbase,bigquery,beam,groovy</value>
<description></description>
</property>
<property>
<name>zeppelin.interpreter.connect.timeout</name>
<value>30000</value>
<description>Interpreter process connect timeout in msec.</description>
</property>
<property>
<name>zeppelin.interpreter.output.limit</name>
<value>102400</value>
<description>Output message from interpreter exceeding the limit will be truncated</description>
</property>
<property>
<name>zeppelin.ssl</name>
<value>false</value>
<description>Should SSL be used by the servers?</description>
</property>
<property>
<name>zeppelin.ssl.client.auth</name>
<value>false</value>
<description>Should client authentication be used for SSL connections?</description>
</property>
<property>
<name>zeppelin.ssl.keystore.path</name>
<value>keystore</value>
<description>Path to keystore relative to Zeppelin configuration directory</description>
</property>
<property>
<name>zeppelin.ssl.keystore.type</name>
<value>JKS</value>
<description>The format of the given keystore (e.g. JKS or PKCS12)</description>
</property>
<property>
<name>zeppelin.ssl.keystore.password</name>
<value>change me</value>
<description>Keystore password. Can be obfuscated by the Jetty Password tool</description>
</property>
<!--
<property>
<name>zeppelin.ssl.key.manager.password</name>
<value>change me</value>
<description>Key Manager password. Defaults to keystore password. Can be obfuscated.</description>
</property>
-->
<property>
<name>zeppelin.ssl.truststore.path</name>
<value>truststore</value>
<description>Path to truststore relative to Zeppelin configuration directory. Defaults to the keystore path</description>
</property>
<property>
<name>zeppelin.ssl.truststore.type</name>
<value>JKS</value>
<description>The format of the given truststore (e.g. JKS or PKCS12). Defaults to the same type as the keystore type</description>
</property>
<!--
<property>
<name>zeppelin.ssl.truststore.password</name>
<value>change me</value>
<description>Truststore password. Can be obfuscated by the Jetty Password tool. Defaults to the keystore password</description>
</property>
-->
<property>
<name>zeppelin.server.allowed.origins</name>
<value>*</value>
<description>Allowed sources for REST and WebSocket requests (i.e. http://onehost:8080,http://otherhost.com). If you leave * you are vulnerable to https://issues.apache.org/jira/browse/ZEPPELIN-173</description>
</property>
<property>
<name>zeppelin.anonymous.allowed</name>
<value>false</value>
<description>Anonymous user allowed by default</description>
</property>
<property>
<name>zeppelin.username.force.lowercase</name>
<value>false</value>
<description>Force convert username case to lower case, useful for Active Directory/LDAP. Default is not to change case</description>
</property>
<property>
<name>zeppelin.notebook.default.owner.username</name>
<value></value>
<description>Set owner role by default</description>
</property>
<property>
<name>zeppelin.notebook.public</name>
<value>true</value>
<description>Make notebook public by default when created, private otherwise</description>
</property>
<property>
<name>zeppelin.websocket.max.text.message.size</name>
<value>1024000</value>
<description>Size in characters of the maximum text message to be received by websocket. Defaults to 1024000</description>
</property>
<property>
<name>zeppelin.server.default.dir.allowed</name>
<value>false</value>
<description>Enable directory listings on server.</description>
</property>
<!--
<property>
<name>zeppelin.interpreter.lifecyclemanager.class</name>
<value>org.apache.zeppelin.interpreter.lifecycle.TimeoutLifecycleManager</value>
<description>LifecycleManager class for managing the lifecycle of interpreters, by default interpreter will
be closed after timeout</description>
</property>
<property>
<name>zeppelin.interpreter.lifecyclemanager.timeout.checkinterval</name>
<value>60000</value>
<description>Milliseconds of the interval to checking whether interpreter is time out</description>
</property>
<property>
<name>zeppelin.interpreter.lifecyclemanager.timeout.threshold</name>
<value>3600000</value>
<description>Milliseconds of the interpreter timeout threshold, by default it is 1 hour</description>
</property>
-->
<!--
<property>
<name>zeppelin.server.jetty.name</name>
<value>Jetty(7.6.0.v20120127)</value>
<description>Hardcoding Application Server name to Prevent Fingerprinting</description>
</property>
-->
<!--
<property>
<name>zeppelin.server.jetty.request.header.size</name>
<value>8192</value>
<description>Http Request Header Size Limit (to prevent HTTP 413)</description>
</property>
-->
<!--
<property>
<name>zeppelin.server.xframe.options</name>
<value>SAMEORIGIN</value>
<description>The X-Frame-Options HTTP response header can be used to indicate whether or not a browser should be allowed to render a page in a frame/iframe/object.</description>
</property>
-->
<!--
<property>
<name>zeppelin.server.strict.transport</name>
<value>max-age=631138519</value>
<description>The HTTP Strict-Transport-Security response header is a security feature that lets a web site tell browsers that it should only be communicated with using HTTPS, instead of using HTTP. Enable this when Zeppelin is running on HTTPS. Value is in Seconds, the default value is equivalent to 20 years.</description>
</property>
-->
<!--
<property>
<name>zeppelin.server.xxss.protection</name>
<value>1</value>
<description>The HTTP X-XSS-Protection response header is a feature of Internet Explorer, Chrome and Safari that stops pages from loading when they detect reflected cross-site scripting (XSS) attacks. When value is set to 1 and a cross-site scripting attack is detected, the browser will sanitize the page (remove the unsafe parts).</description>
</property>
-->
<!--
<property>
<name>zeppelin.interpreter.callback.portRange</name>
<value>10000:10010</value>
</property>
-->
<!--
<property>
<name>zeppelin.recovery.storage.class</name>
<value>org.apache.zeppelin.interpreter.recovery.FileSystemRecoveryStorage</value>
<description>ReoveryStorage implementation</description>
</property>
-->
<!--
<property>
<name>zeppelin.recovery.dir</name>
<value>recovery</value>
<description>Location where recovery metadata is stored</description>
</property>
-->
<!-- GitHub configurations
<property>
<name>zeppelin.notebook.git.remote.url</name>
<value></value>
<description>remote Git repository URL</description>
</property>
<property>
<name>zeppelin.notebook.git.remote.username</name>
<value>token</value>
<description>remote Git repository username</description>
</property>
<property>
<name>zeppelin.notebook.git.remote.access-token</name>
<value></value>
<description>remote Git repository password</description>
</property>
<property>
<name>zeppelin.notebook.git.remote.origin</name>
<value>origin</value>
<description>Git repository remote</description>
</property>
<property>
<name>zeppelin.notebook.cron.enable</name>
<value>false</value>
<description>Notebook enable cron scheduler feature</description>
</property>
<property>
<name>zeppelin.notebook.cron.folders</name>
<value></value>
<description>Notebook cron folders</description>
</property>
-->
</configuration>

View File

@ -1,47 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineException;
import java.io.IOException;
public abstract class AbstractCli implements Tool {
protected ClientContext clientContext;
public AbstractCli(ClientContext cliContext) {
this.clientContext = cliContext;
}
@Override
public abstract int run(String[] args)
throws ParseException, IOException, YarnException, InterruptedException,
SubmarineException;
@Override
public void setConf(Configuration conf) {
clientContext.setSubmarineConfig(conf);
}
@Override
public Configuration getConf() {
return clientContext.getSubmarineConfig();
}
}

View File

@ -1,106 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
import java.util.Arrays;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.conf.YarnConfiguration;
import org.apache.hadoop.yarn.submarine.client.cli.runjob.RunJobCli;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.runtimes.RuntimeFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class Cli {
private static final Logger LOG =
LoggerFactory.getLogger(Cli.class);
private static void printHelp() {
StringBuilder helpMsg = new StringBuilder();
helpMsg.append("\n\nUsage: <object> [<action>] [<args>]\n");
helpMsg.append(" Below are all objects / actions:\n");
helpMsg.append(" job \n");
helpMsg.append(" run : run a job, please see 'job run --help' for usage \n");
helpMsg.append(" show : get status of job, please see 'job show --help' for usage \n");
helpMsg.append(" kill : kill a job, please see 'job kill --help' for usage \n");
System.out.println(helpMsg.toString());
}
private static ClientContext getClientContext() {
Configuration conf = new YarnConfiguration();
ClientContext clientContext = new ClientContext();
clientContext.setConfiguration(conf);
RuntimeFactory runtimeFactory = RuntimeFactory.getRuntimeFactory(
clientContext);
clientContext.setRuntimeFactory(runtimeFactory);
return clientContext;
}
public static void main(String[] args) throws Exception {
System.out.println(" _ _ \n"
+ " | | (_) \n"
+ " ___ _ _ | |__ _ __ ___ __ _ _ __ _ _ __ ___ \n"
+ " / __|| | | || '_ \\ | '_ ` _ \\ / _` || '__|| || '_ \\ / _ \\\n"
+ " \\__ \\| |_| || |_) || | | | | || (_| || | | || | | || __/\n"
+ " |___/ \\__,_||_.__/ |_| |_| |_| \\__,_||_| |_||_| |_| \\___|\n"
+ " \n"
+ " ?\n"
+ " ~~~~~~~~~~~~~~~~~~~~~~~~~~~|^\"~~~~~~~~~~~~~~~~~~~~~~~~~o~~~~~~~~~~~\n"
+ " o | o __o\n"
+ " o | o |X__>\n"
+ " ___o | __o\n"
+ " (X___>-- __|__ |X__> o\n"
+ " | \\ __o\n"
+ " | \\ |X__>\n"
+ " _______________________|_______\\________________\n"
+ " < \\____________ _\n"
+ " \\ \\ (_)\n"
+ " \\ O O O >=)\n"
+ " \\__________________________________________________________/ (_)\n"
+ "\n");
if (CliUtils.argsForHelp(args)) {
printHelp();
System.exit(0);
}
if (args.length < 2) {
LOG.error("Bad parameters specified.");
printHelp();
System.exit(-1);
}
String[] moduleArgs = Arrays.copyOfRange(args, 2, args.length);
ClientContext clientContext = getClientContext();
if (args[0].equals("job")) {
String subCmd = args[1];
if (subCmd.equals(CliConstants.RUN)) {
new RunJobCli(clientContext).run(moduleArgs);
} else if (subCmd.equals(CliConstants.SHOW)) {
new ShowJobCli(clientContext).run(moduleArgs);
} else if (subCmd.equals(CliConstants.KILL)) {
new KillJobCli(clientContext).run(moduleArgs);
} else {
printHelp();
throw new IllegalArgumentException("Unknown option for job");
}
} else {
printHelp();
throw new IllegalArgumentException("Unrecognized option: " + args[0]);
}
}
}

View File

@ -1,65 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
/*
* NOTE: use lowercase + "_" for the option name
*/
public class CliConstants {
public static final String KILL = "kill";
public static final String RUN = "run";
public static final String SERVE = "serve";
public static final String LIST = "list";
public static final String SHOW = "show";
public static final String NAME = "name";
public static final String INPUT_PATH = "input_path";
public static final String CHECKPOINT_PATH = "checkpoint_path";
public static final String SAVED_MODEL_PATH = "saved_model_path";
public static final String N_WORKERS = "num_workers";
public static final String N_SERVING_TASKS = "num_serving_tasks";
public static final String N_PS = "num_ps";
public static final String WORKER_RES = "worker_resources";
public static final String SERVING_RES = "serving_resources";
public static final String PS_RES = "ps_resources";
public static final String DOCKER_IMAGE = "docker_image";
public static final String QUEUE = "queue";
public static final String TENSORBOARD = "tensorboard";
public static final String TENSORBOARD_RESOURCES = "tensorboard_resources";
public static final String TENSORBOARD_DEFAULT_RESOURCES =
"memory=4G,vcores=1";
public static final String ARG_CONF = "conf";
public static final String WORKER_LAUNCH_CMD = "worker_launch_cmd";
public static final String SERVING_LAUNCH_CMD = "serving_launch_cmd";
public static final String PS_LAUNCH_CMD = "ps_launch_cmd";
public static final String ENV = "env";
public static final String VERBOSE = "verbose";
public static final String SERVING_FRAMEWORK = "serving_framework";
public static final String STOP = "stop";
public static final String WAIT_JOB_FINISH = "wait_job_finish";
public static final String PS_DOCKER_IMAGE = "ps_docker_image";
public static final String WORKER_DOCKER_IMAGE = "worker_docker_image";
public static final String QUICKLINK = "quicklink";
public static final String TENSORBOARD_DOCKER_IMAGE =
"tensorboard_docker_image";
public static final String LOCALIZATION = "localization";
public static final String KEYTAB = "keytab";
public static final String PRINCIPAL = "principal";
public static final String DISTRIBUTE_KEYTAB = "distribute_keytab";
public static final String YAML_CONFIG = "f";
public static final String INSECURE_CLUSTER = "insecure";
public static final String FRAMEWORK = "framework";
}

View File

@ -1,124 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.security.UserGroupInformation;
import org.apache.hadoop.yarn.submarine.client.cli.param.runjob.RunJobParameters;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineRuntimeException;
import org.apache.hadoop.yarn.submarine.common.fs.RemoteDirectoryManager;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import static org.apache.hadoop.yarn.submarine.client.cli.CliConstants.KEYTAB;
import static org.apache.hadoop.yarn.submarine.client.cli.CliConstants.PRINCIPAL;
public class CliUtils {
private static final Logger LOG =
LoggerFactory.getLogger(CliUtils.class);
/**
* Replace patterns inside cli
*
* @return launch command after pattern replace
*/
public static String replacePatternsInLaunchCommand(String specifiedCli,
RunJobParameters jobRunParameters,
RemoteDirectoryManager directoryManager) throws IOException {
String input = jobRunParameters.getInputPath();
String jobDir = jobRunParameters.getCheckpointPath();
String savedModelDir = jobRunParameters.getSavedModelPath();
Map<String, String> replacePattern = new HashMap<>();
if (jobDir != null) {
replacePattern.put("%" + CliConstants.CHECKPOINT_PATH + "%", jobDir);
}
if (input != null) {
replacePattern.put("%" + CliConstants.INPUT_PATH + "%", input);
}
if (savedModelDir != null) {
replacePattern.put("%" + CliConstants.SAVED_MODEL_PATH + "%",
savedModelDir);
}
String newCli = specifiedCli;
for (Map.Entry<String, String> replace : replacePattern.entrySet()) {
newCli = newCli.replace(replace.getKey(), replace.getValue());
}
return newCli;
}
// Is it for help?
public static boolean argsForHelp(String[] args) {
if (args == null || args.length == 0)
return true;
if (args.length == 1) {
return args[0].equals("-h") || args[0].equals("--help");
}
return false;
}
public static void doLoginIfSecure(String keytab, String principal) throws
IOException {
if (!UserGroupInformation.isSecurityEnabled()) {
return;
}
if (StringUtils.isEmpty(keytab) || StringUtils.isEmpty(principal)) {
if (StringUtils.isNotEmpty(keytab)) {
SubmarineRuntimeException e = new SubmarineRuntimeException("The " +
"parameter of " + PRINCIPAL + " is missing.");
LOG.error(e.getMessage(), e);
throw e;
}
if (StringUtils.isNotEmpty(principal)) {
SubmarineRuntimeException e = new SubmarineRuntimeException("The " +
"parameter of " + KEYTAB + " is missing.");
LOG.error(e.getMessage(), e);
throw e;
}
UserGroupInformation user = UserGroupInformation.getCurrentUser();
if(user == null || user.getAuthenticationMethod() ==
UserGroupInformation.AuthenticationMethod.SIMPLE) {
SubmarineRuntimeException e = new SubmarineRuntimeException("Failed " +
"to authenticate in secure environment. Please run kinit " +
"command in advance or use " + "--" + KEYTAB + "/--" + PRINCIPAL +
" parameters");
LOG.error(e.getMessage(), e);
throw e;
}
LOG.info("Submarine job is submitted by user: " + user.getUserName());
return;
}
File keytabFile = new File(keytab);
if (!keytabFile.exists()) {
SubmarineRuntimeException e = new SubmarineRuntimeException("No " +
"keytab localized at " + keytab);
LOG.error(e.getMessage(), e);
throw e;
}
UserGroupInformation.loginUserFromKeytab(principal, keytab);
}
}

View File

@ -1,24 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
/**
* Represents a Submarine command.
*/
public enum Command {
RUN_JOB, SHOW_JOB, KILL_JOB
}

View File

@ -1,113 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
import static org.apache.hadoop.yarn.client.api.AppAdminClient.DEFAULT_TYPE;
import java.io.IOException;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.GnuParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.client.api.AppAdminClient;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.param.KillJobParameters;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.annotations.VisibleForTesting;
public class KillJobCli extends AbstractCli {
private static final Logger LOG = LoggerFactory.getLogger(ShowJobCli.class);
private Options options;
private ParametersHolder parametersHolder;
public KillJobCli(ClientContext cliContext) {
super(cliContext);
options = generateOptions();
}
public void printUsages() {
new HelpFormatter().printHelp("job kill", options);
}
private Options generateOptions() {
Options options = new Options();
options.addOption(CliConstants.NAME, true, "Name of the job");
options.addOption("h", "help", false, "Print help");
return options;
}
private void parseCommandLineAndGetKillJobParameters(String[] args)
throws IOException, YarnException {
// Do parsing
GnuParser parser = new GnuParser();
CommandLine cli;
try {
cli = parser.parse(options, args);
parametersHolder =
ParametersHolder.createWithCmdLine(cli, Command.KILL_JOB);
parametersHolder.updateParameters(clientContext);
} catch (ParseException e) {
LOG.error(("Error parsing command-line options: " + e.getMessage()));
printUsages();
}
}
@VisibleForTesting
protected boolean killJob() throws IOException, YarnException {
String jobName = getParameters().getName();
AppAdminClient appAdminClient = AppAdminClient
.createAppAdminClient(DEFAULT_TYPE, clientContext.getYarnConfig());
if (appAdminClient.actionStop(jobName) != 0) {
LOG.error("appAdminClient fail to stop application");
return false;
}
if (appAdminClient.actionDestroy(jobName) != 0) {
LOG.error("appAdminClient fail to destroy application");
return false;
}
appAdminClient.stop();
return true;
}
@VisibleForTesting
public KillJobParameters getParameters() {
return (KillJobParameters) parametersHolder.getParameters();
}
@Override
public int run(String[] args) throws ParseException, IOException,
YarnException, InterruptedException, SubmarineException {
if (CliUtils.argsForHelp(args)) {
printUsages();
return 0;
}
parseCommandLineAndGetKillJobParameters(args);
if (killJob() == true) {
LOG.info("Kill job successfully !");
}
return 0;
}
}

View File

@ -1,126 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli;
import com.google.common.annotations.VisibleForTesting;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.GnuParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.client.cli.param.ShowJobParameters;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineException;
import org.apache.hadoop.yarn.submarine.runtimes.common.StorageKeyConstants;
import org.apache.hadoop.yarn.submarine.runtimes.common.SubmarineStorage;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.util.Map;
public class ShowJobCli extends AbstractCli {
private static final Logger LOG = LoggerFactory.getLogger(ShowJobCli.class);
private Options options;
private ParametersHolder parametersHolder;
public ShowJobCli(ClientContext cliContext) {
super(cliContext);
options = generateOptions();
}
public void printUsages() {
new HelpFormatter().printHelp("job show", options);
}
private Options generateOptions() {
Options options = new Options();
options.addOption(CliConstants.NAME, true, "Name of the job");
options.addOption("h", "help", false, "Print help");
return options;
}
private void parseCommandLineAndGetShowJobParameters(String[] args)
throws IOException, YarnException {
// Do parsing
GnuParser parser = new GnuParser();
CommandLine cli;
try {
cli = parser.parse(options, args);
parametersHolder = ParametersHolder
.createWithCmdLine(cli, Command.SHOW_JOB);
parametersHolder.updateParameters(clientContext);
} catch (ParseException e) {
printUsages();
}
}
private void printIfNotNull(String keyForPrint, String keyInStorage,
Map<String, String> jobInfo) {
if (jobInfo.containsKey(keyInStorage)) {
System.out.println("\t" + keyForPrint + ": " + jobInfo.get(keyInStorage));
}
}
private void printJobInfo(Map<String, String> jobInfo) {
System.out.println("Job Meta Info:");
printIfNotNull("Application Id", StorageKeyConstants.APPLICATION_ID,
jobInfo);
printIfNotNull("Input Path", StorageKeyConstants.INPUT_PATH, jobInfo);
printIfNotNull("Saved Model Path", StorageKeyConstants.SAVED_MODEL_PATH,
jobInfo);
printIfNotNull("Checkpoint Path", StorageKeyConstants.CHECKPOINT_PATH,
jobInfo);
printIfNotNull("Run Parameters", StorageKeyConstants.JOB_RUN_ARGS,
jobInfo);
}
@VisibleForTesting
protected void getAndPrintJobInfo() throws IOException {
SubmarineStorage storage =
clientContext.getRuntimeFactory().getSubmarineStorage();
Map<String, String> jobInfo = null;
try {
jobInfo = storage.getJobInfoByName(getParameters().getName());
} catch (IOException e) {
LOG.error("Failed to retrieve job info", e);
throw e;
}
printJobInfo(jobInfo);
}
@VisibleForTesting
public ShowJobParameters getParameters() {
return (ShowJobParameters) parametersHolder.getParameters();
}
@Override
public int run(String[] args)
throws ParseException, IOException, YarnException, InterruptedException,
SubmarineException {
if (CliUtils.argsForHelp(args)) {
printUsages();
return 0;
}
parseCommandLineAndGetShowJobParameters(args);
getAndPrintJobInfo();
return 0;
}
}

View File

@ -1,54 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.conf.SubmarineLogs;
import java.io.IOException;
/**
* Base class of all parameters.
*/
public abstract class BaseParameters {
private String name;
public void updateParameters(ParametersHolder parametersHolder,
ClientContext clientContext)
throws ParseException, IOException, YarnException {
String name = parametersHolder.getOptionValue(CliConstants.NAME);
if (name == null) {
throw new ParseException("--name is absent");
}
if (parametersHolder.hasOption(CliConstants.VERBOSE)) {
SubmarineLogs.verboseOn();
}
this.setName(name);
}
public String getName() {
return name;
}
public BaseParameters setName(String name) {
this.name = name;
return this;
}
}

View File

@ -1,24 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
/**
* Represents the source of configuration.
*/
public enum ConfigType {
YAML, CLI
}

View File

@ -1,19 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
public class KillJobParameters extends BaseParameters {
}

View File

@ -1,133 +0,0 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
import org.apache.commons.cli.ParseException;
import java.util.Arrays;
import java.util.List;
/**
* Localization parameter.
* */
public class Localization {
private String mountPermissionPattern = "(wr|rw)$";
/**
* Regex for directory/file path in container.
* YARN only support absolute path for mount, but we can
* support some relative path.
* For relative path, we only allow ".", "./","./name".
* relative path like "./a/b" is not allowed.
* "." and "./" means original dir/file name in container working directory
* "./name" means use same or new "name" in container working directory
* A absolute path means same path in container filesystem
*/
private String localPathPattern = "((^\\.$)|(^\\./$)|(^\\./[^/]+)|(^/.*))";
private String remoteUri;
private String localPath;
// Read write by default
private String mountPermission = "rw";
private static final List<String> SUPPORTED_SCHEME = Arrays.asList(
"hdfs", "oss", "s3a", "s3n", "wasb",
"wasbs", "abfs", "abfss", "adl", "har",
"ftp", "http", "https", "viewfs", "swebhdfs",
"webhdfs", "swift");
public void parse(String arg) throws ParseException {
String[] tokens = arg.split(":");
int minimum = "a:b".split(":").length;
int minimumWithPermission = "a:b:rw".split(":").length;
int minimumParts = minimum;
int miniPartsWithRemoteScheme = "scheme://a:b".split(":").length;
int maximumParts = "scheme://a:b:rw".split(":").length;
// If remote uri starts with a remote scheme
if (isSupportedScheme(tokens[0])) {
minimumParts = miniPartsWithRemoteScheme;
}
if (tokens.length < minimumParts
|| tokens.length > maximumParts) {
throw new ParseException("Invalid parameter,"
+ "should be \"remoteUri:localPath[:rw|:wr]\" "
+ "format for --localizations");
}
/**
* RemoteUri starts with remote scheme.
* Merge part 0 and 1 to build a hdfs path in token[0].
* toke[1] will be localPath to ease following logic
* */
if (minimumParts == miniPartsWithRemoteScheme) {
tokens[0] = tokens[0] + ":" + tokens[1];
tokens[1] = tokens[2];
if (tokens.length == maximumParts) {
// Has permission part
mountPermission = tokens[maximumParts - 1];
}
}
// RemoteUri starts with linux file path
if (minimumParts == minimum
&& tokens.length == minimumWithPermission) {
// Has permission part
mountPermission = tokens[minimumWithPermission - 1];
}
remoteUri = tokens[0];
localPath = tokens[1];
if (!localPath.matches(localPathPattern)) {
throw new ParseException("Invalid local file path:"
+ localPath
+ ", it only support \".\", \"./\", \"./name\" and "
+ "absolute path.");
}
if (!mountPermission.matches(mountPermissionPattern)) {
throw new ParseException("Invalid mount permission (ro is not "
+ "supported yet), " + mountPermission);
}
}
public String getRemoteUri() {
return remoteUri;
}
public void setRemoteUri(String rUti) {
this.remoteUri = rUti;
}
public String getLocalPath() {
return localPath;
}
public void setLocalPath(String lPath) {
this.localPath = lPath;
}
public String getMountPermission() {
return mountPermission;
}
public void setMountPermission(String mPermission) {
this.mountPermission = mPermission;
}
private boolean isSupportedScheme(String scheme) {
return SUPPORTED_SCHEME.contains(scheme);
}
}

View File

@ -1,443 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
import static org.apache.hadoop.yarn.submarine.client.cli.runjob.RunJobCli.YAML_PARSE_FAILED;
import java.io.IOException;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.client.cli.Command;
import org.apache.hadoop.yarn.submarine.client.cli.param.runjob.PyTorchRunJobParameters;
import org.apache.hadoop.yarn.submarine.client.cli.param.runjob.TensorFlowRunJobParameters;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.Configs;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.Role;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.Roles;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.Scheduling;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.Security;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.TensorBoard;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.YamlConfigFile;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.YamlParseException;
import org.apache.hadoop.yarn.submarine.client.cli.runjob.Framework;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.google.common.collect.ImmutableSet;
import com.google.common.collect.Lists;
import com.google.common.collect.Maps;
/**
* This class acts as a wrapper of {@code CommandLine} values along with
* YAML configuration values.
* YAML configuration is only stored if the -f &lt;filename&gt;
* option is specified along the CLI arguments.
* Using this wrapper class makes easy to deal with
* any form of configuration source potentially added into Submarine,
* in the future.
* If both YAML and CLI value is found for a config, this is an error case.
*/
public final class ParametersHolder {
private static final Logger LOG =
LoggerFactory.getLogger(ParametersHolder.class);
public static final String SUPPORTED_FRAMEWORKS_MESSAGE =
"TensorFlow and PyTorch are the only supported frameworks for now!";
public static final String SUPPORTED_COMMANDS_MESSAGE =
"'Show job' and 'run job' are the only supported commands for now!";
private final CommandLine parsedCommandLine;
private final Map<String, String> yamlStringConfigs;
private final Map<String, List<String>> yamlListConfigs;
private final ConfigType configType;
private Command command;
private final Set onlyDefinedWithCliArgs = ImmutableSet.of(
CliConstants.VERBOSE);
private final Framework framework;
private final BaseParameters parameters;
private ParametersHolder(CommandLine parsedCommandLine,
YamlConfigFile yamlConfig, ConfigType configType, Command command)
throws ParseException, YarnException {
this.parsedCommandLine = parsedCommandLine;
this.yamlStringConfigs = initStringConfigValues(yamlConfig);
this.yamlListConfigs = initListConfigValues(yamlConfig);
this.configType = configType;
this.command = command;
this.framework = determineFrameworkType();
this.ensureOnlyValidSectionsAreDefined(yamlConfig);
this.parameters = createParameters();
}
private BaseParameters createParameters() {
if (command == Command.RUN_JOB) {
if (framework == Framework.TENSORFLOW) {
return new TensorFlowRunJobParameters();
} else if (framework == Framework.PYTORCH) {
return new PyTorchRunJobParameters();
} else {
throw new UnsupportedOperationException(SUPPORTED_FRAMEWORKS_MESSAGE);
}
} else if (command == Command.SHOW_JOB) {
return new ShowJobParameters();
} else if (command == Command.KILL_JOB) {
return new KillJobParameters();
} else {
throw new UnsupportedOperationException(SUPPORTED_COMMANDS_MESSAGE);
}
}
private void ensureOnlyValidSectionsAreDefined(YamlConfigFile yamlConfig) {
if (isCommandRunJob() && isFrameworkPyTorch() &&
isPsSectionDefined(yamlConfig)) {
throw new YamlParseException(
"PS section should not be defined when PyTorch " +
"is the selected framework!");
}
if (isCommandRunJob() && isFrameworkPyTorch() &&
isTensorboardSectionDefined(yamlConfig)) {
throw new YamlParseException(
"TensorBoard section should not be defined when PyTorch " +
"is the selected framework!");
}
}
private boolean isCommandRunJob() {
return command == Command.RUN_JOB;
}
private boolean isFrameworkPyTorch() {
return framework == Framework.PYTORCH;
}
private boolean isPsSectionDefined(YamlConfigFile yamlConfig) {
return yamlConfig != null &&
yamlConfig.getRoles() != null &&
yamlConfig.getRoles().getPs() != null;
}
private boolean isTensorboardSectionDefined(YamlConfigFile yamlConfig) {
return yamlConfig != null &&
yamlConfig.getTensorBoard() != null;
}
private Framework determineFrameworkType()
throws ParseException, YarnException {
if (!isCommandRunJob()) {
return null;
}
String frameworkStr = getOptionValue(CliConstants.FRAMEWORK);
if (frameworkStr == null) {
LOG.info("Framework is not defined in config, falling back to " +
"TensorFlow as a default.");
return Framework.TENSORFLOW;
}
Framework framework = Framework.parseByValue(frameworkStr);
if (framework == null) {
if (getConfigType() == ConfigType.CLI) {
throw new ParseException("Failed to parse Framework type! "
+ "Valid values are: " + Framework.getValues());
} else {
throw new YamlParseException(YAML_PARSE_FAILED +
", framework should is defined, but it has an invalid value! " +
"Valid values are: " + Framework.getValues());
}
}
return framework;
}
/**
* Maps every value coming from the passed yamlConfig to {@code CliConstants}.
* @param yamlConfig Parsed YAML config
* @return A map of config values, keys are {@code CliConstants}
* and values are Strings.
*/
private Map<String, String> initStringConfigValues(
YamlConfigFile yamlConfig) {
if (yamlConfig == null) {
return Collections.emptyMap();
}
Map<String, String> yamlConfigValues = Maps.newHashMap();
Roles roles = yamlConfig.getRoles();
initGenericConfigs(yamlConfig, yamlConfigValues);
initPs(yamlConfigValues, roles.getPs());
initWorker(yamlConfigValues, roles.getWorker());
initScheduling(yamlConfigValues, yamlConfig.getScheduling());
initSecurity(yamlConfigValues, yamlConfig.getSecurity());
initTensorBoard(yamlConfigValues, yamlConfig.getTensorBoard());
return yamlConfigValues;
}
private Map<String, List<String>> initListConfigValues(
YamlConfigFile yamlConfig) {
if (yamlConfig == null) {
return Collections.emptyMap();
}
Map<String, List<String>> yamlConfigValues = Maps.newHashMap();
Configs configs = yamlConfig.getConfigs();
yamlConfigValues.put(CliConstants.LOCALIZATION, configs.getLocalizations());
yamlConfigValues.put(CliConstants.ENV,
convertToEnvsList(configs.getEnvs()));
yamlConfigValues.put(CliConstants.QUICKLINK, configs.getQuicklinks());
return yamlConfigValues;
}
private void initGenericConfigs(YamlConfigFile yamlConfig,
Map<String, String> yamlConfigs) {
yamlConfigs.put(CliConstants.NAME, yamlConfig.getSpec().getName());
yamlConfigs.put(CliConstants.FRAMEWORK,
yamlConfig.getSpec().getFramework());
Configs configs = yamlConfig.getConfigs();
yamlConfigs.put(CliConstants.INPUT_PATH, configs.getInputPath());
yamlConfigs.put(CliConstants.CHECKPOINT_PATH, configs.getCheckpointPath());
yamlConfigs.put(CliConstants.SAVED_MODEL_PATH, configs.getSavedModelPath());
yamlConfigs.put(CliConstants.DOCKER_IMAGE, configs.getDockerImage());
yamlConfigs.put(CliConstants.WAIT_JOB_FINISH, configs.getWaitJobFinish());
}
private void initPs(Map<String, String> yamlConfigs, Role ps) {
if (ps == null) {
return;
}
yamlConfigs.put(CliConstants.N_PS, String.valueOf(ps.getReplicas()));
yamlConfigs.put(CliConstants.PS_RES, ps.getResources());
yamlConfigs.put(CliConstants.PS_DOCKER_IMAGE, ps.getDockerImage());
yamlConfigs.put(CliConstants.PS_LAUNCH_CMD, ps.getLaunchCmd());
}
private void initWorker(Map<String, String> yamlConfigs, Role worker) {
if (worker == null) {
return;
}
yamlConfigs.put(CliConstants.N_WORKERS,
String.valueOf(worker.getReplicas()));
yamlConfigs.put(CliConstants.WORKER_RES, worker.getResources());
yamlConfigs.put(CliConstants.WORKER_DOCKER_IMAGE, worker.getDockerImage());
yamlConfigs.put(CliConstants.WORKER_LAUNCH_CMD, worker.getLaunchCmd());
}
private void initScheduling(Map<String, String> yamlConfigValues,
Scheduling scheduling) {
if (scheduling == null) {
return;
}
yamlConfigValues.put(CliConstants.QUEUE, scheduling.getQueue());
}
private void initSecurity(Map<String, String> yamlConfigValues,
Security security) {
if (security == null) {
return;
}
yamlConfigValues.put(CliConstants.KEYTAB, security.getKeytab());
yamlConfigValues.put(CliConstants.PRINCIPAL, security.getPrincipal());
yamlConfigValues.put(CliConstants.DISTRIBUTE_KEYTAB,
String.valueOf(security.isDistributeKeytab()));
}
private void initTensorBoard(Map<String, String> yamlConfigValues,
TensorBoard tensorBoard) {
if (tensorBoard == null) {
return;
}
yamlConfigValues.put(CliConstants.TENSORBOARD, Boolean.TRUE.toString());
yamlConfigValues.put(CliConstants.TENSORBOARD_DOCKER_IMAGE,
tensorBoard.getDockerImage());
yamlConfigValues.put(CliConstants.TENSORBOARD_RESOURCES,
tensorBoard.getResources());
}
private List<String> convertToEnvsList(Map<String, String> envs) {
if (envs == null) {
return Collections.emptyList();
}
return envs.entrySet().stream()
.map(e -> String.format("%s=%s", e.getKey(), e.getValue()))
.collect(Collectors.toList());
}
public static ParametersHolder createWithCmdLine(CommandLine cli,
Command command) throws ParseException, YarnException {
return new ParametersHolder(cli, null, ConfigType.CLI, command);
}
public static ParametersHolder createWithCmdLineAndYaml(CommandLine cli,
YamlConfigFile yamlConfig, Command command) throws ParseException,
YarnException {
return new ParametersHolder(cli, yamlConfig, ConfigType.YAML, command);
}
/**
* Gets the option value, either from the CLI arguments or YAML config,
* if present.
* @param option Name of the config.
* @return The value of the config
*/
public String getOptionValue(String option) throws YarnException {
ensureConfigIsDefinedOnce(option, true);
if (onlyDefinedWithCliArgs.contains(option) ||
parsedCommandLine.hasOption(option)) {
return getValueFromCLI(option);
}
return getValueFromYaml(option);
}
/**
* Gets the option values, either from the CLI arguments or YAML config,
* if present.
* @param option Name of the config.
* @return The values of the config
*/
public List<String> getOptionValues(String option) throws YarnException {
ensureConfigIsDefinedOnce(option, false);
if (onlyDefinedWithCliArgs.contains(option) ||
parsedCommandLine.hasOption(option)) {
return getValuesFromCLI(option);
}
return getValuesFromYaml(option);
}
private void ensureConfigIsDefinedOnce(String option, boolean stringValue)
throws YarnException {
boolean definedWithYaml;
if (stringValue) {
definedWithYaml = yamlStringConfigs.containsKey(option);
} else {
definedWithYaml = yamlListConfigs.containsKey(option);
}
if (parsedCommandLine.hasOption(option) && definedWithYaml) {
throw new YarnException("Config '%s' is defined both with YAML config" +
" and with CLI argument, please only use either way!");
}
}
private String getValueFromCLI(String option) {
String value = parsedCommandLine.getOptionValue(option);
if (LOG.isDebugEnabled()) {
LOG.debug("Found config value {} for key {} " +
"from CLI configuration.", value, option);
}
return value;
}
private List<String> getValuesFromCLI(String option) {
String[] optionValues = parsedCommandLine.getOptionValues(option);
if (optionValues != null) {
List<String> values = Arrays.asList(optionValues);
if (LOG.isDebugEnabled()) {
LOG.debug("Found config values {} for key {} " +
"from CLI configuration.", values, option);
}
return values;
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("No config values found for key {} " +
"from CLI configuration.", option);
}
return Lists.newArrayList();
}
}
private String getValueFromYaml(String option) {
String value = yamlStringConfigs.get(option);
if (LOG.isDebugEnabled()) {
LOG.debug("Found config value {} for key {} " +
"from YAML configuration.", value, option);
}
return value;
}
private List<String> getValuesFromYaml(String option) {
List<String> values = yamlListConfigs.get(option);
if (LOG.isDebugEnabled()) {
LOG.debug("Found config values {} for key {} " +
"from YAML configuration.", values, option);
}
return values;
}
/**
* Returns the boolean value of option.
* First, we check if the CLI value is defined for the option.
* If not, then we check the YAML value.
* @param option name of the option
* @return true, if the option is found in the CLI args or in the YAML config,
* false otherwise.
*/
public boolean hasOption(String option) {
if (onlyDefinedWithCliArgs.contains(option)) {
boolean value = parsedCommandLine.hasOption(option);
if (LOG.isDebugEnabled()) {
LOG.debug("Found boolean config with value {} for key {} " +
"from CLI configuration.", value, option);
}
return value;
}
if (parsedCommandLine.hasOption(option)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Found boolean config value for key {} " +
"from CLI configuration.", option);
}
return true;
}
return getBooleanValueFromYaml(option);
}
private boolean getBooleanValueFromYaml(String option) {
String stringValue = yamlStringConfigs.get(option);
boolean result = stringValue != null
&& Boolean.valueOf(stringValue).equals(Boolean.TRUE);
LOG.debug("Found config value {} for key {} " +
"from YAML configuration.", result, option);
return result;
}
public ConfigType getConfigType() {
return configType;
}
public Framework getFramework() {
return framework;
}
public void updateParameters(ClientContext clientContext)
throws ParseException, YarnException, IOException {
parameters.updateParameters(this, clientContext);
}
public BaseParameters getParameters() {
return parameters;
}
}

View File

@ -1,71 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
import org.apache.commons.cli.ParseException;
/**
* A class represents quick links to a web page.
*/
public class Quicklink {
private String label;
private String componentInstanceName;
private String protocol;
private int port;
public void parse(String quicklinkStr) throws ParseException {
if (!quicklinkStr.contains("=")) {
throw new ParseException("Should be <label>=<link> format for quicklink");
}
int index = quicklinkStr.indexOf("=");
label = quicklinkStr.substring(0, index);
quicklinkStr = quicklinkStr.substring(index + 1);
if (quicklinkStr.startsWith("http://")) {
protocol = "http://";
} else if (quicklinkStr.startsWith("https://")) {
protocol = "https://";
} else {
throw new ParseException("Quicklink should start with http or https");
}
quicklinkStr = quicklinkStr.substring(protocol.length());
index = quicklinkStr.indexOf(":");
if (index == -1) {
throw new ParseException("Quicklink should be componet-id:port form");
}
componentInstanceName = quicklinkStr.substring(0, index);
port = Integer.parseInt(quicklinkStr.substring(index + 1));
}
public String getLabel() {
return label;
}
public String getComponentInstanceName() {
return componentInstanceName;
}
public String getProtocol() {
return protocol;
}
public int getPort() {
return port;
}
}

View File

@ -1,104 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* Parameters required to run anything on cluster. Such as run job / serve model
*/
public abstract class RunParameters extends BaseParameters {
private String savedModelPath;
private String dockerImageName;
private List<String> envars = new ArrayList<>();
private String queue;
@Override
public void updateParameters(ParametersHolder parametersHolder,
ClientContext clientContext) throws ParseException,
IOException, YarnException {
String savedModelPath = parametersHolder.getOptionValue(
CliConstants.SAVED_MODEL_PATH);
this.setSavedModelPath(savedModelPath);
List<String> envVars = getEnvVars(parametersHolder);
this.setEnvars(envVars);
String queue = parametersHolder.getOptionValue(
CliConstants.QUEUE);
this.setQueue(queue);
String dockerImage = parametersHolder.getOptionValue(
CliConstants.DOCKER_IMAGE);
this.setDockerImageName(dockerImage);
super.updateParameters(parametersHolder, clientContext);
}
private List<String> getEnvVars(ParametersHolder parametersHolder)
throws YarnException {
List<String> result = new ArrayList<>();
List<String> envVarsArray = parametersHolder.getOptionValues(
CliConstants.ENV);
if (envVarsArray != null) {
result.addAll(envVarsArray);
}
return result;
}
public String getQueue() {
return queue;
}
public RunParameters setQueue(String queue) {
this.queue = queue;
return this;
}
public String getDockerImageName() {
return dockerImageName;
}
public RunParameters setDockerImageName(String dockerImageName) {
this.dockerImageName = dockerImageName;
return this;
}
public List<String> getEnvars() {
return envars;
}
public RunParameters setEnvars(List<String> envars) {
this.envars = envars;
return this;
}
public String getSavedModelPath() {
return savedModelPath;
}
public RunParameters setSavedModelPath(String savedModelPath) {
this.savedModelPath = savedModelPath;
return this;
}
}

View File

@ -1,18 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;
public class ShowJobParameters extends BaseParameters {
}

View File

@ -1,19 +0,0 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param;

View File

@ -1,120 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.runjob;
import java.io.IOException;
import java.util.List;
import org.apache.commons.cli.ParseException;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.client.cli.CliUtils;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import com.google.common.collect.Lists;
/**
* Parameters for PyTorch job.
*/
public class PyTorchRunJobParameters extends RunJobParameters {
private static final String CANNOT_BE_DEFINED_FOR_PYTORCH =
"cannot be defined for PyTorch jobs!";
@Override
public void updateParameters(ParametersHolder parametersHolder,
ClientContext clientContext)
throws ParseException, IOException, YarnException {
checkArguments(parametersHolder);
super.updateParameters(parametersHolder, clientContext);
String input = parametersHolder.getOptionValue(CliConstants.INPUT_PATH);
this.workerParameters =
getWorkerParameters(clientContext, parametersHolder, input);
this.distributed = determineIfDistributed(workerParameters.getReplicas());
executePostOperations(clientContext);
}
private void checkArguments(ParametersHolder parametersHolder)
throws YarnException, ParseException {
if (parametersHolder.getOptionValue(CliConstants.N_PS) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.N_PS));
} else if (parametersHolder.getOptionValue(CliConstants.PS_RES) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.PS_RES));
} else if (parametersHolder
.getOptionValue(CliConstants.PS_DOCKER_IMAGE) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.PS_DOCKER_IMAGE));
} else if (parametersHolder
.getOptionValue(CliConstants.PS_LAUNCH_CMD) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.PS_LAUNCH_CMD));
} else if (parametersHolder.hasOption(CliConstants.TENSORBOARD)) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.TENSORBOARD));
} else if (parametersHolder
.getOptionValue(CliConstants.TENSORBOARD_RESOURCES) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.TENSORBOARD_RESOURCES));
} else if (parametersHolder
.getOptionValue(CliConstants.TENSORBOARD_DOCKER_IMAGE) != null) {
throw new ParseException(getParamCannotBeDefinedErrorMessage(
CliConstants.TENSORBOARD_DOCKER_IMAGE));
}
}
private String getParamCannotBeDefinedErrorMessage(String cliName) {
return String.format(
"Parameter '%s' " + CANNOT_BE_DEFINED_FOR_PYTORCH, cliName);
}
@Override
void executePostOperations(ClientContext clientContext) throws IOException {
// Set default job dir / saved model dir, etc.
setDefaultDirs(clientContext);
replacePatternsInParameters(clientContext);
}
private void replacePatternsInParameters(ClientContext clientContext)
throws IOException {
if (StringUtils.isNotEmpty(getWorkerLaunchCmd())) {
String afterReplace =
CliUtils.replacePatternsInLaunchCommand(getWorkerLaunchCmd(), this,
clientContext.getRemoteDirectoryManager());
setWorkerLaunchCmd(afterReplace);
}
}
@Override
public List<String> getLaunchCommands() {
return Lists.newArrayList(getWorkerLaunchCmd());
}
/**
* We only support non-distributed PyTorch integration for now.
* @param nWorkers
* @return
*/
private boolean determineIfDistributed(int nWorkers) {
return false;
}
}

View File

@ -1,348 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.runjob;
import com.google.common.annotations.VisibleForTesting;
import com.google.common.base.CaseFormat;
import org.apache.commons.cli.ParseException;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.client.cli.CliUtils;
import org.apache.hadoop.yarn.submarine.client.cli.param.Localization;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.client.cli.param.Quicklink;
import org.apache.hadoop.yarn.submarine.client.cli.param.RunParameters;
import org.apache.hadoop.yarn.submarine.client.cli.runjob.RoleParameters;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.api.TensorFlowRole;
import org.apache.hadoop.yarn.submarine.common.fs.RemoteDirectoryManager;
import org.apache.hadoop.yarn.submarine.common.resource.ResourceUtils;
import org.yaml.snakeyaml.introspector.Property;
import org.yaml.snakeyaml.introspector.PropertyUtils;
import java.beans.IntrospectionException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
/**
* Parameters used to run a job
*/
public abstract class RunJobParameters extends RunParameters {
private String input;
private String checkpointPath;
private List<Quicklink> quicklinks = new ArrayList<>();
private List<Localization> localizations = new ArrayList<>();
private boolean waitJobFinish = false;
protected boolean distributed = false;
private boolean securityDisabled = false;
private String keytab;
private String principal;
private boolean distributeKeytab = false;
private List<String> confPairs = new ArrayList<>();
RoleParameters workerParameters =
RoleParameters.createEmpty(TensorFlowRole.WORKER);
@Override
public void updateParameters(ParametersHolder parametersHolder,
ClientContext clientContext)
throws ParseException, IOException, YarnException {
String input = parametersHolder.getOptionValue(CliConstants.INPUT_PATH);
String jobDir = parametersHolder.getOptionValue(
CliConstants.CHECKPOINT_PATH);
if (parametersHolder.hasOption(CliConstants.INSECURE_CLUSTER)) {
setSecurityDisabled(true);
}
String kerberosKeytab = parametersHolder.getOptionValue(
CliConstants.KEYTAB);
String kerberosPrincipal = parametersHolder.getOptionValue(
CliConstants.PRINCIPAL);
CliUtils.doLoginIfSecure(kerberosKeytab, kerberosPrincipal);
if (parametersHolder.hasOption(CliConstants.WAIT_JOB_FINISH)) {
this.waitJobFinish = true;
}
// Quicklinks
List<String> quicklinkStrs = parametersHolder.getOptionValues(
CliConstants.QUICKLINK);
if (quicklinkStrs != null) {
for (String ql : quicklinkStrs) {
Quicklink quicklink = new Quicklink();
quicklink.parse(ql);
quicklinks.add(quicklink);
}
}
// Localizations
List<String> localizationsStr = parametersHolder.getOptionValues(
CliConstants.LOCALIZATION);
if (null != localizationsStr) {
for (String loc : localizationsStr) {
Localization localization = new Localization();
localization.parse(loc);
localizations.add(localization);
}
}
boolean distributeKerberosKeytab = parametersHolder.hasOption(CliConstants
.DISTRIBUTE_KEYTAB);
List<String> configPairs = parametersHolder
.getOptionValues(CliConstants.ARG_CONF);
this.setInputPath(input).setCheckpointPath(jobDir)
.setKeytab(kerberosKeytab)
.setPrincipal(kerberosPrincipal)
.setDistributeKeytab(distributeKerberosKeytab)
.setConfPairs(configPairs);
super.updateParameters(parametersHolder, clientContext);
}
abstract void executePostOperations(ClientContext clientContext)
throws IOException;
void setDefaultDirs(ClientContext clientContext) throws IOException {
// Create directories if needed
String jobDir = getCheckpointPath();
if (jobDir == null) {
jobDir = getJobDir(clientContext);
setCheckpointPath(jobDir);
}
if (getNumWorkers() > 0) {
String savedModelDir = getSavedModelPath();
if (savedModelDir == null) {
savedModelDir = jobDir;
setSavedModelPath(savedModelDir);
}
}
}
private String getJobDir(ClientContext clientContext) throws IOException {
RemoteDirectoryManager rdm = clientContext.getRemoteDirectoryManager();
if (getNumWorkers() > 0) {
return rdm.getJobCheckpointDir(getName(), true).toString();
} else {
// when #workers == 0, it means we only launch TB. In that case,
// point job dir to root dir so all job's metrics will be shown.
return rdm.getUserRootFolder().toString();
}
}
public abstract List<String> getLaunchCommands();
public String getInputPath() {
return input;
}
public RunJobParameters setInputPath(String input) {
this.input = input;
return this;
}
public String getCheckpointPath() {
return checkpointPath;
}
public RunJobParameters setCheckpointPath(String checkpointPath) {
this.checkpointPath = checkpointPath;
return this;
}
public boolean isWaitJobFinish() {
return waitJobFinish;
}
public List<Quicklink> getQuicklinks() {
return quicklinks;
}
public List<Localization> getLocalizations() {
return localizations;
}
public String getKeytab() {
return keytab;
}
public RunJobParameters setKeytab(String kerberosKeytab) {
this.keytab = kerberosKeytab;
return this;
}
public String getPrincipal() {
return principal;
}
public RunJobParameters setPrincipal(String kerberosPrincipal) {
this.principal = kerberosPrincipal;
return this;
}
public boolean isSecurityDisabled() {
return securityDisabled;
}
public void setSecurityDisabled(boolean securityDisabled) {
this.securityDisabled = securityDisabled;
}
public boolean isDistributeKeytab() {
return distributeKeytab;
}
public RunJobParameters setDistributeKeytab(
boolean distributeKerberosKeytab) {
this.distributeKeytab = distributeKerberosKeytab;
return this;
}
public List<String> getConfPairs() {
return confPairs;
}
public RunJobParameters setConfPairs(List<String> confPairs) {
this.confPairs = confPairs;
return this;
}
public void setDistributed(boolean distributed) {
this.distributed = distributed;
}
RoleParameters getWorkerParameters(ClientContext clientContext,
ParametersHolder parametersHolder, String input)
throws ParseException, YarnException, IOException {
int nWorkers = getNumberOfWorkers(parametersHolder, input);
Resource workerResource =
determineWorkerResource(parametersHolder, nWorkers, clientContext);
String workerDockerImage =
parametersHolder.getOptionValue(CliConstants.WORKER_DOCKER_IMAGE);
String workerLaunchCmd =
parametersHolder.getOptionValue(CliConstants.WORKER_LAUNCH_CMD);
return new RoleParameters(TensorFlowRole.WORKER, nWorkers,
workerLaunchCmd, workerDockerImage, workerResource);
}
private Resource determineWorkerResource(ParametersHolder parametersHolder,
int nWorkers, ClientContext clientContext)
throws ParseException, YarnException, IOException {
if (nWorkers > 0) {
String workerResourceStr =
parametersHolder.getOptionValue(CliConstants.WORKER_RES);
if (workerResourceStr == null) {
throw new ParseException(
"--" + CliConstants.WORKER_RES + " is absent.");
}
return ResourceUtils.createResourceFromString(workerResourceStr);
}
return null;
}
private int getNumberOfWorkers(ParametersHolder parametersHolder,
String input) throws ParseException, YarnException {
int nWorkers = 1;
if (parametersHolder.getOptionValue(CliConstants.N_WORKERS) != null) {
nWorkers = Integer
.parseInt(parametersHolder.getOptionValue(CliConstants.N_WORKERS));
// Only check null value.
// Training job shouldn't ignore INPUT_PATH option
// But if nWorkers is 0, INPUT_PATH can be ignored because
// user can only run Tensorboard
if (null == input && 0 != nWorkers) {
throw new ParseException(
"\"--" + CliConstants.INPUT_PATH + "\" is absent");
}
}
return nWorkers;
}
public String getWorkerLaunchCmd() {
return workerParameters.getLaunchCommand();
}
public void setWorkerLaunchCmd(String launchCmd) {
workerParameters.setLaunchCommand(launchCmd);
}
public int getNumWorkers() {
return workerParameters.getReplicas();
}
public void setNumWorkers(int numWorkers) {
workerParameters.setReplicas(numWorkers);
}
public Resource getWorkerResource() {
return workerParameters.getResource();
}
public void setWorkerResource(Resource resource) {
workerParameters.setResource(resource);
}
public String getWorkerDockerImage() {
return workerParameters.getDockerImage();
}
public void setWorkerDockerImage(String image) {
workerParameters.setDockerImage(image);
}
public boolean isDistributed() {
return distributed;
}
@VisibleForTesting
public static class UnderscoreConverterPropertyUtils extends PropertyUtils {
@Override
public Property getProperty(Class<? extends Object> type, String name)
throws IntrospectionException {
if (name.indexOf('_') > -1) {
name = convertName(name);
}
return super.getProperty(type, name);
}
private static String convertName(String name) {
return CaseFormat.UPPER_UNDERSCORE.to(CaseFormat.LOWER_CAMEL, name);
}
}
}

View File

@ -1,213 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.runjob;
import com.google.common.collect.Lists;
import org.apache.commons.cli.ParseException;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.client.cli.CliUtils;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.client.cli.runjob.RoleParameters;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.api.TensorFlowRole;
import org.apache.hadoop.yarn.submarine.common.resource.ResourceUtils;
import java.io.IOException;
import java.util.List;
/**
* Parameters for TensorFlow job.
*/
public class TensorFlowRunJobParameters extends RunJobParameters {
private boolean tensorboardEnabled;
private RoleParameters psParameters =
RoleParameters.createEmpty(TensorFlowRole.PS);
private RoleParameters tensorBoardParameters =
RoleParameters.createEmpty(TensorFlowRole.TENSORBOARD);
@Override
public void updateParameters(ParametersHolder parametersHolder,
ClientContext clientContext)
throws ParseException, IOException, YarnException {
super.updateParameters(parametersHolder, clientContext);
String input = parametersHolder.getOptionValue(CliConstants.INPUT_PATH);
this.workerParameters =
getWorkerParameters(clientContext, parametersHolder, input);
this.psParameters = getPSParameters(clientContext, parametersHolder);
this.distributed = determineIfDistributed(workerParameters.getReplicas(),
psParameters.getReplicas());
if (parametersHolder.hasOption(CliConstants.TENSORBOARD)) {
this.tensorboardEnabled = true;
this.tensorBoardParameters =
getTensorBoardParameters(parametersHolder, clientContext);
}
executePostOperations(clientContext);
}
@Override
void executePostOperations(ClientContext clientContext) throws IOException {
// Set default job dir / saved model dir, etc.
setDefaultDirs(clientContext);
replacePatternsInParameters(clientContext);
}
private void replacePatternsInParameters(ClientContext clientContext)
throws IOException {
if (StringUtils.isNotEmpty(getPSLaunchCmd())) {
String afterReplace = CliUtils.replacePatternsInLaunchCommand(
getPSLaunchCmd(), this, clientContext.getRemoteDirectoryManager());
setPSLaunchCmd(afterReplace);
}
if (StringUtils.isNotEmpty(getWorkerLaunchCmd())) {
String afterReplace =
CliUtils.replacePatternsInLaunchCommand(getWorkerLaunchCmd(), this,
clientContext.getRemoteDirectoryManager());
setWorkerLaunchCmd(afterReplace);
}
}
@Override
public List<String> getLaunchCommands() {
return Lists.newArrayList(getWorkerLaunchCmd(), getPSLaunchCmd());
}
private boolean determineIfDistributed(int nWorkers, int nPS)
throws ParseException {
// Check #workers and #ps.
// When distributed training is required
if (nWorkers >= 2 && nPS > 0) {
return true;
} else if (nWorkers <= 1 && nPS > 0) {
throw new ParseException("Only specified one worker but non-zero PS, "
+ "please double check.");
}
return false;
}
private RoleParameters getPSParameters(ClientContext clientContext,
ParametersHolder parametersHolder)
throws YarnException, IOException, ParseException {
int nPS = getNumberOfPS(parametersHolder);
Resource psResource =
determinePSResource(parametersHolder, nPS, clientContext);
String psDockerImage =
parametersHolder.getOptionValue(CliConstants.PS_DOCKER_IMAGE);
String psLaunchCommand =
parametersHolder.getOptionValue(CliConstants.PS_LAUNCH_CMD);
return new RoleParameters(TensorFlowRole.PS, nPS, psLaunchCommand,
psDockerImage, psResource);
}
private Resource determinePSResource(ParametersHolder parametersHolder,
int nPS, ClientContext clientContext)
throws ParseException, YarnException, IOException {
if (nPS > 0) {
String psResourceStr =
parametersHolder.getOptionValue(CliConstants.PS_RES);
if (psResourceStr == null) {
throw new ParseException("--" + CliConstants.PS_RES + " is absent.");
}
return ResourceUtils.createResourceFromString(psResourceStr);
}
return null;
}
private int getNumberOfPS(ParametersHolder parametersHolder)
throws YarnException {
int nPS = 0;
if (parametersHolder.getOptionValue(CliConstants.N_PS) != null) {
nPS =
Integer.parseInt(parametersHolder.getOptionValue(CliConstants.N_PS));
}
return nPS;
}
private RoleParameters getTensorBoardParameters(
ParametersHolder parametersHolder, ClientContext clientContext)
throws YarnException, IOException {
String tensorboardResourceStr =
parametersHolder.getOptionValue(CliConstants.TENSORBOARD_RESOURCES);
if (tensorboardResourceStr == null || tensorboardResourceStr.isEmpty()) {
tensorboardResourceStr = CliConstants.TENSORBOARD_DEFAULT_RESOURCES;
}
Resource tensorboardResource = ResourceUtils.createResourceFromString(
tensorboardResourceStr);
String tensorboardDockerImage =
parametersHolder.getOptionValue(CliConstants.TENSORBOARD_DOCKER_IMAGE);
return new RoleParameters(TensorFlowRole.TENSORBOARD, 1, null,
tensorboardDockerImage, tensorboardResource);
}
public int getNumPS() {
return psParameters.getReplicas();
}
public void setNumPS(int numPS) {
psParameters.setReplicas(numPS);
}
public Resource getPsResource() {
return psParameters.getResource();
}
public void setPsResource(Resource resource) {
psParameters.setResource(resource);
}
public String getPsDockerImage() {
return psParameters.getDockerImage();
}
public void setPsDockerImage(String image) {
psParameters.setDockerImage(image);
}
public String getPSLaunchCmd() {
return psParameters.getLaunchCommand();
}
public void setPSLaunchCmd(String launchCmd) {
psParameters.setLaunchCommand(launchCmd);
}
public boolean isTensorboardEnabled() {
return tensorboardEnabled;
}
public Resource getTensorboardResource() {
return tensorBoardParameters.getResource();
}
public void setTensorboardResource(Resource resource) {
tensorBoardParameters.setResource(resource);
}
public String getTensorboardDockerImage() {
return tensorBoardParameters.getDockerImage();
}
public void setTensorboardDockerImage(String image) {
tensorBoardParameters.setDockerImage(image);
}
}

View File

@ -1,20 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains classes that hold run job parameters for
* TensorFlow / PyTorch jobs.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.runjob;

View File

@ -1,107 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
import java.util.List;
import java.util.Map;
/**
* Class that holds values found in 'configs' section of YAML configuration.
*/
public class Configs {
private String dockerImage;
private String inputPath;
private String savedModelPath;
private String checkpointPath;
private List<String> quicklinks;
private String waitJobFinish;
private Map<String, String> envs;
private List<String> localizations;
private List<String> mounts;
public String getDockerImage() {
return dockerImage;
}
public void setDockerImage(String dockerImage) {
this.dockerImage = dockerImage;
}
public String getInputPath() {
return inputPath;
}
public void setInputPath(String inputPath) {
this.inputPath = inputPath;
}
public String getSavedModelPath() {
return savedModelPath;
}
public void setSavedModelPath(String savedModelPath) {
this.savedModelPath = savedModelPath;
}
public String getCheckpointPath() {
return checkpointPath;
}
public void setCheckpointPath(String checkpointPath) {
this.checkpointPath = checkpointPath;
}
public Map<String, String> getEnvs() {
return envs;
}
public void setEnvs(Map<String, String> envs) {
this.envs = envs;
}
public List<String> getLocalizations() {
return localizations;
}
public void setLocalizations(List<String> localizations) {
this.localizations = localizations;
}
public List<String> getMounts() {
return mounts;
}
public void setMounts(List<String> mounts) {
this.mounts = mounts;
}
public List<String> getQuicklinks() {
return quicklinks;
}
public void setQuicklinks(List<String> quicklinks) {
this.quicklinks = quicklinks;
}
public String getWaitJobFinish() {
return waitJobFinish;
}
public void setWaitJobFinish(String waitJobFinish) {
this.waitJobFinish = waitJobFinish;
}
}

View File

@ -1,25 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Holds configuration values for PS (parameter server).
* 'ps' is a section underneath the 'roles' section of the YAML
* configuration file.
*/
public class PsRole extends Role {
}

View File

@ -1,91 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
import java.util.List;
import java.util.Map;
/**
* Base class for Roles. 'roles' is a section of the YAML configuration file.
*/
public class Role {
private String resources;
private int replicas;
private String launchCmd;
//Optional parameters (Can override global config)
private String dockerImage;
private Map<String, String> envs;
private List<String> localizations;
private List<String> mounts;
public String getResources() {
return resources;
}
public void setResources(String resources) {
this.resources = resources;
}
public int getReplicas() {
return replicas;
}
public void setReplicas(int replicas) {
this.replicas = replicas;
}
public String getLaunchCmd() {
return launchCmd;
}
public void setLaunchCmd(String launchCmd) {
this.launchCmd = launchCmd;
}
public String getDockerImage() {
return dockerImage;
}
public void setDockerImage(String dockerImage) {
this.dockerImage = dockerImage;
}
public Map<String, String> getEnvs() {
return envs;
}
public void setEnvs(Map<String, String> envs) {
this.envs = envs;
}
public List<String> getLocalizations() {
return localizations;
}
public void setLocalizations(List<String> localizations) {
this.localizations = localizations;
}
public List<String> getMounts() {
return mounts;
}
public void setMounts(List<String> mounts) {
this.mounts = mounts;
}
}

View File

@ -1,41 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* This class represents a section of the YAML configuration file.
*/
public class Roles {
private Role worker;
private Role ps;
public Role getWorker() {
return worker;
}
public void setWorker(Role worker) {
this.worker = worker;
}
public Role getPs() {
return ps;
}
public void setPs(Role ps) {
this.ps = ps;
}
}

View File

@ -1,32 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Class that holds values found in 'scheduling' section of YAML configuration.
*/
public class Scheduling {
private String queue;
public String getQueue() {
return queue;
}
public void setQueue(String queue) {
this.queue = queue;
}
}

View File

@ -1,50 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Class that holds values found in 'security' section of YAML configuration.
*/
public class Security {
private String keytab;
private String principal;
private boolean distributeKeytab;
public String getKeytab() {
return keytab;
}
public void setKeytab(String keytab) {
this.keytab = keytab;
}
public String getPrincipal() {
return principal;
}
public void setPrincipal(String principal) {
this.principal = principal;
}
public boolean isDistributeKeytab() {
return distributeKeytab;
}
public void setDistributeKeytab(boolean distributeKeytab) {
this.distributeKeytab = distributeKeytab;
}
}

View File

@ -1,50 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Class that holds values found in 'spec' section of YAML configuration.
*/
public class Spec {
private String name;
private String jobType;
private String framework;
public String getJobType() {
return jobType;
}
public void setJobType(String jobtype) {
this.jobType = jobtype;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getFramework() {
return framework;
}
public void setFramework(String framework) {
this.framework = framework;
}
}

View File

@ -1,41 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Class that holds values found in 'tensorboard' section of YAML configuration.
*/
public class TensorBoard {
private String dockerImage;
private String resources;
public String getDockerImage() {
return dockerImage;
}
public void setDockerImage(String dockerImage) {
this.dockerImage = dockerImage;
}
public String getResources() {
return resources;
}
public void setResources(String resources) {
this.resources = resources;
}
}

View File

@ -1,25 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Holds configuration values for the worker role.
* 'worker' is a section underneath the 'roles' section of the YAML
* configuration file.
*/
public class WorkerRole extends Role {
}

View File

@ -1,77 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* Root class of YAML configuration.
*/
public class YamlConfigFile {
private Spec spec;
private Configs configs;
private Roles roles;
private Scheduling scheduling;
private Security security;
private TensorBoard tensorBoard;
public Spec getSpec() {
return spec;
}
public void setSpec(Spec spec) {
this.spec = spec;
}
public Configs getConfigs() {
return configs;
}
public void setConfigs(Configs configs) {
this.configs = configs;
}
public Roles getRoles() {
return roles;
}
public void setRoles(Roles roles) {
this.roles = roles;
}
public Scheduling getScheduling() {
return scheduling;
}
public void setScheduling(Scheduling scheduling) {
this.scheduling = scheduling;
}
public Security getSecurity() {
return security;
}
public void setSecurity(Security security) {
this.security = security;
}
public TensorBoard getTensorBoard() {
return tensorBoard;
}
public void setTensorBoard(TensorBoard tensorBoard) {
this.tensorBoard = tensorBoard;
}
}

View File

@ -1,27 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;
/**
* This exception is thrown if any issue arises while parsing the
* YAML configuration.
*/
public class YamlParseException extends RuntimeException {
public YamlParseException(String message) {
super(message);
}
}

View File

@ -1,19 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains value classes for the YAML parser.
*/
package org.apache.hadoop.yarn.submarine.client.cli.param.yaml;

View File

@ -1,59 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.runjob;
import com.google.common.collect.Lists;
import java.util.List;
import java.util.stream.Collectors;
/**
* Represents the type of Machine learning framework to work with.
*/
public enum Framework {
TENSORFLOW(Constants.TENSORFLOW_NAME), PYTORCH(Constants.PYTORCH_NAME);
private String value;
Framework(String value) {
this.value = value;
}
public String getValue() {
return value;
}
public static Framework parseByValue(String value) {
for (Framework fw : Framework.values()) {
if (fw.value.equalsIgnoreCase(value)) {
return fw;
}
}
return null;
}
public static String getValues() {
List<String> values = Lists.newArrayList(Framework.values()).stream()
.map(fw -> fw.value).collect(Collectors.toList());
return String.join(",", values);
}
private static class Constants {
static final String TENSORFLOW_NAME = "tensorflow";
static final String PYTORCH_NAME = "pytorch";
}
}

View File

@ -1,81 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.client.cli.runjob;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.submarine.common.api.Role;
/**
* This class encapsulates data related to a particular Role.
* Some examples: TF Worker process, TF PS process or a PyTorch worker process.
*/
public class RoleParameters {
private final Role role;
private int replicas;
private String launchCommand;
private String dockerImage;
private Resource resource;
public RoleParameters(Role role, int replicas,
String launchCommand, String dockerImage, Resource resource) {
this.role = role;
this.replicas = replicas;
this.launchCommand = launchCommand;
this.dockerImage = dockerImage;
this.resource = resource;
}
public static RoleParameters createEmpty(Role role) {
return new RoleParameters(role, 0, null, null, null);
}
public Role getRole() {
return role;
}
public int getReplicas() {
return replicas;
}
public String getLaunchCommand() {
return launchCommand;
}
public void setLaunchCommand(String launchCommand) {
this.launchCommand = launchCommand;
}
public String getDockerImage() {
return dockerImage;
}
public void setDockerImage(String dockerImage) {
this.dockerImage = dockerImage;
}
public Resource getResource() {
return resource;
}
public void setResource(Resource resource) {
this.resource = resource;
}
public void setReplicas(int replicas) {
this.replicas = replicas;
}
}

View File

@ -1,381 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.client.cli.runjob;
import com.google.common.annotations.VisibleForTesting;
import org.apache.commons.cli.CommandLine;
import org.apache.commons.cli.GnuParser;
import org.apache.commons.cli.HelpFormatter;
import org.apache.commons.cli.Options;
import org.apache.commons.cli.ParseException;
import org.apache.commons.io.FileUtils;
import org.apache.hadoop.yarn.api.records.ApplicationId;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.AbstractCli;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.client.cli.CliUtils;
import org.apache.hadoop.yarn.submarine.client.cli.Command;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import org.apache.hadoop.yarn.submarine.client.cli.param.runjob.RunJobParameters;
import org.apache.hadoop.yarn.submarine.client.cli.param.runjob.RunJobParameters.UnderscoreConverterPropertyUtils;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.YamlConfigFile;
import org.apache.hadoop.yarn.submarine.client.cli.param.yaml.YamlParseException;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineException;
import org.apache.hadoop.yarn.submarine.runtimes.common.JobMonitor;
import org.apache.hadoop.yarn.submarine.runtimes.common.JobSubmitter;
import org.apache.hadoop.yarn.submarine.runtimes.common.StorageKeyConstants;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.yaml.snakeyaml.Yaml;
import org.yaml.snakeyaml.constructor.Constructor;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
/**
* This purpose of this class is to handle / parse CLI arguments related to
* the run job Submarine command.
*/
public class RunJobCli extends AbstractCli {
private static final Logger LOG =
LoggerFactory.getLogger(RunJobCli.class);
private static final String TENSORFLOW = "TensorFlow";
private static final String PYTORCH = "PyTorch";
private static final String PS = "PS";
private static final String WORKER = "worker";
private static final String TENSORBOARD = "TensorBoard";
private static final String CAN_BE_USED_WITH_TF_PYTORCH =
String.format("Can be used with %s or %s frameworks.",
TENSORFLOW, PYTORCH);
private static final String TENSORFLOW_ONLY =
String.format("Can only be used with %s framework.", TENSORFLOW);
public static final String YAML_PARSE_FAILED = "Failed to parse " +
"YAML config";
private static final String LOCAL_OR_ANY_FS_DIRECTORY = "Could be a local " +
"directory or any other directory on the file system.";
private Options options;
private JobSubmitter jobSubmitter;
private JobMonitor jobMonitor;
private ParametersHolder parametersHolder;
public RunJobCli(ClientContext cliContext) {
this(cliContext, cliContext.getRuntimeFactory().getJobSubmitterInstance(),
cliContext.getRuntimeFactory().getJobMonitorInstance());
}
@VisibleForTesting
public RunJobCli(ClientContext cliContext, JobSubmitter jobSubmitter,
JobMonitor jobMonitor) {
super(cliContext);
this.options = generateOptions();
this.jobSubmitter = jobSubmitter;
this.jobMonitor = jobMonitor;
}
public void printUsages() {
new HelpFormatter().printHelp("job run", options);
}
private Options generateOptions() {
Options options = new Options();
options.addOption(CliConstants.YAML_CONFIG, true,
"Config file (in YAML format)");
options.addOption(CliConstants.FRAMEWORK, true,
String.format("Framework to use. Valid values are: %s! " +
"The default framework is Tensorflow.",
Framework.getValues()));
options.addOption(CliConstants.NAME, true, "Name of the job");
options.addOption(CliConstants.INPUT_PATH, true,
"Input of the job. " + LOCAL_OR_ANY_FS_DIRECTORY);
options.addOption(CliConstants.CHECKPOINT_PATH, true,
"Training output directory of the job. " + LOCAL_OR_ANY_FS_DIRECTORY +
"This typically includes checkpoint files and exported model");
options.addOption(CliConstants.SAVED_MODEL_PATH, true,
"Model exported path (saved model) of the job, which is needed when " +
"exported model is not placed under ${checkpoint_path}. " +
LOCAL_OR_ANY_FS_DIRECTORY + "This will be used to serve");
options.addOption(CliConstants.DOCKER_IMAGE, true, "Docker image name/tag");
options.addOption(CliConstants.PS_DOCKER_IMAGE, true,
getDockerImageMessage(PS));
options.addOption(CliConstants.WORKER_DOCKER_IMAGE, true,
getDockerImageMessage(WORKER));
options.addOption(CliConstants.QUEUE, true,
"Name of queue to run the job. By default, the default queue is used");
addWorkerOptions(options);
addPSOptions(options);
addTensorboardOptions(options);
options.addOption(CliConstants.ENV, true,
"Common environment variable passed to worker / PS");
options.addOption(CliConstants.VERBOSE, false,
"Print verbose log for troubleshooting");
options.addOption(CliConstants.WAIT_JOB_FINISH, false,
"Specified when user wants to wait for jobs to finish");
options.addOption(CliConstants.QUICKLINK, true, "Specify quicklink so YARN "
+ "web UI shows link to the given role instance and port. " +
"When --tensorboard is specified, quicklink to the " +
TENSORBOARD + " instance will be added automatically. " +
"The format of quick link is: "
+ "Quick_link_label=http(or https)://role-name:port. " +
"For example, if users want to link to the first worker's 7070 port, " +
"and text of quicklink is Notebook_UI, " +
"users need to specify --quicklink Notebook_UI=https://master-0:7070");
options.addOption(CliConstants.LOCALIZATION, true, "Specify"
+ " localization to make remote/local file/directory available to"
+ " all container(Docker)."
+ " Argument format is: \"RemoteUri:LocalFilePath[:rw] \" "
+ "(ro permission is not supported yet)."
+ " The RemoteUri can be a local file or directory on the filesystem."
+ " Alternatively, the following remote file systems / "
+ "transmit mechanisms can be used: "
+ " HDFS, S3 or abfs, HTTP, etc."
+ " The LocalFilePath can be absolute or relative."
+ " If it is a relative path, it will be"
+ " under container's implied working directory"
+ " but sub-directory is not supported yet."
+ " This option can be set multiple times."
+ " Examples are \n"
+ "-localization \"hdfs:///user/yarn/mydir2:/opt/data\"\n"
+ "-localization \"s3a:///a/b/myfile1:./\"\n"
+ "-localization \"https:///a/b/myfile2:./myfile\"\n"
+ "-localization \"/user/yarn/mydir3:/opt/mydir3\"\n"
+ "-localization \"./mydir1:.\"\n");
options.addOption(CliConstants.KEYTAB, true, "Specify keytab used by the " +
"job under a secured environment");
options.addOption(CliConstants.PRINCIPAL, true, "Specify principal used " +
"by the job under a secured environment");
options.addOption(CliConstants.DISTRIBUTE_KEYTAB, false, "Distribute " +
"local keytab to cluster machines for service authentication. " +
"If not specified, pre-distributed keytab of which path specified by" +
" parameter" + CliConstants.KEYTAB + " on cluster machines will be " +
"used");
options.addOption("h", "help", false, "Print help");
options.addOption("insecure", false, "Cluster is not Kerberos enabled.");
options.addOption("conf", true,
"User specified configuration, as key=val pairs.");
return options;
}
private void addWorkerOptions(Options options) {
options.addOption(CliConstants.N_WORKERS, true,
getNumberOfServiceMessage(WORKER, 1) +
CAN_BE_USED_WITH_TF_PYTORCH);
options.addOption(CliConstants.WORKER_DOCKER_IMAGE, true,
getDockerImageMessage(WORKER) +
CAN_BE_USED_WITH_TF_PYTORCH);
options.addOption(CliConstants.WORKER_LAUNCH_CMD, true,
getLaunchCommandMessage(WORKER) +
CAN_BE_USED_WITH_TF_PYTORCH);
options.addOption(CliConstants.WORKER_RES, true,
getServiceResourceMessage(WORKER) +
CAN_BE_USED_WITH_TF_PYTORCH);
}
private void addPSOptions(Options options) {
options.addOption(CliConstants.N_PS, true,
getNumberOfServiceMessage("PS", 0) +
TENSORFLOW_ONLY);
options.addOption(CliConstants.PS_DOCKER_IMAGE, true,
getDockerImageMessage(PS) +
TENSORFLOW_ONLY);
options.addOption(CliConstants.PS_LAUNCH_CMD, true,
getLaunchCommandMessage("PS") +
TENSORFLOW_ONLY);
options.addOption(CliConstants.PS_RES, true,
getServiceResourceMessage("PS") +
TENSORFLOW_ONLY);
}
private void addTensorboardOptions(Options options) {
options.addOption(CliConstants.TENSORBOARD, false,
"Should we run TensorBoard for this job? " +
"By default, TensorBoard is disabled." +
TENSORFLOW_ONLY);
options.addOption(CliConstants.TENSORBOARD_RESOURCES, true,
"Specifies resources of Tensorboard. The default resource is: "
+ CliConstants.TENSORBOARD_DEFAULT_RESOURCES + "." +
TENSORFLOW_ONLY);
options.addOption(CliConstants.TENSORBOARD_DOCKER_IMAGE, true,
getDockerImageMessage(TENSORBOARD));
}
private String getLaunchCommandMessage(String service) {
return String.format("Launch command of the %s, arguments will be "
+ "directly used to launch the %s", service, service);
}
private String getServiceResourceMessage(String serviceType) {
return String.format("Resource of each %s process, for example: "
+ "memory-mb=2048,vcores=2,yarn.io/gpu=2", serviceType);
}
private String getNumberOfServiceMessage(String serviceType,
int defaultValue) {
return String.format("Number of %s processes for the job. " +
"The default value is %d.", serviceType, defaultValue);
}
private String getDockerImageMessage(String serviceType) {
return String.format("Specifies docker image for the %s process. " +
"When not specified, %s uses --%s as a default value.",
serviceType, serviceType, CliConstants.DOCKER_IMAGE);
}
private void parseCommandLineAndGetRunJobParameters(String[] args)
throws ParseException, IOException, YarnException {
try {
GnuParser parser = new GnuParser();
CommandLine cli = parser.parse(options, args);
parametersHolder = createParametersHolder(cli);
parametersHolder.updateParameters(clientContext);
} catch (ParseException e) {
LOG.error("Exception in parse: {}", e.getMessage());
printUsages();
throw e;
}
}
private ParametersHolder createParametersHolder(CommandLine cli)
throws ParseException, YarnException {
String yamlConfigFile =
cli.getOptionValue(CliConstants.YAML_CONFIG);
if (yamlConfigFile != null) {
YamlConfigFile yamlConfig = readYamlConfigFile(yamlConfigFile);
checkYamlConfig(yamlConfigFile, yamlConfig);
LOG.info("Using YAML configuration!");
return ParametersHolder.createWithCmdLineAndYaml(cli, yamlConfig,
Command.RUN_JOB);
} else {
LOG.info("Using CLI configuration!");
return ParametersHolder.createWithCmdLine(cli, Command.RUN_JOB);
}
}
private void checkYamlConfig(String yamlConfigFile,
YamlConfigFile yamlConfig) {
if (yamlConfig == null) {
throw new YamlParseException(String.format(
YAML_PARSE_FAILED + ", file is empty: %s", yamlConfigFile));
} else if (yamlConfig.getConfigs() == null) {
throw new YamlParseException(String.format(YAML_PARSE_FAILED +
", config section should be defined, but it cannot be found in " +
"YAML file '%s'!", yamlConfigFile));
}
}
private YamlConfigFile readYamlConfigFile(String filename) {
Constructor constructor = new Constructor(YamlConfigFile.class);
constructor.setPropertyUtils(new UnderscoreConverterPropertyUtils());
try {
LOG.info("Reading YAML configuration from file: {}", filename);
Yaml yaml = new Yaml(constructor);
return yaml.loadAs(FileUtils.openInputStream(new File(filename)),
YamlConfigFile.class);
} catch (FileNotFoundException e) {
logExceptionOfYamlParse(filename, e);
throw new YamlParseException(YAML_PARSE_FAILED +
", file does not exist!");
} catch (Exception e) {
logExceptionOfYamlParse(filename, e);
throw new YamlParseException(
String.format(YAML_PARSE_FAILED + ", details: %s", e.getMessage()));
}
}
private void logExceptionOfYamlParse(String filename, Exception e) {
LOG.error(String.format("Exception while parsing YAML file %s", filename),
e);
}
private void storeJobInformation(RunJobParameters parameters,
ApplicationId applicationId, String[] args) throws IOException {
String jobName = parameters.getName();
Map<String, String> jobInfo = new HashMap<>();
jobInfo.put(StorageKeyConstants.JOB_NAME, jobName);
jobInfo.put(StorageKeyConstants.APPLICATION_ID, applicationId.toString());
if (parameters.getCheckpointPath() != null) {
jobInfo.put(StorageKeyConstants.CHECKPOINT_PATH,
parameters.getCheckpointPath());
}
if (parameters.getInputPath() != null) {
jobInfo.put(StorageKeyConstants.INPUT_PATH,
parameters.getInputPath());
}
if (parameters.getSavedModelPath() != null) {
jobInfo.put(StorageKeyConstants.SAVED_MODEL_PATH,
parameters.getSavedModelPath());
}
String joinedArgs = String.join(" ", args);
jobInfo.put(StorageKeyConstants.JOB_RUN_ARGS, joinedArgs);
clientContext.getRuntimeFactory().getSubmarineStorage().addNewJob(jobName,
jobInfo);
}
@Override
public int run(String[] args)
throws ParseException, IOException, YarnException, SubmarineException {
if (CliUtils.argsForHelp(args)) {
printUsages();
return 0;
}
parseCommandLineAndGetRunJobParameters(args);
ApplicationId applicationId = jobSubmitter.submitJob(parametersHolder);
RunJobParameters parameters =
(RunJobParameters) parametersHolder.getParameters();
storeJobInformation(parameters, applicationId, args);
if (parameters.isWaitJobFinish()) {
this.jobMonitor.waitTrainingFinal(parameters.getName());
}
return 0;
}
@VisibleForTesting
public JobSubmitter getJobSubmitter() {
return jobSubmitter;
}
@VisibleForTesting
public RunJobParameters getRunJobParameters() {
return (RunJobParameters) parametersHolder.getParameters();
}
}

View File

@ -1,19 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains classes that are related to the run job command.
*/
package org.apache.hadoop.yarn.submarine.client.cli.runjob;

View File

@ -1,80 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.client.api.YarnClient;
import org.apache.hadoop.yarn.conf.YarnConfiguration;
import org.apache.hadoop.yarn.submarine.common.conf.SubmarineConfiguration;
import org.apache.hadoop.yarn.submarine.common.fs.DefaultRemoteDirectoryManager;
import org.apache.hadoop.yarn.submarine.common.fs.RemoteDirectoryManager;
import org.apache.hadoop.yarn.submarine.runtimes.RuntimeFactory;
public class ClientContext {
private Configuration yarnConf = new YarnConfiguration();
private volatile RemoteDirectoryManager remoteDirectoryManager;
private YarnClient yarnClient;
private Configuration submarineConfig;
private RuntimeFactory runtimeFactory;
public ClientContext() {
submarineConfig = new SubmarineConfiguration();
}
public synchronized YarnClient getOrCreateYarnClient() {
if (yarnClient == null) {
yarnClient = YarnClient.createYarnClient();
yarnClient.init(yarnConf);
yarnClient.start();
}
return yarnClient;
}
public Configuration getYarnConfig() {
return yarnConf;
}
public void setConfiguration(Configuration conf) {
this.yarnConf = conf;
}
public RemoteDirectoryManager getRemoteDirectoryManager() {
if(remoteDirectoryManager == null) {
synchronized (this) {
if(remoteDirectoryManager == null) {
remoteDirectoryManager = new DefaultRemoteDirectoryManager(this);
}
}
}
return remoteDirectoryManager;
}
public Configuration getSubmarineConfig() {
return submarineConfig;
}
public void setSubmarineConfig(Configuration submarineConfig) {
this.submarineConfig = submarineConfig;
}
public RuntimeFactory getRuntimeFactory() {
return runtimeFactory;
}
public void setRuntimeFactory(RuntimeFactory runtimeFactory) {
this.runtimeFactory = runtimeFactory;
}
}

View File

@ -1,27 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common;
public class Envs {
public static final String TASK_TYPE_ENV = "_TASK_TYPE";
public static final String TASK_INDEX_ENV = "_TASK_INDEX";
/*
* HDFS/HADOOP-related configs
*/
public static final String HADOOP_HDFS_HOME = "HADOOP_HDFS_HOME";
public static final String JAVA_HOME = "JAVA_HOME";
public static final String HADOOP_CONF_DIR = "HADOOP_CONF_DIR";
}

View File

@ -1,69 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.api;
/**
* Status of component of training job
*/
public class JobComponentStatus {
private String compName;
private long numReadyContainers = 0;
private long numRunningButUnreadyContainers = 0;
private long totalAskedContainers;
public JobComponentStatus(String compName, long nReadyContainers,
long nRunningButUnreadyContainers, long totalAskedContainers) {
this.compName = compName;
this.numReadyContainers = nReadyContainers;
this.numRunningButUnreadyContainers = nRunningButUnreadyContainers;
this.totalAskedContainers = totalAskedContainers;
}
public String getCompName() {
return compName;
}
public void setCompName(String compName) {
this.compName = compName;
}
public long getNumReadyContainers() {
return numReadyContainers;
}
public void setNumReadyContainers(long numReadyContainers) {
this.numReadyContainers = numReadyContainers;
}
public long getNumRunningButUnreadyContainers() {
return numRunningButUnreadyContainers;
}
public void setNumRunningButUnreadyContainers(
long numRunningButUnreadyContainers) {
this.numRunningButUnreadyContainers = numRunningButUnreadyContainers;
}
public long getTotalAskedContainers() {
return totalAskedContainers;
}
public void setTotalAskedContainers(long totalAskedContainers) {
this.totalAskedContainers = totalAskedContainers;
}
}

View File

@ -1,52 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.api;
/**
* State of training job
*/
public enum JobState {
/**
* Job accepted by scheduler and start running
*/
RUNNING,
/**
* Job killed by user
*/
KILLED,
/**
* Job failed
*/
FAILED,
/**
* Job succeeded
*/
SUCCEEDED,
/**
* Job paused by user
*/
PAUSED;
public static boolean isFinal(JobState state) {
return state == KILLED || state == SUCCEEDED || state == FAILED;
}
}

View File

@ -1,87 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.api;
import java.io.PrintStream;
import java.time.LocalDateTime;
import java.util.List;
/**
* Status of training job.
*/
public class JobStatus {
protected String jobName;
protected JobState state;
protected String tensorboardLink = "N/A";
protected List<JobComponentStatus> componentStatus;
public void nicePrint(PrintStream out) {
out.println(
"Job Name=" + this.jobName + ", status=" + state.name() + " time="
+ LocalDateTime.now());
if (JobState.isFinal(this.state)) {
return;
}
if (tensorboardLink.startsWith("http")) {
out.println(" Tensorboard link: " + tensorboardLink);
}
out.println(" Components:");
for (JobComponentStatus comp : componentStatus) {
out.println(" [" + comp.getCompName() + "] Ready=" + comp
.getNumReadyContainers() + " + Running-But-Non-Ready=" + comp
.getNumRunningButUnreadyContainers() + " | Asked=" + comp
.getTotalAskedContainers());
}
out.println("------------------");
}
public JobState getState() {
return state;
}
public String getTensorboardLink() {
return tensorboardLink;
}
public List<JobComponentStatus> getComponentStatus() {
return componentStatus;
}
public String getJobName() {
return jobName;
}
public void setJobName(String jobName) {
this.jobName = jobName;
}
public void setState(JobState state) {
this.state = state;
}
public void setTensorboardLink(String tensorboardLink) {
this.tensorboardLink = tensorboardLink;
}
public void setComponentStatus(List<JobComponentStatus> componentStatus) {
this.componentStatus = componentStatus;
}
}

View File

@ -1,54 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.api;
/**
* Enum to represent a PyTorch Role.
*/
public enum PyTorchRole implements Role {
PRIMARY_WORKER("master"),
WORKER("worker");
private String compName;
PyTorchRole(String compName) {
this.compName = compName;
}
public String getComponentName() {
return compName;
}
@Override
public String getName() {
return name();
}
}

View File

@ -1,25 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.api;
/**
* Interface for a Role.
*/
public interface Role {
String getComponentName();
String getName();
}

View File

@ -1,58 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.api;
import com.google.common.collect.Lists;
import java.util.List;
import java.util.stream.Collectors;
/**
* Represents the type of Runtime.
*/
public enum Runtime {
TONY(Constants.TONY), YARN_SERVICE(Constants.YARN_SERVICE);
private String value;
Runtime(String value) {
this.value = value;
}
public String getValue() {
return value;
}
public static Runtime parseByValue(String value) {
for (Runtime rt : Runtime.values()) {
if (rt.value.equalsIgnoreCase(value)) {
return rt;
}
}
return null;
}
public static String getValues() {
List<String> values = Lists.newArrayList(Runtime.values()).stream()
.map(rt -> rt.value).collect(Collectors.toList());
return String.join(",", values);
}
public static class Constants {
public static final String TONY = "tony";
public static final String YARN_SERVICE = "yarnservice";
}
}

View File

@ -1,41 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.api;
/**
* Enum to represent a TensorFlow Role.
*/
public enum TensorFlowRole implements Role {
PRIMARY_WORKER("master"),
WORKER("worker"),
PS("ps"),
TENSORBOARD("tensorboard");
private String compName;
TensorFlowRole(String compName) {
this.compName = compName;
}
@Override
public String getComponentName() {
return compName;
}
@Override
public String getName() {
return name();
}
}

View File

@ -1,66 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.conf;
import org.apache.hadoop.conf.Configuration;
public class SubmarineConfiguration extends Configuration {
private static final String SUBMARINE_CONFIGURATION_FILE = "submarine.xml";
public static final String SUBMARINE_CONFIGURATION_PREFIX = "submarine.";
public static final String SUBMARINE_LOCALIZATION_PREFIX =
SUBMARINE_CONFIGURATION_PREFIX + "localization.";
/**
* Limit the size of directory/file to be localized.
* To avoid exhausting local disk space,
* this limit both remote and local file to be localized
*/
public static final String LOCALIZATION_MAX_ALLOWED_FILE_SIZE_MB =
SUBMARINE_LOCALIZATION_PREFIX + "max-allowed-file-size-mb";
// Default 2GB
public static final long DEFAULT_MAX_ALLOWED_REMOTE_URI_SIZE_MB = 2048;
public SubmarineConfiguration() {
this(new Configuration(false), true);
}
public SubmarineConfiguration(Configuration configuration) {
this(configuration, false);
}
public SubmarineConfiguration(Configuration configuration,
boolean loadLocalConfig) {
super(configuration);
if (loadLocalConfig) {
addResource(SUBMARINE_CONFIGURATION_FILE);
}
}
/*
* Runtime of submarine
*/
private static final String PREFIX = "submarine.";
public static final String RUNTIME_CLASS = PREFIX + "runtime.class";
public static final String DEFAULT_RUNTIME_CLASS =
"org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceRuntimeFactory";
public void setSubmarineRuntimeClass(String runtimeClass) {
set(RUNTIME_CLASS, runtimeClass);
}
}

View File

@ -1,31 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.conf;
public class SubmarineLogs {
private static volatile boolean verbose = false;
public static boolean isVerbose() {
return SubmarineLogs.verbose;
}
public static void verboseOn() {
SubmarineLogs.verbose = true;
}
public static void verboseOff() {
SubmarineLogs.verbose = false;
}
}

View File

@ -1,21 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.exception;
public class SubmarineException extends Exception {
public SubmarineException(String msg) {
super(msg);
}
}

View File

@ -1,25 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.exception;
public class SubmarineRuntimeException extends RuntimeException {
public SubmarineRuntimeException(String s) {
super(s);
}
public SubmarineRuntimeException(String message, Throwable cause) {
super(message, cause);
}
}

View File

@ -1,164 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.fs;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.yarn.submarine.client.cli.CliConstants;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import java.io.File;
import java.io.IOException;
import java.net.URI;
/**
* Manages remote directories for staging, log, etc.
* TODO, need to properly handle permission / name validation, etc.
*/
public class DefaultRemoteDirectoryManager implements RemoteDirectoryManager {
private FileSystem fs;
private Configuration conf;
public DefaultRemoteDirectoryManager(ClientContext context) {
this.conf = context.getYarnConfig();
try {
this.fs = FileSystem.get(context.getYarnConfig());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
@Override
public Path getJobStagingArea(String jobName, boolean create)
throws IOException {
Path staging = new Path(getJobRootFolder(jobName), "staging");
if (create) {
createFolderIfNotExist(staging);
}
// Get a file status to make sure it is a absolute path.
FileStatus fStatus = fs.getFileStatus(staging);
return fStatus.getPath();
}
@Override
public Path getJobCheckpointDir(String jobName, boolean create)
throws IOException {
Path jobDir = new Path(getJobStagingArea(jobName, create),
CliConstants.CHECKPOINT_PATH);
if (create) {
createFolderIfNotExist(jobDir);
}
return jobDir;
}
@Override
public Path getModelDir(String modelName, boolean create)
throws IOException {
Path modelDir = new Path(new Path("submarine", "models"), modelName);
if (create) {
createFolderIfNotExist(modelDir);
}
return modelDir;
}
@Override
public FileSystem getDefaultFileSystem() {
return fs;
}
@Override
public FileSystem getFileSystemByUri(String uri) throws IOException {
return FileSystem.get(URI.create(uri), conf);
}
@Override
public Path getUserRootFolder() throws IOException {
Path rootPath = new Path("submarine", "jobs");
createFolderIfNotExist(rootPath);
// Get a file status to make sure it is a absolute path.
FileStatus fStatus = fs.getFileStatus(rootPath);
return fStatus.getPath();
}
@Override
public boolean isDir(String uri) throws IOException {
if (isRemote(uri)) {
return getFileSystemByUri(uri).getFileStatus(new Path(uri)).isDirectory();
}
return new File(uri).isDirectory();
}
@Override
public boolean isRemote(String uri) {
String scheme = new Path(uri).toUri().getScheme();
if (null == scheme) {
return false;
}
return !scheme.startsWith("file://");
}
@Override
public boolean copyRemoteToLocal(String remoteUri, String localUri)
throws IOException {
// Delete old to avoid failure in FileUtil.copy
File old = new File(localUri);
if (old.exists()) {
if (!FileUtil.fullyDelete(old)) {
throw new IOException("Failed to delete dir:"
+ old.getAbsolutePath());
}
}
return FileUtil.copy(getFileSystemByUri(remoteUri), new Path(remoteUri),
new File(localUri), false,
conf);
}
@Override
public boolean existsRemoteFile(Path url) throws IOException {
return getFileSystemByUri(url.toUri().toString()).exists(url);
}
@Override
public FileStatus getRemoteFileStatus(Path url) throws IOException {
return getFileSystemByUri(url.toUri().toString()).getFileStatus(url);
}
@Override
public long getRemoteFileSize(String uri) throws IOException {
return getFileSystemByUri(uri)
.getContentSummary(new Path(uri)).getSpaceConsumed();
}
private Path getJobRootFolder(String jobName) throws IOException {
Path userRoot = getUserRootFolder();
Path jobRootPath = new Path(userRoot, jobName);
createFolderIfNotExist(jobRootPath);
// Get a file status to make sure it is a absolute path.
FileStatus fStatus = fs.getFileStatus(jobRootPath);
return fStatus.getPath();
}
private void createFolderIfNotExist(Path path) throws IOException {
if (!fs.exists(path)) {
if (!fs.mkdirs(path)) {
throw new IOException("Failed to create folder=" + path);
}
}
}
}

View File

@ -1,48 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.fs;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import java.io.IOException;
public interface RemoteDirectoryManager {
Path getJobStagingArea(String jobName, boolean create) throws IOException;
Path getJobCheckpointDir(String jobName, boolean create) throws IOException;
Path getModelDir(String modelName, boolean create) throws IOException;
FileSystem getDefaultFileSystem() throws IOException;
FileSystem getFileSystemByUri(String uri) throws IOException;
Path getUserRootFolder() throws IOException;
boolean isDir(String uri) throws IOException;
boolean isRemote(String uri) throws IOException;
boolean copyRemoteToLocal(String remoteUri, String localUri)
throws IOException;
boolean existsRemoteFile(Path uri) throws IOException;
FileStatus getRemoteFileStatus(Path uri) throws IOException;
long getRemoteFileSize(String uri) throws IOException;
}

View File

@ -1,332 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.common.resource;
import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.yarn.api.records.Resource;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineRuntimeException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* This class implements some methods with the almost the same logic as
* org.apache.hadoop.yarn.util.resource.ResourceUtils of hadoop 3.3.
* If the hadoop dependencies are upgraded to 3.3, this class can be refactored
* with org.apache.hadoop.yarn.util.resource.ResourceUtils.
*/
public final class ResourceUtils {
private final static String RES_PATTERN = "^[^=]+=\\d+\\s?\\w*$";
private final static String SET_RESOURCE_VALUE_METHOD = "setResourceValue";
private final static String SET_MEMORY_SIZE_METHOD = "setMemorySize";
private final static String DEPRECATED_SET_MEMORY_SIZE_METHOD =
"setMemory";
private final static String GET_MEMORY_SIZE_METHOD = "getMemorySize";
private final static String DEPRECATED_GET_MEMORY_SIZE_METHOD =
"getMemory";
private final static String GET_RESOURCE_VALUE_METHOD = "getResourceValue";
private final static String GET_RESOURCE_TYPE_METHOD =
"getResourcesTypeInfo";
private final static String REINITIALIZE_RESOURCES_METHOD =
"reinitializeResources";
public static final String MEMORY_URI = "memory-mb";
public static final String VCORES_URI = "vcores";
public static final String GPU_URI = "yarn.io/gpu";
public static final String FPGA_URI = "yarn.io/fpga";
private static final Logger LOG =
LoggerFactory.getLogger(ResourceUtils.class);
private ResourceUtils() {}
public static Resource createResourceFromString(String resourceStr) {
Map<String, Long> typeToValue = parseResourcesString(resourceStr);
Resource resource = Resource.newInstance(0, 0);
for (Map.Entry<String, Long> entry : typeToValue.entrySet()) {
if(entry.getKey().equals(VCORES_URI)) {
resource.setVirtualCores(entry.getValue().intValue());
continue;
} else if (entry.getKey().equals(MEMORY_URI)) {
setMemorySize(resource, entry.getValue());
continue;
}
setResource(resource, entry.getKey(), entry.getValue().intValue());
}
return resource;
}
private static Map<String, Long> parseResourcesString(String resourcesStr) {
Map<String, Long> resources = new HashMap<>();
String[] pairs = resourcesStr.trim().split(",");
for (String resource : pairs) {
resource = resource.trim();
if (!resource.matches(RES_PATTERN)) {
throw new IllegalArgumentException("\"" + resource + "\" is not a "
+ "valid resource type/amount pair. "
+ "Please provide key=amount pairs separated by commas.");
}
String[] splits = resource.split("=");
String key = splits[0], value = splits[1];
String units = getUnits(value);
String valueWithoutUnit = value.substring(0,
value.length()- units.length()).trim();
long resourceValue = Long.parseLong(valueWithoutUnit);
// Convert commandline unit to standard YARN unit.
if (units.equals("M") || units.equals("m")) {
units = "Mi";
} else if (units.equals("G") || units.equals("g")) {
units = "Gi";
} else if (!units.isEmpty()){
throw new IllegalArgumentException("Acceptable units are M/G or empty");
}
// special handle memory-mb and memory
if (key.equals(MEMORY_URI)) {
if (!units.isEmpty()) {
resourceValue = UnitsConversionUtil.convert(units, "Mi",
resourceValue);
}
}
if (key.equals("memory")) {
key = MEMORY_URI;
resourceValue = UnitsConversionUtil.convert(units, "Mi",
resourceValue);
}
// special handle gpu
if (key.equals("gpu")) {
key = GPU_URI;
}
// special handle fpga
if (key.equals("fpga")) {
key = FPGA_URI;
}
resources.put(key, resourceValue);
}
return resources;
}
/**
* As hadoop 2.9.2 and lower don't support resources except cpu and memory.
* Use reflection to set GPU or other resources for compatibility with
* hadoop 2.9.2
*/
public static void setResource(Resource resource, String resourceName,
int resourceValue) {
try {
Method method = resource.getClass().getMethod(SET_RESOURCE_VALUE_METHOD,
String.class, long.class);
method.invoke(resource, resourceName, resourceValue);
} catch (NoSuchMethodException e) {
LOG.error("There is no '" + SET_RESOURCE_VALUE_METHOD + "' API in this" +
"version of YARN", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.error("Failed to invoke '" + SET_RESOURCE_VALUE_METHOD +
"' method to set GPU resources", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
return;
}
public static void setMemorySize(Resource resource, Long memorySize) {
boolean useWithIntParameter = false;
// For hadoop 2.9.2 and above
try {
Method method = resource.getClass().getMethod(SET_MEMORY_SIZE_METHOD,
long.class);
method.setAccessible(true);
method.invoke(resource, memorySize);
} catch (NoSuchMethodException nsme) {
LOG.info("There is no '" + SET_MEMORY_SIZE_METHOD + "(long)' API in" +
" this version of YARN");
useWithIntParameter = true;
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.error("Failed to invoke '" + SET_MEMORY_SIZE_METHOD +
"' method", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
// For hadoop 2.7.3
if (useWithIntParameter) {
try {
LOG.info("Trying to use '" + DEPRECATED_SET_MEMORY_SIZE_METHOD +
"(int)' API for this version of YARN");
Method method = resource.getClass().getMethod(
DEPRECATED_SET_MEMORY_SIZE_METHOD, int.class);
method.invoke(resource, memorySize.intValue());
} catch (NoSuchMethodException e) {
LOG.error("There is no '" + DEPRECATED_SET_MEMORY_SIZE_METHOD +
"(int)' API in this version of YARN", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.error("Failed to invoke '" + DEPRECATED_SET_MEMORY_SIZE_METHOD +
"' method", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
}
}
public static long getMemorySize(Resource resource) {
boolean useWithIntParameter = false;
long memory = 0;
// For hadoop 2.9.2 and above
try {
Method method = resource.getClass().getMethod(GET_MEMORY_SIZE_METHOD);
method.setAccessible(true);
memory = (long) method.invoke(resource);
} catch (NoSuchMethodException e) {
LOG.info("There is no '" + GET_MEMORY_SIZE_METHOD + "' API in" +
" this version of YARN");
useWithIntParameter = true;
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.error("Failed to invoke '" + GET_MEMORY_SIZE_METHOD +
"' method", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
// For hadoop 2.7.3
if (useWithIntParameter) {
try {
LOG.info("Trying to use '" + DEPRECATED_GET_MEMORY_SIZE_METHOD +
"' API for this version of YARN");
Method method = resource.getClass().getMethod(
DEPRECATED_GET_MEMORY_SIZE_METHOD);
method.setAccessible(true);
memory = ((Integer) method.invoke(resource)).longValue();
} catch (NoSuchMethodException e) {
LOG.error("There is no '" + DEPRECATED_GET_MEMORY_SIZE_METHOD +
"' API in this version of YARN", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.error("Failed to invoke '" + DEPRECATED_GET_MEMORY_SIZE_METHOD +
"' method", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
}
return memory;
}
/**
* As hadoop 2.9.2 and lower don't support resources except cpu and memory.
* Use reflection to set GPU or other resources for compatibility with
* hadoop 2.9.2
*/
public static long getResourceValue(Resource resource, String resourceName) {
long resourceValue = 0;
try {
Method method = resource.getClass().getMethod(GET_RESOURCE_VALUE_METHOD,
String.class);
Object value = method.invoke(resource, resourceName);
resourceValue = (long) value;
} catch (NoSuchMethodException e) {
LOG.info("There is no '" + GET_RESOURCE_VALUE_METHOD + "' API in this" +
" version of YARN");
} catch (InvocationTargetException e) {
if (e.getTargetException().getClass().getName().equals(
"org.apache.hadoop.yarn.exceptions.ResourceNotFoundException")) {
LOG.info("Not found resource " + resourceName);
} else {
LOG.info("Failed to invoke '" + GET_RESOURCE_VALUE_METHOD + "'" +
" method to get resource " + resourceName);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
} catch (IllegalAccessException | ClassCastException e) {
LOG.error("Failed to invoke '" + GET_RESOURCE_VALUE_METHOD +
"' method to get resource " + resourceName, e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
return resourceValue;
}
/**
* As hadoop 2.9.2 and lower don't support resources except cpu and memory.
* Use reflection to add GPU or other resources for compatibility with
* hadoop 2.9.2
*/
public static void configureResourceType(String resrouceName) {
Class resourceTypeInfo;
try{
resourceTypeInfo = Class.forName(
"org.apache.hadoop.yarn.api.records.ResourceTypeInfo");
Class resourceUtils = Class.forName(
"org.apache.hadoop.yarn.util.resource.ResourceUtils");
Method method = resourceUtils.getMethod(GET_RESOURCE_TYPE_METHOD);
Object resTypes = method.invoke(null);
Method resourceTypeInstance = resourceTypeInfo.getMethod("newInstance",
String.class, String.class);
Object resourceType = resourceTypeInstance.invoke(null, resrouceName, "");
((ArrayList)resTypes).add(resourceType);
Method reInitialMethod = resourceUtils.getMethod(
REINITIALIZE_RESOURCES_METHOD, List.class);
reInitialMethod.invoke(null, resTypes);
} catch (ClassNotFoundException e) {
LOG.info("There is no specified class API in this" +
" version of YARN");
LOG.info(e.getMessage());
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
} catch (NoSuchMethodException nsme) {
LOG.info("There is no '" + GET_RESOURCE_VALUE_METHOD + "' API in this" +
" version of YARN");
} catch (IllegalAccessException | InvocationTargetException e) {
LOG.info("Failed to invoke 'configureResourceType' method ", e);
throw new SubmarineRuntimeException(e.getMessage(), e.getCause());
}
}
private static String getUnits(String resourceValue) {
return parseResourceValue(resourceValue)[0];
}
/**
* Extract unit and actual value from resource value.
* @param resourceValue Value of the resource
* @return Array containing unit and value. [0]=unit, [1]=value
* @throws IllegalArgumentException if units contain non alpha characters
*/
private static String[] parseResourceValue(String resourceValue) {
String[] resource = new String[2];
int i = 0;
for (; i < resourceValue.length(); i++) {
if (Character.isAlphabetic(resourceValue.charAt(i))) {
break;
}
}
String units = resourceValue.substring(i);
if (StringUtils.isAlpha(units) || units.equals("")) {
resource[0] = units;
resource[1] = resourceValue.substring(0, i);
return resource;
} else {
throw new IllegalArgumentException("Units '" + units + "'"
+ " contains non alphabet characters, which is not allowed.");
}
}
}

View File

@ -1,164 +0,0 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.yarn.submarine.common.resource;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
/**
* Almost the same logic as UnitsConversionUtil[YARN-4081]. If the dependencies
* are upgraded to hadoop 3.*, this class can be replaced.
*/
public final class UnitsConversionUtil {
private UnitsConversionUtil() {}
/**
* Helper class for encapsulating conversion values.
*/
public static class Converter {
private long numerator;
private long denominator;
Converter(long n, long d) {
this.numerator = n;
this.denominator = d;
}
}
private static final String[] UNITS = {"p", "n", "u", "m", "", "k", "M", "G",
"T", "P", "Ki", "Mi", "Gi", "Ti", "Pi"};
private static final List<String> SORTED_UNITS = Arrays.asList(UNITS);
public static final Set<String> KNOWN_UNITS = createKnownUnitsSet();
private static final Converter PICO =
new Converter(1L, 1000L * 1000L * 1000L * 1000L);
private static final Converter NANO =
new Converter(1L, 1000L * 1000L * 1000L);
private static final Converter MICRO = new Converter(1L, 1000L * 1000L);
private static final Converter MILLI = new Converter(1L, 1000L);
private static final Converter BASE = new Converter(1L, 1L);
private static final Converter KILO = new Converter(1000L, 1L);
private static final Converter MEGA = new Converter(1000L * 1000L, 1L);
private static final Converter GIGA =
new Converter(1000L * 1000L * 1000L, 1L);
private static final Converter TERA =
new Converter(1000L * 1000L * 1000L * 1000L, 1L);
private static final Converter PETA =
new Converter(1000L * 1000L * 1000L * 1000L * 1000L, 1L);
private static final Converter KILO_BINARY = new Converter(1024L, 1L);
private static final Converter MEGA_BINARY = new Converter(1024L * 1024L, 1L);
private static final Converter GIGA_BINARY =
new Converter(1024L * 1024L * 1024L, 1L);
private static final Converter TERA_BINARY =
new Converter(1024L * 1024L * 1024L * 1024L, 1L);
private static final Converter PETA_BINARY =
new Converter(1024L * 1024L * 1024L * 1024L * 1024L, 1L);
private static Set<String> createKnownUnitsSet() {
Set<String> ret = new HashSet<>();
ret.addAll(Arrays.asList(UNITS));
return ret;
}
private static Converter getConverter(String unit) {
switch (unit) {
case "p":
return PICO;
case "n":
return NANO;
case "u":
return MICRO;
case "m":
return MILLI;
case "":
return BASE;
case "k":
return KILO;
case "M":
return MEGA;
case "G":
return GIGA;
case "T":
return TERA;
case "P":
return PETA;
case "Ki":
return KILO_BINARY;
case "Mi":
return MEGA_BINARY;
case "Gi":
return GIGA_BINARY;
case "Ti":
return TERA_BINARY;
case "Pi":
return PETA_BINARY;
default:
throw new IllegalArgumentException(
"Unknown unit '" + unit + "'. Known units are " + KNOWN_UNITS);
}
}
/**
* Converts a value from one unit to another. Supported units can be obtained
* by inspecting the KNOWN_UNITS set.
*
* @param fromUnit the unit of the from value
* @param toUnit the target unit
* @param fromValue the value you wish to convert
* @return the value in toUnit
*/
public static long convert(String fromUnit, String toUnit, long fromValue) {
if (toUnit == null || fromUnit == null) {
throw new IllegalArgumentException("One or more arguments are null");
}
if (fromUnit.equals(toUnit)) {
return fromValue;
}
Converter fc = getConverter(fromUnit);
Converter tc = getConverter(toUnit);
long numerator = fc.numerator * tc.denominator;
long denominator = fc.denominator * tc.numerator;
long numeratorMultiplierLimit = Long.MAX_VALUE / numerator;
if (numerator < denominator) {
if (numeratorMultiplierLimit < fromValue) {
String overflowMsg =
"Converting " + fromValue + " from '" + fromUnit + "' to '" + toUnit
+ "' will result in an overflow of Long";
throw new IllegalArgumentException(overflowMsg);
}
return (fromValue * numerator) / denominator;
}
if (numeratorMultiplierLimit > fromValue) {
return (numerator * fromValue) / denominator;
}
long tmp = numerator / denominator;
if ((Long.MAX_VALUE / tmp) < fromValue) {
String overflowMsg =
"Converting " + fromValue + " from '" + fromUnit + "' to '" + toUnit
+ "' will result in an overflow of Long";
throw new IllegalArgumentException(overflowMsg);
}
return fromValue * tmp;
}
}

View File

@ -1,19 +0,0 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
/**
* This package contains resource utility classes.
*/
package org.apache.hadoop.yarn.submarine.common.resource;

View File

@ -1,103 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes;
import com.google.common.annotations.VisibleForTesting;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.conf.SubmarineConfiguration;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineRuntimeException;
import org.apache.hadoop.yarn.submarine.runtimes.common.JobMonitor;
import org.apache.hadoop.yarn.submarine.runtimes.common.JobSubmitter;
import org.apache.hadoop.yarn.submarine.runtimes.common.SubmarineStorage;
import java.lang.reflect.InvocationTargetException;
public abstract class RuntimeFactory {
protected ClientContext clientContext;
private JobSubmitter jobSubmitter;
private JobMonitor jobMonitor;
private SubmarineStorage submarineStorage;
public RuntimeFactory(ClientContext clientContext) {
this.clientContext = clientContext;
}
public static RuntimeFactory getRuntimeFactory(
ClientContext clientContext) {
Configuration submarineConfiguration =
clientContext.getSubmarineConfig();
String runtimeClass = submarineConfiguration.get(
SubmarineConfiguration.RUNTIME_CLASS,
SubmarineConfiguration.DEFAULT_RUNTIME_CLASS);
try {
Class<?> runtimeClazz = Class.forName(runtimeClass);
if (RuntimeFactory.class.isAssignableFrom(runtimeClazz)) {
return (RuntimeFactory) runtimeClazz.getConstructor(ClientContext.class).newInstance(clientContext);
} else {
throw new SubmarineRuntimeException("Class: " + runtimeClass
+ " not instance of " + RuntimeFactory.class.getCanonicalName());
}
} catch (ClassNotFoundException | IllegalAccessException |
InstantiationException | NoSuchMethodException |
InvocationTargetException e) {
throw new SubmarineRuntimeException(
"Could not instantiate RuntimeFactory: " + runtimeClass, e);
}
}
protected abstract JobSubmitter internalCreateJobSubmitter();
protected abstract JobMonitor internalCreateJobMonitor();
protected abstract SubmarineStorage internalCreateSubmarineStorage();
public synchronized JobSubmitter getJobSubmitterInstance() {
if (jobSubmitter == null) {
jobSubmitter = internalCreateJobSubmitter();
}
return jobSubmitter;
}
public synchronized JobMonitor getJobMonitorInstance() {
if (jobMonitor == null) {
jobMonitor = internalCreateJobMonitor();
}
return jobMonitor;
}
public synchronized SubmarineStorage getSubmarineStorage() {
if (submarineStorage == null) {
submarineStorage = internalCreateSubmarineStorage();
}
return submarineStorage;
}
@VisibleForTesting
public synchronized void setJobSubmitterInstance(JobSubmitter jobSubmitter) {
this.jobSubmitter = jobSubmitter;
}
@VisibleForTesting
public synchronized void setJobMonitorInstance(JobMonitor jobMonitor) {
this.jobMonitor = jobMonitor;
}
@VisibleForTesting
public synchronized void setSubmarineStorage(SubmarineStorage storage) {
this.submarineStorage = storage;
}
}

View File

@ -1,102 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes.common;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.fs.RemoteDirectoryManager;
import java.io.IOException;
import java.io.ObjectInput;
import java.io.ObjectInputStream;
import java.io.ObjectOutput;
import java.io.ObjectOutputStream;
import java.util.Map;
/**
* A super naive FS-based storage.
*/
public class FSBasedSubmarineStorageImpl extends SubmarineStorage {
RemoteDirectoryManager rdm;
public FSBasedSubmarineStorageImpl(ClientContext clientContext) {
rdm = clientContext.getRemoteDirectoryManager();
}
@Override
public void addNewJob(String jobName, Map<String, String> jobInfo)
throws IOException {
Path jobInfoPath = getJobInfoPath(jobName, true);
FSDataOutputStream fos = rdm.getDefaultFileSystem().create(jobInfoPath);
serializeMap(fos, jobInfo);
}
@Override
public Map<String, String> getJobInfoByName(String jobName)
throws IOException {
Path jobInfoPath = getJobInfoPath(jobName, false);
FSDataInputStream fis = rdm.getDefaultFileSystem().open(jobInfoPath);
return deserializeMap(fis);
}
@Override
public void addNewModel(String modelName, String version,
Map<String, String> modelInfo) throws IOException {
Path modelInfoPath = getModelInfoPath(modelName, version, true);
FSDataOutputStream fos = rdm.getDefaultFileSystem().create(modelInfoPath);
serializeMap(fos, modelInfo);
}
@Override
public Map<String, String> getModelInfoByName(String modelName,
String version) throws IOException {
Path modelInfoPath = getModelInfoPath(modelName, version, false);
FSDataInputStream fis = rdm.getDefaultFileSystem().open(modelInfoPath);
return deserializeMap(fis);
}
private Path getModelInfoPath(String modelName, String version, boolean create)
throws IOException {
Path modelDir = rdm.getModelDir(modelName, create);
return new Path(modelDir, version + ".info");
}
private void serializeMap(FSDataOutputStream fos, Map<String, String> map)
throws IOException {
ObjectOutput oo = new ObjectOutputStream(fos);
oo.writeObject(map);
oo.close();
}
private Map<String, String> deserializeMap(FSDataInputStream fis)
throws IOException {
ObjectInput oi = new ObjectInputStream(fis);
Map<String, String> newMap;
try {
newMap = (Map<String, String>) oi.readObject();
} catch (ClassNotFoundException e) {
throw new IOException(e);
}
return newMap;
}
private Path getJobInfoPath(String jobName, boolean create) throws IOException {
Path path = rdm.getJobStagingArea(jobName, create);
return new Path(path, "job.info");
}
}

View File

@ -1,90 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes.common;
import org.apache.hadoop.yarn.submarine.common.ClientContext;
import org.apache.hadoop.yarn.submarine.common.api.JobState;
import org.apache.hadoop.yarn.submarine.common.api.JobStatus;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.common.exception.SubmarineException;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
/**
* Monitor status of job(s)
*/
public abstract class JobMonitor {
private static final Logger LOG =
LoggerFactory.getLogger(JobMonitor.class);
protected ClientContext clientContext;
public JobMonitor(ClientContext clientContext) {
this.clientContext = clientContext;
}
/**
* Returns status of training job.
*
* @param jobName name of job
* @return job status
* @throws IOException anything else happens
* @throws YarnException anything related to YARN happens
*/
public abstract JobStatus getTrainingJobStatus(String jobName)
throws IOException, YarnException;
/**
* Cleanup AppAdminClient, etc.
*/
public void cleanup() throws IOException {}
/**
* Continue wait and print status if job goes to ready or final state.
* @param jobName
* @throws IOException
* @throws YarnException
* @throws SubmarineException
*/
public void waitTrainingFinal(String jobName)
throws IOException, YarnException, SubmarineException {
// Wait 5 sec between each fetch.
int waitIntervalSec = 5;
JobStatus js;
while (true) {
js = getTrainingJobStatus(jobName);
JobState jobState = js.getState();
js.nicePrint(System.err);
if (JobState.isFinal(jobState)) {
if (jobState.equals(JobState.FAILED)) {
throw new SubmarineException("Job failed");
} else if (jobState.equals(JobState.KILLED)) {
throw new SubmarineException("Job killed");
}
LOG.info("Job exited with state=" + jobState);
break;
}
try {
Thread.sleep(waitIntervalSec * 1000);
} catch (InterruptedException e) {
throw new IOException(e);
}
}
cleanup();
}
}

View File

@ -1,36 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes.common;
import org.apache.hadoop.yarn.api.records.ApplicationId;
import org.apache.hadoop.yarn.exceptions.YarnException;
import org.apache.hadoop.yarn.submarine.client.cli.param.ParametersHolder;
import java.io.IOException;
/**
* Submit job to cluster master.
*/
public interface JobSubmitter {
/**
* Submit a job to cluster.
* @param parameters run job parameters
* @return applicationId when successfully submitted
* @throws YarnException for issues while contacting YARN daemons
* @throws IOException for other issues.
*/
ApplicationId submitJob(ParametersHolder parameters)
throws IOException, YarnException;
}

View File

@ -1,24 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes.common;
public class StorageKeyConstants {
public static final String JOB_NAME = "JOB_NAME";
public static final String JOB_RUN_ARGS = "JOB_RUN_ARGS";
public static final String APPLICATION_ID = "APPLICATION_ID";
public static final String CHECKPOINT_PATH = "CHECKPOINT_PATH";
public static final String INPUT_PATH = "INPUT_PATH";
public static final String SAVED_MODEL_PATH = "SAVED_MODEL_PATH";
}

View File

@ -1,57 +0,0 @@
/**
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License. See accompanying LICENSE file.
*/
package org.apache.hadoop.yarn.submarine.runtimes.common;
import java.io.IOException;
import java.util.Map;
/**
* Persistent job/model, etc.
*/
public abstract class SubmarineStorage {
/**
* Add a new job by name
* @param jobName name of job.
* @param jobInfo info of the job.
*/
public abstract void addNewJob(String jobName, Map<String, String> jobInfo)
throws IOException;
/**
* Get job info by job name.
* @param jobName name of job
* @return info of the job.
*/
public abstract Map<String, String> getJobInfoByName(String jobName)
throws IOException;
/**
* Add a new model
* @param modelName name of model
* @param version version of the model, when null is specified, it will be
* "default"
* @param modelInfo info of the model.
*/
public abstract void addNewModel(String modelName, String version,
Map<String, String> modelInfo) throws IOException;
/**
* Get model info by name and version.
* @param modelName name of model.
* @param version version of the model, when null is specifed, it will be
*/
public abstract Map<String, String> getModelInfoByName(String modelName, String version)
throws IOException;
}

View File

@ -1,21 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Examples
Here are some examples about how to use Submarine:
[Running Distributed CIFAR 10 Tensorflow Job](RunningDistributedCifar10TFJobs.html)
[Running Standalone CIFAR 10 PyTorch Job](RunningSingleNodeCifar10PTJobs.html)

View File

@ -1,36 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# How to Install Dependencies
Submarine project uses YARN Service, Docker container and GPU.
GPU could only be used if a GPU hardware is available and properly configured.
As an administrator, you have to properly setup YARN Service related dependencies, including:
- YARN Registry DNS
- Docker related dependencies, including:
- Docker binary with expected versions
- Docker network that allows Docker containers to talk to each other across different nodes
If you would like to use GPU, you need to set up:
- GPU Driver
- Nvidia-docker
For your convenience, we provided some installation documents to help you setup your environment. You can always choose to have them installed in your own way.
Use Submarine installer to install dependencies: [EN](https://github.com/hadoopsubmarine/hadoop-submarine-ecosystem/tree/master/submarine-installer) [CN](https://github.com/hadoopsubmarine/hadoop-submarine-ecosystem/blob/master/submarine-installer/README-CN.md)
Alternatively, you can follow this guide to manually install dependencies: [EN](InstallationGuide.html) [CN](InstallationGuideChineseVersion.html)
Once you have installed all the dependencies, please follow this guide: [TestAndTroubleshooting](TestAndTroubleshooting.html).

View File

@ -1,47 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Submarine is a project which allows infra engineer / data scientist to run
*unmodified* Tensorflow or PyTorch programs on YARN or Kubernetes.
Goals of Submarine:
- It allows jobs for easy access to data/models in HDFS and other storages.
- Can launch services to serve Tensorflow/MXNet models.
- Supports running distributed Tensorflow jobs with simple configs.
- Supports running standalone PyTorch jobs with simple configs.
- Supports running user-specified Docker images.
- Supports specifying GPU and other resources.
- Supports launching Tensorboard for training jobs (optional, if specified).
- Supports customized DNS name for roles (like tensorboard.$user.$domain:6006)
If you want to deep-dive, please check these resources:
- [QuickStart Guide](QuickStart.html)
- [Examples](Examples.html)
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
- [Installation guides](HowToInstall.html)

View File

@ -1,594 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Submarine Installation Guide
## Prerequisites
Please note that the following prerequisites are just an example for you to install Submarine.
You can always choose to install your own version of kernel, different users, different drivers, etc.
### Operating System
The operating system and kernel versions we have tested against are shown in the following table.
The versions in the table are the recommended minimum required versions.
| Environment | Version |
| ------ | ------ |
| Operating System | centos-release-7-5.1804.el7.centos.x86_64 |
| Kernel | 3.10.0-862.el7.x86_64 |
### User & Group
There are specific users and groups recommended to be created to install Hadoop with Docker.
Please create these users if they do not exist.
```
adduser hdfs
adduser mapred
adduser yarn
addgroup hadoop
usermod -aG hdfs,hadoop hdfs
usermod -aG mapred,hadoop mapred
usermod -aG yarn,hadoop yarn
usermod -aG hdfs,hadoop hadoop
groupadd docker
usermod -aG docker yarn
usermod -aG docker hadoop
```
### GCC Version
Check the version of GCC tool (to compile kernel).
```bash
gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
# install if needed
yum install gcc make g++
```
### Kernel header & Kernel devel
```bash
# Approach 1
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# Approach 2
wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-862.el7.x86_64.rpm
rpm -ivh kernel-headers-3.10.0-862.el7.x86_64.rpm
```
### GPU Servers (Only for Nvidia GPU equipped nodes)
```
lspci | grep -i nvidia
# If the server has gpus, you can get info like this
04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
```
### Nvidia Driver Installation (Only for Nvidia GPU equipped nodes)
To make a clean installation, if you have requirements to upgrade GPU drivers.
If nvidia driver / CUDA has been installed before, they should be uninstalled as a first step.
```
# uninstall cuda
sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
# uninstall nvidia-driver
sudo /usr/bin/nvidia-uninstall
```
To check GPU version, install nvidia-detect:
```
yum install nvidia-detect
# run 'nvidia-detect -v' to get reqired nvidia driver version
nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
This device requires the current xyz.nm NVIDIA driver kmod-nvidia
[8086:1912] Intel Corporation HD Graphics 530
An Intel display controller was also detected
```
Pay attention to `This device requires the current xyz.nm NVIDIA driver kmod-nvidia`.
Download the installer like [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html).
Some preparatory work for Nvidia driver installation.
The steps below are for Nvidia GPU driver installation, just pasted here for your convenience.
```
# It may take a while to update
yum -y update
yum -y install kernel-devel
yum -y install epel-release
yum -y install dkms
# Disable nouveau
vim /etc/default/grub
# Add the following configuration in “GRUB_CMDLINE_LINUX” part
rd.driver.blacklist=nouveau nouveau.modeset=0
# Generate configuration
grub2-mkconfig -o /boot/grub2/grub.cfg
vim /etc/modprobe.d/blacklist.conf
# Add confiuration:
blacklist nouveau
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r)
reboot
```
Check whether nouveau is disabled
```
lsmod | grep nouveau # return null
# install nvidia driver
sh NVIDIA-Linux-x86_64-390.87.run
```
Some options during the installation
```
Install NVIDIA's 32-bit compatibility libraries (Yes)
centos Install NVIDIA's 32-bit compatibility libraries (Yes)
Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO)
```
Check Nvidia driver installation
```
nvidia-smi
```
Reference
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
### Docker Installation
The following steps show you how to install docker 18.06.1.ce. You can choose other approaches to install Docker.
```
# Remove old version docker
sudo yum remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-engine
# Docker version
export DOCKER_VERSION="18.06.1.ce"
# Setup the repository
sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2
sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
# Check docker version
yum list docker-ce --showduplicates | sort -r
# Install docker with specified DOCKER_VERSION
sudo yum install -y docker-ce-${DOCKER_VERSION} docker-ce-cli-${DOCKER_VERSION} containerd.io
# Start docker
systemctl start docker
chown hadoop:netease /var/run/docker.sock
chown hadoop:netease /usr/bin/docker
```
Referencehttps://docs.docker.com/install/linux/docker-ce/centos/
### Docker Configuration
Add a file, named daemon.json, under the path of /etc/docker/.
Please replace the variables of image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip with specific IPs according to your environment.
```
{
"insecure-registries": ["${image_registry_ip}:5000"],
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
"cluster-advertise":"{localhost_ip}:2375",
"dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
"hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
}
```
Restart docker daemon
```
sudo systemctl restart docker
```
### Check docker version
```bash
$ docker version
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:03 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:03 2018
OS/Arch: linux/amd64
Experimental: false
```
### Nvidia-docker Installation (Only for Nvidia GPU equipped nodes)
Submarine has already supported nvidia-docker V2
```
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
sudo yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
```
According to `nvidia-driver` version, add folders under the path of `/var/lib/nvidia-docker/volumes/nvidia_driver/`
```
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
# 390.8 is nvidia driver version
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
# Test with nvidia-smi
nvidia-docker run --rm nvidia/cuda:10.0-devel nvidia-smi
```
Test docker, nvidia-docker, nvidia-driver installation
```
# Test 1
nvidia-docker run -rm nvidia/cuda nvidia-smi
```
```
# Test 2
nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
# In docker container
python
import tensorflow as tf
tf.test.is_gpu_available()
```
If you want to uninstall nvidia-docker V2:
```
sudo yum remove -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
```
Reference:
https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
There is no need to install CUDNN and CUDA on the servers, because CUDNN and CUDA can be added in the docker images.
We can get or build basic docker images by referring to [Write Dockerfile](WriteDockerfileTF.html).
### Test tensorflow in a docker container
After docker image is built, we can check
Tensorflow environments before submitting a Submarine job.
```shell
$ docker run -it ${docker_image_name} /bin/bash
# >>> In the docker container
$ python
$ python >> import tensorflow as tf
$ python >> tf.__version__
```
If there are some errors, we could check the following configuration.
1. LD_LIBRARY_PATH environment variable
```
echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
```
2. The location of libcuda.so.1, libcuda.so
```
ls -l /usr/local/nvidia/lib64 | grep libcuda.so
```
### Etcd Installation
etcd is a distributed, reliable key-value store for the most critical data of a distributed system, Registration and discovery of services used in containers.
You can also choose alternatives like ZooKeeper, Consul or others.
To install Etcd on specified servers, we can run Submarine-installer/install.sh
```shell
$ ./Submarine-installer/install.sh
# Etcd status
systemctl status Etcd.service
```
Check Etcd cluster health
```shell
$ etcdctl cluster-health
member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
cluster is healthy
$ etcdctl member list
3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
```
### Calico Installation
Calico creates and manages a flat three-tier network, and each container is assigned a routable IP address.
We are listing the steps here for your convenience.
You can also choose alternatives like Flannel, OVS or others.
To install Calico on specified servers, we can run Submarine-installer/install.sh
```
systemctl start calico-node.service
systemctl status calico-node.service
```
#### Check Calico Network
```shell
# Run the following command to show all host status in the cluster except localhost.
$ calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
+---------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
```
Create containers to validate calico network
```
docker network create --driver calico --ipam-driver calico-ipam calico-network
docker run --net calico-network --name workload-A -tid busybox
docker run --net calico-network --name workload-B -tid busybox
docker exec workload-A ping workload-B
```
## Hadoop Installation
### Get Hadoop Release
You can either get Hadoop release binary or compile from source code. Please follow the https://hadoop.apache.org/ guides.
### Start YARN service
```
YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
```
### Start YARN registry DNS service
```
sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
```
### Test with a MR wordcount job
```
./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4
```
## Tensorflow Job with CPU
### Standalone Mode
#### Clean up apps with the same name
Suppose we want to submit a TensorFlow job named standalone-tf, destroy any application with the same name and clean up historical job directories.
```bash
./bin/yarn app -destroy standalone-tf
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
```
where ${dfs_name_service} is the HDFS name service you use
#### Run a standalone tensorflow job
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name standalone-tf \
--docker_image tf-1.13.1-cpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
```
### Distributed Mode
#### Clean up apps with the same name
```bash
./bin/yarn app -destroy distributed-tf
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
```
#### Run a distributed TensorFlow job
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image tf-1.13.1-cpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--num_ps 1 \
--ps_resources memory=4G,vcores=2 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--num_workers 4 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
```
## TensorFlow Job with GPU
### GPU configurations for both ResourceManager and NodeManager
Add the YARN resource configuration file, named resource-types.xml
```
<configuration>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
#### GPU configurations for ResourceManager
The scheduler used by ResourceManager must be the capacity scheduler, and yarn.scheduler.capacity.resource-calculator in capacity-scheduler.xml should be DominantResourceCalculator
```
<configuration>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
</configuration>
```
#### GPU configurations for NodeManager
Add configurations in yarn-site.xml
```
<configuration>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
<!--Use nvidia docker v2-->
<property>
<name>yarn.nodemanager.resource-plugins.gpu.docker-plugin</name>
<value>nvidia-docker-v2</value>
</property>
</configuration>
```
Add configurations to container-executor.cfg
```
[docker]
...
# Add configurations in `[docker]` part
# /usr/bin/nvidia-docker is the path of nvidia-docker command
# nvidia_driver_375.26 means that nvidia driver version is <version>. nvidia-smi command can be used to check the version
docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_<version>
# Use nvidia docker v2
docker.allowed.runtimes=nvidia
[gpu]
module.enabled=true
[cgroups]
# /sys/fs/cgroup is the cgroup mount destination
# /hadoop-yarn is the path yarn creates by default
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn
```
### Run a distributed TensorFlow GPU job
```bash
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image tf-1.13.1-gpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--num_ps 0 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 1 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
```

View File

@ -1,704 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Submarine 安装说明
## Prerequisites
### 操作系统
我们使用的操作系统版本是 centos-release-7-5.1804.el7.centos.x86_64, 内核版本是 3.10.0-862.el7.x86_64。
| Enviroment | Verion |
| ------ | ------ |
| Operating System | centos-release-7-5.1804.el7.centos.x86_64 |
| Kernal | 3.10.0-862.el7.x86_64 |
### User & Group
如果操作系统中没有这些用户组和用户,必须添加。一部分用户是 hadoop 运行需要,一部分用户是 docker 运行需要。
```
adduser hdfs
adduser mapred
adduser yarn
addgroup hadoop
usermod -aG hdfs,hadoop hdfs
usermod -aG mapred,hadoop mapred
usermod -aG yarn,hadoop yarn
usermod -aG hdfs,hadoop hadoop
groupadd docker
usermod -aG docker yarn
usermod -aG docker hadoop
```
### GCC 版本
```bash
gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11)
# 如果没有安装请执行以下命令进行安装
yum install gcc make g++
```
### Kernel header & devel
```bash
# 方法一:
yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
# 方法二:
wget http://vault.centos.org/7.3.1611/os/x86_64/Packages/kernel-headers-3.10.0-862.el7.x86_64.rpm
rpm -ivh kernel-headers-3.10.0-862.el7.x86_64.rpm
```
### 检查 GPU 版本
```
lspci | grep -i nvidia
# 如果什么都没输出,就说明显卡不对,以下是我的输出:
# 04:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
# 82:00.0 3D controller: NVIDIA Corporation Device 1b38 (rev a1)
```
### 安装 nvidia 驱动
安装nvidia driver/cuda要确保已安装的nvidia driver/cuda已被清理
```
# 卸载cuda
sudo /usr/local/cuda-10.0/bin/uninstall_cuda_10.0.pl
# 卸载nvidia-driver
sudo /usr/bin/nvidia-uninstall
```
安装nvidia-detect用于检查显卡版本
```
yum install nvidia-detect
# 运行命令 nvidia-detect -v 返回结果:
nvidia-detect -v
Probing for supported NVIDIA devices...
[10de:13bb] NVIDIA Corporation GM107GL [Quadro K620]
This device requires the current 390.87 NVIDIA driver kmod-nvidia
[8086:1912] Intel Corporation HD Graphics 530
An Intel display controller was also detected
```
注意这里的信息 [Quadro K620] 和390.87。
下载 [NVIDIA-Linux-x86_64-390.87.run](https://www.nvidia.com/object/linux-amd64-display-archive.html)
安装前的一系列准备工作
```
# 若系统很久没更新,这句可能耗时较长
yum -y update
yum -y install kernel-devel
yum -y install epel-release
yum -y install dkms
# 禁用nouveau
vim /etc/default/grub #在“GRUB_CMDLINE_LINUX”中添加内容 rd.driver.blacklist=nouveau nouveau.modeset=0
grub2-mkconfig -o /boot/grub2/grub.cfg # 生成配置
vim /etc/modprobe.d/blacklist.conf # 打开新建文件添加内容blacklist nouveau
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
dracut /boot/initramfs-$(uname -r).img $(uname -r) # 更新配置,并重启
reboot
```
开机后确认是否禁用
```
lsmod | grep nouveau # 应该返回空
# 开始安装
sh NVIDIA-Linux-x86_64-390.87.run
```
安装过程中,会遇到一些选项:
```
Install NVIDIA's 32-bit compatibility libraries (Yes)
centos Install NVIDIA's 32-bit compatibility libraries (Yes)
Would you like to run the nvidia-xconfig utility to automatically update your X configuration file... (NO)
```
最后查看 nvidia gpu 状态
```
nvidia-smi
```
reference
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
### 安装 Docker
```
# Remove old version docker
sudo yum remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-engine
# Docker version
export DOCKER_VERSION="18.06.1.ce"
# Setup the repository
sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2
sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
# Check docker version
yum list docker-ce --showduplicates | sort -r
# Install docker with specified DOCKER_VERSION
sudo yum install -y docker-ce-${DOCKER_VERSION} docker-ce-cli-${DOCKER_VERSION} containerd.io
# Start docker
systemctl start docker
chown hadoop:netease /var/run/docker.sock
chown hadoop:netease /usr/bin/docker
```
Referencehttps://docs.docker.com/install/linux/docker-ce/centos/
### 配置 Docker
`/etc/docker/` 目录下,创建`daemon.json`文件, 添加以下配置变量如image_registry_ip, etcd_host_ip, localhost_ip, yarn_dns_registry_host_ip, dns_host_ip需要根据具体环境进行修改
```
{
"insecure-registries": ["${image_registry_ip}:5000"],
"cluster-store":"etcd://${etcd_host_ip1}:2379,${etcd_host_ip2}:2379,${etcd_host_ip3}:2379",
"cluster-advertise":"{localhost_ip}:2375",
"dns": ["${yarn_dns_registry_host_ip}", "${dns_host_ip1}"],
"hosts": ["tcp://{localhost_ip}:2375", "unix:///var/run/docker.sock"]
}
```
重启 docker daemon
```
sudo systemctl restart docker
```
### 检查 Docker version
```bash
$ docker version
Client:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:03 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:03 2018
OS/Arch: linux/amd64
Experimental: false
```
### 安装 nvidia-docker
Hadoop-3.2 的 submarine 已支持 V2 版本的 nvidia-docker
```
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
sudo yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
```
`/var/lib/nvidia-docker/volumes/nvidia_driver/` 路径下,根据 `nvidia-driver` 的版本创建文件夹:
```
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87
# 其中390.87是nvidia driver的版本号
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
mkdir /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/bin/nvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/bin
cp /usr/lib64/libcuda* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
cp /usr/lib64/libnvidia* /var/lib/nvidia-docker/volumes/nvidia_driver/390.87/lib64
# Test nvidia-smi
nvidia-docker run --rm nvidia/cuda:10.0-devel nvidia-smi
```
测试 docker, nvidia-docker, nvidia-driver 安装
```
# 测试一
nvidia-docker run -rm nvidia/cuda nvidia-smi
```
```
# 测试二
nvidia-docker run -it tensorflow/tensorflow:1.9.0-gpu bash
# 在docker中执行
python
import tensorflow as tf
tf.test.is_gpu_available()
```
卸载 nvidia-docker V2 的方法:
```
sudo yum remove -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
```
reference:
https://github.com/NVIDIA/nvidia-docker
### Tensorflow Image
CUDNN 和 CUDA 其实不需要在物理机上安装,因为 Submarine 中提供了已经包含了CUDNN 和 CUDA 的镜像文件基础的Dockfile可参见WriteDockerfileTF.md
### 测试 TF 环境
创建好 docker 镜像后,需要先手动检查 TensorFlow 是否可以正常使用,避免通过 YARN 调度后出现问题,可以执行以下命令
```shell
$ docker run -it ${docker_image_name} /bin/bash
# >>> 进入容器
$ python
$ python >> import tensorflow as tf
$ python >> tf.__version__
```
如果出现问题,可以按照以下路径进行排查
1. 环境变量是否设置正确
```
echo $LD_LIBRARY_PATH
/usr/local/cuda/extras/CUPTI/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
```
2. libcuda.so.1,libcuda.so是否在LD_LIBRARY_PATH指定的路径中
```
ls -l /usr/local/nvidia/lib64 | grep libcuda.so
```
### 安装 Etcd
运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Etcd 组件和服务自启动脚本。
```shell
$ ./Submarine/install.sh
# 通过如下命令查看 Etcd 服务状态
systemctl status Etcd.service
```
检查 Etcd 服务状态
```shell
$ etcdctl cluster-health
member 3adf2673436aa824 is healthy: got healthy result from http://${etcd_host_ip1}:2379
member 85ffe9aafb7745cc is healthy: got healthy result from http://${etcd_host_ip2}:2379
member b3d05464c356441a is healthy: got healthy result from http://${etcd_host_ip3}:2379
cluster is healthy
$ etcdctl member list
3adf2673436aa824: name=etcdnode3 peerURLs=http://${etcd_host_ip1}:2380 clientURLs=http://${etcd_host_ip1}:2379 isLeader=false
85ffe9aafb7745cc: name=etcdnode2 peerURLs=http://${etcd_host_ip2}:2380 clientURLs=http://${etcd_host_ip2}:2379 isLeader=false
b3d05464c356441a: name=etcdnode1 peerURLs=http://${etcd_host_ip3}:2380 clientURLs=http://${etcd_host_ip3}:2379 isLeader=true
```
其中,${etcd_host_ip*} 是etcd服务器的ip
### 安装 Calico
运行 Submarine/install.sh 脚本,就可以在指定服务器中安装 Calico 组件和服务自启动脚本。
```
systemctl start calico-node.service
systemctl status calico-node.service
```
#### 检查 Calico 网络
```shell
# 执行如下命令,注意:不会显示本服务器的状态,只显示其他的服务器状态
$ calicoctl node status
Calico process is running.
IPv4 BGP status
+---------------+-------------------+-------+------------+-------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
+---------------+-------------------+-------+------------+-------------+
| ${host_ip1} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip2} | node-to-node mesh | up | 2018-09-21 | Established |
| ${host_ip3} | node-to-node mesh | up | 2018-09-21 | Established |
+---------------+-------------------+-------+------------+-------------+
IPv6 BGP status
No IPv6 peers found.
```
创建docker container验证calico网络
```
docker network create --driver calico --ipam-driver calico-ipam calico-network
docker run --net calico-network --name workload-A -tid busybox
docker run --net calico-network --name workload-B -tid busybox
docker exec workload-A ping workload-B
```
## 安装 Hadoop
### 编译 Hadoop
```
mvn package -Pdist -DskipTests -Dtar
```
### 启动 YARN服务
```
YARN_LOGFILE=resourcemanager.log ./sbin/yarn-daemon.sh start resourcemanager
YARN_LOGFILE=nodemanager.log ./sbin/yarn-daemon.sh start nodemanager
YARN_LOGFILE=timeline.log ./sbin/yarn-daemon.sh start timelineserver
YARN_LOGFILE=mr-historyserver.log ./sbin/mr-jobhistory-daemon.sh start historyserver
```
### 启动 registery dns 服务
```
sudo YARN_LOGFILE=registrydns.log ./yarn-daemon.sh start registrydns
```
### 测试 wordcount
通过测试最简单的 wordcount ,检查 YARN 是否正确安装
```
./bin/hadoop jar /home/hadoop/hadoop-current/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.0-SNAPSHOT.jar wordcount /tmp/wordcount.txt /tmp/wordcount-output4
```
## 使用CUP的Tensorflow任务
### 单机模式
#### 清理重名程序
```bash
# 每次提交前需要执行:
./bin/yarn app -destroy standalone-tf
# 并删除hdfs路径
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
# 确保之前的任务已经结束
```
其中,变量${dfs_name_service}请根据环境用你的hdfs name service名称替换
#### 执行单机模式的tensorflow任务
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name standalone-tf \
--docker_image tf-1.13.1-cpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --num-gpus=0"
```
### 分布式模式
#### 清理重名程序
```bash
# 每次提交前需要执行:
./bin/yarn app -destroy distributed-tf
# 并删除hdfs路径
./bin/hdfs dfs -rmr hdfs://${dfs_name_service}/tmp/cifar-10-jobdir
# 确保之前的任务已经结束
```
#### 提交分布式模式 tensorflow 任务
```bash
./bin/yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image tf-1.13.1-cpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--worker_resources memory=4G,vcores=2 --verbose \
--num_ps 1 \
--ps_resources memory=4G,vcores=2 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--num_workers 4 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=0"
```
## 使用GPU的Tensorflow任务
### Resourcemanager, Nodemanager 中添加GPU支持
在 yarn 配置文件夹(conf或etc/hadoop)中创建 resource-types.xml添加
```
<configuration>
<property>
<name>yarn.resource-types</name>
<value>yarn.io/gpu</value>
</property>
</configuration>
```
### Resourcemanager 的 GPU 配置
resourcemanager 使用的 scheduler 必须是 capacity scheduler在 capacity-scheduler.xml 中修改属性:
```
<configuration>
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
</configuration>
```
### Nodemanager 的 GPU 配置
在 nodemanager 的 yarn-site.xml 中添加配置:
```
<configuration>
<property>
<name>yarn.nodemanager.resource-plugins</name>
<value>yarn.io/gpu</value>
</property>
<!--Use nvidia docker v2-->
<property>
<name>yarn.nodemanager.resource-plugins.gpu.docker-plugin</name>
<value>nvidia-docker-v2</value>
</property>
</configuration>
```
在 container-executor.cfg 中添加配置:
```
[docker]
...
# 在[docker]已有配置中,添加以下内容:
# /usr/bin/nvidia-docker是nvidia-docker路径
# nvidia_driver_375.26的版本号375.26可以使用nvidia-smi查看
docker.allowed.volume-drivers=/usr/bin/nvidia-docker
docker.allowed.devices=/dev/nvidiactl,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidia1,/dev/nvidia0
docker.allowed.ro-mounts=nvidia_driver_375.26
# Use nvidia docker v2
docker.allowed.runtimes=nvidia
[gpu]
module.enabled=true
[cgroups]
# /sys/fs/cgroup是cgroup的mount路径
# /hadoop-yarn是yarn在cgroup路径下默认创建的path
root=/sys/fs/cgroup
yarn-hierarchy=/hadoop-yarn
```
### 提交验证
Distributed-shell + GPU + cgroup
```bash
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--docker_image tf-1.13.1-gpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--num_ps 0 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 1 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
```
## 问题
### 问题一: 操作系统重启导致 nodemanager 启动失败
```
2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
```
解决方法:使用 `root` 账号给 `yarn` 用户修改 `/sys/fs/cgroup/cpu,cpuacct` 的权限
```
chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
```
在支持gpu时还需cgroup devices路径权限
```
chown :yarn -R /sys/fs/cgroup/devices
chmod g+rwx -R /sys/fs/cgroup/devices
```
### 问题二container-executor 权限问题
```
2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command:
java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
at org.apache.hadoop.util.Shell.run(Shell.java:901)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
```
`/etc/yarn/sbin/Linux-amd64-64/container-executor` 该文件的权限应为6050
### 问题三:查看系统服务启动日志
```
journalctl -u docker
```
### 问题四Docker 无法删除容器的问题 `device or resource busy`
```bash
$ docker rm 0bfafa146431
Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy
```
编写 `find-busy-mnt.sh` 脚本,检查 `device or resource busy` 状态的容器挂载文件
```bash
#!/bin/bash
# A simple script to get information about mount points and pids and their
# mount namespaces.
if [ $# -ne 1 ];then
echo "Usage: $0 <devicemapper-device-id>"
exit 1
fi
ID=$1
MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null`
[ -z "$MOUNTS" ] && echo "No pids found" && exit 0
printf "PID\tNAME\t\tMNTNS\n"
echo "$MOUNTS" | while read LINE; do
PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3`
# Ignore self and thread-self
if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then
continue
fi
NAME=`ps -q $PID -o comm=`
MNTNS=`readlink /proc/$PID/ns/mnt`
printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS"
done
```
查找占用目录的进程
```bash
$ chmod +x find-busy-mnt.sh
./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a
# PID NAME MNTNS
# 5007 ntpd mnt:[4026533598]
$ kill -9 5007
```
### 问题五命令sudo nvidia-docker run 报错
```
docker: Error response from daemon: create nvidia_driver_361.42: VolumeDriver.Create: internal error, check logs for details.
See 'docker run --help'.
```
解决方法:
```
#查看nvidia-docker状态是不是启动有问题可以使用
$ systemctl status nvidia-docker
$ journalctl -n -u nvidia-docker
#重启下nvidia-docker
systemctl stop nvidia-docker
systemctl start nvidia-docker
```
### 问题六YARN 启动容器失败
如果你创建的容器数PS+Work>GPU显卡总数可能会出现容器创建失败那是因为在一台服务器上同时创建了超过本机显卡总数的容器。

View File

@ -1,322 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Quick Start Guide
## Prerequisite
Must:
- Apache Hadoop version newer than 2.7.3
Optional:
- Enable YARN DNS. (Only when YARN Service runtime is required)
- Enable GPU on YARN support. (When GPU-based training is required)
- Docker images for Submarine jobs. (When docker container is required)
```
# Get prebuilt docker images (No liability)
docker pull hadoopsubmarine/tf-1.13.1-gpu:0.0.1
# Or build your own docker images
docker build . -f Dockerfile.gpu.tf_1.13.1 -t tf-1.13.1-gpu-base:0.0.1
```
For more details, please refer to:
- [How to write Dockerfile for Submarine TensorFlow jobs](WriteDockerfileTF.html)
- [How to write Dockerfile for Submarine PyTorch jobs](WriteDockerfilePT.html)
## Submarine runtimes
After submarine 0.2.0, it supports two runtimes which are YARN native service
runtime and Linkedin's TonY runtime. Each runtime can support both Tensorflow
and Pytorch framework. And the user don't need to worry about the usage
because the two runtime implements the same interface.
To use the TonY runtime, please set below value in the submarine configuration.
|Configuration Name | Description |
|:---- |:---- |
| `submarine.runtime.class` | org.apache.hadoop.yarn.submarine.runtimes.tony.TonyRuntimeFactory |
For more details of TonY runtime, please check [TonY runtime guide](TonYRuntimeGuide.html)
## Run jobs
### Commandline options
```$xslt
usage: job run
-framework <arg> Framework to use.
Valid values are: tensorflow, pytorch.
The default framework is Tensorflow.
-checkpoint_path <arg> Training output directory of the job, could
be local or other FS directory. This
typically includes checkpoint files and
exported model
-docker_image <arg> Docker image name/tag
-env <arg> Common environment variable of worker/ps
-input_path <arg> Input of the job, could be local or other FS
directory
-name <arg> Name of the job
-num_ps <arg> Number of PS tasks of the job, by default
it's 0
-num_workers <arg> Numnber of worker tasks of the job, by
default it's 1
-ps_docker_image <arg> Specify docker image for PS, when this is
not specified, PS uses --docker_image as
default.
-ps_launch_cmd <arg> Commandline of worker, arguments will be
directly used to launch the PS
-ps_resources <arg> Resource of each PS, for example
memory-mb=2048,vcores=2,yarn.io/gpu=2
-queue <arg> Name of queue to run the job, by default it
uses default queue
-saved_model_path <arg> Model exported path (savedmodel) of the job,
which is needed when exported model is not
placed under ${checkpoint_path}could be
local or other FS directory. This will be
used to serve.
-tensorboard <arg> Should we run TensorBoard for this job? By
default it's true
-verbose Print verbose log for troubleshooting
-wait_job_finish Specified when user want to wait the job
finish
-worker_docker_image <arg> Specify docker image for WORKER, when this
is not specified, WORKER uses --docker_image
as default.
-worker_launch_cmd <arg> Commandline of worker, arguments will be
directly used to launch the worker
-worker_resources <arg> Resource of each worker, for example
memory-mb=2048,vcores=2,yarn.io/gpu=2
-localization <arg> Specify localization to remote/local
file/directory available to all container(Docker).
Argument format is "RemoteUri:LocalFilePath[:rw]"
(ro permission is not supported yet).
The RemoteUri can be a file or directory in local
or HDFS or s3 or abfs or http .etc.
The LocalFilePath can be absolute or relative.
If relative, it'll be under container's implied
working directory.
This option can be set mutiple times.
Examples are
-localization "hdfs:///user/yarn/mydir2:/opt/data"
-localization "s3a:///a/b/myfile1:./"
-localization "https:///a/b/myfile2:./myfile"
-localization "/user/yarn/mydir3:/opt/mydir3"
-localization "./mydir1:."
```
#### Notes:
When using `localization` option to make a collection of dependency Python
scripts available to entry python script in the container, you may also need to
set the `PYTHONPATH` environment variable as below to avoid module import errors
reported from `entry_script.py`.
```
... job run
# the entry point
--localization entry_script.py:<path>/entry_script.py
# the dependency Python scripts of the entry point
--localization other_scripts_dir:<path>/other_scripts_dir
# the PYTHONPATH env to make dependency available to entry script
--env PYTHONPATH="<path>/other_scripts_dir"
--worker_launch_cmd "python <path>/entry_script.py ..."
```
### Submarine Configuration
For Submarine internal configuration, please create a `submarine.xml` file which should be placed under `$HADOOP_CONF_DIR`.
|Configuration Name | Description |
|:---- |:---- |
| `submarine.runtime.class` | Optional. Full qualified class name for your runtime factory. |
| `submarine.localization.max-allowed-file-size-mb` | Optional. This sets a size limit to the file/directory to be localized in "-localization" CLI option. 2GB by default. |
### Launch Standalone Tensorflow Application:
#### Commandline
```
yarn jar path-to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job run \
--framework tensorflow \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name tf-job-001 \
--docker_image <your-docker-image> \
--input_path hdfs://default/dataset/cifar-10-data \
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
--worker_resources memory=4G,vcores=2,gpu=2 \
--worker_launch_cmd "python ... (Your training application cmd)" \
--tensorboard # this will launch a companion tensorboard container for monitoring
```
#### Notes:
1) `DOCKER_JAVA_HOME` points to JAVA_HOME inside Docker image.
2) `DOCKER_HADOOP_HDFS_HOME` points to HADOOP_HDFS_HOME inside Docker image.
3) `--worker_resources` can include GPU when you need GPU to train your task.
4) When `--tensorboard` is specified, you can go to YARN new UI, go to services -> `<you specified service>` -> Click `...` to access Tensorboard.
This will launch Tensorboard to monitor *all your jobs*.
By access the YARN UI (new UI), you can go to the Services page, then go to the `tensorboard-service`, click quick links (`Tensorboard`)
This will lead you to Tensorboard.
See below screenshot:
![alt text](./images/tensorboard-service.png "Tensorboard service")
After v0.2.0, if there is no hadoop client, we can also use the java command
and the uber jar, hadoop-submarine-all-*.jar, to submit the job.
```
java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
org.apache.hadoop.yarn.submarine.client.cli.Cli job run \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 --name tf-job-001 \
--docker_image <your-docker-image> \
--input_path hdfs://default/dataset/cifar-10-data \
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
--worker_resources memory=4G,vcores=2,gpu=2 \
--worker_launch_cmd "python ... (Your training application cmd)" \
--tensorboard # this will launch a companion tensorboard container for monitoring
```
### Launch Distributed Tensorflow Application:
#### Commandline
```
yarn jar hadoop-yarn-applications-submarine-<version>.jar job run \
--name tf-job-001 --docker_image <your-docker-image> \
--framework tensorflow \
--input_path hdfs://default/dataset/cifar-10-data \
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
--num_workers 2 \
--worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
--num_ps 2 \
--ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
```
Or
```
java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
org.apache.hadoop.yarn.submarine.client.cli.Cli job run \
--name tf-job-001 --docker_image <your docker image> \
--input_path hdfs://default/dataset/cifar-10-data \
--checkpoint_path hdfs://default/tmp/cifar-10-jobdir \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--num_workers 2 \
--worker_resources memory=8G,vcores=2,gpu=1 --worker_launch_cmd "cmd for worker ..." \
--num_ps 2 \
--ps_resources memory=4G,vcores=2,gpu=0 --ps_launch_cmd "cmd for ps" \
```
#### Notes:
1) Very similar to standalone TF application, but you need to specify number of workers / PS processes.
2) Different resources can be specified for worker and PS.
3) `TF_CONFIG` environment will be auto generated and set before executing user's launch command.
## Get job history / logs
### Get Job Status from CLI
```
yarn jar hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar job show --name tf-job-001
```
Or
```
java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
org.apache.hadoop.yarn.submarine.client.cli.Cli job show --name tf-job-001
```
Output looks like:
```
Job Meta Info:
Application Id: application_1532131617202_0005
Input Path: hdfs://default/dataset/cifar-10-data
Checkpoint Path: hdfs://default/tmp/cifar-10-jobdir
Run Parameters: --name tf-job-001 --docker_image <your-docker-image>
(... all your commandline before run the job)
```
After that, you can run ```tensorboard --logdir=<checkpoint-path>``` to view Tensorboard of the job.
### Run tensorboard to monitor your jobs
```
# Cleanup previous service if needed
yarn app -destroy tensorboard-service; \
yarn jar /tmp/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
job run --name tensorboard-service --verbose --docker_image <your-docker-image> \
--framework tensorflow \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
--num_workers 0 --tensorboard
```
Or
```
# Cleanup previous service if needed
yarn app -destroy tensorboard-service; \
java -cp /path-to/hadoop-conf:/path-to/hadoop-submarine-all-*.jar \
org.apache.hadoop.yarn.submarine.client.cli.Cli job run \
--name tensorboard-service --verbose --docker_image wtan/tf-1.8.0-cpu:0.0.3 \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--num_workers 0 --tensorboard
```
You can view multiple job training history from the `Tensorboard` link:
![alt text](./images/multiple-tensorboard-jobs.png "Tensorboard for multiple jobs")
### Get component logs from a training job
There are two ways to get the logs of a training job.
First, from YARN UI (new or old):
![alt text](./images/job-logs-ui.png "Job logs UI")
Alternatively, you can use `yarn logs -applicationId <applicationId>` to get logs from CLI.
## Build from source code
If you want to build the Submarine project by yourself, you should follow these steps:
- Run 'mvn install -DskipTests' from Hadoop source top level once.
- Navigate to hadoop-submarine folder and run 'mvn clean package'.
- By Default, hadoop-submarine is built based on hadoop 3.1.2 dependencies.
Both yarn service runtime and tony runtime are built.
You can also add a parameter of "-Phadoop-3.2" to specify the dependencies
to hadoop 3.2.0.
- Hadoop-submarine can support hadoop 2.9.2 and hadoop 2.7.4 as well.
You can add "-Phadoop-2.9" to build submarine based on hadoop 2.9.2.
For example:
```
mvn clean package -Phadoop-2.9
```
As yarn service is based on hadoop 3.*, so only tony runtime is built
in this case.

View File

@ -1,164 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Tutorial: Running Distributed Cifar10 Tensorflow Estimator Example.
## Prepare data for training
CIFAR-10 is a common benchmark in machine learning for image recognition. The example below is based on CIFAR-10 dataset.
1) Checkout https://github.com/tensorflow/models/:
```
git clone https://github.com/tensorflow/models/
```
2) Go to `models/tutorials/image/cifar10_estimator`
3) Generate data by using following command: (required Tensorflow installed)
```
python generate_cifar10_tfrecords.py --data-dir=cifar-10-data
```
4) Upload data to HDFS
```
hadoop fs -put cifar-10-data/ /dataset/cifar-10-data
```
**Warning:**
Please note that YARN service does not allow multiple services with the same name, so please run following command
```
yarn application -destroy <service-name>
```
to delete services if you want to reuse the same service name.
## Prepare Docker images
Refer to [Write Dockerfile](WriteDockerfileTF.html) to build a Docker image or use prebuilt one.
## Run Tensorflow jobs
### Run standalone training
```
yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
job run --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
--input_path hdfs://default/dataset/cifar-10-data \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/ \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
--num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
--tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1
```
Explanations:
- When access of HDFS is required, the two environments are required to indicate: JAVA_HOME and HDFS_HOME to access libhdfs libraries *inside Docker image*. We will try to eliminate specifying this in the future.
- Docker image for worker and tensorboard can be specified separately. For this case, Tensorboard does not need GPU, so we will use the CPU Docker image for Tensorboard. (Same for parameter-server in the distributed example below).
### Run distributed training
```
yarn jar path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
job run --name tf-job-001 --verbose --docker_image tf-1.13.1-gpu:0.0.1 \
--input_path hdfs://default/dataset/cifar-10-data \
--env(s) (same as standalone) \
--num_workers 2 \
--worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --num-gpus=2 --sync" \
--ps_docker_image tf-1.13.1-cpu:0.0.1 \
--num_ps 1 --ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0" \
--tensorboard --tensorboard_docker_image tf-1.13.1-cpu:0.0.1
```
Explanations:
- `>1` num_workers indicates it is a distributed training.
- Parameters / resources / Docker image of parameter server can be specified separately. For many cases, parameter server does not require GPU.
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
*Outputs of distributed training*
Sample output of master:
```
...
allow_soft_placement: true
, '_tf_random_seed': None, '_task_type': u'master', '_environment': u'cloud', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fe77cb15050>, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:29:14.656022: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:8000}
2018-05-06 22:29:14.656097: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.656112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:29:14.659359: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:Restoring parameters from hdfs://default/tmp/cifar-10-jobdir/model.ckpt-0
INFO:tensorflow:Evaluation [1/625]
INFO:tensorflow:Evaluation [2/625]
INFO:tensorflow:Evaluation [3/625]
INFO:tensorflow:Evaluation [4/625]
INFO:tensorflow:Evaluation [5/625]
INFO:tensorflow:Evaluation [6/625]
...
INFO:tensorflow:Validation (step 1): loss = 1220.6445, global_step = 1, accuracy = 0.1
INFO:tensorflow:loss = 6.3980675, step = 0
INFO:tensorflow:loss = 6.3980675, learning_rate = 0.1
INFO:tensorflow:global_step/sec: 2.34092
INFO:tensorflow:Average examples/sec: 1931.22 (1931.22), step = 100
INFO:tensorflow:Average examples/sec: 354.236 (38.6479), step = 110
INFO:tensorflow:Average examples/sec: 211.096 (38.7693), step = 120
INFO:tensorflow:Average examples/sec: 156.533 (38.1633), step = 130
INFO:tensorflow:Average examples/sec: 128.6 (38.7372), step = 140
INFO:tensorflow:Average examples/sec: 111.533 (39.0239), step = 150
```
Sample output of worker:
```
, '_tf_random_seed': None, '_task_type': u'worker', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc2a490b050>, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:45.807936: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808040: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> ps-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:45.808064: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8000}
2018-05-06 22:28:45.809919: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
...
INFO:tensorflow:loss = 5.319096, step = 0
INFO:tensorflow:loss = 5.319096, learning_rate = 0.1
INFO:tensorflow:Average examples/sec: 49.2338 (49.2338), step = 10
INFO:tensorflow:Average examples/sec: 52.117 (55.3589), step = 20
INFO:tensorflow:Average examples/sec: 53.2754 (55.7541), step = 30
INFO:tensorflow:Average examples/sec: 53.8388 (55.6028), step = 40
INFO:tensorflow:Average examples/sec: 54.1082 (55.2134), step = 50
INFO:tensorflow:Average examples/sec: 54.3141 (55.3676), step = 60
```
Sample output of PS:
```
...
, '_tf_random_seed': None, '_task_type': u'ps', '_environment': u'cloud', '_is_chief': False, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4be54dff90>, '_tf_config': gpu_options {
per_process_gpu_memory_fraction: 1.0
}
...
2018-05-06 22:28:42.562316: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> master-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.562408: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:8000}
2018-05-06 22:28:42.562433: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> worker-0.distributed-tf.root.tensorflow.site:8000}
2018-05-06 22:28:42.564242: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:316] Started server with target: grpc://localhost:8000
```

View File

@ -1,62 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Tutorial: Running a standalone Cifar10 PyTorch Estimator Example.
Currently, PyTorch integration with Submarine only supports PyTorch in standalone (non-distributed mode).
Please also note that HDFS as a data source is not yet supported by PyTorch.
## What is CIFAR-10?
CIFAR-10 is a common benchmark in machine learning for image recognition. Below example is based on CIFAR-10 dataset.
**Warning:**
Please note that YARN service doesn't allow multiple services with the same name, so please run following command
```
yarn application -destroy <service-name>
```
to delete services if you want to reuse the same service name.
## Prepare Docker images
Refer to [Write Dockerfile](WriteDockerfilePT.html) to build a Docker image or use prebuilt one.
## Running PyTorch jobs
### Run standalone training
```
export HADOOP_CLASSPATH="/home/systest/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar:/home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar"
/opt/hadoop/bin/yarn jar /home/systest/hadoop-submarine-core-0.2.0-SNAPSHOT.jar job run \
--name pytorch-job-001 \
--verbose \
--framework pytorch \
--wait_job_finish \
--docker_image pytorch-latest-gpu:0.0.1 \
--input_path hdfs://unused \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.2 \
--env YARN_CONTAINER_RUNTIME_DOCKER_DELAYED_REMOVAL=true \
--num_workers 1 \
--worker_resources memory=5G,vcores=2 \
--worker_launch_cmd "cd /test/ && python cifar10_tutorial.py"
```
For the meaning of the individual parameters, see the [QuickStart](QuickStart.html) page!
**Remarks:**
Please note that the input path parameter is mandatory, but not yet used by the PyTorch docker container.

View File

@ -1,146 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
#### Test with a tensorflow job
Distributed-shell + GPU + cgroup
```bash
./yarn jar /home/hadoop/hadoop-current/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \
--env DOCKER_JAVA_HOME=/opt/java \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-current --name distributed-tf-gpu \
--env YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=calico-network \
--worker_docker_image tf-1.13.1-gpu:0.0.1 \
--ps_docker_image tf-1.13.1-cpu:0.0.1 \
--input_path hdfs://${dfs_name_service}/tmp/cifar-10-data \
--checkpoint_path hdfs://${dfs_name_service}/user/hadoop/tf-distributed-checkpoint \
--num_ps 0 \
--ps_resources memory=4G,vcores=2,gpu=0 \
--ps_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --num-gpus=0" \
--worker_resources memory=4G,vcores=2,gpu=1 --verbose \
--num_workers 1 \
--worker_launch_cmd "python /test/cifar10_estimator/cifar10_main.py --data-dir=hdfs://${dfs_name_service}/tmp/cifar-10-data --job-dir=hdfs://${dfs_name_service}/tmp/cifar-10-jobdir --train-steps=500 --eval-batch-size=16 --train-batch-size=16 --sync --num-gpus=1"
```
## Issues:
### Issue 1: Fail to start NodeManager after system reboot
```
2018-09-20 18:54:39,785 ERROR org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor: Failed to bootstrap configured resource subsystems!
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerException: Unexpected: Cannot create yarn cgroup Subsystem:cpu Mount points:/proc/mounts User:yarn Path:/sys/fs/cgroup/cpu,cpuacct/hadoop-yarn
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializePreMountedCGroupController(CGroupsHandlerImpl.java:425)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsHandlerImpl.initializeCGroupController(CGroupsHandlerImpl.java:377)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:98)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.CGroupsCpuResourceHandlerImpl.bootstrap(CGroupsCpuResourceHandlerImpl.java:87)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.resources.ResourceHandlerChain.bootstrap(ResourceHandlerChain.java:58)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.init(LinuxContainerExecutor.java:320)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:389)
at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:929)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:997)
2018-09-20 18:54:39,789 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state INITED
```
Solution: Grant user yarn the access to `/sys/fs/cgroup/cpu,cpuacct`, which is the subfolder of cgroup mount destination.
```
chown :yarn -R /sys/fs/cgroup/cpu,cpuacct
chmod g+rwx -R /sys/fs/cgroup/cpu,cpuacct
```
If GPUs are used, access to cgroup devices folder is required as well.
```
chown :yarn -R /sys/fs/cgroup/devices
chmod g+rwx -R /sys/fs/cgroup/devices
```
### Issue 2: container-executor permission denied
```
2018-09-21 09:36:26,102 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: IOException executing command:
java.io.IOException: Cannot run program "/etc/yarn/sbin/Linux-amd64-64/container-executor": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:938)
at org.apache.hadoop.util.Shell.run(Shell.java:901)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1213)
```
Solution: The permission of `/etc/yarn/sbin/Linux-amd64-64/container-executor` should be 6050
### Issue 3How to get docker service log
Solution: we can get docker log with the following command
```
journalctl -u docker
```
### Issue 4docker can't remove containers with errors like `device or resource busy`
```bash
$ docker rm 0bfafa146431
Error response from daemon: Unable to remove filesystem for 0bfafa146431771f6024dcb9775ef47f170edb2f1852f71916ba44209ca6120a: remove /app/docker/containers/0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a/shm: device or resource busy
```
Solution: to find which process leads to a `device or resource busy`, we can add a shell script, named `find-busy-mnt.sh`
```bash
#!/bin/bash
# A simple script to get information about mount points and pids and their
# mount namespaces.
if [ $# -ne 1 ];then
echo "Usage: $0 <devicemapper-device-id>"
exit 1
fi
ID=$1
MOUNTS=`find /proc/*/mounts | xargs grep $ID 2>/dev/null`
[ -z "$MOUNTS" ] && echo "No pids found" && exit 0
printf "PID\tNAME\t\tMNTNS\n"
echo "$MOUNTS" | while read LINE; do
PID=`echo $LINE | cut -d ":" -f1 | cut -d "/" -f3`
# Ignore self and thread-self
if [ "$PID" == "self" ] || [ "$PID" == "thread-self" ]; then
continue
fi
NAME=`ps -q $PID -o comm=`
MNTNS=`readlink /proc/$PID/ns/mnt`
printf "%s\t%s\t\t%s\n" "$PID" "$NAME" "$MNTNS"
done
```
Kill the process by pid, which is found by the script
```bash
$ chmod +x find-busy-mnt.sh
./find-busy-mnt.sh 0bfafa146431771f6024dcb9775ef47f170edb2f152f71916ba44209ca6120a
# PID NAME MNTNS
# 5007 ntpd mnt:[4026533598]
$ kill -9 5007
```
### Issue 5YARN fails to start containers
If the number of GPUs required by an application is greater than the number of GPUs in the cluster, some container will not be created.

View File

@ -1,309 +0,0 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Quick Start Guide
## Prerequisite
Must:
- Apache Hadoop 2.7 or above.
- TonY library 0.3.2 or above. You could download latest TonY jar from
https://github.com/linkedin/TonY/releases.
Optional:
- Enable GPU on YARN support (when GPU-based training is required, Hadoop 3.1 and above).
- Enable Docker support on Hadoop (Hadoop 2.9 and above).
## Run jobs
### Commandline options
```$xslt
usage:
-docker_image <arg> Docker image name/tag
-env <arg> Common environment variable of worker/ps
-name <arg> Name of the job
-num_ps <arg> Number of PS tasks of the job, by default
it's 0
-num_workers <arg> Numnber of worker tasks of the job, by
default it's 1
-ps_docker_image <arg> Specify docker image for PS, when this is
not specified, PS uses --docker_image as
default.
-ps_launch_cmd <arg> Commandline of worker, arguments will be
directly used to launch the PS
-ps_resources <arg> Resource of each PS, for example
memory-mb=2048,vcores=2,yarn.io/gpu=2
-queue <arg> Name of queue to run the job, by default it
uses default queue
-saved_model_path <arg> Model exported path (savedmodel) of the job,
which is needed when exported model is not
placed under ${checkpoint_path}could be
local or other FS directory. This will be
used to serve.
-tensorboard <arg> Should we run TensorBoard for this job? By
default it's true
-verbose Print verbose log for troubleshooting
-wait_job_finish Specified when user want to wait the job
finish
-worker_docker_image <arg> Specify docker image for WORKER, when this
is not specified, WORKER uses --docker_image
as default.
-worker_launch_cmd <arg> Commandline of worker, arguments will be
directly used to launch the worker
-worker_resources <arg> Resource of each worker, for example
memory-mb=2048,vcores=2,yarn.io/gpu=2
-localization <arg> Specify localization to remote/local
file/directory available to all container(Docker).
Argument format is "RemoteUri:LocalFileName"
The LocalFilePath is the local file or folder name.
You should access it with relative path to working directory.
This option can be set mutiple times.
Examples are
-localization "hdfs:///user/yarn/mydir2:data"
-localization "s3a:///a/b/myfile1:file1"
-localization "https:///a/b/myfile2:myfile"
-localization "/user/yarn/mydir3:mydir3"
-localization "./mydir1:mydir1"
-insecure Whether running in an insecure cluster
-conf Override configurations via commandline
```
> Note: all --localization files will be localized to working directory. You should access them use
relative path. Alternatively, you could use `--conf tony.containers.resources
=src_file::dest_file_name,src_file2::dest_file_name2`. It accepts a list of resources to be localized to all containers,
delimited by comma. If a resource has no scheme like `hdfs://` or `s3://`, the file is considered a local file. You
could add #archive annotation, if an entry has `#archive`, the file will be automatically unzipped when localized to the
containers, folder name is the same as the file name. For example: `/user/khu/abc.zip#archive` would be inferred as a
local file and will be unarchived in containers. You would anticipate an abc.zip/ folder in your container's working
directory. Annotation `::` is added since TonY 0.3.3. If you use `PATH/TO/abc.txt::def.txt`, the `abc.txt` file
would be localized as `def.txt` in the container working directory.
Details: [tony configurations](https://github.com/linkedin/TonY/wiki/TonY-Configurations)
### Submarine Configuration
For submarine internal configuration, please create a `submarine.xml` which should be placed under `$HADOOP_CONF_DIR`.
Make sure you set `submarine.runtime.class` to `org.apache.hadoop.yarn.submarine.runtimes.tony.TonyRuntimeFactory`
|Configuration Name | Description |
|:---- |:---- |
| `submarine.runtime.class` | org.apache.hadoop.yarn.submarine.runtimes.tony.TonyRuntimeFactory
| `submarine.localization.max-allowed-file-size-mb` | Optional. This sets a size limit to the file/directory to be localized in "-localization" CLI option. 2GB by default. |
### Launch TensorFlow Application:
#### Commandline
### Without Docker
You need:
* Build a Python virtual environment with TensorFlow 1.13.1 installed
* A cluster with Hadoop 2.7 or above.
### Building a Python virtual environment with TensorFlow
TonY requires a Python virtual environment zip with TensorFlow and any needed Python libraries already installed.
```
wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz
tar xf virtualenv-16.0.0.tar.gz
# Make sure to install using Python 3, as TensorFlow only provides Python 3 artifacts
python virtualenv-16.0.0/virtualenv.py venv
. venv/bin/activate
pip install tensorflow==1.13.1
zip -r venv.zip venv
```
### TensorFlow version
- Version 1.13.1
**Note:** If you require a past version of TensorFlow and TensorBoard, take a look at [this](https://github.com/linkedin/TonY/issues/42) issue.
### Installing Hadoop
TonY only requires YARN, not HDFS. Please see the [open-source documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html) on how to set YARN up.
### Get the training examples
Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-tensorflow
```
CLASSPATH=$(hadoop classpath --glob): \
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.11-all.jar \
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
--framework tensorflow \
--num_workers 2 \
--worker_resources memory=3G,vcores=2 \
--num_ps 2 \
--ps_resources memory=3G,vcores=2 \
--worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
--ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
--insecure
--conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
PATH_TO_TONY_CLI_JAR/tony-cli-0.3.11-all.jar
```
You should then be able to see links and status of the jobs from command line:
```
2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED
```
### With Docker
```
CLASSPATH=$(hadoop classpath --glob): \
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.11-all.jar \
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
--framework tensorflow \
--docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.3 \
--input_path hdfs://pi-aw:9000/dataset/cifar-10-data \
--worker_resources memory=3G,vcores=2 \
--worker_launch_cmd "export CLASSPATH=\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync" \
--env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env HADOOP_HOME=/hadoop-3.1.0 \
--env HADOOP_YARN_HOME=/hadoop-3.1.0 \
--env HADOOP_COMMON_HOME=/hadoop-3.1.0 \
--env HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \
--conf tony.containers.resources=/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.11-all.jar
```
### Launch PyToch Application:
#### Commandline
### Without Docker
You need:
* Build a Python virtual environment with PyTorch 0.4.* installed
* A cluster with Hadoop 2.7 or above.
### Building a Python virtual environment with PyTorch
TonY requires a Python virtual environment zip with PyTorch and any needed Python libraries already installed.
```
wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz
tar xf virtualenv-16.0.0.tar.gz
python virtualenv-16.0.0/virtualenv.py venv
. venv/bin/activate
pip install pytorch==0.4.0
zip -r venv.zip venv
```
### PyTorch version
- Version 0.4.0+
### Installing Hadoop
TonY only requires YARN, not HDFS. Please see the [open-source documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html) on how to set YARN up.
### Get the training examples
Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-pytorch
```
CLASSPATH=$(hadoop classpath --glob): \
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.11-all.jar \
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
--num_workers 2 \
--worker_resources memory=3G,vcores=2 \
--num_ps 2 \
--ps_resources memory=3G,vcores=2 \
--worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
--ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
--insecure \
--conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
PATH_TO_TONY_CLI_JAR/tony-cli-0.3.11-all.jar \
--conf tony.application.framework=pytorch
```
You should then be able to see links and status of the jobs from command line:
```
2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED
```
### With Docker
```
CLASSPATH=$(hadoop classpath --glob): \
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.11-all.jar \
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
--docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.3 \
--input_path hdfs://pi-aw:9000/dataset/cifar-10-data \
--worker_resources memory=3G,vcores=2 \
--worker_launch_cmd "export CLASSPATH=\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync" \
--env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
--env HADOOP_HOME=/hadoop-3.1.0 \
--env HADOOP_YARN_HOME=/hadoop-3.1.0 \
--env HADOOP_COMMON_HOME=/hadoop-3.1.0 \
--env HADOOP_HDFS_HOME=/hadoop-3.1.0 \
--env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \
--conf tony.containers.resources=PATH_TO_TONY_CLI_JAR/tony-cli-0.3.11-all.jar \
--conf tony.application.framework=pytorch
```

View File

@ -1,114 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Creating Docker Images for Running PyTorch on YARN
## How to create docker images to run PyTorch on YARN
Dockerfile to run PyTorch on YARN needs two parts:
**Base libraries which PyTorch depends on**
1) OS base image, for example ```ubuntu:16.04```
2) PyTorch dependent libraries and packages. For example ```python```, ```scipy```. For GPU support, you also need ```cuda```, ```cudnn```, etc.
3) PyTorch package.
**Libraries to access HDFS**
1) JDK
2) Hadoop
Here's an example of a base image (with GPU support) to install PyTorch:
```
FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu16.04
ARG PYTHON_VERSION=3.6
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
curl \
vim \
ca-certificates \
libjpeg-dev \
libpng-dev \
wget &&\
rm -rf /var/lib/apt/lists/*
RUN curl -o ~/miniconda.sh -O https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
chmod +x ~/miniconda.sh && \
~/miniconda.sh -b -p /opt/conda && \
rm ~/miniconda.sh && \
/opt/conda/bin/conda install -y python=$PYTHON_VERSION numpy pyyaml scipy ipython mkl mkl-include cython typing && \
/opt/conda/bin/conda install -y -c pytorch magma-cuda100 && \
/opt/conda/bin/conda clean -ya
ENV PATH /opt/conda/bin:$PATH
RUN pip install ninja
# This must be done before pip so that requirements.txt is available
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/pytorch.git
WORKDIR pytorch
RUN git submodule update --init
RUN TORCH_CUDA_ARCH_LIST="3.5 5.2 6.0 6.1 7.0+PTX" TORCH_NVCC_FLAGS="-Xfatbin -compress-all" \
CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \
pip install -v .
WORKDIR /opt/pytorch
RUN git clone https://github.com/pytorch/vision.git && cd vision && pip install -v .
```
On top of above image, add files, install packages to access HDFS
```
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# Install hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
```
Build and push to your own docker registry: Use ```docker build ... ``` and ```docker push ...``` to finish this step.
## Use examples to build your own PyTorch docker images
We provided some example Dockerfiles for you to build your own PyTorch docker images.
For latest PyTorch
- *docker/pytorch/base/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that supports GPU, which is prebuilt to CUDA10.
- *docker/pytorch/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.pytorch_latest*: Latest Pytorch that GPU, which is prebuilt to CUDA10, with models.
## Build Docker images
### Manually build Docker image:
Under `docker/pytorch` directory, run `build-all.sh` to build all Docker images. This command will build the following Docker images:
- `pytorch-latest-gpu-base:0.0.1` for base Docker image which includes Hadoop, PyTorch, GPU base libraries.
- `pytorch-latest-gpu:0.0.1` which includes cifar10 model as well
### Use prebuilt images
(No liability)
You can also use prebuilt images for convenience:
- hadoopsubmarine/pytorch-latest-gpu-base:0.0.1

View File

@ -1,123 +0,0 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Creating Docker Images for Running Tensorflow on YARN
## How to create docker images to run Tensorflow on YARN
Dockerfile to run Tensorflow on YARN need two part:
**Base libraries which Tensorflow depends on**
1) OS base image, for example ```ubuntu:16.04```
2) Tensorflow depended libraries and packages. For example ```python```, ```scipy```. For GPU support, need ```cuda```, ```cudnn```, etc.
3) Tensorflow package.
**Libraries to access HDFS**
1) JDK
2) Hadoop
Here's an example of a base image (w/o GPU support) to install Tensorflow:
```
FROM ubuntu:16.04
# Pick up some TF dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
curl \
libfreetype6-dev \
libpng12-dev \
libzmq3-dev \
pkg-config \
python \
python-dev \
rsync \
software-properties-common \
unzip \
&& \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
RUN export DEBIAN_FRONTEND=noninteractive && apt-get update && apt-get install -yq krb5-user libpam-krb5 && apt-get clean
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
RUN pip --no-cache-dir install \
Pillow \
h5py \
ipykernel \
jupyter \
matplotlib \
numpy \
pandas \
scipy \
sklearn \
&& \
python -m ipykernel.kernelspec
RUN pip --no-cache-dir install \
http://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.13.1-cp27-none-linux_x86_64.whl
```
On top of above image, add files, install packages to access HDFS
```
RUN apt-get update && apt-get install -y openjdk-8-jdk wget
# Install hadoop
ENV HADOOP_VERSION="3.1.2"
RUN wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
RUN tar zxf hadoop-${HADOOP_VERSION}.tar.gz
RUN ln -s hadoop-${HADOOP_VERSION} hadoop-current
RUN rm hadoop-${HADOOP_VERSION}.tar.gz
```
Build and push to your own docker registry: Use ```docker build ... ``` and ```docker push ...``` to finish this step.
## Use examples to build your own Tensorflow docker images
We provided following examples for you to build tensorflow docker images.
For Tensorflow 1.13.1 (Precompiled to CUDA 10.x)
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only.
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.cpu.tf_1.13.1*: Tensorflow 1.13.1 supports CPU only, and included models
- *docker/tensorflow/base/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10.
- *docker/tensorflow/with-cifar10-models/ubuntu-16.04/Dockerfile.gpu.tf_1.13.1*: Tensorflow 1.13.1 supports GPU, which is prebuilt to CUDA10, with models.
## Build Docker images
### Manually build Docker image:
Under `docker/` directory, run `build-all.sh` to build Docker images. It will build following images:
- `tf-1.13.1-gpu-base:0.0.1` for base Docker image which includes Hadoop, Tensorflow, GPU base libraries.
- `tf-1.13.1-gpu-base:0.0.1` for base Docker image which includes Hadoop. Tensorflow.
- `tf-1.13.1-gpu:0.0.1` which includes cifar10 model
- `tf-1.13.1-cpu:0.0.1` which inclues cifar10 model (cpu only).
### Use prebuilt images
(No liability)
You can also use prebuilt images for convenience:
- hadoopsubmarine/tf-1.13.1-gpu:0.0.1
- hadoopsubmarine/tf-1.13.1-cpu:0.0.1

Some files were not shown because too many files have changed in this diff Show More