SUBMARINE-64. Improve TonY runtime's document. Contributed by Keqiu Hu.
This commit is contained in:
parent
382a962d8d
commit
24f218aef8
@ -19,6 +19,8 @@
|
||||
Must:
|
||||
|
||||
- Apache Hadoop 2.7 or above.
|
||||
- TonY library 0.3.2 or above. You could download latest TonY jar from
|
||||
https://github.com/linkedin/TonY/releases.
|
||||
|
||||
Optional:
|
||||
|
||||
@ -149,9 +151,106 @@ java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
||||
--worker_resources memory=3G,vcores=2 \
|
||||
--num_ps 2 \
|
||||
--ps_resources memory=3G,vcores=2 \
|
||||
--worker_launch_cmd "venv.zip/venv/bin/python --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
||||
--ps_launch_cmd "venv.zip/venv/bin/python --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
||||
--container_resources /home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar
|
||||
--worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
||||
--ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py --steps 1000 --data_dir /tmp/data --working_dir /tmp/mode" \
|
||||
--insecure
|
||||
--conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
|
||||
PATH_TO_TONY_CLI_JAR/tony-cli-0.3.2-all.jar
|
||||
|
||||
```
|
||||
You should then be able to see links and status of the jobs from command line:
|
||||
|
||||
```
|
||||
2019-04-22 20:30:42,611 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: RUNNING
|
||||
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: RUNNING
|
||||
2019-04-22 20:30:42,612 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: RUNNING
|
||||
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for ps 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi
|
||||
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 0 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi
|
||||
2019-04-22 20:30:42,612 INFO tony.TonyClient: Logs for worker 1 at: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi
|
||||
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: ps index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000002/pi status: FINISHED
|
||||
2019-04-22 20:30:44,625 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 0 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000003/pi status: FINISHED
|
||||
2019-04-22 20:30:44,626 INFO tony.TonyClient: Tasks Status Updated: [TaskInfo] name: worker index: 1 url: http://pi-aw:8042/node/containerlogs/container_1555916523933_0030_01_000004/pi status: FINISHED
|
||||
|
||||
```
|
||||
|
||||
### With Docker
|
||||
|
||||
```
|
||||
CLASSPATH=$(hadoop classpath --glob): \
|
||||
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
|
||||
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
|
||||
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
|
||||
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar \
|
||||
|
||||
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
||||
--docker_image hadoopsubmarine/tf-1.8.0-cpu:0.0.3 \
|
||||
--input_path hdfs://pi-aw:9000/dataset/cifar-10-data \
|
||||
--worker_resources memory=3G,vcores=2 \
|
||||
--worker_launch_cmd "export CLASSPATH=\$(/hadoop-3.1.0/bin/hadoop classpath --glob) && cd /test/models/tutorials/image/cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16 --train-batch-size=16 --variable-strategy=CPU --num-gpus=0 --sync" \
|
||||
--env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
|
||||
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0 \
|
||||
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
|
||||
--env HADOOP_HOME=/hadoop-3.1.0 \
|
||||
--env HADOOP_YARN_HOME=/hadoop-3.1.0 \
|
||||
--env HADOOP_COMMON_HOME=/hadoop-3.1.0 \
|
||||
--env HADOOP_HDFS_HOME=/hadoop-3.1.0 \
|
||||
--env HADOOP_CONF_DIR=/hadoop-3.1.0/etc/hadoop \
|
||||
--conf tony.containers.resources=--conf tony.containers.resources=/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar
|
||||
```
|
||||
|
||||
|
||||
### Launch PyToch Application:
|
||||
|
||||
#### Commandline
|
||||
|
||||
### Without Docker
|
||||
|
||||
You need:
|
||||
* Build a Python virtual environment with PyTorch 0.4.* installed
|
||||
* A cluster with Hadoop 2.7 or above.
|
||||
|
||||
### Building a Python virtual environment with PyTorch
|
||||
|
||||
TonY requires a Python virtual environment zip with PyTorch and any needed Python libraries already installed.
|
||||
|
||||
```
|
||||
wget https://files.pythonhosted.org/packages/33/bc/fa0b5347139cd9564f0d44ebd2b147ac97c36b2403943dbee8a25fd74012/virtualenv-16.0.0.tar.gz
|
||||
tar xf virtualenv-16.0.0.tar.gz
|
||||
|
||||
python virtualenv-16.0.0/virtualenv.py venv
|
||||
. venv/bin/activate
|
||||
pip install pytorch==0.4.0
|
||||
zip -r venv.zip venv
|
||||
```
|
||||
|
||||
### PyTorch version
|
||||
|
||||
- Version 0.4.0+
|
||||
|
||||
|
||||
### Installing Hadoop
|
||||
|
||||
TonY only requires YARN, not HDFS. Please see the [open-source documentation](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html) on how to set YARN up.
|
||||
|
||||
### Get the training examples
|
||||
|
||||
Get mnist_distributed.py from https://github.com/linkedin/TonY/tree/master/tony-examples/mnist-pytorch
|
||||
|
||||
|
||||
```
|
||||
CLASSPATH=$(hadoop classpath --glob): \
|
||||
./hadoop-submarine-core/target/hadoop-submarine-core-0.2.0-SNAPSHOT.jar: \
|
||||
./hadoop-submarine-yarnservice-runtime/target/hadoop-submarine-score-yarnservice-runtime-0.2.0-SNAPSHOT.jar: \
|
||||
./hadoop-submarine-tony-runtime/target/hadoop-submarine-tony-runtime-0.2.0-SNAPSHOT.jar: \
|
||||
/home/pi/hadoop/TonY/tony-cli/build/libs/tony-cli-0.3.2-all.jar \
|
||||
|
||||
java org.apache.hadoop.yarn.submarine.client.cli.Cli job run --name tf-job-001 \
|
||||
--num_workers 2 \
|
||||
--worker_resources memory=3G,vcores=2 \
|
||||
--num_ps 2 \
|
||||
--ps_resources memory=3G,vcores=2 \
|
||||
--worker_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
|
||||
--ps_launch_cmd "venv.zip/venv/bin/python mnist_distributed.py" \
|
||||
--insecure
|
||||
--conf tony.containers.resources=PATH_TO_VENV_YOU_CREATED/venv.zip#archive,PATH_TO_MNIST_EXAMPLE/mnist_distributed.py, \
|
||||
PATH_TO_TONY_CLI_JAR/tony-cli-0.3.2-all.jar
|
||||
|
Loading…
Reference in New Issue
Block a user