hadoopScript/hadoop/hadoop
LingZhaoHui 7bc58000b4 update 2022-09-28 22:47:19 +08:00
..
conf update 2022-09-28 22:47:19 +08:00
hooks update 2022-09-26 22:49:02 +08:00
profile.d update 2022-09-26 22:49:02 +08:00
ssh update 2022-09-26 22:49:02 +08:00
.dockerignore update 2022-09-26 22:49:02 +08:00
CentOS-8-reg.repo update 2022-09-26 22:49:02 +08:00
Dockerfile update 2022-09-28 22:47:19 +08:00
Makefile update 2022-09-26 22:49:02 +08:00
README.md update 2022-09-26 22:49:02 +08:00
docker-compose.yml update 2022-09-26 22:49:02 +08:00
entrypoint.sh update 2022-09-28 22:47:19 +08:00
get_versions update 2022-09-26 22:49:02 +08:00

README.md

Apache Hadoop

DockerHub Hadoop

https://hadoop.apache.org/

Big Data Distributed Storage and Compute Software

  • Yarn - Distributed Processing Framework for running MapReduce, Spark and other application frameworks
  • HDFS - Distributed Storage

By default starts a pseudo-distributed cluster of 4 daemons in a single container:

  • Yarn
    • ResourceManager - Cluster Processing Master (submit jobs here)
    • NodeManager - Cluster Processing Worker
  • HDFS
    • NameNode - Filesystem Master
    • DataNode - Filesystem Worker

Perfect for development and testing. Recommended to use Docker with 4GB+ RAM for this pseudo-cluster container.

For real scaling just start a single daemon in each container for fully distributed setup.

To run the all-in-one-container cluster and expose all the UIs for NodeManager, ResourceManager, NameNode and DataNode respectively, do:

docker run -ti -p 8042 -p 8088 -p 19888 -p 50070 -p 50075 harisekhon/hadoop

or with docker-compose:

docker-compose up

or without docker-compose, a shortcut for the docker run command:

make run

Related Docker images can be found for many Open Source, Big Data and NoSQL technologies on my DockerHub profile. The source for them all can be found in the master Dockerfiles GitHub repo.