X Tutup
Skip to content

Commit ce6c22b

Browse files
committed
Added more details to the Big Data chapter
1 parent 623954d commit ce6c22b

File tree

3 files changed

+53
-16
lines changed

3 files changed

+53
-16
lines changed

developer-tools/java/chapters/ch11-bigdata.adoc

Lines changed: 53 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,19 @@
11
:imagesdir: images
22

3-
= Example: Big Data Processing with Docker and Hadoop
3+
= Big Data Processing with Docker and Hadoop
44

55
*PURPOSE*: This chapter explains how to use Docker to create a Hadoop cluster and a Big Data application in Java. It highlights several concepts like service scale, dynamic port allocation, container links, integration tests, debugging, etc.
66

7-
== Download images and application
7+
Big Data applications usually involves distributed processing using tools like Hadoop or Spark. These services can be scaled up, running with several nodes to support more parallelism. Running tools like Hadoop and Spark on Docker makes it easy to scale them up and down. This is very useful to simulate a cluster on development time and also to run integration tests before taking your application to production.
8+
9+
The application on this example reads a file, count how many words are on that file using a MapReduce implemented on Hadoop and then saves the result on MongoDB. In order to do that, we will run a Hadoop cluster and a MongoDB server on Docker.
10+
11+
[NOTE]
12+
====
13+
Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. The Hadoop framework itself is mostly written in Java.
14+
====
15+
16+
== Clone the sample application
817

918
Clone the project at `https://github.com/fabianenardon/hadoop-docker-demo`
1019

@@ -18,7 +27,7 @@ cd sample
1827
mvn clean install -Papp-docker-image
1928
----
2029

21-
In the command above, `-Papp-docker-image` will fire up the `app-docker-image` profile, defined in the application pom.xml. This profile will create a dockerized version of the application, creating two images:
30+
In the command above, `-Papp-docker-image` will fire up the `app-docker-image` profile, defined in the application `pom.xml`. This profile will create a dockerized version of the application, creating two images:
2231

2332
. `docker-hadoop-example`: docker image used to run the application
2433
. `docker-hadoop-example-tests`: docker image used to run integration tests
@@ -29,6 +38,7 @@ Go to the `sample/docker` folder and start the services:
2938

3039
[source, text]
3140
----
41+
cd docker
3242
docker-compose up -d
3343
----
3444

@@ -39,35 +49,48 @@ See the logs and wait until everything is up:
3949
docker-compose logs -f
4050
----
4151

42-
Open `http://localhost:8088/cluster` to see your if your cluster is running. You should see 1 active node when everything is up.
52+
In order to see if everything is up, open `http://localhost:8088/cluster`. You should see 1 active node when everything is up.
53+
54+
image::docker-bigdata-03.png[]
4355

4456
== Running the application
4557

46-
This application reads a text file from hdfs and counts how many words it has. The result is saved on MongoDB.
58+
This application reads a text file from HDFS and counts how many words it has. The result is saved on MongoDB.
4759

48-
First, create a folder on hdfs. We will save the file to be processed on it:
60+
First, create a folder on HDFS. We will save the file to be processed on it:
4961

5062
[source, text]
5163
----
5264
docker-compose exec yarn hdfs dfs -mkdir /files/
5365
----
5466

55-
Put the file we are going to process on hdfs:
67+
In the command above, we are executing `hdfs dfs -mkdir /files/` on the service `yarn`. This command creates a new folder called `/files/` on HDFS, the distributed file system used by Hadoop.
68+
69+
Put the file we are going to process on HDFS:
5670

5771
[source, text]
5872
----
59-
docker-compose run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
73+
docker-compose run docker-hadoop-example \
74+
hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
6075
----
6176

62-
The `text_for_word_count.txt` was added to the application image by maven when we built it, so we can use it to test.
77+
The `text_for_word_count.txt` file was added to the application image by maven when we built it, so we can use it to test. The command above will transfer the `text_for_word_count.txt` file from the local disk to the `/files/` folder on HDFS, so the Hadoop process can access it.
6378

6479
Run our application
6580

6681
[source, text]
6782
----
68-
docker-compose run docker-hadoop-example hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar hdfs://namenode:9000 /files mongo yarn:8050
83+
docker-compose run docker-hadoop-example \
84+
hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar \
85+
hdfs://namenode:9000 /files mongo yarn:8050
6986
----
7087

88+
The command above will run our jar file on the Hadoop cluster. The `hdfs://namenode:9000` parameter is the HDFS address. The `/files` parameter is where the file to process can be found on HDFS. The `mongo` parameter is the MongoDB host address. The `yarn:8050` parameter is the Hadoop yarn address, where the MapReduce job will be deployed. Note that since we are running the Hadoop components (namenode, yarn), MongoDB and our application as Docker services, they can all find each other and we can use the service names as host addresses.
89+
90+
If you go to `http://localhost:8088/cluster`, you can see your job running. When the job finishes, you should see this:
91+
92+
image::docker-bigdata-04.png[]
93+
7194
If everything ran successful, you should be able to see the results on MongoDB.
7295

7396
Connect to the Mongo container:
@@ -105,7 +128,7 @@ docker-compose scale nodemanager=2
105128

106129
This means that you want to have 2 nodes in your Hadoop cluster. Go to `http://localhost:8088/cluster` and refresh until you see 2 active nodes.
107130

108-
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the docker-compose.yml file:
131+
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the `docker-compose.yml` file:
109132

110133
[source, text]
111134
----
@@ -130,7 +153,7 @@ Stop all the services
130153
docker-compose down
131154
----
132155

133-
Note that since our docker-compose.yml file defines volume mappings for hdfs and mongoDB, next time you start the services again, your data will still be there.
156+
Note that since our `docker-compose.yml` file defines volume mappings for HDFS and MongoDB, next time you start the services again, your data will still be there.
134157

135158

136159
== Debugging your code
@@ -198,22 +221,29 @@ Put the test file on hdfs:
198221

199222
[source, text]
200223
----
201-
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
224+
docker-compose --file src/test/resources/docker-compose.yml \
225+
run docker-hadoop-example \
226+
hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
202227
----
203228

204229

205230
Run the application
206231

207232
[source, text]
208233
----
209-
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar hdfs://namenode:9000 /files mongo yarn:8050
234+
docker-compose --file src/test/resources/docker-compose.yml \
235+
run docker-hadoop-example \
236+
hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar \
237+
hdfs://namenode:9000 /files mongo yarn:8050
210238
----
211239

212240
Run our integration tests:
213241

214242
[source, text]
215243
----
216-
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example-tests mvn -f /maven/code/pom.xml -Dmaven.repo.local=/m2/repository -Pintegration-test verify
244+
docker-compose --file src/test/resources/docker-compose.yml \
245+
run docker-hadoop-example-tests mvn -f /maven/code/pom.xml \
246+
-Dmaven.repo.local=/m2/repository -Pintegration-test verify
217247
----
218248

219249
Stop all the services:
@@ -227,7 +257,14 @@ If you want to remote debug tests, run the tests this way instead:
227257

228258
[source, text]
229259
----
230-
docker run -v ~/.m2:/m2 -p 5005:5005 --link mongo:mongo --net resources_default docker-hadoop-example-tests mvn -f /maven/code/pom.xml -Dmaven.repo.local=/m2/repository -Pintegration-test verify -Dmaven.failsafe.debug
260+
docker run -v ~/.m2:/m2 -p 5005:5005 \
261+
--link mongo:mongo \
262+
--net resources_default \
263+
docker-hadoop-example-tests \
264+
mvn -f /maven/code/pom.xml \
265+
-Dmaven.repo.local=/m2/repository \
266+
-Pintegration-test verify \
267+
-Dmaven.failsafe.debug
231268
----
232269

233270
Running with this configuration, the application will wait until an IDE connects for remote debugging on port 5005.
237 KB
Loading
330 KB
Loading

0 commit comments

Comments
 (0)
X Tutup