You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: developer-tools/java/chapters/ch11-bigdata.adoc
+53-16Lines changed: 53 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,19 @@
1
1
:imagesdir: images
2
2
3
-
= Example: Big Data Processing with Docker and Hadoop
3
+
= Big Data Processing with Docker and Hadoop
4
4
5
5
*PURPOSE*: This chapter explains how to use Docker to create a Hadoop cluster and a Big Data application in Java. It highlights several concepts like service scale, dynamic port allocation, container links, integration tests, debugging, etc.
6
6
7
-
== Download images and application
7
+
Big Data applications usually involves distributed processing using tools like Hadoop or Spark. These services can be scaled up, running with several nodes to support more parallelism. Running tools like Hadoop and Spark on Docker makes it easy to scale them up and down. This is very useful to simulate a cluster on development time and also to run integration tests before taking your application to production.
8
+
9
+
The application on this example reads a file, count how many words are on that file using a MapReduce implemented on Hadoop and then saves the result on MongoDB. In order to do that, we will run a Hadoop cluster and a MongoDB server on Docker.
10
+
11
+
[NOTE]
12
+
====
13
+
Apache Hadoop is an open-source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. The Hadoop framework itself is mostly written in Java.
14
+
====
15
+
16
+
== Clone the sample application
8
17
9
18
Clone the project at `https://github.com/fabianenardon/hadoop-docker-demo`
10
19
@@ -18,7 +27,7 @@ cd sample
18
27
mvn clean install -Papp-docker-image
19
28
----
20
29
21
-
In the command above, `-Papp-docker-image` will fire up the `app-docker-image` profile, defined in the application pom.xml. This profile will create a dockerized version of the application, creating two images:
30
+
In the command above, `-Papp-docker-image` will fire up the `app-docker-image` profile, defined in the application `pom.xml`. This profile will create a dockerized version of the application, creating two images:
22
31
23
32
. `docker-hadoop-example`: docker image used to run the application
24
33
. `docker-hadoop-example-tests`: docker image used to run integration tests
@@ -29,6 +38,7 @@ Go to the `sample/docker` folder and start the services:
29
38
30
39
[source, text]
31
40
----
41
+
cd docker
32
42
docker-compose up -d
33
43
----
34
44
@@ -39,35 +49,48 @@ See the logs and wait until everything is up:
39
49
docker-compose logs -f
40
50
----
41
51
42
-
Open `http://localhost:8088/cluster` to see your if your cluster is running. You should see 1 active node when everything is up.
52
+
In order to see if everything is up, open `http://localhost:8088/cluster`. You should see 1 active node when everything is up.
53
+
54
+
image::docker-bigdata-03.png[]
43
55
44
56
== Running the application
45
57
46
-
This application reads a text file from hdfs and counts how many words it has. The result is saved on MongoDB.
58
+
This application reads a text file from HDFS and counts how many words it has. The result is saved on MongoDB.
47
59
48
-
First, create a folder on hdfs. We will save the file to be processed on it:
60
+
First, create a folder on HDFS. We will save the file to be processed on it:
49
61
50
62
[source, text]
51
63
----
52
64
docker-compose exec yarn hdfs dfs -mkdir /files/
53
65
----
54
66
55
-
Put the file we are going to process on hdfs:
67
+
In the command above, we are executing `hdfs dfs -mkdir /files/` on the service `yarn`. This command creates a new folder called `/files/` on HDFS, the distributed file system used by Hadoop.
68
+
69
+
Put the file we are going to process on HDFS:
56
70
57
71
[source, text]
58
72
----
59
-
docker-compose run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
The `text_for_word_count.txt` was added to the application image by maven when we built it, so we can use it to test.
77
+
The `text_for_word_count.txt` file was added to the application image by maven when we built it, so we can use it to test. The command above will transfer the `text_for_word_count.txt` file from the local disk to the `/files/` folder on HDFS, so the Hadoop process can access it.
63
78
64
79
Run our application
65
80
66
81
[source, text]
67
82
----
68
-
docker-compose run docker-hadoop-example hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar hdfs://namenode:9000 /files mongo yarn:8050
83
+
docker-compose run docker-hadoop-example \
84
+
hadoop jar /maven/jar/docker-hadoop-example-1.0-SNAPSHOT-mr.jar \
85
+
hdfs://namenode:9000 /files mongo yarn:8050
69
86
----
70
87
88
+
The command above will run our jar file on the Hadoop cluster. The `hdfs://namenode:9000` parameter is the HDFS address. The `/files` parameter is where the file to process can be found on HDFS. The `mongo` parameter is the MongoDB host address. The `yarn:8050` parameter is the Hadoop yarn address, where the MapReduce job will be deployed. Note that since we are running the Hadoop components (namenode, yarn), MongoDB and our application as Docker services, they can all find each other and we can use the service names as host addresses.
89
+
90
+
If you go to `http://localhost:8088/cluster`, you can see your job running. When the job finishes, you should see this:
91
+
92
+
image::docker-bigdata-04.png[]
93
+
71
94
If everything ran successful, you should be able to see the results on MongoDB.
This means that you want to have 2 nodes in your Hadoop cluster. Go to `http://localhost:8088/cluster` and refresh until you see 2 active nodes.
107
130
108
-
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the docker-compose.yml file:
131
+
The trick to scale the nodes is to use dynamically allocated ports and let docker assign a different port to each new nodemanager. See this approach in this snippet of the `docker-compose.yml` file:
109
132
110
133
[source, text]
111
134
----
@@ -130,7 +153,7 @@ Stop all the services
130
153
docker-compose down
131
154
----
132
155
133
-
Note that since our docker-compose.yml file defines volume mappings for hdfs and mongoDB, next time you start the services again, your data will still be there.
156
+
Note that since our `docker-compose.yml` file defines volume mappings for HDFS and MongoDB, next time you start the services again, your data will still be there.
134
157
135
158
136
159
== Debugging your code
@@ -198,22 +221,29 @@ Put the test file on hdfs:
198
221
199
222
[source, text]
200
223
----
201
-
docker-compose --file src/test/resources/docker-compose.yml run docker-hadoop-example hdfs dfs -put /maven/test-data/text_for_word_count.txt /files/
0 commit comments