Hadoop MapReduce vs. Spark

This content originally appeared on Level Up Coding - Medium and was authored by Anello

Managing distributed processing.

One of the main advantages of Apache Spark is to apply MapReduce to large datasets. Hadoop has popularized MapReduce operations. However, these operations can be executed with Spark up to 100x faster if we run in memory. Now we’ll look at the differences between Hadoop MapReduce and Apache Spark.

Hadoop MapReduce and Apache Spark are the two most popular frameworks for cluster computing and large-scale data analysis (Big Data). These two frameworks hide the complexity of data processing concerning the parallelism between tasks and fault tolerance by exposing a simple SOFTWARE API with information to users. We have no way to process Big Data on just one computer; the data is vast. We need the distributed process on multiple computers.

Therefore, Hadoop and Apache teams have simplified work by developing software that masks complexity. Instead of having to program everything manually to be processed in a distributed manner, we call a Spark or Hadoop MapReduce function.

Hadoop HDFS

Hadoop is a framework divided into several parts. One of these parts is the Hadoop Distributed File System — HDFS. Just as we have to process in a distributed way, we also need to store in a distributed manner. Therefore, the data is distributed across multiple computers.

Hadoop MapReduce

The other MapReduce module is for distributed processing. So, we store it in a distributed way with HDFS and then process it in a distributed manner with MapReduce. However, some developers noticed some inconsistencies, and from there came the idea to develop Spark. Therefore, when referring to Apache Spark, we compare it to Hadoop MapReduce and not Hadoop as a whole.

What do these modules do?

These two modules run the mapping and reduction paradigm. We take Big Data and feed this load into a distributed storage environment (HDFS). From there, the data is distributed across multiple machines.

However, we want to look at this data and look for a pattern, organize it, give greater consistency to that data, and then be able to collect insights — analyze data. Mapping and reduction do just that, mapping in search of patterns and applying reductions through programming.

Data Scientists will determine what tasks will run in MapReduce resulting in valuable data for analysis, visualizations, machine learning applications through Apache Spark or Hadoop MapReduce. However, Spark performs distributed processing, similar to Hadoop MapReduce, but with much more speed. Spark has no storage system and can use Hadoop HDFS as a data source/destination.

MapReduce Apache Spark

The diagram summarizes mapping and reduction with Apache Spark. On the left side, we have the Map operations; on the right side the reduction operations. In the middle of the way, we have the shuffle operation to shuffle the data.

In the first column of Map, we feed with BIG DATA, that is, large data sets, with high volume, high variety, generated at high speed. With Big Data, we want to look for some patterns in the data. We apply a mapping operation to separate those patterns.

After mapping, Spark performs shuffle, that is, fragmentation of the data at the end of each mapping operation. The shuffle joins the data fragments that have some similarities and feeds the final reduction process. We transform the data that at first was downright messy and at the end into a pattern that we can use for analysis.

Apache Spark Summary

Here is a summary of any operation with Apache Spark. On the left side, we have the user program, the program that data scientists will develop.

Therefore, we developed our program and sent a task to Apache Spark. Within Spark, we initially have Spark Context, a kind of memory area for Running Spark that can be local or on a cluster of computers depending on how we run Spark.

Within this area of memory, where we will run Spark, create the RDDs. We take the dataset and record it in a format of our own for execution in a distributed manner. RDDs will be processed through DAGScheduler — task manager that will route these tasks to the cluster manager.

Cluster Manager captures the DagScheduler program that processed the data through an RDD and sends it to the Executors, which is sending a task so that the cluster computer can perform processing. Depending on the operations we do, all of this is stored in Cache (memory area), making Apache Spark much faster. We can create this Spark Cluster internally on the company network or create it in the cloud if it doesn’t have enough infrastructure.

For all this to work, we need a key component called Resilient Distributed Dataset — RDD. The data will not be processed on just one machine; it will be processed on several computers. Therefore, we cannot use the default object of the Python, Scala, or Java language, because these objects that default within languages were not constructed for distributed processing. Therefore, that object (list, tuple, dictionary) will be converted to an RDD to be processed in a distributed environment.

With RDDs, we process everything in memory, and that’s what makes Apache Spark such a differentiated product. Instead of processing on disk, it loads everything into memory, runs the developed program, and generates the final result. Therefore, RDDs are the essence of the operation of distributed processing in Apache Spark.

Hadoop MapReduce writes intermediate results to disk, while Apache Spark writes intermediate results to memory, which is much faster.

Spark supports more than just the Map and Reduce functions; it also provides concise and consistent APIs in Scala, Java, Python, and R languages to build the process. Plus, an interactive shell for our queries without having to create an entire Data Analysis process.

We can use Hadoop HDFS as one of our data sources or destinations, and it is not mandatory to use HDFS. We can also use Apache Spark with other data tools as long as they are supported. HDFS ends up being a good option because it allows distributed storage of large data sets.

The Data Scientist is responsible for defining the rules of data manipulation and analysis. The Data Engineer is responsible for ensuring the processing of the pipeline, data sources, data destination, and data security. In general, the scientist develops the program and sends it to the Data Engineer. It will process and ensure that the data is secure, stored, etc.

And there we have it. I hope you have found this helpful. Thank you for reading. ?

Hadoop MapReduce vs. Spark was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding - Medium and was authored by Anello

Print Share Comment Cite Upload Translate Updates

APA

Anello | Sciencx (2021-05-11T01:58:40+00:00) Hadoop MapReduce vs. Spark. Retrieved from https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/

MLA

" » Hadoop MapReduce vs. Spark." Anello | Sciencx - Tuesday May 11, 2021, https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/

HARVARD

Anello | Sciencx Tuesday May 11, 2021 » Hadoop MapReduce vs. Spark., viewed ,<https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/>

VANCOUVER

Anello | Sciencx - » Hadoop MapReduce vs. Spark. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/

CHICAGO

" » Hadoop MapReduce vs. Spark." Anello | Sciencx - Accessed . https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/

IEEE

" » Hadoop MapReduce vs. Spark." Anello | Sciencx [Online]. Available: https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/. [Accessed: ]

rf:citation

» Hadoop MapReduce vs. Spark | Anello | Sciencx | https://www.scien.cx/2021/05/11/hadoop-mapreduce-vs-spark/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.