ARDC Research Link Australia

ORCID Profile
Orcid icon. 0000-0002-0844-5819

Current Organisation
Massey University

Does something not look right? The information on this page has been harvested from data sources that may not be up to date. We continue to work with information providers to improve coverage and quality. To report an issue, use the Feedback Form.

Publications

Publication

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

Publisher: Springer Science and Business Media LLC

Date: 12-2020

DOI: 10.1186/S40537-020-00388-5

Abstract: Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.

Publication

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

Publisher: Springer Science and Business Media LLC

Date: 14-08-2021

DOI: 10.1186/S40537-021-00499-7

Abstract: This article proposes a new parallel performance model for different workloads of Spark Big Data applications running on Hadoop clusters. The proposed model can predict the runtime for generic workloads as a function of the number of executors, without necessarily knowing how the algorithms were implemented. For a certain problem size, it is shown that a model based on serial boundaries for a 2D arrangement of executors can fit the empirical data for various workloads. The empirical data was obtained from a real Hadoop cluster, using Spark and HiBench. The workloads used in this work were included WordCount, SVM, Kmeans, PageRank and Graph (Nweight). A particular runtime pattern emerged when adding more executors to run a job. For some workloads, the runtime was longer with more executors added. This phenomenon is predicted with the new model of parallelisation. The resulting equation from the model explains certain performance patterns that do not fit Amdahl’s law predictions, nor Gustafson’s equation. The results show that the proposed model achieved the best fit with all workloads and most of the data sizes, using the R-squared metric for the accuracy of the fitting of empirical data. The proposed model has advantages over machine learning models due to its simplicity, requiring a smaller number of experiments to fit the data. This is very useful to practitioners in the area of Big Data because they can predict runtime of specific applications by analysing the logs. In this work, the model is limited to changes in the number of executors for a fixed problem size.

Publication

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Publisher: MDPI AG

Date: 05-11-2021

DOI: 10.3390/BDCC5040065

Abstract: Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.

Publication

Bluetooth-based wireless personal area network for multimedia communication

Publisher: IEEE Comput. Soc

Date: 2002

DOI: 10.1109/DELTA.2002.994587

Related Organisations

Organisation

University Of Dhaka

Location: Bangladesh

View Organisation

Organisation

Massey University

Location: New Zealand

View Organisation

Organisation

International Islamic University Malaysia

Location: Malaysia

View Organisation

Organisation

University Brunei Darussalam

Location: Brunei Darussalam

View Organisation

Related Funding Activities

No related grants have been discovered for Mohammad Rashid.

Mohammad Rashid

Researcher

Related Links

Publications

A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

A parallelization model for performance characterization of Spark Big Data jobs on Hadoop clusters

An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster

Bluetooth-based wireless personal area network for multimedia communication

Related Organisations

University Of Dhaka

Massey University

International Islamic University Malaysia

University Brunei Darussalam

Related Funding Activities

ARDC NEWSLETTER SIGNUP