Published: 2025-12-01
Apache Spark for Business and Financial Data Engineering: A Systematic Literature Review
DOI: 10.35870/ijsecs.v5i3.5419
Ahmad Bilal Almagribi, Bambang Purnomosidi Dwi Putranto
Article Metrics
- Views 0
- Downloads 0
- Scopus Citations
- Google Scholar
- Crossref Citations
- Semantic Scholar
- DataCite Metrics
-
If the link doesn't work, copy the DOI or article title for manual search (API Maintenance).
Abstract
This paper is an SLR that maps the application of Apache Spark in data engineering in the business and finance domains. Practitioners and researchers alike would find it interesting to know how Apache Spark has been applied to solve big data problems as organizations continue to deal with large volumes of data. By analyzing publications from the Scopus database for 2021-2025, we try to find trends and methodologies currently in use as well as gaps in research existing in the field. It was found that Apache Spark is primarily used for sentiment analysis and trend analysis on social media, particularly Twitter, since its real-time processing capability can help understand market dynamics and consumer behavior. The platform carries out predictive tasks like predicting customer churn or pricing financial assets (stocks, bonds, options), proving its versatility across different business applications. Also, this platform is popular for anomaly detection such as transaction fraud with efficiency and cost being the main drivers of adoption. The landscape is not monolithic since some studies propose alternative platforms indicating that Apache Spark may not be the best option for every scenario. Based on our findings, we suggest future research directions that would push the boundaries of the field: using social media data sources other than Twitter for more general market sentiment, applying more varied algorithms to improve prediction accuracy, and extending Spark's application into new areas like currency exchange rate forecasting, credit risk analysis, Anti-Money Laundering (AML) detection as well as Data Lakehouse architecture implementation. These recommendations are meant to steer researchers toward uncharted territories where significant value could be unlocked for business and finance with the help of Apache Spark.
Keywords
Apache Spark ; Business ; Finance ; Data Engineering ; Systematic Literature Review (SLR)
Article Metadata
Peer Review Process
This article has undergone a double-blind peer review process to ensure quality and impartiality.
Indexing Information
Discover where this journal is indexed at our indexing page to understand its reach and credibility.
Open Science Badges
This journal supports transparency in research and encourages authors to meet criteria for Open Science Badges by sharing data, materials, or preregistered studies.
How to Cite
Article Information
This article has been peer-reviewed and published in the International Journal Software Engineering and Computer Science (IJSECS). The content is available under the terms of the Creative Commons Attribution 4.0 International License.
-
Issue: Vol. 5 No. 3 (2025)
-
Section: Articles
-
Published: %750 %e, %2025
-
License: CC BY 4.0
-
Copyright: © 2025 Authors
-
DOI: 10.35870/ijsecs.v5i3.5419
AI Research Hub
This article is indexed and available through various AI-powered research tools and citation platforms. Our AI Research Hub ensures that scholarly work is discoverable, accessible, and easily integrated into the global research ecosystem. By leveraging artificial intelligence for indexing, recommendation, and citation analysis, we enhance the visibility and impact of published research.
Ahmad Bilal Almagribi
Department of Information Technology, Universitas Teknologi Digital Indonesia, Bantul Regency, Special Region of Yogyakarta, Indonesia
-
Gu, R., Zhang, X., Gao, H., Huang, Z., Chen, H., & Wang, C. (2021). Alchemy: Distributed financial quantitative analysis system with high-level programming model. Software: Practice and Experience, 51(8), 1676–1699. https://doi.org/10.1002/spe.2982
-
-
Pallamala, R. K., & Rodrigues, P. (2022). An investigative testing of structured and unstructured data formats in big data application using Apache Spark. Wireless Personal Communications, 122(1), 603–620. https://doi.org/10.1007/s11277-021-08915-0
-
Jaya Lakshmi, A., Venkatramaphanikumar, S., & Kolli, V. K. K. (2022). Prediction of cardiovascular risk using extreme learning machine-tree classifier on Apache Spark cluster. Recent Advances in Computer Science and Communications, 15(3), 443–455. https://doi.org/10.2174/2666255813999200904163404
-
Ataie, E., Evangelinou, A., Gianniti, E., & Ardagna, D. (2022). A hybrid machine learning approach for performance modeling of cloud-based big data applications. The Computer Journal, 65(12), 3123–3140. https://doi.org/10.1093/comjnl/bxab131
-
Martinez-Mosquera, D., Navarrete, R., & Luján-Mora, S. (2021). Efficient processing of complex XSD using Hive and Spark. PeerJ Computer Science, 7, 1–33. https://doi.org/10.7717/peerj-cs.652
-
Lijo, V. P., & Seetha, H. (2021). Tweets sentiment analysis using multi-lexicon features and SMO. International Journal of Embedded Systems, 14(5), 476–485. https://doi.org/10.1504/IJES.2021.120264
-
Raviya, K., & Mary Vennila, S. (2021). An implementation of hybrid enhanced sentiment analysis system using Spark ML pipeline: A big data analytics framework. International Journal of Advanced Computer Science and Applications, 12(5), 323–329. https://doi.org/10.14569/IJACSA.2021.0120540
-
Rodrigues, A. P., Fernandes, R., Bhandary, A., Shenoy, A. C., Shetty, A., & Anisha, M. (2021). Real-time Twitter trend analysis using big data analytics and machine learning techniques. Wireless Communications and Mobile Computing, 2021, Article 3920325. https://doi.org/10.1155/2021/3920325
-
Zhou, H., Sun, G., Fu, S., Wang, L., Hu, J., & Gao, Y. (2021). Internet financial fraud detection based on a distributed big data approach with Node2vec. IEEE Access, 9, 43378–43386. https://doi.org/10.1109/ACCESS.2021.3062467
-
Özgüven, Y. M., Gönener, U., & Eken, S. (2022). A Dockerized big data architecture for sports analytics. Computer Science and Information Systems, 19(2), 957–978. https://doi.org/10.2298/CSIS220118010O
-
-
Hasan, Z., Xing, H.-J., & Magray, M. I. (2022). Big data machine learning using Apache Spark MLlib. Mesopotamian Journal of Big Data, 2022, 1–11. https://doi.org/10.58496/MJBD/2022/001
-
Azeroual, O., & Nikiforova, A. (2022). Apache Spark and MLlib-based intrusion detection system or how the big data technologies can secure the data. Information, 13(2), Article 58. https://doi.org/10.3390/info13020058
-
Shaikh, S. A., Kitagawa, H., Matono, A., Mariam, K., & Kim, K.-S. (2022). GeoFlink: An efficient and scalable spatial data stream management system. IEEE Access, 10, 24909–24935. https://doi.org/10.1109/ACCESS.2022.3154063
-
Tariq, M. U., Babar, M., Poulin, M., & Khattak, A. S. (2022). Distributed model for customer churn prediction using convolutional neural network. Journal of Modelling in Management, 17(3), 853–863. https://doi.org/10.1108/JM2-01-2021-0032
-
Ayub, U., Ahsan, S. M., & Qureshi, S. M. (2022). Scalable big data pipeline for video stream analytics over commodity hardware. KSII Transactions on Internet and Information Systems, 16(4), 1146–1165. https://doi.org/10.3837/tiis.2022.04.004
-
Hagar, A. A., & Gawali, B. W. (2022). Apache Spark and deep learning models for high-performance network intrusion detection using CSE-CIC-IDS2018. Computational Intelligence and Neuroscience, 2022, Article 3131153. https://doi.org/10.1155/2022/3131153
-
-
Shrotriya, L., Sharma, K., Parashar, D., Mishra, K., Rawat, S. S., & Pagare, H. (2023). Apache Spark in healthcare: Advancing data-driven innovations and better patient care. International Journal of Advanced Computer Science and Applications, 14(6), 608–616. https://doi.org/10.14569/IJACSA.2023.0140665
-
Lin, S.-Y., & Lin, H.-Y. (2023). Bond price prediction using technical indicators and machine learning techniques. Journal of Information Science and Engineering, 39(2), 439–455. https://doi.org/10.6688/JISE.202303_39(2).0012
-
Azeem, M., Abualsoud, B. M., & Priyadarshana, D. (2023). Mobile big data analytics using deep learning and Apache Spark. Mesopotamian Journal of Big Data, 2023, 16–28. https://doi.org/10.58496/MJBD/2023/003
-
Xiong, L., Luo, J., Vise, H., & White, M. (2023). Distributed least-squares Monte Carlo for American option pricing. Risks, 11(8), Article 145. https://doi.org/10.3390/risks11080145
-
Karimian-Aliabadi, S., Aseman-Manzar, M.-M., Entezari-Maleki, R., Ardagna, D., Egger, B., & Movaghar, A. (2023). Fixed-point iteration approach to Spark scalable performance modeling and evaluation. IEEE Transactions on Cloud Computing, 11(1), 897–910. https://doi.org/10.1109/TCC.2021.3119943
-
Mendes, A. H. D., Rosa, M. J. F., Marotta, M. A., Araujo, A., Melo, A. C. M. A., & Ralha, C. G. (2024). MAS-Cloud+: A novel multi-agent architecture with reasoning models for resource management in multiple providers. Future Generation Computer Systems, 154, 16–34. https://doi.org/10.1016/j.future.2023.12.022
-
Jose, B., Rajesh, N., & Joseph, L. (2024). Enhanced query performance for stored streaming data through structured streaming within Spark SQL. Indonesian Journal of Electrical Engineering and Computer Science, 35(3), 1744–1750. https://doi.org/10.11591/ijeecs.v35.i3.pp1744-1750
-
Bachir Belmehdi, C. B., Khiat, A., & Keskes, N. (2024). Predicting an optimal virtual data model for uniform access to large heterogeneous data. Data Intelligence, 6(2), 504–530. https://doi.org/10.1162/dint_a_00216
-
La Gatta, V., Moscato, V., Postiglione, M., & Sperlì, G. (2024). An eXplainable artificial intelligence methodology on big data architecture. Cognitive Computation, 16(5), 2642–2659. https://doi.org/10.1007/s12559-024-10272-6
-
Aladib, L., Su, G., & Yang, J. (2025). Real-time monitoring of LTL properties in distributed stream processing applications. Electronics, 14(7), Article 1448. https://doi.org/10.3390/electronics14071448
-
Esmaelizadeh, A., Cotterill, S., Hebert, L., Golab, L., & Taghva, K. (2025). InfoMoD: Information-theoretic machine learning model diagnostics. Distributed and Parallel Databases, 43(1). https://doi.org/10.1007/s10619-024-07450-8
-
Trinh, T., Nguyen, V.-H., Nguyen, N., & Nguyen, D.-N. (2025). Product collaborative filtering based recommendation systems for large-scale e-commerce. International Journal of Information Management Data Insights, 5(1), Article 100322. https://doi.org/10.1016/j.jjimei.2025.100322
-
Patil, S. S., Suryawanshi, V. P., Patil, S. M., Girase, S. P., & Bhagat, D. A. (2025). Review of sentiment analysis in social media using big data: Techniques, tools, and frameworks. International Journal of Basic and Applied Sciences, 14(2), 34–48. https://doi.org/10.14419/mhv83077
-
Bompotas, A., Kalogeropoulos, N.-R., & Makris, C. (2025). CommC: A multi-purpose commodity hardware cluster. Future Internet, 17(3), Article 121. https://doi.org/10.3390/fi17030121
-
Jiao, S. (2025). Utilization of the Internet of Things and big data for enterprise asset management and accounting. International Journal of High Speed Electronics and Systems. https://doi.org/10.1142/S0129156425402505
-
Ionescu, S.-A., Diaconita, V., & Radu, A.-O. (2025). Engineering sustainable data architectures for modern financial institutions. Electronics, 14(8), Article 1650. https://doi.org/10.3390/electronics14081650
-
Vivek, Y., Ravi, V., & Krishna, P. R. (2025). Online feature subset selection for mining feature streams in big data via incremental learning and evolutionary computation. Swarm and Evolutionary Computation, 94, Article 101896. https://doi.org/10.1016/j.swevo.2025.101896

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
1. Copyright Retention and Open Access License
Authors retain copyright of their work and grant the journal non-exclusive right of first publication under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
This license allows unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
2. Rights Granted Under CC BY 4.0
Under this license, readers are free to:
- Share — copy and redistribute the material in any medium or format
- Adapt — remix, transform, and build upon the material for any purpose, including commercial use
- No additional restrictions — the licensor cannot revoke these freedoms as long as license terms are followed
3. Attribution Requirements
All uses must include:
- Proper citation of the original work
- Link to the Creative Commons license
- Indication if changes were made to the original work
- No suggestion that the licensor endorses the user or their use
4. Additional Distribution Rights
Authors may:
- Deposit the published version in institutional repositories
- Share through academic social networks
- Include in books, monographs, or other publications
- Post on personal or institutional websites
Requirement: All additional distributions must maintain the CC BY 4.0 license and proper attribution.
5. Self-Archiving and Pre-Print Sharing
Authors are encouraged to:
- Share pre-prints and post-prints online
- Deposit in subject-specific repositories (e.g., arXiv, bioRxiv)
- Engage in scholarly communication throughout the publication process
6. Open Access Commitment
This journal provides immediate open access to all content, supporting the global exchange of knowledge without financial, legal, or technical barriers.