Episode Summary for Data Mechanics: Data Engineering with Jean-Yves Stephan

Apache Spark is a unified analytics engine for large-scale data processing. In computing environments like data warehouses, Spark Applications play a crucial role. Spark has a central place for the general framework for big data, distributed computing. The main usage cases of Spark are Spark streaming and ETL. It is also a general-purpose tool of data science and data engineering workflows. In a comparison of volume and data engineer procedures like extracting, transferring, and loading (ETL) are the main usage cases. Usage of Spark has been very popular in the sector since its release in May 2014. Recently, it is also possible to use Spark on top of Kubernetes and allows standard architecture for big data workflows like data sets over 100 gigabytes, which is the main usage case that makes it such a useful tool for data science.

That brings us to the question of what are the main differences between the applications of Spark and the applications of the data warehouses. Data warehouses generally interact with data by using the only SQL, and the main problem with this approach is that it is so hard to manage. On the other hand, the usage of Spark gave the flexibility that the engineers strive for. 

A data lake is a centralized storage repository that holds a massive amount of structured and unstructured data. Using Spark on a data lake provides the flexibility of the programming language, which allows usage of Java, Scala, Python, and R. To implement particular business logic in a somewhat complex way, Spark provides you additional flexibility, especially when you start managing a lot of ETLs and do not want to implement everything in SQL. The reason for this is after some point SQL queries become hard to manage. Spark lets users create functions and modularize their code and that is one of the main reasons why people choose Spark over data warehouses. In terms of cost-effectiveness,


This article is purposely trimmed, please visit the source to read the full article.

The post Episode Summary for Data Mechanics: Data Engineering with Jean-Yves Stephan appeared first on Software Engineering Daily.

This post was originally published on this site