Do you know how Spark SQL internally works and gives results to you in seconds or in minutes from large volume of data which usually takes hours or days in RDBMS?
This is Siddharth Garg having around 6+ years of experience in Big Data Technologies like Map Reduce, Hive, Hbase, Sqoop, Oozie, Flume, Airflow, Phoenix, Spark, Scala, and Python. For the last 1.5+ years, I am working with Luxoft as Software Development Engineer 1(Big Data).
Spark SQL is a Spark module for organized information handling. Not at all like the fundamental Spark RDD API, the interfaces furnished by Spark SQL furnish Spark with more data about the structure of both the information and the calculation being performed. Inside, Spark SQL utilizes this additional data to perform additional improvements. There are a few different ways to collaborate with Spark SQL including SQL and the Dataset API. When registering an outcome, a similar execution motor is utilized, free of which API/language you are utilizing to communicate the calculation. This unification implies that engineers can undoubtedly switch to and FRO between various APIs dependent on which gives the most characteristic approach to communicate a given change.
Have you ever think about how the Spark SQL works internally whenever you fire the query through it and how it will able to give you results within seconds or within minutes from large volume of data which will take hours or days if you query through RDBMS.
The Internal working of Spark SQL
Spark SQL query goes through various phases.
Let’s understand these
1. Parsed Logical Plan — unresolved
query is parsed and It checks for any of the syntax errors.
if the syntax is correct then it goes to step 2.
2. Resolved/Analyzed Logical plan
It will try to resolve the table name, column names, etc.
It refers to the catalog to resolve these.
if the column names or table name is not available then we will get analysis exception.
In case if all is fine it goes to step 3.
3. Optimized Logical Plan
Resolved Logical plan goes through catalyst optimizer.
it’s a rule-based engine.
The plan is optimized based on various rules.
some of the rules are -
* filter push down
* combining of filters
* combining of projections
There are many such rules which are already in place.
If we want we can add our own custom rules in the catalyst optimizer.
4. Generation of the physical plan
An optimized logical plan is converted to multiple physical plans.
Out of these, the one with the lowest cost is selected.
5. Code generation
The selected physical plan is converted to Lower Level API RDD code.
This is then executed.
This is how Spark SQL works internally and gives you the result within seconds or within minutes whenever you query the large volume of data.