How to solve the problem of long running queries for large volume of data set over the period of hours and days in RDBMS to run the queries in system which works faster and give you the results within seconds or minutes for same large volume of data?

6 min readDec 15, 2020

This is Siddharth Garg having around 6+ years of experience in Big Data Technologies like Map Reduce, Hive, Hbase, Sqoop, Oozie, Flume, Airflow, Phoenix, Spark, Scala, and Python. For the last 1.5+ years, I am working with Luxoft as Software Development Engineer 1(Big Data).

A Relational database data structure works by using multiple tables, and every table is arranged into rows (also called records or tuples) and columns (also known as fields or attributes) which will take time to fetch results incase you fire the query over the large volume of data.

But you can solve this problem by moving from legacy RDBMS to HDFS(where data is stored) and use Spark engine to query.

Spark SQL integrates relational processing with Spark’s functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

Spark SQL is a Spark module for organized information handling. Not at all like the fundamental Spark RDD API, the interfaces furnished by Spark SQL furnish Spark with more data about the structure of both the information and the calculation being performed. Inside, Spark SQL utilizes this additional data to perform additional improvements. There are a few different ways to collaborate with Spark SQL including SQL and the Dataset API. When registering an outcome, a similar execution motor is utilized, free of which API/language you are utilizing to communicate the calculation. This unification implies that engineers can undoubtedly switch to and fro between various APIs dependent on which gives the most characteristic approach to communicate a given change.

One utilization of Spark SQL is to execute SQL queries. Spark SQL can likewise be utilized to peruse information from a current Hive establishment. When running SQL from inside another programming language the outcomes will be returned as a Dataset/Dataframe. You can likewise communicate with the SQL interface utilizing the command-line or over JDBC/ODBC.

Why is Spark SQL used?

Spark SQL originated as Apache Hive to run on top of Spark and is now integrated with the Spark stack. Apache Hive had certain limitations as mentioned below. Spark SQL was built to overcome these drawbacks and replace Apache Hive.

Is Spark SQL faster than Hive?

Spark SQL is faster than Hive when it comes to processing speed. Below I have listed down a few limitations of Hive over Spark SQL.

Limitations With Hive:

Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce lags in the performance when it comes to the analysis of medium-sized datasets (10 to 200 GB).
Hive has no resume capability. This means that if the processing dies in the middle of a workflow, you cannot resume from where it got stuck.
Hive cannot drop encrypted databases in cascade when the trash is enabled and leads to an execution error. To overcome this, users have to use the Purge option to skip trash instead of drop.

These drawbacks gave way to the birth of Spark SQL. But the question which still pertains in most of our minds is,

Is Spark SQL a database?

Spark SQL is not a database but a module that is used for structured data processing. It majorly works on DataFrames which are the programming abstraction and usually act as a distributed SQL query engine.

How does Spark SQL work?

Let us explore, what Spark SQL has to offer. Spark SQL blurs the line between RDD and relational table. It offers much tighter integration between relational and procedural processing, through declarative DataFrame APIs which integrates with Spark code. It also provides higher optimization. DataFrame API and Datasets API are the ways to interact with Spark SQL.

With Spark SQL, Apache Spark is accessible to more users and improves optimization for the current ones. Spark SQL provides DataFrame APIs which perform relational operations on both external data sources and Spark’s built-in distributed collections. It introduces an extensible optimizer called Catalyst as it helps in supporting a wide range of data sources and algorithms in Big-data.

Spark runs on both Windows and UNIX-like systems (e.g. Linux, Microsoft, Mac OS). It is easy to run locally on one machine — all you need is to have java installed on your system PATH, or the JAVA_HOME environment variable pointing to a Java installation.

Spark SQL Libraries

Spark SQL has the following four libraries which are used to interact with relational and procedural processing:

1. Data Source API (Application Programming Interface):

This is a universal API for loading and storing structured data.

It has built-in support for Hive, Avro, JSON, JDBC, Parquet, etc.
Supports third-party integration through Spark packages
Support for smart sources.
It is a Data Abstraction and Domain Specific Language (DSL) applicable to structure and semi-structured data.
DataFrame API is a distributed collection of data in the form of named column and row.
It is lazily evaluated like Apache Spark Transformations and can be accessed through SQL Context and Hive Context.
It processes the data in the size of Kilobytes to Petabytes on a single-node cluster to multi-node clusters.
Supports different data formats (Avro, CSV, Elastic Search, and Cassandra) and storage systems (HDFS, HIVE Tables, MySQL, etc.).
Can be easily integrated with all Big Data tools and frameworks via Spark-Core.
Provides API for Python, Java, Scala, and R Programming.

2. DataFrame API:

A DataFrame is a distributed collection of data organized into named columns. It is equivalent to a relational table in SQL used for storing data into tables.

3. SQL Interpreter And Optimizer:

SQL Interpreter and Optimizer is based on functional programming constructed in Scala.

It is the newest and most technically evolved component of SparkSQL.
It provides a general framework for transforming trees, which is used to perform analysis/evaluation, optimization, planning, and run time code spawning.
This supports cost-based optimization (run time and resource utilization are termed as cost) and rule-based optimization, making queries run much faster than their RDD (Resilient Distributed Dataset) counterparts.

e.g. Catalyst is a modular library that is made as a rule-based system. Each rule in the framework focuses on distinct optimization.

4. SQL Service:

SQL Service is the entry point for working along with structured data in Spark. It allows the creation of DataFrame objects as well as the execution of SQL queries.

Features Of Spark SQL

The following are the features of Spark SQL:

Integration With Spark

Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs, using SQL or a DataFrame API which can be used in Java, Scala, Python and R. To run the streaming computation, developers simply write a batch computation against the DataFrame / Dataset API, and Spark automatically increments the computation to run it in a streaming fashion. This powerful design means that developers don’t have to manually manage state, failures, or keeping the application in sync with batch jobs. Instead, the streaming job always gives the same answer as a batch job on the same data.

Uniform Data Access

DataFrames and SQL support a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and JDBC. This joins the data across these sources. This is very helpful to accommodate all the existing users into Spark SQL.

Hive Compatibility

Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full compatibility with current Hive data, queries, and UDFs.

Standard Connectivity

The connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence tools.

Performance And Scalability

Spark SQL incorporates a cost-based optimizer, code generation, and columnar storage to make queries agile alongside computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance. The interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimization. Spark SQL can directly read from multiple sources (files, HDFS, JSON/Parquet files, existing RDDs, Hive, etc.). It ensures the fast execution of existing Hive queries.
The image below depicts the performance of Spark SQL when compared to Hadoop. Spark SQL executes up to 100x times faster than Hadoop.

This is how you can solve this problem.