Distributed SQL Query using SparkSQL, HDFS and Sqoop
Spark SQL: A brief introduction
Spark SQL is a component of Spark framework. It allows to manipulate big unstructured data file and extract useful information using SQL.
It introduces a new data abstraction called DataFrames allowing the analysis of structured and semi-structured data.
Spark SQL provides API in Scala,Python and Java in order to manipulate DataFrames.
It also provides the support for SQL language, a command line interface and an ODBC/JDBC server.
The example described in this post shows how to write a simple Spark application in order to execute an SQL query using Spark.
Import MySQL data into HDFS
In this paragraph I show how import a MySQL database in hadoop using sqoop in the following paragraph I use this data loaded in HDFS in order to execute an SQL query.
I’m using the “world” database that can be downloaded from this [link] (https://dev.mysql.com/doc/index-other.html). It contains data about cities and countries around the world and the languages spoken in each country.
I import all tables of world database in hdfs using as output format a text file separated by tab character. The following command imports all the table in the hdfs directory /user/cloudera/world
As you can observe watching the following command Sqoop has created a sub directory for each MySQL table and it has divided table data in different files with the same prefix “part-m-“.
Spark SQL application
This paragraph describes the simple application that I wrote in order to execute the SQL Query using Spark SQL on the HDFS data imported in the last paragraph of this post.
The SQL query performs a join between all the three table of the database and it allows to extract the top ten country names and their capitals ordered by life expectancy where more than 50% of people speaks English.
First of all, i created the function “loadCSV” in order to load the data from HDFS in my application. This function accepts three parameters: the HDFS location, the name of the table used in Spark SQL and a lamba function mapping each field of HDFS text files in a specific type (string,integer,float) of a table. It parse each line of the text file in the HDFS location, split them by line and parse each line in order to extract the content of each field; at the end this function register the data on a Spark temporary table.
Later I execute the query described above and save the result on a parquet file.
Execute the application and view the result
To run the application in my Hadoop cluster I simply wrote the source code described in the previous paragraph in the file spark_sql.py and I run it using the command spark-submit
The next snippet use parquet tools in order to view in a human readable format the result stored in the parquet file.
This post describe how generate the “System currency” report in Red Hat Satellite 6.
System currency report is a report existing in Satellite 5 but it isn’t ...
This post describe a GitHub Actions workflow that allow to create new post on a Jekyll web site contained in a GitHub repository using the issue editor of Gi...
Elasticsearch is a near real-time search server based on Lucene. It allows to create a distributed full-text search engine. It’s an opensource software devel...