422,347 Members | 1,913 Online
Bytes IT Community
Submit an Article
Got Smarts?
Share your bits of IT knowledge by writing an article on Bytes.

Apache Impala Gets Top-Level Status As Open Source Hadoop Tool

P: 1
Since Hadoop evolution, the developers have gotten the new abstraction and release with more features. The new releases and versions of Hadoop are to provide improved Hadoop with the removal of drawbacks of its earlier versions.

Apache Hive was provided to manage and process larger data sets which are stored in the distributed environment of Hadoop, it was introduced by Facebook. Apache Hive has its own SQL like language known as HiveQL. Impala was introduced by Cloudera to remove the limitation posed by Hadoop SQL low interaction. Cloudera Impala provides high performance and low latency queries similar to SQL, which can be used to analyze and process the data. The data processed by Impala are stored on Hadoop clusters.

This article discusses features and architectures of Cloudera Impala and compares it with Apache Hive.

Introduction to Impala

Impala is a massively parallel processing SQL engine which can provide you a powerful way to process massive amounts of data. The condition for Impala is only that the to be processed data should be stored on Hadoop clusters. Due to this condition,Hadoop dominates data warehousing. Impala was introduced by Cloudera to the world in October 2012 and its public beta run was publicly made available in May 2013.

Impala of Cloudera is an excellent choice for the Hadoop programmers to run the queries on Apache HBase and HDFS because in Impala programmers need not transform or move the data before processing it. It can be easily integrated with Hadoop ecosystem due to similar data and file formats used in Cloudera Impala.

Apart from this metadata, resource management and security frameworks are also similar to used by Apache Hive, Apache Pig, and MapReduce Hadoop software. The architecture of Impala assimilates the strength of Hadoop, multi-user performance feature of traditional database and familiarity with SQL syntax. Following are the two Impala technologies which attract other processing languages:

1) Columnar Storage

The data in Impala is stored in columnar fashion due to which the user can achieve efficient storage and high compression ratio.

2) Tree Architecture

To push down the query to the tree and then to aggregate the result from the leaves of the tree this architecture helps to achieve massively parallel multi-level distributed query processing.

Following are the few additional reasons to implement Impala apart from the reason that it eliminates the need to migrate huge data sets and improves the performance parameters and it also eliminates the need of conversion of data formats before analyzing it.
  • Impala supports Apache HBase storage and HDFS
  • It can recognize all file formats of Hadoop like test, LZO, Avro, RCFile, Parquet and SequenceFile
  • It supports Hadoop security named Kerberos authentication
  • It supports role-based, grained authorization with Apache Sentry
  • It can easily read ODBC driver, metadata and SQL syntax from Apache Hive

In just two years Impala has gained much popularity. The addition of Impala support in MapR and Apache Web Services is proof of its success.

Architecture of Cloudera Impala

Impala of Cloudera consists of three key components which are known as impala, impala-shell and impala-state-store. The interaction between various components involved in SQL queries takes place as shown in the following figure:

Impala shell is a shell script it is used to start the impala-shell-py python programs which are used to run the queries.
Impalad runs on each node of Hadoop it is also used to plan and execute the queries, which are sent from the impala-shell.
Impala-state-store is used to store the information like status and location of the impalad instances

Difference between Impala and Apache Hive

Impala is considered as better than Apache Hive. Here you can have a deeper look at the differences between both:
Cloudera Impala is a native language for query processing, it helps in reducing the commonly noticed overhead of startups which they found in MapReduce/Tez based jobs. Impala is always ready to process the program and its daemon processes are started at the booting time while Hive has the problem of “cold start”.

The query expressions are generated in Hive at compile time while in impala the run time code is generated especially for “big loops“

For interactive computing Apache Hive is not ideal while Impala is meant especially interactive computing

Impala is just like MPP database while Hive is used for batch processing

Impala does not support complex types while Hive supports them
Fault tolerance is not supported by Impala while it is supported by Hive. In Hive the result of any data query will be produced even in case of any down DataNode due to its fault tolerance feature while in Impala the complete process will restart in any of such situation.

Following graph showed that Impala is much faster than Apache Hive. It has several performance-related advantages over Hive. If you are going to start any new project than Impala can be the best, while for any upgradation project where compatibility may be there, Hive can be abetter option.

In short Apache Hive and Impala cannot be compared to both of the advantages and disadvantages and it will depend on the situation and case. Even in some cases, both are used together to execute the queries. Impala has a MapReduce foundation for query execution. Sometimes to get the best result you can use both together and it may result in better compatibility and performance.

Further Saying

For large datasets, there are many more efforts in the world of Big data which can support fast queries, real-time and ad-hoc data processing. Cloudera Impala is also using the same concept as Google Big Query and support a wider range of the formats of input and it becomes available as open source technology. It can also attract the external developers due to which the performance. The programmers can take their application to the next level with the help of Cloudera Impala to improve the performance of the application and to take them to the next level so that it can work excellently.
4 Weeks Ago #1
Share this Article
Share on Google+