big data solutions for enterprises

Hive, Pig and Sqoop


Hive is a data warehousing infrastructure based on the Hadoop. Hive is designed to enable easy data summarization using ad-hoc queries on large volumes of data. When data is loaded in hive, it creates schema but the original data is stored in HDFS. When SQL queries are submitted to hive, it internally converts it to MapReduce job. It enables users familiar with SQL to do ad-hoc querying on unstructured data present in HDFS. On the other hand, Hive also allows traditional MapReduce programmers to be able to plug in their custom mappers and reducers. Hive does not offer real-time queries and row level updates. It is best used for batch jobs over large sets of immutable data.


Pig is a platform for analyzing large data sets that consists of a high level language for expressing data analysis programs, with infrastructure to execute these programs. It is a very simple language, to write programs, when we execute these programs; it internally gets converted into MapReduce Job. Pig programming structure is amenable to substantial parallelization. The way in which tasks are encoded permits the system to optimize their execution automatically, so programmers need not to focus on efficiency.


Sqoop is a tool designed to transfer data between Hadoop and relational database management system (RDBMS). We can use Sqoop to import data from a RDBMS such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop by running MapReduce job, and then export the data back into an RDBMS. Sqoop automates this process, relying on the database to describe the schema for the data to be imported. Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.


Share on Google+Share on LinkedInPin on PinterestShare on TumblrEmail this to someone