Tuesday, November 26, 2013

Hive Bucketed Tables and Sampling

Bucketing is a simple idea if you are already aware. You create multiple buckets. You read each record and place it into one of the buckets based on some logic mostly some kind of hashing algorithm. This allows you to organize your data by decomposing it into multiple parts. You might think that we do achieve same thing using partitioning then why bother about bucketing. There is one difference. When we do partitioning, we create a partition for each unique value of the column. This may burst into a situation where you might need to create thousands of tiny partitions. But if you use bucketing, you can limit it to a number which you choose and decompose your data into those buckets. In hive a partition is a directory but a bucket is a file. We will see it in action.
In hive, bucketing does not work by default. You will have to set following variable to enable bucketing. set hive.enforce.bucketing=true;Once you ensured that above variable is set, lets create a bucketed table. In my previous post, we had an external table named employees. We will create a bucketed hive table named emphive and load data from employees table. Once data is loaded into bucketed table, we will go back to our warehouse to check if we have multiple data files created which should be equal to the number of buckets in our table.
Here is command to create a bucketed file. I have bucketed data on country column into 3 buckets.
I have executed this command on hive prompt and now it’s time to load data, below screen shows command to load data.
I have executed this command on hive prompt, now it’s time to go and check how many data files are created in our warehouse. If you have read all my hive articles, you already know where these files will be created. Check below screen and you will realize three files names as 000000_0, 000001_0 and 000002_0 are created these are our data files.


You might have question in your mind, how can we select data only from one bucket. I will answer that question. We can use sampling queries. Keep reading.
There are two benefits of bucketing. First, it is more efficient for certain types of queries in hive particularly join operation on two tables which are bucketed on same column. Second reason is your sampling queries are more efficient if they are performed on bucketed columns.
Do you understand sampling? No worries let me explain it. You might have a situation when accuracy is not very important and you just need run your queries on a smaller set of data instead of entire table. One use case for such situation is testing scenarios. You have huge data in table which takes lot of time to read and return your queries and you want to test it on just a smaller sample of you data instead of entire set. Another use case is when you need to draw some patterns or need some aggregation like average and accuracy is not your top priority. A general sense with a predictable degree of error is fine. You may just need to draw your patterns on a smaller sample set of data instead of full data set. Sampling can also be used to obtain a random subset of the actual rows in a table for auditing purposes. Hive supports two types of sampling which we will explore through examples.
Let’s count how many records we have in our different buckets. We will use hive sampling query for this.
You noticed that 3rd bucket has just one record. Let’s check which country this record belongs to and does my first and second files contain data for this country.
What does this mean, a specific country does not go into two buckets, that’s what we expect. Let’s check which countries bucket 1 and 2 contains?



So, bucket 1 contains data for CA and US whereas bucket 2 contains data for UK. I hope you understood what hive has done when we inserted data into this bucketed table. Let me summarize. Hive created three buckets as I instructed it to do so in create table statement. When I loaded data into this table, hive has used some hashing technique for each country to generate a number in range of 1 to 3. Based on the outcome of hashing, hive has placed data row into appropriate bucked. This hashing technique is deterministic so that when hive has to read it back or write more data, it goes in correct bucket.
One thing about tablesample clause is worth mentioning, it’s part of from clause so you can have where clause or other clauses like order by etc as usual. Check below example if you have any doubt.

Hive supports another type of sampling. That’s called block sampling. In this sampling we specify percentage of data to be sampled. Let’s see an example.
Oops! It doesn’t seem to be working correctly. I have total 106 records and I asked to sample 10% so I expect count to be something close to 10. How come it is 70? Answer is straight, because it is block sampling. When I asked hive to sample 10%, I actually asked to read approximately 10% blocks but I just have two blocks for my data into this table and minimum hive can read is one block. Yes, granularity of block sampling is at block level. For example, if your HDFS block size is 256MB, even if n% of input size is only 100MB, you get 256MB of data.
Keep Reading and Keep Learing.