Big Data : Handling Update/Delete by leveraging Hive dynamic partition


I started with an objective to use my Apache Hadoop prototype cluster (ref: Apache Hadoop on Mac OSX platform) as a reliable ETL/ELT engine for generating pre-built aggregates of my Semantic layer exposed to the reporting and analytics engine.

If the data processing can be taken out of a datamart , then the additional resources thus released can be leveraged to improve data extraction performance. And on few VLDB RDBMS solutions this directly translates to cost saving as well due to their way of implementing ETL.

Two immediate problems surfaced:

  1. I tried to stick to RDBMS kind of solution , instead of going for a Java/script heavy implementation involving writing own MR jobs and control. Easier said than done - there was an immediate need of a ETL tool for obvious reasons e.g. unified development platform, focus on the solution than channeling energy into writing your own big data friendly etl scripts, inclination towards open source. Zeroed in on Talend Big Data Open Studio after some deliberation. I liked Talend OS ETL tool , for its sheer simplicity.
  2. Traditional ETL demands handling of IUD (otherwise known as CRUD/ACID also in other circles) operations on the data. This aspect is a strict No-No in the Big Data world due to the inherent nature of implementation. After all, there should be no need of deleting/purging/altering data when one can simply add another record with the latest image ? But then, the objective here was to have a full blown ETL solution and not just leverage what Hadoop/Hive provides out-of-box. So it came down to handle UPDATE and DELETE transactions in the most easily available manner as allowed by the current platform.

One very good solution approach is presented by Hortonworks team in this article. May be someday this will get implemented and can be used to leverage Big Data as an ETL platform in the true sense, thereby becoming an integral part of traditional Data Warehousing.

There is another way of handling DELETE operations at HBase level and making Hive interact with HBase tables instead. But this method has inherent issues with atomicity and transient state of data.

But I wasn't really happy with myself going to this additional stretch for achieving my objective. While studying Hive features, I found dynamic partitioning quite interesting and was always curious to use it effectively in some solution. It appeared finally that I found the best use of it - to achieve UPDATE and DELETE functionality with minimal code changes and maintenance issues.

So I took the following approach, and it worked out quite well for my use case:

"How to update or delete a record in aggregated table, in-place"

The answer is no straightforward approach as expected. The generic solution approach involves dynamically partitioning the aggregate table up till one level plus of the lowest grain. Thereafter , a simple INSERT OVERWRITE containing data till that grain will refresh the records in aggregate table in-place.

There are two downsides that I could identify immediately viz.

  1. Dynamic partition invariably introduces additional processing in form of Map Reduce jobs triggering for each INSERT as well as data purge. This will slow down an otherwise same transaction on the table with no partition in place.
  2. As of Hive 13 , if I try to replace data at the lowest grain level, the dynamic partitions go for a toss (though I didn't bother to file a jira yet). The solution is to replace the dataset at one level higher than the lowest grain. This also means additional computing / slowing down of operation.
  3. Any query can get transient state of records, while data operations are going on (but isn't that how it is traditionally as well ?)

Here is an example to explain the above approach:

Use-case : Create and populate two summary tables with Top 10 Twitter Hashtags (collected at a frequency of 15 minutes ) aggregated at hourly and daily levels.

The summary data needs to be up-to-date as of now, as usual with any typical DW aggregate.

For the daily aggregate table, it is expected to have one record per day , and the top 10 hashtags will keep changing based on the popularity index throughout the day. It will no longer get updated after the day is over.

For the hourly aggregate table, similarly , it is expected to have 24 records per day, and top 10 hashtags will keep changing within the hour based on their popularity index.

Talend proved extremely helpful in quickly developing a Search program with custom keywords and setting it up to run every 15 minutes to fetch the raw data.

Next step involved designing the tables and setting up the aggregation logic to handle the UPDATE s.

An UPDATE can also be translated as DELETE + INSERT and this scheme was implemented using the INSERT OVERWRITE feature provided by Hive.

The hourly aggregate table was dynamically partitioned at date and hour level.

The daily aggregate table was dynamically partitioned at year, week, day level.

All it needed was to select the data suitably from raw twitter feed (e.g. aggregate the top 10 hashtags at hourly level for current date , current hour) and execute an INSERT OVERWRITE into the hourly aggregate table.

Rest everything is handled by Hive / Hadoop engine seamlessly !! No need to mention partition name, sub partition name or even calculate it upfront.

Remember, all I needed was to select the current hour's top 10 hashtags from raw feed and insert in the hourly aggregate table without bothering for any partition reference or caring about the time limit. Similarly for the daily aggregate , I replaced data for current day of that week effectively by selecting top 10 hashtags for the current day.

In a similar manner , exchanging an empty partition will handle DELETE seamlessly.

Conclusion

  • DELETE, UPDATE can be emulated in Hive and quite effortlessly.
  • Big Data is not meant inherently to handle DELETE or UPDATE. Hence take this approach only if this is what you really need. Else inserting a new record with the refreshed image is sufficient to handle CDC.
  • Hive 14 and upcoming versions are coming up with more functionality and surely it will become even easier to handle Update/Delete operations even better.

Please feel free to share your thoughts in comments.

As usual, please don't try this directly in a production environment before testing it out properly :)

Hi Thank you for the post. Hive 0.14 is what we use. I need to delete insert a partitioned Parquet table from a non-partitioned table. INSERT OVERWRITE is not working as expected.

Like
Reply
Sulabh Singh

Engineering @ Booking.com

7y

So there is no direct way to handle deletion in Hive ?

Like
Reply
Kotesh Banoth

Apache Spark|Delta Lake |Python| Entrepreneur| Interests in Data science and Cloud platform

7y

How can i update time automatically, when row is inserted or Updated

Like
Reply
Sambit Khandai

Cloud Solutions Engineer | GCP | AWS | DataWarehouse | ETL | Analytics

9y

Good article.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics