How To Remove Row Duplicates From HIVE Table?

Removing Exact Row Level Duplicates From HIVE Tables Made Easy!

Remove Row Level Duplicates From Hive Table
Remove Row Level Duplicates From Hive Table

Key Words: Duplicates, HIVE, Table, Dedup, Deduplication, Row, Row Level Duplicates, Apache, Group By

Hey there folks! Working as a Big Data engineer, I often encountered a situation where I had to remove EXACT Row Level Duplicates from a HIVE Table. There are no clear techniques to be found anywhere, be it StackOverflow or Medium. After scouring the internet and various other community platforms, I am glad to share a working approach ( Tested in Live Production Environments ) to de-duplicate HIVE Tables or in short, remove the exact row-level duplicates!

Let’s have a ‘device_table‘ with following schema:

CREATE EXTERNAL TABLE IF NOT EXISTS device_table (
     device_id STRING,
     device_name STRING,
     os STRING,
     country STRING,
     unix_time STRING,
     app_id STRING
)
PARTITIONED BY (
     dt STRING
)
STORED AS PARQUET

Scenario 1: All the columns are the same (Exact Row Level Duplicates) or (Pure Duplicate)

Run the below HIVE query while ingesting data into a new ‘dedup_device_table‘ ( That has the same schema of parent table i.e. device_table)

SET hive.groupby.orderby.position.alias=true;

INSERT OVERWRITE TABLE dedup_device_table PARTITION (dt='2020-YY-XX')
SELECT 
    device_id,
    device_name,
    os,
    country,
    unix_time,
    app_id
FROM 
    device_table
GROUP BY 1,2,3,4,5,6

Scenario 2: Selecting based on the Latest Timestamp

Run the below HIVE query while ingesting data into a new ‘dedup_device_table‘ ( That has the same schema of parent table i.e. device_table)

INSERT OVERWRITE TABLE dedup_device_table PARTITION (dt='2020-YY-XX')
SELECT 
    device_id,
    device_name,
    os,
    country,
    unix_time,
    app_id 
FROM (
    SELECT *, 
           ROW_NUMBER() 
           OVER (PARTITION BY device_id, 
                  ORDER BY unix_time DESC) as rank
    FROM impressions
     ) ranked_device_data
WHERE ranked_device_data.rank=1;

Another alternative approach to get this done is using STRUCT and MAX functions together. Essentially, we can build a STRUCT, and use the MAX function on the first value. Follow the query below:

INSERT OVERWRITE TABLE dedup_device_table PARTITION (dt='2020-0X-XX')
SELECT 
    device_id,
    MAX(STRUCT(unix_time, device_id)).col2 as device_id,
    MAX(STRUCT(unix_time, device_name)).col2 as device_name,
    MAX(STRUCT(unix_time, os)).col2 as os,
    MAX(STRUCT(unix_time, country)).col2 as country,
    MAX(STRUCT(unix_time, unix_time)).col2 as unix_time,
    MAX(STRUCT(unix_time, app_id)).col2 as app_id
FROM 
    device_table
GROUP BY device_id

Let me know what you guys think in the comments below! Also, Follow us on Instagram and Facebook to get notified when we drop new content!

Also, check this project to practice your HIVE Skills!

Peace.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.