Hello! When you are dealing with data ingestion in Hadoop via hive tables, we often incur a situation where there are “double-quotes” in the raw-data. It looks something like this:
Of course, if the datasets are small, excel will be our go-to tool. But, what if you are dealing with huge datasets? ( More than 10GB, etc )
Well not to worry! In order to remove double quotes from data in HIVE table, HIVE offers an in-built table property/feature, that should be used during the table definition.
The feature I’m referring to is the OpenCSV SERDE property.
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
The create table statement, which handles the removal of double quotes from the hive table is given below:
CREATE TABLE IF NOT EXISTS user_info_tab (userid bigint, name string, occupation string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = "\;", "quoteChar" = '\"' );
"separatorChar" = "\;"
This is to specify the delimiter of the raw-datasets. In this case, the delimiter is a semi-colon. Hence, we are “escaping” the semi-colon. Otherwise, HIVE will throw an error.
If the delimiter is a comma, or a pipe, then you don’t need to escape it.
"quoteChar" = '\"'
This is to remove the double quotes from data in HIVE table. Again, we have to “escape” the double-quote while specifying. This applies to any special character than you want to remove.
So, with this simple, yet effective method, you can easily remove the double quotes in data from HIVE table. It will look something like this :
Please let me know if this was useful in the comments below. 🙂 Also, let me know if there are any queries/issues with the above method in the comments. 😀
Also, check out this cool Hadoop project, to enhance your HIVE skills!