Analysing Book Dataset Hadoop Project

Hive
Image Source: Internet

This Hadoop project involves analyzing the book datasets to solve a few problem statements.

Problem Statement

  1. Find out the frequency of books published each year. (Hint: Use Boooks.csv file for this)
  2. Find out in which year the maximum number of books were published
  3. Find out how many books were published based on ranking in the year 2002. ( Hint: Use Book.csv and Book-Ratings.csv)

Data Set Description

The Book-Crossing dataset consists of 3 tables.

  1. BX-Users: This file contains the list of the users, their age and where they are collected. If that data is unavailable for any field then it is filled with NULL.
  2. BX -Books: It gives us details about the book such as Book-Title, Book-Author, Year-Of-Publication, Publisher, Image-URL and ISBN. Here ISBN will act as a unique code for a book. Invalid ISBNs have already been removed from the dataset. URLs linking to cover images are also given, appearing in three different flavors (Image-URL-SImage-URL-MImage-URL-L) i.e. small, medium, large. These URLs point to the Amazon web site.
  3. BX-Book-Ratings: It contains the book rating information. Ratings are either explicitly expressed on a scale from 1-10 (higher values denoting higher appreciation) or implicitly expressed by 0.

The Raw Datasets are available in my Github Repository. Please click here.

Solutions

The data in each of the columns are enclosed within double quotes in the raw datasets. Hence, it’s necessary to perform data cleansing to remove these double-quotes. There are multiple ways to perform data cleansing, using Map Reduce pattern algorithms, PIG, etc.
I’m going to take the HIVE approach, as it’s the most widely used technique. So, in this approach, we’ll be using HIVE to clean the data, as well as perform the necessary analysis to get the solutions.

1) Cleaning and Loading BX-Book-Ratings.csv

CREATING TABLE TO LOAD DATA WITH DOUBLE-QUOTES:

create table if not exists bookratings
(userid string, 
 isbn string, 
 bookrating string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\;",
"quoteChar" = '\"'
)

Note: SERDE is used to remove the double quotes from the raw dataset. When SERDE is used in HIVE, by default, the datatypes of the columns are converted to String. Use “describe bookratings” query to check the data type of the columns.

LOADING DATA INTO TABLE:

load data local inpath "BX-Book-Ratings.csv" into table bookratings

Note: Since the datatype is by default String due to SERDE, we need to cast String to BigInt while querying for analysis.

SAMPLE QUERY FOR TYPE CAST:

select distinct(rating) from bookratings
order by cast(rating as bigint)

Note: If you don’t typecast, the query will run. But, the data will be ordered lexically. (Run the above query without casting, and you’ll see the difference.)

2) Cleaning and Loading BX-Books.csv

CREATING TABLE TO LOAD DATA WITH DOUBLE-QUOTES:

create table if not exists bookstable
(isbn string, 
 title string, 
 author string, 
 year string, 
 publisher string,
 urls string,
 urlm string, 
 urll string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\;",
"quoteChar" = '\"'
)
tblproperties ("skip.header.line.count"="1")

Note: SERDE is used to remove the double quotes from the raw dataset. When SERDE is used in HIVE, by default, the datatypes of the columns are converted to String. Use “describe bookstable” query to check the datatype of the columns.

LOADING DATA INTO TABLE:

load data local inpath "BX-Books.csv" into table bookstable

Note: Since the data type is by default String due to SERDE, we need to castString to BigInt while querying for analysis.

3) BX-Users.csv

We do not need this dataset to solve the given problem statements. It’s an extra dataset. 🙂

4) Solution for Problem Statement-1

select year,count(*) from bookstable
group by year
order by cast(year as bigint)

5) Solution for Problem Statement-2

NOTE: WITHOUT USING MAX()

select year, count(*) as A
from bookstable
group by year
order by A desc
limit 1

NOTE: WITH MAX() (ONLY MAXCOUNT IS DISPLAYED)

select max(A) as highestcount
from
(select year, count(*) as A from bookstable group by year) as bookcount

6) Solution for Problem Statement-3

select count(*) from bookstable
join bookratings on bookstable.isbn=bookratings.isbn
where rating='10' and year='2002'
group by year

Note: You can give your own year and rating. It is user specific.

All the solution queries and their outputs (screenshots) are in my Github repository. Please click here.

Please comment below if you have any doubts/queries.

Peace! 🙂

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.