create database SocialNetwork;
create table USER (email VARCHAR(255) NOT NULL PRIMARY KEY,fullname VARCHAR(255) NOT NULL,age INT NOT NULL,followers_count INT NOT NULL);
create table POST (post_id INT NOT NULL PRIMARY KEY,category VARCHAR(255) NOT NULL,date DATE NOT NULL,user_email VARCHAR(255) NOT NULL,FOREIGN KEY
(user_email) REFERENCES USER(email));
create table LIKES (user_email VARCHAR(255) NOT NULL,post_id INT NOT NULL,PRIMARY KEY (user_email, post_id),FOREIGN KEY (user_email) REFERENCES USER(email)
FOREIGN KEY (post_id) REFERENCES POST(post_id));
load data local infile 'users.csv' into table USER fields terminated by ',';
load data local infile 'posts.csv' into table POST fields terminated by ',';
load data local infile 'likes.csv' into table LIKES fields terminated by ',';
sqoop-import-all-tables --connect jdbc:mysql://master/SocialNetwork --username gebre -P --warehouse-dir /home/gebre/SocialNetwork -m 1
To solve this problem we used two maper classes and one reducer class :
1.PostsMapper which is responsible for parsing posts.csv file and emit into hdfs file system using post id as key and category, date,useremail as values and here we also included a tag named post at the begining of the values string which will further be used as an identifer to Post or likes in the reducer side.
2.LikesMapper which is responsible for parsing likes.csv file and emit into hdfs file system using post id as key and
useremail(a user who likes that post id) as value and here we also included a tag named Likes at the begining of the
values string which will further be used as an identifer to post or likes in the reducer side. The image attached in the below shows the
output of the two mappers classes(i.e tha data which is ready to be sent to the reducer)
-
The mappers will finally send list of PostId keys and list of users who liked the postid ,date the post is posted , the post category and the user who posts the post to reducer class.
-
The PostLikesReducer then receives the data from the mappers and make some parsing to collect all the users who liked a particular postid into a json array object,uses postid_date as row key ,info as column family 1 with category and user email as column qualifier,likes column family with users qualifier (stores list of users who liked the post id in json array format),numberoflikes qualifier we created it by ourselfs to store the number of users who liked the post becausse this column will help us in the second task and finally this reducer stores this all info into an hbase table named posts.
To solve this problem we used one maper,one combiner and one reducer class where :
-
The maper class named PostCountAndAvgLikesMapper reads data from Hbase table and extracts the post category,user email who made post and number of likes. Then Constructed a composite key using the user ID and category and we emit the composite key as a key and the number of likes as the value and send this data to the combiner to perform map side aggregation. The result of this maper looks like as in the below image.
-
The combiner classe named PostCountAndAvgLikesCombiner takes the key value paires obtained from the previous maper and groups all post categories which have the same composite key and summs the number of likes obtained for each category and produces a new key value pair that has count of all posts of the same category per user and sum of likes for each post category and the same composite key too. The result of the combiner class looks like in the below image.
- The Reducer class named PostCountAndAvgLikesReducer the receives the aggragated count of posts and total number of likes
obtained per every user post and category and then computes the average number of likes obtained using
Average = total likes obtained per post category of user/count of post category. Then it will finally emits, the Composite key as a key , count of posts ,number of likes ,Average likes obtained as values. The output of the reducer will look like as in the below image.