Stack Exchange has released its data-dumps of all its publicly available contents including all the stack exchange communities like stack Overflow, Super user, Ask Ubuntu, Server Fault, etc. The data includes posts, users, comments, badges, post-feedbacks, post-History, post-Tags, post-Types, review-Tasks, tags, votes, etc. Analyze Stack Exchange data and generate insights. Process posts data and develop program to solve various KPI''s and problems. Also the purpose of the project is to parse the unstructured data into structured format.
- ""<row Id="41" PostTypeId="1" AcceptedAnswerId="44" " CreationDate="2014-05-14T11:15:40.907" Score="28" ViewCount="1897" Body="<p>R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine Learning with R".</p> <p>I've seen a guideline of 5TB for a dataset to be considered as Big Data.</p> <p>My question is: Is R suitable for the amount of Data typically seen in Big Data problems? Are there strategies to be employed when using R with this size of dataset?</p> " OwnerUserId="136" LastEditorUserId="118" LastEditDate="2014-05- 14T13:06:28.407" LastActivityDate="2015-04-12T05:00:23.663" Title="Is the R language suitable for Big Data" Tags="<bigdata><r>" AnswerCount="8" CommentCount="1" FavoriteCount="13" />
• Id
• PostTypeId (listed in the PostTypes table)
-
Question
-
Answer
-
Orphaned tag wiki
-
Tag wiki excerpt
-
Tag wiki
-
Moderator nomination
-
“Wiki placeholder”(seems to only be the election description)
-
Privilege wiki
• AcceptedAnswerId (only present if PostTypeId is 1)
• Parent ID (only present if PostTypeId is 2)
• CreationDate
• DeletionDate(only non-null for the SEDE PostsWithDeleted table. Deleted posts are not present on Posts. Column not present on data dump.)
• Score
• ViewCount (nullable)
• Body (as rendered HTML, not Markdown)
• OwnerUserId (only present if user has not been deleted; always -1 for tag wiki entries, i.e. the community user owns them)
• OwnerDisplayName (nullable)
• LastEditorUserId (nullable)
• LastEditorDisplayName (nullable)
• LastEditDate="2009-03-05T22:28:34.823" - the date and time of the most recent edit to the post (nullable)
• LastActivityDate="2009-03-11T12:51:01.480" - the date and time of the most recent activity on the post. For a question, this could be the post being edited, a new answer was posted, a bounty was started, etc.
• Title (nullable)
• Tags (nullable)
• AnswerCount (nullable)
• CommentCount
• FavoriteCount
• ClosedDate (present only if the post is closed)
• CommunityOwnedDate (present only if post is community wikied)
-
Count the total number of questions in the available data-set and collect the questions id of all the questions
-
Monthly questions count – provide the distribution of number of questions asked per month
-
Provide the number of posts which are questions and contains specified words in their title (like data, science, nosql, hadoop, spark)
-
The trending questions which are viewed and scored highly by the user – Top 10 highest viewed questions with specific tags
-
The questions that doesn’t have any answers – Number of questions with “0” number of answers
-
Number of questions with more than 2 answers
-
Number of questions which are active for last 6 months
-
Questions which are marked closed for each category – provide the distribution of number of closed questions per month
-
The most scored questions with specific tags – Top 10 questions having tag hadoop, spark
-
List of all the tags along with their counts
-
Number of question with specific tags (nosql, big data) which was asked in the specified time range (from 01-01-2015 to 31-12-2015)
-
Average time for a post to get a correct answer
Ojas Gupta |
Made with ❤️ by DS Community SRM |