potential security vulnerability package 'requests'
crawl news contents from website like medium(toward data science), 科技新報, store in MongoDB, extract information by NLTK
Python package: pymongo
requests
BeautifulSoup
nltk
Some nltk module might be needed, plase modify and insert the script below if neccesary:
import nltk
nltk.download()
1.git clone https://github.com/LeeKLTW/AI_News_tracker
2.install mongodb
in command line
mongod
Remember your host,port,account,password
python crawler.py
positional arguments:
url string. The URL of tag website or tag of medium. e.g.
'https://medium.com/tag/machine-learning',
https://technews.tw/category/cutting-edge/ai/
layout_type The layout type of html. e.g. medium, technews
optional arguments:
-h, --help show this help message and exit
--host HOST The host of MongoDB
--port PORT The port of MongoDB
The information you will store in MongoDB shown below:
layout_type | 'title' | 'author' | 'date' | 'content' | 'tags' |
---|---|---|---|---|---|
medium(towardsdatascience) | Yes | Yes | Yes | Yes | Yes |
technews | Yes | Yes | Yes | Yes | Yes |
python extracter.py
optional arguments:
-h, --help show this help message and exit
--title TITLE Query for title. Default:None e.g. 'Machine Learning: A
Gentle Introduction. – Towards Data Science'
--author AUTHOR Query for author. Default:None e.g. 'Nvs Abhishek'
--date DATE Query for date. Default:None e.g.'Sep 22'
--host HOST The host of MongoDB
--port PORT The port of MongoDB
--show SHOW Show the title and summary or not.
Demo Suppose we want to know what this article want to tell us. Machine Learning: A Gentle Introduction. – Towards Data Science'
Machine Learning: A Gentle Introduction. – Towards Data Science
This AI was developed by Google’s DeepMind.
In 2012 Harvard Business Review called the job of a Data Scientist as “The Sexiest Job of the 21st Century”, and six years hence, it still holds that tag tight and high.
In the past decade or so, Machine Learning and broadly Data Science has taken over the technological front by storm.
It follows the concept of hit and trial method.
Deep learning is a subfield of machine learning which makes use of a certain kind of machine learning algorithm known as Artificial Neural Networks (ANNs), vaguely inspired by the human brain.
Judging by this sentence, we can know that: 2012 HBR “The Sexiest Job of the 21st Century”. Neural Networks is subfield of machine learning, inspired by the human brain. It has taken over the technological front. Google’s DeepMind is notable team in deep learning.
-
Update Summary Extracter based on Seq2Seq
-
Add search function on title and content based on tf-idf, cos-similarity
-
Check if it's possible to use doc2vec on recommendation system
-
Check if it's possible to improve the Chineses article POS tagging by Hidden markov.
Yes.
-
Check if it's possible to translate Chineses to English,summarize, and translate it back to Chinese?
Don't do it.
-
Check nltk sinica treebank. -> It's not suitable for technews ...
LeeKLTW
MIT