📰Newspaper4k: 网络文章抓取、分析与处理
目前,Newspaper4k项目是知名项目newspaper3k(由codelucas开发,自2020年9月以来未更新)的一个分支。这个分支的初始目标是保持项目的活跃性,并添加新功能和修复错误。尽可能保留先前存在的编码API。
Python 兼容性
Python 3.8+ 最低要求
概览:
$ pip3 install newspaper4k
import newspaper
article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')
print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']
print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00
print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...
print(article.top_image)
#https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill
print(article.movies)
# []
article.nlp()
print(article.keywords)
# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']
print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime
使用构建器API
import newspaper
cnn_paper = newspaper.build('http://cnn.com')
print(cnn_paper.category_urls())
# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com', 'https://cnnespanol.cnn.com', 'http://edition.cnn.com', 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson', 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations', 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
article = cnn_paper.articles[0]
article.download()
article.parse()
print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
from newspaper import fulltext
html = requests.get(...).text
text = fulltext(html)
Newspaper 可以无缝地提取和检测语言。如果没有指定语言,Newspaper 将尝试自动检测语言。
import newspaper
article = newspaper.article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')
print(article.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?
安装
✅ pip3 install newspaper4k ✅
用户指南
许可证
由 [Andrei Paraschiv] 撰写和维护。
Newspaper4k最初由Lucas Ou-Yang (codelucas)开发,原始 仓库可在此处找到[here](https://github.com/codelucas/newspaper)。 Newspaper4k采用MIT许可证授权。
致谢
感谢Lucas Ou-Yang创建了原始的Newspaper3k项目,以及原项目的所有贡献者。