📰Newspaper4k: 网络文章抓取、分析与处理

https://badge.fury.io/py/newspaper4k.svg Build status Coverage status

目前,Newspaper4k项目是知名项目newspaper3k(由codelucas开发,自2020年9月以来未更新)的一个分支。这个分支的初始目标是保持项目的活跃性,并添加新功能和修复错误。尽可能保留先前存在的编码API。

在 GitHub 查看这里

Python 兼容性

  • Python 3.8+ 最低要求

概览:

$ pip3 install newspaper4k
import newspaper

article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')

print(article.authors)
# ['Hannah Brewitt', 'Minute Read', 'Published', 'Am Edt', 'Sun October']

print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00

print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...

print(article.top_image)
#https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill

print(article.movies)
# []

article.nlp()
print(article.keywords)
# ['broncos', 'game', 'et', 'wide', 'chiefs', 'mahomes', 'patrick', 'denver', 'nfl', 'stadium', 'week', 'quarterback', 'win', 'history', 'images']

print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

使用构建器API

import newspaper

cnn_paper = newspaper.build('http://cnn.com')
print(cnn_paper.category_urls())
# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com', 'https://cnnespanol.cnn.com', 'http://edition.cnn.com', 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']

article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson', 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations', 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']

article = cnn_paper.articles[0]
article.download()
article.parse()

print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
from newspaper import fulltext

html = requests.get(...).text
text = fulltext(html)

Newspaper 可以无缝地提取和检测语言。如果没有指定语言,Newspaper 将尝试自动检测语言。

import newspaper

article = newspaper.article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')

print(article.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?

安装

pip3 install newspaper4k

用户指南

许可证

由 [Andrei Paraschiv] 撰写和维护。

Newspaper4k最初由Lucas Ou-Yang (codelucas)开发,原始 仓库可在此处找到[here](https://github.com/codelucas/newspaper)。 Newspaper4k采用MIT许可证授权。

致谢

感谢Lucas Ou-Yang创建了原始的Newspaper3k项目,以及原项目的所有贡献者。