高级

本文档的这一部分展示了如何使用newspaper完成一些实用但高级的操作

多线程文章下载

逐篇下载文章速度很慢。但向单个新闻来源(如cnn.com)发送大量线程或使用异步IO会导致速率限制,并且这样做还可能导致您的IP被网站屏蔽。

我们通过为每个新闻源分配1-2个线程来解决此问题,这样既能显著加快下载速度,又能保持对新闻源的尊重。

import newspaper
from newspaper.mthreading import fetch_news

slate_paper = newspaper.build('http://slate.com')
tc_paper = newspaper.build('http://techcrunch.com')
espn_paper = newspaper.build('http://espn.com')

papers = [slate_paper, tc_paper, espn_paper]
results = fetch_news(papers, threads=4)


#At this point, you can safely assume that download() has been
#called on every single article for all 3 sources.

print(slate_paper.articles[10].tite)
#'<html> ...'

除了Source对象外,fetch_news还接受Article对象或简单URL。

article_urls = [f'https://abcnews.go.com/US/x/story?id={i}' for i in range(106379500, 106379520)]
articles = [Article(url=u) for u in article_urls]

results = fetch_news(articles, threads=4)

urls = [
    "https://www.foxnews.com/media/homeowner-new-florida-bill-close-squatting-loophole-return-some-fairness",
    "https://edition.cnn.com/2023/12/27/middleeast/dutch-diplomat-humanitarian-aid-gaza-sigrid-kaag-intl/index.html",
]

results = fetch_news(urls, threads=4)

# or everything at once
papers = [slate_paper, tc_paper, espn_paper]
papers.extend(articles)
papers.extend(urls)

results = fetch_news(papers, threads=4)

注意:在先前版本的newspaper中,这可以通过news_pool调用来完成,但它不够健壮,并已被ThreadPoolExecutor实现所取代。

仅保留文章正文的Html

当您希望保留原始 html 中的某些格式信息时,仅保留文章正文的 html 可能会有所帮助。此外,如果您希望将文章嵌入网站中,这可能有助于格式化。

例如,您可以:

import newspaper

# we are calling the shortcut function ``article()`` which will do the
# downloading and parsing for us and return an ``Article`` object.

a = article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html')

print(a.article_html)
# '<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunda...'

# You can also access the article's top node (lxml node) directly

print(a.top_node)
# '<Element div at 0x7f2b8c0b6b90>'

# Additionally we create a sepparate DOM tree with cleaned html.
# This can be useful in some cases.

print(a.clean_doc)
# '<Element html at 0x7f2b8c0b6b90>'

print(a.clean_top_node)
# '<Element div at 0x7f2b8c0b6b90>'

添加新语言

目前我们计划改变(简化)添加新语言的方式。 如果您仍想提交新语言,请遵循以下说明。

对于使用拉丁字符的语言,这相当基础。 您需要提供一个停用词列表,形式为 stopwords-<语言代码>.txt 文本文件。

对于非拉丁字母语言,我们需要专门的标记器(tokenizer),因为 仅通过空格分割对于中文或阿拉伯语等语言来说根本无法奏效。对于中文,我们使用了额外的 开源库jieba来将文本分割成词语。 对于阿拉伯语,我们使用 一个特殊的nltk标记器来完成同样的工作。

因此,要为新的(非拉丁语)语言添加全文提取功能,我们需要:

1. 推送一个停用词文件,格式为 stopwords-<2-char-language-code>.txtnewspaper/resources/text/.

  1. 提供一种方法,将那种外语的文本分割/分词成单词。

对于拉丁语言:

1. 上传一个停止词文件,格式为 stopwords-<2-char-language-code>.txtnewspaper/resources/text/. 我们就完成了!

显式构建新闻源

除了使用newspaper.build(..) api,我们还可以更进一步 使用newspaper的Source api。

from newspaper import Source
cnn_paper = Source('http://cnn.com')

print(cnn_paper.size()) # no articles, we have not built the source
# 0

cnn_paper.build()
print(cnn_paper.size())
# 3100

请注意上面的build()方法。上面的代码等同于下面的调用序列:

cnn_paper = Source('http://cnn.com')

# These calls are taken care in build() :
cnn_paper.download()
cnn_paper.parse()
cnn_paper.set_categories()
cnn_paper.download_categories()
cnn_paper.parse_categories()
cnn_paper.set_feeds()
cnn_paper.download_feeds()
cnn_paper.generate_articles()

print(cnn_paper.size())
# 3100

参数与配置

Newspaper提供了两个API供用户配置他们的ArticleSource对象。一种是通过命名参数传递推荐, 另一种是通过Configuration对象。 Configuration的任何属性都可以作为参数传递给article() 函数,Article对象的构造函数或Source对象的构造函数。

以下是一些参数传递示例:

import newspaper
from newspaper import Article, Source

cnn = newspaper.build('http://cnn.com', language='en', memoize_articles=False)

article = Article(url='http://cnn.com/french/...', language='fr', fetch_images=False)

cnn = Source(url='http://latino.cnn.com/...', language='es', request_timeout=10,
                                                            number_threads=20)

以下是一些如何使用 Configuration 对象的示例。

import newspaper
from newspaper impo, Article, Source

config = Config()
config.memoize_articles = False
config.language = 'en'
config.proxies = {'http': '192.168.1.100:8080',
                    'https': '192.168.1.100:8080'}

cbs_paper = newspaper.build('http://cbs.com', config=config)

article_1 = Article(url='http://espn/2013/09/...', config=config)

cbs_paper = Source('http://cbs.com', config=config)

完整的可用选项可在Configuration部分中找到

缓存

Newspaper4k库提供了一个简单的缓存机制,可用于避免重复下载同一篇文章。此外,在构建Source对象时,类别URL检测会被缓存24小时。

默认情况下,这两种机制都是启用的. 文章缓存由newspaper.build()函数中的memoize_articles参数控制,或者,在创建Source对象时,构造函数中的memoize_articles参数. 将其设置为False将禁用缓存机制.

类别检测缓存由 utils.cache_disk.enabled 设置控制。这会禁用 Source._get_category_urls(..) 方法上的缓存装饰器。

例如:

import newspaper
from newspaper import utils

cbs_paper = newspaper.build('http://cbs.com')

# Disable article caching
utils.cache_disk.enabled = False

cbs_paper2 = newspaper.build('http://cbs.com') # The categories will be re-detected

# Enable article caching
utils.cache_disk.enabled = True

cbs_paper3 = newspaper.build('http://cbs.com') # The cached category urls will be loaded

代理使用

很多时候网站会阻止来自单个IP地址的重复访问。或者,某些网站可能会限制来自特定地理位置的访问(由于法律原因等)。要绕过这些限制,您可以使用代理。Newspaper支持通过将proxies参数传递给Article对象的构造函数或Source对象的构造函数来使用代理。proxies参数应该是一个字典,按照requests library的要求,格式如下:

from newspaper import Article

# Define your proxy
proxies = {
    'http': 'http://your_http_proxy:port',
    'https': 'https://your_https_proxy:port'
}

# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'

# Create an Article object, passing the proxies parameter
article = Article(url, proxies=proxies)

# Download and parse the article
article.download()
article.parse()

# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)
或者更简短的版本:
from newspaper import article

# Define your proxy
proxies = {
    'http': 'http://your_http_proxy:port',
    'https': 'https://your_https_proxy:port'
}

# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'

# Create an Article object,
article = article(url, proxies=proxies)

# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)