高级
本文档的这一部分展示了如何使用newspaper完成一些实用但高级的操作
多线程文章下载
逐篇下载文章速度很慢。但向单个新闻来源(如cnn.com)发送大量线程或使用异步IO会导致速率限制,并且这样做还可能导致您的IP被网站屏蔽。
我们通过为每个新闻源分配1-2个线程来解决此问题,这样既能显著加快下载速度,又能保持对新闻源的尊重。
import newspaper
from newspaper.mthreading import fetch_news
slate_paper = newspaper.build('http://slate.com')
tc_paper = newspaper.build('http://techcrunch.com')
espn_paper = newspaper.build('http://espn.com')
papers = [slate_paper, tc_paper, espn_paper]
results = fetch_news(papers, threads=4)
#At this point, you can safely assume that download() has been
#called on every single article for all 3 sources.
print(slate_paper.articles[10].tite)
#'<html> ...'
除了Source对象外,fetch_news还接受Article对象或简单URL。
article_urls = [f'https://abcnews.go.com/US/x/story?id={i}' for i in range(106379500, 106379520)]
articles = [Article(url=u) for u in article_urls]
results = fetch_news(articles, threads=4)
urls = [
"https://www.foxnews.com/media/homeowner-new-florida-bill-close-squatting-loophole-return-some-fairness",
"https://edition.cnn.com/2023/12/27/middleeast/dutch-diplomat-humanitarian-aid-gaza-sigrid-kaag-intl/index.html",
]
results = fetch_news(urls, threads=4)
# or everything at once
papers = [slate_paper, tc_paper, espn_paper]
papers.extend(articles)
papers.extend(urls)
results = fetch_news(papers, threads=4)
注意:在先前版本的newspaper中,这可以通过news_pool调用来完成,但它不够健壮,并已被ThreadPoolExecutor实现所取代。
仅保留文章正文的Html
当您希望保留原始 html 中的某些格式信息时,仅保留文章正文的 html 可能会有所帮助。此外,如果您希望将文章嵌入网站中,这可能有助于格式化。
例如,您可以:
import newspaper
# we are calling the shortcut function ``article()`` which will do the
# downloading and parsing for us and return an ``Article`` object.
a = article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html')
print(a.article_html)
# '<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunda...'
# You can also access the article's top node (lxml node) directly
print(a.top_node)
# '<Element div at 0x7f2b8c0b6b90>'
# Additionally we create a sepparate DOM tree with cleaned html.
# This can be useful in some cases.
print(a.clean_doc)
# '<Element html at 0x7f2b8c0b6b90>'
print(a.clean_top_node)
# '<Element div at 0x7f2b8c0b6b90>'
添加新语言
目前我们计划改变(简化)添加新语言的方式。 如果您仍想提交新语言,请遵循以下说明。
对于使用拉丁字符的语言,这相当基础。
您需要提供一个停用词列表,形式为 stopwords-<语言代码>.txt 文本文件。
对于非拉丁字母语言,我们需要专门的标记器(tokenizer),因为 仅通过空格分割对于中文或阿拉伯语等语言来说根本无法奏效。对于中文,我们使用了额外的 开源库jieba来将文本分割成词语。 对于阿拉伯语,我们使用 一个特殊的nltk标记器来完成同样的工作。
因此,要为新的(非拉丁语)语言添加全文提取功能,我们需要:
1. 推送一个停用词文件,格式为 stopwords-<2-char-language-code>.txt
至 newspaper/resources/text/.
提供一种方法,将那种外语的文本分割/分词成单词。
对于拉丁语言:
1. 上传一个停止词文件,格式为 stopwords-<2-char-language-code>.txt
到 newspaper/resources/text/. 我们就完成了!
显式构建新闻源
除了使用newspaper.build(..) api,我们还可以更进一步
使用newspaper的Source api。
from newspaper import Source
cnn_paper = Source('http://cnn.com')
print(cnn_paper.size()) # no articles, we have not built the source
# 0
cnn_paper.build()
print(cnn_paper.size())
# 3100
请注意上面的build()方法。上面的代码等同于下面的调用序列:
cnn_paper = Source('http://cnn.com')
# These calls are taken care in build() :
cnn_paper.download()
cnn_paper.parse()
cnn_paper.set_categories()
cnn_paper.download_categories()
cnn_paper.parse_categories()
cnn_paper.set_feeds()
cnn_paper.download_feeds()
cnn_paper.generate_articles()
print(cnn_paper.size())
# 3100
参数与配置
Newspaper提供了两个API供用户配置他们的Article和
Source对象。一种是通过命名参数传递推荐,
另一种是通过Configuration对象。
Configuration的任何属性都可以作为参数传递给article()
函数,Article对象的构造函数或Source对象的构造函数。
以下是一些参数传递示例:
import newspaper
from newspaper import Article, Source
cnn = newspaper.build('http://cnn.com', language='en', memoize_articles=False)
article = Article(url='http://cnn.com/french/...', language='fr', fetch_images=False)
cnn = Source(url='http://latino.cnn.com/...', language='es', request_timeout=10,
number_threads=20)
以下是一些如何使用 Configuration 对象的示例。
import newspaper
from newspaper impo, Article, Source
config = Config()
config.memoize_articles = False
config.language = 'en'
config.proxies = {'http': '192.168.1.100:8080',
'https': '192.168.1.100:8080'}
cbs_paper = newspaper.build('http://cbs.com', config=config)
article_1 = Article(url='http://espn/2013/09/...', config=config)
cbs_paper = Source('http://cbs.com', config=config)
完整的可用选项可在Configuration部分中找到
缓存
Newspaper4k库提供了一个简单的缓存机制,可用于避免重复下载同一篇文章。此外,在构建Source对象时,类别URL检测会被缓存24小时。
默认情况下,这两种机制都是启用的. 文章缓存由newspaper.build()函数中的memoize_articles参数控制,或者,在创建Source对象时,构造函数中的memoize_articles参数. 将其设置为False将禁用缓存机制.
类别检测缓存由 utils.cache_disk.enabled 设置控制。这会禁用 Source._get_category_urls(..) 方法上的缓存装饰器。
例如:
import newspaper
from newspaper import utils
cbs_paper = newspaper.build('http://cbs.com')
# Disable article caching
utils.cache_disk.enabled = False
cbs_paper2 = newspaper.build('http://cbs.com') # The categories will be re-detected
# Enable article caching
utils.cache_disk.enabled = True
cbs_paper3 = newspaper.build('http://cbs.com') # The cached category urls will be loaded
代理使用
很多时候网站会阻止来自单个IP地址的重复访问。或者,某些网站可能会限制来自特定地理位置的访问(由于法律原因等)。要绕过这些限制,您可以使用代理。Newspaper支持通过将proxies参数传递给Article对象的构造函数或Source对象的构造函数来使用代理。proxies参数应该是一个字典,按照requests library的要求,格式如下:
from newspaper import Article
# Define your proxy
proxies = {
'http': 'http://your_http_proxy:port',
'https': 'https://your_https_proxy:port'
}
# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
# Create an Article object, passing the proxies parameter
article = Article(url, proxies=proxies)
# Download and parse the article
article.download()
article.parse()
# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)
from newspaper import article
# Define your proxy
proxies = {
'http': 'http://your_http_proxy:port',
'https': 'https://your_https_proxy:port'
}
# URL of the article you want to scrape
url = 'https://abcnews.go.com/Technology/wireStory/indonesias-mount-marapi-erupts-leading-evacuations-reported-casualties-106358667'
# Create an Article object,
article = article(url, proxies=proxies)
# Access the article's text, keywords, and summary
print("Title:", article.title)
print("Text:", article.text)