text 绘图

本笔记本旨在演示(并记录)如何使用 shap.plots.text 函数。它使用来自 transformers 包的蒸馏 PyTorch BERT 模型对 IMDB 电影评论进行情感分析。

请注意,我们定义的预测函数接受一个字符串列表,并返回正类的对数几率值。

[9]:
import nlp
import numpy as np
import scipy as sp
import torch
import transformers

import shap

# load a BERT sentiment analysis model
tokenizer = transformers.DistilBertTokenizerFast.from_pretrained(
    "distilbert-base-uncased"
)
model = transformers.DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).cuda()


# define a prediction function
def f(x):
    tv = torch.tensor(
        [
            tokenizer.encode(v, padding="max_length", max_length=500, truncation=True)
            for v in x
        ]
    ).cuda()
    outputs = model(tv)[0].detach().cpu().numpy()
    scores = (np.exp(outputs).T / np.exp(outputs).sum(-1)).T
    val = sp.special.logit(scores[:, 1])  # use one vs rest logit units
    return val


# build an explainer using a token masker
explainer = shap.Explainer(f, tokenizer)

# explain the model's predictions on IMDB reviews
imdb_train = nlp.load_dataset("imdb")["train"]
shap_values = explainer(imdb_train[:10], fixed_context=1)

单实例文本绘图

当我们传递一个单独的实例到文本图时,我们得到了每个标记在原始文本中的重要性,这些标记对应于该文本。红色区域对应于文本中包含时会增加模型输出的部分,而蓝色区域对应于包含时会减少模型输出的部分。在情感分析模型的上下文中,红色对应于更积极的评论,而蓝色对应于更消极的评论。

请注意,文本模型返回的重要性值通常是层次化的,并遵循文本的结构。组内标记之间的非线性交互通常会被保存,并在绘图过程中使用。如果传递给文本图的Explanation对象具有``.hierarchical_values``属性,那么具有强烈非线性效应的小组标记将被自动合并在一起,形成连贯的块。当存在``.hierarchical_values``属性时,这也意味着解释器可能没有完全枚举所有可能的标记扰动,因此将文本块视为一个基本单元。这种情况发生在我们通常希望在少于文档中标记数量的评估次数下解释文本模型时。每当解释器没有分割输入文本的区域时,文本图会将其显示为一个单一单元。

文本上方的力图旨在提供一个概览,展示文本的各个部分如何组合以产生模型的输出。更多详情请参阅 `力图 <>`__ 笔记本,但图表的一般结构是红色正特征“推动”模型输出更高,而蓝色负特征“推动”模型输出更低。力图比文本着色提供了更多的定量信息。将鼠标悬停在文本块上将突出显示力图中对应于该文本块的部分,而将鼠标悬停在力图的部分上将突出显示对应的文本块。

请注意,点击任何文本块将显示该块中标记的SHAP值总和(再次点击将隐藏该值)。

[10]:
# plot the first sentence's explanation
shap.plots.text(shap_values[3])
-2.171297base value-5.200698-8.2300990.8581053.8875066.9169083.6333723.633372f(x)2.49 But 2.385 lovable 2.222 impressive 1.676 is 1.319 still, 0.977 Its not The Fisher King, but its not crap, either. 0.518 some of the most traditionally reviled members of 0.484 is 0.083 very -0.958 society -0.775 Many of the jokes fall flat. -0.684 this film -0.627 Sure, its flawed. It does not give a realistic view of homelessness -0.554 My only complaint is that Brooks should have cast someone else in the lead -0.518 in a -0.511 . -0.4 and to pull that off in a story about -0.357 easily the most -0.346 (I love Mel as a Director and Writer, not so much as a lead). -0.176 underrated film inn the Brooks cannon. -0.167 (unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS) -0.167 This is -0.093 way many comedies are not, -0.012 . -0.004 truly -0.0 -0.0
-0.0
-0.167 / 2
This is
-0.357 / 3
easily the most
-0.176 / 8
underrated film inn the Brooks cannon.
-0.627 / 15
Sure, its flawed. It does not give a realistic view of homelessness
-0.167 / 27
(unlike, say, how Citizen Kane gave a realistic view of lounge singers, or Titanic gave a realistic view of Italians YOU IDIOTS)
-0.511
.
-0.775 / 7
Many of the jokes fall flat.
2.49
But
1.319 / 2
still,
-0.684 / 2
this film
0.484
is
0.083
very
2.385 / 2
lovable
-0.518 / 2
in a
-0.093 / 6
way many comedies are not,
-0.4 / 9
and to pull that off in a story about
0.518 / 9
some of the most traditionally reviled members of
-0.958
society
1.676
is
-0.004
truly
2.222
impressive
-0.012
.
0.977 / 13
Its not The Fisher King, but its not crap, either.
-0.554 / 14
My only complaint is that Brooks should have cast someone else in the lead
-0.346 / 18
(I love Mel as a Director and Writer, not so much as a lead).
-0.0

多实例文本绘图

当我们向文本图传递一个多行解释对象时,我们为每个输入实例获得单个实例图,这些图的x轴和颜色范围经过缩放,以便它们具有一致的可比性。

[11]:
# plot the first sentence's explanation
shap.plots.text(shap_values[:3])
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

0th instance:
-2.165315base value-5.158718-8.152122-11.1455260.8280893.821492-3.729354-3.729354f(x)1.306 The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at ......... 0.153 other programs about school life, 0.098 Bromwell High is 0.085 STUDENT: Welcome to Bromwell High. 0.022 think that Bromwell 0.0 0.0 -0.396 ran -0.329 "Teachers". -0.319 My 35 years in the teaching profession lead -0.318 satire is much closer to reality than is -0.275 same time as some -0.216 High is -0.177 What a pity that it isn't! -0.168 a cartoon comedy -0.143 It -0.128 m here to sack one of your teachers. -0.121 such as "Teachers". -0.116 . -0.115 A classic line: INSPECTOR: I' -0.101 me to believe that Bromwell High's -0.1 fetched -0.058 at the -0.051 . -0.04 I expect that many adults of my age -0.033 . High. -0.026 far
0.0
0.098 / 5
Bromwell High is
-0.168 / 3
a cartoon comedy
-0.051
.
-0.143
It
-0.396
ran
-0.058 / 2
at the
-0.275 / 4
same time as some
0.153 / 6
other programs about school life,
-0.121 / 6
such as "Teachers".
-0.319 / 8
My 35 years in the teaching profession lead
-0.101 / 10
me to believe that Bromwell High's
-0.318 / 8
satire is much closer to reality than is
-0.329 / 4
"Teachers".
1.306 / 82
The scramble to survive financially, the insightful students who can see right through their pathetic teachers' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .........
-0.033 / 3
. High.
-0.115 / 8
A classic line: INSPECTOR: I'
-0.128 / 9
m here to sack one of your teachers.
0.085 / 9
STUDENT: Welcome to Bromwell High.
-0.04 / 8
I expect that many adults of my age
0.022 / 5
think that Bromwell
-0.216 / 2
High is
-0.026
far
-0.1 / 2
fetched
-0.116
.
-0.177 / 9
What a pity that it isn't!
0.0

1st instance:
-0.722620base value-3.716024-6.709427-9.7028312.2707845.264187-4.128328-4.128328f(x)1.915 it shows a tender side compared to his slapstick work such as Blazing Saddles, 1.385 films where prior to being a comedy, 1.159 Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others. 0.838 <br /><br />While the love connection between Molly 0.557 The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk. 0.386 it's fight or flight, kill or be killed. 0.324 to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money. 0.3 be one of Mel Brooks' observant 0.119 Young Frankenstein, or Spaceballs for the matter, 0.046 and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't. 0.0 0.0 -2.169 Stinks -2.105 I found -1.711 "Life -0.948 necessary to plot, -0.909 not used -0.451 " to -0.407 He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other -0.378 Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. -0.347 to reaching -0.275 He's -0.269 mutual agreements like he once did when being rich where -0.2 Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets. -0.154 and Bolt wasn't -0.086 <br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings. -0.024 characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home,
0.0
-0.378 / 49
Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter.
-0.2 / 52
Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.
-0.086 / 157
<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without the luxuries; if Bolt succeeds, he can do what he wants with a future project of making more buildings.
0.557 / 33
The bet's on where Bolt is thrown on the street with a bracelet on his leg to monitor his every move where he can't step off the sidewalk.
-0.407 / 24
He's given the nickname Pepto by a vagrant after it's written on his forehead where Bolt meets other
-0.024 / 26
characters including a woman by the name of Molly (Lesley Ann Warren) an ex-dancer who got divorce before losing her home,
0.046 / 34
and her pals Sailor (Howard Morris) and Fumes (Teddy Wilson) who are already used to the streets. They're survivors. Bolt isn't.
-0.275 / 3
He's
-0.909 / 2
not used
-0.347 / 2
to reaching
-0.269 / 10
mutual agreements like he once did when being rich where
0.386 / 12
it's fight or flight, kill or be killed.
0.838 / 14
<br /><br />While the love connection between Molly
-0.154 / 5
and Bolt wasn't
-0.948 / 4
necessary to plot,
-2.105 / 2
I found
-1.711 / 2
"Life
-2.169 / 2
Stinks
-0.451 / 2
" to
0.3 / 9
be one of Mel Brooks' observant
1.385 / 8
films where prior to being a comedy,
1.915 / 17
it shows a tender side compared to his slapstick work such as Blazing Saddles,
0.119 / 10
Young Frankenstein, or Spaceballs for the matter,
0.324 / 43
to show what it's like having something valuable before losing it the next day or on the other hand making a stupid bet like all rich people do when they don't know what to do with their money.
1.159 / 35
Maybe they should give it to the homeless instead of using it like Monopoly money.<br /><br />Or maybe this film will inspire you to help others.
0.0

2nd instance:
-2.184386base value-5.177789-8.171193-11.1645970.8090183.8024214.3469024.346902f(x)1.598 is also 0.836 Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. 0.718 superb 0.695 is a 0.441 After being accused of 0.373 in Blazing Saddles. 0.322 as anything 0.299 being a turncoat, 0.272 on 0.255 The 0.24 selling out his boss, 0.232 as good 0.215 and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says. 0.179 The corn on face 0.173 Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. 0.158 take 0.143 lawyers 0.1 classic 0.087 Look for the legs scene and the two big diggers fighting (one bleeds). 0.022 (which is quite often). 0.0 0.0 -0.59 . -0.225 This movie gets better each time I see it -0.013 ,
0.0
0.836 / 30
Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none.
0.179 / 4
The corn on face
0.695 / 2
is a
0.1
classic
-0.013
,
0.232 / 2
as good
0.322 / 2
as anything
0.373 / 5
in Blazing Saddles.
0.255
The
0.158
take
0.272
on
0.143
lawyers
1.598 / 2
is also
0.718
superb
-0.59
.
0.441 / 4
After being accused of
0.299 / 5
being a turncoat,
0.24 / 5
selling out his boss,
0.215 / 24
and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I'm a lawyer" he says.
0.173 / 63
Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics.
0.087 / 18
Look for the legs scene and the two big diggers fighting (one bleeds).
-0.225 / 9
This movie gets better each time I see it
0.022 / 7
(which is quite often).
0.0

总结文本解释

虽然使用文本图绘制多个实例级解释可以非常有信息量,但有时您希望获得令牌对大量实例影响的总体摘要。有关更多详细信息,请参阅 `Explanation object <>`__ 文档,但您可以通过将多行解释对象的所有行折叠(在这种情况下通过求和)来轻松总结数据集中令牌的重要性。这样做将每个文本输入令牌类型视为一个特征,因此折叠的解释对象将具有与原始多行解释对象中唯一令牌数量相同的列数。如果解释对象中存在分层值,则任何大组都会被划分,并且组中的每个令牌都会获得总体组重要性值的相等份额。

[12]:
shap.plots.bar(shap_values.abs.sum(0))
../../../_images/example_notebooks_api_examples_plots_text_7_0.png

请注意,您如何总结特征的重要性可能会产生很大差异。在上图中,a 标记非常重要,因为它对模型有影响,而且非常常见。下面我们改为使用 max 函数来总结实例,以查看任何实例中标记的最大影响。

[13]:
shap.plots.bar(shap_values.abs.max(0))
../../../_images/example_notebooks_api_examples_plots_text_9_0.png

你也可以通过使用该标记作为输入名称,从所有实例中切出一个单一标记(注意,输入名称左侧的灰色值是生成该标记的原始文本)。

[14]:
shap.plots.bar(shap_values[:, "but"])
../../../_images/example_notebooks_api_examples_plots_text_11_0.png
[15]:
shap.plots.bar(shap_values[:, "but"])
../../../_images/example_notebooks_api_examples_plots_text_12_0.png

文本到文本可视化

[16]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import shap

tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es").cuda()

s = [
    "In this picture, there are four persons: my father, my mother, my brother and my sister."
]

explainer = shap.Explainer(model, tokenizer)

shap_values = explainer(s)

文本到文本的可视化在左侧包含模型的输入文本,在右侧包含输出文本(在默认布局中)。当悬停在右侧(输出)的某个标记上时,每个输入标记的重要性会覆盖在其上,并通过标记的背景颜色表示。红色区域对应于包含时会增加模型输出的文本部分,而蓝色区域对应于包含时会减少模型输出的文本部分。可以通过点击输出标记来锚定特定输出标记的解释(再次点击可以取消锚定)。

请注意,与上述描述的单输出图类似,文本模型返回的重要性值通常是分层的,并且遵循文本的结构。具有强烈非线性效应的小组标记将被自动合并在一起,形成连贯的块。同样,解释器可能没有完全列举所有可能的标记扰动,因此将文本的块视为基本上的单一单元。此预处理针对每个输出标记进行,并且合并行为可能因每个输出标记而异,因为每个输出标记的交互效应可能不同。合并的块可以通过将鼠标悬停在输入文本上来查看,一旦输出标记被固定。合并块的所有标记都会加粗显示。

一旦输出文本被锚定,可以点击输入标记来查看确切的shap值(悬停在输入标记上也会弹出一个带有值的工具提示)。自动合并的标记显示该块中标记总数除以标记数量的值。

将鼠标悬停在输入文本上会显示每个输出标记的SHAP值。这再次通过输出标记的背景颜色表示。可以通过点击输入标记来锚定。

注意:所有标记(输入和输出)的颜色缩放是一致的,最亮的红色分配给任何输出标记的输入标记的最大SHAP值。

注意:可以通过使用“布局”下拉菜单来更改两段文本的布局。

[17]:
shap.plots.text(shap_values)

0th instance:
Visualization Type:
Input/Output - Heatmap
Layout :
Input Text
In
this
picture
,
there
are
four
persons
:
my
father
,
my
mother
,
my
brother
and
my
sister
.
Output Text
En
este
cuadro
,
hay
cuatro
personas
:
mi
padre
,
mi
madre
,
mi
hermano
y
mi
hermana
.

有更多有用示例的想法吗?我们鼓励提交增加此文档笔记本的拉取请求!