指定数据#

Altair 使用的基本数据模型是表格数据，类似于电子表格或数据库表。单个数据集假定包含一系列记录（行），这些记录可以包含任意数量的命名数据字段（列）。每个顶级图表对象（即 Chart、 LayerChart、 VConcatChart、 HConcatChart、 RepeatChart 和 FacetChart）接受一个数据集作为其第一个参数。

指定数据集有许多不同的方法：

作为一个 pandas DataFrame
作为支持数据框交换协议的 DataFrame（包含一个 __dataframe__ 属性），例如 polars 和 pyarrow。这是实验性的。
作为一个 Data 或相关对象（即 UrlData， InlineData， NamedData）
作为指向一个 json 或 csv 格式文本文件的 URL 字符串
作为一个 geopandas GeoDataFrame、Shapely Geometries、GeoJSON Objects 或其他支持 __geo_interface__ 的对象
作为生成的数据集，例如数值序列或地理参考元素

当数据被指定为 pandas DataFrame 时，Altair 使用 pandas 提供的数据类型信息自动确定编码中所需的数据类型。例如，这里我们通过 pandas DataFrame 指定数据，Altair 自动检测到 x 列应该在分类（名义）尺度上可视化，而 y 列应该在定量尺度上可视化：

import altair as alt
import pandas as pd

data = pd.DataFrame({'x': ['A', 'B', 'C', 'D', 'E'],
                     'y': [5, 3, 6, 7, 2]})
alt.Chart(data).mark_bar().encode(
    x='x',
    y='y',
)

通过比较，所有其他指定数据的方式（包括非pandas DataFrame）都需要显式声明编码类型。这里我们使用一个 Data 对象创建与上面相同的图表，数据以JSON样式的记录列表形式指定：

import altair as alt

data = alt.Data(values=[{'x': 'A', 'y': 5},
                        {'x': 'B', 'y': 3},
                        {'x': 'C', 'y': 6},
                        {'x': 'D', 'y': 7},
                        {'x': 'E', 'y': 2}])
alt.Chart(data).mark_bar().encode(
    x='x:N',  # specify nominal data
    y='y:Q',  # specify quantitative data
)

注意编码中需要额外的标记；因为 Altair 无法推断 Data 对象中的类型，我们必须手动指定它们（在这里我们使用 Encoding Shorthands 来指定名义 (N) 为 x 和定量 (Q) 为 y; 参见 Encoding Data Types）。

同样，当通过URL引用数据时，我们也必须指定数据类型：

import altair as alt
from vega_datasets import data
url = data.cars.url

alt.Chart(url).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q'
)

编码及其相关类型在编码中进行了进一步讨论。下面我们将详细介绍在 Altair 图表中指定数据的不同方式。

pandas 数据框#

包含索引数据#

根据设计，Altair 只访问数据框的列，而不是数据框的索引。有时，相关数据出现在索引中。例如：

import numpy as np
rand = np.random.RandomState(0)

data = pd.DataFrame({'value': rand.randn(100).cumsum()},
                    index=pd.date_range('2018', freq='D', periods=100))
data.head()

                   value
    2018-01-01  1.764052
    2018-01-02  2.164210
    2018-01-03  3.142948
    2018-01-04  5.383841
    2018-01-05  7.251399

如果您希望索引在图表中可用，可以通过使用

reset_index()方法将其显式转换为列:

alt.Chart(data.reset_index()).mark_line().encode(
    x='index:T',
    y='value:Q'
)

如果索引对象没有设置 name 属性，结果列将被称为 "index"。更多信息请参考 pandas documentation。

长形式数据与宽形式数据#

在数据框中存储数据有两种常见的约定，有时称为长格式和宽格式。这两种格式在以表格形式存储数据时都是合理的模式；简而言之，区别在于：

宽格式数据 每个 自变量 有一行，元数据记录在 行和列标签 中。
长格式数据 每个 观测值 占一行，元数据在表中作为值记录。

Altair的语法最适合长格式数据，其中每一行对应一个单独的观察值及其元数据。

一个具体的例子将有助于使这一区别更加清晰。考虑一个数据集，包含了多个公司的股票价格随时间的变化。数据的宽格式版本可能安排如下：

wide_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01'],
                          'AAPL': [189.95, 182.22, 198.08],
                          'AMZN': [89.15, 90.56, 92.64],
                          'GOOG': [707.00, 693.00, 691.48]})
print(wide_form)

             Date    AAPL   AMZN    GOOG
2007-10-01  189.95  89.15  707.00
2007-11-01  182.22  90.56  693.00
2007-12-01  198.08  92.64  691.48

请注意，每一行对应一个单独的时间戳（这里时间是自变量），而每个观察的元数据（即公司名称）存储在列标签中。

相同数据的长格式可能如下所示：

long_form = pd.DataFrame({'Date': ['2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01',
                                   '2007-10-01', '2007-11-01', '2007-12-01'],
                          'company': ['AAPL', 'AAPL', 'AAPL',
                                      'AMZN', 'AMZN', 'AMZN',
                                      'GOOG', 'GOOG', 'GOOG'],
                          'price': [189.95, 182.22, 198.08,
                                     89.15,  90.56,  92.64,
                                    707.00, 693.00, 691.48]})
print(long_form)

             Date company   price
2007-10-01    AAPL  189.95
2007-11-01    AAPL  182.22
2007-12-01    AAPL  198.08
2007-10-01    AMZN   89.15
2007-11-01    AMZN   90.56
2007-12-01    AMZN   92.64
2007-10-01    GOOG  707.00
2007-11-01    GOOG  693.00
2007-12-01    GOOG  691.48

请注意，每一行包含一个单一的观测值（即价格），以及该观测值的元数据（日期和公司名称）。重要的是，列和索引标签不再包含任何有用的元数据。

如上所述，Altair 最适合使用这种长格式数据，因为相关的数据和元数据存储在表格本身内，而不是存储在行和列的标签中：

alt.Chart(long_form).mark_line().encode(
  x='Date:T',
  y='price:Q',
  color='company:N'
)

宽格式数据可以通过例如分层（见分层图表）类似地进行可视化，但在Altair的语法中这要方便得多。

如果您想将宽格式的数据转换为长格式，则有两种可能的方法：可以使用pandas作为预处理步骤，也可以作为图表本身的转换步骤。我们将在下面详细介绍这两种方法。

使用pandas进行转换#

这种数据处理可以作为预处理步骤使用pandas完成，并在pandas文档的重塑和透视表部分详细讨论。

为了将宽格式数据转换为 Altair 使用的长格式数据，可以使用 melt 数据框的方法。 melt 的第一个参数是作为索引变量的列或列的列表；其余的列将合并为一个指标变量和一个值变量，其名称可以选择性地指定：

wide_form.melt('Date', var_name='company', value_name='price')

             Date company   price
2007-10-01    AAPL  189.95
2007-11-01    AAPL  182.22
2007-12-01    AAPL  198.08
2007-10-01    AMZN   89.15
2007-11-01    AMZN   90.56
2007-12-01    AMZN   92.64
2007-10-01    GOOG  707.00
2007-11-01    GOOG  693.00
2007-12-01    GOOG  691.48

有关 melt 方法的更多信息，请参见 pandas melt documentation。

如果您想撤销此操作并将长格式转换回宽格式，pivot 方法很有用。

long_form.pivot(index='Date', columns='company', values='price').reset_index()

    company        Date    AAPL   AMZN    GOOG
      2007-10-01  189.95  89.15  707.00
      2007-11-01  182.22  90.56  693.00
      2007-12-01  198.08  92.64  691.48

有关 pivot 方法的更多信息，请参阅 pandas pivot documentation。

使用折叠变换进行转换#

如果您希望避免数据预处理，可以使用Altair的 Fold Transform调整数据（请参见Fold以获取详细讨论）。这样，上面的图表可以如下重新生成：

alt.Chart(wide_form).transform_fold(
    ['AAPL', 'AMZN', 'GOOG'],
    as_=['company', 'price']
).mark_line().encode(
    x='Date:T',
    y='price:Q',
    color='company:N'
)

注意，与pandas melt 函数不同，我们必须明确指定要折叠的列。as_ 参数是可选的，默认值为 ["key", "value"]。

生成的数据#

在某些情况下，不使用外部数据源，而是在图表规范中生成数据会更方便。其好处在于，对于生成的数据，图表规范可以变得比嵌入数据小得多。

序列生成器#

这是一个使用 sequence() 函数生成 x 数据序列的示例，以及一个计算来计算 y 数据。

import altair as alt

# Note that the following generator is functionally similar to
# data = pd.DataFrame({'x': np.arange(0, 10, 0.1)})
data = alt.sequence(0, 10, 0.1, as_='x')

alt.Chart(data).transform_calculate(
    y='sin(datum.x)'
).mark_line().encode(
    x='x:Q',
    y='y:Q',
)

经线网生成器#

另一种方便在图表中生成的数据是地理可视化上的纬度/经度线，称为格网。这些可以使用Altair的graticule()生成器函数来创建。以下是一个简单的例子：

import altair as alt

data = alt.graticule(step=[15, 15])

alt.Chart(data).mark_geoshape(stroke='black').project(
    'orthographic',
    rotate=[0, -45, 0]
)

球体生成器#

最后，当可视化地球时，可以使用球体作为地图中的背景层来表示地球的范围。这个球体数据可以使用Altair的 sphere() 生成器函数创建。以下是一个示例：

import altair as alt

sphere_data = alt.sphere()
grat_data = alt.graticule(step=[15, 15])

background = alt.Chart(sphere_data).mark_geoshape(fill='aliceblue')
lines = alt.Chart(grat_data).mark_geoshape(stroke='lightgrey')

alt.layer(background, lines).project('naturalEarth1')

空间数据#

在本节中，我们解释了将空间数据读入Altair的不同方法。要了解更多关于在读取这些数据后如何处理它的信息，请参见Geoshape标记页面。

GeoPandas 地理数据框架#

使用GeoPandas作为空间数据源非常方便。 GeoPandas可以读取多种类型的空间数据，而Altair与GeoDataFrames配合良好。在这里，我们将四个多边形几何体定义为一个 GeoDataFrame，并使用mark_geoshape进行可视化。

from shapely import geometry
import geopandas as gpd
import altair as alt

data_geoms = [
    {"color": "#F3C14F", "geometry": geometry.Polygon([[1.45, 3.75], [1.45, 0], [0, 0], [1.45, 3.75]])},
    {"color": "#4098D7", "geometry": geometry.Polygon([[1.45, 0], [1.45, 3.75], [2.57, 3.75], [2.57, 0], [2.33, 0], [1.45, 0]])},
    {"color": "#66B4E2", "geometry": geometry.Polygon([[2.33, 0], [2.33, 2.5], [3.47, 2.5], [3.47, 0], [3.2, 0], [2.57, 0], [2.33, 0]])},
    {"color": "#A9CDE0", "geometry": geometry.Polygon([[3.2, 0], [3.2, 1.25], [4.32, 1.25], [4.32, 0], [3.47, 0], [3.2, 0]])},
]

gdf_geoms = gpd.GeoDataFrame(data_geoms)
gdf_geoms

         color                                           geometry
#F3C14F      POLYGON ((1.45 3.75, 1.45 0, 0 0, 1.45 3.75))
#4098D7  POLYGON ((1.45 0, 1.45 3.75, 2.57 3.75, 2.57 0...
#66B4E2  POLYGON ((2.33 0, 2.33 2.5, 3.47 2.5, 3.47 0, ...
#A9CDE0  POLYGON ((3.2 0, 3.2 1.25, 4.32 1.25, 4.32 0, ...

由于我们示例中的空间数据不是地理数据，我们使用 project 配置 type="identity", reflectY=True 来绘制几何图形，而不应用地理投影。通过使用 alt.Color(...).scale(None) 我们禁用 Altair 中的自动颜色分配，而直接使用提供的十六进制颜色代码。

alt.Chart(gdf_geoms, title="Vega-Altair").mark_geoshape().encode(
    alt.Color("color:N").scale(None)
).project(type="identity", reflectY=True)

内联 GeoJSON 对象#

如果您的源数据是一个GeoJSON文件并且您不想将其加载到GeoPandas GeoDataFrame中，您可以将其以字典的形式提供给Altair Data类。一个GeoJSON文件通常由一个FeatureCollection组成，其中包含一个features的列表，每个几何体的信息在properties字典中指定。在下面的示例中，像GeoJSON的数据对象被指定为一个Data类，使用property值的key包含嵌套列表（这里命名为features）。

obj_geojson = {
    "type": "FeatureCollection",
    "features":[
        {"type": "Feature", "properties": {"location": "left"}, "geometry": {"type": "Polygon", "coordinates": [[[1.45, 3.75], [1.45, 0], [0, 0], [1.45, 3.75]]]}},
        {"type": "Feature", "properties": {"location": "middle-left"}, "geometry": {"type": "Polygon", "coordinates": [[[1.45, 0], [1.45, 3.75], [2.57, 3.75], [2.57, 0], [2.33, 0], [1.45, 0]]]}},
        {"type": "Feature", "properties": {"location": "middle-right"}, "geometry": {"type": "Polygon", "coordinates": [[[2.33, 0], [2.33, 2.5], [3.47, 2.5], [3.47, 0], [3.2, 0], [2.57, 0], [2.33, 0]]]}},
        {"type": "Feature", "properties": {"location": "right"}, "geometry": {"type": "Polygon", "coordinates": [[[3.2, 0], [3.2, 1.25], [4.32, 1.25], [4.32, 0], [3.47, 0], [3.2, 0]]]}}
    ]
}
data_obj_geojson = alt.Data(values=obj_geojson, format=alt.DataFormat(property="features"))
data_obj_geojson

    Data({
      format: DataFormat({
        property: 'features'
      }),
      values: {'type': 'FeatureCollection', 'features': [{'type': 'Feature', 'properties': {'location': 'left'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[1.45, 3.75], [1.45, 0], [0, 0], [1.45, 3.75]]]}}, {'type': 'Feature', 'properties': {'location': 'middle-left'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[1.45, 0], [1.45, 3.75], [2.57, 3.75], [2.57, 0], [2.33, 0], [1.45, 0]]]}}, {'type': 'Feature', 'properties': {'location': 'middle-right'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[2.33, 0], [2.33, 2.5], [3.47, 2.5], [3.47, 0], [3.2, 0], [2.57, 0], [2.33, 0]]]}}, {'type': 'Feature', 'properties': {'location': 'right'}, 'geometry': {'type': 'Polygon', 'coordinates': [[[3.2, 0], [3.2, 1.25], [4.32, 1.25], [4.32, 0], [3.47, 0], [3.2, 0]]]}}]}
    })

每个对象位置的标签存储在 properties 字典中。要访问这些值，您可以在颜色通道编码中指定一个嵌套的变量名（这里是 properties.location）。在这里，我们将颜色编码更改为基于这个位置标签，并应用 magma 颜色方案，而不是默认方案。:O 后缀表示我们希望 Altair 将这些值视为有序的，您可以在编码数据类型页面上阅读更多关于有序结构数据的信息。

alt.Chart(data_obj_geojson, title="Vega-Altair - ordinal scale").mark_geoshape().encode(
    alt.Color("properties.location:O").scale(scheme='magma')
).project(type="identity", reflectY=True)

通过URL获取GeoJSON文件#

Altair 可以直接从网络 URL 加载 GeoJSON 资源。这里我们使用来自 geojson.xyz 的示例。正如在内联 GeoJSON 对象中所解释的，我们将 features 指定为 alt.DataFormat() 对象中 property 参数的值，并将我们想要绘制的属性 (continent) 与存储每个几何体信息的嵌套字典的名称 (properties) 前加。

url_geojson = "https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_admin_0_countries.geojson"
data_url_geojson = alt.Data(url=url_geojson, format=alt.DataFormat(property="features"))
data_url_geojson

    Data({
      format: DataFormat({
        property: 'features'
      }),
      url: 'https://d2ad6b4ur7yvpq.cloudfront.net/naturalearth-3.3.0/ne_110m_admin_0_countries.geojson'
    })

alt.Chart(data_url_geojson).mark_geoshape().encode(color='properties.continent:N')

内联 TopoJSON 对象#

TopoJSON 是 GeoJSON 的扩展，其中特征的几何图形由一个名为 arcs 的顶级对象引用。每个共享的弧只存储一次，以减少数据的大小。一个 TopoJSON 文件对象可以包含多个对象（例如，边界和省界）。在为 Altair 定义 TopoJSON 对象时，我们指定 topojson 数据格式类型以及我们希望可视化的对象名称，使用 feature 参数。在这里，这个对象键的名称是 MY_DATA，但在每个数据集中都是不同的。

obj_topojson = {
    "arcs": [
        [[1.0, 1.0], [0.0, 1.0], [0.0, 0.0], [1.0, 0.0]],
        [[1.0, 0.0], [2.0, 0.0], [2.0, 1.0], [1.0, 1.0]],
        [[1.0, 1.0], [1.0, 0.0]],
    ],
    "objects": {
        "MY_DATA": {
            "geometries": [
                {"arcs": [[-3, 0]], "properties": {"name": "abc"}, "type": "Polygon"},
                {"arcs": [[1, 2]], "properties": {"name": "def"}, "type": "Polygon"},
            ],
            "type": "GeometryCollection",
        }
    },
    "type": "Topology",
}
data_obj_topojson = alt.Data(
    values=obj_topojson, format=alt.DataFormat(feature="MY_DATA", type="topojson")
)
data_obj_topojson

    Data({
      format: DataFormat({
        feature: 'MY_DATA',
        type: 'topojson'
      }),
      values: {'arcs': [[[1.0, 1.0], [0.0, 1.0], [0.0, 0.0], [1.0, 0.0]], [[1.0, 0.0], [2.0, 0.0], [2.0, 1.0], [1.0, 1.0]], [[1.0, 1.0], [1.0, 0.0]]], 'objects': {'MY_DATA': {'geometries': [{'arcs': [[-3, 0]], 'properties': {'name': 'abc'}, 'type': 'Polygon'}, {'arcs': [[1, 2]], 'properties': {'name': 'def'}, 'type': 'Polygon'}], 'type': 'GeometryCollection'}}, 'type': 'Topology'}
    })

alt.Chart(data_obj_topojson).mark_geoshape(
).encode(
    color="properties.name:N"
).project(
    type='identity', reflectY=True
)

通过URL获取TopoJSON文件#

Altair可以直接从网页URL加载TopoJSON资源。正如在内联TopoJSON对象中所解释的，我们必须使用 feature 参数来指定对象名称（这里是 boroughs），并在 alt.DataFormat() 对象中将数据类型定义为 topjoson。

from vega_datasets import data

url_topojson = data.londonBoroughs.url

data_url_topojson = alt.Data(
    url=url_topojson, format=alt.DataFormat(feature="boroughs", type="topojson")
)

data_url_topojson

    Data({
      format: DataFormat({
        feature: 'boroughs',
        type: 'topojson'
      }),
      url: 'https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/londonBoroughs.json'
    })

注意：如果这个文件可以通过URL访问，还有一种提取对象的简写方法：alt.topo_feature(url=url_topojson, feature="boroughs")

我们通过名称对区进行颜色编码，因为它们作为唯一标识符存储（id）。我们在两个列中使用symbolLimit为33，以显示图例中的所有条目，并更改颜色方案以使颜色更加鲜明。当我们用鼠标悬停在其上时，我们还添加了一个工具提示，显示区的名称。

alt.Chart(data_url_topojson, title="London-Boroughs").mark_geoshape(
    tooltip=True
).encode(
    alt.Color("id:N").scale(scheme='tableau20').legend(columns=2, symbolLimit=33)
)

类似于 feature 选项，还存在 mesh 参数。该参数提取一个命名的 TopoJSON 对象集。与 feature 选项不同，返回的相应地理数据作为一个单一的、统一的网格实例，而不是单独的 GeoJSON 特征。提取网格对于更有效地绘制边界或其他地理元素非常有用，这些元素不需要与特定区域（例如单个国家、州或县）关联。

下面我们绘制伦敦的相同区，但现在仅作为网格。

注意：你必须明确地定义 filled=False 才能绘制没有填充颜色的多条线。

from vega_datasets import data

url_topojson = data.londonBoroughs.url

data_url_topojson_mesh = alt.Data(
    url=url_topojson, format=alt.DataFormat(mesh="boroughs", type="topojson")
)

alt.Chart(data_url_topojson_mesh, title="Border London-Boroughs").mark_geoshape(
    filled=False
)

嵌套的 GeoJSON 对象#

GeoJSON 数据也可以嵌套在其他数据集中。在这种情况下，可以使用 shape 编码通道与 :G 后缀结合来将嵌套特征可视化为 GeoJSON 对象。在下面的例子中，GeoJSON 对象嵌套在 geo 的字典列表中：

nested_features = [
    {"color": "#F3C14F", "geo": {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[1.45, 3.75], [1.45, 0], [0, 0], [1.45, 3.75]]]}}},
    {"color": "#4098D7", "geo": {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[1.45, 0], [1.45, 3.75], [2.57, 3.75], [2.57, 0], [2.33, 0], [1.45, 0]]]}}},
    {"color": "#66B4E2", "geo": {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[2.33, 0], [2.33, 2.5], [3.47, 2.5], [3.47, 0], [3.2, 0], [2.57, 0], [2.33, 0]]]}}},
    {"color": "#A9CDE0", "geo": {"type": "Feature", "geometry": {"type": "Polygon", "coordinates": [[[3.2, 0], [3.2, 1.25], [4.32, 1.25], [4.32, 0], [3.47, 0], [3.2, 0]]]}}},
]
data_nested_features = alt.Data(values=nested_features)

alt.Chart(data_nested_features, title="Vega-Altair").mark_geoshape().encode(
    shape="geo:G",
    color=alt.Color("color:N").scale(None)
).project(type="identity", reflectY=True)

投影#

对于地理数据，最好使用1984年世界大地坐标系统作为其地理坐标参考系统，单位为十进制度。尽量避免将投影数据放入Altair，但请先将您的空间数据重投影为EPSG:4326。如果您的数据以不同的投影形式出现（例如，单位为米），并且您没有重投影数据的选项，请尝试使用项目配置(type: 'identity', reflectY': True)。它在不应用投影的情况下绘制几何图形。

绕线顺序#

线字符串、 polygon 和多边形几何体包含按照一定顺序排列的坐标：线段朝向某个方向，polygon 环也如此。类似 GeoJSON 的 __geo_interface__ 结构建议多边形和多边形的右手法则绕行顺序。这意味着外环应该是逆时针的，而内环是顺时针的。虽然它建议使用右手法则绕行顺序，但并不排斥不使用右手法则的几何体。

Altair 不遵循几何体的右手法则，而是使用左手法则。这意味着外环应该是顺时针，而内环应该是逆时针。如果您面临有关缠绕顺序的问题，请尝试在使用 Altair 之前使用 GeoPandas 强制左手法则在您的数据上，例如如下所示：

from shapely.ops import orient
gdf.geometry = gdf.geometry.apply(orient, args=(-1,))