0.25.0 版本的新特性（2019年7月18日）#

警告

从 0.25.x 系列版本开始，pandas 仅支持 Python 3.5.3 及以上版本。更多详情请参见 Dropping Python 2.7。

警告

在未来的版本中，最低支持的 Python 版本将提升到 3.6。

警告

Panel 已被完全移除。对于 N-D 标记数据结构，请使用 xarray

警告

read_pickle() 和 read_msgpack() 仅保证向后兼容到 pandas 版本 0.20.3 (GH 27082)

这是 pandas 0.25.0 的更改。请参阅发行说明以获取包括其他版本 pandas 的完整更新日志。

增强功能#

带有重命名的GroupBy聚合#

pandas 添加了特殊的 groupby 行为，称为“命名聚合”，用于在将多个聚合函数应用于特定列时命名输出列（GH 18366, GH 26512）。

In [1]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ...:                         'height': [9.1, 6.0, 9.5, 34.0],
   ...:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ...: 

In [2]: animals
Out[2]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

[4 rows x 3 columns]

In [3]: animals.groupby("kind").agg(
   ...:     min_height=pd.NamedAgg(column='height', aggfunc='min'),
   ...:     max_height=pd.NamedAgg(column='height', aggfunc='max'),
   ...:     average_weight=pd.NamedAgg(column='weight', aggfunc="mean"),
   ...: )
   ...: 
Out[3]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

将所需的列名作为 **kwargs 传递给 .agg。**kwargs 的值应该是元组，其中第一个元素是列选择，第二个元素是要应用的聚合函数。pandas 提供了 pandas.NamedAgg 命名元组，以使其更清楚函数的参数是什么，但普通元组也是可以接受的。

In [4]: animals.groupby("kind").agg(
   ...:     min_height=('height', 'min'),
   ...:     max_height=('height', 'max'),
   ...:     average_weight=('weight', 'mean'),
   ...: )
   ...: 
Out[4]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

[2 rows x 3 columns]

命名聚合是推荐用于替代已弃用的“字典-字典”方法，以命名列特定聚合的输出（当重命名时弃用 groupby.agg() 使用字典）。

类似的处理方法现在也可以用于 Series groupby 对象。因为不需要选择列，值可以直接是要应用的函数。

In [5]: animals.groupby("kind").height.agg(
   ...:     min_height="min",
   ...:     max_height="max",
   ...: )
   ...: 
Out[5]: 
      min_height  max_height
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

这种聚合方式是推荐替代已弃用的行为，当传递一个字典给 Series groupby 聚合时（当重命名时弃用 groupby.agg() 使用字典）。

更多信息请参见命名聚合。

使用多个 lambda 的 GroupBy 聚合#

你现在可以在 GroupBy.agg 中为一个类似列表的聚合提供多个 lambda 函数 (GH 26430)。

In [6]: animals.groupby('kind').height.agg([
   ...:     lambda x: x.iloc[0], lambda x: x.iloc[-1]
   ...: ])
   ...: 
Out[6]: 
      <lambda_0>  <lambda_1>
kind                        
cat          9.1         9.5
dog          6.0        34.0

[2 rows x 2 columns]

In [7]: animals.groupby('kind').agg([
   ...:     lambda x: x.iloc[0] - x.iloc[1],
   ...:     lambda x: x.iloc[0] + x.iloc[1]
   ...: ])
   ...: 
Out[7]: 
         height                weight           
     <lambda_0> <lambda_1> <lambda_0> <lambda_1>
kind                                            
cat        -0.4       18.6       -2.0       17.8
dog       -28.0       40.0     -190.5      205.5

[2 rows x 4 columns]

之前，这些会引发一个 SpecificationError。

更好的 MultiIndex 表示#

MultiIndex 实例的打印现在显示每一行的元组，并确保元组项垂直对齐，因此现在更容易理解 MultiIndex 的结构。(GH 13480)

现在的 repr 看起来像这样：

In [8]: pd.MultiIndex.from_product([['a', 'abc'], range(500)])
Out[8]: 
MultiIndex([(  'a',   0),
            (  'a',   1),
            (  'a',   2),
            (  'a',   3),
            (  'a',   4),
            (  'a',   5),
            (  'a',   6),
            (  'a',   7),
            (  'a',   8),
            (  'a',   9),
            ...
            ('abc', 490),
            ('abc', 491),
            ('abc', 492),
            ('abc', 493),
            ('abc', 494),
            ('abc', 495),
            ('abc', 496),
            ('abc', 497),
            ('abc', 498),
            ('abc', 499)],
           length=1000)

之前，输出一个 MultiIndex 会打印出 MultiIndex 的所有 levels 和 codes，这在视觉上不吸引人，并且使得输出更难以浏览。例如（将范围限制为5）：

In [1]: pd.MultiIndex.from_product([['a', 'abc'], range(5)])
Out[1]: MultiIndex(levels=[['a', 'abc'], [0, 1, 2, 3]],
   ...:            codes=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]])

在新 repr 中，如果行数小于 :attr:`options.display.max_seq_items`（默认：100 项），将显示所有值。水平方向上，如果输出宽度超过 :attr:`options.display.width`（默认：80 个字符），则会截断。

更短的 Series 和 DataFrame 截断 repr#

目前，pandas 的默认显示选项确保当一个 Series 或 DataFrame 有超过 60 行时，其 repr 会被截断到最多 60 行（display.max_rows 选项）。然而，这仍然给出了一个占据屏幕大部分垂直空间的 repr。因此，引入了一个新的选项 display.min_rows，默认值为 10，它决定了在截断的 repr 中显示的行数：

对于小的 Series 或 DataFrames，最多显示 max_rows 行（默认：60）。
对于长度超过 max_rows 的大型 DataFrame 系列，仅显示 min_rows 行数（默认：10，即前5行和后5行）。

这个双选项允许仍然查看相对较小的对象的完整内容（例如 df.head(20) 显示所有20行），同时为大型对象提供简要的 repr。

要恢复单个阈值的先前行为，请设置 pd.options.display.min_rows = None。

使用 max_level 参数进行 JSON 规范化#

json_normalize() 将提供的输入字典规范化到所有嵌套级别。新的 max_level 参数提供了更多控制，以决定在哪个级别结束规范化 (GH 23843):

现在的 repr 看起来像这样：

from pandas.io.json import json_normalize
data = [{
    'CreatedBy': {'Name': 'User001'},
    'Lookup': {'TextField': 'Some text',
               'UserField': {'Id': 'ID001', 'Name': 'Name001'}},
    'Image': {'a': 'b'}
}]
json_normalize(data, max_level=1)

Series.explode 将列表类型的值拆分为行#

Series 和 DataFrame 已经获得了 DataFrame.explode() 方法，用于将类似列表的对象转换为单独的行。更多信息请参见文档中的关于展开类似列表列的部分 (GH 16538, GH 10511)

这是一个典型的用例。你在一个列中有一个逗号分隔的字符串。

In [9]: df = pd.DataFrame([{'var1': 'a,b,c', 'var2': 1},
   ...:                    {'var1': 'd,e,f', 'var2': 2}])
   ...: 

In [10]: df
Out[10]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

[2 rows x 2 columns]

创建一个长格式 DataFrame 现在可以通过链式操作简单完成

In [11]: df.assign(var1=df.var1.str.split(',')).explode('var1')
Out[11]: 
  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
1    f     2

[6 rows x 2 columns]

其他增强功能#

DataFrame.plot() 关键字 logy, logx 和 loglog 现在可以接受值 'sym' 用于 symlog 缩放。(GH 24867)
在解析日期时间时，增加了对 ISO 周年格式 (‘%G-%V-%u’) 的支持，使用 to_datetime() (GH 16607)
DataFrame 和 Series 的索引现在接受零维 np.ndarray (GH 24919)
Timestamp.replace() 现在支持 fold 参数来消除 DST 转换时间的歧义 (GH 25017)
DataFrame.at_time() 和 Series.at_time() 现在支持带时区的 datetime.time 对象 (GH 24043)
DataFrame.pivot_table() 现在接受一个 observed 参数，该参数传递给 DataFrame.groupby() 的底层调用，以加快对分类数据的分组速度。(GH 24923)
Series.str 已经获得了 Series.str.casefold() 方法，以移除字符串中所有的字母大小写区别 (GH 25405)
DataFrame.set_index() 现在适用于 abc.Iterator 的实例，前提是它们的输出长度与调用帧相同 (GH 22484, GH 24984)
DatetimeIndex.union() 现在支持 sort 参数。sort 参数的行为与 Index.union() 的行为相匹配 (GH 24994)
RangeIndex.union() 现在支持 sort 参数。如果 sort=False，则总是返回一个未排序的 Int64Index。sort=None 是默认值，如果可能则返回一个单调递增的 RangeIndex，否则返回一个排序的 Int64Index (GH 24471)
TimedeltaIndex.intersection() 现在也支持 sort 关键字 (GH 24471)
DataFrame.rename() 现在支持 errors 参数，在尝试重命名不存在的键时会引发错误 (GH 13473)
添加了稀疏访问器用于处理值为稀疏的 DataFrame (GH 25681)
RangeIndex 获得了 start, stop, 和 step 属性 (GH 25710)
datetime.timezone 对象现在支持作为时区方法和构造函数的参数 (GH 25065)
DataFrame.query() 和 DataFrame.eval() 现在支持用反引号引用列名以引用带有空格的名称 (GH 6508)
merge_asof() 现在在合并键是分类且不相等时给出更清晰的错误信息 (GH 26136)
Rolling() 支持指数（或泊松）窗口类型 (GH 21303)
缺少必需导入的错误消息现在包含原始导入错误的文本 (GH 23868)
DatetimeIndex 和 TimedeltaIndex 现在有一个 mean 方法 (GH 24757)
DataFrame.describe() 现在格式化不带小数点的整数百分位数 (GH 26660)
增加了使用 read_spss() 读取 SPSS .sav 文件的支持 (GH 26537)
添加了新选项 plotting.backend ，以便能够选择不同于现有的 matplotlib 的绘图后端。使用 pandas.set_option('plotting.backend', '<backend-module>') ，其中 <backend-module> 是实现 pandas 绘图 API 的库 (GH 14130)
pandas.offsets.BusinessHour 支持多个营业时间间隔 (GH 15481)
read_excel() 现在可以通过 engine='openpyxl' 参数使用 openpyxl 读取 Excel 文件。这将在未来的版本中成为默认设置 (GH 11499)
pandas.io.excel.read_excel() 支持读取 OpenDocument 表格。指定 engine='odf' 以启用。更多详情请参阅 IO 用户指南 (GH 9070)
Interval, IntervalIndex, 和 IntervalArray 获得了一个 is_empty 属性，表示给定的区间是否为空 (GH 27219)

向后不兼容的 API 更改#

使用带有UTC偏移的日期字符串进行索引#

使用带有UTC偏移的日期字符串对带有 DatetimeIndex 的 DataFrame 或 Series 进行索引时，以前会忽略UTC偏移。现在，索引时会考虑UTC偏移。(GH 24076, GH 16785)

In [12]: df = pd.DataFrame([0], index=pd.DatetimeIndex(['2019-01-01'], tz='US/Pacific'))

In [13]: df
Out[13]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

以前的行为:

In [3]: df['2019-01-01 00:00:00+04:00':'2019-01-01 01:00:00+04:00']
Out[3]:
                           0
2019-01-01 00:00:00-08:00  0

新行为:

In [14]: df['2019-01-01 12:00:00+04:00':'2019-01-01 13:00:00+04:00']
Out[14]: 
                           0
2019-01-01 00:00:00-08:00  0

[1 rows x 1 columns]

`MultiIndex` 由层级和代码构建#

使用 NaN 级别或代码值 < -1 构建 MultiIndex 以前是允许的。现在，不允许使用代码值 < -1 进行构建，并且 NaN 级别的相应代码将被重新分配为 -1。(GH 19387)

以前的行为:

In [1]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ...:               codes=[[0, -1, 1, 2, 3, 4]])
   ...:
Out[1]: MultiIndex(levels=[[nan, None, NaT, 128, 2]],
                   codes=[[0, -1, 1, 2, 3, 4]])

In [2]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
Out[2]: MultiIndex(levels=[[1, 2]],
                   codes=[[0, -2]])

新行为:

In [15]: pd.MultiIndex(levels=[[np.nan, None, pd.NaT, 128, 2]],
   ....:               codes=[[0, -1, 1, 2, 3, 4]])
   ....: 
Out[15]: 
MultiIndex([(nan,),
            (nan,),
            (nan,),
            (nan,),
            (128,),
            (  2,)],
           )

In [16]: pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[16], line 1
----> 1 pd.MultiIndex(levels=[[1, 2]], codes=[[0, -2]])

File /home/pandas/pandas/core/indexes/multi.py:338, in MultiIndex.__new__(cls, levels, codes, sortorder, names, dtype, copy, name, verify_integrity)
    335     result.sortorder = sortorder
    337 if verify_integrity:
--> 338     new_codes = result._verify_integrity()
    339     result._codes = new_codes
    341 result._reset_identity()

File /home/pandas/pandas/core/indexes/multi.py:422, in MultiIndex._verify_integrity(self, codes, levels, levels_to_verify)
    416     raise ValueError(
    417         f"On level {i}, code max ({level_codes.max()}) >= length of "
    418         f"level ({len(level)}). NOTE: this index is in an "
    419         "inconsistent state"
    420     )
    421 if len(level_codes) and level_codes.min() < -1:
--> 422     raise ValueError(f"On level {i}, code value ({level_codes.min()}) < -1")
    423 if not level.is_unique:
    424     raise ValueError(
    425         f"Level values must be unique: {list(level)} on level {i}"
    426     )

ValueError: On level 0, code value (-2) < -1

`GroupBy.apply` 在 `DataFrame` 上评估第一个组仅一次#

DataFrameGroupBy.apply() 的实现之前在第一个组上一致地评估提供的函数两次，以推断是否可以使用快速代码路径。特别是对于有副作用的函数，这是不希望的行为，可能会导致意外。(GH 2936, GH 2656, GH 7739, GH 10519, GH 12155, GH 20084, GH 21417)

现在每个组只评估一次。

In [17]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [18]: df
Out[18]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

In [19]: def func(group):
   ....:     print(group.name)
   ....:     return group
   ....: 

以前的行为:

In [3]: df.groupby('a').apply(func)
x
x
y
Out[3]:
   a  b
0  x  1
1  y  2

新行为:

In [3]: df.groupby('a').apply(func)
x
y
Out[3]:
   a  b
0  x  1
1  y  2

连接稀疏值#

当传递值为稀疏的 DataFrame 时，concat() 现在将返回带有稀疏值的 Series 或 DataFrame，而不是 SparseDataFrame (GH 25702)。

In [20]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})

以前的行为:

In [2]: type(pd.concat([df, df]))
pandas.core.sparse.frame.SparseDataFrame

新行为:

In [21]: type(pd.concat([df, df]))
Out[21]: pandas.DataFrame

这现在匹配了 concat 在带有稀疏值的 Series 上的现有行为。concat() 将继续在所有值都是 SparseDataFrame 实例时返回一个 SparseDataFrame。

这一更改也影响内部使用 concat() 的例程，例如 get_dummies()，现在在所有情况下都返回一个 DataFrame`（以前如果所有列都是虚拟编码的，则返回 ``SparseDataFrame`，否则返回 DataFrame）。

提供任何 SparseSeries 或 SparseDataFrame 给 concat() 将导致返回 SparseSeries 或 SparseDataFrame，如前所述。

`.str`-访问器执行更严格的类型检查#

由于缺乏更细粒度的数据类型，Series.str 迄今为止仅检查数据是否为 object 数据类型。Series.str 现在将推断 Series 内的数据类型；特别是，仅 'bytes' 数据将引发异常（除了 Series.str.decode()、Series.str.get()、Series.str.len()、Series.str.slice()），参见 GH 23163、GH 23011、GH 23551。

以前的行为:

In [1]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [2]: s
Out[2]:
0      b'a'
1     b'ba'
2    b'cba'
dtype: object

In [3]: s.str.startswith(b'a')
Out[3]:
0     True
1    False
2    False
dtype: bool

新行为:

In [22]: s = pd.Series(np.array(['a', 'ba', 'cba'], 'S'), dtype=object)

In [23]: s
Out[23]: 
0      b'a'
1     b'ba'
2    b'cba'
Length: 3, dtype: object

In [24]: s.str.startswith(b'a')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 s.str.startswith(b'a')

File /home/pandas/pandas/core/strings/accessor.py:136, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    131 if self._inferred_dtype not in allowed_types:
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
--> 136     raise TypeError(msg)
    137 return func(self, *args, **kwargs)

TypeError: Cannot use .str.startswith with values of inferred dtype 'bytes'.

在 GroupBy 操作期间，分类数据类型会被保留。#

之前，在分组操作期间，那些虽然是分类但不是分组键的列会被转换为 object 数据类型。现在，pandas 将保留这些数据类型。(GH 18502)

In [25]: cat = pd.Categorical(["foo", "bar", "bar", "qux"], ordered=True)

In [26]: df = pd.DataFrame({'payload': [-1, -2, -1, -2], 'col': cat})

In [27]: df
Out[27]: 
   payload  col
0       -1  foo
1       -2  bar
2       -1  bar
3       -2  qux

[4 rows x 2 columns]

In [28]: df.dtypes
Out[28]: 
payload       int64
col        category
Length: 2, dtype: object

以前的行为:

In [5]: df.groupby('payload').first().col.dtype
Out[5]: dtype('O')

新行为:

In [29]: df.groupby('payload').first().col.dtype
Out[29]: CategoricalDtype(categories=['bar', 'foo', 'qux'], ordered=True, categories_dtype=object)

不兼容的索引类型联合#

当在不相容的 dtypes 对象之间执行 Index.union() 操作时，结果将是一个 dtype 为 object 的基 Index。这种行为适用于之前会被禁止的 Index 对象之间的联合。空 Index 对象的 dtype 现在将在执行联合操作之前进行评估，而不是简单地返回另一个 Index 对象。Index.union() 现在可以被认为是可交换的，即 A.union(B) == B.union(A) (GH 23525)。

以前的行为:

In [1]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
...
ValueError: can only call with other PeriodIndex-ed objects

In [2]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[2]: Int64Index([1, 2, 3], dtype='int64')

新行为:

In [3]: pd.period_range('19910905', periods=2).union(pd.Int64Index([1, 2, 3]))
Out[3]: Index([1991-09-05, 1991-09-06, 1, 2, 3], dtype='object')
In [4]: pd.Index([], dtype=object).union(pd.Index([1, 2, 3]))
Out[4]: Index([1, 2, 3], dtype='object')

请注意，整数型和浮点型的索引被认为是“兼容的”。整数值被强制转换为浮点数，这可能会导致精度丢失。更多信息请参见对索引对象的集合操作。

`DataFrame` GroupBy ffill/bfill 不再返回组标签#

DataFrameGroupBy 的 ffill, bfill, pad 和 backfill 方法之前在返回值中包含组标签，这与其他的 groupby 转换不一致。现在只返回填充的值。(GH 21521)

In [30]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

In [31]: df
Out[31]: 
   a  b
0  x  1
1  y  2

[2 rows x 2 columns]

以前的行为:

In [3]: df.groupby("a").ffill()
Out[3]:
   a  b
0  x  1
1  y  2

新行为:

In [32]: df.groupby("a").ffill()
Out[32]: 
   b
0  1
1  2

[2 rows x 1 columns]

`DataFrame` 描述一个空的分类/对象列将返回顶部和频率#

当调用 DataFrame.describe() 时，如果分类/对象列是空的，之前会省略 ‘top’ 和 ‘freq’ 列，这与非空列的输出不一致。现在 ‘top’ 和 ‘freq’ 列将始终包含，在空 DataFrame 的情况下使用 numpy.nan (GH 26397)

In [33]: df = pd.DataFrame({"empty_col": pd.Categorical([])})

In [34]: df
Out[34]: 
Empty DataFrame
Columns: [empty_col]
Index: []

[0 rows x 1 columns]

以前的行为:

In [3]: df.describe()
Out[3]:
        empty_col
count           0
unique          0

新行为:

In [35]: df.describe()
Out[35]: 
       empty_col
count          0
unique         0
top          NaN
freq         NaN

[4 rows x 1 columns]

`str` 方法现在调用 `repr` 而不是反过来#

pandas 到目前为止主要在 pandas 对象的 __str__/__unicode__/__bytes__ 方法中定义字符串表示，并在 __repr__ 方法中调用 __str__，如果未找到特定的 __repr__ 方法。这对于 Python3 是不需要的。在 pandas 0.25 中，pandas 对象的字符串表示现在通常在 __repr__ 中定义，并且在一般情况下，如果未找到特定的 __str__ 方法，对 __str__ 的调用现在会传递给 __repr__，这是 Python 的标准做法。这一更改对于直接使用 pandas 是向后兼容的，但如果你子类化 pandas 对象并且为你的子类提供特定的 __str__/__repr__ 方法，你可能需要调整你的 __str__/__repr__ 方法 (GH 26495)。

使用 `Interval` 对象索引 `IntervalIndex`#

对于 IntervalIndex 的索引方法已修改，仅要求 Interval 查询的完全匹配。IntervalIndex 方法以前匹配任何重叠的 Interval。使用标量点（例如用整数查询）的行为不变 (GH 16316)。

In [36]: ii = pd.IntervalIndex.from_tuples([(0, 4), (1, 5), (5, 8)])

In [37]: ii
Out[37]: IntervalIndex([(0, 4], (1, 5], (5, 8]], dtype='interval[int64, right]')

in 运算符 (__contains__) 现在仅对 IntervalIndex 中与 Intervals 完全匹配的情况返回 True，而之前这会对任何与 IntervalIndex 中的 Interval 重叠的 Interval 返回 True。

以前的行为:

In [4]: pd.Interval(1, 2, closed='neither') in ii
Out[4]: True

In [5]: pd.Interval(-10, 10, closed='both') in ii
Out[5]: True

新行为:

In [38]: pd.Interval(1, 2, closed='neither') in ii
Out[38]: False

In [39]: pd.Interval(-10, 10, closed='both') in ii
Out[39]: False

get_loc() 方法现在只返回与 Interval 查询完全匹配的位置，而不是像以前那样返回重叠匹配的位置。如果找不到完全匹配，将引发 KeyError。

以前的行为:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: array([0, 1])

In [7]: ii.get_loc(pd.Interval(2, 6))
Out[7]: array([0, 1, 2])

新行为:

In [6]: ii.get_loc(pd.Interval(1, 5))
Out[6]: 1

In [7]: ii.get_loc(pd.Interval(2, 6))
---------------------------------------------------------------------------
KeyError: Interval(2, 6, closed='right')

同样地，get_indexer() 和 get_indexer_non_unique() 也只会返回与 Interval 查询完全匹配的位置，其中 -1 表示未找到完全匹配。

这些索引变化扩展到使用 IntervalIndex 索引查询 Series 或 DataFrame。

In [40]: s = pd.Series(list('abc'), index=ii)

In [41]: s
Out[41]: 
(0, 4]    a
(1, 5]    b
(5, 8]    c
Length: 3, dtype: object

使用 [] (__getitem__) 或 loc 从 Series 或 DataFrame 中选择时，现在仅返回 Interval 查询的精确匹配。

以前的行为:

In [8]: s[pd.Interval(1, 5)]
Out[8]:
(0, 4]    a
(1, 5]    b
dtype: object

In [9]: s.loc[pd.Interval(1, 5)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为:

In [42]: s[pd.Interval(1, 5)]
Out[42]: 'b'

In [43]: s.loc[pd.Interval(1, 5)]
Out[43]: 'b'

同样地，对于非精确匹配，将引发 KeyError 而不是返回重叠的匹配。

以前的行为:

In [9]: s[pd.Interval(2, 3)]
Out[9]:
(0, 4]    a
(1, 5]    b
dtype: object

In [10]: s.loc[pd.Interval(2, 3)]
Out[10]:
(0, 4]    a
(1, 5]    b
dtype: object

新行为:

In [6]: s[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

In [7]: s.loc[pd.Interval(2, 3)]
---------------------------------------------------------------------------
KeyError: Interval(2, 3, closed='right')

overlaps() 方法可以用来创建一个布尔索引器，复制返回重叠匹配的先前行为。

新行为:

In [44]: idxr = s.index.overlaps(pd.Interval(2, 3))

In [45]: idxr
Out[45]: array([ True,  True, False])

In [46]: s[idxr]
Out[46]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

In [47]: s.loc[idxr]
Out[47]: 
(0, 4]    a
(1, 5]    b
Length: 2, dtype: object

现在，对 Series 的二进制 ufuncs 进行了对齐#

应用一个二进制 ufunc 如 numpy.power() 现在会在两者都是 Series 时对齐输入 (GH 23293)。

In [48]: s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])

In [49]: s2 = pd.Series([3, 4, 5], index=['d', 'c', 'b'])

In [50]: s1
Out[50]: 
a    1
b    2
c    3
Length: 3, dtype: int64

In [51]: s2
Out[51]: 
d    3
c    4
b    5
Length: 3, dtype: int64

以前的行为

In [5]: np.power(s1, s2)
Out[5]:
a      1
b     16
c    243
dtype: int64

新行为

In [52]: np.power(s1, s2)
Out[52]: 
a     1.0
b    32.0
c    81.0
d     NaN
Length: 4, dtype: float64

这与 pandas 中其他二进制操作的行为相匹配，例如 Series.add()。要保留之前的行为，请在应用 ufunc 之前将其他 Series 转换为数组。

In [53]: np.power(s1, s2.array)
Out[53]: 
a      1
b     16
c    243
Length: 3, dtype: int64

Categorical.argsort 现在将缺失值放在最后#

Categorical.argsort() 现在将缺失值放在数组的末尾，使其与 NumPy 和 pandas 的其余部分保持一致 (GH 21801)。

In [54]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

以前的行为

In [2]: cat = pd.Categorical(['b', None, 'a'], categories=['a', 'b'], ordered=True)

In [3]: cat.argsort()
Out[3]: array([1, 2, 0])

In [4]: cat[cat.argsort()]
Out[4]:
[NaN, a, b]
categories (2, object): [a < b]

新行为

In [55]: cat.argsort()
Out[55]: array([2, 0, 1])

In [56]: cat[cat.argsort()]
Out[56]: 
['a', 'b', NaN]
Categories (2, object): ['a' < 'b']

当将字典列表传递给 DataFrame 时，列的顺序会被保留。#

从 Python 3.7 开始，dict 的键顺序是保证的。实际上，从 Python 3.6 开始就已经是这样了。DataFrame 构造函数现在以与处理 OrderedDict 列表相同的方式处理字典列表，即保留字典的顺序。此更改仅适用于在 Python>=3.6 上运行的 pandas (GH 27309)。

In [57]: data = [
   ....:     {'name': 'Joe', 'state': 'NY', 'age': 18},
   ....:     {'name': 'Jane', 'state': 'KY', 'age': 19, 'hobby': 'Minecraft'},
   ....:     {'name': 'Jean', 'state': 'OK', 'age': 20, 'finances': 'good'}
   ....: ]
   ....: 

以前的行为:

这些列之前已按字典顺序排序，

In [1]: pd.DataFrame(data)
Out[1]:
   age finances      hobby  name state
0   18      NaN        NaN   Joe    NY
1   19      NaN  Minecraft  Jane    KY
2   20     good        NaN  Jean    OK

新行为:

列的顺序现在与 dict 中键的插入顺序相匹配，考虑了从上到下的所有记录。因此，与之前的 pandas 版本相比，结果 DataFrame 的列顺序已经改变。

In [58]: pd.DataFrame(data)
Out[58]: 
   name state  age      hobby finances
0   Joe    NY   18        NaN      NaN
1  Jane    KY   19  Minecraft      NaN
2  Jean    OK   20        NaN     good

[3 rows x 5 columns]

增加了依赖项的最小版本#

由于放弃了对 Python 2.7 的支持，许多可选依赖项已更新了最低版本 (GH 25725, GH 24942, GH 25752)。独立地，一些依赖项的最低支持版本也已更新 (GH 23519, GH 25554)。如果已安装，我们现在要求：

包	最低版本	必需的
numpy	1.13.3	X
pytz	2015.4	X
python-dateutil	2.6.1	X
bottleneck	1.2.1
numexpr	2.6.2
pytest (开发版)	4.0.2

对于可选库，一般建议使用最新版本。下表列出了在 pandas 开发过程中当前测试的每个库的最低版本。低于最低测试版本的可选库可能仍然有效，但不被视为受支持。

包	最低版本
beautifulsoup4	4.6.0
fastparquet	0.2.1
gcsfs	0.2.2
lxml	3.8.0
matplotlib	2.2.2
openpyxl	2.4.8
pyarrow	0.9.0
pymysql	0.7.1
pytables	3.4.2
scipy	0.19.0
sqlalchemy	1.1.4
xarray	0.8.2
xlrd	1.1.0
xlsxwriter	0.9.8
xlwt	1.2.0

更多信息请参见依赖项和可选依赖项。

其他 API 更改#

DatetimeTZDtype 现在将 pytz 时区标准化为通用时区实例 (GH 24713)
Timestamp 和 Timedelta 标量现在实现了 to_numpy() 方法，作为 Timestamp.to_datetime64() 和 Timedelta.to_timedelta64() 的别名。(GH 24653)
Timestamp.strptime() 现在将引发一个 NotImplementedError (GH 25016)
比较 Timestamp 与不支持的对象现在返回 NotImplemented 而不是引发 TypeError 。这意味着不支持的丰富比较被委托给另一个对象，并且现在与 datetime 对象的 Python 3 行为一致 (GH 24011)
在 DatetimeIndex.snap() 中的错误，该错误没有保留输入 Index 的 name (GH 25575)
在 DataFrameGroupBy.agg() 中的 arg 参数已重命名为 func (GH 26089)
在 Window.aggregate() 中的 arg 参数已重命名为 func (GH 26372)
大多数 pandas 类都有一个 __bytes__ 方法，用于获取对象的 python2 风格的字节字符串表示。作为放弃 Python2 的一部分，此方法已被移除 (GH 26447)
.str 访问器已禁用用于 1 级 MultiIndex，如有必要请使用 MultiIndex.to_flat_index() (GH 23679)
移除了对剪贴板的gtk包的支持 (GH 26563)
使用不支持的 Beautiful Soup 4 版本现在会引发 ImportError 而不是 ValueError (GH 27063)
Series.to_excel() 和 DataFrame.to_excel() 现在在保存时区感知数据时会引发 ValueError。(GH 27008, GH 7056)
ExtensionArray.argsort() 将 NA 值放在排序数组的末尾。(GH 21801)
DataFrame.to_hdf() 和 Series.to_hdf() 现在在保存 fixed 格式的 MultiIndex 时，如果包含扩展数据类型，将引发 NotImplementedError。 (GH 7775)
在 read_csv() 中传递重复的 names 现在会引发 ValueError (GH 17346)

弃用#

稀疏子类#

SparseSeries 和 SparseDataFrame 子类已被弃用。它们的功能可以通过带有稀疏值的 Series 或 DataFrame 更好地提供。

以前的方法

df = pd.SparseDataFrame({"A": [0, 0, 1, 2]})
df.dtypes

新方法

In [59]: df = pd.DataFrame({"A": pd.arrays.SparseArray([0, 0, 1, 2])})

In [60]: df.dtypes
Out[60]: 
A    Sparse[int64, 0]
Length: 1, dtype: object

这两种方法的内存使用是相同的 (GH 19239)。

msgpack 格式#

自 0.25 版本起，msgpack 格式已被弃用，并将在未来的版本中移除。建议使用 pyarrow 进行 pandas 对象的在线传输。(GH 27084)

其他弃用#

已弃用的 .ix[] 索引器现在会引发更明显的 FutureWarning 而不是 DeprecationWarning (GH 26438)。
弃用了 units=M``（月）和 ``units=Y``（年）参数用于 :func:`pandas.to_timedelta`、:func:`pandas.Timedelta` 和 :func:`pandas.TimedeltaIndex` 的 ``units 参数（GH 16344）
pandas.concat() 已经弃用了 join_axes 关键字。相反，请在结果或输入上使用 DataFrame.reindex() 或 DataFrame.reindex_like() (GH 21951)
SparseArray.values 属性已弃用。你可以使用 np.asarray(...) 或 SparseArray.to_dense() 方法代替 (GH 26421)。
函数 pandas.to_datetime() 和 pandas.to_timedelta() 已弃用 box 关键字。请改用 to_numpy() 或 Timestamp.to_datetime64() 或 Timedelta.to_timedelta64()。(GH 24416)
DataFrame.compound() 和 Series.compound() 方法已被弃用，并将在未来版本中移除 (GH 26405)。
内部属性 _start, _stop 和 _step 属性已被弃用。请使用公共属性 start, stop 和 step 代替 (GH 26581)。
Series.ftype()、Series.ftypes() 和 DataFrame.ftypes() 方法已被弃用，并将在未来版本中移除。请改用 Series.dtype() 和 DataFrame.dtypes() (GH 26705)。
Series.get_values(), DataFrame.get_values(), Index.get_values(), SparseArray.get_values() 和 Categorical.get_values() 方法已被弃用。可以使用 np.asarray(..) 或 to_numpy() 代替 (GH 19617)。
NumPy ufuncs 上的 ‘outer’ 方法，例如 np.subtract.outer 在 Series 对象上已被弃用。首先使用 Series.array 将输入转换为数组 (GH 27186)
Timedelta.resolution() 已被弃用，并被 Timedelta.resolution_string() 取代。在未来的版本中，Timedelta.resolution() 将被改为与标准库 datetime.timedelta.resolution 相同的行为 (GH 21344)
read_table() 已不再弃用。(GH 25220)
Index.dtype_str 已被弃用。(GH 18262)
Series.imag 和 Series.real 已被弃用。(GH 18262)
Series.put() 已被弃用。(GH 18262)
Index.item() 和 Series.item() 已被弃用。(GH 18262)
在 CategoricalDtype 中，默认值 ordered=None 已被弃用，取而代之的是 ordered=False。在类别类型之间转换时，必须显式传递 ordered=True 以保留顺序。(GH 26336)
Index.contains() 已被弃用。请使用 key in index (__contains__) 代替 (GH 17753)。
DataFrame.get_dtype_counts() 已被弃用。(GH 18262)
Categorical.ravel() 将返回一个 Categorical 而不是一个 np.ndarray (GH 27199)

移除先前版本的弃用/更改#

移除了 Panel (GH 25047, GH 25191, GH 25231)
移除了之前在 read_excel() 中已弃用的 sheetname 关键字 (GH 16442, GH 20938)
移除了之前已弃用的 TimeGrouper (GH 16942)
移除了之前已弃用的 read_excel() 中的 parse_cols 关键字 (GH 16488)
移除了之前已弃用的 pd.options.html.border (GH 16970)
移除了之前已弃用的 convert_objects (GH 11221)
移除了之前已弃用的 DataFrame 和 Series 的 select 方法 (GH 17633)
移除了之前在 rename_categories() 中将 Series 视为类列表行为的已弃用行为 (GH 17982)
移除了之前已弃用的 DataFrame.reindex_axis 和 Series.reindex_axis (GH 17842)
移除了之前通过 Series.rename_axis() 或 DataFrame.rename_axis() 修改列或索引标签的已弃用行为 (GH 17842)
移除了之前已弃用的 tupleize_cols 关键字参数在 read_html()、read_csv() 和 DataFrame.to_csv() 中 (GH 17877, GH 17820)
移除了之前已弃用的 DataFrame.from.csv 和 Series.from_csv (GH 17812)
移除了之前在 DataFrame.where() 和 DataFrame.mask() 中已弃用的 raise_on_error 关键字参数 (GH 17744)
移除了之前在 astype 中已弃用的 ordered 和 categories 关键字参数 (GH 17742)
移除了之前已弃用的 cdate_range (GH 17691)
移除了之前在 SeriesGroupBy.nth() 中 dropna 关键字参数的已弃用 True 选项 (GH 17493)
移除了之前已弃用的 Series.take() 和 DataFrame.take() 中的 convert 关键字参数 (GH 17352)
移除了之前已弃用的与 datetime.date 对象进行算术运算的行为 (GH 21152)

性能提升#

SparseArray 初始化的显著加速，这使得大多数操作受益，修复了在 v0.20.0 中引入的性能回归 (GH 24985)
DataFrame.to_stata() 现在在输出包含任何字符串或非本地字节序列的数据时更快 (GH 25045)
改进了 Series.searchsorted() 的性能。当数据类型为 int8/int16/int32 且搜索键在数据类型的整数范围内时，加速尤为明显 (GH 22034)
改进了 GroupBy.quantile() 的性能 (GH 20405)
改进了对 RangeIndex 的切片和其他选定操作的性能（GH 26565, GH 26617, GH 26722）
RangeIndex 现在执行标准查找而不实例化实际的哈希表，从而节省内存 (GH 16685)
通过更快的分词和更快的解析小浮点数，提高了 read_csv() 的性能 (GH 25784)
通过更快的解析N/A和布尔值，改进了 read_csv() 的性能 (GH 25804)
通过移除转换为 MultiIndex ，改进了 IntervalIndex.is_monotonic 、 IntervalIndex.is_monotonic_increasing 和 IntervalIndex.is_monotonic_decreasing 的性能 (GH 24813)
改进了写入 datetime dtypes 时 DataFrame.to_csv() 的性能 (GH 25708)
通过更快的解析 MM/YYYY 和 DD/MM/YYYY 日期时间格式，改进了 read_csv() 的性能 (GH 25922)
改进了不能存储NaN的dtypes的nanops性能。加速在 Series.all() 和 Series.any() 中尤为显著 (GH 25070)
通过映射类别而不是映射所有值，改进了分类系列上字典映射器的 Series.map() 性能 (GH 23785)
改进了 IntervalIndex.intersection() 的性能 (GH 24813)
通过更快的连接日期列而不对整数/浮点零和浮点 NaN 进行额外的字符串转换，以及通过更快的检查字符串是否可能是日期，提高了 read_csv() 的性能 (GH 25754)
通过移除转换为 MultiIndex ，改进了 IntervalIndex.is_unique 的性能 (GH 24813)
通过重新启用专用代码路径，恢复了 DatetimeIndex.__iter__() 的性能 (GH 26702)
在构建至少包含一个 CategoricalIndex 级别的 MultiIndex 时，性能得到了提升 (GH 22044)
通过在检查 SettingWithCopyWarning 时移除垃圾收集的需求，提升了性能 (GH 27031)
对于 to_datetime() 将缓存参数的默认值更改为 True (GH 26043)
在给定非唯一、单调数据的情况下，改进了 DatetimeIndex 和 PeriodIndex 的切片性能 (GH 27136)。
改进了面向索引数据的 pd.read_json() 性能。(GH 26773)
改进了 MultiIndex.shape() 的性能 (GH 27384)。

错误修复#

Categorical#

在 DataFrame.at() 和 Series.at() 中的错误，如果索引是 CategoricalIndex ，则会引发异常 (GH 20629)
修复了在包含缺失值的有序 Categorical 与标量比较时，有时会错误地结果为 True 的bug (GH 26504)
当 DataFrame 有一个包含 Interval 对象的 CategoricalIndex 时，DataFrame.dropna() 中的错误会不正确地引发 TypeError (GH 25087)

Datetimelike#

在 to_datetime() 中的一个错误，当使用一个非常遥远的未来日期调用并且指定了 format 参数时，会引发一个（不正确的） ValueError 而不是引发 OutOfBoundsDatetime (GH 23830)
to_datetime() 中的一个错误，当使用 cache=True 调用时，如果 arg 包含集合 {None, numpy.nan, pandas.NaT} 中的至少两个不同元素，会引发 InvalidIndexError: Reindexing only valid with uniquely valued Index objects (GH 22305)
在 DataFrame 和 Series 中的一个错误，其中带有 dtype='datetime64[ns] 的时区感知数据未转换为朴素类型 (GH 25843)
在各种日期时间函数中改进了 时间戳 类型检查，以防止在使用子类化的 datetime 时出现异常 (GH 25851)
在 Series 和 DataFrame 的 repr 中存在一个错误，其中 np.datetime64('NaT') 和 np.timedelta64('NaT') 在 dtype=object 时会被表示为 NaN (GH 25445)
在 to_datetime() 中的错误，当错误设置为强制时，不会用 NaT 替换无效参数 (GH 26122)
在将带有非零月份的 DateOffset 添加到 DatetimeIndex 时会出现 ValueError 错误 (GH 26258)
在 to_datetime() 中的错误，当使用 format='%Y%m%d' 和 error='coerce' 调用时，对无效日期和 NaN 值的混合会引发未处理的 OverflowError (GH 25512)
在 isin() 方法中对于 datetimelike 索引的错误；DatetimeIndex, TimedeltaIndex 和 PeriodIndex 中忽略了 levels 参数。(GH 26675)
在 to_datetime() 中的错误，当调用 format='%Y%m%d' 时，对于长度 >= 6 位的无效整数日期会引发 TypeError，且 errors='ignore'
在比较 PeriodIndex 与零维 numpy 数组时出现的错误 (GH 26689)
从具有非ns单位和越界时间戳的numpy datetime64 数组构造 Series 或 DataFrame 时出现错误，生成垃圾数据，现在将正确引发 OutOfBoundsDatetime 错误 (GH 26206)。
在 date_range() 中存在一个错误，对于非常大或非常小的日期会不必要地引发 OverflowError (GH 26651)
添加 Timestamp 到 np.timedelta64 对象时会引发错误，而不是返回 Timestamp (GH 24775)
在比较包含 np.datetime64 对象的零维 numpy 数组与 Timestamp 时会错误地引发 TypeError 的 bug (GH 26916)
在 to_datetime() 中的一个错误，当使用 cache=True 调用时，如果 arg 包含具有不同偏移量的日期时间字符串，会引发 ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True (GH 26097)

Timedelta#

在 TimedeltaIndex.intersection() 中的一个错误，对于某些非单调索引的情况下，实际上存在交集时返回了一个空的 Index (GH 25913)
在 Timedelta 和 NaT 之间进行比较时引发 TypeError 的错误 (GH 26039)
在将 BusinessHour 加减到 Timestamp 时，结果时间分别落在次日或前一天的错误 (GH 26381)
在比较 TimedeltaIndex 与零维 numpy 数组时出现的错误 (GH 26689)

时区#

在 DatetimeIndex.to_frame() 中的错误，其中时区感知数据会被转换为时区无感知数据 (GH 25809)
在 to_datetime() 中存在一个错误，当 utc=True 并且日期时间字符串会应用于先前解析的 UTC 偏移到后续参数时 (GH 24992)
Timestamp.tz_localize() 和 Timestamp.tz_convert() 中的错误不会传播 freq (GH 25241)
在 Series.at() 中设置带时区的 Timestamp 会引发 TypeError 的错误 (GH 25506)
在使用时区感知数据更新 DataFrame.update() 时，会返回时区无感知数据 (GH 25807)
在 to_datetime() 中的一个错误，当传递一个带有混合UTC偏移量的datetime字符串的朴素 Timestamp 时，会引发一个不具信息的 RuntimeError (GH 25978)
在使用 unit='ns' 时，to_datetime() 中的错误会从解析的参数中丢失时区信息 (GH 26168)
在 DataFrame.join() 中的一个错误，当连接一个带有时区信息的索引和一个带有时区信息的列时，会导致一列 NaN (GH 26335)
在 date_range() 中的一个错误，其中模糊或不存在的时间起点或终点没有分别由 ambiguous 或 nonexistent 关键字处理 (GH 27088)
在组合时区感知和时区无知的 DatetimeIndex 时，DatetimeIndex.union() 中的错误 (GH 21671)
在将 numpy 缩减函数（例如 numpy.minimum()）应用于时区感知的 Series 时出现错误 (GH 15552)

Numeric#

在 to_numeric() 中的一个错误，其中大负数未被正确处理 (GH 24910)
在 to_numeric() 中的一个错误，即使 errors 不是 coerce，数字也被强制转换为浮点数 (GH 24910)
在 to_numeric() 中的一个错误，允许了 errors 的无效值 (GH 26466)
在 format 中的一个错误，其中浮点复数没有被格式化为适当的显示精度和修剪 (GH 25514)
在 DataFrame.corr() 和 Series.corr() 中的错误信息存在问题。增加了使用可调用对象的可能性。(GH 25729)
Series.divmod() 和 Series.rdivmod() 中的错误，会引发一个（不正确的）``ValueError`` 而不是返回一对 Series 对象作为结果 (GH 25557)
当向需要数值索引的方法发送非数值索引时，会引发一个有用的异常 interpolate() 。（GH 21662）
在使用标量运算符比较浮点数时，eval() 中的错误，例如：x < -0.1 (GH 25928)
修复了将全布尔数组转换为整数扩展数组失败的错误 (GH 25211)
divmod 中包含零的 Series 对象错误地引发 AttributeError 的错误 (GH 26987)
Series 中的不一致性：地板除法 (//) 和 divmod 用 NaN 而不是 Inf 填充正数/零 (GH 27321)

转换#

在传递列和类型的字典时，DataFrame.astype() 中的 errors 参数被忽略的错误。(GH 25905)

字符串#

在 Series.str 的几个方法的 __name__ 属性中存在错误，这些属性设置不正确 (GH 23551)
当传递错误数据类型的 Series 到 Series.str.cat() 时改进错误信息 (GH 22722)

Interval#

Interval 的构建仅限于数值、Timestamp 和 Timedelta 端点 (GH 23013)
修复了 Series/DataFrame 在包含缺失值的 IntervalIndex 中不显示 NaN 的错误 (GH 25984)
在 IntervalIndex.get_loc() 中的一个错误，对于递减的 IntervalIndex 会错误地引发 KeyError (GH 25860)
在 Index 构造函数中的一个错误，当传递混合的封闭 Interval 对象时，会导致 ValueError 而不是 object dtype Index (GH 27172)

索引#

当使用非数字对象列表调用 DataFrame.iloc() 时，改进了异常消息 (GH 25753)。
当使用长度不同的布尔索引器调用 .iloc 或 .loc 时，改进了异常消息 (GH 26658)。
在通过一个不存在的键索引 MultiIndex 时，KeyError 异常消息中未显示原始键的错误 (GH 27250)。
在使用布尔索引器时，.iloc 和 .loc 中存在的错误，当传递的项目过少时不会引发 IndexError (GH 26658)。
在 DataFrame.loc() 和 Series.loc() 中的一个错误，当键小于或等于 MultiIndex 的层数时，对于 MultiIndex 没有引发 KeyError (GH 14885)。
在 DataFrame.append() 中存在一个错误，当要追加的数据包含新列时，会产生一个错误的警告，指示将来会抛出 KeyError (GH 22252)。
在 DataFrame.to_csv() 中存在一个错误，当索引是单层 MultiIndex 时，重新索引的数据框会导致段错误 (GH 26303)。
修复了将 arrays.PandasArray 分配给 DataFrame 时会引发错误的问题 (GH 26390)
允许在 DataFrame.query() 字符串中使用可调用本地引用的关键字参数 (GH 26426)
修复了在索引一个包含恰好一个标签的列表的 MultiIndex 级别时出现的 KeyError，该标签缺失 (GH 27148)
在 MultiIndex 中部分匹配 Timestamp 时产生 AttributeError 的错误 (GH 26944)
在使用 in 运算符 (__contains__) 时，当对象与 Interval 中的值不可比较时，Categorical 和 CategoricalIndex 中的 Interval 值存在错误 (GH 23705)
在包含单个带时区的 datetime64[ns] 列的 DataFrame 上，DataFrame.loc() 和 DataFrame.iloc() 中的错误错误地返回标量而不是 Series (GH 27110)
CategoricalIndex 和 Categorical 中的错误，在 in 运算符（__contains__）中传递列表时，错误地引发 ValueError 而不是 TypeError (GH 21729)
在 Series 中设置新值时，Timedelta 对象错误地将值转换为整数 (GH 22717)
在 Series 中设置新键 (__setitem__) 时，使用带时区的 datetime 不正确地引发 ValueError (GH 12862)
当使用只读索引器进行索引时，DataFrame.iloc() 中的错误 (GH 17192)
在 Series 中设置现有元组键 (__setitem__) 时，带有时区感知的 datetime 值不正确地引发 TypeError (GH 20441)

缺失#

如果参数 order 是必需的但被省略，在 Series.interpolate() 中修正了误导性的异常信息 (GH 10633, GH 24014)。
如果在 DataFrame.dropna() 中传递了无效的 axis 参数，则在异常消息中显示固定的类类型 (GH 25555)
当 limit 不是正整数时，DataFrame.fillna() 现在会抛出 ValueError (GH 27042)

MultiIndex#

在测试 MultiIndex 的成员资格时，Timedelta 引发不正确异常的错误 (GH 24570)

IO#

在 DataFrame.to_html() 中的一个错误，其中值被使用显示选项截断，而不是输出完整内容 (GH 17004)
在使用 to_clipboard() 时，修复了在 Windows 上使用 Python 3 复制 utf-16 字符时缺少文本的错误 (GH 25040)
在 orient='table' 时，read_json() 中尝试通过默认推断 dtypes 的错误，因为 dtypes 已经在 JSON 模式中定义了 (GH 21345)
在 orient='table' 和浮点索引的情况下，read_json() 存在一个错误，因为它默认推断索引数据类型，这在索引数据类型已经在 JSON 模式中定义的情况下不适用 (GH 25433)
在 orient='table' 和浮点列名的字符串中，read_json() 存在一个错误，因为它将列名类型转换为 Timestamp，这是不适用的，因为列名已经在 JSON 模式中定义 (GH 25435)
在 errors='ignore' 的情况下，json_normalize() 中的错误，输入数据中缺失的值在结果 DataFrame 中填充为字符串 "nan" 而不是 numpy.nan (GH 25468)
DataFrame.to_html() 现在在使用无效类型作为 classes 参数时会引发 TypeError 而不是 AssertionError (GH 25608)
DataFrame.to_string() 和 DataFrame.to_latex() 中的错误，当使用 header 关键字时会导致不正确的输出 (GH 16718)
在 Python 3.6+ 的 Windows 上，read_csv() 中的错误未能正确解释 UTF8 编码的文件名 (GH 15086)
在 pandas.read_stata() 和 pandas.io.stata.StataReader 中转换有缺失值的列时，性能得到了提升 (GH 25772)
在 DataFrame.to_html() 中的一个错误，当四舍五入时，标题数字会忽略显示选项 (GH 17280)
在 read_hdf() 中的一个错误，当通过 start 或 stop 参数使用子选择读取直接用 PyTables 写入 HDF5 文件的表时，会引发 ValueError (GH 11188)
在引发 KeyError 后，read_hdf() 中的错误未正确关闭存储 (GH 25766)
改进了当Stata dta文件中值标签重复时失败的解释，并提出了解决方法（GH 25772）
改进了 pandas.read_stata() 和 pandas.io.stata.StataReader 以读取由 Stata 保存的格式不正确的 118 格式文件 (GH 25960)
改进了 DataFrame.to_html() 中的 col_space 参数，以接受字符串，从而可以正确设置 CSS 长度值 (GH 25941)
修复了从包含 # 字符的 URL 中加载 S3 对象的错误 (GH 25945)
为 read_gbq() 添加 use_bqstorage_api 参数，以加快下载大型数据帧的速度。此功能需要 pandas-gbq 库的 0.10.0 版本以及 google-cloud-bigquery-storage 和 fastavro 库。(GH 26104)
在处理数值数据时修复了 DataFrame.to_json() 中的内存泄漏问题 (GH 24889)
在 read_json() 中存在一个错误，其中带有 Z 的日期字符串未转换为 UTC 时区 (GH 26168)
为 read_csv() 添加了 cache_dates=True 参数，当解析唯一日期时允许缓存它们 (GH 25990)
DataFrame.to_excel() 现在在调用者的维度超过 Excel 的限制时会引发 ValueError (GH 26051)
修复了在 pandas.read_csv() 中使用 engine=’python’ 时，BOM 会导致解析错误的问题 (GH 26545)
read_excel() 现在在输入类型为 pandas.io.excel.ExcelFile 且传递了 engine 参数时会引发 ValueError，因为 pandas.io.excel.ExcelFile 定义了一个引擎 (GH 26566)
从 HDFStore 选择时出现错误，指定 where='' (GH 26610)。
修复了 DataFrame.to_excel() 中的一个错误，其中合并单元格内的自定义对象（即 PeriodIndex）未被转换为 Excel 写入器安全的类型 (GH 27006)
在 read_hdf() 中的一个错误，读取时区感知的 DatetimeIndex 会引发 TypeError (GH 11926)
to_msgpack() 和 read_msgpack() 中的错误，对于无效路径会引发 ValueError 而不是 FileNotFoundError (GH 27160)
修复了 DataFrame.to_parquet() 中的错误，当数据框没有列时会引发 ValueError (GH 27339)
在使用 read_csv() 时允许解析 PeriodDtype 列 (GH 26934)

绘图#

修复了无法在 matplotlib 绘图中使用 api.extensions.ExtensionArray 的错误 (GH 25587)
在 DataFrame.plot() 中的错误信息存在一个Bug。如果向 DataFrame.plot() 传递非数字数据时，改进了错误信息 (GH 25481)
在绘制非数值/非日期时间的索引时，刻度标签位置不正确的问题 (GH 7612, GH 15912, GH 22334)
修复了导致 PeriodIndex 时间序列图失败的错误，如果频率是频率规则代码的倍数 (GH 14763)
修复了在绘制带有 datetime.timezone.utc 时区的 DatetimeIndex 时的错误 (GH 17173)

GroupBy/重采样/滚动#

在具有时区感知索引的 Resampler.agg() 中的错误，当传递函数列表时会引发 OverflowError (GH 22660)
DataFrameGroupBy.nunique() 中的一个错误，其中列级别的名称丢失 (GH 23222)
在应用于时区感知数据时 GroupBy.agg() 中的错误 (GH 23683)
GroupBy.first() 和 GroupBy.last() 中的错误，其中时区信息会被丢弃 (GH 21603)
当仅对NA值进行分组时，GroupBy.size() 中的错误 (GH 23050)
在 Series.groupby() 中的一个错误，其中 observed 关键字参数之前被忽略 (GH 24880)
在 Series.groupby() 中的一个错误，当使用 groupby 与一个标签列表等于系列长度的 MultiIndex 系列时，会导致不正确的分组 (GH 25704)
确保 groupby 聚合函数中的输出顺序在所有 Python 版本中保持一致 (GH 25692)
确保在按有序 Categorical 分组并指定 observed=True 时结果组顺序正确 (GH 25871, GH 25167)
Rolling.min() 和 Rolling.max() 中的一个错误导致内存泄漏 (GH 25893)
Rolling.count() 和 .Expanding.count 中的错误之前忽略了 axis 关键字 (GH 13503)
在带有 datetime 列的 GroupBy.idxmax() 和 GroupBy.idxmin() 中的错误会返回不正确的 dtype (GH 25444, GH 15306)
在具有缺失类别的分类列中，GroupBy.cumsum(), GroupBy.cumprod(), GroupBy.cummin() 和 GroupBy.cummax() 中的错误会导致返回不正确的结果或段错误 (GH 16771)
在 GroupBy.nth() 中的错误，其中分组中的 NA 值会返回不正确的结果 (GH 26011)
在 SeriesGroupBy.transform() 中存在一个错误，当转换一个空组时会引发 ValueError (GH 26208)
在 DataFrame.groupby() 中的一个错误，当传递一个 Grouper 时，在使用 .groups 访问器时会返回不正确的组 (GH 26326)
在 GroupBy.agg() 中存在一个错误，对于 uint64 列返回了不正确的结果。(GH 26310)
在 Rolling.median() 和 Rolling.quantile() 中的错误，当窗口为空时引发 MemoryError (GH 26005)
在 Rolling.median() 和 Rolling.quantile() 中的错误，当使用 closed='left' 和 closed='neither' 时返回不正确的结果 (GH 26005)
改进了 Rolling、Window 和 ExponentialMovingWindow 函数，以从结果中排除烦人的列而不是引发错误，并且仅当所有列都是烦人的时才引发 DataError (GH 12537)
在 Rolling.max() 和 Rolling.min() 中的错误，当使用空变量窗口时返回不正确的结果 (GH 26005)
当使用不支持的加权窗口函数作为 Window.aggregate() 的参数时，引发一个有用的异常 (GH 26597)

Reshaping#

在 pandas.merge() 中的错误会在 suffixes 中分配 None 时添加一个 None 字符串，而不是保持列名不变 (GH 24782)。
在按索引名称合并时，merge() 中的错误有时会导致索引编号不正确（现在缺失的索引值被分配为 NA）(GH 24212, GH 25009)
to_records() 现在接受 dtypes 作为其 column_dtypes 参数 (GH 24895)
在 concat() 中的一个错误，当 OrderedDict``（和 Python 3.6+ 中的 ``dict）作为 objs 参数传递时，顺序没有被保留 (GH 21510)
在 pivot_table() 中的一个错误，即使 dropna 参数为 False，包含 NaN 值的列也会被删除，当 aggfunc 参数包含一个 list 时 (GH 22159)
在 concat() 中存在一个错误，当两个具有相同 freq 的 DatetimeIndex 的结果 freq 会被丢弃 (GH 3232)。
在 merge() 中的错误，合并等效的 Categorical dtypes 时会引发错误 (GH 22501)
在用迭代器或生成器字典实例化 DataFrame 时存在一个错误（例如 pd.DataFrame({'A': reversed(range(3))})），会引发一个错误 (GH 26349)。
在用 range 实例化 DataFrame 时存在一个错误（例如 pd.DataFrame(range(3))），会引发一个错误 (GH 26342)。
在传递非空元组时，DataFrame 构造函数中的错误会导致段错误 (GH 25691)
当序列是一个时区感知的 DatetimeIndex 时，Series.apply() 中的错误失败 (GH 25959)
在 pandas.cut() 中的一个错误，其中大的箱子可能由于整数溢出而错误地引发错误 (GH 26045)
在 DataFrame.sort_index() 中的一个错误，当一个多索引的 DataFrame 在所有层级上排序时，初始层级最后排序会抛出错误 (GH 26053)
Series.nlargest() 中的错误将 True 视为小于 False (GH 26154)
在 DataFrame.pivot_table() 中使用 IntervalIndex 作为透视索引时会出现 TypeError (GH 25814)
在 orient='index' 时，DataFrame.from_dict() 忽略了 OrderedDict 的顺序的错误 (GH 8425)。
在 DataFrame.transpose() 中的错误，当转置一个包含时区感知datetime列的DataFrame时会错误地引发 ValueError (GH 26825)
当以 values 作为时区感知列进行 pivot_table() 时，会出现一个错误，会删除时区信息 (GH 14948)
在指定多个 by 列时，其中一个是 datetime64[ns, tz] 数据类型时，merge_asof() 中的错误 (GH 26649)

Sparse#

SparseArray 初始化的显著加速，这使得大多数操作受益，修复了在 v0.20.0 中引入的性能回归 (GH 24985)
SparseFrame 构造函数中的一个错误，当将 None 作为数据传递时，会导致 default_fill_value 被忽略 (GH 16807)
在向 SparseDataFrame 添加列时，如果值的长度与索引长度不匹配，会引发 AssertionError 而不是 ValueError (GH 25484)
在 Series.sparse.from_coo() 中引入一个更好的错误信息，以便对于不是 coo 矩阵的输入返回一个 TypeError (GH 26554)
在 SparseArray 上的 numpy.modf() 中的错误。现在返回一个 SparseArray 的元组 (GH 26946)。

构建变化#

在 macOS 上使用 PyPy 修复安装错误 (GH 26536)

ExtensionArray#

当传递一个带有自定义 na_sentinel 的 ExtensionArray 时，factorize() 中的错误 (GH 25696)。
Series.count() 在 ExtensionArrays 中错误计算 NA 值 (GH 26835)
添加了 Series.__array_ufunc__ 以更好地处理应用于由扩展数组支持的 Series 的 NumPy ufuncs (GH 23293)。
关键字参数 deep 已从 ExtensionArray.copy() 中移除 (GH 27083)

其他#

从供应商的 UltraJSON 实现中移除了未使用的 C 函数 (GH 26198)
允许 Index 和 RangeIndex 传递给 numpy min 和 max 函数 (GH 26125)
在 Series 子类的空对象的 repr 中使用实际类名 (GH 27001)。
DataFrame 中的一个错误，其中传递时区感知的 datetime 对象的对象数组会错误地引发 ValueError (GH 13287)

贡献者#

总共有231人为此版本贡献了补丁。名字后面带有“+”的人首次贡献了补丁。

1_x7 +
Abdullah İhsan Seçer +
Adam Bull +
Adam Hooper
Albert Villanova del Moral
Alex Watt +
AlexTereshenkov +
Alexander Buchkovsky
Alexander Hendorf +
Alexander Nordin +
Alexander Ponomaroff
Alexandre Batisse +
Alexandre Decan +
Allen Downey +
Alyssa Fu Ward +
Andrew Gaspari +
Andrew Wood +
Antoine Viscardi +
Antonio Gutierrez +
Arno Veenstra +
ArtinSarraf
Batalex +
Baurzhan Muftakhidinov
Benjamin Rowell
Bharat Raghunathan +
Bhavani Ravi +
Big Head +
Brett Randall +
Bryan Cutler +
C John Klehm +
Caleb Braun +
Cecilia +
Chris Bertinato +
Chris Stadler +
Christian Haege +
Christian Hudon
Christopher Whelan
Chuanzhu Xu +
Clemens Brunner
Damian Kula +
Daniel Hrisca +
Daniel Luis Costa +
Daniel Saxton
DanielFEvans +
David Liu +
Deepyaman Datta +
Denis Belavin +
Devin Petersohn +
Diane Trout +
EdAbati +
Enrico Rotundo +
EternalLearner42 +
Evan +
Evan Livelo +
Fabian Rost +
Flavien Lambert +
Florian Rathgeber +
Frank Hoang +
Gaibo Zhang +
Gioia Ballin
Giuseppe Romagnuolo +
Gordon Blackadder +
Gregory Rome +
Guillaume Gay
HHest +
Hielke Walinga +
How Si Wei +
Hubert
Huize Wang +
Hyukjin Kwon +
Ian Dunn +
Inevitable-Marzipan +
Irv Lustig
JElfner +
Jacob Bundgaard +
James Cobon-Kerr +
Jan-Philip Gehrcke +
Jarrod Millman +
Jayanth Katuri +
Jeff Reback
Jeremy Schendel
Jiang Yue +
Joel Ostblom
Johan von Forstner +
Johnny Chiu +
Jonas +
Jonathon Vandezande +
Jop Vermeer +
Joris Van den Bossche
Josh
Josh Friedlander +
Justin Zheng
Kaiqi Dong
Kane +
Kapil Patel +
Kara de la Marck +
Katherine Surta +
Katrin Leinweber +
Kendall Masse
Kevin Sheppard
Kyle Kosic +
Lorenzo Stella +
Maarten Rietbergen +
Mak Sze Chun
Marc Garcia
Mateusz Woś
Matias Heikkilä
Mats Maiwald +
Matthew Roeschke
Max Bolingbroke +
Max Kovalovs +
Max van Deursen +
Michael
Michael Davis +
Michael P. Moran +
Mike Cramblett +
Min ho Kim +
Misha Veldhoen +
Mukul Ashwath Ram +
MusTheDataGuy +
Nanda H Krishna +
Nicholas Musolino
Noam Hershtig +
Noora Husseini +
Paul
Paul Reidy
Pauli Virtanen
Pav A +
Peter Leimbigler +
Philippe Ombredanne +
Pietro Battiston
Richard Eames +
Roman Yurchak
Ruijing Li
Ryan
Ryan Joyce +
Ryan Nazareth
Ryan Rehman +
Sakar Panta +
Samuel Sinayoko
Sandeep Pathak +
Sangwoong Yoon
Saurav Chakravorty
Scott Talbert +
Sergey Kopylov +
Shantanu Gontia +
Shivam Rana +
Shorokhov Sergey +
Simon Hawkins
Soyoun(Rose) Kim
Stephan Hoyer
Stephen Cowley +
Stephen Rauch
Sterling Paramore +
Steven +
Stijn Van Hoey
Sumanau Sareen +
Takuya N +
Tan Tran +
Tao He +
Tarbo Fukazawa
Terji Petersen +
Thein Oo
ThibTrip +
Thijs Damsma +
Thiviyan Thanapalasingam
Thomas A Caswell
Thomas Kluiters +
Tilen Kusterle +
Tim Gates +
Tim Hoffmann
Tim Swast
Tom Augspurger
Tom Neep +
Tomáš Chvátal +
Tyler Reddy
Vaibhav Vishal +
Vasily Litvinov +
Vibhu Agarwal +
Vikramjeet Das +
Vladislav +
Víctor Moron Tejero +
Wenhuan
Will Ayd +
William Ayd
Wouter De Coster +
Yoann Goular +
Zach Angell +
alimcmaster1
anmyachev +
chris-b1
danielplawrence +
endenis +
enisnazif +
ezcitron +
fjetter
froessler
gfyoung
gwrome +
h-vetinari
haison +
hannah-c +
heckeop +
iamshwin +
jamesoliverh +
jbrockmendel
jkovacevic +
killerontherun1 +
knuu +
kpapdac +
kpflugshaupt +
krsnik93 +
leerssej +
lrjball +
mazayo +
nathalier +
nrebena +
nullptr +
pilkibun +
pmaxey83 +
rbenes +
robbuckley
shawnbrown +
sudhir mohanraj +
tadeja +
tamuhey +
thatneat
topper-123
willweil +
yehia67 +
yhaque1213 +

0.25.0 版本的新特性（2019年7月18日）#

增强功能#

带有重命名的GroupBy聚合#

使用多个 lambda 的 GroupBy 聚合#

更好的 MultiIndex 表示#

更短的 Series 和 DataFrame 截断 repr#

使用 max_level 参数进行 JSON 规范化#

Series.explode 将列表类型的值拆分为行#

其他增强功能#

向后不兼容的 API 更改#

使用带有UTC偏移的日期字符串进行索引#

MultiIndex 由层级和代码构建#

GroupBy.apply 在 DataFrame 上评估第一个组仅一次#

连接稀疏值#

.str-访问器执行更严格的类型检查#

在 GroupBy 操作期间，分类数据类型会被保留。#

不兼容的索引类型联合#

DataFrame GroupBy ffill/bfill 不再返回组标签#

DataFrame 描述一个空的分类/对象列将返回顶部和频率#

__str__ 方法现在调用 __repr__ 而不是反过来#

使用 Interval 对象索引 IntervalIndex#

现在，对 Series 的二进制 ufuncs 进行了对齐#

Categorical.argsort 现在将缺失值放在最后#

当将字典列表传递给 DataFrame 时，列的顺序会被保留。#

增加了依赖项的最小版本#

其他 API 更改#

弃用#

稀疏子类#

msgpack 格式#

其他弃用#

移除先前版本的弃用/更改#

性能提升#

错误修复#

Categorical#

Datetimelike#

Timedelta#

时区#

Numeric#

转换#

字符串#

Interval#

索引#

缺失#

MultiIndex#

IO#

绘图#

GroupBy/重采样/滚动#

Reshaping#

Sparse#

构建变化#

ExtensionArray#

其他#

贡献者#

`MultiIndex` 由层级和代码构建#

`GroupBy.apply` 在 `DataFrame` 上评估第一个组仅一次#

`.str`-访问器执行更严格的类型检查#

`DataFrame` GroupBy ffill/bfill 不再返回组标签#

`DataFrame` 描述一个空的分类/对象列将返回顶部和频率#

`str` 方法现在调用 `repr` 而不是反过来#

使用 `Interval` 对象索引 `IntervalIndex`#