适合在快速入门期间了解的一些相对不那么基础的概念 Less Basic

广播规则 Broadcasting rules

Broadcasting allows universal functions to deal in a meaningful way with inputs that do not have exactly the same shape. 广播允许通用函数以有意义的形式接受并操作具有不完全相同形状的输入数组,

The first rule of broadcasting is that if all input arrays do not have the same number of dimensions, a"1" will be repeatedly prepended to the shapes of the smaller arrays until all the arrays have the same number of dimensions.

第一个数组广播规则假如所有的输入数组并不拥有相同数目的维度, 那么重复将数字"1"补充到较小的数组的形状后面,直到所有数组具有相同的维度。

The second rule of broadcasting ensures that arrays with a size of 1 along a particular dimension act as if they had the size of the array with the largest shape along that dimension. The value of the array element is assumed to be the same along that dimension for the "broadcast" array. 第二个数组广播规则确保在特定维度上大小为1的数组的表现和同维度上最大形状的数组表现一致, 这个尺度大小为1的数组的其余元素假设与"broadcast" 数组的维度上元素相同

After application of the broadcasting rules, the sizes of all arrays must match. More details can be found in basics.broadcasting{.interpreted-text role=“ref”}. 应用广播规则后,所有数组的大小必须匹配。 更多信息请参考basics.broadcasting{.interpreted-text role=“ref”}.

高级索引和索引技巧 Advanced indexing and index tricks

NumPy offers more indexing facilities than regular Python sequences. In addition to indexing by integers and slices, as we saw before, arrays can be indexed by arrays of integers and arrays of booleans. Numpy比常规Python序列提供了更多的索引功能。除了整数索引和切片索引外, 还可以通过整数数组和boolean值数组作为索引。

通过索引数组来索引 Indexing with Arrays of Indices

>>> a = np.arange(12)**2  # 前12个平方数
>>> i = np.array([1, 1, 3, 8, 5])  # 索引数组
>>> a[i]  # 数组`a` 在位置 `i` 的元素
array([ 1,  1,  9, 64, 25])
>>>
>>> j = np.array([[3, 4], [9, 7]])  # 二维数组做索引
>>> a[j]  # a[j]的形状大小和 `j` 一样
array([[ 9, 16],
       [81, 49]])

When the indexed array a is multidimensional, a single array of indices refers to the first dimension of a. The following example shows this behavior by converting an image of labels into a color image using a palette. 如果被索引的数组 a 是多维的,那么单一的索引数组指向的是a的第一个维度。 下面的例子展示了这个特性,使用面板的索引将标记的图像数组转换成颜色图像

>>> palette = np.array([[0, 0, 0],         # black
...                     [255, 0, 0],       # red
...                     [0, 255, 0],       # green
...                     [0, 0, 255],       # blue
...                     [255, 255, 255]])  # white
>>> image = np.array([[0, 1, 2, 0],  # 每个值对应于面板中一个颜色
...                   [0, 3, 4, 0]])
>>> palette[image]  # the (2, 4, 3) color image
array([[[  0,   0,   0],
        [255,   0,   0],
        [  0, 255,   0],
        [  0,   0,   0]],
<BLANKLINE>
       [[  0,   0,   0],
        [  0,   0, 255],
        [255, 255, 255],
        [  0,   0,   0]]])

We can also give indexes for more than one dimension. The arrays of indices for each dimension must have the same shape. 也可以利用多个维度的索引、每个维度的索引必须具有相同的shape

>>> a = np.arange(12).reshape(3, 4)
>>> a
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>> i = np.array([[0, 1],  # 提供给数组 `a` 第一个维度使用的索引
...               [1, 2]])
>>> j = np.array([[2, 1],  # 提供给数组 `a` 第二个维度使用的索引
...               [3, 3]])
>>>
>>> a[i, j]  # 索引i 和 j 必须具有相同shape
array([[ 2,  5],
       [ 7, 11]])
>>>
>>> a[i, 2]
array([[ 2,  6],
       [ 6, 10]])
>>>
>>> a[:, j]
array([[[ 2,  1],
        [ 3,  3]],
<BLANKLINE>
       [[ 6,  5],
        [ 7,  7]],
<BLANKLINE>
       [[10,  9],
        [11, 11]]])

In Python, arr[i, j] is exactly the same as arr[(i, j)]---so we can put i and j in a tuple and then do the indexing with that. 在Python中, arr[i, j]arr[(i, j)]完全一致 ---因此我们可以使用 ij 构建 元组 然后使用这个元组作为索引.

>>> l = (i, j)
>>> # 同  a[i, j]
>>> a[l]
array([[ 2,  5],
       [ 7, 11]])

However, we can not do this by putting i and j into an array, because this array will be interpreted as indexing the first dimension of a. 但是,我们不能同样的将 ij 放入数组以做索引,因为数组会被解释成 a的第一个维度的索引

>>> s = np.array([i, j])
>>> # 并非我们想的
>>> a[s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: index 3 is out of bounds for axis 0 with size 3
>>> # 同 `a[i, j]`
>>> a[tuple(s)]
array([[ 2,  5],
       [ 7, 11]])

Another common use of indexing with arrays is the search of the maximum value of time-dependent series: 数组索引的另一常见用法是查询时间相关序列的最大值

>>> time = np.linspace(20, 145, 5)  # 时间维度  
>>> data = np.sin(np.arange(20)).reshape(5, 4)  # 4个时间相关序列
>>> time
array([ 20.  ,  51.25,  82.5 , 113.75, 145.  ])
>>> data
array([[ 0.        ,  0.84147098,  0.90929743,  0.14112001],
       [-0.7568025 , -0.95892427, -0.2794155 ,  0.6569866 ],
       [ 0.98935825,  0.41211849, -0.54402111, -0.99999021],
       [-0.53657292,  0.42016704,  0.99060736,  0.65028784],
       [-0.28790332, -0.96139749, -0.75098725,  0.14987721]])
>>> # 每个序列的最大值的索引ß
>>> ind = data.argmax(axis=0)
>>> ind
array([2, 0, 3, 1])
>>> # 与最大值对应的时间
>>> time_max = time[ind]
>>>
>>> data_max = data[ind, range(data.shape[1])]  # => data[ind[0], 0], data[ind[1], 1]...
>>> time_max
array([ 82.5 ,  20.  , 113.75,  51.25])
>>> data_max
array([0.98935825, 0.84147098, 0.99060736, 0.6569866 ])
>>> np.all(data_max == data.max(axis=0))
True

You can also use indexing with arrays as a target to assign to: 还可以使用数组索引作为赋值操作的目标:

>>> a = np.arange(5)
>>> a
array([0, 1, 2, 3, 4])
>>> a[[1, 3, 4]] = 0
>>> a
array([0, 0, 2, 0, 0])

However, when the list of indices contains repetitions, the assignment is done several times, leaving behind the last value: 如果索引列表中包含重复,那么赋值会执行多次,保留最后赋值操作的值:

>>> a = np.arange(5)
>>> a[[0, 0, 2]] = [1, 2, 3]
>>> a
array([2, 1, 3, 3, 4])

This is reasonable enough, but watch out if you want to use Python's += construct, as it may not do what you expect: 这样的操作相当合力,不过在使用python的+=操作赋时要谨慎,可能并不会符合您的预期

>>> a = np.arange(5)
>>> a[[0, 0, 2]] += 1
>>> a
array([1, 1, 3, 3, 4])

Even though 0 occurs twice in the list of indices, the 0th element is only incremented once. This is because Python requires a += 1 to be equivalent to a = a + 1.

虽然在索引序列中0出现了两次,第零元素却仅增加了一次,这是由于python要求 a += 1 必须等价于 a = a + 1

Boolean值数组索引 Indexing with Boolean Arrays

When we index arrays with arrays of (integer) indices we are providing the list of indices to pick. With boolean indices the approach is different; we explicitly choose which items in the array we want and which ones we don't. 当通过整数索引数组访问数组时,我们提供的是欲挑选的元素的序列。当使用boolean值索引时,操作是不同的。 我们通过boolean索引明确选择数组中我们需要的和不需要的元素。

The most natural way one can think of for boolean indexing is to use boolean arrays that have the same shape as the original array: 我们可能想出的使用boolean索引的最自然方式是获得和原数组有相同大小的数组:

>>> a = np.arange(12).reshape(3, 4)
>>> b = a > 4
>>> b  # `b` 是 boolean数组、 和数组 `a`'s shape 相同
array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]])
>>> a[b]  # 选出元素的1维数组
array([ 5,  6,  7,  8,  9, 10, 11])

这个属性在赋值时非常有用:

>>> a[b] = 0  # 所有大于4的元素现在都变成 0
>>> a
array([[0, 1, 2, 3],
       [4, 0, 0, 0],
       [0, 0, 0, 0]])

You can look at the following example to see how to use boolean indexing to generate an image of the Mandelbrot set:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
import matplotlib.pyplot as plt
def mandelbrot(h, w, maxit=20, r=2):
    """Returns an image of the Mandelbrot fractal of size (h,w)."""
    x = np.linspace(-2.5, 1.5, 4*h+1)
    y = np.linspace(-1.5, 1.5, 3*w+1)
    A, B = np.meshgrid(x, y)
    C = A + B*1j
    z = np.zeros_like(C)
    divtime = maxit + np.zeros(z.shape, dtype=int)

    for i in range(maxit):
        z = z**2 + C
        diverge = abs(z) > r                    # who is diverging
        div_now = diverge & (divtime == maxit)  # who is diverging now
        divtime[div_now] = i                    # note when
        z[diverge] = r                          # avoid diverging too much

    return divtime
plt.clf()
plt.imshow(mandelbrot(400, 400))

The second way of indexing with booleans is more similar to integer indexing; for each dimension of the array we give a 1D boolean array selecting the slices we want: 应用boolean值索引的第二种方式同整数索引相似,对于标的数组的每个维度我们给出一个1维的boolean值数组作为我们想选择的元素的索引:

>>> a = np.arange(12).reshape(3, 4)
>>> b1 = np.array([False, True, True])         # 选择第一个维度用
>>> b2 = np.array([True, False, True, False])  # 选择第二个维度用
>>>
>>> a[b1, :]                                   # 选择行
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>>
>>> a[b1]                                      # 同上,选择行
array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
>>>
>>> a[:, b2]                                   # 选择列
array([[ 0,  2],
       [ 4,  6],
       [ 8, 10]])
>>>
>>> a[b1, b2]                                  # 很奇怪的操作
array([ 4, 10])

Note that the length of the 1D boolean array must coincide with the length of the dimension (or axis) you want to slice. In the previous example, b1 has length 3 (the number of rows in a), and b2 (of length 4) is suitable to index the 2nd axis (columns) of a. 请注意1维boolean数组的长度 必须和标的数组的维度(轴)的长度相同。 在上面例子中b1 长度为 3 (即a的数目), b2 (长度 4) 也与a 的第二个轴(列)长度相同,适合做索引.

The ix() function

The [ix_]{.title-ref} function can be used to combine different vectors so as to obtain the result for each n-uplet. For example, if you want to compute all the a+b*c for all the triplets taken from each of the vectors a, b and c: 函数ix_用于混合不同的矢量以获得每个矢量的元素相结合的组合。比如需要计算矢量a,b,c中的每个元素组合排列成的所有三元组。

>>> a = np.array([2, 3, 4, 5])
>>> b = np.array([8, 5, 4])
>>> c = np.array([5, 4, 6, 8, 3])
>>> ax, bx, cx = np.ix_(a, b, c)
>>> ax
array([[[2]],
<BLANKLINE>
       [[3]],
<BLANKLINE>
       [[4]],
<BLANKLINE>
       [[5]]])
>>> bx
array([[[8],
        [5],
        [4]]])
>>> cx
array([[[5, 4, 6, 8, 3]]])
>>> ax.shape, bx.shape, cx.shape
((4, 1, 1), (1, 3, 1), (1, 1, 5))
>>> result = ax + bx * cx
>>> result
array([[[42, 34, 50, 66, 26],
        [27, 22, 32, 42, 17],
        [22, 18, 26, 34, 14]],
<BLANKLINE>
       [[43, 35, 51, 67, 27],
        [28, 23, 33, 43, 18],
        [23, 19, 27, 35, 15]],
<BLANKLINE>
       [[44, 36, 52, 68, 28],
        [29, 24, 34, 44, 19],
        [24, 20, 28, 36, 16]],
<BLANKLINE>
       [[45, 37, 53, 69, 29],
        [30, 25, 35, 45, 20],
        [25, 21, 29, 37, 17]]])
>>> result[3, 2, 4]
17
>>> a[3] + b[2] * c[4]
17

也可以下面这样定义reduce:

>>> def ufunc_reduce(ufct, *vectors):
...    vs = np.ix_(*vectors)
...    r = ufct.identity
...    for v in vs:
...        r = ufct(r, v)
...    return r

然后这样使用:

>>> ufunc_reduce(np.add, a, b, c)
array([[[15, 14, 16, 18, 13],
        [12, 11, 13, 15, 10],
        [11, 10, 12, 14,  9]],
<BLANKLINE>
       [[16, 15, 17, 19, 14],
        [13, 12, 14, 16, 11],
        [12, 11, 13, 15, 10]],
<BLANKLINE>
       [[17, 16, 18, 20, 15],
        [14, 13, 15, 17, 12],
        [13, 12, 14, 16, 11]],
<BLANKLINE>
       [[18, 17, 19, 21, 16],
        [15, 14, 16, 18, 13],
        [14, 13, 15, 17, 12]]])

The advantage of this version of reduce compared to the normal ufunc.reduce is that it makes use of the broadcasting rules <broadcasting-rules>{.interpreted-text role=“ref”} in order to avoid creating an argument array the size of the output times the number of vectors. 这个使用reduce的版本对比于普通的ufunc.reduce版本的重要优势在于应用了 广播规则,这样避免创建了结果数组大小的参数数组,一共避免矢量数目的次数

技巧

下面是一些简短有用的技巧

###“Automatic" Reshaping

To change the dimensions of an array, you can omit one of the sizes which will then be deduced automatically: 在改变数组维度时,如果numpy可以自动推断出其值,可以省略其中一个尺度,

>>> a = np.arange(30)
>>> b = a.reshape((2, -1, 3))  # -1 意思是 "所需的任何值"
>>> b.shape
(2, 5, 3)
>>> b
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11],
        [12, 13, 14]],
<BLANKLINE>
       [[15, 16, 17],
        [18, 19, 20],
        [21, 22, 23],
        [24, 25, 26],
        [27, 28, 29]]])

矢量堆叠

How do we construct a 2D array from a list of equally-sized row vectors? In MATLAB this is quite easy: if x and y are two vectors of the same length you only need do m=[x;y]. In NumPy this works via the functions column_stack, dstack, hstack and vstack, depending on the dimension in which the stacking is to be done. For example: 如何从一系列相等尺码的行矢量构建一个2维数组呢?在MATLAB中这很容易:如果xy是相同长度的矢量,那么仅用m=[x;y]就够了。 在Numpy中通过函数column_stack, dstack, hstackvstack来实现,具体使用哪个函数依赖于堆叠的维度所在。比如:

>>> x = np.arange(0, 10, 2)
>>> y = np.arange(5)
>>> m = np.vstack([x, y])
>>> m
array([[0, 2, 4, 6, 8],
       [0, 1, 2, 3, 4]])
>>> xy = np.hstack([x, y])
>>> xy
array([0, 2, 4, 6, 8, 0, 1, 2, 3, 4])

The logic behind those functions in more than two dimensions can be strange. 在超过2维的数组上应用这些函数的逻辑会很奇怪。

::: {.seealso} numpy-for-matlab-users{.interpreted-text role=“doc”} :::

直方图

The NumPy histogram function applied to an array returns a pair of vectors: the histogram of the array and a vector of the bin edges. Beware: matplotlib also has a function to build histograms (called hist, as in Matlab) that differs from the one in NumPy. The main difference is that pylab.hist plots the histogram automatically, while numpy.histogram only generates the data.

NumPy 的 histogram 函数应用在数组上返回一对矢量: 数组的直方图和直方图筒边缘的矢量。 注意:matplotlib也有一个构建直方图的函数(Matlib中的’hist') 与Numpy中的不同。 主要区别在于pylab.hist自动画图,而numpy.histogram 仅仅生成数据。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import numpy as np
rg = np.random.default_rng(1)
import matplotlib.pyplot as plt
# 10000个符合方差为 0.5^2,均值为2的正态分布的随机变量的矢量
mu, sigma = 2, 0.5
v = rg.normal(mu, sigma, 10000)
# Plot a normalized histogram with 50 bins
plt.hist(v, bins=50, density=True)       # matplotlib 版本 (plot)
(array...)
# 使用numpy计算直方图数据,然后画图
(n, bins) = np.histogram(v, bins=50, density=True)  # NumPy 版本 (no plot)
plt.plot(.5 * (bins[1:] + bins[:-1]), n)

对于大于3.4版本的Matplotlib,也可以使用 plt.stairs(n, bins).

Further reading