diff options
Diffstat (limited to 'PVCM/cama/en/ma1 np03 Manipulations.ipynb')
| -rw-r--r-- | PVCM/cama/en/ma1 np03 Manipulations.ipynb | 821 |
1 files changed, 821 insertions, 0 deletions
diff --git a/PVCM/cama/en/ma1 np03 Manipulations.ipynb b/PVCM/cama/en/ma1 np03 Manipulations.ipynb new file mode 100644 index 0000000..3d713e9 --- /dev/null +++ b/PVCM/cama/en/ma1 np03 Manipulations.ipynb @@ -0,0 +1,821 @@ +{ + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "[Array manipulations](https://docs.scipy.org/doc/numpy-1.15.1/reference/routines.array-manipulation.html) include:\n", + "\n", + "* the reorganization of the table (reindexing)\n", + "* the aggregation of 2 or more arrays\n", + "* the division of a table in 2 or more\n", + "\n", + "Before looking at these points, let's look at how Numpy presents the\n", + "dimensions of a multidimensional array with the notion of axes." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 4, 3)" + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "\n", + "# An array of marks of 3 exams for 4 students in two subjects \n", + "# (therefore 6 marks per students or 12 per subjects)\n", + "\n", + "# stud.1 stud.2 stud.3 stud.4\n", + "marks = np.array([[[7,13,11], [7,7,13], [5,9,11], [7,17,15]], # subject 1\n", + " [[8,12,14], [8,12,12], [8,12,10], [12,16,12]]]) # subject 2\n", + "marks.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## The axes\n", + "\n", + "A table has axes which correspond to the axes of a coordinate system in space. The order of the axes\n", + "is that of the inclusion of square brackets. In 2D an array of array is an array of rows with\n", + "each row which is a 1D array of values. The order is therefore rows then columns (unlike the $(x,y)$ axis in\n", + "space). In 3D the order is line, column, depth if you want to have an image, otherwise it's 0, 1 and 2.\n", + "\n", + "Many operations on tables are done along one of the axes of the table so it is important to\n", + "understand what axes are.\n", + "\n", + "Let's look at the example notes above. The axes are\n", + "\n", + "0. materials\n", + "1. students\n", + "2. exams" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "Making the average of the values along axis 1 means to take data along axis 1 and performing the calculations on it, so here outputting an average." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[ 6.5, 11.5, 12.5],\n", + " [ 9. , 13. , 12. ]])" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "marks.mean(axis=1) # give means for each exam in each subject" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "Another way to look at axes is to think of them as __projection axes__. If I project a 3D object\n", + "along the $y$ axis, the result is a 2D object in $(x,z)$. There is thus a reduction in dimension.\n", + "\n", + "If I sum on the 0 axis an array of dimension (2,4,3) as is our marks array, this means that I lose the 0 dimension and therefore the dimension of the result is (4,3)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(4, 3)" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "marks.mean(axis=0).shape # mean along axis 0 (subjects) therefore this axis disapears" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "#### Some functions that support axes\n", + "\n", + "All functions that apply a set of values to produce a result\n", + "should be able to use axis concept (I don't have them all\n", + "checked but do not hesitate to give me a counter-example). We have the following mathematical functions:\n", + "\n", + "* arithmetic: `sum`, `prod`, `cumsum`, `cumprod`\n", + "* statistics: `min`, `max`, `argmin`, `argmax`, `mean` (average), `average` (weighted average), `std` (standard deviation), `var`, `median `, `percentile`, `quantile`\n", + "* others: `gradient`, `diff`, `fft`\n", + "\n", + "Moreover, it is possible to sort the values of an array according to the axis of your choice with `sort`.\n", + "However, they can be mixed, with [`shuffle`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.random.shuffle.html#numpy.random.shuffle ), only along the 0 axis.\n", + "\n", + "#### Apply a function along an axis\n", + "\n", + "The function [`apply_along_axis`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.apply_along_axis.html)\n", + "allows to apply a 1D functionto a table along an axis. This is the axis that will disappear in the result:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "-> [ 7 13 11] 6\n", + "-> [ 7 7 13] 6\n", + "-> [ 5 9 11] 6\n", + "-> [ 7 17 15] 10\n", + "-> [ 8 12 14] 6\n", + "-> [ 8 12 12] 4\n", + "-> [ 8 12 10] 4\n", + "-> [12 16 12] 4\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[ 6, 6, 6, 10],\n", + " [ 6, 4, 4, 4]])" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def diff_min_max(a):\n", + " print('->', a, a.max() - a.min())\n", + " return a.max() - a.min()\n", + "\n", + "np.apply_along_axis(diff_min_max, axis=-1, arr=marks) # -1 is the last axis, marks in our case" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "Question: this is the difference between the marks, but which ones?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "#### Apply a function along several axes\n", + "\n", + "Some operations may take a list of axes and not a single axis." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "a.max \n", + " [17 16] \n", + "\n", + "a.max keepdim \n", + " [[[17]]\n", + "\n", + " [[16]]] \n", + "\n" + ] + } + ], + "source": [ + "print('a.max \\n', marks.max(axis=(1,2)), '\\n') \n", + "print('a.max keepdim \\n', marks.max(axis=(1,2), keepdims=True), '\\n') " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "Question: what do the 2 output values correspond to?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "One can also use the function [`apply_over_axes`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.apply_over_axes.html#numpy.apply_over_axes) to apply a\n", + "function along the given axes.\n", + "\n", + "Beware, the function given in argument will receive the whole table and the axis on which it must\n", + "work, the axes being given one after the other and the table being modified at each stage." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Apply over axis 1\n", + "[[[ 7 13 11]\n", + " [ 7 7 13]\n", + " [ 5 9 11]\n", + " [ 7 17 15]]\n", + "\n", + " [[ 8 12 14]\n", + " [ 8 12 12]\n", + " [ 8 12 10]\n", + " [12 16 12]]] \n", + "\n", + "Apply over axis 2\n", + "[[[ 7 17 15]]\n", + "\n", + " [[12 16 14]]] \n", + "\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[[17]],\n", + "\n", + " [[16]]])" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def mymax(array, axis):\n", + " print('Apply over axis', axis)\n", + " print(array, '\\n')\n", + " return array.max(axis)\n", + "\n", + "np.apply_over_axes(mymax, marks, axes=(1,2))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## Arranging a table\n", + "\n", + "We have already seen `reshape` to change the shape of an array, `flatten` to flatten it in 1 dimension, let's have a look at\n", + "other array manipulation functions.\n", + "\n", + "### Reorder axes\n", + "\n", + "#### `moveaxis` moves an axis\n", + "\n", + "In our marks example, the 3 axes are subjects, students and exams.\n", + "The `moveaxis` function is used to move an axis. If so I want the exams to become the first axis\n", + "in order to make the marks stand out, I move axis 2 to position 0 and the other axes slide to make room, axis 0 becomes axis 1 and axis 1 becomes axis 2:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "marks.shape = (2, 4, 3) \n", + "\n", + "b.shape = (3, 2, 4)\n" + ] + }, + { + "data": { + "text/plain": [ + "array([[[ 7, 7, 5, 7],\n", + " [ 8, 8, 8, 12]],\n", + "\n", + " [[13, 7, 9, 17],\n", + " [12, 12, 12, 16]],\n", + "\n", + " [[11, 13, 11, 15],\n", + " [14, 12, 10, 12]]])" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "print('marks.shape = ',marks.shape, '\\n')\n", + "b = np.moveaxis(marks, 2, 0) \n", + "print('b.shape = ', b.shape)\n", + "b" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "It is easier to see that the first examination was difficult." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "#### `swapaxes` 2 axis swap\n", + "\n", + "Rather than inserting one axis at a new position and dragging the others, you may want to swap two of them.\n", + "Here is how to get the marks for each subject and for each exam:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[ 7, 7, 5, 7],\n", + " [13, 7, 9, 17],\n", + " [11, 13, 11, 15]],\n", + "\n", + " [[ 8, 8, 8, 12],\n", + " [12, 12, 12, 16],\n", + " [14, 12, 10, 12]]])" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "marks.swapaxes(1,2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "#### `transpose` to do everything\n", + "\n", + "Finally `transpose` allows you to reorder all the axes as you want, thus: `transpose((2,0,1))` puts\n", + "\n", + "* axis 2 in position 0,\n", + "* axis 0 in place 1\n", + "* axis 1 in place 2.\n", + "\n", + "#### A simpler and faster apply over axis\n", + "\n", + "Unfortunately the `apply_over_axis` function is not optimized, so in some cases it may\n", + "be preferable to make a loop on its table, which means to put the axes which will remain at the beginning and those on which we make our reduction at the end:" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Means per students [10.833333333333334, 9.833333333333334, 9.166666666666666, 13.166666666666666]\n", + "24 µs ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n", + "29.6 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)\n" + ] + } + ], + "source": [ + "print(\"Means per students\", [m.mean() for m in marks.transpose((1,0,2))])\n", + "\n", + "%timeit [m.mean() for m in marks.transpose((1,0,2))]\n", + "%timeit np.apply_over_axes(np.mean, marks, axes=(0,2))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "### Changing the order of array elements\n", + "\n", + "You can flip the values of an array along an axis with [`flip`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.flip.html#numpy.flip )\n", + "which can also be done by indicating it at the level of the indices. Thus `np.flip(a, n)` is equivalent to\n", + "`a[:,:,..,::-1,:,..,:]` with `::-1` in $n$-th position." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[12, 13, 14, 15],\n", + " [16, 17, 18, 19],\n", + " [20, 21, 22, 23]],\n", + "\n", + " [[ 0, 1, 2, 3],\n", + " [ 4, 5, 6, 7],\n", + " [ 8, 9, 10, 11]]])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a = np.arange(24).reshape([2,3,4])\n", + "np.flip(a,0)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "You can roll values along an axis with `roll` by specifying how much to slide them:" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[ 4, 5, 6, 7],\n", + " [ 8, 9, 10, 11],\n", + " [ 0, 1, 2, 3]],\n", + "\n", + " [[16, 17, 18, 19],\n", + " [20, 21, 22, 23],\n", + " [12, 13, 14, 15]]])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.roll(a, 2, axis=1) # roll elements by 2 along axis 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "The transpose also applies regardless of the dimension. By default it reverses the order of the axes but you can\n", + "specify the desired output order." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[ 0, 12],\n", + " [ 4, 16],\n", + " [ 8, 20]],\n", + "\n", + " [[ 1, 13],\n", + " [ 5, 17],\n", + " [ 9, 21]],\n", + "\n", + " [[ 2, 14],\n", + " [ 6, 18],\n", + " [10, 22]],\n", + "\n", + " [[ 3, 15],\n", + " [ 7, 19],\n", + " [11, 23]]])" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "a.T" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[ 0, 4, 8],\n", + " [ 1, 5, 9],\n", + " [ 2, 6, 10],\n", + " [ 3, 7, 11]],\n", + "\n", + " [[12, 16, 20],\n", + " [13, 17, 21],\n", + " [14, 18, 22],\n", + " [15, 19, 23]]])" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.transpose(a, (0,2,1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## Aggregation\n", + "\n", + "### Concatenation\n", + "\n", + "The basic function is [`concatenate`](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.concatenate.html) indicating the axis chosen for concatenation. This is, in my opinion, the method\n", + "the safest and it works whatever the size.\n", + "\n", + "However, for 2D or 3D arrays, we can use:\n", + "\n", + "* `vstack` or `row_stack` for vertical concatenation\n", + "* `hstack` or `column_stack` for horizontal concatenation\n", + "* `dstack` for deep concatenation\n", + "\n", + "All of these functions take a list of arrays to concatenate as an argument. Of course the sizes of the tables must be compatible." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0. 0. 0.]\n", + " [0. 0. 0.]\n", + " [1. 1. 1.]\n", + " [1. 1. 1.]] \n", + "\n", + "[[0. 0. 0. 1. 1. 1.]\n", + " [0. 0. 0. 1. 1. 1.]]\n" + ] + } + ], + "source": [ + "a = np.zeros((2,3))\n", + "b = np.ones((2,3))\n", + "\n", + "print(np.concatenate((a,b), axis=0), '\\n') # same than vstack\n", + "print(np.hstack((a,b))) # same than concatenate with axis=1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "### Stacking\n", + "\n", + "Unlike concatenation, stacking adds dimension.\n", + "Stack is useful for storing a bunch of 2D arrays, images for example, in a 3D array.\n", + "We use the function [`stack`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.stack.html)." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "array([[[0., 0., 0.],\n", + " [0., 0., 0.]],\n", + "\n", + " [[1., 1., 1.],\n", + " [1., 1., 1.]]])" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "c = np.stack((a,b)) # c[0] is a\n", + "c" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "Note that `stack` has an `axis` option to indicate the direction in which one wishes to store the given arrays." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## Splitting\n", + "\n", + "The inverse function of concatenation is splitting with [`split`](https://docs.scipy.org/doc/numpy-1.15.1/reference/generated/numpy.split.html#numpy.split) which asks as arguments:\n", + "\n", + "* the array to split\n", + "* in how many pieces or at what indices\n", + "* the direction (the axis)\n", + "\n", + "To find our two tables that generated the result of the previous cell, we cut in 2 along the 0 axis. We can also cut along another axis." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "split part 1\n", + " [[[0. 0. 0.]]\n", + "\n", + " [[1. 1. 1.]]] \n", + "\n", + "split part 2\n", + " [[[0. 0. 0.]]\n", + "\n", + " [[1. 1. 1.]]]\n" + ] + } + ], + "source": [ + "e,f = np.split(c, 2, 1) # splits in 2 along axis 1\n", + "print(\"split part 1\\n\", e, '\\n')\n", + "print(\"split part 2\\n\", f)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "There are also `hsplit`, `vsplit` and `dsplit` to split along axes 0, 1 and 2.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## From Python to Numpy\n", + "\n", + "If you want to dig and look at many examples, you can read N. Rougier's book [From Python to Numpy](https://www.labri.fr/perso/nrougier/from-python-to-numpy/)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "lang": "en" + }, + "source": [ + "## Pandas too\n", + "\n", + "We will find these manipulations with Pandas which is the super spreadsheet of Python. It also works on\n", + "array-like structures but without the constraint that all values are of the same type." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "variables": { + " PreviousNext(\"np02 Filtres.ipynb\", \"np04 Xarray.ipynb\")": " <br/><center><a href=\"np02 Filtres.ipynb\">np02 Filtres</a> ← <a href=\"http://python3.mooc.lrde.epita.fr/notebooks/Table%20des%20mati%C3%A8res.ipynb\" style=\"text-decoration:none\"> △ </a> → <a href=\"np04 Xarray.ipynb\">np04 Xarray</a></center><br/> " + } + }, + "source": [ + "{{ PreviousNext(\"np02 Filtres.ipynb\", \"np04 Xarray.ipynb\")}}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ] +}
\ No newline at end of file |
