{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Programming K-means (the unsupervised clustering algorithm) in Spark" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Algorithm k-means (k)\t\n", "1. Initialize the location of the k prototypes kj\n", "\t (usually, randomly)\t\n", "2. (MAP) Assign each instance xi to its closest prototype \n", "\t(usually, closeness = Euclidean distance).\t\n", "3. (REDUCE) Update the location of prototypes kj as the average of the instances xi assigned to each cluster.\t\n", "4. Go to 2, until \tclusters do not change\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Start the SPARK context**" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sys\n", "import os\n", "import os.path\n", "SPARK_HOME = \"\"\"C:\\spark-1.5.0-bin-hadoop2.6\"\"\" #CHANGE THIS PATH TO YOURS!\n", "\n", "sys.path.append(os.path.join(SPARK_HOME, \"python\", \"lib\", \"py4j-0.8.2.1-src.zip\"))\n", "sys.path.append(os.path.join(SPARK_HOME, \"python\", \"lib\", \"pyspark.zip\"))\n", "os.environ[\"SPARK_HOME\"] = SPARK_HOME\n", "\n", "from pyspark import SparkContext\n", "sc = SparkContext(master=\"local[*]\", appName=\"PythonKMeans\")\n", "\n", "# sc.stop()\n", "\n", "# from pyspark.sql import SQLContext\n", "# sqlContext = SQLContext(sc)\n", "# Spark manager can be seen at http://localhost:4040" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Check the SPARK context there **" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(sc)\n", "print(type(sc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Relevant packages are loaded**" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "from pyspark.mllib.regression import LabeledPoint\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The iris dataset is loaded in the driver program **" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "iris = load_iris()\n", "X = iris.data # Input attributes\n", "y = iris.target # Label\n", "# zip is used so that each instance is a tuble of (label, input attributes). \n", "# This will make life easier later\n", "# Note: zip([1,2,3], [\"a\",\"b\",\"c\"]) => [(1, 'a'), (2, 'b'), (3, 'c')]\n", "data = zip(y,X) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**And then, it is distributed into 4 partitions (or the numbers of actual cores in your computer)**" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "data_rdd = sc.parallelize(data,4)\n", "print data_rdd.getNumPartitions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Now, each instance is transformed into a spark labeled point. k-means could be programmed using only standard RDDs, without using LabeledPoints, but we will use them in order to handle all datasets in the same way. **" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[LabeledPoint(0.0, [5.1,3.5,1.4,0.2])]" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_rdd = data_rdd.map(lambda x: LabeledPoint(x[0], x[1]))\n", "data_rdd.take(1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now we separate input attributes (X_rdd) from labels (y_rdd). In order to program K-Means, only X_rdd will be used, because K-means is an unsupervised algorithm **" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_rdd = data_rdd.map(lambda x: x.features)\n", "y_rdd = data_rdd.map(lambda x: x.label)" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[DenseVector([5.1, 3.5, 1.4, 0.2]), DenseVector([4.9, 3.0, 1.4, 0.2])]\n", "[0.0, 0.0]\n" ] } ], "source": [ "print(X_rdd.take(2))\n", "print(y_rdd.take(2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** The K-means algorithm starts here. First, the number of required clusters is initialized to 3 and the initial prototypes are initialized to three random instances (the original algorithm initialized it to three random locations)**" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[DenseVector([6.4, 2.8, 5.6, 2.2]),\n", " DenseVector([5.5, 2.5, 4.0, 1.3]),\n", " DenseVector([5.2, 4.1, 1.5, 0.1])]" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "K=3\n", "kPrototypes = X_rdd.takeSample(False, K, 1)\n", "kPrototypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The following function is not part of the K-means algorithm but it will help us to plot iris data and the current location of the three prototypes. Remark: iris has four input attributes and therefore prototypes have also four-dimensional. For the sake of simplicity, the plot shows only the first two input attributes. **" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def plotPrototypes():\n", " plt.scatter(X[:, 0], X[:, 1])\n", " # kPrototypes is transformed into a numpy matrix (attributes in the columns, instances in the rows)\n", " kProtos = np.array(map(lambda x: x, np.array(kPrototypes)))\n", " plt.scatter(kProtos[:,0], kProtos[:,1], color='red', s=200)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAEACAYAAABWLgY0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAG3dJREFUeJzt3X+MHOWd5/H3dzJDPMaBXTvLnA6C2YMgi73NOuxBnMBl\n+rSecWyksQi7CpdEjK3VgiJWTphR1tnYEXNazyasZMKhrESsrMDccYRbEnz2sfbYEdvHzh9xwMZZ\nEuwVLDjhSLAWCJcDTALH9/7oGnto93RV14+uH/15SaXp7nqq6ttP13y7+qmnnjJ3R0REqqkv7wBE\nRCQ7SvIiIhWmJC8iUmFK8iIiFaYkLyJSYUryIiIVFinJm9lxM/uhmT1hZj9YoMydZva0mR0xs5Xp\nhikiInH0Ryz3DlBz91+0mmlma4GL3f2DZvYR4C5gVUoxiohITFGbayyk7HrgXgB3Pwica2ZDCWMT\nEZGEoiZ5Bw6Y2WNm9ict5p8PPD/v+QvBayIikqOozTVXufvPzey3aCT7o+4+m2VgIiKSXKQk7+4/\nD/7+i5k9BFwJzE/yLwAfmPf8guC1dzEzDZQjIhKDu1uc5UKba8xssZktCR6fDYwCP2oqthu4ISiz\nCnjV3U8sEGihpltvvTX3GMoSl2JSTL0QVxFjSiLKkfwQ8FBwFN4P3Ofu+83spkbO9h3u/ndmts7M\nngFeBzYmikpERFIRmuTd/TngjH7v7v7Npud/mmJcIiKSgp6/4rVWq+UdQktFjEsxRaOYoitiXEWM\nKQlL2t7T0cbMvJvbExGpAjPDszrxKiIi5aUkLyJSYUryIiIVpiQvIlJhSvIiIhWmJC8iUmFK8iIi\nFaYkLyJSYUryIiIVpiQvIlJhSvIiIhWmJC8iUmFK8iIiFaYkLyJSYUryIiIVpiQvIlJhSvIiIhWm\nJC8iUmFK8iIiFaYkLyJSYUryIiIVFjnJm1mfmR02s90t5g2b2avB/MNmtjXdMEVEJI7+Dsp+HngK\nOGeB+Y+6+1jykEREJC2RjuTN7AJgHfCtdsVSiUhERFITtbnm68AXAW9T5qNmdsTMHjazy5KHJiIi\nSYU215jZNcAJdz9iZjVaH7EfAi509zfMbC2wC7i01fqmpqZOPa7VatRqtc6jFhGpsHq9Tr1eT2Vd\n5t7u4BzM7C+BzwJvA4PA+4DvuvsNbZZ5Dvh9d3+l6XUP256IiLybmeHusZrEQ5N804aGgcnmE6xm\nNuTuJ4LHVwL/3d0varG8kryISIeSJPlOetc0b/QmwN19B/CHZvY54C3gJPCpuOsVEZH0dHQkn3hj\nOpLvOTMzM2zfvgOAyckbWbNmTc4RiZRP15prklKS7y0zMzNce+04J0/eBsDg4GYeeminEr1Ih5Tk\npZBGR6/jwIExYDx4ZScjI7vZv/87eYYlUjpJkrzGrhERqbDYJ15FwkxO3sjs7DgnTzaeDw5uZnJy\nZ75BifQYNddIpnTiVSQ5tcmLiFSY2uRFRKQlJXkRkQpTkhcRqTAleRGRClOSFxGpMCV5EZEKU5KX\nUDMzM4yOXsfo6HXMzMzkHY6IdED95KvgxAn49rfh+PHG84suguuvh6GhxKvWIGMi+dPFUL3q2DHY\nvBn27288f/PNxt9Fixp/R0fhtttgxYrYm9AgYyL508VQvWh2Fq64AvbsaST3uQQPp5/v2dMoMzub\nX5wikisl+TI6dgzWroXXXoN2v4zcG2XWrm0sE8Pk5I0MDm4GdgI7g0HGboy1LhHpPjXXlNH69Y2j\n9Kh1aQZjY7BrV6zNaZAxkXypTb6XnDjROLE6v3kmikWLGidmUzgZKyLdpTb5XvLtb+ezrIiUkpJ8\n2Rw/3vlRPDSW+clPUg9HRIpNSV5EpMKU5MvmootO94PvxKJFsHx56uGISLFFTvJm1mdmh81s9wLz\n7zSzp83siJmtTC9EeZfrr89nWREppU6O5D8PPNVqhpmtBS529w8CNwF3pRCbtDI01LiS1To40W4G\na9YUrmeNxsQRyV6kJG9mFwDrgG8tUGQ9cC+Aux8EzjWzYmWUKrntNjj77Ojlzz4bvva17OKJYW5M\nnAMHxjhwYIxrrx1XohfJQNQj+a8DXwQW6uR+PvD8vOcvBK9JFlasgL17YcmS9kf0Zo0ye/cmGr8m\nC9u37wgGPRsHGgOgzV1wJSLp6Q8rYGbXACfc/YiZ1YBYHfLnTE1NnXpcq9Wo1WpJVte7rr4aHnsM\nvvQlmDsCbh6gbM2axhF8wRK8iLRXr9ep1+uprCv0ilcz+0vgs8DbwCDwPuC77n7DvDJ3AX/v7g8E\nz48Bw+5+omlduuI1C3NDDc/1g1++PLWhhrOiIYxFouvasAZmNgxMuvtY0+vrgJvd/RozWwXc4e6r\nWiyvJC+naEwckWhySfJmdhPg7r4jmPcN4BPA68BGdz/cYnkleRGRDmmAMklkenqa22+/G4CJiY1s\n2bIl54hEZL4kST70xKtU2/T0NFu3/hVwJwBbt24CUKIXqQgdyfe4Zcsu4ZVXvsL82/stXfoXvPzy\nM3mGJSLzaKhhERFpSUm+x01MbAQ2MXd7P9gUvCYiVaDmGtGJV5GCU3NNj9iwYQMDA0MMDAyxYcOG\n1Na7ZcsWXn75GV5++ZnUE7wGIZM0JdmfenZfdPeuTY3NSRzj4+MO5zjcE0zn+Pj4eN5htbVv3z4f\nHBw6FfPg4JDv27cv77CkpJLsT2XfF4PcGS/vxl0w1saU5GPr7z8v2EE9mO7x/v7z8g6rrZGRT54R\n88jIJ/MOS0oqyf5U9n0xSZJXc42ISIXpYqiS+Mxn1rJz56Z5r2ziM5+5Nrd4opicvJHZ2XFOnmw8\nHxzczOTkznyDktJKsj/18r6o3jUlsmHDBu67by/QSPr33HNPvgFFoEHIJE1J9qcy74sau0ZEpMLU\nhbJHZNV9rGe7lon0grhnbONMqHdNbFl1Hyt71zKRXkCC3jVqrimJ0dHrOHBgjPkDiY2M7Gb//u8k\nWjbJekWkO9RcIyIiLakLZUlk1X2sl7uWifQCNdeUSFbdx8rctUykF6gLZQbySnxKuFIU2heLI0mS\nV++aFvLqcaKeLlIU2heLBfWuSVdePU7U00WKQvtisah3jYiItKTeNS3k1eNEPV2kKLQvVkdoc42Z\nvRd4FDiLxpfCg+7+n5rKDAP/A3g2eOm77r6txbpK0VwDOvEqon2xODI/8QosDv6+B/g+cGXT/GFg\nd4T1ZHFOolK2bdvmS5de7EuXXuzbtm2LPG/fvn0+MvJJHxn5ZOonyLJct4iEo1t3hgIWA48DVzS9\nPgzsibB8phVRdtu2bTvjFn9zybzdvCx7QqiXhUj+Mk/yNE7QPgH8Evhqi/nDwEvAEeBh4LIF1pN5\nZZTZ0qUXn3GLsqVLLw6dl+Wtzcp+2zSRKkiS5COdeHX3d4APm9k5wC4zu8zdn5pX5BBwobu/YWZr\ngV3Apa3WNTU1depxrVajVqtFCUFEpGfU63Xq9Xo6K+v0WwH4CjARUuY5YGmL1zP7pqsCNdeISCtk\n2VwDvB84N3g8SKOnzbqmMkPzHl8JHF9gXVnXRenpxKuINEuS5KN0ofxdYCeNdvk+4AF3nzazm4IN\n7zCzm4HPAW8BJ4Fb3P1gi3V52PZEROTdMr3i1d2fdPfL3X2lu3/I3aeD17/p7juCx3/t7v/W3T/s\n7h9rleDLJskt8aanp1m27BKWLbuE6enp1JbN6hZ+SeLNi26FKBJR3J8AcSZK0lyTpB26Xdt5kmWz\nuoVfknjzolshSq+hW/3kk05lSfJJug226+qYZNl2MeUVb16SvN+s6lEkS0mSvAYoExGpsrjfDnEm\nSnIkr+YaNdeouUaKBDXXpC9Jt8F2XR2TLNsuprzizUuS95tVPYpkRUm+QpSA8pXXF14Zv2ile5Tk\nK0JNCfnKq+mqjE1m0l1K8hWhnh/5yqunURl7OEl3JUny6l0jIlJhuv1fgbS75Zpux5a9iYmNbN26\nad4rm5iY+LPKbld6RNyfAHEm1FwTSide86UTr1JEqE2+tay62WX5D6lEfloR66KII4FmdWBQxPrv\nVUryLWR1wUyWPSHUg+a0ItZFEcf0z6pHVhHrv5cpybeQ1fgmWfaEUA+a04pYF0W8BWNWPbKKWP+9\nLEmSV+8aEZEqi/vtEGdCzTWZxVw1RawLNdf05r5YBKi5pjWdeC23ItaFTrxKHpIk+dDb/6VJt/8T\nEelcprf/61VZ3V5uw4YNDAwMMTAwxIYNG1KMWLohq9sDJtkvtE9JW3F/AsSZKMnFUFm1ZY6Pj5/R\npjs+Pp7hO5E0ZdVOnWS/0D7VG1CbfLqy6nrW33/eGfP6+8/L8q1IirLqVphkv9A+1RuSJHk114iI\nVFncb4c4EyU5kldzjbSi5hrJC1k21wDvBQ4CTwBPArcuUO5O4GngCLBygTJZ10Vqsup6Nj4+7v39\n53l//3n6ZyyhrLoVJtkvtE9VX6ZJvrF+Fgd/3wN8H7iyaf5a4OHg8UeA7y+wntTffBFHbUzSlzqP\nmLO8P2xW687rOoZ2wpJtkpiz2i+K+P9TxJjylnmS99NJejHwOHBF0+t3AZ+a9/woMNRi+VTfeBFv\nl5fkqsg8Yk6yzbCrf7Nad15XJLcT1mySJOas9osi/v8UMaYi6MaRfF/QXPNL4Kst5u8BPjbv+feA\ny1uUS/WNF/F2eUkGscoj5iTbDBusLat15zWAXDthvVySxJzVflHE/58ixlQESZJ8pDtDufs7wIfN\n7Bxgl5ld5u5PRVm22dTU1KnHtVqNWq0WZzUiIpVVr9ep1+vprKzTbwXgK8BE02vNzTXHUHNNR80M\necWs5pp0qLkmHUWMqQjIuHfN+4Fzg8eDwKPAuqYy6zh94nUVOvGqE68Zr1snXtNRxP+fIsaUtyRJ\nPnSAMjP7XWAnjXb5PuABd582s5uCDe8Iyn0D+ATwOrDR3Q+3WJeHbU9ERN4t0wHK3P1Jd7/c3Ve6\n+4fcfTp4/ZtzCT54/qfufom7/16rBJ+HrAaTktPC6jirz6DdepPElOX7qdr+WLX3U1lxfwLEmSjJ\nTUMkmrzOMSRpt02y7CP33++TA+/z2xn12xn1yYH3+SP3359KXZVN1d5P0aEBys7Uy92tuiWvLqFJ\nutnFWvboUfexMf9VX5+/zsDcTH+dAf9VX5/72FijTIK6KpuqvZ+iS5LkNUCZSBu/84uX4IorYM8e\nznrnHRbz1ql5i3mLs955B/bsaZSZnc0xUpEFxP12iDOh5ppKqXpzze+9d5m/NTjo8w5X209Llix4\nRF+1/bFq76foUHNNa73a3aqb8uoSmqSbXdRlX1y1yt0sepI3c1+/PlbMZVS191NkSZJ8pCtey2rN\nmjWsWbMm7zB62uOPP86hQz889Xj+5zEzM8P27Y0OWpOTN3b0WbX7bNttM6rf+NWbvP/QoUb6jsqd\nX+/Zw2eHr+GPv7zpjO0m2R+T1FVW2r2frOItYj0UXtxvhzgTJRpqWMKF/WRPcvVvXEmuHp0/bxOf\n9jeiHsHPm15nwDfx6VSbL8rWNJJHM13VoeYayUNYD4skg7XFlWSwr/nzbucLHSf4uWk7t6Ta26Rs\nPVny6FVVdUmSvHrXiIhUWdxvhzgTOpKvFDXXqLmmFTXXpA8110hewnpYJBmsLa4kg33Nzfujj6/z\ntwdOX/gUdfpVX5//0cfXpZ58ytaTJY9eVVWWJMmHDlCWJg1Qlp28eh0k2e7IyAjf+94TAKxe/WEO\nHDjQle1GXe9/+b8/Y+jgweg9bMxgbAx27UolFpE5SQYo05F8BRRx7O8wq1evPqNZZfXq1Zlvt5P1\npnkxlEgSqLmmtxXxVm1hYNkZy8KyzLfb6Xq/8O8+3kje7S6KMmuU+Yd/SByDSCtJkrx614i08ePf\nfD889hiMjfHrvj5OMnBq3kkG+HVfX6OJ5rHH4Oqrc4xUZAFxvx3iTOhIPhNqrulO7425oYa3M+rb\nOxxqWCQJ1FwjRbxVW5hGol/msCxygk9ju0nW26u9OyRfSZK8eteIiBRcprf/E8nqlnd53BpQyk2f\nbQxxfwLEmVBzTelkdcs7XRUpnerlzxa1yUvqMr7lnQaxkk718mebJMmruUbONDurW96JVEXcb4c4\nEzqSL76jRxsX9qRwlaeaayRNvfzZkmVzDXAB8AjwY+BJYFOLMsPAq8DhYNq6wLqyrw1JZmysa7e8\n0yBW0qle/WyzTvL/ClgZPF4C/BOwoqnMMLA7wroyroriS7KTZr6Dv/ii+6JF0RN8hJEXy/ZPGTaC\nZVbKVk/SXZkm+TMWgF3AHzS9NgzsibBslvVQeEl+bnblp+odd8RK8guNoV62n9dhY9FnpWz1JN3X\ntSQPXAQcB5Y0vT4MvAQcAR4GLltg+azrotCS9A7oSs+CL6R7y7uy9YYIu3VgVspWT9J9SZJ8f9QT\ntGa2BHgQ+Ly7v9Y0+xBwobu/YWZrg6P9S1utZ2pq6tTjWq1GrVaLGoKISE+o1+vU6/V0VhblmwDo\nB/bRSPBRyj8HLG3xeobfdcWn5ppiN0OouUaKiqyba4B7gdvbzB+a9/hK4PgC5TKtiDLQiddi04lX\nKaIkST50gDIzuwp4lEb3SQ+mLwPLgw3vMLObgc8BbwEngVvc/WCLdXnY9qour9v0RbZ+feNCp6if\nk255J5K5JAOUaRTKLpqZmeHaa8c5efI2AAYHN/PQQzuLleiPHWtcyfpa82mXBSxZ0rhhxooV2cYl\n0sM0CmVJbN++I0jw40Aj2c8d1RfGihWwd28jeVubfcqsUWbvXiV4kQJTkpczXX31qVvevYmdccu7\nNzHd8k6kJNRc00WlaK5pcsef/znPfe3rLGcYgJ/wv/jtL93CF7761ZwjE+kdapMvkcKfeG1henqa\n22+/G4CJiY1s2bIl54hEeouSvIhIhenEawxFvY1YEeMqYkxZ6aX3Kj0ibgf7OBMFuRiqqFcYFjGu\nIsaUlV56r1IudHMUyiRTUZJ8UQeEKmJcRYwpK730XqVckiT5nm2uERHpBZFHoaySyckbmZ0d5+TJ\nxvPBwc1MTu7MNyiKGVcRY8pKL71X6R0927umqF0ZixhXEWPKSi+9VykPdaGUnpOk774SuZRNkiTf\nk801Um7T09Ns3fpXwJ0AbN26CSBSom++6nh2drzwVx2LJKEjeSmdZcsu4ZVXvkJjoDeAnSxd+he8\n/PIzocuOjl7HgQNj71p2ZGQ3+/d/J6NoRZLTxVAiItKSmmukdCYmNp5qomnYxMTEn0VaVj1opNeo\nuUZKSSdepZeod42ISIWpTV5ERFpSkhcRqTAleRGRClOSFxGpMCV5EZEKC03yZnaBmT1iZj82syfN\nbNMC5e40s6fN7IiZrUw/VBER6VSUI/m3gQl3/x3go8DNZrZifgEzWwtc7O4fBG4C7ko90h6h28+J\nSJpCr3h19xeBF4PHr5nZUeB84Ni8YuuBe4MyB83sXDMbcvcTGcRcWRo8S0TS1lGbvJldBKwEDjbN\nOh94ft7zF4LXpAPbt+8IEvw40Ej2c1dmiojEEXnsGjNbAjwIfN7dX4u7wampqVOPa7UatVot7qpE\nRCqpXq9Tr9dTWVekYQ3MrB/4n8Bed//PLebfBfy9uz8QPD8GDDc312hYg/aam2sGBzeruUZEsh+7\nxszuBV5y94kF5q8Dbnb3a8xsFXCHu69qUU5JPoQGzxKRZpkmeTO7CngUeBLwYPoysBxwd98RlPsG\n8AngdWCjux9usS4leRGRDmkUShGRCtMolCIi0pKSvIhIhSnJi4hUmJK8iEiFKcmLiFSYkryISIUp\nyYuIVJiSvIhIhSnJi4hUmJK8iEiFKcmLiFSYkryISIUpyYuIVJiSvIhIhSnJi4hUmJK8iEiFKcmL\niFSYkryISIUpyYuIVJiSvIhIhSnJi4hUmJK8iEiFhSZ5M/sbMzthZv+4wPxhM3vVzA4H09b0wxQR\nkTiiHMnfDawJKfOou18eTNtSiKtr6vV63iG0VMS4FFM0iim6IsZVxJiSCE3y7j4L/CKkmKUTTvcV\n9QMtYlyKKRrFFF0R4ypiTEmk1Sb/UTM7YmYPm9llKa1TREQS6k9hHYeAC939DTNbC+wCLk1hvSIi\nkpC5e3ghs+XAHnf/UISyzwG/7+6vtJgXvjERETmDu8dqFo96JG8s0O5uZkPufiJ4fCWNL44zEnyS\nIEVEJJ7QJG9m/w2oAcvM7KfArcBZgLv7DuAPzexzwFvASeBT2YUrIiKdiNRcIyIi5ZTZFa9m1hdc\nHLV7gfl3mtnTQa+clVnFETWmPC7qMrPjZvZDM3vCzH6wQJk86qltXDnV1blm9rdmdtTMfmxmH2lR\npqt1FRZTt+vJzC4NPrPDwd//Y2abWpTrWj1FiSmn/ekWM/uRmf2jmd1nZme1KJPH/17buGLVlbtn\nMgG3AP8V2N1i3lrg4eDxR4DvZxVHBzENt3o943ieBX6zzfy86iksrjzq6h5gY/C4Hzgn77qKEFPX\n62netvuAnwEfyLueIsTU1XoC/nWwj58VPH8AuCHveooYV8d1lcmRvJldAKwDvrVAkfXAvQDufhA4\n18yGsoilg5ig+xd1Ge1/TXW9niLGNVemK8zsHODfu/vdAO7+trv/sqlYV+sqYkyQ34WCq4F/dvfn\nm17Pa59qFxN0v57eA5xtZv3AYhpfPvPlVU9hcUGHdZVVc83XgS8CCzX4nw/M/6BfCF7LUlhM0P2L\nuhw4YGaPmdmftJifRz1FiQu6W1e/DbxkZncHP1F3mNlgU5lu11WUmCC/CwU/Bdzf4vW89ilYOCbo\nYj25+8+A7cBPabz/V939e03Ful5PEeOCDusq9SRvZtcAJ9z9CG26XnZTxJjmLupaCXyDxkVdWbvK\n3S+n8QvjZjO7ugvbjCIsrm7XVT9wOfDXQVxvAF/KeJthosSUxz6FmQ0AY8DfdmN7UYTE1NV6MrPf\noHGkvpxGE8kSM/t0ltuMImJcHddVFkfyVwFjZvYsjW/t/2Bm9zaVeQH4wLznFwSvZSU0Jnd/zd3f\nCB7vBQbMbGmGMeHuPw/+/gvwEHBlU5Fu11OkuHKoq/8NPO/ujwfPH6SRYOfrdl2FxpTHPhVYCxwK\nPr9muexT7WLKoZ5WA8+6+yvu/v+A7wIfayqTRz2FxhWnrlJP8u7+ZXe/0N3/DXA98Ii739BUbDdw\nA4CZraLxs+RE2rF0EtP89jYLuagrDWa22MyWBI/PBkaBHzUV62o9RY2r23UVvOfnzWxuuIw/AJ5q\nKtbtfSo0pm7X0zz/kYWbRbq+T4XFlEM9/RRYZWaLzMxofHZHm8rkUU+hccWpqzTGronEzG4iuIDK\n3f/OzNaZ2TPA68DGbsWxUEx0/6KuIeAhawz10A/c5+77C1BPoXGRzwVwm4D7gp/9zwIbC1BXbWMi\nh3oys8U0jghvnPdarvUUFhNdrid3/4GZPQg8EWzzMLAj73qKEhcx6koXQ4mIVJhu/yciUmFK8iIi\nFaYkLyJSYUryIiIVpiQvIlJhSvIiIhWmJC8iUmFK8iIiFfb/AeelSQTtj491AAAAAElFTkSuQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotPrototypes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 2 of the K-means algorithm is \"2.Assign each instance xi to its closest prototype\". Function closestPrototype is defined for this purpose** " ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def closestPrototype(x):\n", " # First, \n", " distances = map(lambda kp: np.linalg.norm(x-kp), kPrototypes)\n", " distances = np.array(distances)\n", " return(np.ndarray.argmin(distances))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 2.Assign each instance xi to its closest prototype.** \n", " closest_rdd is an RDD that contains a (key, value) tuple. The key is the index (0,1,2) of the closest of the three prototypes. The value is (x,1). x are the input attributes *x* of the instance. The \"1\" is difficult to understand at this point, but it will be useful later." ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false }, "outputs": [], "source": [ "closest_rdd = X_rdd.map(lambda x: (closestPrototype(x), (x,1)))" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(2, (DenseVector([5.1, 3.5, 1.4, 0.3]), 1)),\n", " (2, (DenseVector([5.0, 3.0, 1.6, 0.2]), 1)),\n", " (1, (DenseVector([6.1, 2.8, 4.7, 1.2]), 1)),\n", " (0, (DenseVector([6.3, 2.5, 5.0, 1.9]), 1)),\n", " (0, (DenseVector([7.4, 2.8, 6.1, 1.9]), 1)),\n", " (2, (DenseVector([4.7, 3.2, 1.6, 0.2]), 1)),\n", " (0, (DenseVector([6.3, 2.9, 5.6, 1.8]), 1)),\n", " (2, (DenseVector([4.6, 3.4, 1.4, 0.3]), 1)),\n", " (0, (DenseVector([6.8, 3.0, 5.5, 2.1]), 1)),\n", " (1, (DenseVector([5.5, 2.5, 4.0, 1.3]), 1))]" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "closest_rdd.takeSample(False, 10, 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Step 3.(REDUCE) Update the location of prototypes kj as the average of the instances xi assigned to each cluster. **: " ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(0, (DenseVector([383.1, 173.1, 316.7, 113.6]), 58)),\n", " (1, (DenseVector([243.1, 114.1, 173.9, 54.0]), 42)),\n", " (2, (DenseVector([250.3, 170.9, 73.2, 12.2]), 50))]" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# For every prototype, compute the sum of the input attributes of the instances assigned to each prototype: p1[0] + p2[0]\n", "# The aim is to eventually compute the average, which is this sum divided by the number of instances assigned to each prototype\n", "# In order to compute the number of instances of each prototype, we add the \"1\"s: p1[1] + p2[1]\n", "# Notice that kPrototypes is a variable collected in the (local) driver program\n", "kPrototypes = closest_rdd.reduceByKey(lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1])).collect()\n", "kPrototypes\n" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[DenseVector([6.6052, 2.9845, 5.4603, 1.9586]),\n", " DenseVector([5.7881, 2.7167, 4.1405, 1.2857]),\n", " DenseVector([5.006, 3.418, 1.464, 0.244])]" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The average is now computed locally (in the driver program)\n", "\n", "kPrototypes = map(lambda (k,(summation, n)): summation / n, kPrototypes)\n", "kPrototypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's see if the prototypes have a better location now **" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAEACAYAAABWLgY0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHIRJREFUeJzt3X+QHPV55/H3I3ZhVwhwVjlt7vilOwjmSOIIMCAs7jQp\nRytLVK0KkzpzccJKrjKUC9cKrc6Gs0SxV9YmTqqEBYUrWPYFRB0BbIyIHIxWyjkTUFKWjYQcbEsp\nE1BMiLUVSyE+JJFI9nN/TK8YRrMzPd3T0z/m86rq2vnR0/3Md3qe6f320982d0dERIppVtoBiIhI\ncpTkRUQKTEleRKTAlORFRApMSV5EpMCU5EVECixUkjezg2b2XTN7ycy+PcM8D5jZD81sn5ktaG+Y\nIiISRU/I+X4OlNz9n+s9aWbLgEvc/ZfN7DrgIWBhm2IUEZGIwnbXWJN5VwCPArj7buA8MxuMGZuI\niMQUNsk7sNPMvmNmH6/z/PnA61X33wgeExGRFIXtrlnk7j82s39HJdnvd/ddSQYmIiLxhUry7v7j\n4O8/mdlW4FqgOsm/AVxYdf+C4LF3MTMNlCMiEoG7W5TXNe2uMbPZZjYnuH02MAR8r2a2bcCtwTwL\ngTfdfWqGQDM13XvvvanHkJe4FJNi6oa4shhTHGH25AeBrcFeeA/wmLvvMLPbKznbN7v7N8xsuZm9\nAhwFVsWKSkRE2qJpknf314DT6t7d/Ys19z/ZxrhERKQNuv6M11KplHYIdWUxLsUUjmIKL4txZTGm\nOCxuf09LKzPzTq5PRKQIzAxP6sCriIjkl5K8iEiBKcmLiBSYkryISIEpyYuIFJiSvIhIgSnJi4gU\nmJK8iEiBKcmLiBSYkryISIEpyYuIFJiSvIhIgSnJi4gUmJK8iEiBKcmLiBSYkryISIEpyYuIFJiS\nvIhIgSnJi4gUmJK8iEiBKcmLiBRY6CRvZrPMbK+Zbavz3GIzezN4fq+ZrW9vmCIiEkVPC/OuBn4A\nnDvD88+7+3D8kEREpF1C7cmb2QXAcuDLjWZrS0QiItI2YbtrPg98CvAG81xvZvvM7FkzuyJ+aCIi\nElfT7hozuxGYcvd9Zlai/h77HuAidz9mZsuAZ4DL6i1vfHz81O1SqUSpVGo9ahGRAiuXy5TL5bYs\ny9wb7ZyDmf0e8DvASaAfOAd42t1vbfCa14Cr3f1IzePebH0iIvJuZoa7R+oSb5rka1a0GFhbe4DV\nzAbdfSq4fS3wFXefX+f1SvIiIi2Kk+Rbqa6pXentgLv7ZuC3zOwTwAngOPCRqMsVEZH2aWlPPvbK\ntCffdSYnJ9m4cTMAa9fextKlS1OOSCR/OtZdE5eSfHeZnJzkpptGOH78DwDo77+LrVu3KNGLtEhJ\nXjJpaOhmdu4cBkaCR7awZMk2duz4WpphieROnCSvsWtERAos8oFXkWbWrr2NXbtGOH68cr+//y7W\nrt2SblAiXUbdNZIoHXgViU998iIiBaY+eRERqUtJXkSkwJTkRUQKTEleRKTAlORFRApMSV5EpMCU\n5KWpyclJhoZuZmjoZiYnJ9MOR0RaoDp5aUiDjImkTydDSWI0yJhI+nQylIiI1KUByqQhDTImkm/q\nrpGmNMiYSLrUJy9KxCIFpiTf5VQBI1JsSvJdThUwIsWm6hoREalL1TUFoAoYEZlJ6O4aM5sFvAj8\ng7sP13n+AWAZcBRY6e776syj7pqE6MCrSHF1pE/ezNYAVwPn1iZ5M1sGfNLdbzSz64D73X1hnWUo\nycsp+mESCSfxPnkzuwBYDnx5hllWAI8CuPtu4DwzG4wSkHSH6YqgnTuH2blzmJtuGtHgZyIJCHvg\n9fPAp4CZdsPPB16vuv9G8JhIXRs3bg5KPkeASvnn9F69iLRP0wOvZnYjMOXu+8ysBET6l2Ha+Pj4\nqdulUolSqRRncSIihVMulymXy21ZVtM+eTP7PeB3gJNAP3AO8LS731o1z0PAX7j7k8H9A8Bid5+q\nWZb65AXQCVwirejYyVBmthhYW+fA63LgjuDA60Jgkw68SjM68CoSTipJ3sxuB9zdNwfPPQh8iEoJ\n5Sp331vn9UryIiIt0rAGEsvExAT33fcwAGNjq1i3bl3KEYlItThJXme8drmJiQnWr/9D4AEA1q8f\nBVCiFykI7cl3ublzL+XIkXuoHtxsYOCzHD78SpphiUgVDVAmIiJ1Kcl3ubGxVcAosCWYRoPHRKQI\n1F0jOvAqknHqrukSK1eupLd3kN7eQVauXNm25a5bt47Dh1/h8OFX2p7gJycnGRq6maGhmzU2jcQW\nZ3vq2m3R3Ts2VVYnUYyMjDic6/BIMJ3rIyMjaYfV0Pbt272/f/BUzP39g759+/a0w5KcirM95X1b\nDHJntLwb9YWRVqYkH1lPz7xgA/VgesR7eualHVZDS5Z8+LSYlyz5cNphSU7F2Z7yvi3GSfLqrhER\nKTCdDJUTH/3oMrZsGa16ZJSPfvSm1OIJQ5cllHaKsz1187ao6pocWblyJY899hxQSfqPPPJIugGF\noEHIpJ3ibE953hY1do2ISIGphLJLJFU+1rWlZSLdIOoR2ygTqq6JLKnysbyXlol0A2JU16i7JieG\nhm5m585hqgcSW7JkGzt2fC3Wa+MsV0Q6Q901IiJSl0oocyKp8rFuLi0T6QbqrsmRpMrH8lxaJtIN\nVEKZgLQSnxKuZIW2xeyIk+RVXVNHWhUnqnSRrNC2mC2ouqa90qo4UaWLZIW2xWxRdY2IiNSl6po6\n0qo4UaWLZIW2xeJo2l1jZmcBzwNnUvlReMrd/1fNPIuBPwVeDR562t031FlWLrprQAdeRbQtZkfi\n1TVmNtvdj5nZGcBfAaPu/u2q5xcDa919uMlycpPkUzE1xY6PfYy/+79/BcAlH1zE0B//MQwOAo2v\nxZrkF1JfdpF0day6BpgNvAhcU/P4YuDrIV7fzgPOxbF/v/vwsJ/o6fGj71y6xo+Cn+jpcR8e9j9a\nvfq0y/9t2LDB3ZOthFCVhUj6SPryf1QO0L4E/BT4/TrPLwZ+AuwDngWumGE5iTdG7rzwgvucOe5m\np5L7aZOZ/z/MF/GZd12+bGDgEndP9tJmeb9smkgRxEnyoQ68uvvPgSvN7FzgGTO7wt1/UDXLHuAi\nr3TpLAOeAS6rt6zx8fFTt0ulEqVSKUwIxXTgACxbBm+91Xg+d+YAz7GRa/hd/pbLOxKeiKSjXC5T\nLpfbs7BWfxWAe4CxJvO8BgzUeTyxX7pcGh5uvAdfM50Ef5or1V0j0mVI8mQoM/tF4IS7/4uZ9QOT\nwOfc/RtV8wy6+1Rw+1rgK+4+v86yvNn6usbUFMyfD2+/3dLL3sZY8J6L+N3/8XEdeBXpEolW15jZ\nrwFbqPTLzwKedPcJM7udyq/LZjO7A/gEcAI4Dqxx9911lqUkP+3+++Huu1tO8vT1wec+B6tXJxOX\niGROome8uvvL7n6Vuy9w9/e5+0Tw+BfdfXNw+wvu/qvufqW7f6Begs+bOJfEm5iYYO7cS5k791Im\nJibqz3TwYOsJHuDtt/nafQ+29RJ+oeLNGF0KUSSkqP08USZy0icfpx96w4YNM5Y6vsudd4bui6+d\nNjLUtkv4hY43Q3QpROk2JF1C2a4pL0k+TtngwMAlp712utTxXTZtcu/raznBH6XPR9n0rpg6Em+G\nxHm/jV6rclHJqjhJXgOUpeWWWyK9zIAniPZaEelCUX8dokzkZE++Y90fMUoo1V2j7hrpHqi7pv22\nb9/uS5Z82Jcs+XDLX/QNGzb4wMAlPjBwSeOEuX9/5WzXkEn+RH+/f+wDS+rG1JF4MyTO+2302jjL\nFUlKnCSvi4akbdeuylmvR4NRa+oxg7PPZvf4OPdM/jWgevWkNBoErojrlXzQ5f/ybv9+9xUr/GRv\nrx+r2nM/Bn6yt9d9xQp/4UtfUldCwtLquspjl5l0FtqTL4b/tvhG/v3z7+FiKkML/z1T/Pi/vslX\n/vJZXY6tA+bOvZQjR+6huo0HBj7L4cOvFHK9kh9x9uR1ZagMefOsPr7KEO9K5GdtSzMkEck5JfkM\naXTJNV2OLXljY6tYv3606pFRxsY+Xdj1SpeI2s8TZUJ98k2p8iNdaVUa5bHCSToHlVDWl1SZXZJf\nSCXyd2SxLRp99knGG/XHP6nvgHSWknwdSZ0wk2QlhE7GeUcW26LRZ5/WmP5JndyVxfbvZkrydSQ1\nvkmSY71o7JR3ZLEtGn32aV2CMamxeLLY/t0sTpLX2DUiIkUW9dchyoS6axKLuWiy2BbqrunObTEL\nUHdNfTrwmm9ZbAsdeJU0xEnyOuNVRCTjEr38X7dK6vJyK1eupLd3kN7eQVauXNnGiKUTkro8YJzt\nQtuUNBT1X4AoEzk5GSqpvsyRkZHT+nRHRkYSfCfSTkn1U8fZLrRNdQfUJ99eSZWe9fTMO+25np55\nSb4VaaNYZYWHDlUu+XjnnZVp06bKYx5vu9A21R3iJHmNXSOSpAMH4K67YMeOyv2336787euDu++G\noSHe6yf5fnoRStFF/XWIMpGTPXl110g9LW8XL7xQufJXo0s8mvmxnh5fxGx118iMSLK7BjgL2A28\nBLwM3DvDfA8APwT2AQtmmCfptmibpErPRkZGvKdnnvf0zNOXMYdCbxctXtrxeE+P/8oZA5G2C21T\nxZdokq8sn9nB3zOAbwHX1jy/DHg2uH0d8K0ZltP2N5/FURvj1FKnEXOS14dNatlpncfQSHWy3XPh\nhS1dpP1nZv5nvbMTacdGsvj9yWJMaUs8yZ+aGWYDLwLX1Dz+EPCRqvv7gcE6r2/rG0/qbL844pwV\nmUbMcdbZ7OzfpJad1hnJjVR3m8zj/nddxjHsdIxen8f9bW3HRrL4/cliTFnQiT35WUF3zU+B36/z\n/NeBD1Td/3PgqjrztfWNJzU4UxxxBrFKI+Y462w2WFtSy05rALlGqqtcRtnkR+ltOckfpc9H2dTW\ndmwki9+fLMaUBXGSfKjqGnf/OXClmZ0LPGNmV7j7D8K8ttb4+Pip26VSiVKpFGUxIpk1n4PM5kTL\nr5vN21zM3wPvaX9QkivlcplyudyehbX6qwDcA4zVPFbbXXMAddec1l2g7pru6K65j6GW9+Knp40M\ntbUdG8ni9yeLMWUBCVfX/CJwXnC7H3geWF4zz3LeOfC6EB141YHXhJed5QOva2bN8X8944yWE/wx\nzO+ePVcHXjMYU9riJPmmA5SZ2a8BW6j0y88CnnT3CTO7PVjx5mC+B4EPAUeBVe6+t86yvNn6RHJv\nagrmz3/nxKew+vrg4EEYHEwiKsmxRAcoc/eX3f0qd1/g7u9z94ng8S9OJ/jg/ifd/VJ3//V6CT4N\nSQ0mJe9o1sZJfQaNlhsnpra8n8FBGBoCa+E7aQZLl+Yqwev7lRNR/wWIMtHBk6G6uf+uU9I6xhCn\n3zap156mxZOhfM6cymtyQt+vzqJTdfJxp04m+W4ut+qUtEpC45TZJfXaukIOa+Bz5lTmzRF9vzor\nTpLXePIiSbnhBvjOd2B4mH+bNYvj9J566ji9/NusWTA8XJnnhhtSDFQKLeqvQ5QJddcUirprwr+f\nbz7+uK/tPcc3MuQbGfK1vef4Nx9/PHZbpEXfr85C3TX1dWu5VSelVRIap8wuqdfGiTmPivZ+sixO\nki/0ePJLly5l6dKlaYfR1V588UX27PnuqdtLly6tlBg+8QQHy2X27P0eU/2z+c/j/5PfuOWW0Mtt\n9NnWXWebNFr25OQkGzdWCs7Wrr3ttPXG2R6bLTsNjd5PUvFmsR0yL+qvQ5SJDu/JS7Ka/ctee/bp\neznbD1x+uXtfn58888zpI3Z+lF4/Dn5o4cLYFSZxzsJt9f20cjZzHHnrGkmjm67oUHeNpKFZhUX1\nYGGLeMF/yll+skEZ4c+mSwljVJrEGTStlfdTu+wkq03yVsmSRlVV0cVJ8qqukcS9lwM8xzLO4V85\no8F8swDeeguWLatcNk9E4ov66xBlQnvyhRK2e2MrV/pJwl9Aw83cV6yIFJO6a9Kn7pr2Q901kpZm\nFRafv/tuP95Kgp+e+vrcDx2KFFOcQdOavZ84g8/FkbdKljSqqoosTpIvdHVNN8lq1cGdv/RL0HdW\n64N1ATzxBKxeXfepRu933bp1rFu3bsbFxqlyef/738/VV+89dbtdy20mb5ViScWbt3bIhKi/DlEm\ntCefiCyO/X3KnXe2vhc/Pa1Z09H3m9bJXSLNoO6a7pbFS7WdkkCST6t6o5urOyRdcZK8qmskWfPn\nV8ZJb1VfH1x8cdvDEek6UX8dokxoTz4Rme6uOXSochC1jQde1V0j3QZ110gWL9V2yvBw4+F2I5RQ\nplW90a3VHZKuOEm+6eX/2kmX/+tSBw7ANddUTnQKY86cyvC7l1+ebFwiOZHo5f9E4lzmbXJykqHR\nday5/CpO9vc3viSeWSXBP/dcrASvy9IVlz7bCKL+CxBlQt01uROnH7r2tb9+1lw/dP31pw1Qdmx6\ngLLrr489QJn6zYurmz9b1CcvSYlTNjjjaw8d8i+8933BBTTW+CibfB73axAraaibP9s4SV5nvErn\nDQ7yzEWXsvNvh4GR4MEtaUYkUlxRfx2iTGhPPnfa2V3TymBgacQr2dbNny1JVteY2QXAo8Ag8HPg\nS+7+QM08i4E/BV4NHnra3TfUWZY3W59kT5xxcRq9VlcPklZ162cbp7omTHfNSWDM3feZ2Rxgj5nt\ncPfaAb+fd/fhKEF0k6QSZlrixJTFQawmJia4776HARgbW9VwoLN2yuJnm0UaoCyCVnf9gWeAD9Y8\nthj4eojXJvCPTH4k1fWRpDhdLnn797rZWPRJyVs7SefRqeoaYD5wEJhT8/hi4CfAPuBZ4IoZXp90\nW2RaIpUqCYtzuby8VUM0u3RgUvLWTtJ5cZJ86OqaoKvmKWC1u9eeurgHuMjdj5nZsmBv/7J6yxkf\nHz91u1QqUSqVwoYgItIVyuUy5XK5PQsL80tApe9+O5UEH2b+14CBOo8n+FuXfequyXY3hLprJKtI\nuruGSnXNfQ2eH6y6fS1wcIb5Em2IPIgzwFUWByEr2oBezS4dmJS8tZN0VpwkH6aEchHwPPAy4MH0\nGeDiYMWbzewO4BPACeA4sMbdd9dZljdbX9HlsYoirYoTEamIU0KpUSg7aHJykptuGuH48T8AoL//\nLrZu3ZLpRD8xMcH69X8ITJ8aMcqGDZ9WohfpICX5nBgaupmdO999Kv+SJdvYseNraYbV0Ny5l3Lk\nyD1Uxzww8FkOH34lzbBEuoqGGhYRkbo0QFkHrV17G7t2jXD8eOV+f/9drF2b7YG5xsZWsX79aNUj\no4yNfTq1eESkNequ6TAdeBWRVqlPXkSkwNQnH0FWLyOWxbiyGFNSuum9SpeIWmAfZSIjJ0Nl9QzD\nLMaVxZiS0k3vVfIFXf6vNVkdECqLcWUxpqR003uVfImT5Lu2u0ZEpBt0ZQllVksZsxhXFmNKSje9\nV+keXVtdk9VSxizGlcWYktJN71XyQyWU0nXi1O4rkUveJH2NV5FMqR00bfqM3DCJvnaQuF27RjI/\nSJxIHNqTl9yJM2haHgeJE9HJUCIiUpe6ayR34gyapgoa6TbqrpFc0oFX6SaqrhERKTD1yYuISF1K\n8iIiBaYkLyJSYEryIiIFpiQvIlJgTZO8mV1gZt80s++b2ctmNjrDfA+Y2Q/NbJ+ZLWh/qCIi0qow\ne/IngTF3/xXgeuAOM7u8egYzWwZc4u6/DNwOPNT2SLuELj8nIu3U9IxXdz8EHApuv2Vm+4HzgQNV\ns60AHg3m2W1m55nZoLtPJRBzYWnwLBFpt5b65M1sPrAA2F3z1PnA61X33wgekxZs3Lg5SPAjQCXZ\nT5+ZKSISReixa8xsDvAUsNrd34q6wvHx8VO3S6USpVIp6qJERAqpXC5TLpfbsqxQwxqYWQ/wZ8Bz\n7n5/necfAv7C3Z8M7h8AFtd212hYg8Zqu2v6++9Sd42IJD92jZk9CvzE3cdmeH45cIe732hmC4FN\n7r6wznxK8k1o8CwRqZVokjezRcDzwMuAB9NngIsBd/fNwXwPAh8CjgKr3H1vnWUpyYuItEijUIqI\nFJhGoRQRkbqU5EVECkxJXkSkwJTkRUQKTEleRKTAlORFRApMSV5EpMCU5EVECkxJXkSkwJTkRUQK\nTEleRKTAlORFRApMSV5EpMCU5EVECkxJXkSkwJTkRUQKTEleRKTAlORFRApMSV5EpMCU5EVECkxJ\nXkSkwJTkRUQKrGmSN7P/bWZTZvY3Mzy/2MzeNLO9wbS+/WGKiEgUYfbkHwaWNpnneXe/Kpg2tCGu\njimXy2mHUFcW41JM4Sim8LIYVxZjiqNpknf3XcA/N5nN2hNO52X1A81iXIopHMUUXhbjymJMcbSr\nT/56M9tnZs+a2RVtWqaIiMTU04Zl7AEucvdjZrYMeAa4rA3LFRGRmMzdm89kdjHwdXd/X4h5XwOu\ndvcjdZ5rvjIRETmNu0fqFg+7J2/M0O9uZoPuPhXcvpbKD8dpCT5OkCIiEk3TJG9mfwKUgLlm9iPg\nXuBMwN19M/BbZvYJ4ARwHPhIcuGKiEgrQnXXiIhIPiV2xquZzQpOjto2w/MPmNkPg6qcBUnFETam\nNE7qMrODZvZdM3vJzL49wzxptFPDuFJqq/PM7Ktmtt/Mvm9m19WZp6Nt1SymTreTmV0WfGZ7g7//\nYmajdebrWDuFiSml7WmNmX3PzP7GzB4zszPrzJPGd69hXJHayt0TmYA1wP8BttV5bhnwbHD7OuBb\nScXRQkyL6z2ecDyvAr/Q4Pm02qlZXGm01SPAquB2D3Bu2m0VIqaOt1PVumcB/whcmHY7hYipo+0E\n/IdgGz8zuP8kcGva7RQyrpbbKpE9eTO7AFgOfHmGWVYAjwK4+27gPDMbTCKWFmKCzp/UZTT+b6rj\n7RQyrul5OsLMzgX+i7s/DODuJ939pzWzdbStQsYE6Z0o+JvA37n76zWPp7VNNYoJOt9OZwBnm1kP\nMJvKj0+1tNqpWVzQYlsl1V3zeeBTwEwd/ucD1R/0G8FjSWoWE3T+pC4HdprZd8zs43WeT6OdwsQF\nnW2r/wj8xMweDv5F3Wxm/TXzdLqtwsQE6Z0o+BHg8TqPp7VNwcwxQQfbyd3/EdgI/IjK+3/T3f+8\nZraOt1PIuKDFtmp7kjezG4Epd99Hg9LLTgoZ0/RJXQuAB6mc1JW0Re5+FZX/MO4wsxs6sM4wmsXV\n6bbqAa4CvhDEdQy4O+F1NhMmpjS2KcysFxgGvtqJ9YXRJKaOtpOZvYfKnvrFVLpI5pjZbye5zjBC\nxtVyWyWxJ78IGDazV6n8av+GmT1aM88bwIVV9y8IHktK05jc/S13Pxbcfg7oNbOBBGPC3X8c/P0n\nYCtwbc0snW6nUHGl0Fb/ALzu7i8G95+ikmCrdbqtmsaUxjYVWAbsCT6/WqlsU41iSqGdfhN41d2P\nuPvPgKeBD9TMk0Y7NY0rSlu1Pcm7+2fc/SJ3/0/ALcA33f3Wmtm2AbcCmNlCKv+WTLU7llZiqu5v\nsyYndbWDmc02sznB7bOBIeB7NbN1tJ3CxtXptgre8+tmNj1cxgeBH9TM1ultqmlMnW6nKv+dmbtF\nOr5NNYsphXb6EbDQzPrMzKh8dvtr5kmjnZrGFaWt2jF2TShmdjvBCVTu/g0zW25mrwBHgVWdimOm\nmOj8SV2DwFarDPXQAzzm7jsy0E5N4yKdE+BGgceCf/tfBVZloK0axkQK7WRms6nsEd5W9Viq7dQs\nJjrcTu7+bTN7CngpWOdeYHPa7RQmLiK0lU6GEhEpMF3+T0SkwJTkRUQKTEleRKTAlORFRApMSV5E\npMCU5EVECkxJXkSkwJTkRUQK7P8DCRofOgUO+ogAAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotPrototypes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**The K-means steps are carried out 4 times more (actually, they should be carried out until the prototypes remain approximately in the same place)**" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for i in range(4):\n", " # 2. (MAP) Assign each instance xi to its closest prototype \n", " closest_rdd = X_rdd.map(lambda x: (closestPrototype(x), (x,1)))\n", " # 3. (REDUCE) Update the location of prototypes kj as the average of the instances xi assigned to each cluster.\n", " kPrototypes = closest_rdd.reduceByKey(lambda p1, p2: (p1[0] + p2[0], p1[1] + p2[1])).collect()\n", " kPrototypes = map(lambda (k,(summation, n)): summation / n, kPrototypes)" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[DenseVector([6.8023, 3.0442, 5.6488, 2.0302]),\n", " DenseVector([5.8544, 2.7421, 4.3456, 1.4088]),\n", " DenseVector([5.006, 3.418, 1.464, 0.244])]" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kPrototypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** And check the final result **" ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXkAAAEACAYAAABWLgY0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHBpJREFUeJzt3X+QHPWZ3/H3A7uglXTgW9naJGAggcMu4nMBLkAGJ5qc\nTyuvqKwK6yom9h0r2WXAxdWCVrGhLFFsYq0PX5WwTHEJJzsHoo5LsDHiRDBa6XI3ATll2fyQDw4p\nBQfEHGdtDuk4giQqyH7yx/SKYTQ73dM9/WN6Pq+qrp0fPd3PfKfnmd6nv/1tc3dERKScTso7ABER\nSY+SvIhIiSnJi4iUmJK8iEiJKcmLiJSYkryISIlFSvJm9oqZ/dTMnjGzH88xz51m9oKZ7TWzCzsb\npoiIxNEXcb5fAhV3//tmT5rZCHCuu/+amV0G3A0s6VCMIiISU9RyjYXMuxK4D8Dd9wCnm9lQwthE\nRCShqEnegV1m9hMz+2KT588AXq27/1rwmIiI5ChqueYKd/+5mX2AWrLf5+670wxMRESSi5Tk3f3n\nwd+/M7NtwKVAfZJ/Dfhg3f0zg8few8w0UI6ISAzubnFeF1quMbP5ZrYwuL0AGAaea5htO3BNMM8S\n4A13n5kj0EJNt912W+4xdEtcikkx9UJcRYwpiSh78kPAtmAvvA+43913mtl1tZztW9z9B2a2wsxe\nBA4DaxJFJSIiHRGa5N39ZeCEfu/u/ocN93+3g3GJiEgH9PwZr5VKJe8QmipiXIopGsUUXRHjKmJM\nSVjSek9bKzPzLNcnIlIGZoandeBVRES6l5K8iEiJKcmLiJSYkryISIkpyYuIlJiSvIhIiSnJi4iU\nmJK8iEiJKcmLiJSYkryISIkpyYuIlJiSvIhIiSnJi4iUmJK8iEiJKcmLiJSYkryISIkpyYuIlJiS\nvIhIiSnJi4iUmJK8iEiJKcmLiJRY5CRvZieZ2dNmtr3Jc0vN7I3g+afNbENnwxQRkTj62pj3RuB5\n4LQ5nn/c3UeThyQiIp0SaU/ezM4EVgDfaTVbRyISEZGOiVqu+SbwZcBbzPNxM9trZo+a2QXJQxMR\nkaRCyzVmdiUw4+57zaxC8z32p4Cz3P2ImY0ADwPnN1ve5OTk8duVSoVKpdJ+1CIiJVatVqlWqx1Z\nlrm32jkHM/s68NvAMWAA+BXgIXe/psVrXgY+5u6HGh73sPWJiMh7mRnuHqskHprkG1a0FFjXeIDV\nzIbcfSa4fSnwXXc/p8nrleRFRNqUJMm307umcaXXAe7uW4DfMrMvAe8AR4HPxF2uiIh0Tlt78olX\npj35njM9Pc2mTVsAWLfuWpYvX55zRCLdJ7NyTVJK8r1lenqaq64a4+jRbwAwMHAz27ZtVaIXaZOS\nvBTS8PAqdu0aBcaCR7aybNl2du78fp5hiXSdJEleY9eIiJRY7AOvImHWrbuW3bvHOHq0dn9g4GbW\nrduab1AiPUblGkmVDryKJKeavIhIiakmLyIiTSnJi4iUmJK8iEiJKcmLiJSYkryISIkpyYuIlJiS\nvISanp5meHgVw8OrmJ6ezjscEWmD+slLSxpkTCR/OhlKUqNBxkTyp5OhRESkKQ1QJi1pkDGR7qZy\njYTSIGMi+VJNXpSIRUpMSb7HqQeMSLkpyfc49YARKTf1rhERkabUu6YE1ANGROYSuVxjZicBTwJ/\n4+6jTZ6/ExgBDgOr3X1vk3lUrkmJDryKlFcmNXkzWwt8DDitMcmb2Qjwu+5+pZldBnzL3Zc0WYaS\nvBynHyaRaFKvyZvZmcAK4DtzzLISuA/A3fcAp5vZUJyApDfM9gjatWuUXbtGueqqMQ1+JpKCqAde\nvwl8GZhrN/wM4NW6+68Fj4k0tWnTlqDL5xhQ6/45u1cvIp0TeuDVzK4EZtx9r5lVgFj/MsyanJw8\nfrtSqVCpVJIsTkSkdKrVKtVqtSPLCq3Jm9nXgd8GjgEDwK8AD7n7NXXz3A38hbs/ENzfDyx195mG\nZakmL4BO4BJpR2YnQ5nZUmBdkwOvK4AbggOvS4DNOvAqYXTgVSSaXJK8mV0HuLtvCZ67C/gUtS6U\na9z96SavV5IXEWmThjWQRKamprjjjnsAmJhYw/r163OOSETqJUnyOuO1x01NTbFhw+8DdwKwYcM4\ngBK9SEloT77HLVp0HocO3Ur94GaDg1/j4MEX8wxLROpogDIREWlKSb7HTUysAcaBrcE0HjwmImWg\nco3owKtIwalc0yNWr15Nf/8Q/f1DrF69umPLXb9+PQcPvsjBgy92PMFPT08zPLyK4eFVGptGEkuy\nPfXstujumU211UkcY2NjDqc53BtMp/nY2FjeYbW0Y8cOHxgYOh7zwMCQ79ixI++wpEsl2Z66fVsM\ncme8vBv3hbFWpiQfW1/f4mAD9WC61/v6FucdVkvLln36hJiXLft03mFJl0qyPXX7tpgkyatcIyJS\nYjoZqkt87nMjbN06XvfIOJ/73FW5xROFLksonZRke+rlbVG9a7rI6tWruf/+x4Ba0r/33nvzDSgC\nDUImnZRke+rmbVFj14iIlJi6UPaItLqP9WzXMpFeEPeIbZwJ9a6JLa3uY93etUykF5Cgd43KNV1i\neHgVu3aNUj+Q2LJl29m58/uJXptkuSKSDZVrRESkKXWh7BJpdR/r5a5lIr1A5Zouklb3sW7uWibS\nC9SFMgV5JT4lXCkKbYvFkSTJq3dNE3n1OFFPFykKbYvFgnrXdFZePU7U00WKQttisah3jYiINKXe\nNU3k1eNEPV2kKLQtlkdoucbMTgUeB06h9qPwoLv/+4Z5lgJ/CrwUPPSQu29ssqyuKNeADryKaFss\njtR715jZfHc/YmYnAz8Ext39x3XPLwXWuftoyHK6JsnnYmaGnZ//PH/9338IwLmfvILhP/ojGBoC\nWl+LNc0vpL7sIvnKrHcNMB94Erik4fGlwCMRXt/JA87lsW+f++iov9PX54ffvXSNHwZ/p6/PfXTU\n/9ONN55w+b+NGze6e7o9IdTLQiR/pH35P2oHaJ8B3gR+r8nzS4HXgb3Ao8AFcywn9cboOk884b5w\nobvZ8eR+wmTm/xfzK/jqey5fNjh4rrune2mzbr9smkgZJEnykQ68uvsvgYvM7DTgYTO7wN2fr5vl\nKeAsr5V0RoCHgfObLWtycvL47UqlQqVSiRJCOe3fDyMj8NZbredzZyHwGJu4hN/hf/HhTMITkXxU\nq1Wq1WpnFtburwJwKzARMs/LwGCTx1P7petKo6Ot9+AbpmPgD3GRyjUiPYY0T4Yys/cD77j7P5jZ\nADAN3O7uP6ibZ8jdZ4LblwLfdfdzmizLw9bXM2Zm4Jxz4O2323rZ2xgXvu8sfufffVEHXkV6RKq9\na8zs14Gt1OryJwEPuPuUmV1H7ddli5ndAHwJeAc4Cqx19z1NlqUkP+tb34Jbbmk7yTNvHtx+O9x4\nYzpxiUjhpHrGq7s/6+4Xu/uF7v5Rd58KHv9Dd98S3P4Dd/+Iu1/k7pc3S/DdJskl8aampli06DwW\nLTqPqamp5jO98kr7CR7g7bf5/h13dfQSfpHiLRhdClEkorh1njgTXVKTT1KH3rhx45xdHd/jppsi\n1+Ibp00Md+wSfpHjLRBdClF6DWl3oezU1C1JPkm3wcHBc0947WxXx/fYvNl93ry2E/xh5vk4m98T\nUybxFkiS99vqteouKkWVJMlrgLK8XH11rJcZ8F+J91oR6UFxfx3iTHTJnnxm5Y8EXShVrlG5RnoH\nKtd03o4dO3zZsk/7smWfbvuLvnHjRh8cPNcHB89tnTD37aud7Roxyb8zMOCfv3xZ05gyibdAkrzf\nVq9NslyRtCRJ8rpoSN52766d9Xo4GLWmGTNYsIA9k5PcOv0/AfVXT0urQeDKuF7pDrr8X7fbt899\n5Uo/1t/vR+r23I+AH+vvd1+50p/49rdVSkhZXqWrbiyZSbbQnnw5/JulV/KPH38fZ1MbWvh/M8PP\n/+UbfPd/PKrLsWVg0aLzOHToVurbeHDwaxw8+GIp1yvdI8mevK4MVSBvnDqP7zHMexL5qdvzDElE\nupySfIG0uuSaLseWvomJNWzYMF73yDgTE18p7XqlR8St88SZUE0+lHp+5Cuvnkbd2MNJsoO6UDaX\nVje7NL+QSuTvKmJbtPrs04w37o9/Wt8ByZaSfBNpnTCTZk8InYzzriK2RavPPq8x/dM6uauI7d/L\nlOSbSGt8kzTHetHYKe8qYlu0+uzzugRjWmPxFLH9e1mSJK+xa0REyizur0OcCZVrUou5bIrYFirX\n9Oa2WASoXNOcDrx2tyK2RS4HXg8c8Oevv94fPOs8f/Cs8/z56693P3Ag0np14LUckiR5nfEqUlT7\n98PNN8POnbX7s1cSmzev9nd4GL7xDfjwh/OJTzKT6uX/elVal5dbvXo1/f1D9PcPsXr16g5GLFlI\n6/KAJ2wXu3fDJZfAI4/Uknv9pSJn7z/yCFxyCVMjI9qmZG5x/wWIM9ElJ0OlVcscGxs7oaY7NjaW\n4juRTkqrTt24XXyIBX6kr8/rura0nN4E/xBf1zZVYqgm31lpdT3r61t8wnN9fYvTfCvSQWl1K2zc\nLrZxkR9r45KQxzB/iJXapkosSZJXuUakQBYzw3Ke4+Q2XnMyzqeYZjEzqcUlXSzur0OciS7Zk1e5\nRprJolwzzmf9cBt78bPTYfp9nM9qmyop0izXAKcCe4BngGeB2+aY707gBWAvcOEc86TdFh2TVtez\nsbEx7+tb7H19i/Vl7EJpdSuc3S4220DbCX52+qYNaJsqqVSTfG35zA/+ngz8CLi04fkR4NHg9mXA\nj+ZYTsfffBFHbUzSlzqPmNO8Pmxay87rPIZWwn7Ao8T8H089PXaS97Vr2465iN+fIsaUt9ST/PGZ\nYT7wJHBJw+N3A5+pu78PGGry+o6+8bTO9ksiyVmRecScZJ1hZ/+mtey8zkhuJawUFzXmuOUanzfP\nffPmtmIu4veniDEVQRZ78icF5Zo3gd9r8vwjwOV19/8MuLjJfB1942kNzpREkkGs8og5yTrDBmtL\na9l5DSDXSljPqagxL+aAH6E/XpKvOws2iiJ+f4oYUxEkSfKRrgzl7r8ELjKz04CHzewCd38+ymsb\nTU5OHr9dqVSoVCpxFiNSSv+HIab5CP+aZ6L3sDGD5cthaCjN0CRD1WqVarXamYW1+6sA3ApMNDzW\nWK7Zj8o1x/9tV7lG5Zp2Yv4QC/ztU06Jvhe/cKH7vn1tx1zE708RYyoCUu5d837g9OD2APA4sKJh\nnhW8e+B1CTrwqgOvKS+7rAdej8f8xBO15G02d3I3q83zxBOxYy7i96eIMeUtSZIPHaDMzH4d2Eqt\nLn8S8IC7T5nZdcGKtwTz3QV8CjgMrHH3p5ssy8PWJyKB/fvhlltgdoycxgHKli+H22/XAGU9IMkA\nZbF+GeJOZNxPvld/9bOU138mSfb20nptkphbOnCg1nNm7dratHlz2wdZ06DvV3bIqgtl0inLJN/L\n9bus5HWMIUndNq3XJm2rblO291N0SvJN9HJ3q6zk1SU0STe7tF6btK26TdneT9ElSfIaoExEpMzi\n/jrEmVC5plRUrlG5pizvp+hQuaY5HRhKX8cOVM4eXLzpptoUcnCxpw68FlTZ3k+RJUnykc547VbL\nly9n+fLleYfR05588kmeeuqnx2/Xfx7T09N89z9s4gsvPMdlb7zOySef/N5ugrfcMud1TFt9tq3W\nmfb72bRpCwDr1l17wnqTbI9hy85Dq/eTVrxFbIfCi/vrEGci4z15SVfYv+xhZ63+xim/6m9yqh+j\ncyf8JDkLN+n7Sat80W2lkTzKdGWHyjWSh7AeFq0GC/v85cv8TU6dO7nHPHU/yaBpSd5Pmr1Nuq0n\nSx69qsouSZJX7xrJxRdeeI75/L/oLzh8uFa+EZH2xP11iDOhPflSiV3eOHDAj/WnM5yuyjX5U7mm\n81C5RvIS1sOi6WBhmzfXEnacJB/hwhhJBk2L9X4ivjaJbuvJksdwFmWmJC+5bfyx1nvTTe0n+Nkp\nuMRdXkmkV5OM5EtJvsfl9W9s7PUmTPJ5lQN6uVwg+VKS73FFvFRbS5s3J7qOaV69N3q5d4fkK0mS\nV+8ayd7VVxNvYOzaa0WkDXF/HeJMaE8+FV1XrnH33R/4gB9rZy/ezH3lysTrTfJ+VK6RvJBgTz70\nylCdpCtDpSev071jr3f/fo585CPM/8Uvos2/cCH85CfHhzfI67R5nVYveUhyZSglecnP7t0wMlI7\n0Wmu7cIMFiyAxx6DT3wi2/hECiJJkldNXkJNT08zPLyK4eFVTM9eb7QDr50+fJgvfPTj/PD9/4hf\n9Pe/e+1SqN2eNw9GR2t78G0k+CTxSrHps40hbp0nzoRq8l0nrTHUG587e977/fnrr098HVPVzcur\nlz9b1IVS0pLWJe80iJW0q5c/2yRJXuUaEZEyi/vrEGdCe/JdJ6tyjQaxkjC9/NmSZhdKMzsTuA8Y\nAn4JfNvd72yYZynwp8BLwUMPufvGJsvysPVJ8STpNtjqtbp6kLSrVz/bJL1rolz+7xgw4e57zWwh\n8JSZ7XT3/Q3zPe7uo3GC6CVpJcy8JIkprcszJlnu1NQUd9xxDwATE2tYv359J0ObUxE/2yLSJT1j\naHfXH3gY+GTDY0uBRyK8NoV/ZLpHWqWPNCUpuXTbv9dhY9GnpdvaSbJHVr1rgHOAV4CFDY8vBV4H\n9gKPAhfM8fq026LQ0uqpkqYkPWS6rTdE2KUD09Jt7STZS5Lko5RrAAhKNQ8CN7r7Ww1PPwWc5e5H\nzGwk2Ns/v9lyJicnj9+uVCpUKpWoIYiI9IRqtUq1Wu3MwqL8ElCr3e+gluCjzP8yMNjk8RR/64pP\n5ZpilyFUrpGiIu1yDbXeNXe0eH6o7valwCtzzJdqQ3SDJFcWKuLVn8p2JaWwSwempdvaSbKVJMlH\n6UJ5BfA48CzgwfRV4OxgxVvM7AbgS8A7wFFgrbvvabIsD1tf2XVjL4q8epyISI1GoewS09PTXHXV\nGEePfgOAgYGb2bZta6ET/dTUFBs2/D4we2rEOBs3fkWJXiRDSvJdYnh4Fbt2jQJjwSNbWbZsOzt3\nfj/PsFpatOg8Dh26lfqYBwe/xsGDL+YZlkhP0VDDIiLSVOQulJLcunXXsnv3GEeP1u4PDNzMunVb\n8w0qxMTEGjZsGK97ZJyJia/kFo+ItEflmozpwKuItEs1eRGRElNNPoaiXkasiHEVMaa09NJ7lR4R\nt4N9nImCnAxV1DMMixhXEWNKSy+9V+ku6PJ/7SnqgFBFjKuIMaWll96rdJckSb5nyzUiIr2gJ7tQ\nFrUrYxHjKmJMaeml9yq9o2d71xS1K2MR4ypiTGnppfcq3UNdKKXnJOm7r0Qu3Sbta7yKFErjoGmz\nZ+RGSfSNg8Tt3j1W+EHiRJLQnrx0nSSDpnXjIHEiOhlKRESaUrlGuk6SQdPUg0Z6jco10pV04FV6\niXrXiIiUmGryIiLSlJK8iEiJKcmLiJSYkryISIkpyYuIlFhokjezM83sz83sr8zsWTMbn2O+O83s\nBTPba2YXdj5UERFpV5Q9+WPAhLv/c+DjwA1m9uH6GcxsBDjX3X8NuA64u+OR9ghdfk5EOin0jFd3\nPwAcCG6/ZWb7gDOA/XWzrQTuC+bZY2anm9mQu8+kEHNpafAsEem0tmryZnYOcCGwp+GpM4BX6+6/\nFjwmbdi0aUuQ4MeAWrKfPTNTRCSOyGPXmNlC4EHgRnd/K+4KJycnj9+uVCpUKpW4ixIRKaVqtUq1\nWu3IsiINa2BmfcB/Ax5z9281ef5u4C/c/YHg/n5gaWO5RsMatNZYrhkYuFnlGhFJf+waM7sPeN3d\nJ+Z4fgVwg7tfaWZLgM3uvqTJfEryITR4log0SjXJm9kVwOPAs4AH01eBswF39y3BfHcBnwIOA2vc\n/ekmy1KSFxFpk0ahFBEpMY1CKSIiTSnJi4iUmJK8iEiJKcmLiJSYkryISIkpyYuIlJiSvIhIiSnJ\ni4iUmJK8iEiJKcmLiJSYkryISIkpyYuIlJiSvIhIiSnJi4iUmJK8iEiJKcmLiJSYkryISIkpyYuI\nlJiSvIhIiSnJi4iUmJK8iEiJKcmLiJRYaJI3s/9sZjNm9pdzPL/UzN4ws6eDaUPnwxQRkTii7Mnf\nAywPmedxd784mDZ2IK7MVKvVvENoqohxKaZoFFN0RYyriDElEZrk3X038Pchs1lnwsleUT/QIsal\nmKJRTNEVMa4ixpREp2ryHzezvWb2qJld0KFliohIQn0dWMZTwFnufsTMRoCHgfM7sFwREUnI3D18\nJrOzgUfc/aMR5n0Z+Ji7H2ryXPjKRETkBO4eqywedU/emKPubmZD7j4T3L6U2g/HCQk+SZAiIhJP\naJI3sz8BKsAiM/sZcBtwCuDuvgX4LTP7EvAOcBT4THrhiohIOyKVa0REpDuldsarmZ0UnBy1fY7n\n7zSzF4JeORemFUfUmPI4qcvMXjGzn5rZM2b24znmyaOdWsaVU1udbmbfM7N9ZvZXZnZZk3kybauw\nmLJuJzM7P/jMng7+/oOZjTeZL7N2ihJTTtvTWjN7zsz+0szuN7NTmsyTx3evZVyx2srdU5mAtcAf\nA9ubPDcCPBrcvgz4UVpxtBHT0maPpxzPS8Cvtng+r3YKiyuPtroXWBPc7gNOy7utIsSUeTvVrfsk\n4G+BD+bdThFiyrSdgH8SbOOnBPcfAK7Ju50ixtV2W6WyJ29mZwIrgO/MMctK4D4Ad98DnG5mQ2nE\n0kZMkP1JXUbr/6Yyb6eIcc3OkwkzOw34F+5+D4C7H3P3Nxtmy7StIsYE+Z0o+JvAX7v7qw2P57VN\ntYoJsm+nk4EFZtYHzKf241Mvr3YKiwvabKu0yjXfBL4MzFXwPwOo/6BfCx5LU1hMkP1JXQ7sMrOf\nmNkXmzyfRztFiQuybat/CrxuZvcE/6JuMbOBhnmybqsoMUF+Jwp+BvgvTR7Pa5uCuWOCDNvJ3f8W\n2AT8jNr7f8Pd/6xhtszbKWJc0GZbdTzJm9mVwIy776VF18ssRYxp9qSuC4G7qJ3UlbYr3P1iav9h\n3GBmn8hgnVGExZV1W/UBFwN/EMR1BLgl5XWGiRJTHtsUZtYPjALfy2J9UYTElGk7mdn7qO2pn02t\nRLLQzD6b5jqjiBhX222Vxp78FcComb1E7Vf7X5nZfQ3zvAZ8sO7+mcFjaQmNyd3fcvcjwe3HgH4z\nG0wxJtz958HfvwO2AZc2zJJ1O0WKK4e2+hvgVXd/Mrj/ILUEWy/rtgqNKY9tKjACPBV8fo1y2aZa\nxZRDO/0m8JK7H3L3XwAPAZc3zJNHO4XGFaetOp7k3f2r7n6Wu/8z4Grgz939mobZtgPXAJjZEmr/\nlsx0OpZ2Yqqvt1nISV2dYGbzzWxhcHsBMAw81zBbpu0UNa6s2yp4z6+a2exwGZ8Enm+YLettKjSm\nrNupzr9l7rJI5ttUWEw5tNPPgCVmNs/MjNpnt69hnjzaKTSuOG3VibFrIjGz6whOoHL3H5jZCjN7\nETgMrMkqjrliIvuTuoaAbVYb6qEPuN/ddxagnULjIp8T4MaB+4N/+18C1hSgrVrGRA7tZGbzqe0R\nXlv3WK7tFBYTGbeTu//YzB4EngnW+TSwJe92ihIXMdpKJ0OJiJSYLv8nIlJiSvIiIiWmJC8iUmJK\n8iIiJaYkLyJSYkryIiIlpiQvIlJiSvIiIiX2/wEIfQGVzLGcCwAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plotPrototypes()" ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Finally, the spark context is closed\n", "sc.stop()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }