{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# First assignment: feature extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Directory **20news-18828** contains a list of subdirectories. Each one is a newsgroup. We can check the names of these subdirectories by using *os.listdir*. We can see there are 20 newsgroups and their names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "\n", "# Prefix is a variable that contains the directory where each newsgroup is located\n", "# Notice that a double \\\\ has to be used when \\ is desired within a string\n", "prefix = \"20news-18828\\\\\" \n", "directories = os.listdir(prefix)\n", "print(len(directories))\n", "print(directories)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inside each subdirectory, there are many files. Each file constains a message that was posted long time ago in that newsgroup. Let's see the names of the files that contain the first 5 messages, by using again *os.listdir*:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fileNames = os.listdir(prefix+\"alt.atheism\")\n", "print(fileNames[:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's see the contents of the first of those messages. This can be done by reading the first file and displaying it, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fileName = prefix+\"alt.atheism\\\\\"+\"49960\"\n", "print \"The name of the file that contains the first message is: {}\".format(fileName)\n", "print \"================================\"\n", "\n", "# Now, we open the file for reading\n", "myFile = open(fileName, \"r\")\n", "# and it's read whole.\n", "firstMessage = myFile.read()\n", "myFile.close() # The file is no longer needed, so it is closed\n", "print(firstMessage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's your turn: read and print the **second** message of newsgroup *comp.graphics* by filling in the gaps." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "fileNames = os.listdir(prefix+)\n", "fileName = prefix+\"comp.graphics\\\\\"+\n", "print \"The name of the file that contains the first message is: {}\".format(fileName)\n", "print \"================================\"\n", "\n", "# Now, we open the file for reading\n", "myFile = open(fileName, \"r\")\n", "# and it's read whole.\n", "firstMessage = myFile.read()\n", "myFile.close() # The file is no longer needed, so it is closed\n", "print(firstMessage)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we are going to read the first two messages from newsgroup *alt.atheism* and the first two messages from *comp.graphics*. But first, let's define a function that reads messages, in order to save ourselves some typing." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def readMessage(fileName):\n", " # Now, we open the file for reading\n", " myFile = open(fileName, \"r\")\n", " # and it's read whole.\n", " message = myFile.read()\n", " myFile.close() # The file is no longer needed, so it is closed \n", " return(message)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use function *readMessage* to read two messages from *alt.atheism*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "group = \"alt.atheism\\\\\"\n", "# Files contains the list of files (messages) of newsgroup alt.atheism\n", "files = os.listdir(prefix+group)\n", "\n", "# Let's read the first two messages\n", "\n", "message1_aa = readMessage(prefix+group+files[])\n", "message2_aa = readMessage(prefix+group+files[1])\n", "\n", "# Let's put them into a list:\n", "\n", "messages_aa = [message1_aa, message2_aa]\n", "\n", "print \"We have read {} messages\".format(len(messages_aa))\n", "print \"==================================================\"\n", "print \"The first message starts with these words:\\n{}\".format(messages_aa[0][0:200])\n", "print \"==================================================\"\n", "print \"And the second message starts with these words:\\n{}\".format(messages_aa[1][0:200])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, read a third message and *append* it into list *messages_aa*. Use the *apend* function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "message3_aa = readMessage(prefix+group+files[])\n", "messages_aa.(message3_aa)\n", "print(len(messages_aa))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do even better by using a loop to read the n first messages" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "group = \"alt.atheism\\\\\"\n", "files = os.listdir(prefix+group)\n", "\n", "messages_aa = [] # Initially, we have no read any message (empty list)\n", "# Let's read 5 messages (but it could be any number)\n", "for myFile in files[:5]:\n", " messages_aa.append(readMessage(prefix+group+))\n", "print \"We have read {} messages\".format(len(messages_aa)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, use the previous code in order to read 4 messages from newsgroup *comp.graphics*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "group = \"comp.graphics\\\\\"\n", "files = os.listdir(prefix+group)\n", "\n", " = [] # Initially, we have no read any message (empty list)\n", "# Let's read 5 messages (but it could be any number)\n", "for myFile in files[:]:\n", " messages_cg.append(readMessage(prefix+group+))\n", "print \"We have read {} messages from {}\".format(len(messages_cg), group) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next step: let's decompose each message into words. We can do this by means of *split*. Let's see an example. Also, there are a couple of problems. What happens with words **remember** and **racing**?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "myMessage = \"Somewhere in la Mancha, in a place whose name I do not care to remember, a gentleman lived not long ago, one of those who has a lance and ancient shield on a shelf and keeps a skinny nag and a greyhound for racing.\" \n", "print(myMessage)\n", "print(\"\")\n", "words = myMessage.split()\n", "print \"The list of words is:\\n{}\".format(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A way of fixing the problem of separators like \",\" and \".\" is to substitute them by spaces (\" \"), so that there is only one delimiter (space). We can use *replace* for this. Try to understand the following code. Do we get the correct list of words now (i.e. do **remember** and **racing** look right now?)?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "myMessage = \"Somewhere in la Mancha, in a place whose name I do not care to remember, a gentleman lived not long ago, one of those who has a lance and ancient shield on a shelf and keeps a skinny nag and a greyhound for racing.\" \n", "print(myMessage)\n", "print(\"\")\n", "myMessage = myMessage.replace(\",\", \" \")\n", "myMessage = myMessage.replace(\".\", \" \")\n", "\n", "words = myMessage.split()\n", "print \"The list of words is:\\n{}\".format(words)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we could do the same for the three messages of *alt.atheism* which are contained in list *messages_aa*. Try to understand this code." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# First, let's make a copy of the original list \"messages_aa\", \n", "# so that messages_aa is not modified and we can use it later unchanged\n", "\n", "import copy\n", "messages_aa_copy = copy.deepcopy(messages_aa)\n", "\n", "# Now, we will modify messages_aa_copy and remove , and . from them\n", "# Let's remove \",\" and \".\" from the first message\n", "messages_aa_copy[0] = messages_aa_copy[0].replace(\",\", )\n", "messages_aa_copy[0] = messages_aa_copy[0].replace(, \" \")\n", "messages_aa_copy[0] = messages_aa_copy[0].split()\n", "\n", "# Same for the second message\n", "messages_aa_copy[1] = messages_aa_copy[1].replace(, \" \")\n", "messages_aa_copy[1] = messages_aa_copy[1].replace(\".\", )\n", "messages_aa_copy[1] = messages_aa_copy[1].split()\n", "\n", "# Same for the third message\n", "messages_aa_copy[2] = messages_aa_copy[2].replace(, )\n", "messages_aa_copy[2] = messages_aa_copy[2].replace(, )\n", "messages_aa_copy[2] = messages_aa_copy[2].split()\n", "\n", "\n", "print \"==================================================\"\n", "print \"We can check what happenned to the first 20 words of the first message:\"\n", "print(messages_aa_copy[0][0:50]) \n", "print \"==================================================\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But using the above code for every message is rather tedious, specially if there are many messages. Please, try to improve the above code by using a **for** loop. Reminder: range(3) == [0, 1, 2]. Please, use variable **messages_aa** that contains a list with the three messages." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# len(messages_aa) is the number of elements in the list (i.e. the number of messages, \n", "# 3 in this case)\n", "for messageNumber in (len(messages_aa)):\n", " messages_aa[messageNumber] = messages_aa[messageNumber].replace(\",\", \" \")\n", " messages_aa[messageNumber] = messages_aa[messageNumber].replace(\".\", \" \")\n", " messages_aa[messageNumber] = messages_aa[messageNumber].\n", " \n", " \n", "print \"==================================================\"\n", "print \"We can check what happened to the first 50 words of the first message:\"\n", "print(messages_aa[0][0:50]) \n", "print \"==================================================\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we are going to compute the frequency of each word for every message. We will first count the words, and then divide by the total number of words in order to get the frequency. In order to save you some work, the function *createDictionary* is provided to you." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def createDictionary(words):\n", " d = {}\n", " for word in words:\n", " if word in d:\n", " d[word] = d[word] + 1\n", " else:\n", " d[word] = 1\n", " return(d)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how *createDictionary* works. Please, notice that the dictionary is a list of pairs (key, count), where count is the number of times each word (key) appears in the sentence." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "myMessage = \"Somewhere in la Mancha, in a place whose name I do not care to remember, a gentleman lived not long ago, one of those who has a lance and ancient shield on a shelf and keeps a skinny nag and a greyhound for racing.\" \n", "print(myMessage)\n", "print(\"\")\n", "myMessage = myMessage.replace(\",\", \" \")\n", "myMessage = myMessage.replace(\".\", \" \")\n", "words = myMessage.split()\n", "\n", "# Here, we create a dictionary with the count of words\n", "myDictionary = createDictionary(words)\n", "print(myDictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to compute the frequency, we need to divide each count by the total number of words in the message. Reminder: function *len* computes the number of elements in a list. We will multiply by 100 in order to compute percentages. Something like this: " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for in myDictionary:\n", " myDictionary[word] = 100*myDictionary[word]/\n", " \n", "print(myDictionary) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, compute a similar dictionary, but for the **first** message of messages_aa." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What follows is a solution with a loop, in order to process the three messages of messages_aa. Just try to understand it, and you are done." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "list_of_dictionaries = [] # Initially, an empty list\n", "for messageNumber in range(len(messages_aa)):\n", " words = messages_aa[messageNumber]\n", " myDictionary = createDictionary(words)\n", " for word in myDictionary:\n", " myDictionary[word] = 100*myDictionary[word]/len(words) \n", " list_of_dictionaries.append(myDictionary)\n", " \n", "print(list_of_dictionaries)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }