{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook for the feature extraction Python programming assignment (2.0 points)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Directory **20news-18828** contains a list of subdirectories. Each one is a newsgroup. Each newsgroup contains a list of files. If you have uncompressed the data file here, no error should be displayed. Macs might give some trouble." ] }, { "cell_type": "code", "execution_count": 214, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The current directory is: C:\\Users\\Aler\\Google Drive\\TRABAJO LOCAL ALLWAYSSYNC\\DOCENCIA\\MASTER DATA SCIENCE\\ADVANCED PROGRAMMING\\practicas\\assignment feature extraction\n", "\n", "The subdirectories / newsgroups are: ['alt.atheism', 'comp.graphics']\n", "\n", "The first two files of alt.atheism are: \n", "20news-18828\\alt.atheism\\49960\n", "20news-18828\\alt.atheism\\51060\n" ] } ], "source": [ "import os\n", "print('The current directory is: ' + os.getcwd())\n", "prefix = \"20news-18828\\\\\" \n", "directories = os.listdir(prefix)\n", "files = os.listdir(prefix+directories[0])\n", "\n", "print()\n", "print('The subdirectories / newsgroups are:', directories)\n", "\n", "print()\n", "print(\"The first two files of alt.atheism are: \")\n", "print(prefix + directories[0] + \"\\\\\" + files[0])\n", "print(prefix + directories[0] + \"\\\\\" + files[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, you can see the *getMessage()* function. It opens a file that contains a message, reads the message, closes the file, and returns the message as a string." ] }, { "cell_type": "code", "execution_count": 215, "metadata": {}, "outputs": [], "source": [ "def getMessage(fileName):\n", " # Now, we open the file for reading\n", " myFile = open(fileName, \"r\")\n", " # read the complete message.\n", " message = myFile.read()\n", " # And close the file\n", " myFile.close() \n", " return(message)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we test that the function is working properly. " ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I'm going to read the following file: 20news-18828\\alt.atheism\\49960\n", "-----------------------------------------------------------\n", "From: mathew \n", "Subject: Alt.Atheism FAQ: Atheist Resources\n", "\n", "Archive-name: atheism/resources\n", "Alt-atheism-archive-name: resources\n", "Last-modified: 11 December 1992\n", "Version: 1.0\n", "\n", " Atheist Resources\n", "\n", " Addresses of Atheist Organizations\n", "\n", " USA\n", "\n", "FREEDOM FROM RELIGION FOUNDATION\n", "\n", "Darwin fish bumper stickers and assorted other atheist paraphernalia are\n", "available from the Freedom From Religion Foundation in the US.\n", "\n", "Write to: FFRF, P.O. Box 750, Madison, WI 53701.\n", "Telephone: (608) 256-8900\n", "\n", "EVOLUTION DESIGNS\n", "\n", "Evolution Designs sell the \"Darwin fish\". It's a fish symbol, like the ones\n", "Christians stick on their cars, but with feet and the word \"Darwin\" written\n", "inside. The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.\n", "\n", "Write to: Evolution Designs, 7119 Laurel Canyon #4, North Hollywood,\n", " CA 91605.\n", "\n", "People in the San Francisco Bay area can get Darwin Fish from Lynn Gold --\n", "try mailing . For net people who go to Lynn directly, the\n", "price is $4.95 per fish.\n", "\n", "AMERICAN ATHEIST PRESS\n", "\n", "AAP publish various atheist books -- critiques of the Bible, lists of\n", "Biblical contradictions, and so on. One such book is:\n", "\n", "\"The Bible Handbook\" by W.P. Ball and G.W. Foote. American Atheist Press.\n", "372 pp. ISBN 0-910309-26-4, 2nd edition, 1986. Bible contradictions,\n", "absurdities, atrocities, immoralities... contains Ball, Foote: \"The Bible\n", "Contradicts Itself\", AAP. Based on the King James version of the Bible.\n", "\n", "Write to: American Atheist Press, P.O. Box 140195, Austin, TX 78714-0195.\n", " or: 7215 Cameron Road, Austin, TX 78752-2973.\n", "Telephone: (512) 458-1244\n", "Fax: (512) 467-9525\n", "\n", "PROMETHEUS BOOKS\n", "\n", "Sell books including Haught's \"Holy Horrors\" (see below).\n", "\n", "Write to: 700 East Amherst Street, Buffalo, New York 14215.\n", "Telephone: (716) 837-2475.\n", "\n", "An alternate address (which may be newer or older) is:\n", "Prometheus Books, 59 Glenn Drive, Buffalo, NY 14228-2197.\n", "\n", "AFRICAN-AMERICANS FOR HUMANISM\n", "\n", "An organization promoting black secular humanism and uncovering the history of\n", "black freethought. They publish a quarterly newsletter, AAH EXAMINER.\n", "\n", "Write to: Norm R. Allen, Jr., African Americans for Humanism, P.O. Box 664,\n", " Buffalo, NY 14226.\n", "\n", " United Kingdom\n", "\n", "Rationalist Press Association National Secular Society\n", "88 Islington High Street 702 Holloway Road\n", "London N1 8EW London N19 3NL\n", "071 226 7251 071 272 1266\n", "\n", "British Humanist Association South Place Ethical Society\n", "14 Lamb's Conduit Passage Conway Hall\n", "London WC1R 4RH Red Lion Square\n", "071 430 0908 London WC1R 4RL\n", "fax 071 430 1271 071 831 7723\n", "\n", "The National Secular Society publish \"The Freethinker\", a monthly magazine\n", "founded in 1881.\n", "\n", " Germany\n", "\n", "IBKA e.V.\n", "Internationaler Bund der Konfessionslosen und Atheisten\n", "Postfach 880, D-1000 Berlin 41. Germany.\n", "\n", "IBKA publish a journal:\n", "MIZ. (Materialien und Informationen zur Zeit. Politisches\n", "Journal der Konfessionslosesn und Atheisten. Hrsg. IBKA e.V.)\n", "MIZ-Vertrieb, Postfach 880, D-1000 Berlin 41. Germany.\n", "\n", "For atheist books, write to:\n", "\n", "IBDK, Internationaler B\"ucherdienst der Konfessionslosen\n", "Postfach 3005, D-3000 Hannover 1. Germany.\n", "Telephone: 0511/211216\n", "\n", "\n", " Books -- Fiction\n", "\n", "THOMAS M. DISCH\n", "\n", "\"The Santa Claus Compromise\"\n", "Short story. The ultimate proof that Santa exists. All characters and \n", "events are fictitious. Any similarity to living or dead gods -- uh, well...\n", "\n", "WALTER M. MILLER, JR\n", "\n", "\"A Canticle for Leibowitz\"\n", "One gem in this post atomic doomsday novel is the monks who spent their lives\n", "copying blueprints from \"Saint Leibowitz\", filling the sheets of paper with\n", "ink and leaving white lines and letters.\n", "\n", "EDGAR PANGBORN\n", "\n", "\"Davy\"\n", "Post atomic doomsday novel set in clerical states. The church, for example,\n", "forbids that anyone \"produce, describe or use any substance containing...\n", "atoms\". \n", "\n", "PHILIP K. DICK\n", "\n", "Philip K. Dick Dick wrote many philosophical and thought-provoking short \n", "stories and novels. His stories are bizarre at times, but very approachable.\n", "He wrote mainly SF, but he wrote about people, truth and religion rather than\n", "technology. Although he often believed that he had met some sort of God, he\n", "remained sceptical. Amongst his novels, the following are of some relevance:\n", "\n", "\"Galactic Pot-Healer\"\n", "A fallible alien deity summons a group of Earth craftsmen and women to a\n", "remote planet to raise a giant cathedral from beneath the oceans. When the\n", "deity begins to demand faith from the earthers, pot-healer Joe Fernwright is\n", "unable to comply. A polished, ironic and amusing novel.\n", "\n", "\"A Maze of Death\"\n", "Noteworthy for its description of a technology-based religion.\n", "\n", "\"VALIS\"\n", "The schizophrenic hero searches for the hidden mysteries of Gnostic\n", "Christianity after reality is fired into his brain by a pink laser beam of\n", "unknown but possibly divine origin. He is accompanied by his dogmatic and\n", "dismissively atheist friend and assorted other odd characters.\n", "\n", "\"The Divine Invasion\"\n", "God invades Earth by making a young woman pregnant as she returns from\n", "another star system. Unfortunately she is terminally ill, and must be\n", "assisted by a dead man whose brain is wired to 24-hour easy listening music.\n", "\n", "MARGARET ATWOOD\n", "\n", "\"The Handmaid's Tale\"\n", "A story based on the premise that the US Congress is mysteriously\n", "assassinated, and fundamentalists quickly take charge of the nation to set it\n", "\"right\" again. The book is the diary of a woman's life as she tries to live\n", "under the new Christian theocracy. Women's right to own property is revoked,\n", "and their bank accounts are closed; sinful luxuries are outlawed, and the\n", "radio is only used for readings from the Bible. Crimes are punished\n", "retroactively: doctors who performed legal abortions in the \"old world\" are\n", "hunted down and hanged. Atwood's writing style is difficult to get used to\n", "at first, but the tale grows more and more chilling as it goes on.\n", "\n", "VARIOUS AUTHORS\n", "\n", "\"The Bible\"\n", "This somewhat dull and rambling work has often been criticized. However, it\n", "is probably worth reading, if only so that you'll know what all the fuss is\n", "about. It exists in many different versions, so make sure you get the one\n", "true version.\n", "\n", " Books -- Non-fiction\n", "\n", "PETER DE ROSA\n", "\n", "\"Vicars of Christ\", Bantam Press, 1988\n", "Although de Rosa seems to be Christian or even Catholic this is a very\n", "enlighting history of papal immoralities, adulteries, fallacies etc.\n", "(German translation: \"Gottes erste Diener. Die dunkle Seite des Papsttums\",\n", "Droemer-Knaur, 1989)\n", "\n", "MICHAEL MARTIN\n", "\n", "\"Atheism: A Philosophical Justification\", Temple University Press,\n", " Philadelphia, USA.\n", "A detailed and scholarly justification of atheism. Contains an outstanding\n", "appendix defining terminology and usage in this (necessarily) tendentious\n", "area. Argues both for \"negative atheism\" (i.e. the \"non-belief in the\n", "existence of god(s)\") and also for \"positive atheism\" (\"the belief in the\n", "non-existence of god(s)\"). Includes great refutations of the most\n", "challenging arguments for god; particular attention is paid to refuting\n", "contempory theists such as Platinga and Swinburne.\n", "541 pages. ISBN 0-87722-642-3 (hardcover; paperback also available)\n", "\n", "\"The Case Against Christianity\", Temple University Press\n", "A comprehensive critique of Christianity, in which he considers\n", "the best contemporary defences of Christianity and (ultimately)\n", "demonstrates that they are unsupportable and/or incoherent.\n", "273 pages. ISBN 0-87722-767-5\n", "\n", "JAMES TURNER\n", "\n", "\"Without God, Without Creed\", The Johns Hopkins University Press, Baltimore,\n", " MD, USA\n", "Subtitled \"The Origins of Unbelief in America\". Examines the way in which\n", "unbelief (whether agnostic or atheistic) became a mainstream alternative\n", "world-view. Focusses on the period 1770-1900, and while considering France\n", "and Britain the emphasis is on American, and particularly New England\n", "developments. \"Neither a religious history of secularization or atheism,\n", "Without God, Without Creed is, rather, the intellectual history of the fate\n", "of a single idea, the belief that God exists.\" \n", "316 pages. ISBN (hardcover) 0-8018-2494-X (paper) 0-8018-3407-4\n", "\n", "GEORGE SELDES (Editor)\n", "\n", "\"The great thoughts\", Ballantine Books, New York, USA\n", "A \"dictionary of quotations\" of a different kind, concentrating on statements\n", "and writings which, explicitly or implicitly, present the person's philosophy\n", "and world-view. Includes obscure (and often suppressed) opinions from many\n", "people. For some popular observations, traces the way in which various\n", "people expressed and twisted the idea over the centuries. Quite a number of\n", "the quotations are derived from Cardiff's \"What Great Men Think of Religion\"\n", "and Noyes' \"Views of Religion\".\n", "490 pages. ISBN (paper) 0-345-29887-X.\n", "\n", "RICHARD SWINBURNE\n", "\n", "\"The Existence of God (Revised Edition)\", Clarendon Paperbacks, Oxford\n", "This book is the second volume in a trilogy that began with \"The Coherence of\n", "Theism\" (1977) and was concluded with \"Faith and Reason\" (1981). In this\n", "work, Swinburne attempts to construct a series of inductive arguments for the\n", "existence of God. His arguments, which are somewhat tendentious and rely\n", "upon the imputation of late 20th century western Christian values and\n", "aesthetics to a God which is supposedly as simple as can be conceived, were\n", "decisively rejected in Mackie's \"The Miracle of Theism\". In the revised\n", "edition of \"The Existence of God\", Swinburne includes an Appendix in which he\n", "makes a somewhat incoherent attempt to rebut Mackie.\n", "\n", "J. L. MACKIE\n", "\n", "\"The Miracle of Theism\", Oxford\n", "This (posthumous) volume contains a comprehensive review of the principal\n", "arguments for and against the existence of God. It ranges from the classical\n", "philosophical positions of Descartes, Anselm, Berkeley, Hume et al, through\n", "the moral arguments of Newman, Kant and Sidgwick, to the recent restatements\n", "of the classical theses by Plantinga and Swinburne. It also addresses those\n", "positions which push the concept of God beyond the realm of the rational,\n", "such as those of Kierkegaard, Kung and Philips, as well as \"replacements for\n", "God\" such as Lelie's axiarchism. The book is a delight to read - less\n", "formalistic and better written than Martin's works, and refreshingly direct\n", "when compared with the hand-waving of Swinburne.\n", "\n", "JAMES A. HAUGHT\n", "\n", "\"Holy Horrors: An Illustrated History of Religious Murder and Madness\",\n", " Prometheus Books\n", "Looks at religious persecution from ancient times to the present day -- and\n", "not only by Christians.\n", "Library of Congress Catalog Card Number 89-64079. 1990.\n", "\n", "NORM R. ALLEN, JR.\n", "\n", "\"African American Humanism: an Anthology\"\n", "See the listing for African Americans for Humanism above.\n", "\n", "GORDON STEIN\n", "\n", "\"An Anthology of Atheism and Rationalism\", Prometheus Books\n", "An anthology covering a wide range of subjects, including 'The Devil, Evil\n", "and Morality' and 'The History of Freethought'. Comprehensive bibliography.\n", "\n", "EDMUND D. COHEN\n", "\n", "\"The Mind of The Bible-Believer\", Prometheus Books\n", "A study of why people become Christian fundamentalists, and what effect it\n", "has on them.\n", "\n", " Net Resources\n", "\n", "There's a small mail-based archive server at mantis.co.uk which carries\n", "archives of old alt.atheism.moderated articles and assorted other files. For\n", "more information, send mail to archive-server@mantis.co.uk saying\n", "\n", " help\n", " send atheism/index\n", "\n", "and it will mail back a reply.\n", "\n", "\n", "mathew\n", "ΓΏ\n", "\n" ] } ], "source": [ "my_file = prefix + directories[0] + \"\\\\\" + files[0]\n", "print(\"I'm going to read the following file: \", my_file)\n", "print('-----------------------------------------------------------')\n", "message = getMessage(my_file)\n", "print(message)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Section 1.**. Grade: 0.2 points.\n", "\n", "Program a function that takes a message (a string) as input and returns a list with the words in the string. The function should do the following things:\n", "\n", "- Convert the message to lowercase\n", "- Convert '.', ',', '\\n', '\\t' to ' ', so that all separators become the blank space. You may consider replacing other separators, in addition to the former ones.\n", "- Split the words contained in the message\n", "- If there are empty words (''), they should be removed\n", "- Return the list of words" ] }, { "cell_type": "code", "execution_count": 217, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 218, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']\n" ] } ], "source": [ "# If your function works properly, the following should work\n", "\n", "message = 'In a hole in the ground, there lived a Hobbit'\n", "print(returnWords(message))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Point 2.**. Grade: 0.30 points.\n", "\n", "Define a function that creates a dictionary for counting the words in a sentence. Use a *defaultdict*." ] }, { "cell_type": "code", "execution_count": 219, "metadata": {}, "outputs": [], "source": [ "from collections import defaultdict\n", "\n" ] }, { "cell_type": "code", "execution_count": 220, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('in', 2), ('a', 2), ('hole', 1), ('the', 1), ('ground', 1), ('there', 1), ('lived', 1), ('hobbit', 1)]\n" ] } ], "source": [ "# If your function works properly, the following should work\n", "\n", "words = ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']\n", "dictio = createDictionary(words)\n", "print(list(dictio.items()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Point 3. **. Grade: 0.4 points.\n", "\n", "Write a function that takes the previous dictionary and computes the ratio for each word." ] }, { "cell_type": "code", "execution_count": 221, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 222, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('in', 0.2), ('a', 0.2), ('hole', 0.1), ('the', 0.1), ('ground', 0.1), ('there', 0.1), ('lived', 0.1), ('hobbit', 0.1)]\n" ] } ], "source": [ "# If your function is correct, you should get what follows\n", "\n", "words = ['in', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit']\n", "dictio = createDictionary(words)\n", "dictio = computeRatios(dictio)\n", "print(list(dictio.items()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Point 4. **. Grade: 0.5 points.\n", "\n", "The code to write next is a bit harder. You have to loop through all subdirectories in \"20news-18828\\\\\". In fact, there are only two subdirectories: alt.atheism and comp.graphics. For each subdirectory, you have to loop through the **10 first files**. We are reading only 10 files within each subdirectory to save time, but in a real situation, we would read all the files. Therefore you will have two nested loops. \n", "\n", "For each file, you have to:\n", "\n", "- read the message using the *getMessage* function\n", "- obtain the words in the message using the function you defined in Point 1\n", "- obtain the counts of the words of the message using the function in Point 2\n", "- compute the ratio of words using the function in Point 3\n", "- once you have the list with the word ratios for a message, append it to a list" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['alt.atheism', 'comp.graphics']\n" ] } ], "source": [ "# Prefix is a variable that contains the directory where each newsgroup is located\n", "# Notice that a double \\\\ has to be used when \\ is desired within a string\n", "prefix = \"20news-18828\\\\\" \n", "directories = os.listdir(prefix)\n", "print(directories)\n", "\n", "# The messages list will contain one element per message. \n", "# Each element contains the word ratios for each message.\n", "messages = []\n", "\n", "# The classes list contains the class of each message\n", "# In this example, there are only two classes: alt.atheism and comp.graphics\n", "classes = []\n", "\n", "# Please, write your code from here\n", "\n", "# HERE YOUR CODE\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The first four messages are:\n", "[ [ ('from:', 0.00039603960396039607),\n", " ('mathew', 0.0007920792079207921),\n", " ('