Usted está aquí: Inicio Ingeniería Telemática Web 2.0 and Web 3.0 Technologies SemAnnot Module 4 Project

Module 4 Project

Acciones de Documento
  • Vista de contenidos
  • Marcadores (bookmarks)
  • Exportación de LTI
Autores: Luis Sánchez Fernández, Carlos Delgado Kloos, Vicente Luque Centeno, José Jesús García Rueda, Norberto Fernández García
Semantic Annotation lab

RDF Resource Description Framework Flyer Icon Module 4: Semantic Annotation


 Goals

 

In this Lab we are going to exercise several concepts that we have studied in the lectures on semantic annotation. It will consist on the development of an instance recognition module. Therefore, it will deal with two important annotation types: named entities (input) and instances (output). In the Lab, you will have to use available software components and to develop new ones. You are going to use available NLP tools (GATE) to extract linguistic information from a text in English as well as a number of named entities. You have to develop a tool that performs instance recognition over the set of named entities extracted by GATE. Although a basic system architecture and implementation is provided, you are free to enhanced it with any idea/resource you think that may be helpful in your task: algorithms, lexicons and other data sources, Web search engines, ... You will probably have to deal with several concepts studied in the lectures: context definition, context similarity algorithms, ...

The system could be used together with the ontology that we extended in the previous module to annotate news items in our newspaper. This would be the basis for the development of advanced search components over the newspaper news archive using the technologies that we will study in the next module.

Introduction

 

A piece of software is provided to you. This software implements a simple automatic semantic annotation system that takes as input a text file in English, looks for named entities (persons, locations and organisations) in the text and disambiguates the entities by associating to each of them a certain identifier.

The inner architecture of the semantic annotation system consists of the following components:

  1. Text extractor: reads the text from a file, returning a String with the contents of the file.
  2. Entity finder: processes a String using NLP techniques and returns a vector of entities, each of them consisting on a piece of text (e.g. Rome) and a type (e.g. Location)
  3. Instance finder: looks for instances that are potential candidates to be represented by a certain entity occurrence. For instance, given the entity (Rome, Location) there could be in an hypothetical ontology candidate instances representing the city of Rome in Georgia, USA or Rome, the capital of Italy. These two instances would be adequate candidates for the entity, because both of them refer to a location named Rome.
  4. Instance ranker: as a certain entity can have several candidate instances, a ranking process is needed in order to decide which is the one that best represents the meaning of the entity. This process is carried out by the instance ranker module.
  5. Evaluator: it compares the annotations produced by the automatic system with annotations provided by a human expert, trying to evaluate the quality of the automatic approach. It takes advantage of the functionalities provided by the Annotation Reader which is the responsible of reading from a file the annotations provided by the human user.

These components are interconnected one with the other in a pipeline, so the output from a component is taken as input for the next one. The final architecture of the system is depicted in the following figure.

Annotation system architecture

In practice, each of the components is defined by a Java interface. This design allows decoupling the concrete implementation used for a certain element in the pipeline from the rest of the parts of the system. In the source code of the annotation application there are a set of modules, represented by Java packages, that have a one-to-one correspondence to the components in the application. Within each package the interface that defines the corresponding component and one or several implementations for such interface are provided.

In the next subsections we describe the contents of the different application components/packages.

Text extraction (package gimi.annot.extraction)

The text extraction component is defined by the Java interface gimi.annot.extraction.TextExtractorService:

    public interface TextExtractorService {
        public String extractText(File f) throws Exception;
    }

The interface defines a single method extractText, which receives as input a java.io.File object representing a concrete text file in the filesystem and returns a String containing the text inside the file. An exception may be thrown if a problem arises when the file is accessed.

An implementation of the TextExtractorService is already provided. It is available in the class gimi.annot.extraction.TextFileTextExtractorImpl. It simply opens a text file, reads its contents, appends the contents to a buffer and finally returns the String obtained from the buffer.

Entity finder (package gimi.annot.entfind)

The entity finder component is defined by the Java interface gimi.annot.entfind.EntityFinderService:

    public interface EntityFinderService {
        public Vector<Entity> findEntities(String text) throws Exception;
    }

The method in the interface, findEntities receives as input a String with a text to be analyzed and returns as result a java.util.Vector of objects of class gimi.annot.util.Entity. This class encapsulates the information regarding a concrete entity: its text (e.g. Rome) and its type (e.g. Location). An exception may be thrown if a problem arises at analysis time.

A concrete implementation of the EntityFinderService is already provided. This is coded in the class gimi.annot.entfind.GATEEntityFinderImpl. This implementation relies on the natural language processing environment provided by GATE 4.0 in order to process the text at the input and detect entity occurrences in such text. By default GATE is trained and configured to process texts in English, so we have to take into account this restriction when selecting the documents to be analyzed by the application.

Instance finder (package gimi.annot.insfind)

The instance finder component is defined by the Java interface gimi.annot.insfind.InstanceFinderService:

    public interface InstanceFinderService {   
        public Vector<Instance> findCandidates(Entity e, Context ctx) throws Exception;
    }

A single method is defined by the interface, findCandidates. This method receives two input parameters. The first one is an object of class gimi.annot.util.Entity, which represents the entity whose candidate instances we are looking for. The second one is an object of class gimi.annot.util.Context, which provides context information that can be used to find the candidate instances for the entity. The context information provided by the system to the component at the moment consist of a reference to the absolute path input file being annotated (inputFilePath), and the text of that file as provided by the text extraction component (inputText). Proper getters and setters are defined in the Context class to access all these data.

As a result of its execution, the component produces a java.util.Vector of objects of class gimi.annot.util.Instance. Each object of the class Instance encapsulates the information about a concrete instance in a certain ontology: its identifier (usually a URI), its label and a text snippet describing the instance.

The class gimi.annot.insfind.GoogleWikipediaInstanceFinderImpl provides an implementation of the InstanceFinderService interface. In the context of this practice, instead of using URIs within an ontology as annotation vocabulary, we will rely on Wikipedia URLs as identifiers. The GoogleWikipediaInstanceFinderImpl uses the APIs provided by Google in order to look for candidate Wikipedia articles to be associated to a certain entity. For instance, when looking for candidates for the entity (Rome, Location), the instance finder implementation sends to Google the following query Rome site:en.wikipedia.org obtaining a list of articles from the English Wikipedia related to the word Rome (the entity text). For each article in the result set provided by Google an Instance object is created. The identifier attribute of such object will be the article URL, the label attribute will be the Wikipedia article title and the instance description will be filled with the article abstract returned by Google. At the end, the finder returns a vector with all the gathered instances.

Instance ranker (package gimi.annot.insrank)

The instance ranker component is defined by the Java interface gimi.annot.insrank.InstanceRankerService:

    public interface InstanceRankerService {
        public Instance rankInstances(Entity ent, Vector<Instance> candidates, Context ctx) throws Exception;
    }

The method defined by the interface takes as input an gimi.annot.util.Entity object, a java.util.Vector of gimi.annot.util.Instance objects, that represent potential candidates for the entity, and a gimi.annot.util.Context, which provides context information for disambiguation. One of the instances in the input argument candidates is selected as the most adequate to represent the entity and returned as result. If no candidate is considered valid for the entity, the method could return null as result. If any error occurs in the process, an exception is thrown.

The package gimi.annot.insrank provides two different implementations of the interface InstanceRankerService. The first of these implementations (class gimi.annot.insrank.RandomInstanceRankerImpl) simply selects randomly one of the candidate instances from the input vector. The second implementation (class gimi.annot.insrank.DummyInstanceRankerImpl) always selects the first instance in the candidates input vector.

Annotation reader (package gimi.annot.annotread)

The annotation reader component is defined by the Java interface gimi.annot.annotread.AnnotationReaderService:

    public interface AnnotationReaderService {
        public Hashtable<Entity,Instance> readAnnotations(File annotFile) throws Exception;
    }

The method defined by the interface, readAnnotations, receives a single input parameter: a java.io.File where the annotations provided by the human annotator are stored. As a result of its process, the method should return a java.util.Hashtable of pairs (gimi.annot.util.Entity, gimi.annot.util.Instance). Each pair defines a correspondence between an entity and a certain instance that the human expert has considered the most adequate to represent the meaning of the entity. Please, note that this interface implies the assumption that a certain entity (pair text, type) can not have two different instances associated to it in the same document.

The default implementation provided in the package (class gimi.annot.annotread.TextFileAnnotationReaderImpl) reads the manual annotations from a text file. Within that file, each line represents an annotation with the following format:

EntityType::EntityText=Wikipedia_URL

For instance:

Location::Rio de Janeiro=http://en.wikipedia.org/wiki/Rio_de_Janeiro

It might happen that no adequate instance is found for a certain entity by the human annotator. In that case use the text string null instead of the Wikipedia URL within the manual annotations file. By doing so, the system includes a null Instance object in the correspondent entry of the result hash table. This object is obtained by calling the static method Instance.getNullInstance().

Evaluator (package gimi.annot.eval)

The evaluator component is defined by the Java interface gimi.annot.eval.AnnotationEvaluationService:

    public interface AnnotationEvaluationService {
        public double evaluate(Hashtable<Entity,Instance> manual, Hashtable<Entity,Instance> auto) throws Exception;
    }

The method evaluate defined by the interface takes as input two hash tables. The first one represents the annotations (correspondences entity, instance) as provided by the human expert. The second one contains the annotations automatically detected by the system. The implementation should compare one-by-one the manual annotations with the automatic annotations and compute the precision (percentage of successful associations entity-instance) of the automatic approach, returning a double with that value.

For the purposes of this lab, we are mainly interested in measuring the quality of the disambiguation process, so the implementation will ignore errors due to entities not found by the EntityExtractionService. In order to do so, it operates only in the entities that are both included in the manual and automatically detected annotations.

Installing, configuring and running the application

 

The source code of the automatic annotation tool is provided to you. Download it and decompress it in a local folder. In order to compile and run the system, you will need a Java environment (version 1.6.0 or higher) and two additional software libraries: GATE 4.0, a natural language processing engine, and Apache Ant, a building system for the Java platform. For your convenience, the additional libraries are also provided in the software distribution that you have already downloaded.

In order to check whether the system is sucessfully installed or not, compile the application. An Ant build file (build.xml) is provided to you with this aim. If you have decompressed the software distribution in the folder ROOT_FOLDER, the build file should be already available in (ROOT_FOLDER/Anotator/build.xml). In order to use it, you should have correctly installed Ant in your system. A tutorial on how to do that is available in Ant's web site. Once Ant is installed, open a system shell, change the directory to the root of the annotation application folder (ROOT_FOLDER/Anotator/) and then type in ant. You shall see an output similar to this one:

    [echo] Ant makefile of project Anotator
    [echo] ------------------------------------------------
    [echo]
    [echo] Usage options:
    [echo]
    [echo] ant help: shows this help message
    [echo] ant clean: removes files from previous build
    [echo] ant build: compiles the application
    [echo] ant run: executes the application
    [echo] ant all: cleans, builds and runs the application

You need to modify the property gate.home defined in the Ant build file in order to make it point to the path where GATE is installed in your system. For instance, if you use the GATE distribution included with the annotation software, the value of the property should be ROOT_FOLDER/GATE-4.0/. The usage of relative paths is not recommended.

Type ant build in the shell in order to compile the application. If everything works as expected the output should be similar to this one:

    Buildfile: build.xml

    build:
    [mkdir] Created dir: <ROOT_FOLDER>/Anotator/build/classes
    [javac] Compiling 17 source files to <ROOT_FOLDER>/Anotator/build/classes
    
    BUILD SUCCESSFUL

Before running the tool, it is important that you look at the source code and understand its parts. The information provided in the previous section may help you in this aim. It is also interesting that you analyze the source code of the main class of the system (gimi.annot.Main), which is the one where the component pipeline is implemented. The main class uses Spring Dependency injection to dynamically load the objects that implement each of the interfaces of the components in the annotation system. An XML configuration file <ROOT_FOLDER>/Anotator/conf/spring-beans.xml defines the concrete classes that implement each of the relevant interfaces. For instance, the following XML fragment in that file:

<bean id="instance-ranker" class="gimi.annot.insrank.DummyInstanceRankerImpl"></bean>

Indicates that the service named instance-ranker will be implemented by the class gimi.annot.insrank.DummyInstanceRankerImpl. In order to load an instance of that class from the main of the application, the following Java code is included in gimi.annot.Main:

insRanker = (InstanceRankerService) factory.getBean("instance-ranker");

Once you have compiled the application and analyzed its source code, it is time to run it. In order to do so, type the command ant run in the shell. The Ant script should ask you for the path of the file to be annotated:

    run:
        [input] Please enter the path of the text file to be annotated:

In the directory <ROOT_FOLDER>/Anotator/test, you will find a corpus of text files (.txt) that you can use as input for the annotation process. All of these files are actual news items from the Spanish news agency EFE and from the Wikinews. As you can see, they are plain text files in English language.

Select one of the files in the corpora and type its path in the shell. Relative paths should be enough, so for instance your could simply type test/testX.txt in order to annotate the file <ROOT_FOLDER>/Anotator/test/testX.txt. The sytem loads the file, processes it looking for entities, disambiguates these entities using as instance identifiers Wikipedia URLs, and finally compares the automatically generated annotations with the ones provided by a human expert. The precision of the automatic approach is shown at the end.

Note that the annotation process lasts for few minutes. The reason is that queries are sent to Google in order to look for candidate Wikipedia URLs for each detected entity. In order to not bother Google a timer is set that waits several seconds between succesive requests to the search engine, delaying the execution of the annotation system.

Note also that the Ant script does not ask you for the path of the file that contains the manual annotations to be used in the evaluation step. This is due to the fact that the application uses the following convention to locate that file: if the system is annotating the file <ROOT_FOLDER>/Anotator/test/testX.txt it will look for the file <ROOT_FOLDER>/Anotator/test/testX.annot to load the corresponding manual annotations.

Finally, note that the main is coded so that a file with the annotations automatically detected by the system is dumped. The format of this file is the same as the one used to store the manual annotations, that is, it is a text file that contains an annotation per line. If the application is annotating the file <ROOT_FOLDER>/Anotator/test/testX.txt it will dump the automatic annotations to the file <ROOT_FOLDER>/Anotator/test/testX.out

Exercise

 

You are requested to modify the disambiguation component of the annotation application. The goal is to improve, if possible, the precision of the automatic annotation system. In concrete, two naive implementations of the interface gimi.annot.insrank.InstanceRankerService are provided, that do not take into account the context information (text, other entities, etc.) in order to decide which is the best instance for a certain entity.

The precision results of running the annotation system over the test corpus using the instance finder and the two instance rankers provided by default are shown below. Please, note that these are the results of a single execution. If you run the system the results will be probably different, specially in case you use the random instance ranker.

Ranker test1 test2 test3 test4 test5 test6 test7 test8 test9 test10
RandomInstanceRankerImpl 0.0 0.273 0.077 0.2 0.0 0.071 0.25 0.1 0.2 0.133
DummyInstanceRankerImpl 0.5 0.818 1.0 1.0 0.615 0.714 0.916 0.5 0.8 0.467

You are expected to provide your own implementation of the interface gimi.annot.insrank.InstanceRankerService. In order to test your solution, you have to configure the system to use your implementation instead of the one provided by default. Modify the Spring configuration file (<ROOT_FOLDER>/Anotator/conf/spring-beans.xml), replacing within the fragment:

<bean id="instance-ranker" class="gimi.annot.insrank.DummyInstanceRankerImpl"></bean>

The value of the attribute class with the full name (with package) of your class (i.e. this.is.my.InstanceRankerImpl).

Optionally, you can also modify the default implementation of the interface gimi.annot.insfind.InstanceFinderService. Note that this implementation does not use any context information (entity type, text, other entities) to look for candidate instances. Again, if you want to provide your own implementation of the interface, you need to configure the system to use it. In this case, modify the Spring configuration file, replacing within the fragment:

    <bean id="instance-finder" class="gimi.annot.insfind.GoogleWikipediaInstanceFinderImpl">
        <property name="secondsBetweenTwoInvocationsToGoogle" value="20"/>
    </bean>

The value of the attribute class with the full name (with package) of your class (i.e. this.is.my.InstanceFinderImpl). It is strongly recommended that you do not reduce the time between sucessive Google invocations. Otherwise, Google may ban your IP address.

In both cases, in order to extract richer context information, you can process the text of the input file provided to the components through a Context object. You can reuse the Java code already provided to you in the class gimi.annot.entfind.GATEEntityFinderImpl in order to use GATE to process that text. Note that the implementation provided to you is configured through Spring to only provide annotations about named entities (Person, Location, Organization):

    <bean id="entity-finder" class="gimi.annot.entfind.GATEEntityFinderImpl" init-method="init">
        <property name="annotTypesRequired" value="Person,Location,Organization"/>
    </bean>

But if you want to reuse that code to implement your own solution, you can use the method setAnnotTypesRequired to obtain a richer set of metadatum from GATE (for instance, Token, Date, Sentence, Lookup, JobTitle, etc). Another possibility, is to get an object of the class gate.AnnotationSet by invoking the method doNLPProcessing. This object provides raw access to all the metadata detected by GATE when analyzing the text.

Finally an alternative is to rely on the capabilities provided by other NLP open source tools, like Lingpipe. You may also find useful the Wikipedia API that you can use to obtain information about Wikipedia articles.

Clean, build and run the application using the Ant script as many times as you need to test and improve your approach. Please, note that the solution provided should not depend on the texts in the corpus, that is, it should work if we change the files provided as input, as long as they are English text files.

References

Reutilizar Curso
Descargar este curso