Second HAREM Advancing the State of the Art of Named Entity Recognition in Portuguese Cláudia Freitas*, Cristina Mota, Diana Santos**, Hugo Oliveira* and.

Slides:



Advertisements
Apresentações semelhantes
Presenter’s Notes Some Background on the Barber Paradox
Advertisements

15º Congresso Brasileiro de Catálise
Multiplicação das Células
RCAAP Project João Mendes Moreira, FCCN Berlin 7, Paris, 04/12/2009.
33 truques e segredos para você se tornar um Expert em AutoCAD!
Ana Frankenberg-Garcia
The portuguese public universities José Marques dos Santos Vice-President of the Council of Rectors of Portuguese Universities (CRUP) Rector of University.
Portugal – Fraunhofer Gesellschaft Agreement
Blocos Dinâmicos Paramétricos no AutoCAD® 2012
DIRETORIA ACADÊMICA NÚCLEO DE CIÊNCIAS HUMANAS E ENGENHARIAS DISCIPLINA: INGLÊS FUNDAMENTAL - NOITE PROFESSOR: JOSÉ GERMANO DOS SANTOS PERÍODO LETIVO
Uniform Resource Identifier (URI). Uniform Resource Identifiers Uniform Resource Identifiers (URI) ou Identificador de Recursos Uniforme provê um meio.
Acreditação dos serviços de saúde
BLUE BEAUTY Photos by Astronaut Sunita Williams Photos by Astronaut Sunita Williams.
Balanço final e perspectivas de futuro Diana Santos, Hugo Oliveira, Cláudia Freitas, Cristina Mota e Paula Carvalho Encontro do Segundo HAREM Universidade.
Cícero Nogueira dos Santos Ruy Luiz Milidiú
Encontro do Segundo HAREM
Avaliação Constituição dos grupos de trabalho:
Candidatura Formulário de Candidatura online (Innovation Journal) Projecto Sentir, Pensar, Crescer, Inspirar Programa Escolas Inovadoras WWIEF 2010.
Uma participação minimalista no Segundo HAREM
Instituto Superior de Engenharia de Lisboa Engenharia Informática e dos Computadores Projecto e Seminário 2009/2010 CLOUD COMPUTING Nuno Sousa
Milhões de estrelas, colocadas nos céus, por um Deus. Millions of stars placed in the skies, by one God.
Uma Abordagem Baseada em Modelos para Detecção de Situações em Sistemas Sensíveis ao Contexto Izon Thomaz Mielke Orientadora: Patrícia Dockhorn Costa.
Inglês – Profa. Claudia Mendes
AutoCAD P&ID 2012 Introdução ao AutoCAD P&ID.
Where are the most populated places in Spain? Can you suggest reasons for this? Look at the atlas pages what is the difference in the climate.
Adaptação do sistema de REM da Priberam ao HAREM Carlos Amaral, Helena Figueira, Afonso Mendes, Pedro Mendes, Cláudia Pinto, Tiago Veiga {cma, hgf, amm,
Instructions for use: This flyer was prepared under TRACK_FAST ( The principal objective is to provide education and.
DESAFIO Distinguir os sintagmas preposicionados (PP= Prepositional Phrases) que são complementos de verbos dos que são adjuntos. Ele trabalha em publicidade.
Plataforma Java 2 - Micro Edition (J2ME) Prof. Afonso Ferreira Miguel, MSc.
1. Glue a CALENDAR on the paper; 2. Circle the dates with colored pencils; 3. Glue PICTURES about the dates (Opcional) 4. Write 12 Special Dates in full.
Chapter 1 - The Foundations for a New Kind of Science Wolfram, Stephen. A New Kind of Science. Wolfram Media, Inc
fábrica de software conceitos, idéias e ilusões
MCommerce Seminário GS1 Instituto Politécnico de Leiria
EPIDEMIOLOGIA (ARTIGOS) <-> CLÍNICA
The Campesino a Campesino Movement The Campesino a Campesino movement is an extensive grassroots movement in Central America and Mexico. It is a cultural.
Cavaleiros Medievais Medieval Knights. Nuno Álvares Pereira Nuno Álvares Pereira nasceu em Portugal a 24 de Junho de Com cerca de 1 ano de idade.
ESTRUTURA QUESTIONS.
CONVITE 123° ENCONTRO DO PROGRAMA DE FORMAÇÃO CONTÍNUA DO PROFESSOR DE INGLÊS DA ESCOLA PÚBLICA: UM CONTEXTO PARA A (DES)CONSTRUÇÃO DA PRÁTICA.
1 PSP/TSP Definições e Questões Jones Albuquerque
Activities of diffusion in Portugal 2 nd Meeting of the Steering Committee Espinho 4th,5th and 6th of September.
Brazilian National Commission on Social Determinants of Health (BNCSDH)
Information and Communication Technologies 1 Visão do futuro do HAREM Diana Santos Linguateca Encontro HAREM, 15 de Julho de 2006.
Quanto podemos confiar nos resultados das pesquisas? Integridade em pesquisa.
O que são os alertas do Google?
Fábrica de software princípios, conceitos, e ilusões
ABNT Associação Brasileira de Normas Técnicas CGCRE General Coordination for Accreditation MC Ministry of Communication ITU OIML IAAC IAF ILAC BIPM International.
Grupo de Trabalho: Maria de Fátima Gonçalves Pedro Silva Pierre Maibwe Sociedade Pós-capitalista Drucker, Peter (2003) Lisboa: Actual Editora Curso de.
© 2011 wheresjenny.com City name: Rio de Janeiro, capital city of the State of Rio de Janeiro Country: Brazil Population: 6.3 million people [approx] Language.
Redes Sociais Online ISCTE – Instituto Universitário de Lisboa MCCTI Mónica Oliveira 13 de Março de 2013.
Information and Communication Technologies 1 Medindo a Linguateca Luís Fernando Costa e Luís Miguel Cabral.
RIO DE JANEIRO BY APOSTOLIS KOMNINOS PUPIL OF E2 CLASS.
Apresentação REVISTAS Alexandre Lucas 2º. Trimestre
{ ‘Cálice’ by.  Repression  Violation of Human Rights  Persecution, torture, exile  Censorship and social control  Economic Crisis  Hyperinflation,
SISTEMA DE TRANSITIVIDADE: PARTICIPANTES PROCESSOS CIRCUNSTÂNCIAS.
APPLICATIONS OF DIFFERENTIAL EQUATIONS - ANIL. S. NAYAK.
WILD LIFE #english Maria
O FADO Diogo Lopes 4º ano 4th grade. O fado é um símbolo mundialmente reconhecido de Portugal e tem um significado de destino e saudade. É esta última.
MP 718 Brasília, 14 de junho de 2016 Audiência Pública Comissão Mista do Senado e da Câmara dos Deputados.
GISELA JOÃO AGRUPAMENTO DE ESCOLAS DO FUNDÃO Rita Almeida nº20 Sara Salvado nº21 9ºB.
Abril 2016 Gabriel Mormilho Faculdade de Economia, Administração e Contabilidade da Universidade de São Paulo Departamento de Administração EAD5853 Análise.
CENTRO DE ESTUDOS SOCIAIS Laboratório associado “Brincar com o fogo”: estudos sobre riscos, controvérsias e florestas Rita Serra, PhD., Investigadora do.
Pesquisa Operacional aplicada à Gestão de Produção e Logística Prof. Eng. Junior Buzatto Case 4.
Buy Tamoxifen Citrate India
MOBILE LEARNING IN HIGHER EDUCATION:
CEManTIKA Framework Overview
Adição e Multiplicação
Simple Present Tense. . In English the Simple Present is used to express actions that are made with a certain frequency, like go to school, work, study…
Introduction to density estimation Modelação EcoLÓGICA
European Centre for Disease Prevention and Control
GENOCIDES THAT NEVER WILL BE FORGET. GENOCIDE: The deliberate killing of a large group of people, especially those of a particular nation or ethnic group.
Transcrição da apresentação:

Second HAREM Advancing the State of the Art of Named Entity Recognition in Portuguese Cláudia Freitas*, Cristina Mota, Diana Santos**, Hugo Oliveira* and Paula Carvalho*** Linguateca, FCCN * at Univ. of Coimbra – CISUC / DEI **at SINTEF ICT, ***at Univ. of Lisbon = Faculty of Sciences, Lasige LREC 2010 Conference Valletta, Malta, May, 2010

Linguateca ( Acknowledgement Linguateca and HAREM were funded by the Portuguese government and the European Union with contract number 339/1.3/C/NAC, UMIC and FCCN Linguateca and HAREM were funded by the Portuguese government and the European Union with contract number 339/1.3/C/NAC, UMIC and FCCN is a distributed network for fostering the computational processing of the Portuguese language  Organization of evaluation contests for Portuguese (Morfolimpíadas, HAREM and CLEF [GeoCLEF, adhoc CLEF, GikiP, LogCLEF, GikiCLEF])  Creation of free resources that enable sophisticated processing of Portuguese  Monitoring and cataloguing the area

HAREM Evaluation of named entity recognition in Portuguese texts Evaluation of named entity recognition in Portuguese texts Second HAREM –10 participants; 27 official runs –New tracks:  recognition and normalization of temporal entities (Hag è ge et al., 2008)  detection of relations between named entities (Freitas et al., 2008, 2009) September 2007 November 2007 January 2008April 2008 September 2008 WorkshopSubmission periodRelease of training materialProposal of 3 tasksCall for participation

Main features (Santos, 2007b) I. Semantic model  NE classified in context A morte é reportada no do dia ('The death is announced in Di á rio de Not í cias of that day') A morte é reportada no Di á rio de Not í cias do dia ('The death is announced in Di á rio de Not í cias of that day') A diferen ç a entre o ´Jornal de Not í cias´ e o ´ ’ ('The difference between Jornal de Not í cias and Di á rio de Not í cias') A diferen ç a entre o ´Jornal de Not í cias´ e o ´Di á rio de Not í cias ’ ('The difference between Jornal de Not í cias and Di á rio de Not í cias') O seu pai era funcion á rio p ú blico do Minist é rio da Justi ç a e cr í tico musical do ´´ ('His father was an employee of the Ministry of Justice and a music reviewer for Di á rio de Not í cias') O seu pai era funcion á rio p ú blico do Minist é rio da Justi ç a e cr í tico musical do ´Di á rio de Not í cias´ ('His father was an employee of the Ministry of Justice and a music reviewer for Di á rio de Not í cias') … foi fotografado pelo (DN) a fumar uma cigarrilha... ('had a picture taken by Di á rio de Not í cias smoking a cigarette') … foi fotografado pelo Di á rio de Not í cias (DN) a fumar uma cigarrilha... ('had a picture taken by Di á rio de Not í cias smoking a cigarette')  LOCAL VIRTUAL COMSOC / place  COISA CLASSE / thing  ORGANIZACAO EMPRESA/ org  PESSOA GRUPOMEMBRO / person

Main features II. Vagueness  NE may belong simultaneously to more than one category or type A identifica-se com a Justi ç a Divina (' takes the role of Divine Providence') A Administra ç ão Bush identifica-se com a Justi ç a Divina ('Bush Administration takes the role of Divine Providence') PERSON ?ORG ? BOTH ! Administra ç ão Bush / Bush Administration

Main features III. Categories  Initial corpus-based approach + participant suggestions 10 Categories 43 Types 22 Subtypes DATA HORA GRUPOCARGO GRUPOIND GRUPOMEMBRO INDIVIDUAL MEMBRO CARGO ADMINISTRACAO EMPRESA INSTITUICAO DISCIPLINA ESTADO IDEIA NOME CLASSE MEMBROCLASSE OBJECTO SUBSTANCIA ORGANIZADO EFEMERIDE EVENTO MOEDA QUANTIDADE CLASSIFICACAO ARTE REPRODUZIDAVIRTUAL POVO DURACAO FREQUENCIA GENERICO TEMPO_CALEND PLANO HUMANO FISICO EM INTERVALO COMSOCIAL SITIO OBRA CONSTRUCAO REGIAO DIVISAO PAIS RUA AGUAMASSA RELEVO REGIAO PLANETA AGUACURSO ILHA PESSOA ABSTRACCAO ACONTECIMENTO COISA OBRA ORGANIZACAO VALOR LOCAL TEMPO

Main features III. Categories  Initial corpus-based approach + participant wishes 10 Categories 43 Types 22 Subtypes DATA HORA GRUPOCARGO GRUPOIND GRUPOMEMBRO INDIVIDUAL MEMBRO CARGO ADMINISTRACAO EMPRESA INSTITUICAO DISCIPLINA ESTADO IDEIA NOME CLASSE MEMBROCLASSE OBJECTO SUBSTANCIA ORGANIZADO EFEMERIDE EVENTO MOEDA QUANTIDADE CLASSIFICACAO ARTE REPRODUZIDAVIRTUAL POVO DURACAO FREQUENCIA GENERICO TEMPO_CALEND PLANO HUMANO FISICO EM INTERVALO COMSOCIAL SITIO OBRA CONSTRUCAO REGIAO DIVISAO PAIS RUA AGUAMASSA RELEVO REGIAO PLANETA AGUACURSO ILHA PESSOA ABSTRACCAO ACONTECIMENTO COISA OBRA ORGANIZACAO VALOR LOCAL TEMPO VARIADO ESCOLA NOME PLANO SUB CICLICO PERIODO

Main Features IV. Embedded NEs  ALT mechanism Quantos atletas participaram nos / How many athletes participated in Quantos atletas participaram nos Jogos Ol í mpicos de Barcelona? / How many athletes participated in Barcelona Olympic Games? <Jogos Olímpicos de Barcelona | de Barcelona Olympic Games  EVENT BarcelonaOlympic Games EVENTPLACE

Main features V. Evaluation setup  Flexibility Participant systems SCEN PES PESORGLOCOBRACOABSCOITEMVAL Cage2Sel2CATCAT F + H CAT DobrEMPes PorTexTOTemp PriberamTot R3MSel3 REMBRANDTTot REMMASel4C/TC/T SEI-GeoSel5 SeRELePTot XIP/L2F/ XEROX Sel6NORM Only CATEGORY Only PLACEs (human and natural) Only CATEGORY and TYPE Normalization of temporal expressions Identification Classification Participants’ selective scenarios

New track: ReRelEM Anaphora resolution Mitkov, 2000; Collovoni et al., 2007; de Souza et al Co-reference Anaphoric chains in texts + Relation detection Agichtein and Gravano, 2000; Zhao and Grishman, 2005; Culotta and Sorensen, 2004 Fact extraction World knowledge = Investigate which relations could be found in texts Devise a pilot task to compare systems that recognize those relations ReRelEM R econhecimento de Rel ações entre E ntidades M encionadas Relation detection between named entities

 Identity (ident)  Inclusion (inclui (includes) / incluido (included))  Placement (ocorre-em (occurs-in) / sede-de (place_of)) foi fundada em 1131 por D. Telo (São Teotónio) It was founded in 1132 by D. Telo (São Teotónio) Hamilton, colega de Alonso na McLaren Lewis Hamilton, Alonso's team-mate in McLaren GP Brasil – Não faltou emoção em Interlagos no Circuito José Carlos Pace desde a primeira volta… GP Brasil – There was no lack of excitement in Interlagos at the José Carlos Pace Circuit  Os adeptos do Porto invadiram a cidade do Porto em júbilo The (FC) Porto fans invaded the (city of) Porto, very happy Relation inventory

 Other ( outra ) Relation / gloss # vinculo-inst / affiliation 936 obra-de / work-of 300 participante-em / participant-in 202 ter-participacao-de / has-participant 202 relacao-familiar / family-tie 90 residencia-de / home-of 75 natural-de / born-in 47 relacao-profissional / professional-tie 46 povo-de / people-of 30 representante-de / representative-of 19 residente-de / living-in 15 personagem-de / character-of 12 periodo-vida / life-period 11 propriedade-de / owned-by 10 proprietario-de / owner-of 10 representado-por / represented-by 7 praticado-em / practised-in 7 outra-rel / other 6 nome-de-ident / name-of 4 outra-edicao / other-edition 2 Relation inventory

Second HAREM Collection Distribution by text genre DOCS: 1,040 Paragraphs:15,737 Words: 670,610

DOCS:129 Paragraphs: 2,274 Words: 147,991 NEs:7,847 Vague NEs:633 [52 classes] NE distribution Second HAREM Golden Collection

Relation type# autor_de/obra_de (authorship)142 causador_de (agent)22 consequencia_de (result_of)1 data_de /datado_de (date of)105 data_morte (death date)10 data_nascimento (birth date)5 ident (identity)2229 inclui/incluido (inclusion)854 local_nascimento_de/natural_de (birth place)142 localizado_em/localizacao_de (place of)24 nome_de/nomeado_por (name-of)56 ocorre_em/sede_de / (location)358 outra_edicao (other edition)3 outrarel (other relation)93 participante_em/ter_participacao_de (participation-in)153 periodo_vida (lifetime)5 personagem_de (character of)14 praticado_em/pratica_se/praticante_de/praticado_por (practicing)99 produtor_de/produzido_por (manufacturing)50 proprietario_de/propriedade_de (ownership)39 relacao_familiar (kinship relation)88 relacao_profissional (professional relation)17 residente_de/residencia_de (place of residence)19 vinculo_inst (affiliation)275 TOTAL4803 DOCS: 129 Paragraphs: 2,274 Words: 147,991 NE:7,847 Relations: 4,803 ReRelEM relation types ReRelEM Golden Collection – full version Relations that the systems had to explicitly name Relations under OUTRA/OTHER

ReRelEM Golden Collection – full version Relations per category # ABSTRACCAO/ abstraction 255 ACONTECIMENTO/ event 168 COISA / thing175 LOCAL / place960 OBRA / title274 ORGANIZACAO / org783 OUTRO / other25 PESSOA / person1286 TEMPO / time192 VALOR / value19 ReRelEM relations per category

Evaluation HAREM N = number of classification in the GC M = number of spurious classifications in the participant’s run Wcat = 1/number of categories in the scenario; Wtipo=1/number of types… α, β, γ = weights for categories (1), types (0.5) and subtypes (0.25) (cat, tipo, sub)certa = 1, when it is right; = 0 when wrong (cat, tipo, sub)esp= 1, when spurious ; = 0 when not 17 HAREM score = 1 + sumN((1-Wcat) * catcerta* α + (1- Wtipos) * tipocerta*β + (1-Wsub) * subcerta*γ) – sumM(Wcat* catesp*α + Wtipos* tipoesp*β + Wsub* subesp*γ)

 Evaluate JUST the relations (not the NE) System Portugal_ORG inclui Lisboa_LOCAL GC Portugal_LOCAL inclui Lisboa_LOCAL Relations with mismatched arguments were ignored [Universidade de Lisboa] | Alternative segmentations were ignored [Universidade] de [Lisboa] Evaluation ReRelEM

Maximization Filtering Selection Translation Individual EVAL Normalization Remove relations of types not being evaluated Score the triples CDReRelEM.xml participacao.xml Aligner ALT Organizer EVAL Alignments HAREM Filtering Apply expansion rules Normalize NE identifiers Remove alignments where NEs don’t match and all relations involving removed NEs Create triples arg1 relation arg2 Create triples arg1 relation arg2 Global EVAL Compute: Precision Recall F-measure Compute: Precision Recall F-measure

Participation and results HAREM  Only two systems (Priberam and REMBRANDT) tried to recognize the complete set of categories;  Only one system (R3M) adopted a machine learning approach; the others relied on hand-coded rules + dictionaries, gazetteers, and ontologies;  Two of them (REMBRANDT and REMMA) made use of the Portuguese Wikipedia, in different ways

System NE task NE task Relations RelationsRembrandtall all all SeRelEP only identification all but outra SeiGeo only LOCAL detection inclusion Answer complex questions based on Wikipedia (PhD work in progress) Develop a hot news portal based on NEs Evaluate a system for ontology creation (PhD work) Participation and results ReRelEM

Second HAREM Resources Second HAREM Collection and its metadata Second HAREM Golden Collection (GC) including ReReLEM Extended TEMPO Golden Collection ReRelEM triples Evaluation programs System runs Documentation LÂMPADA – Second HAREM Resource Package = +

SAHARA and AC/DC: further access to HAREM and ReRelEM resources  Sahara web service (  Sahara web service (Gonçalo Oliveira & Cardoso, 2009), –Submit new runs and…  select different options for scoring against the GC(s);  use several scenarios;  check the relative performance against the official runs.  AC/DC, interaction with the parsed GC (Rocha & Santos, 2007)

Discussion  Undeniable relevance for Portuguese processing community, but of possible interest to a wider audience  Multilingual comparison  Are there relevant differences regarding categories?  Do cohesive devices differ between languages?  Differences between explicit / implicit relations  Relationship with QA  Questions for as one text genre  Relationship with GIR  Use of GeoCLEF pool documents in the Second HAREM collection, that allow detailed assess of the importance of NER for this application

Comments and reuse welcome! Studies of NER and RD difficulty for Portuguese, by text genre Studies of other subjects that may involve NE Training material Further linguistic analysis Conversion to other formats/theories