Carregar apresentação
A apresentação está carregando. Por favor, espere
PublicouStella Alencar Paranhos Alterado mais de 8 anos atrás
1
Thesaurus Design (from analised corpora) Pablo Gamallo, Alexandre Agustini, G.P. Lopes {gamallo,aagustini}@di.fct.unl.pt GLINt (Gupo de Lingua Natural) FCT, Universidade Nova de Lisboa
2
fine sanction president secretary small big ministery minister banc organisation Thesaurus design Linguistic goals
3
Thesaurus design Proprieties Distribucional Hypothesis: Words sharing similar contexts are semantically related Domain specific corpus Types of context: simple co-occurrence (bigrams) co-occurrence within a window (n-grams) syntactic structures
4
Thesaurus design Steps Extraction of syntactic contexts from the corpus Similarity measure between words (based on their syntactic contexts) For each word, identify its most similar words
5
Extraction of syntactic contexts Tagging (PoS tags) Chunking (parsing in basic chunks) Attachment heuristics Identification of binary dependencies Extraction of syntactic contexts
6
Clinton sent a clear message to the president of Portugal Tagger: Clinton_N sent_V a_ART clear_ADJ message_N to_PREP the_ART authorities_N of_PREP Portugal_N Tagging and chunking Chunking: NP (Clinton) VP (send) NP (message, clear) PP (to, NP(authority)) PP (of, NP(portugal))
7
Attachment Heuristics and Syntactic Dependencies Attachment of Basic Chunks: Binary Dependencies:
8
Syntactic Contexts : : :
9
Binary Jaccard coefficient Similarity Measure Binary Jaccard coefficient The similarity between two words relies on: The ratio between the number of contexts that are common to both words and the total number of their contexts.
10
Weighted Jaccard coefficient Similarity Measure Weighted Jaccard coefficient
11
MicroCorpus Pedro is reading a book and Maria is reading a book, Pedro is reading a novel and Maria read a novel yesterday, Pedro is reading a lot of things, but Pedro loves Maria, Maria loves books, in fact Maria loves a lot of things. Maria is eating an apple and Pedro is eating an apple too, Pedro eated eggs yesterday, Pedro eats a lot of things, Maria is eating eggs, Maria loves eggs a lot.
12
Thesaurical relations between names Pedro Maria book novel apple egg thing book, egg, apple, novel (book egg)? (Maria thing)?? (Pedro egg)???
13
Extracting syntactic contexts of names Pedro : (, 3) (, 1) (, 3) Maria : (,2) (, 3) (,2) (,1) novel : (,2) book : (,3) (,1) thing : (,1) (,1) (,1) apple : (,2). egg : (,2) (,1)
14
Computing the weigth of a context for each word (1): Pedro: (, 3) GW( ) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56 LW(Pedro, ) = log(3) = 0.47 W(Pedro, ) = 1.03 Pedro: (, 1) GW( ) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11 LW(Pedro, ) = log(1) = 0 W(Pedro, ) = 0.11 Pedro: (, 3) GW( ) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56 LW(Pedro, ) = log(3) = 0.47 W(Pedro, ) = 1.03
15
Computing the weigth of a context for each word (2): Maria: (, 2) GW( ) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56 LW(Maria, ) = log(2) = 0.3 W(Maria,, ) = 0.86 Maria: (, 3) GW( ) = log (1/3 + 3/4) / log(2) = 0.034 / 0.3 = 0.11 LW(Maria, ) = log(3) = 0.47 W(Maria, ) = 0.58 Maria: (, 2) GW( ) = log (3/3 + 2/4) / log(2) = 0.17 / 0.3 = 0.56 LW(Maria, ) = log(3) = 0.3 W(Maria, ) = 0.86 Maria: (, 1) GW( ) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31 LW(Maria, ) = log(1) = 0. W(Maria, ) = 0.31
16
Computing the weigth of a context for each word (3): novel: (, 2) GW( ) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15 LW(novel, ) = log(2) = 0.3 W(novel, ) = 1.45 book: (, 3) GW( ) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15 LW(book, ) = log(3) = 0.47 W(book, ) = 1.62 book: (, 1) GW( ) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31 LW(book, ) = log(1) = 0. W(book, ) = 0.31
17
Computing the weigth of a context for each word (4): thing: (, 1) GW( ) = log (2/1 + 3/2 + 1/3) / log(3) = 0.54 / 0.47 = 1.15 LW(thing, ) = log(1) = 0 W(thing, ) = 1.15 thing: (, 1) GW( ) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1 LW(eat, ) = log(1) = 0 W(book, ) = 1.1 thing: (, 1) GW( ) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31 LW(thing, ) = log(1) = 0. W(thing, ) = 0.31
18
Computing the weigth of a context for each word (5): apple: (, 2) GW( ) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1 LW(apple, ) = log(2) = 0.3 W(apple, ) = 1.4 egg: (, 2) GW( ) = log (1/3 + 2/1 + 2/2) / log(3) = 0.52 / 0.47 = 1.1 LW(egg, ) = log(2) = 0.3 W(book, ) = 1.4 egg: (, 1) GW( ) = log (1/2+ 1/4+1/3 + 1/2) / log(4) = 0.19 / 0.6 = 0.31 LW(egg, ) = log(1) = 0. W(egg, ) = 0.31
19
Similarity between words (1) 0.83 WJ(Pedro, Maria) = 2.17 / 2.61 = 0.83 min( (1.03+0.11+1.03), (0.86+0.58+0.86) ) = 2.17 max( (1.03+0.11+1.03), (0.86+0.58+0.86+0.31) ) = 2.61 0.75 WJ(book, novel) = 1.45 / 1.93 = 0.75 min( (1.45), (1.62) ) = 1.45 max((1.45), (1.62+ 0.31) ) = 1.93 0.58 WJ(book, thing) = 1.58 / 2.69 = 0.58 min( (1.62+0.33), (1.27+0.31) ) = 1.58 max( (1.62+0.33), (1.27+0.31+1.1) ) = 2.69
20
Similarity between words (2) WJ(apple, egg) = 1.4 / 1.71 = 0.81 min( (1.4), (1.4) ) = 1.4 max( (1.4), (1.4+0.31) ) = 1.71 0.41 WJ(apple, thing) = 1.1 / 2.68 = 0.41 min( (1.4), (1.1) ) = 1.1 max((1.4), (1.27+0.31+1.1) ) = 2.68 0.51 WJ(egg, thing) = 1.41 / 2.68 = 0.51 min( (1.4+0.25), (1.1+0.31) ) = 1.41 max( (1.4+0.25), (1.27+0.31+1.1) ) = 2.68 0.41 WJ(novel, thing) = 1.1 / 2.68 = 0.41 min( (1.45), (1.1) ) = 1.1 max((1.45), (1.27+0.31+1.1) ) = 2.68
21
Similarity between words (3) WJ(Maria, thing) = 0.31 / 2.68 = 0.09 min( (0.31), (0.31) ) = 0.31 max( (0.86+0.58+0.86+0.31), (1.27+0.31+1.1) ) = 2.68 0.16 WJ(book, egg) = 0.31 / 1.93= 0.16 min((0.31), (0.31) ) = 0.31 max((1.62+.31), (1.4+0.31) ) = 1.93 0 0 0; WJ(Pedro, thing) = 0 / 2.62 = 0 WJ(novel, egg) = 0 / 1.65 = 0 WJ(book, apple) = 0 / 1.87 = 0; WJ(Maria, egg) = 0.31 / 2.61 = 0.11 min( (0.31), (0.31) ) = 0.31 max( (0.86+0.58+0.86+0.31), (1.4+0.31) ) = 2.61
22
Similarity between words (Sorting) (0.83) Pedro Maria (0.81) apple egg (0.75) book novel (0.58) thing book (0.51) thing egg (0.41) thing apple, novel (0.16) book egg (0.11) Maria egg (0.09) Maria thing (0.0) Pedro egg (0.0) novel egg
23
n juíz| {dirigente, presidente, subinspector, governador, árbitros} juíz| {dirigente, presidente, subinspector, governador, árbitros} n diploma| {decreto, lei, artigo, convenção, regulamento} diploma| {decreto, lei, artigo, convenção, regulamento} n decreto| {diploma, lei, artigo, nº, código} decreto| {diploma, lei, artigo, nº, código} n regulamento| {estatuto, código, sistema, decreto, norma} regulamento| {estatuto, código, sistema, decreto, norma} n regra| {norma, princípio, regime, legislação, plano} regra| {norma, princípio, regime, legislação, plano} n renda| {caução, indemnização, reintegração, multa, quota} renda| {caução, indemnização, reintegração, multa, quota} n conceito| {noção, estatuto, regime, temática, montante} conceito| {noção, estatuto, regime, temática, montante} Corpus “Procuradoria Geral da República” (P.G.R.) Lists of similar words
Apresentações semelhantes
© 2024 SlidePlayer.com.br Inc.
All rights reserved.