archive-com.com » COM » C » CLRES.COM

Total: 469

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • The Synergy of NLP and Computational Lexicography Tasks
    capture this hierarchical structure by explicitly identifying that a sense is so related this will be used to analyze the ways in which the subsenses add components of meaning Many entries come with various forms of usage notes specifying other words typically particles or accompanying phrase types e g adverbials of direction These are extracted into specific features that can then be used as disambiguating devices 2 Definition parsing and pattern matching for semantic network creation The principal functionality of DIMAP is to parse and analyze definitions to create semantic relation between entries This functionality is an end in itself but also serves lexicographic needs described in the next section on definition consistency analysis and allows bootstrapping more lexical information first noted by Richardson 1994 Identification of semantic links takes place in two ways in DIMAP 1 hard coded routines that place definitions into sentence frames parsed into constituent structures and then analyzed to extract links and 2 regular expression patterns containing literals or parts of speech added as supplemental parts of speech defining patterns to the parsing dictionary While definitions generally correspond to constituents of sentences such as NPs for noun definitions and infinitive phrases for verb definitions there are several nuances that may provide misleading results and that make it difficult to parse them directly Transitive verb definitions frequently contain a parenthesized expression specifying lexical preferences for the object of the verb the parentheses but not the contents need to be removed while remembering that what is contained in the parentheses should be extracted as the sense s TypicalObject Many transitive verb senses have no object e g hurry cause to move or proceed with haste where a placeholder something should be inserted after cause Many definitions contain words such as especially that need to be treated differently For habit an addictive practice especially one of taking drugs there are really two definitions present The first is an addictive practice and the second should be transformed into the addictive practice of taking drugs Further as described in the previous section we may know the typical subject of a verb or the typical modificand of an adjective These can be inserted into sentence frames Thus for the example of habit forming which has the definition addictive where we identified the typical modificand a drug or activity we would want to parse a sentence This is a drug or activity that is addictive Analyzing such a definition would then give rise to a qualia structure for the adjective habit forming as modifying the formal quale of drug or activity as outlined in Pustejovsky 1995 After creating and parsing the sentences the parse output a parse tree showing the sentence constituents with the words in the sentence as leaf nodes of the tree is analyzed to identify some key semantic relations most notably the hypernym s to be associated with the sense Defining patterns in definitions are also significant sources of semantic relations Ahlswede Evens 1988 They were also used extensively in creating EuroWordNet Vossen 1998 Analysis of definition parse output is a significant source of semantic relations identified in MindNet Montemagni Vanderwende 1993 In addition to using these previous findings interactions with lexicographers at Macquarie and Oxford are assisting with further identification of appropriate patterns 2 For noun verb and preposition definitions we seek the leading NP and its head noun s the main verb s and the final preposition or verb if none as the hypernym For nouns we examine whether the head of the first NP is empty e g the phrase a kind of where we would then look for the head noun of the following PP as the hypernym s We search the parse tree for manner PPs extracting the adjective modifying manner to provide a Manner relation usually for a verb definition In examining the parse tree we look particularly for prepositional phrases and whether the preposition has an associated defining pattern as a part of speech For of we have associated several such patterns including 1 made rep01 det rep0n adj noun sr has constituent 2 adj nbrth noun rep01 det 0 rep0n adj noun sr mem of 3 purpose adj noun sr purpose where corresponds to the target preposition i e of specific words correspond to literals to be matched and parts of speech correspond to matching any word with that part of speech The rep01 rep0n and rep1n correspond to matching 0 or 1 0 to n or 1 to n occurrences of the given part of speech The final sr identifies the type of relation that is created All material on the right of the is extracted as the link from the semantic relation except that parts of speech with a following 0 are not included The result of the definition parsing and pattern matching is an augmentation of the dictionaries with semantic relation links The resultant semantic network particularly when viewed through its hypernymic links is in effect an ontology the other types of links provide slots and values for various semantic components of the individual senses A considerable number of synonym links are also established so that the dictionaries are similar to WordNet as well In parsing the entire Macquarie dictionary 130 000 entries 281 000 definitions taking 17 hours about 223 000 semantic relations were identified automatically 3 In a limited assessments of the hypernyms lexicographers found an agreement rate of about 75 percent for the nouns and verbs The Macquarie lexical database including a thesaurus was successfully used in the TREC9 question answering track Litkowski 2000b For where questions the database was able to use location components in definitions in judging whether there was a match between a proposed answer s definition and the specifications is the question For questions involving size determinations in potential answers containing numbers modifying some noun it was possible to examine the hypernym of the noun to determine if is a unit For questions like which city it was possible to combine the dictionary and the thesaurus data since Macquarie provides direct links to its thesaurus from individual definitions Thus for example where Shanghai has municipality as its hypernym it was possible to compare the thesaurus categories of municipality and city and make the judgment that Shanghai was a viable answer 3 Definition consistency analysis An important lexicographic task is that of maintaining consistency throughout a single dictionary and across dictionaries We describe three such tasks 1 maintaining consistency among several dictionaries from a single publisher 2 examining the integrity of a thesaurus and 3 populating dictionary entries with a superset of information from definitions with similar information We also identify how these tasks have benefits for NLP tasks 3 1 Mapping multiple dictionaries In Senseval Kilgarriff Rosenzweig 2000 the issue of using dictionary definitions arose in two contexts one concerned with mapping between two dictionary sources WordNet and the Hector dictionary provided for Senseval and the other with providing a baseline for disambiguation using definition text and example sentences The mapping was considered not altogether satisfactory and may have contributed to some reductions in performance The baseline was important as a mechanism for using a straightforward statistical analysis for disambiguation The baseline technique followed Lesk 1986 in using dictionary definitions and examples as the basis for computing an inverse document frequency and matching surrounding context in the Senseval sentences Litkowski 1999 reported using a variation of this technique for mapping definitions between WordNet and Hector but reported a relatively low success rate 36 1 when measured against lexicographer mappings The issue of mapping senses is important to dictionary publishers who may have many dictionaries such as thumbnail childrens learners collegiate and unabridged Macquarie publishes 15 such dictionaries in principle all derived from the unabridged dictionary However as reported to us these dictionaries have been developed each using slightly different editorial policies based on the type of dictionary over a period of time when the unabridged dictionary underwent three major editions The lexicographic task was to map each of these dictionaries into the unabridged dictionary as a step for maintaining definitional consistency Given the fact that the dictionaries had at least been published by one publisher it was expected that the mapping problem would not be as difficult as that of mapping between dictionaries of different publishers The Lesk style word overlap reported in Litkowski 1999 was not quite satisfactory and was modified to include labels such as register geographic coverage and subject domain attached to definitions In general the technique used content words only i e using a stop list to eliminate function words as a percentage of the definitions in the unabridged dictionary The method also used certain syntactic information e g restriction to same part of speech and within part of speech to senses having identical syntactic properties such as verb transitivity After experimenting with various options e g with and without a stop list extensive samples for several of the dictionaries were examined by lexicographers to assess the success of the mappings The agreement rates for the several dictionaries ranged from 90 to 95 percent Many of the failures could be attributed to the presence of similar wording in several definitions of an entry and hence indicative of defining issues that require the lexicographers attention Many definitions were not mapped indicating the presence of completely different wording and perhaps that a smaller dictionary had senses not included in the unabridged dictionary The agreement rates were judged satisfactory and full mappings were undertaken for the 15 smaller dictionaries All dictionaries were uploaded into DIMAP approximately 20 hours per dictionary and the mappings performed a few hours per dictionary The mapping results are now being used to link the smaller dictionaries to the unabridged dictionary Corrections to the mapping will be recorded This will enable the results to be used as a gold standard which can then be used to examine other mapping techniques such as the componential analysis described in Litkowski 1999 Not only will such techniques improve the quality of the mappings but they would then be available for application across dictionaries and for use in more general word sense disambiguation Further the linkages between the smaller and the unabridged dictionaries are available for further analyses as described in the next two sections 3 2 Analysis of primitives in a thesaurus category As indicated above the Macquarie dictionary is unique in having sense by sense linkages to thesaurus entries 4 Such linkages have been used directly in question answering and would likely have considerable benefit in many other NLP applications since when combined with the DIMAP enhancements of adding hypernym links the resultant lexical database has many structural similarities to WordNet As described in Fellbaum 1998 there are many potential applications for such lexical databases The thesaurus links also provide an opportunity for serving additional lexicographic tasks particularly in improving consistency within both the dictionary and the thesaurus The Roget style thesaurus is organized into more than 800 sections e g 038 Approval with each section broken down into paragraphs perhaps several for each part of speech and subparagraphs separated by semicolons with one word highlighted as the key concept of semicolon delimited group Kilgarriff Yallop 2000 noted that the semicolon delimited groups are similar to WordNet sysnsets with the key concept being either synonymous or acting as a hypernym for the other words in the group They noted that the paragraphs and subparagraphs are frequently related by simple linguistic operations such as morphological derivations or scope but also other kinds of semantic relations For Nature the main category of terms for nature was surrounded by noun groups for balance of nature study of nature person who studies nature and adjective groups pertaining to humans animals and humans derogatorily They concluded that the compilers had made use of implicit categorization schemes possibly with inconsistencies but not made explicit The linkages to the unabridged dictionary makes it possible to examine the thesaurus groups in a more principled way With DIMAP it is possible to create a subdictionary consisting only of definitions linked to a single thesaurus group While this is convenient for visually examining a set of definitions the functionality of DIMAP allows for a more rigorous analysis DIMAP can analyze the dictionary digraph as established by the hypernym links for a set of definitions This analysis identifies non primitive words defined only by words within the group but not themselves used in defining words in the group definitional cycles leading to identification of strong components within the set and primitive definitions those used in the formation of the core concepts in the group See Litkowski 1978 and Litkowski 1980 for details of modeling the semantic structure of a dictionary with labelled directed graphs For verbs of approval we are able to eliminate as non primitive such dictionary entries as approbate advocate and hold a brief for We see that approve and sanction are in the same strong component defining one another and hence not adding to the meaning of either We see that verbs such as treat accord take think and receive are used to define the core concepts of approve We also notice that the verbs confirm and ratify are used synonymously in defining the core concepts of approve but are not present in the thesaurus category These few observations produced automatically after parsing the definitions to identify hypernyms provide the beginning of a rationalization of the thesaurus category Examination of the other semantic relations produced by the definition parsing provides further information that can be used to identify meaning components in the concept approval This process provides a firmer basis on which to 1 group the synonyms in subparagraphs 2 make the relationships between subparagraphs and paragraphs explicit and 3 make changes to the underlying definitions for this group of words so that they are more consistent and are phrased in the simplest terms possible to highlight their meaning At a higher level this type of analysis provides a more consistent dictionary from which a more complete and more accurate semantic network can be created and used for NLP applications We are currently working to integrate these methods into the analysis of the thesaurus 3 3 Automatic template and slot generation Most of the preceding discussion focused on dictionary data as used for syntactic and semantic analysis i e word sense disambiguation In general this is what Allen 1995 terms the syntactic pattern However just as important is the representation of meaning the logical form that is to be used in creating a meaning of the text within which a particular sense appears Several lexicographic tasks move in this direction NODE frequently indicates that a verb requires an adverbial of direction Such an adverbial can be expressed by an adverb or an adverbial prepositional phrase overtly in the opposite direction or indirectly to Montreal Many definitions of verbs so labelled contain the phrase in a particular direction e g herd to move in a particular direction However many definitions containing similar phrases e g hand hold the hand of someone in order to help them move in the specified direction do not Identifying such instances assists in bringing greater consistency In any event this suggests that adverbs containing such phrases can usefully be labelled with a feature direction Most importantly such definitions legitimize the creation of a slot labelled direction For definitions that indicate a particular direction the value of the slot has been lexicalized More generally the appearance in a definition of a word like specified particular or certain indicates the presence of a slot that must be filled by the context Thus for hail acclaim enthusiastically as being a specified thing the object of the verb must be characterized in some way here either through an as prepositional phrase or modified by a relative clause This can even occur in definitions of nouns such as half life the time taken for the radioactivity of a specified isotope to fall to half its original value which indicates not only that an isotope must be present in the context but also that isotope can have a property half life something not indicated in the definition of isotope More commonly a slot is predicted by the words someone and something Frequently these occur in a verb definition in parentheses and serve as general placeholders for the object of the verb a very general lexical preference When they occur in other positions they provide both semantic and syntactic information For example halo a circle or ring of something resembling a halo creates a slot for the something and attaches properties to the slot that it has shape circle or ring and a relation resembling halo In FrameNet Baker et al 1998 with its finer granularity of semantic roles cf Fillmore 1968 the preceding considerations provide some methods for automatic generation of frame elements and frame element groups with some indication of their required syntactic and semantic contexts Further analysis of definitions can lead to an even richer identification of frame elements At SIGLEX99 Fillmore noted that an utterance implicitly contained many nested frames Using the example of approval the act of approving cited earlier the appearance of the word in context implicitly requires filling a slot approver and approval object to instantiate an approve event 5 These methods bear some similarity to those described in Collier 1998 for automatic template creation but switches the relative importance of corpus and dictionary evidence Thus improving definitional consistency will contribute greatly to the goal of automatic template generation Another lexicographic task will benefit the characterization of templates Given the general interest in collocations examination of adjective noun and noun noun collocations present in a dictionary will provide additional co compositional characterizations Pustejovsky 1995 In MindNet Richardson 1997 such collocations present in both definitions and examples were an important source i e a corpus of statistical associations established in the dictionary entries We have just begun efforts to extract such collocations for

    Original URL path: http://www.clres.com/synergy.html (2016-02-11)
    Open archived version from archive


  • Dictionary Parsing Project (CL Research)
    showing the headword and all semantic relations generated from each sense 2 a file containing the WordNet analysis results 3 definitions for which a parse tree was not generated currently less than 100 for the entire W2 4 parses which were not completely clean about 30 percent and 5 definitions containing words unknown to the parsing dictionary about 10 percent and almost always resulting in parses that were not clean The last three files are used primarily as diagnostic to identify areas for improving the overall system Semantic Relations We identify several semantic relations semrels during the parsing principally the hypernymic relation which establishes the basic ontological hierarchy for nouns and verbs We also identify synonymic meronymic pertainymic and several other relations With DIMAP it is now possible for the user to define and characterize additional relations by adding defining patterns to the lexical entries for specific words in the separate parsing dictionary In addition we compare our results and assess whether these are consistent with WordNet for those relations that are common See next steps below for a description of the process for further development of semantic relations and links to inventories of semantic relations that are being investigated The rules for recognizing the individual relations are as follows Hypernyms For noun definitions the hypernym is the head noun of the first noun phrase in the definition unless the head is empty currently kind sort type species suborder one and is followed by the word of in which case we take the hypernym as the head of the NP following of For verb definitions the hypernym is the first verb In both cases if the head is a conjunctive phrase all heads are taken as hypernyms Synonyms A definition consisting of a single word is taken as a synonym In a noun definition such as a horse horse would be a hypernym not a synonym This will be extended to include phrases that are also headwords in the dictionary Meronyms For noun definitions whose head is part and is followed by of an is part of relation is created between the headword and the head noun of the following NP Pertainyms For adjective definitions beginning with of relating to or pertaining to a pertains to relationship is created between the headword and the head of the first NP Other semantic relations Additional semantic relations are defined by keying off prepositions identified as a result of parsing These are entered in the parsing dictionary as dpat s defining patterns associated with the particular preposition several may be defined for each preposition For example the entry for for has the pattern adj noun sr purpose giving rise to a purpose relation between the headword and the adj noun phrase that has been matched A defining pattern consists of the symbol for the word itself literals or parts of speech and will be extended to look for whole constituents such as an AP or a NP and to allow for 0 or

    Original URL path: http://www.clres.com/dpp.html (2016-02-11)
    Open archived version from archive

  • Comparison of Lexical Resources - WR.wpd
    mostly prepositions pronouns and conjunctions The Hector senses selected in the word overlap analysis contained about 960 words all Hector senses contained 1878 words We performed a strict word overlap analysis with and without a stop list between the definitions in WordNet and the Hector senses that is we did not attempt to identify root forms of inflected words We took each word in a WordNet sense and determined whether it appeared in a Hector sense we selected a Hector sense based on the highest percentage of words over all Hector senses An empty selection was made if all the words in the WordNet sense did not appear in any Hector sense only content words were considered when the stop list was used For example for bet WordNet sense 2 stake money on the outcome of an issue mapped into Hector sense 4 of a person to risk a sum of money or property in this way In this case there was an overlap on two words money of in the Hector definition 0 13 of its 15 words without the stop list When the stop list was invoked there was an overlap of only one word money 0 07 of the Hector definition In this case the lexicographer had made three assignments Hector senses 2 3 and 4 our scoring method treated this as only 1 out of 3 correct not using the relaxed method employed in Senseval of treating this as completely correct Without the stop list our selections matched the lexicographer s in 28 of 86 cases 32 6 using the stop list we were successful in 31 of 86 cases 36 1 The improvement arising when the stop list was used is deceptive where 8 cases were due to empty assignments so that only 23 cases 26 7 were due to matching content words Overall only 41 content words were involved in these 23 successes when the stop list was used an average of 1 8 content words To summarize the word overlap analysis 1 despite a richer set of definitions in Hector 9 of 66 WordNet senses 13 6 could not be assigned 2 despite the greater detail in Hector senses compared to WordNet senses 2 8 times as many words only 1 8 content words participated in the assignments and 3 therefore the defining vocabulary between these two definition sets seems to be somewhat divergent Although it might appear as if the word overlap analysis does not perform well this is not the case The analysis provides a broad overview of the definition comparison process between two definition sets and frames a deeper analysis of the differences Moreover it appears that the accuracy of a gold standard mapping is not crucially important The quality of the mapping may help frame the subsequent analysis more precisely but it seems sufficient that any reasonable mapping will suffice This will be discussed further after presenting the results of the componential analysis of the definitions Meaning Full Analysis of Definitions The deeper analysis of the mapping between two definition sets relies primarily on two major steps 1 parsing definitions and using defining patterns to identify semrels present in the definitions and 2 relaxing values to these relations by allowing synonymic substitution using WordNet Thus for example if we identify hypernyms or instruments from parsing a definition we would say that the definitions are equal not just if the hypernym or instrument is the same word but also if the hypernyms or instruments are members of the same synset This approach is based on the finding Litkowski 1978 that a dictionary induces a semantic network where nodes represent concepts that may be lexicalized and verbalized in more than one way This finding implies in general the absence of true synonyms and instead the kind of concept embodied in WordNet synsets with several lexical items and phraseologies A similar approach parsing definitions and relaxing semrel values was followed in Dolan 1994 for clustering related senses within a single dictionary The ideal toward which this approach strives is a complete identification of the meaning components included in a definition The meaning components can include syntactic features and characteristics including subcategorization patterns semantic components realized through identification of semrels selectional restrictions and collocational specifications The first stage of the analysis parses the definitions CL Research 1999b Litkowski to appear and uses the parse results to extract via defining patterns semrels Since definitions have many idiosyncrasies that do not follow ordinary text an important first step in this stage is preprocessing the definition text to put it into a sentence frame that facilitates the extraction of semrels 2 The extraction of semrels examines the parse results i e a tree whose intermediate nodes represent non terminals and whose leaves represent the lexical items that comprise the definitions where any node may also include annotations such as characterizations of number and tense For all noun or verb definitions this includes identification of the head noun with recognition of empty heads or verb for verbs we signal whether the definition contained any selectional restrictions that is particular parenthesized expressions for the subject and object We then examine prepositional phrases in the definition and determine whether we have a defining pattern for the preposition which we can use as indicative of a particular semrel We also identify adverbs in the parse tree and look these up in WordNet to identify an adjective synset from which they are derived if one is given The defining patterns are actually part of the dictionary used by the parser That is we do not have to develop specific routines to look for specific patterns A defining pattern is a regular expression that articulates a syntactic pattern to be matched Thus to recognize a manner semrel we have the following entry for in in dpat rep01 det 0 adj manner 0 sr manner This allows us to recognize in as possibly giving rise to a manner component where we recognize in the tilde which allows us to specify particular elements before the in as well with a noun phrase that consists of 0 or 1 determiner an adjective and the literal manner The 0 after the determiner and the literal indicate that these words are not copied into the value for a manner role so that the value to the manner semrel becomes only the adjective that is recognized The second stage of the analysis uses the populated lexical database to compare senses and make the selections This process follows the general methodology used in Senseval Litkowski to appear Specifically in the definition comparison we first examine exclusion criteria to rule out specific mappings These criteria include syntactic properties e g a verb sense that is only transitive cannot map into one that is only intransitive and collocational properties e g a sense that is used with a particle cannot map into one that uses a different particle At the present time these are used only minimally We next score each viable sense based on its semrels We increment the score if the senses have a common hypernym or if a sense s hypernyms belong to the same synset as the other sense s hypernyms If a particular sense contains a large number of synonyms that is no differentiae on the hypernym and they overlap considerably in the synsets they evoke the score can be increased substantially Currently we add 5 points for each match 3 We increment the score based on common semrels In this initial implementation we have defining patterns usually quite minimal for recognizing instrument means location purpose source manner has constituents has members is part of locale and goal 4 We increment the score by 2 points when we have a common semrel and then by another 5 points when the value is identical or in the same synset After all possible increments to the scores have been made we then select the sense s with the highest score Finally we compare our selection with that of the gold standard to assess our mapping over all senses Another way in which our methodology follows the Senseval process is that it proceeds incrementally Thus it is not necessary to have a final perfect parse and mapping routine We can make continual refinements at any stage of the process and examine the overall effect As in Senseval we may make changes to deal with a particular phenomenon with the result that overall performance declines but with a sounder basis for making subsequent improvements Results of Componential Analysis The gold standard analysis involves mapping 66 WordNet senses with 348 words into 102 Hector senses with 1878 words Using the method described above we obtained 35 out of 86 correct mappings 40 7 a slight improvement over the 31 correct assignments using the stop list word overlap technique However as mentioned above the stop list technique had achieved 8 of its successes by matching null assignments Considered on this basis it seems that the componential analysis technique provides substantial improvement In addition our technique erred on 4 cases by making assignments where none were made by the lexicographer We suggest that these cases do contain some common elements of meaning and may conceivably not be construed as errors Perhaps more importantly the componential analysis method exploits considerably more information than the word overlap methods Whereas the stop list word overlap mapping was based on only 41 content words the componential approach in the selected mappings had 228 hits in developing its scores with only a small number of defining patterns Comparison of Dictionaries We next examined the nature of the interrelations between pairs of dictionaries without use of a gold standard to assess the process of mapping For this purpose we mapped in both directions between the pairs WordNet Hector W3 OALD and W3 AHD We examine Dorr s lexical knowledge base for the implications it may have in the mapping process Neither WordNet nor Hector are properly viewed as dictionaries since there was no intention to publish them as such WordNet glosses are generally smaller 5 3 words per sense compared to Hector 18 4 words per sense which contains many words specifying selectional restrictions on the subject and object of the verbs Hector was used primarily for a large scale sense tagging project The three formal dictionaries were subject to rigorous publishing and style standards The average number of words per sense were 8 7 OALD 7 1 AHD and 9 9 W3 with an average of 3 4 6 2 and 12 0 senses per word Each table shows the average number of senses being mapped the average number of assignments in the target dictionary the average number of senses for which no assignment could be made the average number of multiple assignments per word and the average score of the assignments that were made The mapping from WordNet to Hector had relatively few empty mappings senses for which it was not possible to make an assignment These are the cases where it appears that the dictionaries do not overlap and thus provide a tentative indication of where two dictionaries may have different coverage The cases of multiple assignments indicate the degree of ambiguity in the mapping The average in both directions between Hector and WordNet were dominated by the inability to obtain good discrimination for the word seize Thus this method identifies individual words where the discriminative ability needs to be further refined WordNet Hector Senses Assignments Empty Multiple Scores WN Hector 3 7 4 7 0 6 1 7 11 9 Hector WN 5 7 6 4 1 4 2 2 11 3 These points are further emphasized in the mapping between W3 and OALD where the disparity between the empty and multiple assignments indicate that we are mapping between dictionaries quite disparate This tends to be the case not only for the entire set of words but also is evident for individual words where there is a considerable disparity in the number of senses which then dominate the overall disparity Thus for example W3 has 41 definitions for float while OALD has 10 We tend to be unable to find the specific sense in going from W3 to OALD because it is likely that we have many more specific definitions that are not present In the other direction we are likely to have considerable ambiguity and multiple assignments W3 OALD Senses Assignments Empty Multiple Scores W3 OALD 12 0 7 8 6 0 1 8 9 9 OALD W3 3 4 6 0 0 7 3 2 8 6 Between W3 and AHD there is less overall disparity between the definition sets although since W3 is unabridged we still have a relatively high number of senses in W3 that do not appear to be present in AHD Finally it should be noted that the scores for the published dictionaries tend to be a little lower than for WordNet and Hector This reflects the likelihood that we have not extracted as much information as we did in parsing and analyzing the definition sets used in Senseval W3 AHD Senses Assignments Empty Multiple Scores W3 AHD 12 0 11 5 4 0 3 6 9 0 AHD W3 6 2 9 1 1 2 4 1 9 1 We next considered Dorr s lexical database We first transformed her theta grids into syntactic specifications transitive or intransitive and identification of semrels e g where she identified an instr component we added such a semrel to the DIMAP sense We were able to identify a mapping from WordNet to her senses for two words float and shake for which Dorr has several entries However since she has considerably more semantic components than we are currently able to recognize we did not pursue this avenue any further at this time More important than just mapping between two words Dorr s data indicates the possibility of further exploitation of a richer set of semantic components Specifically as reported in Olsen et al 1998 in describing procedures for automatically acquiring thematic grids for Mandarin Chinese it was noted that verbs that incorporate thematic elements in their meaning would not allow that element to appear in the complement structure Thus by using Dorr s thematic grids when verb are parsed in definitions it is possible to identify where particular semantic components are lexicalized and which others are transmitted through to the thematic grid complement or subcategorization pattern for the definiendum The transmission of semantic components to the thematic grid is also reflected overtly in many definitions For example shake has one definition to bring to a specified condition by or as if by repeated quick jerky movements We would thus expect that the thematic grid for this definition should include a goal And indeed Dorr s database has two senses which require a goal as part of their thematic grid Similarly for many definitions in the sample set we identified a source defining pattern based on the word from frequently the object of the preposition was the word source itself indicating that the subcategorization properties of the definiendum should include a source component Discussion While the improvement in mapping by using the componential analysis technique over the word overlap methods is modest we consider these results quite significant in view of the very small number of defining patterns we have implemented Most of the improvement stems from the word substitution principle described earlier as evidenced by the preponderance of 5 point scores This technique also provides a mechanism for bringing back the stop words viz the prepositions which are the carriers of information about semrels the 2 point scores The more general conclusion from the word substitution is that the success arises from no longer considering a definition in isolation The proper context for a word and its definitions consists not just of the words that make up the definition but also the total semantic network represented by the dictionary We have achieved our results by exploiting only a small part of that network We have moved only a few steps into that network beyond the individual words and their definitions We would expect that further expansion first by the addition of further and improved semrel defining patterns and second through the identification of more primitive semantic components will add considerably to our ability to map between lexical resources We also expect improvements from consideration of other techniques such as attempts at ontology alignment Hovy 1998 Although the definition analysis provided here was performed on definitions within a single language the various meaning components correspond to those used in an Interlingua The use of the extinction method developed in order to characterize verbs in another language Chinese can fruitfully be applied here as well Two further observations about this process can be made The first is that reliance on a well established semantic network such as WordNet is not necessary The componential analysis method relies on the local neighborhood of words in the definitions not on the completeness of the network Indeed the network itself can be bootstrapped based on the parsing results The method can work with any semantic network or ontology and may be used to refine or flesh out the network or ontology The second observation is that it is not necessary to have a well established gold standard Any mapping will do All that is necessary is for any investigator lexicographer or not to create a judgmental mapping The methods employed here can then quantify this mapping based on a word overlap analysis and then further examine it based on the componential analysis The componential analysis method can then be used to examine underlying subtleties and nuances in the definitions which a lexicographer or analyst can then examine in further detail to assess the mapping Future Work This work has marked the first time that all the necessary infrastructure has been combined in a rudimentary form Because of

    Original URL path: http://www.clres.com/Comparison_of_Lexical_Resources.html (2016-02-11)
    Open archived version from archive

  • Analysis of Subordinating Conjunctions
    the digraph Lexicographic Rule 3 LR3 A direct link xRy can be eliminated from the digraph as a cyclical synonymic link if the link is synonymic the direct link yRx also exists and y has only the single link yRx in the digraph or additional links yRz i where there is a synonymic path from x to z i In the latter case the links yRz i may also be removed Removing the direct cyclical link is equivalent to removing the sense x in the set of definitions of y However making this elimination does not remove any information from the digraph that remains at this point If y has other uses z i in the digraph the synonymic path from x to any of these uses is preserved even when the links between y and these z i are eliminated Eliminating the link xRy also has an effect on the digraph making the senses of y now minus x more primitive than x Only one such link if in case was removed from the subordinating conjunction digraph through application of the first part of the rule The link if on condition that was eliminated through application of the second part of the rule along with the link on condition that when The second case of direct cyclical links is where one of the links is non synonymic We might describe this situation as xR NS y and yR S x The non synonymic link indicates that x appears with some differentiae In this situation the carriage of senses from y to x would result in x being defined by itself plus some differentiae In general this is an unacceptable lexicographic practice so we want to eliminate it There may be situations where the differentiae are inherent in the meaning of x so that carriage of the differentiae would actually add nothing but redundant information and would be acceptable and indicative of a true cycle in a dictionary We will consider this situation in more detail below To eliminate this situation we can eliminate the link xR NS y so that yR S x will not carry the unacceptable sense back to x This is the first instance where we have modified the nature of the link R between two nodes from is used to define to is used non synonymically to define and is used synonymically to define We will consider other modifications more systematically below Lexicographic Rule 4 LR4 A direct link xR NS y can be eliminated from the digraph as a cyclical non synonymic link if the link is non synonymic the direct link yR S x also exists and is synonymic and y has only the single link yRx in the digraph Another way of viewing this situation is that what we have done is to partition the definitions of y into two sets one containing the single sense containing x and the other containing the remaining definitions of y We treat this as splitting the node y in the digraph into the nodes y 1 with the single sense containing x and y 2 with the remaining senses Since y has only one outgoing link in the digraph we have to consider what will happen when we split the node Since we have precluded the non synonymic link from carrying the definition at y 1 back to x we have made y 1 a leaf in the digraph with no uses We now have to consider whether making this elimination removes any information from the digraph that remains at this point The use involving x has clearly involved the use of differentiae and so intuitively is probably not primitive Eliminating the link xR NS y also has an effect on the digraph making the senses of y now minus the definition involving x more primitive than x Only one such link if supposing was removed from the subordinating conjunction digraph At this point we have made several reductions in the size of the digraph in the first several steps eliminating nodes and in the last few steps focusing eliminating links We would like next to examine the overall effect and character of the digraph that remains To do this we make use of the most important characteristic of digraphs for our purpose namely that every digraph has a basis set that is a set of points from which all nodes in the digraph are reachable This basis set is the set of primitives from which all other definitions are derived A crucial notion in the determination of the basis set is that of a strong component a set of nodes that are mutually reachable by at least one path of the digraph A strong component is an equivalence class based here on the relation is used to define We can make use of an algorithm from digraph theory for partitioning the nodes of the digraph into its strong components that is equivalence classes in order to view the superstructure of the digraph which is itself a digraph In general a digraph created in the manner described here after the application of the several lexicographic and reduction rules will not be a single equivalence class As a result examination of the superstructure will identify sets of nodes that are relatively more primitive than other sets Identification of strong components will in general provide sets of nodes that are leaves in the superstructure to which we can apply the two reduction rules Moreover the superstructure has no cycles and provides a consistent topological sorting that will focus further analysis on the primitives enabling us to put aside extraneous information The next reduction rule identifies leaves in the superstructure that can be put aside Reduction Rule 3 Let S be a strong component of a digraph and let T be the set of all nodes that are defined with members of S and that are not members of S If T is a subset of E and there is at least one node not in S and not in T used to define a node in S then S and all its definitions can be placed in E What this says is that 1 the members of S are mutually reachable that is there are cycles in the definitional paths between members of S 2 the members of S are used as superconcepts only for members of S and for subordinating conjunctions that have already been eliminated as non primitive and 3 there is at least one subordinating conjunction that is not in S or T that is more primitive than the subordinating conjunctions in S Analyzing the subordinating conjunction digraph at this point identifies nine strong components This superstructure is shown in Figure 1 As can be seen the superstructure digraph eliminates the internal structure for components with more than one word Any component with more than one word has an internal cyclical structure so that there exists a path from any word to any other word Following RR3 all nodes of the graph except the one containing the word that are eliminated as non primitive The other nodes containing one word those for whether in case supposing notwithstanding that and on condition that are essentially carriers of relatively small components of meaning into the large node containing what appear to be the 14 most dominant subordinating conjunctions Figure 2 shows the internal structure of this strong component where the remaining analysis will focus 3 Analysis of Syntactic and Semantic Structure To this point we have used only the defining structure of the subordinating conjunctions to provide a general ordering Now we must delve deeper into the meaning components associated with these lexical items As with all lexical items the meaning must be captured in syntactic characteristics components that describe the usage and context and semantic characteristics components that describe the meaning brought to the usage and context We will use these meaning components to break the cycles shown in Figure 2 The essential method employed to accomplish this is to eliminate cycles that introduce an inconsistency With respect to syntactic characteristics we cannot allow a cycle to require that a lexical item have usage or context that differs With respect to semantic characteristics we cannot allow a lexical item to bring conflicting information to the usage and context and we cannot allow a more complex item to define a less complex item In both cases a simple way of viewing this is to say that we will break cycles by finding unification attempts that fail This section describes efforts to characterize the overarching meaning structure associated with subordinating conjunctions particularly identifying features and meaning components that are associated with the subordinating conjunctions To accomplish this we attempt to integrate information and insights from WordNet Quirk et al Barker and Knott In summary the results of this effort indicate an elegant hierarchical structure that provides an intermediate level of representation between sentence structure and discourse structure The results suggest that subordinating conjunctions enable us to characterize clauses as descriptions of such things as times causes reasons places conditions and points of reference 3 1 WordNet Analysis To begin we note provisionally that only one of the 15 definitions of that seems to be the primitive from which the others are derived This definition consists entirely of the usage note used as a function word to introduce a subordinate clause that is joined as complement or modifier to a noun or adjective or is in apposition with a noun According to Quirk et al p 1047 and pp 1260 2 these are instances of postmodifier subordinate clauses in a noun phrase so that the following clause would be treated as an appositive to the head noun and linkable with the copula be which must be a general abstract noun A few examples of such definitions are as soon as immediately at or just after the time that provided with the understanding that and inasmuch as in view of the fact that and for the reason that A large number of the subordinating conjunction definitions are of the form PP that with the PP of the form P det N with det the The determiner here is cataphoric referring to the S following the subordinating conjunction and characterizing that S as the N In W3 the 28 words in Table 7 fill the N position Table 7 Words Used to Characterize Sentences Following Subordinating Conjunctions assumption belief cause circumstance concomitant condition consideration degree event extent fact hope manner measure moment place point possibility provision purpose qualification reason restriction result sort time understanding way Subordinating conjunctions with this pattern thus appear to be characterizing the subordinate clause Moreover the preposition at the beginning of the syntactic pattern indicates that the clause is serving as an adverbial and thus likely to fill one the seven semantic roles posited by Quirk et al 8 2 namely space time process respect contingency modality and degree Since subordinating conjunctions essentially serve a rhetorical function we would view the subordinate clause as an entity functioning in a larger rhetorical structure The clause may describe anything but we are here concerned with the rhetorical role played by the description The words in Table 7 were examined in WordNet in order to characterize them and understand better the nature of the subordinating clause Not surprizingly the set of words exhibit hierarchical structure and patterns within the WordNet hierarchy Considering all senses in WordNet the words in this set fall into only 10 of the WordNet noun tops only five of which seem legitimate characterizations for the subordinating clause abstraction psychological feature event state and location with the other five being senses with a different orientation act entity phenomenon shape and possession A clause may describe an act phenomenon or entity as in senses of words such as consideration and qualification but we are not concerned with this but rather the rhetorical status of this act phenomenon or entity perhaps as an abstraction or state It is useful to examine where these words fit within the WordNet hierarchy For each sense of each word we extracted the hypernymic path to a top and then merged all the paths This list about six pages in length is available showing the subset of the WordNet hierarchy induced by the list of words in Table 7 For abstraction glossed as a concept formed by extracting common features from examples some of the words fell under five of the six hyponyms of this concept that is the synsets time space attribute relation and measure quantity amount quantum Most of the hyponyms under relation were down a few hyponymic levels to the synset statement Most of the hyponyms under attribute were under synsets for quality property and trait There were relatively few words inducing the other branches of the tree under abstraction For psychological feature a feature of the mental life of a living organism the induced tree included all immediate hyponyms cognition knowledge the psychological result of perception and learning and reasoning most of these falling under the synsets content cognitive content mental object and information motivation motive need the psychological feature that arouses an organism to action only reason and purpose and feeling the psychological feature of experiencing affective and emotional states only hope For event the induced tree was shallow and contained only the hyponymic synset happening occurrence natural event an event that happens with only its synsets experience case instance beginning accompaniment concomitant co occurrence and ending conclusion For state the way something is with respect to its main attributes the induced tree was also quite shallow with only a small number of its hyponyms condition status condition situation state of affairs degree level stage point status position and being beingness existence For location a point or extent in space the induced tree was small and shallow induced by the words way point and place To summarize the results from WordNet a subordinating clause can be viewed as expressing an abstraction a piece of knowledge an event a state or a location with more specific characterizations depending on the particular subordinating conjunction The subtrees induced from WordNet correspond well to the semantic roles of adverbial clauses described by Quirk et al pp 1077 1118 that is clauses of time contingency place condition concession contrast exception reason purpose result similarity comparison proportion and preference but probably not and comment It is an interesting observation that a word in Table 7 may appear several places in the induced WordNet subtree For example the word place is in each of the five major categories indicating that its use may convey several possibilities I hesitate to say that it is ambiguous rather I would say that it opens up opportunities for its use enabling the merging of meanings Thus as Quirk et al suggest p 1087 the meaning of place may merge with meanings of contingency contrast and time 3 2 Hierarchizing Quirk et al and Integrating Barker Quirk et al 15 24 52 describes semantic roles for adverbial clauses These are conveniently discussed in several categories such as clauses of time However as noted p 1077 many subordinators introduce clauses with different meanings frequently combining meanings such as time and purpose We can thus easily posit that in general subordinating conjunctions are composites of underlying semantic components and features The issue then is one of identifying these components and features and arranging them in such a way as to ensure consistency and appropriate composition or unification We thus began by identifying and grouping all the distinct semantic roles mentioned in this discussion shown in Table 8 Table 8 Semantic Roles of Adverbial Clauses temporal duration repetition temporal time before time after time overlap time beginning time proximity contingency cause reason motivation circumstance condition presupposition purpose prevention result concession unexpectedness fulfillment exception although distinguished this feature always seems to be blended with condition comparison contrast antithesis preference proportion similarity points of reference place although distinguished this feature always seems to be blended with either contingency or comparison modality fulfillment negation plausibility unexpectedness The key factor underlying the groups shown in Table 8 arose from trying to deal with contingency Each of the semantic roles identified in this group was discussed in several places throughout the referenced sections in Quirk et al As such they seemed to stand independent of the others and not fit into an overall structure However in their analyses both Barker and Knott make considerable reference to an implicational propositional structure of the form P 1 Ù Ù P n Q When the various semantic roles identified under contingency were considered in light of this propositional structure it appeared that the semantic roles were ways of characterizing the clauses on either the left or right hand side of the rule It was this observation that made it possible to consider all the semantic roles as part of an overarching hierarchy that could elaborate the one posited by Barker Barker suggests three types of clause level relationships each with subtypes conjunctive temporal and causal Moreover he suggests that there is a ranking of the types 1 conjunctive relationships merely state a number of propositions P i without any additional information 2 temporal relationships add temporal ordering to the propositions while keeping the conjunctive relations and 3 causal relationships add causal ordering to the propositions while keeping temporal and conjunctive relations This ranking seems to correspond well to the nature of rational thought first positing a number of propositions then noticing a temporal ordering and finally articulating a causal ordering This process of adding information to the relationships is one that is well suited to making sense of the semantic roles in Table 8 Thus for example we can see that reason adds the information that one of the P i is to be treated as not just a cause but also that a human has attached significance to the proposition

    Original URL path: http://www.clres.com/online-papers/sc.html (2016-02-11)
    Open archived version from archive

  • DIMAP Implementation of MCCA
    each individual text group These include for all text groups in the file and for each individual text group 1 the total number of words 2 the percentage of unique words in the texts 3 the total number of words for which a category was available 4 the percentage of tokens in the text that were categorized 5 the percentage of unique words that were categorized and 6 statistics on

    Original URL path: http://www.clres.com/mcca.php?show=mcca-wa (2016-02-11)
    Open archived version from archive

  • DIMAP Implementation of MCCA
    Implementation of Minnesota Contextual Content Analysis Basic Statistics Lookup List A list of all unique tokens in each of the text groups in the input text identifying those not in

    Original URL path: http://www.clres.com/mcca.php?show=mcca-walook (2016-02-11)
    Open archived version from archive

  • DIMAP Implementation of MCCA
    CL Research Implementation of Minnesota Contextual Content Analysis Basic Statistics Keywords in Context A concordance of selected words showing the left context the keywords and the right context

    Original URL path: http://www.clres.com/mcca.php?show=mcca-waconcord (2016-02-11)
    Open archived version from archive

  • DIMAP Implementation of MCCA
    Analysis Basic Statistics Words in Category Tokens in a specified text group that have been used at least a specified number of times sortable into alphabetic order category name and number its use percentage relative to the total number of

    Original URL path: http://www.clres.com/mcca.php?show=mcca-wcat (2016-02-11)
    Open archived version from archive



  •