Turk Bootstrap Word Sense Inventory V2.0 (includes data from V1.0) ================================================================= Chris Biemann, biemann@ukp.informatik.tu-darmstadt.de, March 2010, May 2012 1. Introduction --------------- This file describes the data format of the TWSI (Turk bootstrap Word Sense Inventory) version 2.0. This is the complete version, including the data from TWSI 1.0. For the description of the process, please consult the papers [Bieman and Nygaard, 2010], [Biemann, 2012a] or [Biemann, 2012b] for further documentation. In short, three Mturk tasks were used to yield the data provided here: - "Substitutable words in context": Workers are presented a sentence with a target word and supply substitutions - "Are these words used with the same meaning?": Workers are presented a pair of sentences with the same target word marked in bold and can decide whether the meanings are identical, similar or different - "Match the Meaning" Workers are presented a sense inventory represented by prototypical sentences and align further sentences with the same target word to those senses. The TWSI is organized by target word: For the most frequent 615 nouns in English Wikipedia (dump used from January 3rd, 2008) that were not already included in TWSI 1.0, all targets are organized into senses. With each sense, there are associated substitutions and sentences where the target word was used in this sense. This data has been curated and extracted from the output of a turk bootstrapping acquisition cycle. The list if targets is found in targets.txt. 2. Organization --------------- There are 6 top level directories, containing different aspects of the sense inventory: - inventory - substitutions - contexts - corpus - lexsub_task - doc In each of the directories except lexsub_task and doc, there are files for each of the 615 nouns, the first part of the filename reflects the target noun. Under lexsub_task, you find substitution data in the format of the lexical substitution task [McCarthy and Navigli, 2009]. The doc folder contains publications describing this resource. 2.1 Inventory ------------- This directory contains, for each target, a file with prototypical sentences per sense. The files have three tab-separated columns: - sense-ID - sentence-ID - sentence with target word marked in bold tags. Sample: train@@2 train++5708554 In 1986 , an annual conference was developed and called Hillsong Conference , which was created to teach and train Christians from around Australia and from all over the world . train@@1 train++13332587 One train makes additional stops at Ichinoseki and Morioka . train@@4 train++41336108 Then in 1847 Waite headed to Oregon Country in a wagon train of 40 wagons . Note that senses are represented by @@+number, and numbers could be missing, i.e. there is no train@@3. These prototypical sentences are the ones that have been verified to not contain the same sense. They serve as the inventory for the "Match the Meaning" task, so the contexts are aligned to these sentences rather than to substitutions. 2.2 Substitutions ----------------- This directory contains, for each target, a file with substitutions per sense. The files have four tab-separated columns: - sense-ID - target word (redundant with filename) - substitution - count: how often this substitution was supplied Sample: weight@@2 weight heaviness 22 weight@@3 weight significance 6 Note that sense-IDs can be used to link the substitutions to the inventory. The data has been filtered in the following way: Rare substitutions have been ommitted (minimal frequency=2, at least 1 substitution per sense), substitutions for contexts that did not get assigned to any sense have been ommitted and the maximal number of substitutions per sense is 10 (more only in case of ties on counts). Note that words that have entered the acquisition cycle several times get higher counts on substitutions. Also note that highly frequent substitutions for words witrh a single sense are almost safely context-free synonyms. The data is sorted by sense-id, then descending by substitution count. 2.2.1 Raw data -------------- The raw_data subdirectory contains all substitutions given in the process in subdirectory "all-substitutions" (same format as in 2.2 above) and substitutions on a sentence level in subdirectory "substitutions_per_sentence". For each word, there are two files: target.turkresults and target.difficulty. Format target.turkresults is 4 columns, tab separated: - id consisting of target fullform, sentence-id from corpus, source (0= init, TURK-UNCOV= from matches amrked as uncovered) - target word - substitution - count Sample: years++8875374||0 year 365-day period 1 years++8875374||0 year 60 months 1 years++8875374||0 year annum 1 Format target.turkresults.difficulty is 2 columns, tab separated: - id consisting of target fullform, sentence-id from corpus, source (0= init, TURK-UNCOV= from matches amrked as uncovered) - difficulty score: sum of difficulty judgments easy (3), medium (1), hard (0) and impossible (-3). Not normalized. Sample: years++8875374||0 7 years++23697091||0 -5 year++39673730||0 15 2.3 Contexts (Sample Sentences) -------------------------------- This directory contains, for each target, a file with sentences that are labeled by sense-id. The files have 6 ctab-separated columns: - sense-id - target word - target word as it appears and is marked up in sentence (singular or plural form for nouns) - sentence-id form corpus - sentence with target word marked in bold tags. - confidence from matching process. Only high confidence items entered here Sample: time@@1 time time 17538931 For a long time it was assumed that The Hyena was male and in costume . 1.0 time@@1 time time 37655926 With time running out and the Xindi weapon about to be armed , Archer has to persuade the Xindi - Aquatics to help destroy the weapon . 0.75 Note that these are only the results from "Match the Meaning" tasks. Sentences used for the clustering are not contained here. Also, the prototypical sentences defining the inventory (see 2.1) are not contained in these files. 2.4 Corpus ---------- This directory contains relevant parts of the corpus used: The information can be used to link sentences throughout this resource to the original wikipedia articles. The file "wiki_titles.txt" contains 3 tab-separated columns: - sentence-id from corpus - number of sentence within article - title of article Sample: 195 27 António Ferreira 811 150 Polish-Soviet War in 1920 1020 13 Minka 4010 4 Geography of Texas This data is only available for 97.5% of sentence ids in this resource. 2.5 Lexical Substitution Task Format ------------------------------------- This folder contains the data for all sentences where substitutions in context were collected in the XML format from the Semeval lexical substitution task [McCarthy and Navigli, 2009]. Thus, it can be used to evaluate lexical substitution systems that can read this format. Update 2015-10-30: Minor corrections are added as twsi_clean.xml and twsi_clean.gold. In these versions, the following issues are addressed: - the sentence context in twsi_clean.xml is wrapped in a CDATA element, enabling correct processing of the XML - substitution elements in twsi_clean.gold containg the character ";" are escaped using "\;", resolving ambiguities of the file format 3. Stats -------- 3.1 Sense inventory ------------------- Histogram: number of senses per word senses count 1 375 2 284 3 155 4 93 5 44 6 32 7 11 8 5 9 6 10 4 12 1 13 1 14 1 Read: there are 44 words with exactly 5 senses. 3.2 Substitutions ----------------- Histogram: number of substitutions per sense substitutions sense count 1 105 2 69 3 179 4 189 5 191 6 186 7 183 8 163 9 177 10 398 11 198 12 131 13 73 14 51 >14 45 Read: there are 183 senses that have exactly 7 sustitutions. 4. Distribution and Citation ----------------------------- This data has been fully funded by Powerset, a Microsoft company. It is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported license. You are free: * to Share Ñ to copy, distribute and transmit the work * to Remix Ñ to adapt the work Under the following conditions: * Attribution Ñ You must attribute the work by referencing the following paper: [Biemann and Nygaard, 2010] C. Biemann and V. Nygaard (2010): Crowdsourcing WordNet. In Proceedings of the 5th Global WordNet conference, Mumbai, India. * Share Alike Ñ If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. With the understanding that: * Waiver Ñ Any of the above conditions can be waived if you get permission from the copyright holder. * Public Domain Ñ Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license. * Other Rights Ñ In no way are any of the following rights affected by the license: o Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations; o The author's moral rights; o Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights. * Notice Ñ For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page. References: ----------- [Biemann and Nygaard, 2010] C. Biemann and V. Nygaard (2010): Crowdsourcing WordNet. In Proceedings of the 5th Global WordNet conference, Mumbai, India. [Biemann, 2012a] C. Biemann (2012): Turk Bootstrap Word Sense Inventory 2.0: A Large-Scale Resource for Lexical Substitution. Proceedings of LREC 2012, Istanbul, Turkey [Biemann, 2012b] C. Biemann (2012): Creating a system for lexical substitutions from scratch using crowdsourcing. Language Resources and Evaluation. Springer. doi:10.1007/s10579-012-9180-5 [McCarthy and Navigli, 2009] D. McCarthy and R. Navigli (2009): The English lexical substitution task. Language Resources and Evaluation, Springer doi:10.1007/s10579-009-9084-1