Controlling the Effect of Crowd Noisy Labels in NLP Tasks

Azad Abad

Crowdsourcing, QA, RTE, Noisy Labels, Autonomous Learning, Relation Extraction, NLP

Recently, crowdsourcing has emerged an alternative cheap solution to collect the labels for training the automatic statistical methods for NLP tasks. However, the crowd annotations are noise prune and the quality of the collected labels are high depend on the crowdsourced task difficulty and the annotators' expertise. In this research, we are focusing on several approaches that are dealing with controlling the effect of such a noisy labels in various NLP tasks (e.g. Question Answering, Relation Extraction etc.).




Automatic Translation of Morphologically Rich Languages

Duygu Ataman

Machine learning, Statistical Machine Translation, Natural Language Processing

There are an estimated 7,3 billion people on Earth who can speak 7,097 different languages. As the global industrialization advances, automated translation services become more vital for digital products. Generation of high quality localized content requires language-specific tools that optimize the models used in machine translation. In this research, we investigate novel methods to increase the quality of translation into/from morphologically rich languages. Our study evaluates the lexical and syntactic representation techniques used in computational linguistics and different methods for transferring statistical information between monolingual and bilingual data resources in order to overcome data sparseness issues encountered in the processing of languages with derivative and inflective morphologies.




Automatic Post-Editing for Machine Translation

Rajen Chatterjee

Automatic Post-Editing, Neural Machine Translation, Machine Translation, Sequence to Sequence

Automatic post-editing aims to correct the errors in a machine translated text. This automatic error correction mechanism can speedup the work of translator, and eventually will increase the productivity of translation industry.




Achieving Open Vocabulary in Neural Machine Translation

Mattia Antonino Di Gangi

Machine Translation, Natural Language Processing, Deep Learning

Neural Machine Translation is a new paradigm that became relevant for machine translation both for academia and industry. These kind of systems are generally really good but needs massive resources, either in terms of storage or computational power, for using a large vocabulary. My research is focused on finding methods for using a larger vocabulary, and obtain better translations, while keeping limited the needed resources.




Multi-domain Neural Machine Translation

Mohammad Amin Farajian

machine translation, neural machine translation, multi-domain, domain adaptation

State-of-the-art neural machine translation (NMT) systems are generally very sensitive to the training domain and their performance degrades if the test set belongs to a different domain than the training data. Therefore, the current NMT systems are trained on specific domains by carefully selecting the training sets and applying proper domain adaptation techniques. However, in real-world applications it is very hard, if not impossible, to develop and maintain several specific MT systems for multiple domains. This is mostly due to the fact that usually: i) the target domain is not known in advance, and users might query different sentences from different domains; ii) the application domains are very diverse, making it not feasible to develop and fine-tune one system for each domain, which makes the possibility of developing and fine-tuning one system for each domain infeasible; iii) there is no (or very limited amount of) in-domain training data to train domain-specific MT engines. In this situation, it is necessary to have high quality MT systems that perform consistently well in all (or most of) the domains. In my PhD, I am exploring effective solutions for developing multi-domain NMT systems that their performance does not degrade by changing the application domain and perform equally well in all the domains.




Adaptation Methods in Statistical Machine Translation

Domain adaptation without indomain parallel data

Prashant Mathur

domain, adaptation, statistical, machine, translation

We address a challenging problem frequently faced by MT service providers: creating a domain specific system based on a purely source-monolingual sample of text from the domain. We solve this problem by introducing methods for domain adaptation requiring no in-domain parallel data. Our approach yields results comparable to state-of-the-art systems optimized on an in-domain parallel set.





Diachronic and Synchronic Comparisons of Points of View

Stefano Menini

Digital Humanities, Quantitative History, Natural Language Processing

To deal with the large amount of political documents available we need to integrate traditional humanistic approaches with computational ones. Political documents present a multitude of interconnected points of view and opinions. We focus on the automatic evaluation of ideological positions, detecting divergences and similarities between authors.





Social Annotation and User Profiling

Yaroslav Nechaev

Social Media, Machine Learning, Natural Language Processing

Social Media will be the cornerstone of any future knowledge-based system. Therefore, there is a need to be able to efficiently gather and process the Social Media data and learn how to use the user-generated content to solve a wide variety of problems. This is what I define as Social Annotation — the process of enriching the typical Computer Science problems, for example, Entity Linking, User Profiling, Event Detection and so on, with the knowledge from the Social Media. In my research, I introduce techniques that greatly simplify the processing of Social Media data. Specifically, I work on efficient user representations and novel social media-based evaluation approaches.





Supervised similarity in semantic tree kernels

Massimo Nicosia

Semantic tree kernels compute the similarity between the structural and semantic representation of two pieces of text. Matches between words can be established by the similarity of their word embeddings. Our research aims at producing word representations that will be more effective in the kernel computation, by including contextual information through explicit modeling of the context around words, and adopting supervision in the word encoding mechanism.




Deep Knowledge Extraction from Text

Giulio Petrucci

Natural Language Processing, Knowledge Representation, Machine Learing

Ontologies are used to represent knowledge in a formal and unambiguous way, facilitating its reuse and sharing among people and computer systems. A large amount of knowledge is traditionally available in unstructured text sources and manually encoding their content into a formal representation is costly and time-consuming. Several methods have been proposed to support ontology engineers in the ontology building process, but they mostly turned out to be inadequate for building rich and expressive ontologies. We propose some concrete research directions for designing an effective methodology for semi-supervised ontology learning. We investigate the direction of designing an effective methodology for semi-supervised ontology learning, with a special focus on expressive axioms and concept definitions.





Speech Adaptation Modeling for Statistical Machine Translation

Nicholas Ruiz

Spoken language translation (SLT) exists within one of the most challenging intersections of automatic speech recognition (ASR) and natural language processing (NLP). The ASR system is responsible for the transcription of recorded human language into a sequence of words. These words comprise a combination of content words which carry information and function words that provide order each utterance. The transcribed utterances are often segmented and punctuated into natural sentences to induce grammaticality on each utterance. After the transcribed words are pre-processed, a machine translation (MT) system converts each sentence into a target language. Optionally, a speech synthesizer, or text-to-speech (TTS) system may be applied to generate an audio signal from the machine translated text. While ASR and MT systems can be trained separately using different data, the mismatch in training conditions can cause a well-performing statistical machine translation (SMT) system to translate spoken language poorly. Since machine translation systems are primarily trained on natural language texts, they are not well-equipped to handle the differences in style and genre. Additionally, errors occurring in speech transcripts propagate into the machine translation system and complicate the search space for the best translation. In this research, we explore the differences in translating text from speech and propose several techniques to adapt the machine translation system to anticipate artifacts of ASR outputs that impede the ability to translate adequately. In particular, we focus on translation from spoken English to French and German -- the two parent languages of English -- and demonstrate that information about ASR errors can improve the robustness of machine translation for spoken language.




Temporal Processing of Historical Texts

Rachele Sprugnoli

Natural Language Processing, Digital Humanities, Temporal Information Processing

The elaboration of temporal information is crucial when dealing with historical texts. Finding new approaches to the identification of such information in this type of texts can assist historians in enhancing their work and can have an impact both on NLP and on Digital Humanities research.




Exploring Sensorial Association of Words for Computational Linguistics Applications

Serra Sinem Tekiroglu

Lexical Semantics, Figurative Language

Language is the main communication device to represent the environment and share an understanding of the world that we perceive through of our sensory organs. Therefore, each language might contain a great amount of sensorial elements to express the perceptions both in literal and figurative usage. In contrast to being an anomaly, figurative language is pervasive in the common language and can be handled by any NLP application that includes a semantic task. In order to tackle the semantics of figurative language we propose to use sensorial affinity of the words as a feature for metaphor identification. Additionally, we analyze the transition from perceptual to conceptual knowledge and conduct a creativity detection task on a multimodal dataset that contains both linguistic and visual dimensions of a given concept.




Beyond Factoid Question Answering

Text Snippet Interpretation in Social Media Communities

Antonio Uva

Computer Science: Natural Language Processing; Community Question Answering

Social Media applications, e.g., forums, social networks, allow users to pose questions about a given topic to a community of experts and/or users. Although successful this approach suffers from two main drawbacks: (i) it is rather complex to find similar question with traditional keyword-based search; and (ii) even assuming accurate search, similar questions may not be available. Community Question Answering (cQA) is a branch of QA aim at automatically answering user questions by (i) first looking at the questions most similar to the input question and (ii) then selecting the best answer for those questions. This way, the users do not need to wait for an answer from the community, which may come after a considerable amount of time. In this paper, we propose a new solution to the two main problems for building automatic cQA systems: detecting similar questions (question-question similarity problem) and determining answer relevancy for non-factoid questions (question-answer similarity). Our solution consists in adapting techniques that were used in other QA areas (e.g., factoid QA) but have not been exploited in cQA so far. The resulting system can be used, for example, to automatically answer questions asked by customers about the products or services sold by a company.




Semantic Linguistic resource

Hanyu Zhang

crowdsourcing, GWAP, mobile application, linguistic resource, NLP

Multilingual semantic linguistic resource is critical for many applications in Natural Language Processing (NLP). While, building large-scale lexico-semantic resources manually from scratch is extremely expensive, which promoted the applications of automatic extraction or merger algorithms. These algorithms did benefit us in creation of large-scale resources, but introduced many kinds of errors as the side effect. For example, Chinese WordNet follows the WordNet structure and is generated via several algorithms. This automatic generation of resources introduces many kinds of errors such as wrong translation, typos and false mapping between multilingual terms. The quality of a linguistic resource influences the performance of the further applications directly, which means the quality of a linguistic resource should be the higher the better. Thus, finding errors is inevitable.