Thi Minh Huyen Nguyen

Also published as: Thi Minh Huyền Nguyễn, Thi-Minh-Huyen Nguyen, T. M. Huyen Nguyen, Thị Minh Huyền Nguyễn, Thi Minh Huyen Nguyen

This paper presents an empirical study on the application of the maximum entropy approach for part-of-speech tagging of Vietnamese text, a language with special characteristics which largely distinguish it from occidental languages. Our best tagger explores and includes useful knowledge sources for tagging Vietnamese text and gives a 93.40%overall accuracy and a 80.69%unknown word accuracy on a test set of the Vietnamese treebank. Our tagger significantly outperforms the tagger that is being used for building the Vietnamese treebank, and as far as we are aware, this is the best tagging result ever published for the Vietnamese language.

pdf bib

Automated Extraction of Tree Adjoining Grammars from a Treebank for Vietnamese
Phuong Le-Hong | Thi Minh Huyen Nguyen | Phuong Thai Nguyen | Azim Roussanaly
Proceedings of the 10th International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+10)

2009

pdf bib

Building a Large Syntactically-Annotated Corpus of Vietnamese
Phuong-Thai Nguyen | Xuan-Luong Vu | Thi-Minh-Huyen Nguyen | Van-Hiep Nguyen | Hong-Phuong Le
Proceedings of the Third Linguistic Annotation Workshop (LAW III)

pdf bib

Finite-State Description of Vietnamese Reduplication
Phuong Le Hong | Thi Minh Huyen Nguyen | Azim Roussanaly
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)

2008

pdf bib abs

Word Segmentation of Vietnamese Texts: a Comparison of Approaches
Quang Thắng Đinh | Hồng Phương Lê | Thị Minh Huyền Nguyễn | Cẩm Tú Nguyễn | Mathias Rossignol | Xuân Lương Vũ
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, which also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words.

pdf bib

A Metagrammar for Vietnamese LTAG
Phương Lê Hồng | Thị Minh Huyền Nguyễn | Azim Roussanaly
Proceedings of the Ninth International Workshop on Tree Adjoining Grammar and Related Frameworks (TAG+9)

2006

pdf bib abs

This paper describes the ARCADE II project, concerned with the evaluation of parallel text alignment systems. The ARCADE II project aims at exploring the techniques of multilingual text alignment through a fine evaluation of the existing techniques and the development of new alignment methods. The evaluation campaign consists of two tracks devoted to the evaluation of alignment at sentence and word level respectively. It differs from ARCADE I in the multilingual aspect and the investigation of lexical alignment.

pdf bib abs

A Lexicalized Tree-Adjoining Grammar for Vietnamese
H. Phuong Le | T. M. Huyen Nguyen | Laurent Romary | Azim Roussanaly
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we present the first sizable grammar built for Vietnamese using LTAG, developed over the past two years, named vnLTAG. This grammar aims at modelling written language and is general enough to be both application- and domain-independent. It can be used for the morpho-syntactic tagging and syntactic parsing of Vietnamese texts, as well as text generation. We then present a robust parsing scheme using vnLTAG and a parser for the grammar. We finish with an evaluation using a test suite.

pdf bib

A language-independent method for the alignement of parallel corpora
Thi Minh Huyền Nguyễn | Mathias Rossignol
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation