原住民族教育-國家教育研究院 | 前瞻.開創.教育智庫

中文離合詞自動識別與標記之研究

基本資料摘要 Abstract

資料類型
研究計畫
計畫編號
NAER-107-12-F-2-01-00-1-01
GRB編號
PG10701-0116
計畫名稱
中文離合詞自動識別與標記之研究
計畫類型
個別型計畫
計畫主持人
白明弘
經費來源
國家教育研究院
執行方式
自行研究(本院經費-本院人員)
執行機構
國家教育研究院
執行單位
語文教育及編譯研究中心
年度
2018
期程(起)
2018-01-01
期程(迄)
2018-12-31
執行狀態
執行中
關鍵詞
語料庫,辭典學,自然語言處理,計算語言學,機器學習,華語文教學,離合詞
Keywords
語料庫,辭典學,自然語言處理,計算語言學,機器學習,華語文教學,離合詞
研究主軸

　　離合詞為華語文特有的一種語言現象，由於離合詞的語法複雜，在對外華語教學上被視為重要的難點之一。雖然目前已有許多學者針對離合詞教學提出建議，但仍未能建構系統性的教學方法。根據離合詞的教學研究指出，要讓學習者充份掌握離合詞的結構與規律，必需提高學習者對離合詞分離與結合形式特徵的認知。所以必需在教材中標示出離合詞的特徵、提供具代表性例句及適當語境等。
　　近年來，離合詞的研究逐漸藉助於語料庫的觀察，然而，現今仍少有針對中文離合詞標記的工具，以致目前華語文語料庫大多缺乏離合詞標記訊息。許多觀察者間接以窮舉離合詞的方法來觀察離合詞在語料庫中的現象。但是以窮舉離合詞的方式進行研究，一方面過程極為耗費人力與時間，另一方面則必需依賴現有的離合詞表，可能因遺漏而造成觀察的偏頗。為解決語料庫缺乏離合詞標記的問題，部份研究者投入離合詞自動標記的研究。但是，這些研究仍是以人工所建構的離合詞表為基礎，與窮舉離合詞的研究方法有相同的遺漏問題；另一方面，這些研究所發展的工具也未能開放給外界使用，故目前中文語料庫處理中，仍然缺乏一個可靠的離合詞標記系統。
　　本計畫的主要目的是要建構一個可靠的離合詞自動識別與標記系統。藉由深入分析離合詞特性，歸納離合詞標記的後設資料格式，並結合最新的機器學習理論與深度學習近年的技術，開發離合詞自動標記的工具。同時本計畫也將利用所發展的離合詞標記工具，全面標記國教院的華語文語料庫(COCT)。離合詞的標記結果將與語料庫查詢工具結合，增加離合詞標記查詢及統計的功能，以提供華語文研究者、教材編輯者及教學者更全面的華語文語料庫。同時計畫所產出的離合詞標記工具將開放所有語料庫研究者，使所有以語料庫為基礎的華語文研究者都能共享本計畫的研究成果。

　　The separable word is a special language phenomenon in Chinese language. Because of its complicated grammar, it is regarded as one of the significant difficulties in teaching Chinese as a foreign language. Although there are many scholars proposed suggestions for the teaching of separable words, there is still a lack of systematic teaching methods. According to the teaching research of the separable words, it is necessary to improve the learners' cognition of the structural features of the separate forms and the combination forms of separable words. They suggest marking the features of the words in the textbook, to provide representative example sentences and the context.
　　In recent years, researchers have gradually notice that corpus is an import tool to observe separable words. However, there are few tools for the identification of the separable words in corpus, so that the Chinese corpus is mostly lacking the message. Many researchers indirectly observe the separable words in corpus by means of enumerating all possible separable words. But this method, on the one hand the process is extremely laborious and time-consuming; on the other hand, it must rely on the manually compiled separable word lists, which should be incomplete. Hence, some researchers proposed automatic identification methods of the separable words. However, these studies are still based on the manually compiled separable word lists. Further, their identification tools were not opened, so, there is still a lack of a reliable separable word identification tool.
　　The main purpose of this project is to develop a reliable automatic identification and tagging system for the separable words, by in-depth analysis of the separable words and employing the latest machine learning technology.?
　　Meanwhile, the project will automatically identify and tag all separable words in the COCT corpus by means of the automatic identification tool. The final result of the tagged COCT corpus will be combined with the corpus query system to provide the query and statistics of the separable words for Chinese language researchers, textbook editors and instructors. At the same time, the separable word identification tool will be opened for the Chinese corpus-based researches to share the results of this project.

回上頁