跳到主要內容區塊 :::
   
:::
首頁/研究主軸/原住民族教育/研究計畫/中文錯別字語料庫自動建置與基於人工智慧技術的錯別字自動偵測之研究

中文錯別字語料庫自動建置與基於人工智慧技術的錯別字自動偵測之研究

  • 資料類型

    研究計畫

  • 計畫編號

  • GRB編號

  • 計畫名稱

    中文錯別字語料庫自動建置與基於人工智慧技術的錯別字自動偵測之研究

  • 計畫類型

    個別型計畫

  • 計畫主持人

    白明弘

  • 經費來源

    國家教育研究院

  • 執行方式

    自行研究(本院經費-本院人員)

  • 執行機構

    國家教育研究院

  • 執行單位

    語文教育及編譯研究中心

  • 年度

    2018

  • 期程(起)

    2018-05-01

  • 期程(迄)

    2019-04-30

  • 執行狀態

    執行中

  • 關鍵詞

    中文錯別字偵測,語料庫,機器學習,人工智慧,自然語言處理

  • Keywords

    中文錯別字偵測,語料庫,機器學習,人工智慧,自然語言處理

  • 研究主軸

  •   由於中文字數量龐大,認識與書寫為初學中文最大的困難之一。近年來更因中文書寫率降低,網路文章不重視用字正確性,廣告文字刻意使用諧音字等等,不知不覺間影響學生對用字的認知,使錯別字成為網路時代的特色之一。
      英文拼字錯誤自動偵測技術的研究已有數十年之久,利用自然語言處理的理論基礎,英文錯別字偵測的正確率已達到 95% 以上。目前大部分的搜尋引擎、文件編輯器等都已支援拼字錯誤偵測的功能。但中文錯別字偵測的正確率卻連70% 都不易達成。根據研究者的分析,造成錯別字偵測特別困難的原因大致可歸納成兩項:一、中文詞彙間沒有分隔字元,造成錯別字判別困難;二、中文字數量龐大,機率參數需要大量的錯別字語料庫做為統計樣本。不幸的是,人工建置的錯別字語料庫不但成本極高,數量也難以擴大。
      本計畫的主要目有二:第一、從網路大量中文語料中,以自動建置大量中文錯別字語料庫,此錯別字語料庫一方面可做為國語文教育之教學與研究參考,一方面可用來開發錯別字相關的應用程式。第二、利用中文錯別字語料庫結合人工智慧技術研發錯別字自動偵測系統。此系統可幫助學生自主學習,協助新聞媒體及出版業者提升文件的品質,以達成降低錯別字的良性循環。

  •   Due to the large number of Chinese characters, recognition and writing Chinese characters are part of the difficulties for Chinese language learning. Further, in recent years, the facts that Chinese handwriting rates have dropped, internet essays contain many Chinese spelling errors and deliberately used homophone words in advertisements, which have unwittingly influenced students' cognition of words and made typos a characteristic of the internet era.
    The research of English spelling error automatic detection has been for decades, employing the theoretical basis of natural language processing, English spelling error checking accuracy has reached above 95%. Hence, currently, most search engines, word processors support the function of spell error detection. However, the accuracy of Chinese spelling error checking is not easy to reach even 70%. According to the past researches, the difficulties of Chinese spelling errors can be roughly summed up into two reasons. First, there is no delimiter between Chinese words, which makes it difficult to detect Chinese spelling errors. Second, there are a large number of Chinese characters which makes the error model includes a large amount of probability parameters. The error model need a very large spelling error tagged corpus to estimate the parameters. Unfortunately, manually construction of the Chinese spelling error tagged corpus is extremely costly and the quantity is difficult to expand as well.
      The purpose of our project is as follows: First, we plan to collect a large number of Chinese spelling errors from internet automatically to build a large Chinese spelling error tagged corpus. This spelling error tagged corpus can be used as a research and teaching material for Chinese language education as well as to develop typos related applications. Second, the use of Chinese spelling error tagged corpus combined with artificial intelligence technology to develop Chinese spell-checking system. This system helps students to learn independently and helps news media and publishers enhance the quality of their documents in order to achieve a virtuous circle of reduce the Chinese spelling errors.

top
回首頁 網站導覽 FAQ 意見信箱 EN
facebook youtube