一般注記A character sequence tends to comprise segmentation alternatives, leading to segmentation ambiguity. Properly handling this ambiguity using multi-granularity linguistic units, such as character clusters, subwords, and words, can improve word segmentation performance and lessen ambiguous boundary decisions. We conduct a study to investigate the potential of using various linguistic units and leveraging segmentation alternatives for character-based word segmentation. Our experimental results demonstrated improvements in segmentation performance, outperforming previous work on the BCCWJ, CTB6, and BEST2010 datasets in Japanese, Chinese, and Thai, respectively.
identifier:oai:t2r2.star.titech.ac.jp:50672947
一次資料へのリンクURLhttp://t2r2.star.titech.ac.jp/rrws/file/CTT100902372/ATD100000413/19D10554_CHAY-INTR-Thodsaporn_thesis.pdf (fulltext)
連携機関・データベース国立情報学研究所 : 学術機関リポジトリデータベース(IRDB)(機関リポジトリ)