Language resource management — Word segmentation of written texts — Part 1: Basic concepts and general principles
|Publication Date:||1 November 2010|
|ICS Code (Writing and transliteration):||01.140.10|
This part of ISO 24614 presents the basic concepts and general
principles of word segmentation, and provides language-independent
NOTE 1 In language-related research and industry, the word is a fundamental and necessary concept. It is thus critical to have a universal definition of what comprises a word for the purposes of segmenting a text into words. One cannot simply use rules based only on spaces and punctuation to delimit words. Such rules do not account for situations such as hyphenated compounds, abbreviations, idioms or word-like expressions that contain symbols or numbers. Word segmentation is even more problematic for languages that do not use spaces to separate words, such as Chinese and Japanese, and for agglutinative languages, where some functional word classes are realized as affixes, such as Korean.
The many applications and fields that need to segment texts into words - and thus to which this part of ISO 24614 can be applied - include the following.
Word count is the principal method for calculating the cost of a translation. Word segmentation is a standard function in translation memory systems and computer-assisted translation (CAT) tools. Word segmentation is performed by term extraction tools, which are sometimes provided in terminology management systems and CAT tools.
Most content management systems and databases allow for searching by individual words. The content being searched has to be segmented to permit matching with a search word. Furthermore, search functions require knowledge of the boundaries of words.
Text-to-speech systems generate speech based on words and therefore require word segmentation for lexicon lookup, stress assignment, prosodic pattern assignment, etc.
Various natural language processing (NLP) systems must segment text into words in order to carry out their functions. NLP systems include
- morphosyntactic processors,
- syntactic parsers,
- text classification systems, and
- corpus linguistics annotators.
Lexical resources are often evaluated by size, usually by referring to the number of words.
NOTE 2 The size of language resources is an essential benchmark
for their management. Quantifying the size of language resources is
typically achieved by counting the words. However, because NLP
applications use different segmentation methods, each calculates
the number of words differently and arrives at a different sum for
the same text. A reliable, reproducible, standard measure would
allow comparable results. This is not to say that applications may
not use their own, application-specific