IEEE - WHITE PAPER: INDIAN LANGUAGE RESOURCES - TEXT PROCESSING SUBCOMMITTEE REPORT
INDIAN LANGUAGE RESOURCES—TEXT PROCESSING SUBCOMMITTEE REPORT
| Organization: | IEEE |
| Publication Date: | 1 January 2023 |
| Status: | active |
| Page Count: | 41 |
scope:
The scope of this survey includes the following:
There are quite many text processing tasks that exist in the field of NLP. However, while some tasks require standards [like POS tagset, named entity recognition (NER), etc.], some may not need the same (word sense, domain terms, etc.). Identify the tasks for which standardization is required. This includes the standardization in the annotation (including tagset and guideline), definition, and formatting for the different text processing tasks. In this survey, the authors categorize the tasks based on the input in the following dimensions and explore different tasks in each dimension and the need for standardization:
a. Character-level
b. Word-level
c. Sentence-level
d. Discourse-level
e. Code-mixed
In addition to the above, the survey also includes a few end-user applications (e.g., question answering (QnA), summarization) as case studies to understand the need for standardization for specific tasks.
A detailed survey of the existing standards available primarily focusing ILs. The survey also includes global standards whenever available to understand the big picture across international languages.
Furthermore, the scope of the survey also includes documenting available resources and tools.
Identifying the gaps.
The scope of the task does not include the following:
There are many approaches to solve each text processing task. Some of them may have some correlation with the granularity of annotation (e.g., number of tags). The survey does not include any study on approaches used to solve the tasks.
Often there are multiple standards (originated from different research groups at the same time) available for a task. This survey lists all of them and does not compare them side-by-side as the same is not the scope of the prestandardization. Rather, generalize the gaps across them.
Document History