Author: Daniel Hardt
Position paper submitted for the AMTA 2010 Workshop on Collaborative Translation and Crowdsourcing.
If a translation problem has already been solved, it shouldn't have to be solved again. This is just common sense, but at the same time it constitutes an ambitious and distant goal. There are two main issues: accessibility of data and translation technology.
On the technology side, there has been a development of increasingly sophisticated techniques to learn from translation data and apply what has been learned to new texts. For example, SMT systems have moved from word-by-word translations to phrase translation, and techniques to exploit hierarchical structure are being developed. This progress is important, but incremental.
Accessibility of data is a very different problem, involving difficult social, economic and legal issues. While initiatives like TAUS/TDA have taken promising steps, again progress has been modest, especially in light of the ambitious goal we have set ourselves.
I would like to suggest that more dramatic progress is possible in the area of data accessibility, not with respect to the complex social issues, but by developing relevant technologies. Here's an example of this: consider a translator interactively translating a document by automatically translating each sentence and then post-editing the output. At any given point in the process, the post-edited portion of the file is very likely to contain solutions to translation problems arising later in that same file. Thus it is well worth making that
data available to the MT system. In a research paper in the current AMTA, I'll present work (done together with Jakob Elming) giving evidence that within-file translation data can dramatically improve SMT translation quality. We also describe a prototype showing how this data can be exploited in real time.
Right now, SMT systems only exploit a fraction of potentially
available data in producing translations. This is because there are technological barriers to data accessibility — I think it might be much easier to remove these technological barriers than it is to remove the social barriers. We should build technologies that ensure that, whenever a text is being translated, the SMT system exploits all available data, tuned to the specifics of the text at hand.
Existing, accessible translation data contains the distilled wisdom of the world's translators — this is knowledge that ultimately will solve the problem of translation. SMT technology is already quite effective at extracting this knowledge. But right now, we lack the technology for systems to take advantage of this data. By removing these technological barriers, we can make great strides towards solving the world's translation challenges.
Position paper submitted for the AMTA 2010 Workshop on Collaborative Translation and Crowdsourcing.
If a translation problem has already been solved, it shouldn't have to be solved again. This is just common sense, but at the same time it constitutes an ambitious and distant goal. There are two main issues: accessibility of data and translation technology.
On the technology side, there has been a development of increasingly sophisticated techniques to learn from translation data and apply what has been learned to new texts. For example, SMT systems have moved from word-by-word translations to phrase translation, and techniques to exploit hierarchical structure are being developed. This progress is important, but incremental.
Accessibility of data is a very different problem, involving difficult social, economic and legal issues. While initiatives like TAUS/TDA have taken promising steps, again progress has been modest, especially in light of the ambitious goal we have set ourselves.
I would like to suggest that more dramatic progress is possible in the area of data accessibility, not with respect to the complex social issues, but by developing relevant technologies. Here's an example of this: consider a translator interactively translating a document by automatically translating each sentence and then post-editing the output. At any given point in the process, the post-edited portion of the file is very likely to contain solutions to translation problems arising later in that same file. Thus it is well worth making that
data available to the MT system. In a research paper in the current AMTA, I'll present work (done together with Jakob Elming) giving evidence that within-file translation data can dramatically improve SMT translation quality. We also describe a prototype showing how this data can be exploited in real time.
Right now, SMT systems only exploit a fraction of potentially
available data in producing translations. This is because there are technological barriers to data accessibility — I think it might be much easier to remove these technological barriers than it is to remove the social barriers. We should build technologies that ensure that, whenever a text is being translated, the SMT system exploits all available data, tuned to the specifics of the text at hand.
Existing, accessible translation data contains the distilled wisdom of the world's translators — this is knowledge that ultimately will solve the problem of translation. SMT technology is already quite effective at extracting this knowledge. But right now, we lack the technology for systems to take advantage of this data. By removing these technological barriers, we can make great strides towards solving the world's translation challenges.