Connectionists: DGT-TM - Translation Memory for 231 language pairs available for distribution

Ralf Steinberger ralf.steinberger at jrc.it
Wed Nov 28 08:53:19 EST 2007


Apologies for cross-postings.

 

This dataset may be of interest to people and organisations working on
Statistical Machine Translation and other multilingual Machine Learning
applications.

 

 

   DGT-TM Translation Memory

   Freely available

   22 languages

   231 language pairs

   Format: TMX version 1

    <http://langtech.jrc.it/DGT-TM.html> http://langtech.jrc.it/DGT-TM.html

 

 

The European Commission's Directorate General for Translation (DGT) and the
Joint Research Centre (JRC) have made available a multilingual Translation
Memory (sentences and their translations, in standard TMX format) for the 22
official European Union languages Bulgarian, Czech, Danish, Dutch, English,
Estonian, German, Greek, Finnish, French, Hungarian, Italian, Latvian,
Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovene, Spanish
and Swedish.

 

This release follows the public release - in May 2006 - of the
<http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis multilingual parallel
corpus with sentence alignment for 231 language pairs and a total size of
over 1 Billion words.

 

The data releases of DGT and JRC are in line with the general effort of the
European Commission to support multilingualism, language diversity and the
re-use of Commission information. 

 

The Translation Memory contains most, but not all of the Acquis
Communautaire, which is the entire body of European legislation, including
all the treaties, regulations and directives adopted by the European Union
(EU) and the rulings of the European Court of Justice. Since each new
country joining the EU is required to accept the whole Acquis Communautaire,
this body of legislation is translated into 22 official EU languages. For
the 23rd official EU language, Irish, the Acquis is not translated on a
regular basis.

 

A translation memory is a collection of small text segments and their
translation. These segments can be sentences or sentence parts. Translation
memories are used to support translators by ensuring that pieces of text
that have already been translated do not need to be translated again. 

 

Both translation memories and parallel texts are an important linguistic
resource that can be used for a variety of purposes, including:

 

*       training automatic systems for Statistical Machine Translation
(SMT); 

*       producing monolingual or multilingual lexical and semantic resources
such as dictionaries and ontologies; 

*       training and testing multilingual information extraction software; 

*       checking translation consistency automatically; 

*       testing and benchmarking alignment software (for sentences, words,
etc.). 

*       For usage conditions, details regarding the difference between
<http://langtech.jrc.it/DGT-TM.html> DGT-TM and the
<http://langtech.jrc.it/JRC-Acquis.html> JRC-Acquis, size information,
downloading instructions, etc. go to  <http://langtech.jrc.it/DGT-TM.html>
http://langtech.jrc.it/DGT-TM.html. 

 

 

Achim Blatt

Directorate General for Translation (DGT)

Unit DGT.R.3 Informatics ( <http://ec.europa.eu/dgs/translation/>
http://ec.europa.eu/dgs/translation/)

 

Ralf Steinberger 
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology ( <http://langtech.jrc.it/>
http://langtech.jrc.it) 

 

 

The JRC's Language Technology group specialises in the development of highly
multilingual text analysis tools and in cross-lingual applications. Many
applications are accessible online, e.g.:

.        <http://press.jrc.it/NewsExplorer/> NewsExplorer: multilingual news
aggregation and analysis (19 languages); allows to navigate the news over
time and across languages; trend analysis; collects information about people
from the news; social network detection.

.        <http://press.jrc.it/> NewsBrief: breaking news detection and
display of the very latest thematic news from around the world; email
alerting (22+ languages).

.        <http://medusa.jrc.it/> MedISys Medical Information System: latest
health-related news from around the world according to themes and diseases
(22+ languages).

 

 

 



More information about the Connectionists mailing list