Connectionists: JRC-Acquis: bilingual alignments for 231 language pairs now available

Ralf Steinberger ralf.steinberger at jrc.it
Mon Aug 6 10:17:53 EDT 2007


Bilingual alignments for all 231 language pairs of the JRC-Acquis parallel
corpus are now freely available online.

 

 

We are pleased to announce that the bilingual alignments for all 231
language pairs of the JRC-Acquis corpus are now available online for
download. The JRC-Acquis is a freely downloadable multilingual parallel
corpus in 22 languages comprising of a total of over 1 Billion words. 

 

SIZE AND FORMAT

 

- 22 languages (all official EU languages except Irish)

- Average corpus size per language: 28.9 million words + 19 Million words in
annexes, etc.

- 23,000 texts per language (less in Bulgarian, Maltese and Romanian)

- XML Format according to TEI P4, UTF-8-encoded

- Aligned bilingually at paragraph level (often equivalent to sentences or
sentence parts), using Vanilla.

- Modular: download the languages you need.

 

LANGUAGES

 

Bulgarian, Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish,
French, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish,
Portuguese, Romanian, Slovak, Slovene, Spanish, Swedish.

 

TEXT TYPES

 

- Documents on contents, principles and political objectives of the EU
Treaties;

- EU legislation;

- Declarations;

- Resolutions;

- Acts;

- International agreements.

 

PARAGRAPH ALIGNMENT

 

Paragraph alignment for all 231 language pairs was carried out with the
Vanilla aligner and is available for download. Paragraphs in the JRC-Acquis
are frequently equivalent to sentences or even sentence parts. Version 2.2
of the JRC-Acquis corpus (210 language pairs, still available on the same
website) was additionally aligned with HunAlign.

 

- Paragraph-aligned for all 231 language pairs;

- Paragraphs are sentence parts, sentences, or groups of sentences;

- Using the Vanilla aligner;

- Over 1 Million alignments per language pair (on average for all language
pairs);

- 85.43% one-to-one alignments (on average for all language pairs).

 

MANUAL SUBJECT DOMAIN CLASSIFICATION

 

- Manually classified according to EUROVOC subject domains;

- Selected from 6000 hierarchically organised classes, wide-coverage;

- suitable to experiment with multilingual multi-label categorisation.

 

USE / DOWNLOAD

 

- Download from  <http://langtech.jrc.it/JRC-Acquis.html>
http://langtech.jrc.it/JRC-Acquis.html;

- Usage free for research purposes.

 

FOR MORE DETAILS

 

You will find a detailed description of version 2.2 of the corpus in the
following paper. Please use the following reference when you mention the
JRC-Acquis in any publications. We would be pleased to hear how you use the
corpus.

 

Steinberger Ralf,  Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
<http://langtech.jrc.it/#Publications> http://langtech.jrc.it/#Publications.


 

 

 <http://langtech.jrc.it/#Publications> 

 <http://langtech.jrc.it/#Publications> The JRC's Language Technology group
specialises in the development of highly multilingual text analysis tools
and in cross-lingual applications. An example is our multilingual (19
languages) news analysis application NewsExplorer, publicly accessible at
http://press.jrc.it/NewsExplorer. 

 <http://press.jrc.it/NewsExplorer> 

 <http://press.jrc.it/NewsExplorer> Related JRC developments (both covering
22+ languages):

 <http://press.jrc.it/NewsExplorer> 

-           <http://press.jrc.it/NewsExplorer> NewsBrief
(http://press.jrc.it): breaking news detection and display of the very
latest thematic news from around the world;

 <http://press.jrc.it/> 

-           <http://press.jrc.it/> Medical Information System MedISys
(http://medusa.jrc.it): displays the latest health-related news from around
the world according to themes and diseases.

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> 

 <http://medusa.jrc.it/> Ralf Steinberger
European Commission - Joint Research Centre (JRC)
IPSC - SeS - EMM - Language Technology 

 <http://medusa.jrc.it/> http://langtech.jrc.it,
http://press.jrc.it/NewsExplorer


 <http://press.jrc.it/NewsExplorer/> 



More information about the Connectionists mailing list