Résumés
Abstract
This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).
Keywords:
- parallel corpus,
- corpus-based translation studies,
- corpus linguistics,
- copyright clearance,
- web interface
Résumé
Le présent article décrit un corpus parallèle de grande qualité en néerlandais, en français et en anglais contenant 10 millions de mots (DPC, pour Dutch Parallel Corpus). Les différents types textuels, au nombre de cinq, sont équilibrés, ainsi que les différentes directions de traduction. Tous les problèmes relatifs aux droits d’auteurs ont été résolus. L’importance de la disponibilité des corpus parallèles dans plusieurs domaines de recherche est discutée et nous comparons le DPC avec d’autres corpus multilingues actuels. Le DPC se distingue par sa composition équilibrée et par le fait qu’il est offert à l’ensemble des chercheurs, car il est libre de droits. Les textes sont alignés au niveau de la phrase et enrichis avec des annotations linguistiques (lemme, étiquettes morphologiques). De plus, environ 25 000 mots (dans la partie néerlandais-anglais) ont fait l’objet d’un alignement manuel sous-phrastique. La richesse des métadonnées permet d’effectuer un certain nombre de sélections adaptées aux besoins de l’utilisateur. L’exploitation se fait de deux manières : d’une part, il est possible d’accéder à l’intégralité du corpus et de s’en servir en format XML. D’autre part, le corpus est consultable à travers une interface web qui autorise des requêtes simples ou complexes et présente les résultats sous forme de concordances parallèles. Le corpus sera distribué par l’Agence néerlandaise et flamande pour le traitement automatique des langues (TST-Centrale).
Mots-clés :
- corpus parallèle,
- traductologie fondée sur corpus,
- linguistique de corpus,
- affranchissement des droits d’auteurs,
- interface web
Parties annexes
Bibliography
- Allauzen, Alexandre and Bonneau-Maynard, Hélène (2008): Training and evaluation of POS taggers on the French MULTITAG corpus. In: Nicolas Morales, Javier Tejedor, Javier Garrido,et al., eds. Proceedings of the Sixth Language Resources and Evaluation (LREC’08). (Language Resource and Evaluation Conference, Marrakech, 28-30 May 2008). European Language Resources Association (ELRA). Visited 3 May 2011, <http://www.lrec-conf.org/proceedings/lrec2008/>.
- Baker, Mona (1993): Corpus linguistics and translation studies: implications and applications. In: Mona Baker, Gill Francis and Elena Tognini-Bonelli, eds. Text and Technology: in honour of John Sinclair. Amsterdam: Benjamins, 233-252.
- Baker, Mona (1995): Corpora in translation studies: an overview and some suggestions for future research. Target. 7(2):223-243.
- Brown, Peter F., Lai, Jennifer C. and Mercer, Robert L. (1991): Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. (Annual Meeting of the Association for Computational Linguistics, Berkeley, 18-21 June 1991), 169-176.
- Burnard, Lou (2005): Metadata for corpus work. In: Martin Wynne, ed. Developing Linguistic Corpora: a Guide to Good Practice. Oxford: Oxbow Books, 30-46.
- Carl, Michael and Way, Andy (2003): Recent Advances in Example-Based Machine Translation. Dordrecht: Kluwer Academic Publishers.
- Daelemans, Walter and van den Bosch, Antal (2005): Memory-based Language Processing. Cambridge: Cambridge University Press.
- Danielsson, Pernilla and Ridings, Daniel (1997): Practical presentation of a ”vanilla” aligner. In: Proceedings of the TELRI Workshop on Alignment and Exploitation of Texts. (Workshop on Alignment and Exploitation of Texts, Ljubljana, 1-2 February 1997).
- De Clercq, Orphée, Montero Perez, Maribel (2010): Data Collection and IPR in Multilingual Parallel Corpora. In: Proceedings of the Seventh Language Resources and Evaluation (LREC’10). (Language Resource and Evaluation Conference, Valletta, 19-21 May 2010).
- Deville, Guy, Dumortier, Laurence and Paulussen, Hans (2004): Génération de corpus multilingues dans la mise en oeuvre d’un outil en ligne d’aide à la lecture de textes en langue étrangère. In: Gérald Purnelle, Cédrick Fairon and Anne Dister, eds. Le poids des mots: Actes des 7e journées internationales d’analyse statistique des données textuelles (JADT’04). (Journées internationales d’analyse statistique des données textuelles, Louvain-la-Neuve, 10-12 March 2004). Louvain-la-Neuve: Presses Universitaires de Louvain, 304-312.
- Ebeling, Jarle (1998): Contrastive Linguistics, Translation and Parallel Corpora. Meta. 43(4):602-615.
- Gale, William A. and Church, Kenneth W. (1991): A program for aligning sentences in bilingual corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. (Annual Meeting of the Association for Computational Linguistics, Berkeley, 18-21 June 1991). 177-184.
- Halverson, Sandra (1998): Translation Studies and Representative Corpora: Establishing Links between Translation Corpora, Theoretical/Descriptive Categories and a Conception of the Object of Study. Meta. 43(4):494-514.
- Hutchins, John (2005): Current commercial machine translation systems and computer-based translation tools: system types and their uses. International Journal of Translation. 17(1-2):5-38.
- Johansson, Stig (2007): Seeing through Multilingual Corpora. On the use of corpora in contrastive studies. Amsterdam: Benjamins.
- Kay, Martin and Röscheisen, Martin (1993): Text-Translation Alignment. Computational Linguistics. 19(1):121-142.
- Koehn, Philipp (2005): Europarl: a parallel corpus for statistical machine translation. In: Conference Proceedings: the tenth Machine Translation Summit. (MT Summit X, Phuket, 13-15 September 2005). 79-86.
- Koskinen, Kaisa (2000): Institutional Illusions. Translating in the EU Commission. The Translator. 6(1):49-65.
- Lee, David Y.W. (2001): Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle. Language Learning and Technology. 5(3):37-72.
- Macken, Lieve (2010a): An annotation scheme and Gold Standard for Dutch-English word alignment. In: Proceedings of the Seventh Language Resources and Evaluation (LREC’10). (Language Resource and Evaluation Conference, Valletta, 19-21 May 2010).
- Macken, Lieve (2010b): Sub-sentential alignment of translational correspondences. Doctoral dissertation, unpublished. Antwerp: University of Antwerp.
- Marcus, Mitchell P., Santorini, Beatrice and Marcinkiewicz, Mary Ann (1993): Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics. 19(2):313-330.
- McEnery, Tony and Wilson, Andrew (1996): Corpus linguistics. Edinburgh: Edinburgh University Press.
- McEnery, Tony and Xiao, Richard (2008): Parallel and Comparable Corpora: What is Happening? In: Gunilla Anderman and Margaret Rogers, eds. Incorporating Corpora: The Linguist and the Translator. Frankfurt: Multilingual Matters, 18-31.
- McEnery, Tony, Xiao, Richard, Tono, Yukio (2006): Corpus-Based Language Studies: An advanced resource book. London: Routledge Taylor and Francis Group.
- Melamed, Dan I. (1997): A Portable Algorithm for Mapping Bitext Correspondence. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. (Annual Meeting of the Association for Computational Linguistics, Madrid, 7-12 July 1997). California: Morgan Kaufmann Publishers, 305-312.
- Moore, Robert C. (2002): Fast and accurate sentence alignment of bilingual corpora. In: Stephen Richardson, ed. Machine Translation: from research to real users. (Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, Tiburon, 8-12 October 2002). Berlin: Springer, 135-244.
- Olohan, Maeve (2004): Introducing Corpora in Translation Studies. New York: Routledge.
- Ostler, Nicholas (2008): Corpora of less studied languages. In: Anke Lüdeling and Merja Kytö, eds. Corpus Linguistics: An International Handbook. Vol. 1. Berlin: Mouton de Gruyter, 457-484.
- Paroubek Patrick (2000): Language resources as by-product of evaluation: the Multitag example. In: Proceedings of the Second Language Resources and Evaluation (LREC’00). (Language Resources and Evaluation Conference, Athens, 30-02 May/June 2000). European Language Resources Association (ELRA), 151-154.
- Schmid, Helmut (1994): Probabilistic part-of-speech tagging using decision trees. In: Daniel B. Jones and Harold Somers, eds. New Methods in Language Processing (Studies in Computational Linguistics). (International Conference on New Methods in Language Processing, Manchester, 14-16 September 1994) London: Routledge, 154-163.
- Simard, Michel, Foster, George, Hannan, Marie-Louise, et al. (2000): Bilingual text alignment: where do we draw the line? In: Simon Botley, Anthony McEnery and Andrew Wilson, eds. Multilingual corpora in teaching and research. Amsterdam: Rodopi, 38-64.
- van Halteren, Hans, Zavrel Jakub and Daelemans, Walter (2001): Improving Accuracy in Word Class Tagging through the Combination of Machine Learning Systems. Computational Linguistics. 27(2):199-229.
- van den Bosch, Antal, Schuurman Ineke and Vandeghinste, Vincent (2006): Transferring POS tagging and lemmatization tools from spoken to written Dutch corpus development. In: Proceedings of the Fifth Language Resources and Evaluation (LREC’06). (Language Resources and Evaluation Conference, Genua, 22-28 May 2006). European Language Resources Association (ELRA).
- Van Eynde, Frank, Zavrel, Jakub and Daelemans, Walter (2000): Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus. In: Proceedings of the Second Language Resources and Evaluation (LREC’00). (Language Resources and Evaluation Conference, Athens, 30-02 May/June 2000). European Language Resources Association (ELRA), 1427-1433.
- Xiao, Richard (2008): Well-known and influential corpora. In: Anke Lüdeling and Merja Kytö, eds. Corpus Linguistics: An International Handbook. Vol. 1. Berlin: Mouton de Gruyter, 383-457.
- Xiao, Richard (2010): Corpus Creation. In: Nitin Indurkhya and Fred Damerau, eds. Handbook of Natural Language Processing (2nd Revised edition). Connecticut: Taylor & Francis, 147-165.
- Zhu, Chunshen (1999): Ut once More: The Sentence as the Key Functional Unit of Translation. Meta. 44(3):429-447.