Accéder directement au contenu

Pedro Ortiz Suarez

Chercheur à l'équipe de Speech and Language Technology à DFKI GmbH Berlin.
27
Documents
Affiliations actuelles
  • 1143775
  • 258630
Identifiants chercheurs
Contact
Site web
  • https://portizs.eu
  • https://portizs.eu

Présentation

I'm a researcher at the [Speech and Language Technology Team](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology) at [DFKI GmbH](https://www.dfki.de/en/web) Berlin. I am interested in [large corpora](https://oscar-corpus.com) for training language models, specially for under resourced languages and historical languages. I am interested in tasks such as Name Entity Recognition (NER), Dependency Parsing and Part-of-Speech tagging, Machine Translation and Document structuration.
Je suis chercheur à l'[équipe de Speech and Language Technology](https://www.dfki.de/en/web/research/research-departments/speech-and-language-technology) à [DFKI GmbH](https://www.dfki.de/en/web) Berlin. Je m'intéresse aux grands corpus pour l'entraînement de modèles de langue, en particulier pour les langues peu-dotées et les langues historiques. Je suis intéressé par des tâches telles que la reconnaissance d'entités nommées (NER), l'analyse syntaxique, l'étiquetage morpho-syntaxique, la traduction automatique et la structuration de documents.

Domaines de recherche


Publications

Image document

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Hugo Laurençon , Lucile Saulnier , Thomas Wang , Christopher Akiki , Albert Villanova del Moral
Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Nov 2022, New Orleans, United States
Communication dans un congrès hal-03823922v1
Image document

Towards a Cleaner Document-Oriented Multilingual Crawled Corpus

Julien Abadji , Pedro Ortiz Suarez , Laurent Romary , Benoît Sagot
Thirteenth Language Resources and Evaluation Conference - LREC 2022, Jun 2022, Marseille, France
Communication dans un congrès hal-03536361v1
Image document

From FreEM to D'AlemBERT

Simon Gabay , Pedro Ortiz Suarez , Alexandre Bartz , Alix Chagué , Rachel Bawden
13th Language Resources and Evaluation Conference - LREC 2022, European Language Resources Association, Jun 2022, Marseille, France. pp.3367-3374
Communication dans un congrès hal-03596653v1
Image document

Le projet FREEM : ressources, outils et enjeux pour l’étude du français d’Ancien Régime

Simon Gabay , Pedro Ortiz Suarez , Rachel Bawden , Alexandre Bartz , Philippe Gambette
TALN 2022 - Traitement Automatique des Langues Naturelles, Jun 2022, Avignon, France. pp.154-165
Communication dans un congrès hal-03701524v1
Image document

A Data-driven Approach to Named Entity Recognition for Early Modern French

Pedro Ortiz Suarez , Simon Gabay
Proceedings of the 29th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Oct 2022, Gyeongju, South Korea
Communication dans un congrès hal-03814449v1
Image document

Gallic(orpor)a : Extraction, annotation et diffusion de l’information textuelle et visuelle en diachronie longue

Benoît Sagot , Laurent Romary , Rachel Bawden , Pedro Javier Ortiz Suárez , Kelly Christensen
DataLab de la BnF : Restitution des travaux 2022, DataLab de la BnF, Dec 2022, Paris, France
Communication dans un congrès hal-03930542v1
Image document

BERTrade: Using Contextual Embeddings to Parse Old French

Loïc Grobol , Mathilde Regnault , Pedro Ortiz Suarez , Benoît Sagot , Laurent Romary
13th Language Resources and Evaluation Conference, European Language Resources Association, Jun 2022, Marseille, France
Communication dans un congrès hal-03736840v1
Image document

Expanding the content model of annotationBlock

Alexandre Bartz , Juliette Janes , Laurent Romary , Philippe Gambette , Rachel Bawden
Next Gen TEI, 2021 - TEI Conference and Members’ Meeting, Oct 2021, Virtual, United States
Communication dans un congrès hal-03380805v1
Image document

A dataset for automatic detection of places in (early) modern French texts

Simon Gabay , Pedro Javier Ortiz Suárez
NASSCFL 2021 - 50th Annual North American Society for Seventeenth-Century French Literature Conference, NASSCFL, May 2021, Iowa City / Virtual, United States. pp.5
Communication dans un congrès hal-03187097v1
Image document

Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus

Julien Abadji , Pedro Javier Ortiz Suárez , Laurent Romary , Benoît Sagot
CMLC 2021 - 9th Workshop on Challenges in the Management of Large Corpora, Jul 2021, Limerick / Virtual, Ireland. ⟨10.14618/ids-pub-10468⟩
Communication dans un congrès hal-03301590v1
Image document

SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German

Pedro Javier Ortiz Suárez , Yoann Dupont , Gaël Lejeune , Tian Tian
CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Sep 2020, Thessaloniki / Virtual, Greece
Communication dans un congrès hal-02984746v1
Image document

CamemBERT: a Tasty French Language Model

Louis Martin , Benjamin Muller , Pedro Javier Ortiz Suárez , Yoann Dupont , Laurent Romary
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020, Seattle / Virtual, United States. ⟨10.18653/v1/2020.acl-main.645⟩
Communication dans un congrès hal-02889805v1
Image document

Establishing a New State-of-the-Art for French Named Entity Recognition

Pedro Javier Ortiz Suárez , Yoann Dupont , Benjamin Muller , Laurent Romary , Benoît Sagot
LREC 2020 - 12th Language Resources and Evaluation Conference, May 2020, Marseille, France
Communication dans un congrès hal-02617950v2
Image document

A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

Pedro Javier Ortiz Suárez , Laurent Romary , Benoît Sagot
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020, Seattle / Virtual, United States. ⟨10.18653/v1/2020.acl-main.156⟩
Communication dans un congrès hal-02863875v2
Image document

Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Djamé Seddah , Farah Essaidi , Amal Fethi , Matthieu Futeral , Benjamin Muller
ACL 2020 - 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020, Seattle / Virtual, United States. ⟨10.18653/v1/2020.acl-main.107⟩
Communication dans un congrès hal-02889804v1
Image document

Les modèles de langue contextuels Camembert pour le français : impact de la taille et de l'hétérogénéité des données d'entrainement

Louis Martin , Benjamin Muller , Pedro Javier Ortiz Suárez , Yoan Dupont , Laurent Romary
JEP-TALN-RECITAL 2020 - 33ème Journées d’Études sur la Parole, 27ème Conférence sur le Traitement Automatique des Langues Naturelles, 22ème Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues, Jun 2020, Nancy / Virtuel, France. pp.54-65
Communication dans un congrès hal-02784755v3
Image document

French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Murielle Fabre , Pedro Javier Ortiz Suárez , Benoît Sagot , Éric Villemonte de La Clergerie
CMLC-8 - 8th Workshop on the Challenges in the Management of Large Corpora, May 2020, Marseille, France
Communication dans un congrès hal-02678358v1
Image document

Preparing the Dictionnaire Universel for Automatic Enrichment

Pedro Javier Ortiz Suárez , Laurent Romary , Benoît Sagot
10th International Conference on Historical Lexicography and Lexicology (ICHLL), Jun 2019, Leeuwarden, Netherlands
Communication dans un congrès hal-02131598v1
Image document

How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Mohamed Khemakhem , Ioana Galleron , Geoffrey Williams , Laurent Romary , Pedro Javier Ortiz Suárez
19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond, Sep 2019, Graz, Austria
Communication dans un congrès hal-02263276v1
Image document

Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures

Pedro Javier Ortiz Suárez , Benoît Sagot , Laurent Romary
7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7), Jul 2019, Cardiff, United Kingdom. ⟨10.14618/IDS-PUB-9021⟩
Communication dans un congrès hal-02148693v1
Image document

A Data-driven Approach to Named Entity Recognition for Early Modern French

Pedro Ortiz Suarez , Simon Gabay
Computational Linguistics, Oct 2022, Gyeongju, South Korea. Proceedings of the 29th International Conference on Computational Linguistics, pp.3722-3730
Poster de conférence hal-04246946v1