Web Corpora
Transcript
Web Corpora
Italian Corpora. Spoken, written and web corpora for research and language teaching Massimo Moneglia LABLITA (University of Florence) Corpora di italiano (Marco Baroni) Voce in Enciclopedia dell'Italiano (2010) http://www.treccani.it/enciclopedia/corpora-diitaliano_(Enciclopedia-dell'Italiano)/ Isabella Chiari Web page http://www.alphabit.net/home/index.php?option=com_c ontent&view=section&id=6&Itemid=11 Manuel Barbera Linguistica dei corpora e linguistica dei corpora italiana. Un'introduzione http://www.bmanuel.org/man/cl-HOME.htm Types of Resources • • • • • Spoken Corpora Broadcasting Corpora Written Corpora Learners Corpora Web corpora Distribution mean • Corpora available through web service • Corpora on DVD • Corpora for free downloading Query types • Concordances – by lemma / form / phrase – Ordering by context – Ranking of types • Patterns • Collocation – – – – General Restricted per PoS Word sketches Sketch difference • CQL • Frequency lists Corpora in the classroom • TALC conferences (since 1994) • data-driven learning – Students explore concordances • Discover language facts for themselves – Real language – Test hypotheses – If they learn like this, they will remember • After twenty years – Minority interest – Advanced level (university) only – Most teachers haven't heard of it Do they meet student needs? • Dictionary is much easier • Concordances – slow and arduous – distractions, confusions • Motivation – Not sexy • “I want to learn English, not Corpus Linguistics” Adam Kilgarriff Spoken corpora • Acoustic information • Context variation • Lessico di frequenza dell'italiano parlato (LIP) (De Mauro, Mancini, Vedovelli e Voghera,1993) Around 500.000 words (57 hours of speech) No acoustic source available Diaphasic and diatopic variation (recordings a Milano, Firenze, Roma e Napoli, Now searchable on line URL: http://badip.uni-graz.at/ • API/AVIP/IPAR (1999-2001) – map task of different Italian varieties (Pisa, Napoli,Bari ) – high acoustic quality – 75 minutes segmented into phonemes – Dinstributed on cd-rom by CIRASS and though ftp.cirass.unina.it URL: http://www.cirass.unina.it/ www.parlaritaliano.it. (server not active in this moment) Corpora Linguistici per l'Italiano Parlato e Scritto (CLIPS 2001-2003) – Around 100 hours spoken Italian (50% male 50% female voices) Partially transcribed and phonetic transcription – . Recordings in Bari, Bergamo, Bologna, Cagliari, Catanzaro, Firenze, Genova, Lecce, Milano, Napoli, Palermo, Parma, Perugia, Roma, Venezia. For each city: - Broadcasting (news, interview, talk shows); - semi spontaneous dialogues (240 dialoguers of map task) - Read texts by non professional speakers (20 sentences covering the high freq lexicon) - Telephone conversations (300 speakers) - Read texts by 20 professional speakers (160 sentences covering high freq phonotactic sequences) in anechoic room Free download URL: http://www.clips.unina.it/. Corpus LABLITA-C-ORAL-ROM • • Spontaneous Spoken Italian Corpus (Cresti, 2000) Large context variation (recorded in Tuscany since 1965 (1.00.000 W.) – Corpus design – Text / Speech synchronization x utterance – Tools for the exploitation of the acoustic information – Comparative approach • Partially published in the C-ORAL-ROM Italian corpus within the multilingual romance corpus C-ORAL-ROM (300.000 words, 32 hours) – DVD encrypted edition for personal use (Cresti & Moneglia eds. CORAL-ROM. Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: Benjamins) http://www.benjamins.nl/#catalog/books/scl.15/ – DVDs for laboratories, distributed by ELDA (Paris) – http://www.elda.org/catalogue/en/speech/S0172.html Corpus Design Acoustic source • Segmentation of spontaneous speech by learners – exploitation of transcripts – utterance boundaries • isolated listening • repetition • lexical and prosodic patterns IPIC Data Base • C-ORAL-ROM Italian Informal • Multilingual platform for comparable Brazilian Portuguese minicorpus (Spanish-in preparation) • Annotation – PoS/Lemma – Information structure / prosodic parsing • Available on the web from the LABLITA web site • http://lablita.dit.unifi.it/ipic/ipic_access Broadcasting • Lessico di frequenza dell'italiano radiofonico (LIR) – around 60 hours (transcribed, lemmatized, aligned) – on cd-rom at present only on site at the Accademia della Crusca, but expected on line within the VIVIT infrastructure • Il Lessico Italiano Televisivo • LIT e DIA-LIT Multimodal-multichannel + transcripts searchable on line - LIT sampling of Rai e Mediaset emissions during 2006 (around 168 hours) – Dia-lit 40 hours from 1954 up to now . • http://www.italianotelevisivo.org/ Written corpora • • • • • Language of the origins Literature Targeted resources Learners corpora Standard Italian Language of the Origins • Tesoro della lingua italiana delle origini (TLIO) • TLIO si basa sul corpus testuale dell'italiano antico dell'OVI, di cui è possibile la consultazione integrale. – Text data base 2.001 texts; 21.911.171 tokens (2012) 26504 lemmas – Italian written before 1375 (Dead of Boccaccio) both poetry and prose – Searchable on line • Banche Dati dell'Opera del Vocabolario Italiano • http://www.ovi.cnr.it/index.php?page=banchedati • http://gattoweb.ovi.cnr.it/(S(fbi1pu45pc0jdxqnq2d2wi55))/ CatForm01.aspx • CT "Corpus Taurinense“ • – Corpus di Italiano antico (21 texts XIIIcentury, Firenze) 259,299 tokens 21,087 types 7,599 lemmas. – built up with the same bunch of Old Florentine texts choosen by Lorenzo Renzi and Giampaolo Salvi for their ItalAnt, Grammatica dell'italiano antico. • This set of texts is, a subset of TLIO, Tesoro della lingua italiana delle origini kindly supplied by Pietro Beltrami (OVI). Lemma and POS-tagging according to EAGLES specs – http://www.corpora.unito.it/italant/index.html Italian Literature Letteratura Italiana Zanichelli (Picchi & Stoppelli CDrom ) 1000 works of Italian literature by 245 authors, from Francesco d’Assisi’s Cantico delle creature to Italo Svevo’s La Coscienza di Zeno. The search interface allows for the creation of word indices by alphabet, frequency, incipits. Primo Tesoro della Lingua Letteraria del Novecento (De Mauro ed.) • • • Selection of 100 novels among those preesented at the Premio Strega from 1947 to 2006 (Vinners & some among the more significant). 8 milion words, Lemma/Pos annotation DVD published by UTET with internal search engine and statistics • Corpus e Lessico di Frequenza dell'Italiano Scritto (ColFI) • 3.150.075 words taken fron news papers magazins and miscelaneous texts balanced according to the impact on the italian audience • Description . http://www.istc.cnr.it/material/database/colfis/ • dowload http://www.ge.ilc.cnr.it/strumenti.php (partial) Targeted Corpora from UNITO and UNIBO • Athenaeum Corpus corpus of written academic Italian dell'Università di Torino; – Various textual tipologies 306.927 token; 32.221 type; 11.748 lemmas • Jus Jurium (in progress) a free Italian Corpus covering the full Legal universe of discourse current in Italy. http://www.bmanuel.org/projects/ The Bononia Legal Corpus – BoLC a multilingual comparable legal corpus: parallel corpora in Italian and English. • http://corpora.ficlit.unibo.it/ Learners corpora Corpus LIPS (Lessico Italiano Parlato di Stranieri) • transcripts from CILS - Certificazione di Italiano come Lingua Straniera dell’Università per Stranieri di Siena • Oral exames only (bidirectional and monodirectional exchanges) Around 700.000 words (100 hours) • Lemmatized through TreeTagger, • frequency lists for each learning level • Free Dowload from www.Parlaritaliano.it. • (non active now??) VALICO: Varietà di Apprendimento della Lingua Italiana VINCA: Varietà di Italiano di Nativi Corpus Appaiato (Barbera, Marello & Corino) VALICO multilingual learners corpus: free texts, translations, written texts elicitated by iconic stimuli – Main languages English, French, Spanish, German – Sampling of text of less represented languages (Maltese, Polish, Japanese, Arab, Serbian, Portuguese, Hungarian.) . VINCA parallel corpus of tests written by mother tongue informants http://www.valico.org/ . Main Italian corpora on the web Reference corpus – CoRIS/Codis Opportunistic – Corpus la Repubblica Web Corpora – – – – – – NUNC Webbit Itwak Ten-ten-it Paisà RIDIRE Corpus di Italiano Scritto contemporaneo CORIS/CODIS – – – – – Around 130 milion words (up-dated every two years) Corpus Design Balanced as a refence corpus pos-tagged / lemmatized Searchable on line Allow selection of subcorpora for domain specific research • http://corpora.dslo.unibo.it/coris_ita.html – COrpus di Riferimento dell'Italiano Scritto ( Coris ) . – COrpus Dinamico dell'Italiano Scritto ( Codis ) Allows search on balanced subcorpora of the CODIS corpus [lemma="andare"] [pos="PREP"] Corpus La reppubblica (Bologna Forlì) 2004 • Searchable on line • http://dev.sslmit.unibo.it/corpora/corpus.php?path= &name=Repubblica • Opportunistic corpus taken from the news paper “La Repubblica” 1985-2000 – pos-tagged / lemmatized – Searchable on line – categorized in terms of genre and topic General labels: news-report and comment; Topic labels: church, culture, economics, education, news, politics, science, society, sport, weather. New generation Web corpora • Representativeness • Technical problems (boilerplate cleaning and deduplication tools) – Cleaning html pages: definizione di ciò che è testo (html codes,images, banner, menù,intestazioni, link) – Duplicated pages – Effimeral pages – Processing Format Representativeness of the language on the web • Internet is the largest repository of linguistic information • It is the main enviroment for the use of written information in all domains – NUNC "NewsgroupsUseNet Corpora". • • Multilingual Corpus based on newsgroups in various semantic domains more that 600 milion words per language: It. De. Fr. En. Es. Ma. Su. Ee. Pt. • – – NUNC Italiano (I parte) NUNC Italiano (II parte) • NUNC Cucina • NUNC Motori • NUNC Foto • NUNC Foto • NUNC Cinema M. Barbera , S. Colombo, E. Corino, C. Marello, http://www.corpora.unito.it http://www.bmanuel.org/projects/ WEBBIT • Corpus of Italian Web pages over 150 milion words. http://clic.cimec.unitn.it/marco/webbit/ • Seatrchable On line Sampling strategy • 1. selection of kwords : 500 frequent forms; • 2. query google: 5,000-8,000 queries, with 4 words strings • 3. downloading: processing of the first 10 pages returned WaCky Web-as-Corpus kool ynitiative ITWACThe first Italian web corpus – 2 billion words from the Web in domines .it – PoS/Lemma Tagged WebBootCaT: a web tool for instant corpora Marco Baroni, Kilgarriff, Jan Pomikálek, Pavel Rychlý (2006) Proc. Euralex. Torino, Italy WaCky / ITWAK sampling strategy – Selection of seeds: quering google through couples a mid-frequency taken from la repubblica corpus + basic Italian vocabulary list. – Crawling of the web site corresponding to seeds • Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta The WaCky wide web: a collection of very large linguistically processed web-crawled corpora, Language resource and Evaluation 2009 • WebBootCaT: a web tool for instant corpora Marco Baroni, Kilgarriff, Jan Pomikálek, Pavel Rychlý (2006) Proc. Euralex. Torino, Italy ItWaC ItWaC is searchable through various web interfaces exploiting corpora IMS/CWB and NoSketch Engine Download free of charge on request Free search from: http://nl.ijs.si/noske/wacs.cgi/first_form info and freq lists http://wacky.sslmit.unibo.it/doku.php?id=download . TenTen corpora • New generation of Web corpora. • Created by Web crawling and processed with the latest boilerplate cleaning and de-duplication tools. • The "TenTen" designates the target sizes of the corpora which is 1010 (10 billion) words. itTenTen initial version -- 3.1 billion tokens https://the.sketchengine.co.uk/login/ massimo.moneglia 7uFp3Bh2ma web corpus Paisà • web corpus (around 250 milion tokens) from Creative Commons texts • PoS-tagging (Istituto di Linguistica Computazionale, Pisa) • Sintactic dependencies in CONLL format through DeSR parser • Free download • Search on line – form and lemmas search + sintactic distribution. –http://www.corpusitaliano.it / Relations RIDIRE targheted web corpus (around 2Bilion words) Sampling strategy: target the language usage in the domains which can be of interest for a learner • Domains characterizing for a functional use of the language. • Semantic domains of excellence of the Italian culture Semantic Domains vs Functional Domains 1- cooking (100 MLN) 2- Literature and Theatre (100 MLN) 3- Architecture & Design (100 MLN) 4- Sport (100 MLN) 5- Fashion (100 MLN) 6- Music (100 MLN) 7-Religion (100 MLN) 8- Cinema (100 MLN) 9- Fine arts (100 MLN) 1- News 2-Low & Administration 3- Business 400 MLN 300 MLN 300 MLN http://www.ridire.it/it.drwolf.ridire/home.seam Beta version 750 MW User: demo Pass: demolima (rilascio versione 1.0 dicembre 2013) CORpora DIdattiCi- LABLITA • CorDIC-scritto • CoDIC-parlato. – Two strictly comparable resources for comparison of the spoken and written varieties for didactic purposes • http://corporadidattici.lablita.it/ • 500.000 W each in 200 samples (2.500 word average) Written Domain N. texts N. wprds art 40 101299 20,15% burocracy 40 98814 19,66% creative 40 101725 20,24% echonomy 40 100072 19,91% newspapers 40 100755 20,04% 200 502665 Total Spoken Context private public Broadcasting Total N. texts 82 86 32 200 N. parole 193905 198468 106638 499011 Interaction dialogues monologues Total natural context N. texts 115 53 N. words 266095 126278 168 392373 38,86% 39,77% 21,37% 67,82% 32,18%
Documenti analoghi
1) cosa è un corpus e cosa è la corpus linguistics (P)/
Facciamo un altro esempio tratto da Hunston (2002: 28) e ipotizziamo che gli
studiosi vogliano condurre una ricerca sugli usi linguistici nei giornali. Il corpus
da costituire deve essere bilanciat...