Sketch Engine

	Logo of Sketch Engine
	Sketch Engine concordance page
Original author(s)	Adam Kilgarriff, Pavel Rychlý
Developer(s)	Lexical Computing CZ s.r.o.
Initial release	23 July 2003
Written in	Go, JavaScript, jQuery, C++, Python
Operating system	Linux, Mac OS X
Platform	IA-32, x64 or IA-64
Standard(s)	Unicode
Available in	11 languages
	List of languages Arabic, Crimean Tatar, Czech, English, French, German, Irish, Italian, Nko, Spanish, Ukrainian
Type	Corpus manager for 90+ languages, database management system
License	Proprietary software; both commercial and freeware editions are available
Website	www.sketchengine.eu

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing CZ s.r.o. since 2003. Its purpose is to enable people studying language behaviour (lexicographers, researchers in corpus linguistics, translators or language learners) to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour.^[2] Currently, it supports and provides corpora in 90+ languages.^[3]

History of development

Sketch Engine is a product of Lexical Computing Limited, a company founded in 2003 by the lexicographer and research scientist Adam Kilgarriff.^[4] He started a collaboration with Pavel Rychlý, a computer scientist working at the Natural Language Processing Centre, Masaryk University,^[5] and the developer of Manatee and Bonito (two major parts of the software suite), and introduced the concept of word sketches.

Since then, Sketch Engine has been commercial software, however, all the core features of Manatee and Bonito that were developed by 2003 (and extended since then) are freely available under the GPL license within the NoSketch Engine suite.^[6]

Features

A list of tools available in Sketch Engine:

Word sketches – a one-page automatic derived summary of a word's grammatical and collocational behaviour
Word sketch difference – compares and contrasts two words by analysing their collocation
Distributional Thesaurus – automated thesaurus finding words with similar meaning or appearing in the same/similar context
Concordance search – finds examples of a word form, lemma, phrase, tag or complex structure
Collocation search – word co-occurrence analysis displaying the most frequent words (to a search word) which can be regarded as collocation candidates
Word lists – generates frequency lists which can be filtered with complex criteria
n-grams – generates frequency lists of multi-word expressions
Terminology / Keyword extraction (both monolingual and bilingual) – automatic extraction key words and multi-word terms from texts (based on frequency count and linguistic criteria)
Diachronic analysis (Trends)^[7] – detecting words which undergo changes in the frequency of use in time (show trending words)
Corpus building and management – create corpora from the Web or uploaded texts including part-of-speech tagging and lemmatization which can be used as data mining software
Parallel corpus (bilingual) facilities – looking up translation examples (EUR-Lex corpus, Europarl corpus, OPUS corpus, etc.) or building parallel corpus from own aligned texts
Text type analysis – statistics of metadata in the corpus

Keywords and terminology extraction

It is a tool for automatic term extraction for identifying words typical of a particular corpus, document, or text. It supports extracting one-word and multi-word units from monolingual and bilingual texts. The terminology extraction feature provides a list of relevant terms based on comparison with a large corpus of general language. This tool is also a separate service operating as OneClick terms with a dedicated interface.^[8]

List of text corpora

Sketch Engine provides access to more than 700 text corpora. There are monolingual as well as multilingual language corpora of different sizes (from thousand of words up to 60 billions of words) and various sources (web, books, subtitles, legal documents, etc.). The list of corpora includes British National Corpus, Brown Corpus, Cambridge Academic English Corpus and Cambridge Learner Corpus, CHILDES corpora of child language, OpenSubtitles (a set of 60 parallel corpora), 24 multilingual corpora of EUR-Lex documents, TenTen Corpus Family (multi-billion web corpora), trends corpora (monitor corpora with daily updates), etc.

Architecture

Thesaurus cloud of the lemma work in Sketch Engine

Sketch Engine consists of three main components: an underlying database management system called Manatee, a web interface search front-end called Bonito and a web interface for corpus building and management called Corpus Architect. ^[9]

Manatee

Manatee is a database management system specifically devised for effective indexing of large text corpora. It is based on the idea of inverted indexing (keeping an index of all positions of a given word in the text). It has been used to index text corpora comprising tens of billions of words.^[10]

Searching corpora indexed by Manatee is performed by formulating queries in the Corpus Query Language (CQL).^[11]

Manatee is written in C++ and offers an API for a number of other programming languages including Python, Java, Perl and Ruby. Recently, it was rewritten into Go for faster processing of corpus queries.^[12]

Bonito

Bonito is a web interface for Manatee providing access to corpus search. In the client–server model, Manatee is the server and Bonito plays the client part. It is written in Python.^[9]

Corpus Architect

Corpus Architect is a web interface providing corpus building and management features. It is also written in Python.

Applications

Sketch Engine has been used by major British or other publishing houses for producing dictionaries such as Macmillan English Dictionary, Dictionnaires Le Robert, Oxford University Press or Shogakukan and four of the UK's five biggest dictionary publishers use Sketch Engine.^[13]

References

↑ Companies House Searched on United Kingdom's registrar of companies (Company name: LEXICAL COMPUTING LIMITED or Company number: 04841901)
↑ Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (10 July 2014). "The Sketch Engine: ten years on". Lexicography. 1 (1): 7–36. doi:10.1007/s40607-014-0009-9. ISSN 2197-4292.
↑ "Languages in Sketch Engine". Sketch Engine. Lexical Computing s.r.o. 7 June 2016. Retrieved 22 January 2018.
↑ Adam Kilgarriff's home page
↑ Natural Language Processing Centre, Masaryk University
↑ NoSketch Engine
↑ Kilgarriff, Adam; Herman, Ondřej; Bušta, Jan; Rychlý, Pavel; Jakubíček, Miloš (2015). "DIACRAN: a framework for diachronic analysis" (PDF). Corpus Linguistics 2015: 65–70.
↑ Baisa, Vít (2017). "Simplifying terminology extraction: OneClick Terms" (PDF). Proceedings of the 9th International Corpus Linguistics Conference.
1 2 Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing: 65–70.
↑ Pomikálek, Jan; Jakubíček, Miloš; Rychlý, Pavel (2012). "Building a 70 billion word corpus of English from ClueWeb" (PDF). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12).
↑ "CQL – Corpus Query Language". Sketch Engine. Lexical Computing s.r.o. 15 May 2015. Retrieved 22 January 2018.
↑ Rychlý, Pavel; Rábara, Radoslav (2015). "Concurrent Processing of Text Corpus Queries" (PDF). Workshop on Recent Advances in Slavonic Natural Language Processing: 49–58.
↑ "Using Computational Lexicography for Dictionary Production with the Sketch Engine". REF Impact Case Studies. University of Brighton. Retrieved 18 April 2015.

Related publications

Thomas, James (March 2016). Discovering English with Sketch Engine : a corpus-based approach to language exploration. Workbook and glossary. Brno: Versatile. ISBN 9788026095798.

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] Companies House Searched on United Kingdom's registrar of companies (Company name: LEXICAL COMPUTING LIMITED or Company number: 04841901)

[2] Kilgarriff, Adam; Baisa, Vít; Bušta, Jan; Jakubíček, Miloš; Kovář, Vojtěch; Michelfeit, Jan; Rychlý, Pavel; Suchomel, Vít (10 July 2014). "The Sketch Engine: ten years on". Lexicography. 1 (1): 7–36. doi:10.1007/s40607-014-0009-9. ISSN 2197-4292.

[3] "Languages in Sketch Engine". Sketch Engine. Lexical Computing s.r.o. 7 June 2016. Retrieved 22 January 2018.

[4] Adam Kilgarriff's home page

[5] Natural Language Processing Centre, Masaryk University

[6] NoSketch Engine

[7] Kilgarriff, Adam; Herman, Ondřej; Bušta, Jan; Rychlý, Pavel; Jakubíček, Miloš (2015). "DIACRAN: a framework for diachronic analysis" (PDF). Corpus Linguistics 2015: 65–70.

[8] Baisa, Vít (2017). "Simplifying terminology extraction: OneClick Terms" (PDF). Proceedings of the 9th International Corpus Linguistics Conference.

[bonito-9] 1 2 Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing: 65–70.

[10] Pomikálek, Jan; Jakubíček, Miloš; Rychlý, Pavel (2012). "Building a 70 billion word corpus of English from ClueWeb" (PDF). Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12).

[11] "CQL – Corpus Query Language". Sketch Engine. Lexical Computing s.r.o. 15 May 2015. Retrieved 22 January 2018.

[12] Rychlý, Pavel; Rábara, Radoslav (2015). "Concurrent Processing of Text Corpus Queries" (PDF). Workshop on Recent Advances in Slavonic Natural Language Processing: 49–58.

[13] "Using Computational Lexicography for Dictionary Production with the Sketch Engine". REF Impact Case Studies. University of Brighton. Retrieved 18 April 2015.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine