Text lab in Pavia

Materials for Python seminar

Motivation

Use existing tools to save your time. Learn some programming to save even more of your time.

Sketch Engine (account)

ELEXIS login for European universities (including Pavia). For the rest, sign up (join a multiuser account) with phrase paviaMay2019. This will be valid for one month starting 8th May 2019. Access the new Sketch Engine.

Sketch Engine API

REST: HTTP GET; accessible from any programming language with network support. Examples online in bash, Java, R and Python.

JSON

JSON (JavaScript Object Notation) is simple format for storing data (lists, dictionaries, strings, numbers and boolean values). Example

Online parser and validator and formatter.

To process JSON in Python you need json library. This should be available to import json then you can use two functions: loads for corverting JSON string into Python object and vice versa with dumps.

Let’s try to process the example JSON from above.

import json
jsondata = json.loads('{"key": 1, "key2": "test"}')
print(jsondata["key"])
print(jsondata["key2"])

Alternatively you may use repl.it online Python 3 IDE.

Python requests module

In your console (or in a short script you execute), try import requests

If this doesn’t work, have a look at this how-to and install it using pip in your console.

python -m pip install requests or py -m pip install requests should work.

You can download a webpage (HTML source) by using get function.

import requests
response = requests.get("https://www.google.com")
# response.text response.json() response.status_code
print(response.status_code)

Now we want to send some parameters (query string in URL after “?"):

import requests
r = requests.get("https://www.google.com/search", params={"q": "pavia"})
print("Pavia" in r.text)

Supported languages in SkE

import requests
r = requests.get("https://app.sketchengine.eu/ca/api/languages")
for language in r.json()["data"]:
print(language["name"])

List of corpora

import requests
r = requests.get("https://app.sketchengine.eu/ca/api/corpora")
for corpus in r.json()["data"]:
print(corpus["name"], corpus["language_name"], corpus["sizes"]["wordcount"])

API server, username, API key

In the UI, click the top right menu and in My account, you should see API key, or you can generate a new one. This is to tell SkE API that it is really you with an existing account.

We use a separated API server, the address is https://api.sketchengine.co.uk/corpus/. We will always send three parameters in our queries:

A trick

The new UI is sending API queries in the background. Open console (network) and have a look at what is sent to our server. You can copycat the query in your API requests easily.

Request template

import requests

base_url = "https://api.sketchengine.co.uk/corpus/"
method = "..." # view, thes, wsketch, wordlist, ...
parameters = {
"corpname": "...",
"username": "...",
"api_key": "...",
"format": "json"
}

r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
#process jsondata dictionary

API capabilities

Concordance (full-text search), parallel concordances, word sketch (collocational profiles of words), distributional thesaurus, word sketch difference (comparing two words and their collocational behaviour), wordlists, n-grams, keyword and term extraction, good dictionary examples (GDEX), bilingual terminology extraction (OneClick Terms), diachronic analysis (trends in time), …

CQL: corpus query language

Have a look at the cheat cheets and at online documentation. Basics. What should must every linguist know: regular expressions.

We will try to get sentences from an Italian corpus which contain lemma suonare followed by any noun. We will use method view.

import requests

base_url = "https://api.sketchengine.co.uk/corpus/"
method = "view"
parameters = {
"corpname": "ittenten16_2",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"q": 'q[lemma="suonare"][tag="NOUN"]', # CQL
"viewmode": "sen", # KWIC vs. sentence mode
"pagesize": 10, # limit the number of sentences per page
"structs": "g" # what structures should be in the result
}

r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
for line in jsondata["Lines"]:
...

API methods and parameters

Things to be aware of:

Wordlists

For any positional attribute from the corpus, you may access its lexicon (types).

The task is to get (verbs) in Italian starting with “s”. We will use the console trick to eavesdrop the right API request.

Or you may want to see all tags from the Italian tagset used in a corpus.

Thesaurus (similar words)

Distributional semantics in SkE. Very similar to word embeddings, in some situations even better.

The task is to find what similar words have adjectives “orange” and “red” in common.

import requests

base_url = "https://api.sketchengine.co.uk/corpus/"
method = "thes"
parameters = {
"corpname": "ententen15_tt21",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"lemma": "orange",
"pos": "-j"
}

r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
orange_items = [x["word"] for x in jsondata["Words"]]

parameters["lemma"] = "red"
r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
red_items = [x["word"] for x in jsondata["Words"]]

for common in set(orange_items) & set(red_items):
print(common)

Keyword/term extraction

Create a corpus from text documents in the web interface. Extract keywords and terminology from it via API.

Simple math (Kilgarriff, 2009)

$${{\rm relfreq}_{\rm focus} + N} \over {{\rm relfreq}_{\rm ref} + N}$$

Have a look at the parameters, in particular, we need a reference corpus for the relative frequency.

import requests

base_url = "https://api.sketchengine.co.uk/corpus/"
method = "extract_keywords"
parameters = {
"corpname": "user/xbaisa/kw_test",
"username": "xbaisa",
"api_key": "...",
"format": "json",
"ref_corpname": "preloaded/ententen_13_tt2_1"
}

r = requests.get(base_url + method, params=parameters)
jsondata = r.json()
for kw in jsondata["keywords"]:
print(kw["item"])

parameters["ref_corpname"] = "preloaded/ententen13_tt2_1_term_ref"
r = requests.get(base_url + "extract_terms", params=parameters)
jsondata = r.json()
for tm in jsondata["terms"]:
print(tm["item"])

Here we obtain “error” key in the JSON response. You should always test not just “error” in JSON but also status_code for other issues (FUP, authentication problem, …)

Further functions and resources