Constraining the CLAROS SPARQL endpoint

SPARQL is a powerful query language over linked data. As one might expect, it is therefore quite easy to write expensive queries; those that consume significant time, computation or memory to execute. When one knows the people doing the querying this isn’t too much of an issue, but when it’s been opened to the public web things are a little different. One would certainly think twice about allowing arbitrary SQL queries through a public interface.

The CLAROS SPARQL endpoint uses rate-limiting and query time-outs to protect it from abuse, unintentional or otherwise.

Rate-limiting

The CLAROS data site is built on top of Fuseki (a web-based interface to a Jena triplestore) and humfrey (a RESTful web framework for displaying data from SPARQL endpoints). humfrey allows us to mediate requests to the underlying Fuseki instance.

humfrey uses redis to maintain a lock for each IP address to prevent a single user issuing multiple queries concurrently. Each IP has a score associated with it which, when a query is performed, is increased by the number of seconds it takes to run. It also decays at a constant rate of 0.05 per second. When a query is run when the score exceeds 10 we delay the query by (score – 10) seconds. When it exceeds 20 we reject the query. As an example, if I run a query that takes 7 seconds, wait a minute (reducing the score by 3) and then execute another 7 second query, a query run immediately afterwards would be delayed by one second. The code that implements this can be found on GitHub.

This policy allows users a buffer-zone before they hit any limits, and we don’t expect that most users will notice. However, it should have an effect on people trying to spider the data to the detriment of other users. If you need to do this, there are easier ways; contact us!

Query time-outs

Jena‘s ARQ recently gained the capability to have queries time out. This hasn’t yet been exposed through Fuseki, so we’re currently running a forked version of Fuseki with time-outs hard-coded at eight seconds. We hope to abandon our fork as soon as this functionality appears.

Further reading

1 Comment

Filed under Uncategorized

The launch of CLAROS

CLAROS launched its first public service on May 17th with a web-based explorer interface (http://www.clarosnet.org), and a data-oriented service (http://data.clarosnet.org/).

Based at the e-Research Centre in Oxford, CLAROS is an international research collaboration to enable simultaneous searching of major collections of digital material about archaeology and art in university research institutes and museums. It contains material from a wide range of data partners, including the Beazley Archive, various digital archives in the Ashmolean Museum, the Arachne archive, the Lexicon of Greek Personal Names, and the Lexicon Iconographicum Mythologiae Classicae, recording over 2 million objects, places, photographs, and people.

CLAROS is a resource discovery service, and its job is to provide cacheing, indexing, querying and visualization services. The working practice is one of federation. CLAROS ingests a catalogue of records from each data partner and amalgamates it into a single entity, but for more detailed information about a hit we return to the original web site of the partner. CLAROS data is modelled using RDF against the CIDOC CRM ontology, and can be accessed using an open SPARQL endpoint, as well as the powerful web site.

CLAROS is work in progress, with more data partners to come, and large amounts of work to be done on both internal linking, and linking to the wider semantic web. The first fruit of this will be completion of work to join up the places inside CLAROS with those in geonames (http://www.geonames.org/) and Pleiades
(http://pleiades.stoa.org/).

Within the 20 million data records in CLAROS there will inevitably be errors and omissions. We welcome your comments.

Leave a comment

Filed under Uncategorized