SimSeerX API

SimSeerX provides a RESTful API for submitting documents and retrieving search results.
Please Note: This API is experimental and will change from time to time without proper version support.

Python Client

If you want to dive straight into using SimSeerX programatically, a Python 2.7 client is available at https://gist.github.com/kylemarkwilliams/2b8e32a98b379d968c8d
(It's probably still worth reading the description below though to understand how the client works)

RESTful Methods

The methods supported by SimSeerX are summarized in the Table below.

MethodURLDescriptionReturnsRequired ParametersOptional Parameters
POST/submitFileForTokenUploads a new file document via a formA token uniquely identifying the submitted documentA form with file=filename setextraction flag
POST/submitTextForTokenSubmit a string of text (internally represented as a document)A token uniquely identifying the submitted text as a documentA form with text=your_text setextraction flag
GET/search/token/methodConducts a searchA set of search results in XML formatoption, ranking, hops, collection, extractionxml, numresults, divider

Detailed Description and Examples

submitFileForToken

This is the most common way of submitting a file to SimSeerX. The method returns a token that can then be used to search a SimSeerX collection. The expected input is a multipart/form-data that contains a form element named "file" that contains the file to be submitted. An optional form value "extraction" can also be specified. If this is set to "on", the document will be run through an academic filter and, if it passes, header information will be extracted from it using CiteSeerExtractor. The extraction filter is generally only useful when interacting with SimSeerX via the Web interface.

Curl Example curl -X POST -F file=@/data/file "http://simseerx.ist.psu.edu/submitFileForToken" Python Requests Example response = requests.post('http://simseerx.ist.psu.edu/submitFileForToken', files={'file': open('/data/file', 'rb')} )
print response.status_code, response.content

submitTextForToken

This method is essentially the same as the submitFileForToken method, except it allows you to submit a string instead of a file. Instead of a multipart/form-data it uses a form URL encoded that contains a form element named "text" that contains the text to be submitted. The optional extraction form element is also allowed.

Curl Example curl -X POST -d text="This is my string" "http://simseerx.ist.psu.edu/submitText" Python Requests Example response = requests.post('http://simseerx.ist.psu.edu/submitText', data={'text': 'This is my long string'})
print response.status_code, response.content

search

This method is used to actually search using SimSeerX. The previous submitFileForToken and submitText methods each return a random token if successful. This token uniquely (though only temporarily) identifies the submitted document or text within SimSeerX. A search is performed by making a GET request to search/token/method. token is the token returned after document/text submission and method is the similarity method to use (described in a moment). The way in which these different search parameters work is described in:

Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: a similar document search engine. In Proceedings of the 2014 ACM symposium on Document engineering (pp. 143-146). ACM [Download]

The following three similarity metrics are currently available: keyphrase, shingles, simhash. The following parameters are required in order to search SimSeerX:

NameDescriptionPossible Values
optionan option for the similarity methodkeyphrase similarity: keyphrases, text
shingles similarity: 3, 5, 8
simhash similarity: 1, 2,3, 4, 5
rankingthe ranking method to be usedcosine or lucene/jaccard/hamming for keyphrase/shingles/simhash similarity
hopsSimSeerX uses pseudo-relevance feedback. This parameter specifies how many times it should be applied0, 1, 2
collectionThe SimSeerX collection to search throughSee /listCollections
extractionwhether or not extraction was used on this document0 or 1 (generally this should be set to 0)


In addition to the required parameters, the following parameters are optional:

NameDescriptionPossible Values
xmlwhether not not the results should be returned in xml1 or 0 (in almost all cases you want this to be 1)
numresultsthe number of results that should be returned per hopsany integer
dividerwhen doing multiple hops, this is the factor by which the number of returned results should be decreasedany positive integer


This example searches with a token tmpERiZD9 using the keyphrase similarity method with the keyphrases option. Cosine similarity is used for ranking with 2 hops. The Wikipedia collection is searched, no extraction is used and the results are returned in XML.

Curl Example curl "http://simseerx.ist.psu.edu/search/tmpERiZD9/keyphrase?option=keyphrases&ranking=cosine&hops=2&collection=Wikipedia&extraction=0&xml=1" Python Requests Example response = requests.get('http://simseerx.ist.psu.edu/search/tmpERiZD9/keyphrase?option=keyphrases&ranking=cosine&hops=2&collection=Wikipedia&extraction=0&xml=1')