About SimSeerX

SimSeerX is a similar document search engine. It accepts a document as input and then uses several similarity functions to identify similar documents and rank them.

Similarity in SimSeerX

SimSeerX currently supports 3 notions of similarity:

Document Collections

SimSeerX currently indexes the following document collections: More collections coming soon!


SimSeerX is built based on the Play! Framework and makes use of Solr/Lucene with custom similarity functions for indexing and searching. Information extraction is performed using CiteSeerExtractor (Williams et al., 2014) and keyphrase extraction is performed using Maui.


The SimSeerX API is described at http://simseerx.ist.psu.edu/api.


Kyle Williams
Kyle Williams developed, runs and maintains SimSeerX as part of his PhD research.
Prof. C. Lee Giles
Prof. C. Lee Giles is the PI on the SimSeerX project


A paper describing SimSeerX appeared in ACM Document Engineering 2014.

Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: a similar document search engine. In Proceedings of the 2014 ACM symposium on Document engineering (pp. 143-146). ACM. [Download]


Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8), 1157-1166.

Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.

Manku, G. S., Jain, A., & Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web (pp. 141-150). ACM.

Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3 (pp. 1318-1327). Association for Computational Linguistics.

Williams, K., Li, L., Khabsa, M., Wu, J., Shih, P. C., & Giles, C. L. (2014, June). A web service for scholarly big data information extraction. In Web Services (ICWS), 2014 IEEE International Conference on (pp. 105-112). IEEE.