SimSeerX is a similar document search engine. It accepts a document as input and then uses several similarity functions to identify similar documents and rank them.
Similarity in SimSeerX
SimSeerX currently supports 3 notions of similarity:
- Keyphrase Similarity with keyphrases extracted using the Maui tool (Medelyan et al, 2009)
- Shingle Similarity based on sequences of tokens in a document (Broder, 1997)
- Simhash Similarity based on locality sensitive hashing (Charikar, 2002) and a efficient algorithm for lookup (Manku et al., 2007)
SimSeerX currently indexes the following document collections:
More collections coming soon!
SimSeerX is built based on the Play! Framework
and makes use of Solr/Lucene
with custom similarity functions for indexing and searching.
Information extraction is performed using CiteSeerExtractor
(Williams et al., 2014) and keyphrase extraction is performed using Maui
The SimSeerX API is described at http://simseerx.ist.psu.edu/api
Kyle Williams developed, runs and maintains SimSeerX as part of his PhD research.
|Prof. C. Lee Giles
Prof. C. Lee Giles is the PI on the SimSeerX project
A paper describing SimSeerX appeared in ACM Document Engineering 2014.
Williams, K., Wu, J., & Giles, C. L. (2014). SimSeerX: a similar document search engine. In Proceedings of the 2014 ACM symposium on Document engineering (pp. 143-146). ACM.
Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8), 1157-1166.
Charikar, M. S. (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing (pp. 380-388). ACM.
Manku, G. S., Jain, A., & Das Sarma, A. (2007). Detecting near-duplicates for web crawling. In Proceedings of the 16th international conference on World Wide Web (pp. 141-150). ACM.
Medelyan, O., Frank, E., & Witten, I. H. (2009). Human-competitive tagging using automatic keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3 (pp. 1318-1327). Association for Computational Linguistics.
Williams, K., Li, L., Khabsa, M., Wu, J., Shih, P. C., & Giles, C. L. (2014, June). A web service for scholarly big data information extraction. In Web Services (ICWS), 2014 IEEE International Conference on (pp. 105-112). IEEE.