===================== PhD thesis position ===================== Summary : --------- 3-years position in Computer Science Laboratoire Hubert Curien, UMR CNRS 5516, Saint-Etienne, France. Apply before 7th may 2015 to christophe.gravier@univ-st-etienne.fr Start your thesis between september and october 2015. For more information, contact Christophe Gravier <christophe.gravier@univ-st- etienne.fr> or Julien Subercaze <julien.subercaze@univ-st- etienne.fr> Description : ------------- On the Web and in social networks, textual similarity search is a problem of the utmost practical interest. Devising metrics for textual contents (Web pages, tweets, ..) serves the purpose of today Web services â?" search engine, recommender system, online advertising, ... to name a few. In this context, one of the popular strategy to overcome scalability issues is Semantic Hashing, which has been proposed in the early 2000 [2, 4]. Semantic hashing aims at embedding the data points from high dimensional spaces of traditional approaches into a Hamming hypercube of size n. In semantic hashing, each data point is associated to a binary code of length n, so that their Hamming distance is similar to the one in the original space. The problem to find how to perform such an embedding provides a hot topic for the computer scientists. Many semantic hashing schemes rely on machine learning: using a corpus, binary classifiers are trained in order to satisfy different optimisation functions [6, 12, 13, 11, 3, 8, 5]. The main issues with what is commonly accepted are the following: - data dependency : the entire corpus needs to be known in advance â?" or a sufficient portion of it making the approach also subject to cold start. - concept-insensitive : as these processes relies on keyword feature spaces, two documents semantically similar but making use of different terms are not mapped to close binary codes. - language-dependant : while some of the most recent works consider hashing textual and multimedia items together, few works focus on multi-lingual corpus [10]. Instead of high dimensional vector spaces, we consider in this thesis other document representation as candidates for hashing. We especially consider graphs as a document representation that can be tuned to be data-independant, concept-sensitive, and language-independant. We aim at exploring further if this representation is more suitable for semantic hashing. This track has been settled in the institute in the last couple of years. Especially, in [9] we demonstrated that a graphical model is a suitable document representation for semantic hashing, as it presents an interesting an massive speed-up for pairwise similarity computation at the cost of a limited loss of semantic similarity with respect to high dimensional models. In [1], we extended our model and we show that an external taxonomy used in the semantic hashing scheme provide concept-sensitivity to our semantic hashing process. The candidates are strongly advised to read these two publications from the team. In this thesis, we aim at improving this method and evaluate what are the performances and limitations of a semantic hashing scheme based on a graphical representation of documents. We will put focus on under-studied applications of semantic hashing : multi-lingual semantic hashing and speed up of natural language processing tasks. Supervisors : Christophe Gravier and Julien Subercaze Requirements: ------------- The candidate MUST have : 1. a Master degree in Computer Science, 2. a good mathematical background, 3. a strong background in programming, Java experience is a plus, 4. excellent english writing skills. In addition, although an obvious and very useful quality in general, it is not mandatory to speak French to apply. How to apply: -------------- * Agenda Submission deadline : May 7th 2015 11:59pm Paris time. Interview, if selected : between 11-13 May 2015. Notification : between 10-15 June * Application file Your application file MUST contain the following items : 1. A curriculum vitae, 2. The master diploma as a PDF file, 3. The details of your marks for the two last years of the Master, 4. Contact details of two associate professors or professors you had interacted with as reference. Your application file MAY also contain any information or link you think is appropriate (e. g. scientific publication you have contributed to, online service you maintain, link to your social coding account, ...). You must send your application as a single, zipped, PDF file to : christophe.gravier@univ-st- etienne.fr * Evaluation of applications All applications will be reviewed by all the members of the Knowledge and Representation project, a subdivision of the Connected Intelligence group at Hubert Curien laboratory. A first selection will be made from your application folder. If selected, you will enter a second phase of selection based on three exerices : - A motivation interview so that we can know each others. The interview includes the advisors, but also the other researchers from the laboratory with a more remote perspective on the research, - A technical assignment : a basic 2-hours Java (or C++) programming task. - An english assignment. The entire process may not exceed half a day and are all held the same day. On that day, you are invited to come to the laboratory. Location : ---------- Saint-Etienne is located about 2 hours from the mediterranean sea and 2 hours from Alps slopes. Lyon city is at 50 km. Saint Etienne has about 180.000 inhabitants including more than 20.000 students. Surrounded by hills where hiking and mountain biking are significant, Saint Etienne is also member of the Unesco Creative Cities Network for design. References : ------------ [1] Bamba, P., Subercaze, J., Gravier, C., Benmira, N., and Fontaine, J. (October, 30th 2012). The Twitaholic Next Door. In Proc. of 21st ACM International Conference on Information and Knowledge Management (CIKMâ?T12), pages 2275â?"2278, Maui, Hawaiâ?Ti, USA. ACM. [2] Gionis, A., Indyk, P., Motwani, R., et al. (1999). Similarity search in high dimensions via hashing. In VLDB, volume 99, pages 518â?"529. [3] Gu, X., Zhang, Y., Zhang, L., Zhang, D., and Li, J. (2013). An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features. Signal Processing, 93(8):2244â?"2255. [4] Indyk, P. and Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604â?"613. ACM. [5] Ji, J., Li, J., Yan, S., Zhang, B., and Tian, Q. (2012). Super-bit locality-sensitive hashing. In NIPS, pages 108â?"116. [6] Lin, R.-S., Ross, D. A., and Yagnik, J. (2010). Spec hashing: Similarity preserving algorithm for entropy-based coding. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 848â?"854. IEEE. [7] Matousek, J. (2013). Lecture notes on metric embeddings. In Department of Applied Mathematics, Czech Republic, and Institute of theoretical Computer Science Zurich, page 126 pages. [8] Shrivastava, A. and Li, P. (2014). Densifying one permutation hashing via rotation for fast near neighbor search. In Proceedings of The 31st International Conference on Machine Learning, pages 557â?"565. [9] Subercaze, J., Gravier, C., and Laforest, F. (2013). Towards an expressive and scalable twitterâ?Ts users profiles. In Proceeding of Web Intelligence, pages 101â?"108. [10] Ture, F., Elsayed, T., and Lin, J. (2011). No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 943â?"952. ACM. [11] Wang, Q., Si, L., Zhang, Z., and Zhang, N. (2014). Active hashing with joint data example and tag selection. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 405â?"414. ACM. [12] Zhang, D., Wang, J., Cai, D., and Lu, J. (2010). Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 18â?"25. ACM. [13] Zhang, L., Zhang, Y., Tang, J., Gu, X., Li, J., and Tian, Q. (2013). Topology preserving hashing for similarity search. In Proceedings of the 21st ACM international conference on Multimedia, pages 123â?"132. ACM. ______________________________ _________________
Posted by: Tri Kurniawan Wijaya <trikurniawanwijaya@yahoo.com>