Bitem site report for the claims to passage task in CLEF-IP 2012

Gobeill, Julien ; Ruch, Patrick

In: CLEF 2012, 2012, p. s.p.

In CLEF-IP 2012, we participated in the Claims to Passage task where the goal was to return relevant passages according to sets of claims, for patentability or novelty search purposes. The collection contained 2.3M of documents, corresponding to an estimated volume of 250M of passages. To cope with the problems induced by this large dataset, we designed a two-step retrieval system. In the first... Plus

Ajouter à la liste personnelle
    Summary
    In CLEF-IP 2012, we participated in the Claims to Passage task where the goal was to return relevant passages according to sets of claims, for patentability or novelty search purposes. The collection contained 2.3M of documents, corresponding to an estimated volume of 250M of passages. To cope with the problems induced by this large dataset, we designed a two-step retrieval system. In the first step, the 2.3M of patent application documents were indexed ; for each topic, we then retrieved the k most similar documents with a classical Prior Art Search. Document representations and tuning of the IR engine were set relying on training data and on the expertise we acquired in past similar tasks. In particular, we used not only claims for topics, but also the full description of the application document, and the applicants/inventors details ; moreover, we discarded retrieved documents that didn’t share at least one IPC code with the topic. The k parameter ranged from 5 to 1000 according to the computed run. In the second step, for each topic (i.e. “on the fly”), we indexed the passages contained in these k most similar documents and queried with the topic claims in order to obtained the final runs. Thus, we dealt with approximately 11M of passages instead of 250M. The best k parameter with the training data was 10. Hence, we decided to submit four runs with k set to 10, 20, 50, and 100. Finally, we analyzed the training data and observed that the position of a passage in the document played a role, as passages at the end of the description were more likely to be relevant. Thus, we re-ranked each run according to passages’ positions in the document in order to submit four supplementary runs.