Skip to main content.
^ TOP

Articles

G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk, R. Delbru, S. Decker. Sig.ma : Live views on the Web of Data. In Journal of Web Semantics, 2010.
We present Sig.ma, both a service and an end user application to access the Web of Data as an integrated information space. Sig.ma uses an holistic approach in which large scale semantic web indexing, logic reasoning, data aggregation heuristics, ad-hoc ontology consolidation, external services and responsive user interaction all play together to create rich entity descriptions. These consolidated entity descriptions then form the base for embeddable data mashups, machine oriented services as well as data browsing services. Finally, we discuss Sig.ma's peculiar characteristics and report on lessons learned and ideas it inspires.
E. Oren, R. Delbru, M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello. Sindice.com: A document-oriented lookup index for open linked data. In International Journal of Metadata, Semantics and Ontologies, 3(1), 2008.
Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: how and where to find statements about encountered resources. The "linked data" approach mandates that resource URIs should be de-referenced to return resource metadata. But for data discovery linkage itself is not enough, and crawling and indexing of data is necessary. Existing Semantic Web search engines are focused on database-like functionality, compromising on index size, query performance and live updates. We present Sindice, a lookup index over resources crawled on the Semantic Web. Our index allows applications to automatically locate documents containing information about a given resource. In addition, we allow resource retrieval through uniquely identifying inverse-functional properties, offer a full-text search and index SPARQL endpoints. Finally we introduce an extension to the sitemap protocol which allows us to efficiently index large Semantic Web datasets with minimal impact on the data providers.
^ TOP

Book Chapters

M. Catasta, R. Delbru, N. Toupikov and G. Tummarello. Managing Terabytes of Web Semantics Data. Invited paper in R. De Virgilio, F. Giunchiglia, and L. Tanca, editors, Semantic Web Information Management: A Model-Based Perspective. Springer, 2009 (in press).
A large amount of semi structured data is now made available on the Web in form of RDF, RDFa and Microformats. In this chapter we discuss a general model for the Web of Data and, based on our experience in Sindice.com, we discuss how this is reflected in the architecture and components of a large scale infrastructure. Aspects such as data collection, processing, indexing, ranking are touched and we give an ample example of an applications built on top of said infrastructure.
R. Delbru, N. Toupikov, M. Catasta, R. Fuller and G. Tummarello. SIREn: Efficient Search on Semi-Structured Documents. In Lucene in Action 2nd Edition (In Action series). Manning Publications Co., 2009 (in press).
While the specifications for RDF (Resource Description Framework) and Microformats have been out for quite some time now, it is only in the last few years that many web sites have begun to make use of them, thus effectively starting a "Web of Data" or as some refer to it a "Web 3.0". Sites such as LinkedIn, Eventfull, Digg, LastFM and others are using these specifications to share pieces of information that can be automatically reused by other web sites or by smart clients.
Traditionally, querying graph structured data (RDF) has been done using ad-hoc solutions, called Triplestores, typically based on DBMS backends. In Sindice we needed something much more scalable than DBMS and with the desirable features of the typical Web Search engines: top-k query processing, real time updates, full text search, distributed indexes over shards, etc. While Lucene has long offered these capabilities, we will see that its native capabilities are not intended for large semi-structured document collections with very different schemata. For this reason we developed SIREn (Semantic Information Retrieval Engine), a Lucene extension to overcome these shortcomings and efficiently index and query RDF, as well as any textual document with an arbitrary number of metadata fields.
^ TOP

Conference papers

R. Delbru, N. Toupikov, M. Catasta, G. Tummarello. A Node Indexing Scheme for Web Entity Retrieval. In Proceedings of the 7th Extended Semantic Web Conference (ESWC). 2010. [slides]
Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an "entity retrieval system" designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.
R. Delbru, N. Toupikov, M. Catasta, G. Tummarello, S. Decker. Hierarchical Link Analysis for Ranking Web Data. In Proceedings of the 7th Extended Semantic Web Conference (ESWC). 2010. [slides]
On the Web of Data, entities are often interconnected in a way similar to web documents. Previous works have shown how PageRank can be adapted to achieve entity ranking. In this paper, we propose to exploit locality on the Web of Data by taking a layered approach, similar to hierarchical PageRank approaches. We provide justifications for a two-layer model of the Web of Data, and introduce DING (Dataset Ranking) a novel ranking methodology based on this two-layer model. DING uses links between datasets to compute dataset ranks and combines the resulting values with semantic-dependent entity ranking strategies. We quantify the effectiveness of the approach with other link-based algorithms on large datasets coming from the Sindice search engine. The evaluation which includes a user study indicates that the resulting rank is better than the other approaches. Also, the resulting algorithm is shown to have desirable computational properties such as parallelisation.
S. Corlosquet, R. Delbru, T. Clark, A. Polleres and S. Decker. Produce and Consume Linked Data with Drupal!. In Proceedings of the 8th International Semantic Web Conference (ISWC). 2009.
Currently a large number of Web sites are driven by Content Management Systems (CMS) which manage textual and multimedia content but also - inherently - carry valuable information about a site's structure and content model. Exposing this structured information to the Web of Data has so far required considerable expertise in RDF and OWL modelling and additional programming effort. In this paper we tackle one of the most popular CMS: Drupal. We enable site administrators to export their site content model and data to the Web of Data without requiring extensive knowledge on Semantic Web technologies. Our modules create RDFa annotations and --- optionally --- a SPARQL endpoint for any Drupal site out of the box. Likewise, we add the means to map the site data to existing ontologies on the Web with a search interface to find commonly used ontology terms. We also allow a Drupal site administrator to include existing RDF data from remote SPARQL endpoints on the Web in the site. When brought together, these features allow networked RDF Drupal sites that reuse and enrich Linked Data. We finally discuss the adoption of our modules and report on a use case in the biomedical field and the current status of its deployment.
X. Bai, R. Delbru and G. Tummarello. RDF Snippets for Semantic Web Search Engines. In Proceedings of the International Conference on Ontologies, Databases and Applications of Semantics (ODBASE). 2008.
There has been interest in ranking the resources and generating corresponding expressive descriptions from the Semantic Web recently. This paper proposes an approach for automatically generating snippets from RDF documents and assisting users in better understanding the content of RDF documents return by Semantic Web search engines. An heuristic method for discovering topics, based on the occurrences of RDF nodes and the URIs of original RDF documents, is presented and experimented in this paper. In order to make the snippets more understandable, two strategies are proposed and used for ranking the topic-related statements and the query-related statements respectively. Finally, the conclusion is drawn based on the discussion about the performances of our topic discovery and the whole snippet generation approaches on a test dataset provided by Sindice.
R. Cyganiak, H. Stenzhorn, R. Delbru, S. Decker and G. Tummarello. Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web. In Proceedings of the Proceedings of the 5th European Semantic Web Conference (ESWC). 2008.
Increasing amounts of RDF data are available on the Web for consumption by Semantic Web browsers and indexing by Semantic Web search engines. Current Semantic Web publishing practices, however, do not directly support efficient discovery and high-performance retrieval by clients and search engines. We propose an extension to the Sitemaps protocol which provides a simple and effective solution: Data publishers create Semantic Sitemaps to announce and describe their data so that clients can choose the most appropriate access method. We show how this protocol enables an extended notion of authoritative information across different access methods.
G. Tummarello, R. Delbru and E. Oren. Sindice.com: Weaving the open linked data. In Proceedings of the 6th International Semantic Web Conference (ISWC). 2007.
Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: where to find statements about encountered resources. The "linked data" approach, which mandates that resource URIs should be de-referenced and yield metadata about the resource, helps but is only a partial solution. We present Sindice, a lookup index over resources crawled on the Semantic Web. Our index allows applications to automatically retrieve sources with information about a given resource. In addition we allow resource retrieval through inverse-functional properties, offer full-text search and index SPARQL endpoints.
E. Oren, R. Delbru, S. Gerke, A. Haller and S. Decker. ActiveRDF: Object-oriented semantic web programming. In Proceedings of the 16th International World-Wide Web Conference (WWW). May 2007.
Object-oriented programming is the current mainstream programming paradigm but existing RDF APIs are mostly triple-oriented. Traditional techniques for bridging a similar gap between relational databases and object-oriented programs cannot be applied directly, given the different nature of Semantic Web data, as can for example be seen in the semantics of class membership, inheritance relations, and object conformance to schemas. We present ActiveRDF, an object-oriented API for managing RDF data that offers full manipulation and querying of RDF data, does not rely on a schema and fully conforms to RDF(S) semantics. ActiveRDF can be used with different RDF data stores, adapters have been implemented to generic SPARQL endpoints, Sesame, Jena, Redland and YARS and new adapters can be added easily. In addition, integration with the popular Ruby on Rails framework enables fast development of Semantic Web applications.
E. Oren, R. Delbru, and S. Decker. Extending faceted navigation for RDF data. In Proceedings of the 5th International Semantic Web Conference (ISWC). November 2006.
Data on the Semantic Web is semi-structured and does not follow one fixed schema. Faceted browsing is a natural technique for navigating such data, partitioning the information space into orthogonal conceptual dimensions. Current faceted interfaces are manually constructed and have limited query expressiveness. We develop an expressive faceted interface for semi-structured data and formally show the improvement over existing interfaces. Secondly, we develop metrics for automatic ranking of facet quality, bypassing the need for manual construction of the interface. We develop a prototype for faceted navigation of arbitrary RDF data. Experimental evaluation shows improved usability over current interfaces.
^ TOP

Workshop papers

N. Toupikov, J. Umbrich, R. Delbru, M. Hausenblas and G. Tummarello. DING! Dataset Ranking using Formal Descriptions. In Proceedings of the WWW-2009 Workshop on Linked Data on the Web (LDOW-2009). Madrid, Spain, 2009.
Considering that thousands if not millions of linked datasets will be published soon, we motivate in this paper the need for an efficient and effective way to rank interlinked datasets based on formal descriptions of their characteristics. We propose DING (from Dataset RankING) as a new approach to rank linked datasets using information provided by the voiD vocabulary. DING is a domain-independent link analysis that measures the popularity of datasets by considering the cardinality and types of the relationships. We propose also a methodology to automatically assign weights to link types. We evaluate the proposed ranking algorithm against other well known ones, such as PageRank or HITS, using synthetic voiD descriptions. Early results show that DING performs better than the standard Web ranking algorithms.
R. Delbru, A. Polleres, G. Tummarello and S. Decker. Context Dependent Reasoning for Semantic Documents in Sindice. In Proceedings of the 4th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS). Karlsruhe, Germany, 2008. [slides]
The Sindice Semantic Web index provides search capabilities over today more than 30 million documents. A scalable reasoning mechanism for real-world web data is important in order to increase the precision and recall of the Sindice index by inferring useful information (e.g. RDF Schema features, equality, property characteristic such as inverse functional properties or annotation properties from OWL). In this paper, we introduce our notion of context dependent reasoning for RDF documents published on the Web according to the linked data principle. We then illustrate an efficient methodology to perform context dependent RDFS and partial OWL inference based on a persistent TBox composed of a network of web ontologies. Finally we report preliminary evaluation results of our implementation underlying the Sindice web data index.
G. Tummarello and R. Delbru. Entity Coreference Resolution Services in Sindice.com: Identification on the current Web of Data. In Proceedings of the 1st international workshop on Identity and Reference on the Semantic Web (IRSW). Tenerife, Spain. 2008.
A. Harth, A. Hogan, R. Delbru, J. Umbrich, S. O'Riain and S. Decker. SWSE: Answers Before Links!. In Proceedings of the Semantic Web Challenge (ISWC). Busan, Korea. 2007.
We present a system that improves on current document-centric Web search engine technology; adopting an entity-centric perspective, we are able to integrate data from both static and live sources into a coherent, interlinked information space. Users can then search and navigate the integrated information space through relationships, both existing and newly materialised, for improved knowledge discovery and understanding.
E. Oren and R. Delbru. ActiveRDF: Object-oriented RDF in Ruby. In Proceedings of the European Semantic Web Conference Workshop on Scripting for the Semantic Web (ESWC). Budva, Montenegro. June 2006.
Although most developers are object-oriented, programming RDF is triple-oriented. Bridging this gap, by developing a truly object-oriented API that uses domain terminology, is not straightforward, because of the dynamic and semi-structured nature of RDF and the open-world semantics of RDF Schema. We present ActiveRDF, our object-oriented library for accessing RDF data. ActiveRDF is completely dynamic, offers full manipulation and querying of RDF data, does not rely on a schema and can be used against different data-stores. In addition, the integration with the popular Rails framework enables very easy development of Semantic Web applications.
E. Oren and R. Delbru. A prototype for faceted browsing of RDF data. In Proceedings of the Workshop on Scripting for Semantic Web (ESWC). Budva, Montenegro. June 2006.
E. Oren, R. Delbru, K. Möller, M. Völkel and S. Handschuh. Annotation and navigation in semantic wikis. In Proceedings of the European Semantic Web Conference Workshop on Semantic Wikis. Budva, Montenegro. June 2006.
Semantic Wikis allow users to semantically annotate their Wiki content. The particular annotations can differ in expressive power, simplicity, and meaning. We present an elaborate conceptual model for semantic annotations, introduce a unique and rich Wiki syntax for these annotations, and discuss how to best formally represent the augmented Wiki content. We improve existing navigation techniques to automatically construct faceted browsing for semistructured data. By utilising the Wiki annotations we provide greatly enhanced information retrieval. Further we report on our ongoing development of these techniques in our prototype SemperWiki.
^ TOP

Symposium

Renaud Delbru. SIREn: Entity Retrieval System for the Web of Data. In Proceedings of the 3rd Symposium on Future Directions in Information Access (FDIA). University of Padua, Italy. September 2009.
We present ongoing work on the Semantic Information Retrieval Engine (SIREn), an "entity retrieval system" specifically designed to meet the requirements of indexing and searching a large amount of semi-structured data, e.g. the entire Web of Data. SIREn supports efficient full text search with semi-structural queries and exhibits a concise index, constant time updates and inherits Information Retrieval features such as top-k queries, efficient caching and scalability via distribution over shards. We demonstrate how SIREn can effectively answer queries over 10 billion triples on single commodity machine. The prototype is currently in use in the Sindice search engine which index at the present time more than 50 million harvested documents containing semi-structured data.
Renaud Delbru. Methodology for Searching Entities on the Web. In Proceedings of the European Semantic Web Conference Ph.D Symposium (ESWC). Tenerife, Spain. June 2008.
^ TOP

Reports

Renaud Delbru. Manipulation and Exploration of Semantic Web Knowledge, Internship Report
DERI and EPITA France, Jan—Jul 2006
La description des ressources web par des méta-données compréhensibles par les machines est l'un des fondements du Web Sémantique. Resource Description Framework (RDF) est le language pour décrire et échanger les connaissances du Web Sémantique. Comme ces données deviennent de plus en plus courantes, les techniques permettant de manipuler et d'explorer ces informations deviennent nécessaires.
Cependant, la manipulation des données RDF est orientée "triple". Ce type de représentation est moins intuitif et plus difficile à prendre en main que l'approche orientée objet. Notre objectif était donc de réconcilier les deux paradigmes en développant une interface de programmation (API) permettant d'exposer les données RDF sous forme d'objet. ActiveRDF est une API dynamique de haut niveau qui abstrait l'accès à différents types de base de données RDF. Cette interface propose un accès aux données RDF sous la forme d'objets en utilisant la terminologie du domaine.
Afin de pouvoir naviguer à travers les données RDF et pour chercher une information, nous proposons Faceteer, une technique de navigation par facettes pour données semi-structurées. Cette technique étend les possibilités de navigation par rapport aux techniques existantes. Elle permet de construire visuellement et facilement des requêtes très complexes. L'interface de navigation est générée automatiquement pour des données RDF arbitraires. Un ensemble de mesures nous permet d'ordonner les facettes du navigateur afin d'améliorer la navigabilité.
Les résultats de nos recherches sur ActiveRDF et Faceteer permettent un gain de temps substantiel dans la manipulation et l'exploration des données RDF pour les utilisateurs du Web Sémantique.
^ TOP