^ TOP
Articles
-
G. Tummarello, R. Cyganiak, M. Catasta, S. Danielczyk,
R. Delbru,
S. Decker.
Sig.ma : Live views on the Web of Data.
In Journal of Web Semantics,
2010.
- We present Sig.ma, both a service and an end
user application to access the Web of Data as an integrated information
space. Sig.ma uses an holistic approach in which large scale semantic
web indexing, logic reasoning, data aggregation heuristics, ad-hoc
ontology consolidation, external services and responsive user
interaction all play together to create rich entity descriptions. These
consolidated entity descriptions then form the base for embeddable data
mashups, machine oriented services as well as data browsing services.
Finally, we discuss Sig.ma's peculiar characteristics and report on
lessons learned and ideas it inspires.
- E. Oren,
R. Delbru,
M. Catasta, R. Cyganiak, H. Stenzhorn, and G. Tummarello.
Sindice.com: A document-oriented lookup index for open linked data.
In International Journal of Metadata, Semantics and
Ontologies, 3(1), 2008.
- Developers of Semantic Web applications face a challenge with respect
to the decentralised publication model: how and where to find statements
about encountered resources. The "linked data" approach mandates that
resource URIs should be de-referenced to return resource metadata. But
for data discovery linkage itself is not enough, and crawling and indexing
of data is necessary. Existing Semantic Web search engines are focused on
database-like functionality, compromising on index size, query performance
and live updates. We present Sindice, a lookup index over resources crawled
on the Semantic Web. Our index allows applications to automatically locate
documents containing information about a given resource. In addition, we
allow resource retrieval through uniquely identifying inverse-functional
properties, offer a full-text search and index SPARQL endpoints. Finally
we introduce an extension to the sitemap protocol which allows us to
efficiently index large Semantic Web datasets with minimal impact on the
data providers.
^ TOP
Book Chapters
- M. Catasta, R. Delbru,
N. Toupikov and G. Tummarello.
Managing Terabytes of Web Semantics Data.
Invited paper in R. De Virgilio, F. Giunchiglia, and L. Tanca, editors,
Semantic Web Information Management: A Model-Based Perspective.
Springer, 2009 (in press).
- A large amount of semi structured data is now made available on the Web
in form of RDF, RDFa and Microformats. In this chapter we discuss a general model
for the Web of Data and, based on our experience in Sindice.com, we discuss how
this is reflected in the architecture and components of a large scale infrastructure.
Aspects such as data collection, processing, indexing, ranking are touched and we
give an ample example of an applications built on top of said infrastructure.
- R. Delbru,
N. Toupikov, M. Catasta, R. Fuller and G. Tummarello.
SIREn: Efficient Search on Semi-Structured Documents.
In Lucene in Action 2nd Edition (In Action series).
Manning Publications Co., 2009 (in press).
- While the specifications for RDF (Resource Description Framework) and Microformats
have been out for quite some time now, it is only in the last few years that many web sites
have begun to make use of them, thus effectively starting a "Web of Data" or as some refer
to it a "Web 3.0". Sites such as LinkedIn, Eventfull, Digg, LastFM and others are using these
specifications to share pieces of information that can be automatically reused by other web
sites or by smart clients.
Traditionally, querying graph structured data (RDF) has been done using ad-hoc
solutions, called Triplestores, typically based on DBMS backends. In Sindice we needed
something much more scalable than DBMS and with the desirable features of the typical Web
Search engines: top-k query processing, real time updates, full text search, distributed
indexes over shards, etc.
While Lucene has long offered these capabilities, we will see that its
native capabilities are not intended for large semi-structured document collections with very
different schemata. For this reason we developed SIREn (Semantic Information Retrieval
Engine), a Lucene extension to overcome these shortcomings and efficiently index and query
RDF, as well as any textual document with an arbitrary number of metadata fields.
^ TOP
Conference papers
-
R. Delbru,
N. Toupikov, M. Catasta, G. Tummarello.
A Node Indexing Scheme for Web Entity Retrieval.
In Proceedings of the 7th Extended Semantic
Web Conference (ESWC). 2010.
[slides]
- Now motivated also by the partial support of
major search engines, hundreds of millions of documents are being
published on the web embedding semi-structured data in RDF, RDFa and
Microformats. This scenario calls for novel information search systems
which provide effective means of retrieving relevant semi-structured
information. In this paper, we present an "entity retrieval system"
designed to provide entity search capabilities over datasets as large as
the entire Web of Data. Our system supports full-text search,
semi-structural queries and top-k query results while exhibiting a
concise index and efficient incremental updates. We advocate the use of
a node indexing scheme and show that it offers a good compromise between
query expressiveness, query processing time and update complexity in
comparison to three other indexing techniques. We then demonstrate how
such system can effectively answer queries over 10 billion triples on a
single commodity machine.
-
R. Delbru,
N. Toupikov, M. Catasta, G. Tummarello, S. Decker.
Hierarchical Link Analysis for Ranking Web Data.
In Proceedings of the 7th Extended Semantic
Web Conference (ESWC). 2010.
[slides]
- On the Web of Data, entities are often
interconnected in a way similar to web documents. Previous works have
shown how PageRank can be adapted to achieve entity ranking. In this
paper, we propose to exploit locality on the Web of Data by taking a
layered approach, similar to hierarchical PageRank approaches. We
provide justifications for a two-layer model of the Web of Data, and
introduce DING (Dataset Ranking) a novel ranking methodology based on
this two-layer model. DING uses links between datasets to compute
dataset ranks and combines the resulting values with semantic-dependent
entity ranking strategies. We quantify the effectiveness of the approach
with other link-based algorithms on large datasets coming from the
Sindice search engine. The evaluation which includes a user study
indicates that the resulting rank is better than the other approaches.
Also, the resulting algorithm is shown to have desirable computational
properties such as parallelisation.
- S. Corlosquet,
R. Delbru,
T. Clark, A. Polleres and S. Decker.
Produce and Consume Linked Data with Drupal!.
In Proceedings of the 8th International
Semantic Web Conference (ISWC). 2009.
- Currently a large number of Web sites are
driven by Content Management Systems (CMS) which manage textual and
multimedia content but also - inherently - carry valuable information
about a site's structure and content model. Exposing this structured
information to the Web of Data has so far required considerable expertise
in RDF and OWL modelling and additional programming effort. In this
paper we tackle one of the most popular CMS: Drupal. We enable site
administrators to export their site content model and data to the Web of
Data without requiring extensive knowledge on Semantic Web technologies.
Our modules create RDFa annotations and --- optionally --- a SPARQL
endpoint for any Drupal site out of the box. Likewise, we add the means
to map the site data to existing ontologies on the Web with a search
interface to find commonly used ontology terms. We also allow a Drupal
site administrator to include existing RDF data from remote SPARQL
endpoints on the Web in the site. When brought together, these features
allow networked RDF Drupal sites that reuse and enrich Linked Data. We
finally discuss the adoption of our modules and report on a use case in
the biomedical field and the current status of its deployment.
- X. Bai, R. Delbru and
G. Tummarello.
RDF Snippets for Semantic Web Search Engines. In
Proceedings of the
International Conference on Ontologies, Databases and Applications of
Semantics (ODBASE). 2008.
- There has been interest in ranking the resources and generating
corresponding expressive descriptions from the Semantic Web
recently. This paper proposes an approach for automatically generating
snippets from RDF documents and assisting users in better understanding
the content of RDF documents return by Semantic Web search engines. An
heuristic method for discovering topics, based on the occurrences of RDF
nodes and the URIs of original RDF documents, is presented and
experimented in this paper. In order to make the snippets more
understandable, two strategies are proposed and used for ranking the
topic-related statements and the query-related statements respectively.
Finally, the conclusion is drawn based on the discussion about the
performances of our topic discovery and the whole snippet generation
approaches on a test dataset provided by Sindice.
- R. Cyganiak, H. Stenzhorn,
R. Delbru,
S. Decker and G. Tummarello.
Semantic Sitemaps: Efficient and
Flexible Access to Datasets on the Semantic Web.
In Proceedings of the Proceedings of
the 5th European Semantic Web Conference (ESWC). 2008.
- Increasing amounts of RDF data are available on the Web for
consumption by Semantic Web browsers and indexing by Semantic Web
search engines. Current Semantic Web publishing practices, however, do
not directly support efficient discovery and high-performance retrieval
by clients and search engines. We propose an extension to the Sitemaps
protocol which provides a simple and effective solution: Data publishers
create Semantic Sitemaps to announce and describe their data so that
clients can choose the most appropriate access method. We show how
this protocol enables an extended notion of authoritative information
across different access methods.
- G. Tummarello,
R. Delbru and
E. Oren.
Sindice.com: Weaving the open linked data.
In Proceedings of the 6th International
Semantic Web Conference (ISWC). 2007.
- Developers of Semantic Web applications face a challenge with
respect to the decentralised publication model: where to find statements
about encountered resources. The "linked data" approach, which mandates
that resource URIs should be de-referenced and yield metadata
about the resource, helps but is only a partial solution. We present Sindice,
a lookup index over resources crawled on the Semantic Web. Our index
allows applications to automatically retrieve sources with information about
a given resource. In addition we allow resource retrieval through
inverse-functional properties, offer full-text search and index SPARQL endpoints.
- E. Oren,
R. Delbru,
S. Gerke, A. Haller and S. Decker.
ActiveRDF: Object-oriented
semantic web programming. In
Proceedings of the 16th International World-Wide Web Conference
(WWW). May 2007.
- Object-oriented programming is the current mainstream programming paradigm but existing RDF
APIs are mostly triple-oriented. Traditional techniques for bridging a similar gap between
relational databases and object-oriented programs cannot be applied directly, given the
different nature of Semantic Web data, as can for example be seen in the semantics of class
membership, inheritance relations, and object conformance to schemas. We present ActiveRDF,
an object-oriented API for managing RDF data that offers full manipulation and querying of
RDF data, does not rely on a schema and fully conforms to RDF(S) semantics. ActiveRDF can
be used with different RDF data stores, adapters have been implemented to generic SPARQL
endpoints, Sesame, Jena, Redland and YARS and new adapters can be added easily. In addition,
integration with the popular Ruby on Rails framework enables fast development of Semantic Web
applications.
- E. Oren,
R. Delbru, and
S. Decker.
Extending faceted navigation for RDF data. In
Proceedings of the 5th International Semantic Web Conference (ISWC).
November 2006.
- Data on the Semantic Web is semi-structured and does not follow one fixed schema.
Faceted browsing is a natural technique for navigating such data, partitioning the
information space into orthogonal conceptual dimensions. Current faceted interfaces
are manually constructed and have limited query expressiveness. We develop an expressive
faceted interface for semi-structured data and formally show the improvement over
existing interfaces. Secondly, we develop metrics for automatic ranking of facet
quality, bypassing the need for manual construction of the interface. We develop a
prototype for faceted navigation of arbitrary RDF data. Experimental evaluation shows
improved usability over current interfaces.
^ TOP
Workshop papers
- N. Toupikov, J. Umbrich,
R. Delbru, M. Hausenblas and G. Tummarello.
DING! Dataset Ranking using Formal Descriptions. In
Proceedings of the WWW-2009 Workshop on Linked Data on the Web
(LDOW-2009). Madrid, Spain, 2009.
- Considering that thousands if not millions of linked datasets will
be published soon, we motivate in this paper the need for an efficient
and effective way to rank interlinked datasets based on formal
descriptions of their characteristics. We propose DING (from
Dataset RankING) as a new approach to rank linked datasets
using information provided by the voiD vocabulary. DING is a
domain-independent link analysis that measures the popularity of
datasets by considering the cardinality and types of the relationships.
We propose also a methodology to automatically assign weights to link
types. We evaluate the proposed ranking algorithm against other well
known ones, such as PageRank or HITS, using synthetic voiD descriptions.
Early results show that DING performs better than the standard Web
ranking algorithms.
- R. Delbru, A. Polleres,
G. Tummarello and S. Decker.
Context Dependent Reasoning for Semantic Documents in Sindice. In
Proceedings of the 4th International Workshop on Scalable Semantic
Web Knowledge Base Systems (SSWS). Karlsruhe, Germany, 2008.
[slides]
- The Sindice Semantic Web index provides search capabilities over
today more than 30 million documents. A scalable reasoning mechanism for
real-world web data is important in order to increase the precision and
recall of the Sindice index by inferring useful information (e.g. RDF
Schema features, equality, property characteristic such as inverse
functional properties or annotation properties from OWL). In this paper,
we introduce our notion of context dependent reasoning for RDF documents
published on the Web according to the linked data principle. We then
illustrate an efficient methodology to perform context dependent RDFS
and partial OWL inference based on a persistent TBox composed of a
network of web ontologies. Finally we report preliminary evaluation
results of our implementation underlying the Sindice web data index.
- G. Tummarello and R. Delbru.
Entity Coreference Resolution Services in Sindice.com: Identification on
the current Web of Data. In Proceedings
of the 1st international workshop on Identity and Reference on the
Semantic Web (IRSW). Tenerife, Spain. 2008.
- A. Harth, A. Hogan,
R. Delbru, J. Umbrich, S. O'Riain and S. Decker.
SWSE: Answers Before Links!. In Proceedings of the
Semantic Web Challenge (ISWC). Busan, Korea. 2007.
- We present a system that improves on current document-centric Web
search engine technology; adopting an entity-centric
perspective, we are able to integrate data from both static and live sources
into a coherent, interlinked information space. Users can then search and
navigate the integrated information space through relationships, both
existing and newly materialised, for improved knowledge discovery and
understanding.
- E. Oren and R. Delbru.
ActiveRDF: Object-oriented RDF in Ruby. In Proceedings of the
European Semantic Web Conference Workshop on Scripting for the
Semantic Web (ESWC). Budva, Montenegro. June 2006.
- Although most developers are object-oriented, programming RDF is triple-oriented.
Bridging this gap, by developing a truly object-oriented API that uses domain
terminology, is not straightforward, because of the dynamic and semi-structured nature
of RDF and the open-world semantics of RDF Schema. We present ActiveRDF, our
object-oriented library for accessing RDF data. ActiveRDF is completely dynamic,
offers full manipulation and querying of RDF data, does not rely on a schema and can
be used against different data-stores. In addition, the integration with the popular
Rails framework enables very easy development of Semantic Web applications.
- E. Oren and R. Delbru.
A prototype for faceted browsing of RDF data. In Proceedings of the
Workshop on Scripting for Semantic Web (ESWC). Budva, Montenegro.
June 2006.
- E. Oren, R. Delbru,
K. Möller, M. Völkel and S. Handschuh. Annotation and navigation in
semantic wikis. In Proceedings of the European Semantic Web
Conference Workshop on Semantic Wikis. Budva, Montenegro. June 2006.
- Semantic Wikis allow users to semantically annotate their Wiki content. The particular
annotations can differ in expressive power, simplicity, and meaning. We present an
elaborate conceptual model for semantic annotations, introduce a unique and rich Wiki
syntax for these annotations, and discuss how to best formally represent the augmented
Wiki content. We improve existing navigation techniques to automatically construct
faceted browsing for semistructured data. By utilising the Wiki annotations we
provide greatly enhanced information retrieval. Further we report on our ongoing
development of these techniques in our prototype SemperWiki.
^ TOP
Symposium
- Renaud Delbru.
SIREn: Entity Retrieval System for the Web of Data.
In Proceedings of the 3rd Symposium on Future Directions in
Information Access (FDIA). University of Padua, Italy. September 2009.
- We present ongoing work on the Semantic Information Retrieval Engine (SIREn), an
"entity retrieval system" specifically designed to meet the requirements of indexing
and searching a large amount of semi-structured data, e.g. the entire Web of Data.
SIREn supports efficient full text search with semi-structural queries and exhibits
a concise index, constant time updates and inherits Information Retrieval features
such as top-k queries, efficient caching and scalability via distribution over shards.
We demonstrate how SIREn can effectively answer queries over 10 billion triples on
single commodity machine. The prototype is currently in use in the Sindice search
engine which index at the present time more than 50 million harvested documents
containing semi-structured data.
- Renaud Delbru.
Methodology for Searching Entities on the Web.
In Proceedings of the European Semantic Web Conference Ph.D
Symposium (ESWC). Tenerife, Spain. June 2008.
^ TOP
Reports
- Renaud Delbru.
Manipulation and Exploration of Semantic Web Knowledge, Internship Report
DERI and
EPITA France, Jan—Jul
2006
-
La description des ressources web par des méta-données compréhensibles par les machines
est l'un des fondements du Web Sémantique. Resource Description Framework (RDF) est le
language pour décrire et échanger les connaissances du Web Sémantique. Comme ces données
deviennent de plus en plus courantes, les techniques permettant de manipuler et d'explorer
ces informations deviennent nécessaires.
Cependant, la manipulation des données RDF est orientée "triple". Ce type de représentation
est moins intuitif et plus difficile à prendre en main que l'approche orientée objet. Notre
objectif était donc de réconcilier les deux paradigmes en développant une interface de
programmation (API) permettant d'exposer les données RDF sous forme d'objet. ActiveRDF est
une API dynamique de haut niveau qui abstrait l'accès à différents types de base de données
RDF. Cette interface propose un accès aux données RDF sous la forme d'objets en utilisant
la terminologie du domaine.
Afin de pouvoir naviguer à travers les données RDF et pour chercher une information, nous
proposons Faceteer, une technique de navigation par facettes pour données semi-structurées.
Cette technique étend les possibilités de navigation par rapport aux techniques existantes.
Elle permet de construire visuellement et facilement des requêtes très complexes. L'interface
de navigation est générée automatiquement pour des données RDF arbitraires. Un ensemble
de mesures nous permet d'ordonner les facettes du navigateur afin d'améliorer la navigabilité.
Les résultats de nos recherches sur ActiveRDF et Faceteer permettent un gain de temps
substantiel dans la manipulation et l'exploration des données RDF pour les utilisateurs du
Web Sémantique.
^ TOP