Web Content Quality has various and often subjective aspects. In this year's Discovery Challenge we try to explore various properties that may determine the overall rank, quality and importance of a Web site, with the task of developing automatic methods that can be used to estimate web content quality.
Our main target is to help organizations, both commercial (such as commercial search engines) and non-commercial (such as non-commercial web archives), in their efforts to prioritize their procedures to gather, store and organize their collection of Web pages. The objectives of these entities may vary from institution to institution, e.g. an Archive may even want to include even Web spam but with lower priority, while others may prefer frequent refresh with extensive resources allocated to news sites. As another example, content generated by amateurs individually or in informal organizations may be considered either as an important part of our culture to be preserved, or as something that needs to be handled separate from content generated by professionals in formal organizations.
High quality is not simply the opposite of Web Spam. The recent Web Spam Challenges
have explored the aspects of filtering as a binary decision. In this year's Discovery Challenge we target at more and different aspects. We want to develop site-level classification for the genre of the web sites (editorial, news, commercial, educational, "deep Web", or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality.
The data set consists of sample Web hosts from Europe. The training and testing samples are biased towards the interesting aspects and cleansed manually from mixed sites, Web hosting, and adult content. Features similar to those used to filter Web spam based on content and linkage information are be provided on the host level, along with natural language processing annotations of a large set of sample pages.
As multilingualism is imperative in Europe, for ECML/PKDD we decided to include non-English tasks as well in two major European languages. We have pre-selected sites that are primarily in either Engish, French or German language. We provide training labels in English only and expect either cross-lingual tools or language-independent features to be used to classify the French and German language test set.
This year's competition will have cash prizes sponsored by Google.
- 1st place (measured as average over all three tasks): USD 2500
- 2nd place: USD 1000
- 1st place for the English quality task: USD 1500
USD 2500 in travel grants for up to 5 students was offered by Yahoo! These grants were not given as in the end no student applied for them.
Data preparation and assessment is supported by the EU FP7 Project LIWA (Living Web Archives) and by the Hungarian national grant OTKA NK 72845.