Simeon Warner, Research Associate in Computer and Information Science at Cornell and member of the Open Archives Initiative Technical Committee will speak from his experience working with the e-print archive, discussing "The arXiv metadata format: problems, shortcomings, and service-driven motivations for change".
Descriptive metadata is used as the basis for searching among the 250,000 articles in arXiv. This author-supplied metadata undergoes just a few automated checks and very brief human inspection at ingest. Warner will first describe arXiv's current metadata format and its problems and shortcomings. He will then talk about the plans for a new internal metadata format and the service-driven motivations for the design choices being made.
Simeon Warner began his presentation with a review of the history of arXiv, from its beginnings in 1991 as a mail reflector through its evolution to an early web interface in 1993-94, to the current system containing about a quarter-million articles. He discussed the ingest process: web forms for author-entered metadata, automatic checks that reject some submissions and flag others for human examination.
He spoke of the motivations for changes in the Author and Subject Classification metadata elements, particularly those driven by the need for improvement in Author searching, the most-used search mode of the arXiv. Requirements for a new format include multi-word searches and methods to accommodate single-word names, non-Latin characters, and multi-author collaborations, among others.
He went on to discuss conflicts inherent in the operation of the system, conflicts that arise out of the desire to provide an easy-to-learn, easy-to-use, understandable system. He pointed out that satisfying that desire is made difficult by the simultaneous need to provide appropriate granularity, validation, generalized (but understandable) categories, as well as the problem of presenting a complex system without demanding a large learning commitment from users.
Other topics mentioned were the special problems of non-ASCII characters, math markup, the relationships among internal metadata elements, and interoperability with external systems.
Questions followed the presentation.