Personal

Contact

Anir

LSim

cb2Bib

Downloads

Site Map

Links

Reading and writing bibliographic metadata


 

Reading metadata

Metadata in scientific documents is, unfortunately, rarely appreciated and not widely used. When it comes to bibliographic metadata, the situation is even quite deceiving: there is no accepted format specification, and the reliability of publishers' metadata, if any at all, is questionable in many cases.

The cb2Bib reads all XMP (a specific XML standard devised for metadata storage) packets found in the document. It then parses the XML strings looking for nodes and attributes with key names meaningful to bibliographic references. If a given bibliographic field is found in multiple packets, the cb2Bib will take the last one, which most often, and according to the PDF specs, is the most updated one. The fields file, which would be the document itself, and pages, which is usually the actual number of pages, are skipped.

The metadata is then summarized in the cb2Bib clipboard panel as, for instance

[Bibliographic Metadata
<title>arXiv:0705.0751v1  [cs.IR]  5 May 2007</title>
/Bibliographic Metadata]

This data, whenever the user considers it to be correct, can be easily imported by the build-in 'Heuristic Guess' capability. On the other hand, if keys are found with the prefix bibtex, the cb2Bib will assume the document does contain bibliographic metadata, and it will only consider the keys having this prefix. Assuming therefore that metadata is bibliographic, the cb2Bib will automatically import the reference. This way, if using PDFImport, BibTeX-aware documents will be processed as successfully recognized, without requiring any user supplied regular expression.

See also Release Note cb2Bib 1.0.0, Configuring Clipboard, and PDF Reference Import.

 

Writing metadata

Once an extracted reference is saved and there is a document attached to it, the cb2Bib will optionally insert the bibliographic metadata into the document itself. The cb2Bib writes an XMP packet as, for instance,

<bibtex:author>P. Constans</bibtex:author>
<bibtex:journal>arXiv 0705.0751</bibtex:journal>
<bibtex:title>Approximate textual retrieval</bibtex:title>
<bibtex:type>article</bibtex:type>
<bibtex:year>2007</bibtex:year>

which is similar to JabRef, but differs on that the cb2Bib strictly sticks to BibTeX and avoids (perhaps unnecessary) syntax specialization in author strings.

The BibTeX fields file and id are skip from writing. The former for the reason mentioned above, and the latter because it is easily generated by specialized BibTeX software according to each user preferences. LaTeX escaped characters for non Ascii letters are converted to UTF-8, as XMP already specifies this codec.

The actual writing of the packet into the document is performed by ExifTool, an excellent Perl program written by Phil Harvey. See http://www.sno.phy.queensu.ca/~phil/exiftool/. ExifTool supports several document formats for writing. The most relevant here are Postscript and PDF. For PDF documents, metadata is written as an incremental update of the document. This exactly preserves the binary structure of the document, and changes can be easily reversed or modified if so desired. Whenever ExifTool is unable to insert metadata, e.g., because the document format is not supported or it has structural errors, the cb2Bib will issue an information message, and the document will remain untouched.

See also Configuring Documents and Update Documents Metadata.