cb2Bib Reading and Writing Bibliographic Metadata

Section Contents

Reading Metadata

Metadata in scientific documents had been rarely appreciated and used for decades. For bibliographic metadata, no format specification had been widely accepted. cb2Bib adapted back in 2008 the PDF predefined metadata capabilities to set BibTeX bibliographic keys in document files.

cb2Bib reads all XMP (a specific XML standard devised for metadata storage) packets found in the document. It then parses the XML strings looking for nodes and attributes with key names meaningful to bibliographic references. If a given bibliographic field is found in multiple packets, cb2Bib will take the last one, which most often, and according to the PDF specs, is the most updated one. The fields file, which would be the document itself, and pages, which is usually the actual number of pages, are skipped.

The metadata is then summarized in cb2Bib clipboard panel as, for instance

[Bibliographic Metadata
<title>arXiv:0705.0751v1  [cs.IR]  5 May 2007</title>
/Bibliographic Metadata]

This data, whenever the user considers it to be correct, can be easily imported by the build-in ‘Heuristic Guess’ capability. On the other hand, if keys are found with the prefix bibtex, cb2Bib will assume the document does contain bibliographic metadata, and it will only consider the keys having this prefix. Assuming therefore that metadata is bibliographic, cb2Bib will automatically import the reference. This way, if using PDFImport, BibTeX-aware documents will be processed as successfully recognized, without requiring any user supplied regular expression.

See also Release Note cb2Bib 1.0.0, Configuring Clipboard, and PDF Reference Import.


Writing Metadata

Once an extracted reference is saved and there is a document attached to it, cb2Bib will optionally insert the bibliographic metadata into the document itself. cb2Bib writes an XMP packet as, for instance

<bibtex:author>P. Constans</bibtex:author>
<bibtex:journal>arXiv 0705.0751</bibtex:journal>
<bibtex:title>Approximate textual retrieval</bibtex:title>

The BibTeX fields file and id are skip from writing. The former for the reason mentioned above, and the latter because it is easily generated by specialized BibTeX software according to each user preferences. LaTeX escaped characters for non ASCII letters are converted to UTF-8, as XMP already specifies this codec.

The actual writing of the packet into the document is performed by ExifTool, an excellent Perl program written by Phil Harvey. See https://exiftool.org. ExifTool supports several document formats for writing. The most relevant here are Postscript and PDF. For PDF documents, metadata is written as an incremental update of the document. This exactly preserves the binary structure of the document, and changes can be easily reversed or modified if so desired. Whenever ExifTool is unable to insert metadata, e.g., because the document format is not supported or it has structural errors, cb2Bib will issue an information message, and the document will remain untouched.

See also Configuring Documents and Update Documents Metadata.