cb2Bib PDF Reference Import

Section Contents


Articles in PDF or other formats that can be converted to plain text can be processed and indexed by cb2Bib. Files can be selected using the Select Files button, or dragging them from the desktop or the file manager to the PDFImport dialog panel. Files are converted to plain text by using any external translation tool or script. This tool, and optionally its parameters, are set in the cb2Bib configure dialog. See the Configuring Utilities section for details.

Once the file is converted, the text, and optionally, the preparsed metadata, is sent to cb2Bib for reference recognition. This is the usual, two step process. First, text is optionally preprocessed, using a simple set of rules and/or any external script.or tool. See Configuring Clipboard. Second, text is processed for reference extraction. cb2Bib so far uses two methods. One considers the text as a full pattern, which is checked against the user’s set of regular expressions. The better designed are these rules, the best and most reliable will be the extraction. The second method, used when no regular expression matches the text, considers instead a set of predefined subpatterns. See Field Recognition Rules.

At this point users can interact and supervise their references, right before saving them. Allowing user intervention is and has been a design goal in cb2Bib. Therefore, at this point, cb2Bib helps users to check their references. Poorly translated characters, accented letters, ‘forgotten’ words, or some minor formatting in the titles might be worth considering. See Glyph & Cog’s Text Extraction for a description on the intricacies of PDF to text conversions. In addition, if too few fields were extracted, one might perform a network query. Say, only the DOI was catch, then there are chances that such a query will fill the remaining fields.

The references are saved from the cb2Bib main panel. Once Save is pressed, and depending on the configuration, see Configuring Documents, the document file will be either renamed, copied, moved or simply linked onto the file field of the reference. If Insert BibTeX metadata to document files is checked, the current reference will also be inserted into the document itself.

When several files are going to be indexed, the sequence can be as follows:

  • Process next after saving
    Once files are load and Process is pressed, the PDFImport dialog can be minimized (but not closed) for convenience. All required operations to completely fill the desired fields (e.g. dynamic bookmarks, open DOI, etc, which might be required if the data in document is not complete) are at this point accessible from the main panel. The link in the file field will be permanent, without regard to which operations (e.g. clipboard copying) are needed, until the reference is saved. The source file can be open at any time by right clicking the file line edit. Once the reference is saved, the next file will be automatically processed. To skip a given document file from saving its reference, press the Process button.
  • Unsupervised processing
    In this operation mode, all files will be sequentially processed, following the chosen steps and rules. If the processes is successful, the reference is automatically saved, and the next file is processed. If it is not, the file is skipped and no reference is saved. While processing, the clipboard is disabled for safety. Once finished, this box is unchecked, to avoid a possible accidental saving of a void reference. Network queries that require intervention, i.e., whose result is launching a given page, are skipped. The processes follows until all files are processed. However, it will stop to avoid a file being overwritten, as a result of a repeated key. In this case, it will resume after manual renaming and saving. See also cb2Bib Command Line, commands –txt2bib and –doc2bib.


Automatic Extraction: Questions and Answers

  • When does cb2Bib do automatic extractions?
    cb2Bib is conceived as a lightweight tool to extract references and manage bibliographies in a simple, fast, and accurate way. Accuracy is better achieved in semi-automatic extractions. Such extractions are handy, and allow user intervention and verification. However, in cases where one has accumulated a large number of unindexed documents, automatic processing can be convenient. cb2Bib does automatic extraction when, in PDFImport mode, ‘Unsupervised processing’ is checked, or, in command line mode, when typing cb2bib –doc2bib *.pdf tmp_references.bib, or, on Windows, c2bconsole.exe instead of cb2bib.
  • Are PDFImport and command line modes equivalent?
    Yes. There are, however, two minor differences. First, PDFImport adds each reference to the current BibTeX file, as this behavior is the normal one in cb2Bib. On the other hand, command line mode will, instead, overwrite tmp_references.bib if it exists, as this is the expected behavior for almost all command line tools. Second, as for now, command line mode does not follow the configuration option ‘Check Repeated On Save’.
  • How do I do automatic extraction?
    To test and learn about automatic extractions, the cb2Bib distribution includes a set of four PDF files that mimic a paper title page. For these files, distribution also includes a regular expression, in file regexps.txt, capable of extracting the reference fields, provided the pdftotex flags are set to their default values. Processing these files, should, therefore, be automatic, and four messages stating Processed as 'PDF Import Example' should be seen in the logs. Note that extractions are configurable. A reading of Configuration will provide additional, useful information.
  • Why some entries are not saved and files not renamed?
    Once you move from the fabricated examples to real cases, you will realize that some of the files, while being processed, are not renamed and their corresponding BibTeX data is not written. For each document file, cb2Bib converts its first page to text, and from this text it attempts to extract the bibliographic reference. By design, when extraction fails, cb2Bib does nothing: no file is moved, no BibTeX is written. This way, you know that the remaining files in the origin directory need special, manual attention. Extractions are seen as failed, unless reliable data is found in the text.
  • What is reliable data?
    Note that computer processing of natural texts, as extracting the bibliographic data from a title page, is nowadays an approximated procedure. cb2Bib tries several strategies: 1) allow for including user regular expressions very specific to the extraction at hand, 2) use metadata if available, 3) guess what is reasonable, and, based on this, make customized queries. Then, cb2Bib considers extracted data is reliable if i) data comes from a match to an user supplied regular expression ii) document contains BibTeX metadata, or iii) a guess is transformed through a query to formatted bibliographic data. As formatted bibliographic data, cb2Bib understands BibTeX, PubMed XML, arXiv XML, and CR JSON data. In addition, it allows external processing if needed. Other data, metadata, guesses, and guesses on query results are considered unreliable data.
  • Is metadata reliable data?
    No. Only author, title, and keywords in standard PDF metadata can be mapped to their corresponding bibliographic fields. Furthermore, publishers most often misuse these three keys, placing, for instance, DOI in title, or setting author to, perhaps, the document typesetter. Only BibTeX XMP metadata is considered reliable. If you consider that a set of PDF files does contain reliable data, you may force to accept it using the command line switch –sloppy together with –doc2bib.
  • How successful is automatic extraction?
    As it follows from the given definition of reliable data, running automatic extractions without adhoc regexps.txt and netqinf.txt files will certainly give a zero success ratio. In practice, scenario 3) often applies: cb2Bib guesses several fields, and, based on the out-of-the-box netqinf.txt file, it obtains from the web either BibTeX, PubMed XML, arXiv XML, or CR JSON data.
  • What can I do to increase success ratio?
    First, set your favorite journals in file abbreviations.txt. Besides increasing the chances of journal name recognition, it will provide consistency across your BibTeX database. In general, do not write regular expressions to extract directly from the PDF text. Conversion is often poor. Special characters often break lines, thus breaking your regular expressions too. Write customized queries instead. For instance, if your PDFs have DOI in title page, set the simple query

    journal=The Journal of Everything|

    then, if it is feasible to extract the reference from the document’s web page using a regular expression, include it in file regexps.txt. Note that querying in cb2Bib had been designed having in mind minority fields of research, for which, established databases might not be available. If cb2Bib failed to make reasonable guesses, then, you might consider writing very simple regular expressions to extract directly from the PDF text. For instance, obtain title only. Then, the posterior query step can provide the remaining information. Note also, especially for old documents, journal name is often missing from the paper title page. If in need of processing a series of those papers, consider using a simple script, that, in the cb2Bib preprocessing step, adds this missing information.

  • Does successful extraction mean accurate extraction?
    No. An extraction is successful if reliable data, as defined above, is found in the text, in the metadata, or in the text returned by a query. Reference accuracy relies on whether or not user regular expressions are robust, BibTeX metadata is correct, a guess is appropriate, a set of queries can correct a partially incorrect guess, and the text returned by a query is accurate. In general, well designed sets of regular expressions are accurate. Publisher’s abstract pages and PubMed are accurate. But, some publishers are still using images for non-ASCII characters, and PubMed algorithms may drop author middle names if a given author has ‘too many names’. Expect convenience over accuracy on other sources.

  • Can I use cb2Bib to extract comma separated value CSV references?
    Yes. To automatically import multiple CSV references you will need one regular expression. If you can control CSV export, choose | as separator, since comma might be used, for instance, in titles. The regular expression for

    AuthName1, AuthName2 | Title | 2010

    will simply be

    author title year

    The reference file references.csv can then be split to single-line files typing

    split -l 1 references.csv slineref

    and the command

    cb2bib --txt2bib slineref* references.bib
    rm -f slineref*

    will convert references.csv to BibTeX file references.bib