cb2Bib: PDF Reference Import
Articles in PDF or other formats that can be converted to plain text can be
processed and indexed by the cb2Bib. Files can be selected using the Select Files
button, or dragging them from the desktop or the file manager to the PDFImport
dialog panel. Files are converted to plain text by using any external translation
tool or script. This tool, and optionally its parameters, are set in the cb2Bib
configure dialog. See the Configuring Utilities section for
details.
Once the file is converted, the text, and optionally, the preparsed metadata,
is sent to the cb2Bib for reference recognition. This is the usual, two step
process. First, text is optionally preprocessed, using a simple set of rules
and/or any external script.or tool. See Configuring Clipboard. Second, text
is processed for reference extraction. The cb2Bib so far uses two methods. One
considers the text as a full pattern, which is checked against the user's set of
regular expressions. The better designed are these rules, the best and most
reliable will be the extraction. The second method, used when no regular
expression matches the text, considers instead a set of predefined subpatterns.
See Field Recognition
Rules.
At this point users can interact and supervise their references, right before
saving them. Allowing user intervention is and has been a design goal in the
cb2Bib. Thus, at this point, the cb2Bib invites users to check their references.
Poorly translated characters, accented letters, 'forgotten' words, or some minor
formatting in the titles might be worth considering. See Text Extraction for
a description on the intricacies of PDF to text conversions. In addition, if too
few fields were extracted, one might perform a network query. Say, only the DOI
was catch, then there are chances that such a query will fill the remaining
fields.
The references are saved from the cb2Bib main panel. Once Save is pressed, and
depending on the configuration, see Configuring Documents, the document
file will be either renamed, copied, moved or simply linked onto the
file field of the reference. If Insert BibTeX metadata to document
files is checked, the current reference will also be inserted into the
document itself.
When several files are going to be indexed, the sequence can be as follows:
- Process next after saving
Once files are load and Process is pressed, the PDFImport dialog can be minimized
(but not closed) for convenience. All required operations to completely fill the
desired fields (e.g. dynamic bookmarks, open DOI, etc, which might be required if
the data in document is not complete) are at this point accessible from the main
panel. The link in the file field will be permanent, without
regard to which operations (e.g. clipboard copying) are needed, until the
reference is saved. The source file can be open at any time by right clicking the
file line edit. Once the reference is saved, the next file will be
automatically processed. To skip a given document file from saving its reference,
press the Process button.
- Unsupervised processing
In this operation mode, all files will be sequentially processed, following the
chosen steps and rules. If the processes is successful, the reference is
automatically saved, and the next file is processed. If it is not, the
file is skipped and no reference is saved. While processing, the clipboard is
disabled for safety. Once finished, this box is unchecked, to avoid a possible
accidental saving of a void reference. Network queries that require intervention,
i.e., whose result is launching a given page, are skipped. The processes follows
until all files are processed. However, it will stop to avoid a file being
overwritten, as a result of a repeated key. In this case, it will resume after
manual renaming and saving. See also The cb2Bib Command Line, commands '--txt2bib' and
'--doc2bib'.
Automatic Extraction: Questions and Answers
- When does cb2Bib do automatic extractions? The cb2Bib is conceived as
a lightweight tool to extract references and manage bibliographies in a simple,
fast, and accurate way. Accuracy is better achieved in semi-automatic
extractions. Such extractions are handy, and allow user intervention and
verification. However, in cases where one has accumulated a large number of
unindexed documents, automatic processing can be convenient. The cb2Bib does
automatic extraction when, in PDFImport mode, 'Unsupervised processing' is
checked, or, in command line mode, when typing
cb2bib --doc2bib *.pdf
tmp_references.bib, or, on Windows, c2bconsole.exe instead of
cb2bib.
- Are PDFImport and command line modes equivalent? Yes. There are,
however, two minor differences. First, PDFImport adds each reference to the
current BibTeX file, as this behavior is the normal one in cb2Bib. On the other
hand, command line mode will, instead, overwrite
tmp_references.bib
if it exists, as this is the expected behavior for almost all command line tools.
Second, as for now, command line mode does not follow the configuration option
'Check Repeated On Save'.
- How do I do automatic extraction? To test and learn about automatic
extractions, the cb2Bib distribution includes a set of four PDF files that mimic
a paper title page. For these files, distribution also includes a regular
expression, in file
regexps.txt, capable of extracting the reference
fields, provided the pdftotex flags are set to their default values.
Processing these files, should, therefore, be automatic, and four messages
stating Processed as 'PDF Import Example' should be seen in the
logs. Note that extractions are configurable. A reading of Configuration will provide additional, useful
information.
- Why some entries are not saved and files not renamed? Once you move
from the fabricated examples to real cases, you will realize that some of the
files, while being processed, are not renamed and their corresponding BibTeX data
is not written. For each document file, cb2Bib converts its first page to text,
and from this text it attempts to extract the bibliographic reference. By design,
when extraction fails, cb2Bib does nothing: no file is moved, no BibTeX is
written. This way, you know that the remaining files in the origin directory need
special, manual attention. Extractions are always seen as failed, unless
reliable data is found in the text.
- What is reliable data? Note that computer processing of
natural texts, as extracting the bibliographic data from a title page, is
nowadays an approximated procedure. The cb2Bib tries several strategies:
1) allow for including user regular expressions very specific to the
extraction at hand, 2) use metadata if available, 3) guess what is
reasonable, and, based on this, make customized queries. Then, cb2Bib considers
extracted data is reliable if i) data comes from a match to an user
supplied regular expression ii) document contains BibTeX metadata, or
iii) a guess is transformed through a query to formatted bibliographic
data. As formatted bibliographic data, cb2Bib only understands BibTeX, and, as an
exception, PubMed XML data. However, it allows external processing if needed. Any
other data, metadata, guesses, and guesses on query results are considered
unreliable data.
- Is metadata reliable data? No. Only author, title, and keywords in
standard PDF metadata can be mapped to their corresponding bibliographic fields.
Furthermore, publishers most often misuse these three keys, placing, for
instance, DOI in title, or setting author to, perhaps, the document typesetter.
Only BibTeX XMP metadata is considered reliable, and only documents already
processed with cb2Bib or JabRef will have it. If you consider that a set of PDF
files does contain reliable data, you may force to accept it using the command
line switch
--sloppy together with --doc2bib.
- How successful is automatic extraction? As it follows from the given
definition of reliable data, running automatic extractions without adhoc
regexps.txt and netqinf.txt files will certainly give a
zero success ratio. In practice, scenario 3) often applies: cb2Bib guesses
several fields, and, based on the out-of-the-box netqinf.txt file,
it obtains from the web either BibTeX or PubMed XML data. Thus, biologists, for
instance, usually have success ratios close to 100%, since PubMed is almost
complete for them, and its data is extremely accurate.
-
What can I do to increase success ratio? First, set your favorite
journals in file
abbreviations.txt. Besides increasing the chances
of journal name recognition, it will provide consistency across your BibTeX
database. In general, do not write regular expressions to extract directly from
the PDF text. Conversion is often poor. Special characters often break lines,
thus breaking your regular expressions too. Write customized queries. For
instance, if your PDFs have DOI in title page, set the simple query
journal=The Journal of Everything|
query=http://dx.doi.org/<<doi>>
capture_from_query=
referenceurl_prefix=
referenceurl_sufix=
pdfurl_prefix=
pdfurl_sufix=
action=htm2txt_query
then, if it is feasible to extract the reference from the document's web
page using a regular expression, include it in file regexps.txt.
Note that querying in cb2Bib had been designed having in mind minority fields of
research, for which, established databases might not be available. If cb2Bib
failed to make reasonable guesses, then, you might consider writing very simple
regular expressions to extract directly from the PDF text. For instance, obtain
title only. Then, the posterior query step can provide the remaining
information. Note also, especially for old documents, journal name is often
missing from the paper title page. If in need of processing a series of those
papers, consider using a simple script, that, in the cb2Bib preprocessing step,
adds this missing information.
- Does successful extraction mean accurate extraction? No. An extraction
is successful if reliable data, as defined above, is found in the text, in the
metadata, or in the text returned by a query. Reference accuracy relies on
whether or not user regular expressions are robust, BibTeX metadata is correct, a
guess is appropriate, a set of queries can correct a partially incorrect guess,
and the text returned by a query is accurate. In general, well designed sets of
regular expressions are accurate. Publisher's abstract pages and PubMed are
accurate. But, some publishers are still using images for non-ASCII characters,
and PubMed algorithms may drop author middle names if a given author has 'too
many names'. Expect convenience over accuracy on other sources.
-
Can I use cb2Bib to extract comma separated value CSV references? Yes. To
automatically import multiple CSV references you will need one regular
expression. If you can control CSV export, choose | as separator, since comma
might be used, for instance, in titles. The regular expression for
AuthName1, AuthName2 | Title | 2010
will simply be
author title year
^([^|]*)\|([^|]*)\|([^|]*)$
The reference file references.csv can then be split to
single-line files typing
split -l 1 references.csv slineref
and the command
cb2bib --txt2bib slineref* references.bib
rm -f slineref*
will convert references.csv to BibTeX file
references.bib
|