Once a manual processing is done, cb2Bib clipboard area contains the extraction
tags, plus, possibly, some other cb2Bib tags introduced during the preprocessing
(see Extracting Data from the
Clipboard). The RegExp Editor will generate a guess regular expression
or matching pattern usable for automated extractions.
The cb2Bib matching patterns consist of four lines: a brief description, the
reference type, an ordered list of captured fields, and the regular expression
itself.
# cb2Bib 1.4.1 Pattern:
American Chemical Society Publications
article
journal volume pages year title author abstract
^(.+), (\d+) \(.+\), ([\d|\-|\s]+),(\d\d\d\d)\..+<NewLine3>(.+)<NewLine4>
(.+)<NewLine5>.+Abstract:<NewLine\d+>(.+)$
The Regular Expression Editor provides the basic skeleton and a set of
predefined suggestions. The regular expressions follow a Perl-like sintax. There
are, however, some slight differences and minor limitations. Information about the
basics on the editing and working with Regular Expressions as used by cb2Bib can be
found at the Qt document file http://doc.trolltech.com/4.5/qregexp.html#introduction.
Remember when creating and editing regular expressions:
- Switch the clipboard mode to 'Tagged Clipboard Data', using the clipboard
panel context menu.
- Extract the bibliographic reference manually. On the clipboard panel will
appear some cb2Bib tags that indicate which fields are being extracted. Once done,
type Alt+I to enter to the regular expression editor. In the editor, there are the
four line edits that define a cb2Bib pattern, one copy of the clipboard panel, and
an information panel. The information panel displays possible issues, and, once
everything is correct, the actual extracted fields. The clipboard panel highlights
the captures for the current regular expression and current input text.
- Patterns can be modified at any time by typing Alt+E to edit the regular
expression file. Patterns are reloaded each time the automatic pattern recognition
is started. This permits editing and testing.
- The cb2bib processes sequentially the list of regular expressions as found in
the regular expression file. It stops and picks the first match for the current
input. Therefore, the order of the regular expressions is important.
Consequently, to avoid possible clashing among similar patterns, consider sorting
them from the most restrictive pattern to the less one. As a rule of thumb, the
more captions it has the most restrictive a pattern is.
- The cb2Bib proposed patterns are general, and not necessarily the most
appropriate for a particular capture. E.g. tag
pages becomes
([\d|\-|\s]+), which considers digits, hyphens, and spaces. It must
be modified accordingly for reference sources with, e.g., pages
written as Roman ordinals.
- Avoid whenever possible general patterns
(.+). There is a
risk that such a caption could include text intended for a posterior caption. This
is why, sometimes, the cb2Bib proposed pattern is not hit by the input stream that
originated it. Use, whenever possible, cb2Bib anchors like
<NewLine1> instead of <NewLine\d+>. They
prevent (.+) captions to overextend.
- To debug a large regular expression it might be useful to break it to the
first capturing parenthesis. For instance, the above pattern will be
# cb2Bib 1.4.1 Pattern:
American Chemical Society Publications
article
journal
^(.+),
- Then, check if anything is captured and if this corresponds to
journal.
- Add on successive steps your set of captions and BibTeX fields.