cb2Bib Regular Expression Editor

Once a manual processing is done, cb2Bib clipboard area contains the extraction tags, plus, possibly, some other cb2Bib tags introduced during the preprocessing (see Extracting Data from the Clipboard). The RegExp Editor will generate a guess regular expression or matching pattern usable for automated extractions.

The cb2Bib matching patterns consist of four lines: a brief description, the reference type, an ordered list of captured fields, and the regular expression itself.

# cb2Bib 2.0.1 Pattern:
American Chemical Society Publications
journal volume pages year title author abstract
^(.+), (\d+) \(.+\), ([\d|\-|\s]+),(\d\d\d\d)\..+<NewLine3>(.+)<NewLine4>

The Regular Expression Editor provides the basic skeleton and a set of predefined suggestions. The regular expressions follow a Perl-like sintax. There are, however, some slight differences and minor limitations. Information about the basics on the editing and working with Regular Expressions as used by cb2Bib can be found at the Qt document file Qt Documentation’s QRegExp Class.

Remember when creating and editing regular expressions:

  • Switch the clipboard mode to ‘Tagged Clipboard Data’, using the clipboard panel context menu.
  • Extract the bibliographic reference manually. On the clipboard panel will appear some cb2Bib tags that indicate which fields are being extracted. Once done, type Alt+I to enter to the regular expression editor. In the editor, there are the four line edits that define a cb2Bib pattern, one copy of the clipboard panel, and an information panel. The information panel displays possible issues, and, once everything is correct, the actual extracted fields. The clipboard panel highlights the captures for the current regular expression and current input text.
  • Patterns can be modified at any time by typing Alt+E to edit the regular expression file. Patterns are reloaded each time the automatic pattern recognition is started. This permits editing and testing.
  • cb2Bib processes sequentially the list of regular expressions as found in the regular expression file. It stops and picks the first match for the current input. Therefore, the order of the regular expressions is important. Consequently, to avoid possible clashing among similar patterns, consider sorting them from the most restrictive pattern to the less one. As a rule of thumb, the more captions it has the most restrictive a pattern is.
  • The cb2Bib proposed patterns are general, and not necessarily the most appropriate for a particular capture. E.g. tag pages becomes ([\d|\-|\s]+), which considers digits, hyphens, and spaces. It must be modified accordingly for reference sources with, e.g., pages written as Roman ordinals.
  • Avoid whenever possible general patterns (.+). There is a risk that such a caption could include text intended for a posterior caption. This is why, sometimes, the cb2Bib proposed pattern is not hit by the input stream that originated it. Use, whenever possible, cb2Bib anchors like <NewLine1> instead of <NewLine\d+>. They prevent (.+) captions to overextend.
  • To debug a large regular expression it might be useful to break it to the first capturing parenthesis. For instance, the above pattern will be
# cb2Bib 2.0.1 Pattern:
American Chemical Society Publications
  • Then, check if anything is captured and if this corresponds to journal.
  • Add on successive steps your set of captions and BibTeX fields.