# Release Note cb2Bib 2.0.1

To optimize search on PDF’s contents, cb2Bib keeps a cache with the extracted text streams, that are compressed to reduce disk space and reading overhead. Nowadays, compressors with extremely high decompression speed are available. Two of them are LZSSE, for SSE4 capable architectures, and LZ4, for a broader range of CPUs. These two compressors can now be used by cb2Bib, with the latter set as the default compression library in cb2Bib builds. When upgrading to version 2.0.1, the first search on the document collection will recreate the cache, and this step will be noticeably slow.

Additionally, cb2Bib 2.0.1 includes original, optimized text matching code for AVX2 capable architectures that is used for search matching and BibTeX parsing. This code is not set in default builds and needs to be explicitly enabled at compilation time.

Finally, it is important mentioning the inclusion in version 2.0.1 of stemmed context search, see Contextual Search for details, and contributed feedback in handling citations and extending cite commands to markdown syntax, see Predefined Placeholders.

# Release Note cb2Bib 2.0.0

Throughout the 1.9.x series, the cb2Bib sources were updated to the improved string processing capabilities of Qt5 and PCRE libraries. This update has brought a remarkable speedup for in-document searches and full search indexing.

Alternate normalization of journal titles and abbreviations, upgrading jsMath to MathJax, extending network queries syntax, and a PDF user manual are the additional enhancements in cb2Bib 2.0.0.

Back in version 0.3.3, cb2Bib introduced network queries to obtain the data for a citation. While convenient, queries to publishers’ websites were difficult to setup and fragile. Nowadays, fortunately, arXiv, PubMed and Crossref offer structured APIs. These interfaces provide to the end user an easy setup for completing bibliographic citations.

# Release Note cb2Bib 1.9.0

The cb2Bib sources have been ported to Qt5. To highlight this major update in library requirements the version number is set to 1.9.0. Later, once stabilized and new functionality related to Qt5 enhancements are applied, version number will be set to 2.

At this point cb2Bib has exactly the same functionality as its preceding version 1.5.0. To build the program, however, only qmake and its related config procedure are available. The cmake scripts have not yet been ported.

Qt5 brings important enhancements related to regular expressions and string processing. Some careful updates to the cb2Bib sources are needed to fully benefit from them. They will implemented through the 1.9.x series. We expect by then a performance boost on full text, regular expression based searches.

# Release Note cb2Bib 1.5.0

Included in version 1.5.0 sources there is a patch for XPDF 3.0.4, the default tool to convert PDF documents to plain text. The modified code separates superscripts to avoid words being joined to reference numbers and author names joined to affiliations’ glyphs. Interested users will need to download the package, apply the patch, and compile it.

Additionally, this version improves converted text postprocessing. This step normalizes character codes, reverts ligatures, restores when possible orphan diacritics and broken words, and undoes text hyphenation.

Conversion to text and postprocessing is important for reference extraction, and document indexing and searching. It is therefore recommended to delete cached document-to-text data to benefit from the present improvements. cb2Bib stores cached texts in *c2b files in an user specified directory. After that, by performing a search or initiating indexing an updated cache will be created.

# Release Note cb2Bib 1.4.7

Approximate and context searches effectively locate our references of interest. As collections grow in size, and low performance devices, netbooks and tablets, start being used, complete document searches become demanding. Besides, it is often not clear what to query for, and then a glossary of terms provides guidance. Often too, interest lies on subsetting documents by being similar to a given one.

Version 1.4.7 adds a pragmatic term or keyword extraction from the document contents. Accepted keywords are set as the substrings appearing at least twice in one document, appearing at least in three documents, and conforming to predefined part-of-speech (POS) sequences. Keyword extraction is performed by either clicking on Index Documents at the c2bciter desktop tray menu, or, by typing cb2bib –index [bibdirname] on a shell. During extraction, the Part Of Speech (POS) Lexicon distribution file must be available and readable. On termination, indexing files are saved on the Search In Files Cache Directory. Simply copying this directory will synchronize keyword indexing to a second computer.

After refreshing c2bciter module, pressing key G displays the glossary of terms. On a reference, pressing K displays its list of keywords. Pressing R on a keyword lists the references related to that keyword. Pressing R on a reference lists similarly related references. Similarity is assessed based on keyword occurrences. Left and Right keys provide previous and next navigation. Pressing V on either a reference keyword, or a keyword reference, visualizes the keyword excerpts from the reference’s document. To close excerpt dialog press Esc or Left keys.

# Release Note cb2Bib 1.4.0

The c2bciter module was introduced in version 1.3.0. Its name, as it was described, states its purpose of being “aimed to ease inserting citation IDs into documents”. In fact, it does have such functionality. And, it has also another, equally important one: it provides a very fast way to retrieve a given work from our personal collections.

Retrieving is accomplished through pre-sorted views of the references and filtering. Both, views and filtering, scale on the (tens of) thousands references. Usually, we recall a work from its publication year, a few words from its title, or (some of the letters of) one of its authors names. Often, what we remember is when a reference was included into our collection. Therefore, having such a chronological view was desirable.

The implementation of this sorted-by-inclusion-date view was not done during the 1.3.x series, but postponed to version 1.4.0; somehow, to indicate that some sort of ‘proprietary’ BibTeX tag might be required to specify inclusion timestamps. I have been reluctant through the cb2Bib’s life span to introduce ‘cb2Bib-only’ tags in the BibTeX outputs. I believe that there is little gain, and it costs, possibly, breaking interoperability.

In the end, the choice was to not write any ‘timestamp’ tag in references. Instead, c2bciter checks for the last modified date of the linked documents to build an approximated chronological view. The advantage is that all, not just ‘version 1.4.0 or later’, references are sorted. Furthermore, if a reference is later corrected, and the document metadata is updated too, the modification date is reflected in the view. The obvious inconvenience is that no such sorting can be done for references without an attached document.

# Release Note cb2Bib 1.3.0

When version 0.2.7 came up, it was mentioned in Release Note cb2Bib 0.2.7 that cb2Bib ‘doesn’t have the means to automatically discern an author name from a department or street name’. I forgot mentioning, that I did not expect cb2Bib would have had such a feature. Since the last Release Note cb2Bib 1.1.0, the cb2Bib internals had changed significantly. Some changes, such heuristic recognition for interlaced authors and affiliations, get easily noticed. Other changes, however, do not, and need additional explanation.

From version 1.2.3, the switches –txt2bib and –doc2bib set cb2Bib to work on console mode. The non-exact nature of the involved extractions makes logging necessary. On Windows, graphic or console modes must be decided not at run time, but when the application is built. So far, logging and globing were missing. This release adds the convenience wrapper c2bconsole. Typing c2bconsole –txt2bib i*.txt out.bib, for instance, will work as it does in the other platforms.

Lists of references are now sorted case and diacritic insensitive. For some languages such a choice is not the expected one, and some operating systems offer local-aware collation. Due to usual inconsistencies and inaccuracies in references, this decision was taken to group together ‘Density Matrix’ with ‘Density-matrix’, and Møller with Moller, which, in a personal collection, most probably, refer to the same concept and to the same person. Additionally, document to text converted strings are now clean from extraneous, non-textual symbols. Therefore, recreating cache files is recommended.

Finally, this release introduces a new module, named c2bciter, and aimed to ease inserting citation IDs into documents. The module should ideally stay idle at the system tray, and be recalled as needed by pressing a global, desktop shortcut. This functionality, while desirable, and usual in dictionaries, is platform and desktop dependent. On KDE there are currently known issues when switching among virtual desktops.

# Release Note cb2Bib 1.1.0

A frequent request from cb2Bib users has been to expand the command line functionality. So far few progress has been seen in this regard. First, the addition of in-document searches and reading/inserting metadata were priorities. Second, cb2Bib is not the tool to interconvert among bibliographic formats. And third, cb2Bib is designed to involve the user in the search process, in the archiving and validation of the discovered works and references.

For the latter reason, and for not knowing a priori how would such a tool be designed, the cb2Bib internals had been interlaced to its graphical interface. At the time of version 0.7.0, when the graphical libraries changed, and a major refactoring was required, the code started moving toward a better modularization and structure. The current release pushes code organization further. As a result, it adds two new command line switches: –html-annote and –view-annote.

The new cb2Bib module is named after the BibTeX key ‘annote’. Annote is not for a ‘one reference annotation’ though. Instead, Annote is for short notes that interrelate several references. Annote takes a plain text note, with minimal or no markup, inserts the bibliographic citations, and converts it to a HTML page with links to the referenced documents.

From within cb2Bib, to write your notes, type Alt+A, enter a filename, either new or existing, and once in Annote, type E to launch your default text editor. For help, type F1. Each time you save the document the viewer will be updated. To display mathematical notations, install jsMath locally. And, remember, code refactoring introduces bugs.

# Release Note cb2Bib 1.0.0

Approximately four years ago the first cb2Bib was released. It included the possibility of easily linking a document to its bibliographic reference, in a handy way, by dragging the file to the main (at that time, single) panel. Now, in version 1.0.0, when a file is dropped, cb2Bib scans the document for metadata packets, and checks, in a rather experimental way, whether or not they contain relevant bibliographic information.

Publishers metadata might or might not be accurate. Some, for instance, assign the DOI to the key Title. cb2Bib extracts possibly relevant key-value pairs and adds them to clipboard panel. Whenever key-value pairs are found accurate, just pressing Alt+G imports them to the line edits. If keys with the prefix bibtex are found, their values are automatically imported.

# Release Note cb2Bib 0.6.0

cb2Bib uses the internal tags <<NewLine_n>> and <<Tab_n>> to ease the creation of regular expressions for reference extraction. New line and tabular codes from the input stream are substituted by these numbered tags. Numbering new lines and tabulars gives an extra safety when writing down a regular expression. E. g., suppose field title is ‘anything’ between ‘<<NewLine1>> and <<NewLine2>>’. We can then easily write ‘anything’ as ‘.+’ without the risk of overextending the caption to several ‘\n’ codes. On the other hand, one still can use <<NewLine\d>> if not interested in a specific numbering. All these internal tags are later removed, once cb2Bib postprocesses the entry fields.

The cb2Bib identified so far new lines by checking for ‘\n’ codes. I was unaware that this was a platform dependent, as well as a not completely accurate way of detecting new lines. McKay Euan reported that <<NewLine_n>> tags were not appearing as expected in the MacOSX version. I later learn that MacOSX uses ‘\r’ codes, and that Windows uses ‘\r\n’, instead of ‘\n’ for new line encoding.

This release addresses this issue. It is supposed now that the cb2Bib regular expressions will be more transferable among the different platforms. Extraction from plain text sources is expected to be completely platform independent. Extraction from web pages will still remain browser dependent. In fact, each browser adds its peculiar interpretation of a given HTML source. For example, in Wiley webpages we see the sectioning header ‘Abstract’ in its source and in several browsers, but we see, and get, ‘ABSTRACT’ if using Konqueror.

What we pay for this more uniform approach is, however, a break in compatibility with previous versions of cb2Bib. Unix/Linux users should not expect many differences, though. Only one from the nine regular expressions in the examples needed to be modified, and the two contributed regular expressions work perfectly without any change. Windows users will not see a duplication of <<NewLine_n>> tags. To update previous expressions it should be enough just shifting the <<NewLine_n>> numbering. And, of course, any working regular expression that does not uses <<NewLine_n>> tags will still be working in this new version.

Finally, just to mention that I do not have a MacOSX to test any of the cb2Bib releases in this particular platform. I am therefore assuming that these changes will fix the problem at hand. If otherwise, please, let me know. Also, let me know if release 0.6.0 ‘break’ your own expressions. I consider this release a sort of experimental or beta version, and the previous version 0.5.3, will still be available during this testing period.

# Release Note cb2Bib 0.5.0

Two issues had appeared regarding cb2Bib installation and deployment on MacOSX platforms.

First, if you encounter a ‘nothing to install’-error during installation on MacOSX 10.4.x using the cb2Bib binary installer available at naranja.umh.es/~atg/, please delete the cb2bib-receipts from /Library/Receipts and then rerun the installer. See also M. Bongard’s clarifying note ‘MACOSX 10.4.X “NOTHING TO INSTALL”-ERROR’ for details.

Second, and also extensible to other cb2Bib platform versions, if PDFImport issues the error message ‘Failed to call some_format_to_text’ tool, make sure such a tool is installed and available. Go to Configure->PDFImport, click at the ‘Select External Convert Tool’ button, and navigate to set its full path. Since version 0.5.0 the default full path for the MacOSX is already set, and pointing to /usr/local/bin/pdftotext.

# Release Note cb2Bib 0.4.1

Qt/KDE applications emit notifications whenever they change the clipboard contents. cb2Bib uses these notifications to automatically start its ‘clipboard to BibTeX’ processing. Other applications, however, does not notify about them. Since version 0.2.1, see Release Note cb2Bib 0.2.1, cb2Bib started checking the clipboard periodically. This checking was later disabled as a default, needing a few lines of code to be uncomented to activate it. Without such a checking, cb2Bib appears unresponsive when selecting/copying from e.g., acroread or Mozilla. This release includes the class clipboardpoll written by L. Lunak for the KDE’s Klipper. Checking is performed in a very optimized way. This checking is enabled by default. If you experience problems with this feature, or if the required X11 headers aren’t available, consider disabling it by typing ./configure –disable-cbpoll prior to compilation. This will disable checking completely. If the naive, old checking is preferred, uncomment the four usual lines, ./configure –disable-cbpoll, and compile.

# Release Note cb2Bib 0.3.5

Releases 0.3.3 and 0.3.4 brought querying functionality to cb2Bib. In essence, cb2Bib was rearranged to accommodate copying and opening of network files. Queries were then implemented as user customizable HTML posts to journal databases. In addition, these arrangements permitted defining convenience, dynamic bookmarks that were placed at the cb2Bib’s ‘About’ panel.

cb2Bib contains three viewing panels: ‘About’, ‘Clipboard’ and ‘View BibTeX’, being the ‘Clipboard’ panel the main working area. To keep cb2Bib simple, only two buttons, ‘About’ and ‘View BibTeX’, are set to navigate through the panels. The ‘About’ and ‘View BibTeX’ buttons are toggle buttons for momentarily displaying their corresponding panels. Guidance was so far provided by enabling/disabling the buttons.

After the bookmark introduction, the ‘About’ panel has greatly increased its usefullness. Button functionality has been slightly redesigned now to avoid as many keystrokes and mouse clicks as possible. The buttons remain switchable, but they no longer disable the other buttons. User is guided by icon changes instead. Hopefully these changes will not be confusing or counterintuitive.

Bookmarks and querying functionality are customizable through the netqinf.txt file, which is editable by pressing the Alt+B keys. Supported queries are of the form ‘Journal-Volume-First Page’. cb2Bib parses netqinf.txt each time a query is performed. It looks for journal=Full_Name|[code] to obtain the required information for a specific journal. Empty, ‘journal=’ entries have a meaning of ‘any journal’. New in this release, cb2Bib will test all possible queries for a given journal instead of giving up at the first No article found message. The query process stops at the first successfull hit or, otherwise, once netqinf.txt is parsed completely (in an equivalent way as the automatic pattern recognition works). This permits querying multiple -and incomplete- journal databases.

Users should order the netqinf.txt file in a way it is more convenient. E.g., put PubMed in front of JACS if desired an automatic extraction. Or JACS in front of PubMed and extract from the journal web page, if author accented characters are wanted.

So far, this querying functionality is still tagged as experimental. Either the querying itself or its syntax seem quite successful. However, downloading of PDF files, on windows OS + T1 network, was found to freeze once progress reaches the 30-50%. Any feedback on this issue will be greatly appreciated. Also, information on kfmclient equivalent tools for non KDE desktops would be worth to be included in the cb2Bib documentation.

# Release Note cb2Bib 0.3.0

cb2Bib considers the whole set of authors as an author-string pattern. This string is later postprocessed, without requirements on the actual number of authors it may contain, or on how the names are written. Once considered author-string patterns, the extraction of bibliographic references by means of regular expressions becomes relatively simple.

There are situations, however, where several author-strings are required. The following box shows one of these cases. Authors are grouped according to their affiliations. Selecting from ‘F. N. First’ to ‘F. N. Fifth’ would include ‘First Affiliation’ within the author string. Cleaning up whatever wording ‘First Affiliation’ may contain is a rather ill-posed problem. Instead, cb2Bib includes an Add Authors option. The way of operation is then to select ‘F. N. First, F. N. Second, F. N. Third’ and chose Authors and right after, select ‘F. N. Fourth and F. N. Fifth’ and chose Add Authors.

                                             Journal Name, 10, 1100-1105, 2004

AN EXAMPLE WITH MULTIPLE AUTHOR SETS

F. N. First, F. N. Second, F. N. Third
First Affiliation

F. N. Fourth and F. N. Fifth
Second Affiliation

Abstract: Select from "Journal Name ..." to "... second author set.". The 'F.
N. First, F. N. Second, F. N. Third' author string is automatically processed
as one author set, while 'F. N. Fourth and F. N. Fifth' is processed as
another, second author set.


At this point in the manual extraction, the user was faced with a red <<moreauthors>> tag in the cb2Bib clipboard panel. The <<moreauthors>> tag was intended to warn the user about the fact that cb2Bib would not be able to consider the resulting extraction pattern as a valid, general regular expression. Usual regular expressions are built up from an a priori known level of nesting. In these cases, however, the level of nesting is variable. It depends on the number of different affiliations occurring in a particular reference.

So far the <<moreauthors>> tag has become a true FAQ about cb2Bib and a source of many confusions. There is no real need, however, for such an user warning. The <<moreauthors>> has therefore been removed and cb2Bib has taken an step further, to its 0.3.0 version.

The cb2Bib 0.3.0 manual extraction works as usual. By clicking Authors the Authors edit line is reseted and selection contents moved there. Alternatively, if Add Authors is clicked, selection contents is added to the author field. On this version, however, both operations are tagged as <<author>> (singular form, as it is the BibTeX keyword for Authors). The generated extraction pattern can now contain any number of <<author>> fields.

In automatic mode, cb2Bib now adds all author captions to Authors. In this way, cb2Bib can treat interlaced author-affiliation cases. Obviously, users needing such extractions will have to write particular regular expressions for cases with one set of authors, for two sets, and so on. Eventhough it is not rare a work having a hundred of authors, it would be quite umprobable that they were working on so many different institutions. Therefore, few regular expressions should actually be required in practice. Although not elegant, this breaks what was a cb2Bib limitation and broadens its use when extracting from PDF sources. Remember here to sort these regular expressions in decreasing order, since at present, cb2Bib stops at the first hit. Also, consider Any Pattern to get ride of the actual affiliation contents, as you might not want to extract authors addresses.

# Release Note cb2Bib 0.2.7

The cb2Bib 0.2.7 release introduces multiple retrieving from PDF files. PDF documents are becoming more and more widely used, not only to transfer and printing articles, but also are substituting the personal paper files and classifiers for the electronic equivalents.

cb2Bib is intended to help updating personal databases of papers. It is a tool focused on what is left behind in database retrieving. Cases such as email alerts, or inter colleague references and PDF sharing are example situations. Though in an electronic format, sources are not standardized or not globally used as to permit using habitual import filters in reference managers. cb2Bib is designed to consider a direct user intervention, either by creating its own useful filters or by a simple copy-paste assistance when handtyping.

Hopefully someday cb2Bib will be able to take that old directory, with perhaps a few hundreds of papers, to automatically index the references and rename the files by author, in a consistent manner. The required mechanism is already there, in this version. But I guess that this new feature will manifest some present limitations in cb2Bib. For instance, most printed and PDF papers interlace author names and affiliations. cb2Bib doesn’t have the means to automatically discern an author name from a department or street name. So far one needs to manually use the ‘Add to Authors’ feature to deal with these situations. Also, the managing of regular expressions needs developing, specially thinking in the spread variety of design patterns in publications.

In summary, this current version is already useful in classifying and extracting the reference of that couple of papers that someone send right before submitting a work. A complete unsupervised extraction is still far away, however.

# Release Note cb2Bib 0.2.1

The cb2Bib mechanism ‘select-and-catch’ failed in some cases. Acrobat and Mozilla selections were not always notified to cb2Bib. Indeed, this ‘window manager - application’ connection seems to be broken on a KDE 3.3.0 Qt 3.3.3 system.

The cb2Bib 0.2.1 continues to listen to system clipboard change notifications, whenever they are received and whenever cb2Bib is on connected mode. Additionally, the cb2Bib 0.2.1 periodically checks for changes in the system clipboard. Checks are performed every second, approximately. This permits cb2Bib to work as usual, although one could experience 1-2 seconds delays in systems where the automatic notification is broken.

If the ‘select-and-catch’ functionality appears ‘sticky’, possibly happening while using non KDE applications from where text is selected, check the source file c2bclipboard.cpp, look for 'Setting timer', and set variable interval to 1000. This is the interval of time in ms that cb2Bib will use to check for clipboard changes.