bibextract?
bibextract
is a small program that I have written. Given a set of LaTeX
sources and a BibTeX file (that acts as a database), it generates a BibTeX file
containing just the ones that are used in those sources, provided they are also
present in the database.
Now if all that is going over your head, in brief, LaTeX is a typesetting system for documents1 and BibTeX is a format used for reference management, that is used along with LaTeX.
Why?
I use LaTeX a lot. When using LaTeX, it is common to maintain a set of bibliography entries as BibTeX files for citation. Since I use a monolithic BibTeX file with all the entries I will ever need, I run into problems when I need to share my LaTeX project with others. Rather than ship a ~2000 line BibTeX file containing mostly irrelevant entries, it is much better to ship a BibTeX file only with entries that are used in that project. Now, it can be easily done for small articles, but for research papers and theses, it will be quite the task. Manually, one would have to look at the generated bibliography in the document output and get the required entries from the database BibTeX. This process can be easily automated given the structured nature of LaTeX.
How?
This is accomplished in two stages:
- Extract
- Search and write
Extraction
We look to the ever-reliable coreutils
, and Perl-Compatible Regular
Expressions (PCREs) for help here. grep
can be used to search for \cite{}
in the LaTeX sources. sed
can then be used to extract the BibTeX entry name.
The resulting one-liner is:
# ext.sh f1.tex
grep -Po '\\cite\{.*?\}' $1 | sed -E 's/\\cite\{(.*)\}/\1/g'
grep -Po '\\cite\{.*?\}' $1
searches for the text\cite{}
in the file provided as the first argument to the script ($1
). E.g. if the file contains both\cite{123}
and\cite{abc}
, thegrep
command will return:\cite{123} \cite{abc}
Note the lazy matching that is being done using
?
, causing the*
quantifier to match as few instances as possible. This is important as it helps deal with edge cases like this line:\cite{abc}. Foo \textbf{baz}
. If lazy matching was not used,grep
would return the entire line, instead of just\cite{abc}
(why?).sed -E 's/\\cite\{(.*)\}/\1/g'
processes this output by removing the unwanted\cite
and the brackets, leaving us with just what we need: the BibTeX entry names. For the discussed example, this command will return:123 abc
I learnt something new about
sed
and sub-expressions while doing this. When usingsed
, you can use\1
through\9
as references to the corresponding matching sub-expression. In our case, the sub-expression is the part within the braces in\cite{}
, indicated by the use of()
in the regular expression.
Search and write
The rest of the work is done using Python, because I am not writing a
BibTeX parser in bash
🤣.
I used the
python-bibtexparser
library for this. The API is straightforward, and offers enough control. It
uses dictionaries to store the parsed BibTeX. Once the database BibTeX has been
read, it can be queried, and matches can be stored in a separate dictionary for
writing. In order to prevent duplicate entries, I used Python sets
instead of
dictionaries
directly.
Please contribute!
This project is completely free and open-source! Hacktoberfest 2022 is live, and I invite you to take this opportunity to contribute to this project. So head over to GitHub and get started!
Happy hacking!