So I’ve been writing a bunch of R using the fantastic RTextTools and topicmodels packages for LDA. The math in those packages is fantastic, the string manipulation in R is not great. So, what’s a guy to do, but ask a stackoverflow question regarding how to do the task at hand.
The task is:
Given a string input, and two vectors (start and end), such that start[i] and end[i] contain a pair of offsets to *remove* from input (thereby shortening the length of the string.)
In Perl, the answer is just “loop, call substr, and go on to the next problem.” In R, it’s not so easy, we ended up with:
Which actually has to take the slices you want, instead compute the parts you want to keep, and then iterates through the list. It’s not a bad implementation, but it’s incredibly slow and memory inefficient (it’s keeping N copies of the document in memories, where N is the total number of slice pairs.) After talking to my R guru, I learned about the fantastic Rcpp package. That same task could be coded in “modern” C++ (using std) like:
Which can then be called in R like:
The output is identical, and the performance in the Rcpp version is literally 25x better. Here are two graphs showing the different performance profiles*. The nativeR version, the call stack is *12* layers deep of R code. 10,000 iterations of the R version took over 40 seconds to run.
By comparison, the C++ version, for the same 10,000 iterations it only took 1.5 seconds! Mind you, the profiler hides the actual C++ call graph, lord only knows what is happening underneath the hood that is std and the stl. However, it’s super fast, and I don’t really have to care too much.
As a caveat, there *was* difficulty in this original task, as the string encoding was not passed real cleanly between R and C++, the length() values that were returned in R and in C++ were completely differently, but that can be chalked up to nothing ever working right in unicode ever. I ended up “fixing” it (for my purposes, I would never suggest that this is the actual way to handle this by following this stackoverflow answer.)
The good news is that this really saves a lot of runtime and memory on the document preparation component of my LDA task, and it has opened up my eyes to powerful ways of dealing with performance critical R code. For my next trick, I can’t wait to tie Dan Bloomberg’s fantastic Leptonica library. Being able to do morphological operations on an image in C and have the resultant image directly exported into R as a numeric matrix is going to make a whole bunch of tasks possible that were previously simply computationally unfeasible.
* As an aside, I don’t think I’ve ever learned more about a programming language and environment than when learning about how to do performance profiling. I don’t know if that’s just me though. Those cheesy graph’s were made with Hadley Wickham‘s fantastic profr package.