So I’ve been writing a bunch of R using the fantastic RTextTools and topicmodels packages for LDA. The math in those packages is fantastic, the string manipulation in R is not great. So, what’s a guy to do, but ask a stackoverflow question regarding how to do the task at hand.
The task is:
Given a string input, and two vectors (start and end), such that start[i] and end[i] contain a pair of offsets to *remove* from input (thereby shortening the length of the string.)
In Perl, the answer is just “loop, call substr, and go on to the next problem.” In R, it’s not so easy, we ended up with:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
s <- "some source text…" | |
cutpoints <-data.frame(start=text$start, end=text$end) | |
keeps <- data.frame(start=c(1, cutpoints$end+1), end=c(cutpoints$start–1, nchar(s))) | |
pieces <- apply(keeps, 1, function(x) substr(s, x[1], x[2])) | |
sliced_string <- paste(pieces, collapse="") |
Which actually has to take the slices you want, instead compute the parts you want to keep, and then iterates through the list. It’s not a bad implementation, but it’s incredibly slow and memory inefficient (it’s keeping N copies of the document in memories, where N is the total number of slice pairs.) After talking to my R guru, I learned about the fantastic Rcpp package. That same task could be coded in “modern” C++ (using std) like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <Rcpp.h> | |
#include <iostream> | |
using namespace Rcpp; | |
using namespace std; | |
. | |
// [[Rcpp::export]] | |
std::string string_slicer( std::string input, std::vector< int > start, std::vector< int > end) { | |
std::string working_copy = input; | |
for(std::vector<int>::size_type i = start.size() – 1; i != (std::vector<int>::size_type) –1; i–) { | |
working_copy.erase(start[i]-1, (end[i] – start[i])+1);. | |
} | |
return working_copy; | |
} |
Which can then be called in R like:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
sourceCpp('../src/string_slicer.cpp') | |
sliced_string <- string_slicer(s, cutpoints$start, cutpoints$end) |
The output is identical, and the performance in the Rcpp version is literally 25x better. Here are two graphs showing the different performance profiles*. The nativeR version, the call stack is *12* layers deep of R code. 10,000 iterations of the R version took over 40 seconds to run.
By comparison, the C++ version, for the same 10,000 iterations it only took 1.5 seconds! Mind you, the profiler hides the actual C++ call graph, lord only knows what is happening underneath the hood that is std and the stl. However, it’s super fast, and I don’t really have to care too much.
As a caveat, there *was* difficulty in this original task, as the string encoding was not passed real cleanly between R and C++, the length() values that were returned in R and in C++ were completely differently, but that can be chalked up to nothing ever working right in unicode ever. I ended up “fixing” it (for my purposes, I would never suggest that this is the actual way to handle this by following this stackoverflow answer.)
The good news is that this really saves a lot of runtime and memory on the document preparation component of my LDA task, and it has opened up my eyes to powerful ways of dealing with performance critical R code. For my next trick, I can’t wait to tie Dan Bloomberg’s fantastic Leptonica library. Being able to do morphological operations on an image in C and have the resultant image directly exported into R as a numeric matrix is going to make a whole bunch of tasks possible that were previously simply computationally unfeasible.
* As an aside, I don’t think I’ve ever learned more about a programming language and environment than when learning about how to do performance profiling. I don’t know if that’s just me though. Those cheesy graph’s were made with Hadley Wickham‘s fantastic profr package.