Making a Ridiculously Fast™ API Client
I recently had the pleasure of publishing the R package
{arcgisgeocode}. It is
an R interface to the ArcGIS World
Geocoder.
You could say it is the “official” Esri geocoding R package.
To my knowledge, it is the fastest geocoding library available in the
R ecosystem. The ArcGIS World Geocoder is made avialable through
{tidygeocoder} as well
as {arcgeocoder}.
{arcgisgeocode} provides the full functionality of the World Geocoder
which includes bulk geocoding functionality which the other two do not.
The other two packages provide an interface to the
/findAddressCandidates
and
/reverseGeocode
API endpoints. The former provides single address forward geocoding
and the latter provides reverse geocoding.
{arcgisgeocode} is ~17x faster when performing single address
geocoding and ~40x faster when performing reverse geocoding when
compared to the community counterparts. There are 2 primary reasons why
this is.
The prolific Kyle Barron responded to one of my tweets a few months ago.
isn't geocoding always bottlenecked by the server?
— Kyle Barron @kylebarron@mapstodon.space (@kylebarron2) March 29, 2024
This statement is true in an aboslute sense. But then if it is only the
server that is the bottle neck, why does {arcgisgeocode} out-perform
two other packages calling the exact same API endpoints?
The reasons are primarily two-fold.
JSON parsing is slow
The first is that both tidygeocoder and arcgeocoder rely on
{jsonlite}
to both encode json and parse json. I have said it many times before and
I’ll say it again—jsonlite was a revolutionary R package but it has
proven to be slow.
The way that these API requests work is that we need to craft JSON from R objects, inject them into our API request, and then process the JSON that we get back from the server.
Encoding R objects as text strings is slow. Reading text and converting them back into R objects is also slow.
This is tangentially why Apache Arrow is so amazing. It uses the same memory layout regardless of where you are. If we were using Arrow arrays and the API received Arrow IPC and sent Arrow IPC, we would be able serialize and deserialize much faster!!!!
Handling JSON with serde
serde_json is a Rust crate that handles
serialization and deserialization of Rust structs. It takes the
guess work out of encoding and decoding JSON responses because it
requires that we specify what the json will look like. {arcgisgeocode}
uses serde_json to perform JSON serialization and deserialization.
For example I have the following struct definition
These struct definitions plus serde_json all coupled with the
extendr library means that I can
process and create JSON extremely fast!
Using a request pool
Both {tidygeocoder} and {arcgeocoder} both use
{httr} whereas {arcgisgeocode} uses
{httr2}. There may be speed-ups inherent
in switching.
But the primary difference is that in {arcgisgeocode}, we use a
req_perform_parallel()
with a small connection pool. This allows for multiple workers to be
handling requests concurrently. That means there is less time being
spent waiting for each request to be handled and then processed by our R
code.
Note that with great power comes great responsibility. Using
req_perform_parallel() without care may lead to accidentally
committing a DDoS
attack.
For that reason we use a conservative number of workers.
Closing notes
While Kyle is correct in the absolute sense, that the bottleneck of performance does come down to the geocoding service, it is also true that the clients that we write to call these services might be adding additional performance overhead.
To improve performance, I would recommend identifying the slowest part and making it faster. In general, when it comes to API clients, this is almost always the (de)serialization and the request handling.
I don’t expect everyone to learn how to write Rust. But you can make informed decisions about what libraries you are using.
Learn how to parse json with Rust
If you are using jsonlite and you care about performance. Stop that. I
strongly recommend using RccpSimdJson (for parsing only), yyjson (for
both), and jsonify—in that order. You will find your code to be much
faster.
Next, if you are making multiple requests to the same endpoint. Consider
using a small worker pool using req_perform_parallel() and then watch
how the speed improves.