Reduce impact on memory of reference data
The documentation currently states:
With the current implementation, each join operation with reference data keeps a copy of the reference data in memory, even if you join with the same reference data multiple times.
This really should not be the case. We are aiming to process in excess of 100k events per second where up to 20% of these events could be held in memory for at least half an hour. These events are also joined to a reference data blob that is ~3MB in size and, at a later point post the long window, again joined to another reference file ~2GB in size. Storing a copy of the reference data per join is wasteful and could lead to jobs failing unnecessarily.