Unpredictable data is either stale, inconsistent, or simply invalid data produced by a data store. For example, the design of a Cache Augmented SQL (CASQL) system may incur dirty reads or suffer from race conditions that leave the cache and the database in an inconsistent state, a data store may employ an eventual consistency technique that produces either stale or inconsistent data for some time, and others. The requirements of an application dictate whether these techniques are appropriate or not. A key question is how much unpredictable data is produced by a data store for interactive social networking actions. BG quantifies this important metric.
Conceptually, BG is aware of the initial state of resources in the database (by creating them using deterministic functions) and the change of value applied by each update operation. There is a finite number of ways for a read of a resource to overlap with concurrent actions that write it. BG enumerates these to compute a range of acceptable values that should be observed by the read operation. If a data store produces a different value then it has produced unpredictable data. This process is named validation and its details are as follows.
BG implements validation in an off-line manner after it rates a data store, preventing it from exhausting the resources of a BGClient. During the benchmarking phase, each thread of a BGClient invokes an action that generates one log record. There are two types of log records, a read and a write log record corresponding to either a read or a write of a resource. A log record consists of a unique identifier, the action that produced it, the resource referenced by the action, its socialite session id, and start and end time stamp of the action. The read log record contains its observed value from the data store. The write log contains either the new value (named Absolute Write Log, AWL, records) or change (named Delta Write Log, DWL, records) to existing value of its referenced resource.
The start and end time stamps of each log record identifies the duration of an action that either read or wrote a resource. They enable BG to detect the 4 possible ways that a write operation may overlap a read operation. During validation phase, for each read log record that references a resource, BG enumerates all completed and overlapping write log records that reference that resource to compute a range of possible values for this resource. If the read log record contains a value outside of this range then its corresponding action has observed unpredictable data.
Currently, there are two centralized implementations of the validation phase using interval-trees as in-memory data structures and a persistent store using a relational database. The latter is more appropriate when the total size of the write log records exceeds the available memory of a BGClient. We also have a perliminary implementation of the validation phase using MapReduce that requires the log files to be scattered across a distributed file system. One may configure the invalidation phase to use either its in-memory or persistent implementation.