Data on traviscj/blog

absolutely minimal OLTP to OLAP pipeline

Thu, 25 Apr 2019 00:00:00 +0000

Suppose we have some data in a production OLTP database, and we need to send it to some OLAP database. This post describes one of the simplest approaches, and how to make it productional enough to rely on.

For every table t, we need to:

introduce a new field, updated_at
introduce a new index on that field, so we can get the records that changed after a certain updated_at.

For example,

feed sequences

Tue, 08 Jan 2019 00:00:00 +0000

In the mysql feeds post, I mentioned that the publisher could do

SELECT MAX(feed_sync_id)+1 
FROM kv

to find the next feed_sync_id during the publishing process, but this is actually a really bad idea. (And I knew it at the time, so forgive me for selling lies…)

Republishing

Before we jump into the problematic scenario, I’d like to motivate it with a tiny bit of background.

The republish operation is extremely useful when consumers need to receive updates. It is also extremely simple! A query like

Feeds as cache invalidation mechanism

Wed, 03 Oct 2018 02:31:43 +0000

One really cool use of feeds we’ve realized is that it gives a very efficient mechanism for application code to load the most recent versions of a table into memory. The basic idea is:

Set it up as a usual feed published table with an appropriate index on feed_sync_id.
Either alongside or within the cache, represent the latest loaded feed_sync_id.
Set up a cronjob/etc that reads the latest feed_sync_id and compares it to the cache’s feed_sync_id.
If they differ, reload the cache.
Ensure that all changes set feed_sync_id to null!

This works really well because the feed_sync_id in the database only gets updated on changes, so the reload cronjob mostly is a no-op. This means we can reload very frequently!

cross-dc sync with feed published KV

Tue, 10 Jul 2018 11:34:39 +0000

It’s been fun describing the feeds framework we use at Square. Today we’ll dive into a concrete problem:

We’ll stick with the feed-published kv table again.
We want two instances of some application code to bidirectionally synchronize the writes that happened on their instance to the other.
Eventually consistent is ok.

First, a bit of justification, though: I use this KV table to remember things I might have to look up without usual context, like my motorcycle’s license plate number, or that one weird python snippet I can never remember. I also have a whole slew of them at work – a bunch of random representative IDs for a bunch of things in our systems that I use from time to time. I also use a bunch of these as todo items at work, but that happens to work differently and is a topic for a future blog post :-)

history preserving data models

Mon, 02 Jul 2018 11:34:39 +0000

Start with a super simple data model:

CREATE TABLE kv (
  id BIGINT(22) NOT NULL AUTO_INCREMENT,
  k VARCHAR(255) NOT NULL,
  v LONGBLOB NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY u_k (`k`)
) Engine=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin

Suppose we want to audit “changes” to this data model.

Approach 1: `kv_log`

add data model like

CREATE TABLE `kv_log` (
  id BIGINT(22) NOT NULL AUTO_INCREMENT,
  changed_at TIMESTAMP NOT NULL,
  k VARCHAR(255) NOT NULL,
  old_v LONGBLOB NOT NULL,
  new_v LONGBLOB NOT NULL,
  
)

current value query: unchanged

mysql feeds

Fri, 29 Jun 2018 00:00:00 +0000

At work, we use a pattern called feeds that gets an incredible amount of work done. I’ve been wanting to describe it here for quite a while, and now seems as good of time as any.

The basic premise is: You have a service A with some data that other “consuming” services B, C, and D want to find out about. Maybe the data is payments, maybe it’s support cases, maybe it’s password changes… whatever. The other services might include your data warehouse, some event listeners, whatever.

sfpark api

Tue, 24 Apr 2018 16:29:39 +0000

I recently came to learn of the SFpark API, which lets one make queries like:

LAT=37.787702
LONG=-122.407796
curl "http://api.sfpark.org/sfpark/rest/availabilityservice?lat=${LAT}&long=${LONG}&radius=0.25&uom=mile&response=json" | pbcopy
pbpaste | jq '.AVL[] | select (.TYPE | contains("OFF"))'

and get a response including records like:

{
  "TYPE": "OFF",
  "OSPID": "950",
  "NAME": "Union Square Garage",
  "DESC": "333 Post Street",
  "INTER": "Geary between Stockton & Powell",
  "TEL": "(415) 397-0631",
  "OPHRS": {
    "OPS": {
      "FROM": "7 Days/Wk",
      "BEG": "24 Hrs/Day"
    }
  },
  "OCC": "381",
  "OPER": "670",
  "PTS": "1",
  "LOC": "-122.407447946,37.7876789151"
}

Pretty cool!

python attrs

Thu, 31 Aug 2017 08:42:43 +0000

I came across a very interesting library in a HN thread: the python attrs library.

In particular, this seems like a great way to do the “dumb data objects” they talk about in the end of object inheritance, and also related to (but maybe lighter weight than) zope.interface.

This also seems very similar to what I use autovalue for at work.

One particularly interesting application is a “code database” – using static, checked-in-to-version-control definitions of some data model as a sort of very-fast-to-read, very-slow-to-update “Data Model”. I find this fascinating: Code shares a lot of properties with great data stores: ability to rollback (git revert) and accountability/auditability (git blame). It also makes a lot of fairly hard problems much simpler: you don’t need to poll the database for changes. You don’t need to invalidate any caches. You don’t need to consider a “split brain” environment where half of the in-memory caches have updated but the other half haven’t. You don’t need to consider failure cases of how long the in-memory cache is allowed to be invalid: you just fail to boot up on deploy. (Admittedly, there’s still an opportunity window for split brain behavior for the duration of the deploy, but this is a lot easier to reason about than an essentially arbitrary.)

toy/life data models

Wed, 19 Apr 2017 00:00:00 +0000

I’ve been experimenting a lot with some kinda “toy” data models based on random things that I wish there was a database to query, but isn’t. For example:

What was my average arrival time this week?
How much of my equity has vested before a certain date?
When was the last time we had spaghetti for dinner?

I’ve been doing this with flat JSON files. This is a bit of an odd choice for me; I actually love schematizing data models in protobuf and MySQL and designing proper indices for the data models I work on during work hours.

filter vs spec (draft)

Wed, 18 Jan 2017 00:00:00 +0000

Consider a silly data model to store data about cities like

message City {
  optional string city_name = 1;
  optional string state = 2;
  optional int32 population = 3;
  optional int32 year_founded = 4;
  // ... presumably others :-)
}

and some sample data like:

[
  {"city_name": "Portland", "state": "OR", "population": ...},
  {"city_name": "Portland", "state": "ME", "population": ...},
  {"city_name": "Springfield", "state": "FL", "population": ...},
  {"city_name": "Springfield", "state": "IL", "population": ...},
  {"city_name": "Springfield", "state": "CO", "population": ...}
]

There are some useful entities we can define: (DRAFT NB: don’t read too much into the matcher vs filter lingo.)

streaks vs statistical streaks

Mon, 26 Sep 2016 00:00:00 +0000

Hacker News et al are obsessed with streaks, but I think they have some problems:

A single regression resets to zero.
There’s not an easy way to gradually ramp up your streak-commitment over time.

I prefer a different approach: statistical streaks.

Suppose I made a commitment to do something differently on 2016-08-26, and did it for the next 5 days; then my 30-day statistical streak avg = 0.166, but my 5-day statistical streak avg = 1.0.

build_json.sh

Wed, 19 Aug 2015 00:00:00 +0000

This might seem silly, but I’ve beeing playing with some json.sh scripts that build legitimate json bodies and are easily filled into a shell script variable as needed.

The basic driving idea was that there are lots of slick ways to pull data out of JSON(either by programming something with python’s json or running a command line tool like jq or whatever), but not as many friendly ways to build some JSON out of a given token or whatever. Often, you have a list of identifiers and you need to build a bunch of JSON blobs from that list.

logging

Fri, 26 Sep 2014 00:00:00 +0000

In grad school, I spent a lot of time writing code that read output from nonlinear optimization solvers, and tried to do useful things with it. A much better way to do that is called “structured logging”, an idea I experimented with a bit during grad school. It has also been coming up in my working life, so I wanted to delve into it a bit deeper. For a quick introduction, check out Thoughts on Logging. For a lot longer introduction, see The Log: What every software engineer should know about real-time data unifying.

org-mode emacs to track mileage

Wed, 15 May 2013 00:00:00 +0000

I’ve been trying out emacs org-mode lately for keeping track of everything. One thing that seemed worth tracking is the mileage the new Focus gets. Turns out that org-mode supports a kind of spreadsheet, so I made a quick little video demo on how I use it:

implementation of set operations

Wed, 13 Mar 2013 00:00:00 +0000

We got in a bit of a debate yesterday in the office over the implementation of associative containers, which I thought was pretty fun. We made up the big chart of complexity results you see below.

nomenclature:

$S$, $S_1$, and $S_2$ are subsets of $\Omega$.
Denote an element by $e\in\Omega$.
$n$,$n_1$,$n_2$,$N$ are the sizes of the set $S$, $S_1$, $S_2$, and $\Omega$, respectively, and $n_1 \geq n_2$.

Complexity

Operation\Approach	Hash Table	Hash Tree	Binary List	Entry List (sorted)	Entry List (unsorted)
$e \in S $	$O(1) $	$O(log(n)) $	$O(1) $	$O(log(n)) $	$O(n) $
$S_1 \cup S_2 $	$O(n_1+n_2) $	$O(n_1+n_2) $	$O(N) $	$O(n_1+n_2) $	$O(n_1n_2) $
$S_1 \cap S_2 $	$O(n_1) $	$O(log(n_1)n_2)$	$O(N) $	$O(n_2) $	$O(n_1n_2) $
space complexity	$O(n) $	$O(n) $	$O(N)$ bits.	$O(n) $	$O(n) $

As I said–this was just what came out of my memory of an informal discussion, so I make no guarantees that any of it is correct. Let me know if you spot something wrong! We used the examples $S_1 = {1,2,3,4,5}$ and $S_2 = {500000}$ to think through some things.

Whack it with an X squared!

Thu, 07 Aug 2008 00:00:00 +0000

David and I were working on our Math381 model, and I was getting frustrated because the data we collected and the results from the simulation were not lining up properly, which was quite frustrating. We were hoping to see something like this:

Number of Logins from Data

Instead, we were getting stuff distributed like this:

Simulated Number of Logins

I realized that we needed some function to force a bunch of this junk further left. Recalling an old adage from Mr. Cone’s AP Chemistry class, I decided it was the right time to whack it with an X squared. This is vaguely appropriate, because rand() has a range [0,1), so squaring it should put a whole bunch of stuff further right, but not everything(ie, the first half will end up in the first quarter, the first 3/4 will end up in the first 9/16, etc). Imagine my shock when I saw this: