<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Data on traviscj/blog</title>
    <link>https://traviscj.com/blog/tags/data/</link>
    <description>Recent content in Data on traviscj/blog</description>
    <generator>Hugo</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 25 Apr 2019 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://traviscj.com/blog/tags/data/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>absolutely minimal OLTP to OLAP pipeline</title>
      <link>https://traviscj.com/blog/post/2019-04-25-oltp_to_olap/</link>
      <pubDate>Thu, 25 Apr 2019 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2019-04-25-oltp_to_olap/</guid>
      <description>&lt;p&gt;Suppose we have some data in a production OLTP database, and we need to send it to some OLAP database.&#xA;This post describes one of the simplest approaches, and how to make it productional enough to rely on.&lt;/p&gt;&#xA;&lt;p&gt;For every table &lt;code&gt;t&lt;/code&gt;, we need to:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;introduce a new field, &lt;code&gt;updated_at&lt;/code&gt;&lt;/li&gt;&#xA;&lt;li&gt;introduce a new index on that field, so we can get the records that changed after a certain &lt;code&gt;updated_at&lt;/code&gt;.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;For example,&lt;/p&gt;</description>
    </item>
    <item>
      <title>feed sequences</title>
      <link>https://traviscj.com/blog/post/2019-01-08-feed_sequences/</link>
      <pubDate>Tue, 08 Jan 2019 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2019-01-08-feed_sequences/</guid>
      <description>&lt;p&gt;In the &lt;a href=&#34;https://traviscj.com/blog/post/2018-06-29-mysql_feeds/&#34;&gt;mysql feeds&lt;/a&gt; post, I mentioned that the publisher could do&lt;/p&gt;&#xA;&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;-webkit-text-size-adjust:none;&#34;&gt;&lt;code class=&#34;language-sql&#34; data-lang=&#34;sql&#34;&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;SELECT&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;MAX&lt;/span&gt;(feed_sync_id)&lt;span style=&#34;color:#f92672&#34;&gt;+&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt; &#xA;&lt;/span&gt;&lt;/span&gt;&lt;span style=&#34;display:flex;&#34;&gt;&lt;span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;FROM&lt;/span&gt; kv&#xA;&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;to find the next &lt;code&gt;feed_sync_id&lt;/code&gt; during the publishing process, but&#xA;&lt;strong&gt;this is actually a really bad idea.&lt;/strong&gt;&#xA;(And I knew it at the time, so forgive me for selling lies&amp;hellip;)&lt;/p&gt;&#xA;&lt;h2 id=&#34;republishing&#34;&gt;Republishing&lt;/h2&gt;&#xA;&lt;p&gt;Before we jump into the problematic scenario, I&amp;rsquo;d like to motivate it with a tiny bit of background.&lt;/p&gt;&#xA;&lt;p&gt;The &lt;em&gt;republish&lt;/em&gt; operation is extremely useful when consumers need to receive updates.&#xA;It is also extremely simple!&#xA;A query like&lt;/p&gt;</description>
    </item>
    <item>
      <title>Feeds as cache invalidation mechanism</title>
      <link>https://traviscj.com/blog/post/2018-10-03-feeds_as_cache_invalidation_mechanism/</link>
      <pubDate>Wed, 03 Oct 2018 02:31:43 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2018-10-03-feeds_as_cache_invalidation_mechanism/</guid>
      <description>&lt;p&gt;One really cool use of &lt;a href=&#34;https://traviscj.com/blog/post/2018-06-29-mysql_feeds/&#34;&gt;feeds&lt;/a&gt; we&amp;rsquo;ve realized is that it gives a very efficient mechanism for application code to load the most recent versions of a table into memory.&#xA;The basic idea is:&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;Set it up as a usual feed published table with an appropriate index on &lt;code&gt;feed_sync_id&lt;/code&gt;.&lt;/li&gt;&#xA;&lt;li&gt;Either alongside or within the cache, represent the latest loaded &lt;code&gt;feed_sync_id&lt;/code&gt;.&lt;/li&gt;&#xA;&lt;li&gt;Set up a cronjob/etc that reads the latest &lt;code&gt;feed_sync_id&lt;/code&gt; and compares it to the cache&amp;rsquo;s &lt;code&gt;feed_sync_id&lt;/code&gt;.&lt;/li&gt;&#xA;&lt;li&gt;If they differ, reload the cache.&lt;/li&gt;&#xA;&lt;li&gt;Ensure that all changes set &lt;code&gt;feed_sync_id&lt;/code&gt; to null!&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;p&gt;This works really well because the &lt;code&gt;feed_sync_id&lt;/code&gt; in the database only gets updated on changes, so the reload cronjob mostly is a no-op.&#xA;This means we can reload very frequently!&lt;/p&gt;</description>
    </item>
    <item>
      <title>cross-dc sync with feed published KV</title>
      <link>https://traviscj.com/blog/post/2018-07-10-cross-dc-sync-with-feed-published_kv/</link>
      <pubDate>Tue, 10 Jul 2018 11:34:39 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2018-07-10-cross-dc-sync-with-feed-published_kv/</guid>
      <description>&lt;p&gt;It&amp;rsquo;s been fun describing the &lt;a href=&#34;https://traviscj.com/blog/post/2018-06-29-mysql_feeds/&#34;&gt;feeds&lt;/a&gt; framework we use at Square.&#xA;Today we&amp;rsquo;ll dive into a concrete problem:&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;We&amp;rsquo;ll stick with the feed-published &lt;code&gt;kv&lt;/code&gt; table again.&lt;/li&gt;&#xA;&lt;li&gt;We want two instances of some application code to bidirectionally synchronize the writes that happened on their instance to the other.&lt;/li&gt;&#xA;&lt;li&gt;Eventually consistent is ok.&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;p&gt;First, a bit of justification, though:&#xA;I use this KV table to remember things I might have to look up without usual context, like my motorcycle&amp;rsquo;s license plate number, or that one weird python snippet I can never remember.&#xA;I also have a whole slew of them at work &amp;ndash; a bunch of random representative IDs for a bunch of things in our systems that I use from time to time.&#xA;I &lt;em&gt;also&lt;/em&gt; use a bunch of these as todo items at work, but that happens to work differently and is a topic for a future blog post :-)&lt;/p&gt;</description>
    </item>
    <item>
      <title>history preserving data models</title>
      <link>https://traviscj.com/blog/post/2018-07-02-history-preserving-data-models/</link>
      <pubDate>Mon, 02 Jul 2018 11:34:39 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2018-07-02-history-preserving-data-models/</guid>
      <description>&lt;p&gt;Start with a super simple data model:&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;CREATE TABLE kv (&#xA;  id BIGINT(22) NOT NULL AUTO_INCREMENT,&#xA;  k VARCHAR(255) NOT NULL,&#xA;  v LONGBLOB NOT NULL,&#xA;  PRIMARY KEY (`id`),&#xA;  UNIQUE KEY u_k (`k`)&#xA;) Engine=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;Suppose we want to audit &amp;ldquo;changes&amp;rdquo; to this data model.&lt;/p&gt;&#xA;&lt;h2 id=&#34;approach-1-kv_log&#34;&gt;Approach 1: &lt;code&gt;kv_log&lt;/code&gt;&lt;/h2&gt;&#xA;&lt;p&gt;add data model like&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;CREATE TABLE `kv_log` (&#xA;  id BIGINT(22) NOT NULL AUTO_INCREMENT,&#xA;  changed_at TIMESTAMP NOT NULL,&#xA;  k VARCHAR(255) NOT NULL,&#xA;  old_v LONGBLOB NOT NULL,&#xA;  new_v LONGBLOB NOT NULL,&#xA;  &#xA;)&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;current value query: unchanged&lt;/p&gt;</description>
    </item>
    <item>
      <title>mysql feeds</title>
      <link>https://traviscj.com/blog/post/2018-06-29-mysql_feeds/</link>
      <pubDate>Fri, 29 Jun 2018 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2018-06-29-mysql_feeds/</guid>
      <description>&lt;p&gt;At work, we use a pattern called &lt;em&gt;feeds&lt;/em&gt; that gets an incredible amount of work done.&#xA;I&amp;rsquo;ve been wanting to describe it here for quite a while, and now seems as good of time as any.&lt;/p&gt;&#xA;&lt;p&gt;The basic premise is: You have a service A with some data that other &amp;ldquo;consuming&amp;rdquo; services B, C, and D want to find out about.&#xA;Maybe the data is payments, maybe it&amp;rsquo;s support cases, maybe it&amp;rsquo;s password changes&amp;hellip; whatever.&#xA;The other services might include your data warehouse, some event listeners, whatever.&lt;/p&gt;</description>
    </item>
    <item>
      <title>sfpark api</title>
      <link>https://traviscj.com/blog/post/2018-04-24-sfpark-api/</link>
      <pubDate>Tue, 24 Apr 2018 16:29:39 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2018-04-24-sfpark-api/</guid>
      <description>&lt;p&gt;I recently came to learn of the &lt;a href=&#34;http://sfpark.org/wp-content/uploads/2013/12/SFpark_API_Dec2013.pdf&#34;&gt;SFpark API&lt;/a&gt;, which lets one make queries like:&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;LAT=37.787702&#xA;LONG=-122.407796&#xA;curl &amp;quot;http://api.sfpark.org/sfpark/rest/availabilityservice?lat=${LAT}&amp;amp;long=${LONG}&amp;amp;radius=0.25&amp;amp;uom=mile&amp;amp;response=json&amp;quot; | pbcopy&#xA;pbpaste | jq &#39;.AVL[] | select (.TYPE | contains(&amp;quot;OFF&amp;quot;))&#39;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;and get a response including records like:&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;{&#xA;  &amp;quot;TYPE&amp;quot;: &amp;quot;OFF&amp;quot;,&#xA;  &amp;quot;OSPID&amp;quot;: &amp;quot;950&amp;quot;,&#xA;  &amp;quot;NAME&amp;quot;: &amp;quot;Union Square Garage&amp;quot;,&#xA;  &amp;quot;DESC&amp;quot;: &amp;quot;333 Post Street&amp;quot;,&#xA;  &amp;quot;INTER&amp;quot;: &amp;quot;Geary between Stockton &amp;amp; Powell&amp;quot;,&#xA;  &amp;quot;TEL&amp;quot;: &amp;quot;(415) 397-0631&amp;quot;,&#xA;  &amp;quot;OPHRS&amp;quot;: {&#xA;    &amp;quot;OPS&amp;quot;: {&#xA;      &amp;quot;FROM&amp;quot;: &amp;quot;7 Days/Wk&amp;quot;,&#xA;      &amp;quot;BEG&amp;quot;: &amp;quot;24 Hrs/Day&amp;quot;&#xA;    }&#xA;  },&#xA;  &amp;quot;OCC&amp;quot;: &amp;quot;381&amp;quot;,&#xA;  &amp;quot;OPER&amp;quot;: &amp;quot;670&amp;quot;,&#xA;  &amp;quot;PTS&amp;quot;: &amp;quot;1&amp;quot;,&#xA;  &amp;quot;LOC&amp;quot;: &amp;quot;-122.407447946,37.7876789151&amp;quot;&#xA;}&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;Pretty cool!&lt;/p&gt;</description>
    </item>
    <item>
      <title>python attrs</title>
      <link>https://traviscj.com/blog/post/2017-08-31-python-attrs/</link>
      <pubDate>Thu, 31 Aug 2017 08:42:43 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2017-08-31-python-attrs/</guid>
      <description>&lt;p&gt;I came across a very interesting library in a &lt;a href=&#34;https://news.ycombinator.com/item?id=15131981&#34;&gt;HN thread&lt;/a&gt;: the python &lt;a href=&#34;http://www.attrs.org/en/stable/examples.html&#34;&gt;attrs&lt;/a&gt; library.&lt;/p&gt;&#xA;&lt;p&gt;In particular, this seems like a great way to do the &amp;ldquo;dumb data objects&amp;rdquo; they talk about in &lt;a href=&#34;https://www.youtube.com/watch?v=3MNVP9-hglc&#34;&gt;the end of object inheritance&lt;/a&gt;, and also related to (but maybe lighter weight than) &lt;a href=&#34;https://zopeinterface.readthedocs.io/en/latest/README.html&#34;&gt;zope.interface&lt;/a&gt;.&lt;/p&gt;&#xA;&lt;p&gt;This also seems very similar to what I use &lt;a href=&#34;https://github.com/google/auto&#34;&gt;autovalue&lt;/a&gt; for at work.&lt;/p&gt;&#xA;&lt;p&gt;One particularly interesting application is a &amp;ldquo;code database&amp;rdquo; &amp;ndash; using static, checked-in-to-version-control definitions of some data model as a sort of very-fast-to-read, very-slow-to-update &amp;ldquo;Data Model&amp;rdquo;.&#xA;I find this fascinating:&#xA;Code shares a lot of properties with great data stores: ability to rollback (&lt;code&gt;git revert&lt;/code&gt;) and accountability/auditability (&lt;code&gt;git blame&lt;/code&gt;).&#xA;It also makes a lot of fairly hard problems much simpler: you don&amp;rsquo;t need to poll the database for changes.&#xA;You don&amp;rsquo;t need to invalidate any caches.&#xA;You don&amp;rsquo;t need to consider a &amp;ldquo;split brain&amp;rdquo; environment where half of the in-memory caches have updated but the other half haven&amp;rsquo;t.&#xA;You don&amp;rsquo;t need to consider failure cases of how long the in-memory cache is allowed to be invalid: you just fail to boot up on deploy.&#xA;(Admittedly, there&amp;rsquo;s still an opportunity window for split brain behavior for the duration of the deploy, but this is a lot easier to reason about than an essentially arbitrary.)&lt;/p&gt;</description>
    </item>
    <item>
      <title>toy/life data models</title>
      <link>https://traviscj.com/blog/post/2017-04-19-toy_life_data_models/</link>
      <pubDate>Wed, 19 Apr 2017 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2017-04-19-toy_life_data_models/</guid>
      <description>&lt;p&gt;I&amp;rsquo;ve been experimenting a lot with some kinda &amp;ldquo;toy&amp;rdquo; data models based on random things that I wish there was a database to query, but isn&amp;rsquo;t.&#xA;For example:&lt;/p&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;What was my average arrival time this week?&lt;/li&gt;&#xA;&lt;li&gt;How much of my equity has vested before a certain date?&lt;/li&gt;&#xA;&lt;li&gt;When was the last time we had spaghetti for dinner?&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;p&gt;I&amp;rsquo;ve been doing this with flat JSON files.&#xA;This is a bit of an odd choice for me; I actually love schematizing data models in protobuf and MySQL and designing proper indices for the data models I work on during work hours.&lt;/p&gt;</description>
    </item>
    <item>
      <title>filter vs spec (draft)</title>
      <link>https://traviscj.com/blog/post/2017-01-18-filter_vs_spec/</link>
      <pubDate>Wed, 18 Jan 2017 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2017-01-18-filter_vs_spec/</guid>
      <description>&lt;p&gt;Consider a silly data model to store data about cities like&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;message City {&#xA;  optional string city_name = 1;&#xA;  optional string state = 2;&#xA;  optional int32 population = 3;&#xA;  optional int32 year_founded = 4;&#xA;  // ... presumably others :-)&#xA;}&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;and some sample data like:&lt;/p&gt;&#xA;&lt;pre&gt;&lt;code&gt;[&#xA;  {&amp;quot;city_name&amp;quot;: &amp;quot;Portland&amp;quot;, &amp;quot;state&amp;quot;: &amp;quot;OR&amp;quot;, &amp;quot;population&amp;quot;: ...},&#xA;  {&amp;quot;city_name&amp;quot;: &amp;quot;Portland&amp;quot;, &amp;quot;state&amp;quot;: &amp;quot;ME&amp;quot;, &amp;quot;population&amp;quot;: ...},&#xA;  {&amp;quot;city_name&amp;quot;: &amp;quot;Springfield&amp;quot;, &amp;quot;state&amp;quot;: &amp;quot;FL&amp;quot;, &amp;quot;population&amp;quot;: ...},&#xA;  {&amp;quot;city_name&amp;quot;: &amp;quot;Springfield&amp;quot;, &amp;quot;state&amp;quot;: &amp;quot;IL&amp;quot;, &amp;quot;population&amp;quot;: ...},&#xA;  {&amp;quot;city_name&amp;quot;: &amp;quot;Springfield&amp;quot;, &amp;quot;state&amp;quot;: &amp;quot;CO&amp;quot;, &amp;quot;population&amp;quot;: ...}&#xA;]&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;There are some useful entities we can define: (DRAFT NB: don&amp;rsquo;t read too much into the matcher vs filter lingo.)&lt;/p&gt;</description>
    </item>
    <item>
      <title>streaks vs statistical streaks</title>
      <link>https://traviscj.com/blog/post/2016-09-26-streaks_vs_statistical_streaks/</link>
      <pubDate>Mon, 26 Sep 2016 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2016-09-26-streaks_vs_statistical_streaks/</guid>
      <description>&lt;p&gt;&lt;a href=&#34;https://hn.algolia.com/?query=streaks&amp;amp;sort=byPopularity&amp;amp;prefix&amp;amp;page=0&amp;amp;dateRange=all&amp;amp;type=story&#34;&gt;Hacker News et al are obsessed with streaks&lt;/a&gt;, but I think they have some problems:&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;&#xA;&lt;p&gt;A single regression resets to zero.&lt;/p&gt;&#xA;&lt;/li&gt;&#xA;&lt;li&gt;&#xA;&lt;p&gt;There&amp;rsquo;s not an easy way to gradually ramp up your streak-commitment over time.&lt;/p&gt;&#xA;&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;p&gt;I prefer a different approach: statistical streaks.&lt;/p&gt;&#xA;&lt;p&gt;Suppose I made a commitment to do something differently on 2016-08-26, and did it for the next 5 days; then my 30-day statistical streak avg = 0.166, but my 5-day statistical streak avg = 1.0.&lt;/p&gt;</description>
    </item>
    <item>
      <title>build_json.sh</title>
      <link>https://traviscj.com/blog/post/2015-08-19-build_json.sh/</link>
      <pubDate>Wed, 19 Aug 2015 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2015-08-19-build_json.sh/</guid>
      <description>&lt;p&gt;This might seem silly, but I&amp;rsquo;ve beeing playing with some&#xA;&lt;a href=&#34;http://traviscj.com/ZeroBin/?1d4c5e66662c6306#V91+G7w0NYxN4ui/sDPBivPA8Fo5PzB7mZHAPboau7U=&#34;&gt;json.sh&lt;/a&gt; scripts&#xA;that build legitimate json bodies and are easily filled into a shell script variable as needed.&lt;/p&gt;&#xA;&lt;p&gt;The basic driving idea was that there are lots of slick ways to pull data &lt;em&gt;out&lt;/em&gt; of JSON(either by programming something with python&amp;rsquo;s json or running a command line tool like jq or whatever), but not as many friendly ways to build some JSON out of a given token or whatever.&#xA;Often, you have a list of identifiers and you need to build a bunch of JSON blobs from that list.&lt;/p&gt;</description>
    </item>
    <item>
      <title>logging</title>
      <link>https://traviscj.com/blog/post/2014-09-26-logging/</link>
      <pubDate>Fri, 26 Sep 2014 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2014-09-26-logging/</guid>
      <description>&lt;p&gt;In grad school, I spent a lot of time writing code that read output from nonlinear optimization solvers, and tried&#xA;to do useful things with it.&#xA;A much better way to do that is called &amp;ldquo;structured logging&amp;rdquo;, an idea I experimented with a bit during grad school.&#xA;It has also been coming up in my working life, so I wanted to delve into it a bit deeper.&#xA;For a quick introduction, check out &lt;a href=&#34;http://gregoryszorc.com/blog/category/logging/&#34;&gt;Thoughts on Logging&lt;/a&gt;.&#xA;For a lot longer introduction, see &lt;a href=&#34;http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying&#34;&gt;The Log: What every software engineer should know about real-time data unifying&lt;/a&gt;.&lt;/p&gt;</description>
    </item>
    <item>
      <title>org-mode emacs to track mileage</title>
      <link>https://traviscj.com/blog/post/2013-05-15-org-mode_emacs_to_track_mileage/</link>
      <pubDate>Wed, 15 May 2013 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2013-05-15-org-mode_emacs_to_track_mileage/</guid>
      <description>&lt;p&gt;I&amp;rsquo;ve been trying out emacs org-mode lately for keeping track of everything.&#xA;One thing that seemed worth tracking is the mileage the new Focus gets.&#xA;Turns out that org-mode supports a kind of spreadsheet, so I made a quick little video demo on how I use it:&lt;/p&gt;&#xA;&lt;!-- raw HTML omitted --&gt;</description>
    </item>
    <item>
      <title>implementation of set operations</title>
      <link>https://traviscj.com/blog/post/2013-03-13-implementation_of_set_operations/</link>
      <pubDate>Wed, 13 Mar 2013 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2013-03-13-implementation_of_set_operations/</guid>
      <description>&lt;p&gt;We got in a bit of a debate yesterday in the office over the implementation of associative containers, which I thought was pretty fun.&#xA;We made up the big chart of complexity results you see below.&lt;/p&gt;&#xA;&lt;h2 id=&#34;nomenclature&#34;&gt;nomenclature:&lt;/h2&gt;&#xA;&lt;ul&gt;&#xA;&lt;li&gt;$S$, $S_1$, and $S_2$ are subsets of $\Omega$.&lt;/li&gt;&#xA;&lt;li&gt;Denote an element by $e\in\Omega$.&lt;/li&gt;&#xA;&lt;li&gt;$n$,$n_1$,$n_2$,$N$ are the sizes of the set $S$, $S_1$, $S_2$, and $\Omega$, respectively, and $n_1 \geq n_2$.&lt;/li&gt;&#xA;&lt;/ul&gt;&#xA;&lt;h2 id=&#34;complexity&#34;&gt;Complexity&lt;/h2&gt;&#xA;&lt;table&gt;&#xA;  &lt;thead&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;th&gt;Operation\Approach&lt;/th&gt;&#xA;          &lt;th&gt;Hash Table&lt;/th&gt;&#xA;          &lt;th&gt;Hash Tree&lt;/th&gt;&#xA;          &lt;th&gt;Binary List&lt;/th&gt;&#xA;          &lt;th&gt;Entry List (sorted)&lt;/th&gt;&#xA;          &lt;th&gt;Entry List (unsorted)&lt;/th&gt;&#xA;          &lt;th&gt;&lt;/th&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/thead&gt;&#xA;  &lt;tbody&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;$e \in S&#x9;&#x9;  $&lt;/td&gt;&#xA;          &lt;td&gt;$O(1)   &#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(log(n))&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(1)&#x9;&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(log(n))&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n)&#x9;&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;$S_1 \cup S_2    $&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1+n_2)&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1+n_2)&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(N)&#x9;&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1+n_2)&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1n_2)&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;$S_1 \cap S_2    $&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1)&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(log(n_1)n_2)$&lt;/td&gt;&#xA;          &lt;td&gt;$O(N)&#x9;&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_2)&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n_1n_2)&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;      &lt;tr&gt;&#xA;          &lt;td&gt;space complexity&lt;/td&gt;&#xA;          &lt;td&gt;$O(n)&#x9;&#x9;&#x9;$&lt;/td&gt;&#xA;          &lt;td&gt;$O(n)          $&lt;/td&gt;&#xA;          &lt;td&gt;$O(N)$ bits.&lt;/td&gt;&#xA;          &lt;td&gt;$O(n)          $&lt;/td&gt;&#xA;          &lt;td&gt;$O(n)          $&lt;/td&gt;&#xA;          &lt;td&gt;&lt;/td&gt;&#xA;      &lt;/tr&gt;&#xA;  &lt;/tbody&gt;&#xA;&lt;/table&gt;&#xA;&lt;p&gt;As I said&amp;ndash;this was just what came out of my memory of an informal discussion, so I make no guarantees that any of it is correct.&#xA;Let me know if you spot something wrong!&#xA;We used the examples  $S_1 = {1,2,3,4,5}$ and $S_2 = {500000}$ to think through some things.&lt;/p&gt;</description>
    </item>
    <item>
      <title>Whack it with an X squared!</title>
      <link>https://traviscj.com/blog/post/2008-08-07-whack_it_with_an_x_squared/</link>
      <pubDate>Thu, 07 Aug 2008 00:00:00 +0000</pubDate>
      <guid>https://traviscj.com/blog/post/2008-08-07-whack_it_with_an_x_squared/</guid>
      <description>&lt;p&gt;David and I were working on our Math381 model, and I was getting frustrated because the data we collected and the results from the simulation were not lining up properly, which was quite frustrating. We were hoping to see something like this:&lt;/p&gt;&#xA;&lt;p&gt;Number of Logins from Data&lt;/p&gt;&#xA;&lt;p&gt;Instead, we were getting stuff distributed like this:&lt;/p&gt;&#xA;&lt;p&gt;Simulated Number of Logins&lt;/p&gt;&#xA;&lt;p&gt;I realized that we needed some function to force a bunch of this junk further left. Recalling an old adage from Mr. Cone’s AP Chemistry class, I decided it was the right time to whack it with an X squared. This is vaguely appropriate, because rand() has a range [0,1), so squaring it should put a whole bunch of stuff further right, but not everything(ie, the first half will end up in the first quarter, the first 3/4 will end up in the first 9/16, etc). Imagine my shock when I saw this:&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
