My public coding obsessions fall into tidy categories of data storage, programming productivity, inductive programming scalability, experiments / toys / fun, and other repos I fork and edit.
This page describes only released projects. I have a dozen more projects hiding in my systems with thousands of lines of code in various states of disrepair waiting to be tidied up for release one day.
(this page hasn’t been updated in a while; more recent live updates at github page of mattsta)
Data storage problems keep me up at night. Here are my many attempts at fixing, prototyping, repairing, or reconciling my twisted views of how data and the world should coexist.
car is an attempt at providing large deduplicating storage based on content. For example, if your user community uploads cat_rainbow.png 65,000 times, it only gets stored once and you have 65,000 references to the same data. car is built on a mix of my own ideas and early Camlistore architecture documents. car is written in erlang, uses statebox to maintain CAP stability, and relies on Riak to store the nodes (pointers to data), the data itself, and indexes into the nodes so we can find data based on multiple criteria.
redisfuse mounts a redis instance as local filesystem. Why does this exist? I was creating a self-hosting website creation service (you would create the website within the website itself) and I wanted a way to edit the hosted website outside of the website editor itself. So, I created a filesystem to mount redis locally letting me edit things in redis using vim directly. redisfuse is written in python.
ghost and nose form a redis-backed hierarchical object storage system. What good is hierarchical storage? Well, we can use it to make threaded discussion systems, trees of users, or other things containing other things containing other things.
ghost and nose are written in lisp using lfe (lisp flavored erlang) and compile down to erlang bytecode.
chatty was my attempt at writing a threaded discussion storage service. All actual comments are stored in riak, while the comment topology is stored in redis.
You can upvote/downvote stories and comments and get everything sorted by hot/confidence/controversy as necessary.
notifier is a generic status/update/info dissemination system, except all notifications are fully hierarchical. It’s more like a single instance AMQP setup with a persistent backing store instead of delete-on-consume behavior. You ‘tell’ a key (e.g. ‘users.signups.thursday’) something happened (e.g. ‘signed-up firstname.lastname@example.org’), then all subscribers to ‘users’ and ‘users.signups’ and ‘users.signups.thursday’ get notified ‘signed-up email@example.com’. All events are stored in redis.
tt is one of my big table clones written in erlang and backed by tokyo cabinet/tokyo tyrant. tiny table supports full three dimensional data like Big Table (all data is stored as (Key, Column, 64-bit Timestamp)). I used tiny table in a few short lived projects in 2009.
er is my redis client for erlang, written in lisp. It returns erlang-appropriate return values (example: if you incr a key, you get back an integer, not a binary) and erlang-appropriate errors. The client isn’t updated in lockstep with redis feature releases, but it works well enough for my needs. If a command isn’t natively exported in er, you can call it directly too, so you aren’t locked into only-erlang-exposed commands to the redis API.
pcache was one of the first significant applications I wrote in erlang. It’s a persistent and ttl-based caching server where each object you store is persisted in its own erlang process. If you are storing objects with ttls, the ttl is maintained by using the ‘receive … after TTL -> die’ semantics of the erlang receive statement.
ecache is my attempt at fixing some deficiencies in pcache. ecache persists data in ets, and ets has a nice compression option, so your cache can transparently compress and decompress your data for you. If you’re using the ttl option on an ecache server, the ttl is still maintained using erlang’s internal ‘receive…after’ semantics too. I’ve been using ecache in production environments since I wrote it and it works great.
weighter is an overly complicated way of managing user karma values. It has settings to optionally auto-decay weights after time, to inherit karma from a parent (say, if on signup you get 1/16th your parents karma automatically), and for giving your karma to another user in an atomic fashion. All weights and karma details are stored in redis.
allegiance is a way of storing membership in capped collections. It was created to store course membership for an online learning site, but was then extended to allow anything to be stored in anything else. We call the top level containers ‘bottles.’ Things get put into bottles. Bottles can have a maximum size if requested, you can generate access tokens for bottles so only people with an invite key can join, and you can ask a bottle if something is a member of it. All bottles and membership information is stored in redis.
beas is a simple CRUD user account system. It stores user details including, but not limited to, usernames, email addresses, passwords, subscription settings, payment history, location details, availability at certain locations, and arbitrary key/value storage per user. It’s written in lfe and stores everything in redis.
bess stores the valuable user client cookie to user id map. You can also associate specific details with session like user agent, ip, referrer, non-cookie unique ID, etc. It’s written in lfe and stores everything in redis. bess matches the zog_web session storage interface, so zog_web can manage sessions internally.
erlwg retrieves a webpages, caches them for a configured number of seconds, then returns the cached page until its timeout expires. When the timeout expires, you still get the old cached page returned until the new cached page is downloaded. The keep-returning-old-until-new-available behavior allows for non-blocking reads so a slow/lagging/down site won’t block your requests for webpage data (you get old data, but old data is better than blocking data. if old data isn’t better than blocking data, modify config values so you block instead).
You can also request erlwg transform all downloaded pages and store the transformed version (and return the transformed version instead of HTML). This allows you to cache modifications or transformations to the HTML without running them each time you request the cached page (e.g. convert the HTML into a tree of erlang terms so you can traverse the DOM as erlang tuples. just convert once on download then always get back your converted tree. new downloads of the webpage after cache expiration will auto-convert into a transformed version as well).
liveconfig is a dumb timer-based file freshness checker. You tell liveconfig to watch a directory with a filename glob. When a file is detected as added/update/deleted, your callback function gets invoked with three arguments of Added, Updated, and Deleted filenames. Files may be simultaneously added, updated, or deleted on each run.
We aren’t using inotify or anything fancy. If you monitor a directory with too many files or a directory with too few files too often, your performance will suffer. If you check a small number of files not very often (say, once every 5 seconds), everything should stay happily in your kernel caches so you won’t be fighting disk at all.
liveconfig is how zog_web detects and provisions new configurations just by changing some files without having to restart or hup the server itself.
cbuf is a circular buffer storing all entries in ets. I wrote it to keep a capped user scrollback log for a web based terminal thing.
libgeoip-erlang is an erlang C port to read maxmind geoip databases using their C library. Geocoding an IP address using libgeoip-erlang also returns the geohash of the location as well. When I wroe libgeoip-erlang, it was the fastest way to get maxmind reads in erlang, but I think mochiweb has updated their libgeoip module over the years to fix some of their inherent performance problems.
cg is a really simple way of breaking out configs based on dev/staging/beta/production services. cg relies on the directory name where your application runs to determine which config to present to you. If your application is in “myapp-beta” you get your beta config keys (things like payment provider auth tokens). If your application is in “myapp-prod” you get your production config keys. Simple, but necessary.
zog_web is an erlang web framework and web server encapsulating everything I’ve learned from developing website backends in erlang since 2006. zog_web sits on top of mochiweb and adds URL routing based on function name, easy form value extraction, easy cookie value setting and extraction, URLs requiring authentication, URLs requiring minimum access levels, and a few other nice to have features.
To see zog_web in action, check out hnf.
zog_web powers all of my live erlang websites.
stripe-erlang is my stripe.com erlang client. The Stripe API is nice and consistent, so the client was pretty easy to write. I even went out of my way to erlang-ize common return values (with types!) for Stripe API methods supported in the client. Note: not all methods are supported because I don’t have the time or necessity to keep feature parity with the production API (which also changes kinda randomly without proper release notes at times—you would think api/v1 would mean the version is stable and additions would go to another api/v2 or api/v1.1, etc, but, no, they keep rev’ing the API under api/v1. We can live with that).
balanced-erlang is my balancedpayments.com erlang client. Since the balanced API is a more RESTy version of the Stripe API, this application is mostly a copy of my stripe-erlang application with endpoints, methods, and paramaeters changed. Note, for balanced-erlang I’m not bothering to populate detailed records of every result. You can deal with fully populated proplists as necessary.
riak_pool wraps the default erlang riak client in a client pool. riak_pool operates under a registered name, so you can access your riak services by name instead of by carrying around connection references everywhere. Naming something gives you power over it.
Inductive Programming Scalability
What is inductive programming scalability? It’s when you abstract away a complicated initial condition so you can use it without knowing inner details. You can use the complexity of something without having to worry about all the intricate details underneath. (Now, you may say “that’s what programming is all about in the first place,” but there’s a difference between encapsulating two() -> 1 + 1; and encapsulating base functionality enabling further higher abstractions to be built at layer n+1.)
oneshot abstracts the extremely common, and extremely verbose, erlang non-blocking server pattern of listening to a TCP port, receiving data, then doing something with it.
Instead of maintaining the internal TCP state, failure conditions, and partial receive states on your own, you just tell oneshot to run a function when TCP data is received.
I use oneshot now whenever I need to listen for a TCP connection instead of setting up the networking internals by hand each time.
oneshot was created so I wasn’t copying and pasting the non-blocking TCP receiver from tt into erlang-stdinout-pool to enable its magic zero-config “stdinout pool over the network” capabilities.
egsf provides locking across an erlang cluster. You can pick the storage backend of how your locks are stored (by default, just ets) and how persistent locks are by default (if you want locks to not timeout, give them a really high timeout value).
I created this for an online service where users had free access to a shared DB with no way to synchronize access themselves. I auto-locked certain records so users couldn’t completely break the database on their own.
racl (written in lfe) provides hierarchical ACLs based on any user levels you define. You define your ACL topology in a per-namespace acl module like this content access example then it gets transformed into an automatically left-inheriting permissions model racl.lfe.
You can also allow and block people/entities/whatever-you’re-ACLing at each level as well, so it’s as flexible as you need.
It’s pretty nifty and I’ve used it in a few production-level projects.
erlang-stdinout-pool is one of my more favorite applications. erlang has a fundamental flaw when dealing with external programs – it can’t close stdin. So, if you want to open a program waiting for input on stdin, receiving input, then giving something on stdout—you just can’t do it.
This is where stdinout-pool comes in. I wrote a tiny C port to act as a proxy to funnel data to a process of your choice. You launch a stdinout-pool with the path to something taking input on stdin, then send data to the pool. Through the magic of unix pipes, when you send data to the pool, it picks an unused idle pre-spawned process, sends your data, then closes stdin, and your output magically shows up on stdout for you.
Your data follows: erlang -> stdinout-pool C program -> your program -> stdinout-pool C program -> erlang. There’s no noticeable speed penalty for reasonable amounts of data because unix pipes are incredibly fast already.
As a bonus, you can tell your stdinout-pool to listen over the network, so you can easily spawn a network addressable pool of non-blocking things to return results for you.
Experiments / Toys / Fun
zlang is a programming language I wrote as an attempt at a safe “anybody can program” language.
zlang has some notable features including implicit database access (you just write “put key, value” or “get key” to get back your data), atoms, access to a global locking server (egsf), functions can be direct HTTP/REST endpoints with full access to query/post params and cookies, pattern matching, and much more.
For an example of zlang syntax, check out editor.zlang which has test cases for most language features.
zlang also has a neat scatter/gather functionality so you can ask for ten things at once, they all get fetched at once, then you get them back in order once they are all finally retrieved (ideally faster than requesting them sequentially).
cudacam is my attempt at bundling, playing with, and extending some sample code using GPU programming with a Kinect.
It didn’t get as far as I would have liked because Life got in the way, but I’m sure I’ll revisit live GPU programming one day.
restcp is a REST-modify/updateable proxy service I created as part of a job interview project. You define routes using a REST interface then the proxy will forward your ports to backend hosts as configured. Neat in concept, and I had written some things like this before for other projects (like tt), but there are probably better code bases to use in production settings.
hnf is a proxy for hacker news created for two purposes: improve HN response time and remove undesirable HN headlines from my view. Back when I wrote hnf, HN would sometimes stall for 30 seconds to multiple minutes at once. hnf caches all retrieved pages, so some HN result was always available. HN is more responsive these days, but annoying (or blatantly self-promotional) articles still show up.
The query param has a regex and any headline matching the regex gets removed (or, you can alter the last param in the URL to show only removed items).
hnf parses the HN table layout by converting the HTML to a tree of erlang terms using the excellent mochiweb_html module. Check out how the HN site gets traversed starting at line 92 of hnf.erl.
hnf uses erlwg to fetch, cache the transformed parse tree for each new request, and re-request new pages when the cache timeout expires.
hnf runs live at http://diff.biz/
fantasy_payroll is a dumb little application front end to Intuit’s After-Taxes Take Home Pay calculator API. Give fantasy_payroll your pre-tax income and it tells you how much your taxes would vary state to state. (I found the API completely by accident just from poking around some URLs from Web Inspector. Their API is completely hosted on AWS, is completely self-describing, and has very helpful error messages. Good job, Intuit!)
rets listens to an old redis replication protocol and populates an ets table with all replicated redis data. Kinda nifty, but it doesn’t work with the updated redis replication protocol.
etherpad was acquired by google then subsequently released as a code dump with no support or improvements. I took their initial code dump, cleaned up some ugliness, added postgresql support, removed all hardcoded “etherpad.com” links, and added a custom installer so you didn’t have to edit ten files to get a fresh etherpad install going.
Other people have tried to “improve” etherpad over the years, but they mostly just ruined the interface and made it slower. Keep it simple.
I like spine. Not many people like spine. Spine is simple. I don’t like the spine global “fetch all objects from server” behavior though, so I added a pull-per-id function in a few places.
I was SSL benchmarking a few years ago and found massive performance discrepancies which later were discovered to be default cypher choices (the faster servers were using simpler ciphers). During that time in my life, I added HTTP x-forward-for header injection to stud. It was a nice exercise, but there’s no reason to use this in production. Better methods exist.
stud master did end up pulling some of my compile/ansi fixes though.
I maintain colorized console output in my branch of rebar. Upstream doesn’t seem interested in my nice colorized console output, so… their loss.
I didn’t do much here except fix rebar compile and erlang standard layout issues. It’s a pretty great application overall.
sgte is old. It pre-dates rebar. It uses a Makefile. It’s a templating engine. It’s a pretty nice templating engine too. sgte is based off of StringTemplate research, which was novel at the time. StringTemplate says “templates may only replace variables, include other templates, and iterate over lists/collections.” So, no erb garbage of having your entire programming environment available when writing templates.
My original erlang-based content site uses this sgte for server-rendered templates. For new projects, I use mustache templates with walrus on the server or hogan.js on the client.
I added a standard CMake build system to GEPD, but I haven’t ended up using GEPD for anything. Anyway, yay CMake!