MonthFebruary 2015

Microservices allow for localized tech debt

At my current company we employ a microservice style cloud architecture. One of the side benefits of this approach has been our ability to localize our tech debt into size-able chunks. When you’re a startup you need to be able to build things fast and fail fast if you need to pivot. Sometimes being perfect is the enemy of building a company, and sometimes it is not. This is where microservices can help mitigate long term issues and allow you to apply perfection where it could block you from growth and allow tech debt to pool in places where you can address it in a reasonable manner in the future,  essentially allowing tech debt to truly be like financial debt. It’s much easier mentally to tackle $10,000 of debt across 4 credit cards at $2500 each than 1 card at the full $10,000. It allows you to apply a sharper, more targeted focus that gives a more immediate gratification. Let’s level set on the phrase tech debt for the purposes of this article. “You want to make a “quick change” to your software of a kind you mean to be able to, and it isn’t quick.  Whatever made that happen, that’s tech debt.” – Dave Diehl.

When trying to build a microservice architecture at a startup you need to identify what are the core services you need designed and well structured to allow for future growth that if you get wrong will hurt your chances of being able to move at a fast, safe pace.  At first glance this might seem easy. We’re going to be a huge consumer play and will have a billion users so you go off and build a complex database sharding layer from scratch, because hey, you’re gonna grow too big for a single master DB.

This however will be highly dangerous. Startups are fluid organisms. Assumptions made on day 1 can quickly become invalidated in practice. In a previous startup where the previous founding team was from an extremely large scale web property and scalability was baked into their DNA, when it came to decide on how we’d store user information for our new product it quickly delved into a data center order of half a dozen DB class master boxes with replicas while the dev team started on a user sharding layer that took weeks of effort. In the end the product didn’t take off, that code was ripped out in multi night hackathons and the DB boxes were re-purposed for other projects leaving a teeny little box to handle the load. That was missed opportunity cost to work on features the users wanted by speculative investing. Perhaps instead just stubb’ing out the interfaces and APIs in such a way that all of the sharding/routing logic was localized to set of client methods and in the future those methods could be fully implemented or backed by a remote service.

Proper technical debt assessment is making known tradeoffs. If you don’t even know the tradeoffs you’re making that’s not so much technical debt as poor engineering and planning. As long as you focus on getting your APIs and interfaces modeled out you should be able to refactor the details behind those if the service turns out to be something that becomes core to your platform. Creating smaller, standalone microservices allows you to iteratively build scalable, well tested, easily digested services. At CrowdStrike we wrote some of our initial services in Python for quick prototyping, but focused on the API interactions with those services and less on the scalability of those particular services. For one particular service as we ramped up and had 10 of those Python boxes running with questionable stability we refactored them to GO (golang) and reduced down to 2 boxes. We actually only needed 1 box but used 2 for redundancy. The API did not change whatsoever and clients happily connected and were serviced as before.

We took on tech debt to test our hypothesis that those services would be critical down the road. It turned out they were and we were able to swap in highly performant, stable, tested code in place and paid off our technical tebt. $2500 less debt! Now to time to get that AMEX bill 🙂

If those sound like fun problems to work on, join the mission at CrowdStrike: http://www.crowdstrike.com/senior-software-engineer/

 

techdebt-blogpost

 

Cassandra’s DateTieredCompaction Strategy does not work as billed

UPDATE: While at first DTCS seemed like a workable solution, after running it under real world workloads, while needing to scale up the cluster, DTCS broke down completely. I recommend looking at TWCS by Jeff Jirsa of CrowdStrike. It’s what DTCS should have been. Reference: https://issues.apache.org/jira/browse/CASSANDRA-9666

 

I’ve been looking at throwing Cassandra at a use case I had come up. Storing billions of items a day but only needing to keep that data for a couple weeks on a rolling window. I was obviously nervous about having to TTL out that large of a data volume on a rolling window. Basically write(2TB), delete(2TB) in the same day. I started on the hunt on the latest C* docs and came across a gem that was recently released in the 2.0.11 release contributed by Spotify: DateTieredCompactionStrategy.  This strategy is great if you’re looking to have a table that is just on a rolling time window. Use cases like time series and analytics where every write has the same ttl and it comes in a forward only manner, meaning you’re not backfilling data later. The feature that really interested me was that it can look at an SSTable to determine if the whole table is out of expiry vs having to spend CPU cycles on merging and dropping data. rm *foo-Data.db. which should make large scale TTL’ing much less I/O and CPU intensive. I wanted to see if it lived up to the hype so I set up a cluster of 3 m3.2xlarge machines and created brand new keyspaces and a new table defined later in the article. I fired up another loader machine to act as a writer and tuned it to write 10,000 events per second to the new cluster on a continuous stream for 12-24 hours. Here’s what I set up

  • 3 node dse 4.60 test cluster with Cassandra 2.0.11
  • created a new table with the DateTieredCompactionStrategy
  • inserter in golang using gocql writing 10,000 inserts per second causing a load of 3 on an 8 core box
  • each insert had a 30 minute TTL with “USING TTL 1800”

I started the test hoping to see a new saw tooth pattern of SSTables on disk and disk space consumed by that column family. Instead to my shock it just kept going up and up all day. I scoured the docs but came up empty. Datastax was kind enough to point out it may be related to the “tombstone_compaction_interval” which defaults to 1 day. That basically means an SSTable won’t be a candidate for deletion until after that date. Once I changed that setting to 1 everything seemed to work like a champ. This was before I set the interval for testing

Screenshot 2015-01-30 16.01.30

after!

Screenshot 2015-01-30 16.02.13

You can also see with DTCS it results in a nice sawtooth on the SSTable count.

Screenshot 2015-01-30 16.04.02

 

For a comparison, below is the exact same test with LeveledCompaction instead of DateTiered. You can see the data volumes just continue to grow.

Screenshot 2015-02-03 08.05.18

Screenshot 2015-02-03 08.08.26

After 24 hours the leveledcompaction table has 10x the data on disk on a single node. Sadness.

Screenshot 2015-02-03 22.13.32

 

**UPDATE**

30 hours later the cluster running the LeveledCompaction 30 minute TTL inserts died a horrible death. It filled up all available disk space on a couple nodes and started crashing the Cassandra process. </endofcluster>

Screenshot 2015-02-04 09.05.01

If you want to come work on high-scale distributed systems, we’re hiring!

http://www.crowdstrike.com/senior-software-engineer/

 

here is a create table script if you’re interested in a test on your own hardware.

  CREATE TABLE IF NOT EXISTS ttl_test (
	id text,
	metatype text,
	event_time timestamp,
	rawbytes blob,
	PRIMARY KEY ((id, metatype), event_time)
) WITH
  CLUSTERING ORDER BY (event_time DESC) AND
  gc_grace_seconds = 0 AND
  compaction={'class': 'DateTieredCompactionStrategy','tombstone_compaction_interval': '1', 'tombstone_threshold': '.01', 'timestamp_resolution':'MICROSECONDS', 'base_time_seconds':'3600', 'max_sstable_age_days':'365'} AND
  compression={'sstable_compression': 'LZ4Compressor'};
session.Query(`INSERT INTO ttl_leveled (id, metatype, event_time, rawbytes) VALUES (?, ?, ?, ?) USING TTL 1800`,
		randomId, metaType, eventTime, rawBytes).Exec()

Leveled Compaction Table Create:

CREATE TABLE ttl_leveled (
  id text,
  metatype text,
  event_time timestamp,
  rawbytes blob,
  PRIMARY KEY ((id, metatype), event_time)
) WITH CLUSTERING ORDER BY (event_time DESC) AND
  bloom_filter_fp_chance=0.100000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.100000 AND
  gc_grace_seconds=0 AND
  read_repair_chance=0.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  compaction={'tombstone_threshold': '.01', 'tombstone_compaction_interval': '1', 'class': 'LeveledCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

© 2017 Jim Plush: Blog

Theme by Anders NorenUp ↑