To start things off, there’s a great article on distributed systems theory and practice over at The Paper Trail. The article provides some references to a number of great articles about (non- dry and boring) theory that engineers who work with distributed systems should know. Definitely worth a read.
Also, for folks who work with NoSQL, there’s a great article over at Planet Cassandra about getting rid of the SQL mentality when working with Cassandra. A lot of the article is Cassandra-specific (the consistency configuration settings, for example) but a lot of it generalizes to other NoSQL systems as well (especially Cassandra’s nearest cousins, Riak and Dynamo, but also systems like Redis and Hadoop itself, in the case of denormalization). Definitely worth a read, especially if you’re still getting used to thinking about NoSQL systems.
Cloudera also shared a great article on how to count events like a data scientist. This was an extremely thoughtful article that really leveraged SQL syntax as a means for doing data science. As well, it really emphasized one of the biggest challenges in using something like SQL for actual analytical work. The notion of a Lateral View in Hive was new to me, and the author’s Hive extension “Exhibit” looks extremely promising. This article is definitely worth a read as well. It’s very informative.
Last but not least, Gigaom ran an article earlier this month talking about what’s next for the researchers at Berkeley’s AMPLab. AMPLab is a research center at the University Of Berkeley, and they are working on some incredible projects in the big data space. One of their projects, Spark, graduated to top level Apache project status, and Spark has enjoyed tremendous growth and buy-in from the open source community.
The folks who brought us Spark aren’t finished, though, and they’ve got (at least) two very exciting projects still waiting in the wings. One of those projects, Tachyon, essentially extends the in-memory and computation-lineage notions from Spark to MapReduce itself, directly (Spark can benefit from running on Tachyon too). Another project, one that has not been released as an alpha yet, is called Succinct, and it promises to make large gains in big data indexing and compression. If you’re not already paying attention to the work out of Berkeley’s AMPLab, you should be. These articles are a great place to start.
I’ll return next month with more “Best of Big Data” reads from across the net. If you have any favorites you’d like to share in the meantime, please let me know. Otherwise, see you next time.