The Spark Summit East for 2016 has come and gone. There were some technical issues that hampered the training courses for the first day, but the next couple days offered a lot of really good information and great discussion with all of the attendees. After giving it a few days of thought, there are three main areas that caught my attention.
In the morning set of keynotes on the second day of the conference, Databricks CTO Matei Zaharia gave us a quick walkthrough of some of the key highlights of the upcoming Spark 2.0 release. He goes over it very well and it’s worth spending a few minutes hearing him review it (rather than having me attempt to explain it).
The speed improvement through the rapid development of Project Tungsten is definitely very cool to see, and faster is never a bad thing. The increased focus on datasets going forward is very interesting to see too. What caught my attention the most was with the streaming component of the platform, which is called structured streaming. It puts together new innovations on top of the strengths of things like Spark Streaming, Spark SQL, and datasets in order to create a new framework for developing what Matei refers to as continuous applications. It’s pretty neat to see this happening because the demand for the ability to apply batch processing logic and other advanced tasks (machine learning, ad-hoc analytics, etc.) on live streams has been growing, and nobody wants to create separate ETL processes (and redundant data silos) in order to accomplish all these tasks. Personally I would say that other projects such as Apache Beam and the toolsets being developed at legacy enterprise software vendors show that this concept, and the demand for it, isn’t necessarily new and it’s great to see it being a core of the Spark platform going forward. It’s particularly important since there’s been a major shift in recent times towards real time applications which means that the inherent operating nature of the underlying platforms also needs to shift. This shows a major focus in supporting that.
The idea that nobody can go at something alone is especially true in the software space. When considering the ecosystem (and the enthusiasm) that has been built around Spark, when attending the various sessions the variety of partners is no surprise. From a company like Arimo leveraging Spark to create a distributed Tensorflow environment all the way to university students in South Korea partnering with SK Telecom to create an open library extending ggplot2 for Spark based big data workloads, everyone is (somewhat predictably) taking the technology in all sorts of cool places.
What made me take note more than the technical advancements themselves is the enterprise support that the technology is receiving. Which brings me to the next part.
Enterprise Readiness and Adoption
To kick things off, having keynote presentations from companies such as IBM, SAP, and Capital One indicate that the large software vendors and large enterprises are getting on board the Spark train. It was also interesting to see what companies were doing with Spark as well, but personally the one major concern I have with Spark in the enterprise seems to be shared with many, and they’re around governance and compliance capabilities. The need for this is apparent because when looking at the ease of accessibility of information is still not where it needs to be. Bloomberg has an interesting solution that they displayed at a session, but in my opinion if Spark is to really become a core component of an enterprise architecture these functions have to start being built into the platform itself instead of having to have enterprises build it themselves. This will be critical for enterprise adoption to break outside of the tactical use case approach. While they are important uses (Netflix’s time travel is a good example), they tend to be more targeted use cases versus being a broader enterprise platform that I believe, from my conversations with attendees, is what companies want.
Another piece of the enterprise equation is infrastructure and deployment as well. Hadoop and Spark environments can be complex and expensive to build and maintain, especially as data sizes and processing complexity increases. A number of cloud vendors are looking to address this so that data scientists and analysts can stay in the business of analyzing data as opposed to being Hadoop administrators, and things such as Google Cloud Platform’s Dataproc offering are looking to address this.