Tag Archives: architectures

Expert Interview: Part 3 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Big Data 7 15 16 Expert Interview: Part 3 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

With big data, not only can a shipping company keep track of the data for one ship, but the port authorities can collect and track metrics from all the ships in all the ports, all over the world.

In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book.

Tell me about your book.

Well, I just did! [laughs] The content I was talking about [in Part 1 and Part 2 of this interview] is kind of the heart of the book. But, there is more. The book is called “Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.” It’s a book that should work well both for people who are actually the ones technically building these systems, and for people who are not. That’s the approach we take with all six of our books.

High level for the business person, and then drill down into the code for the technical person?

Right. It helps the very technical implementer because it gives them a chance to think about the basics behind what they’re doing. They don’t always have the time to do that.

We talk about why people use streaming and give a number of use cases. We talk about the stream-based architecture that I just described to you and why the messaging system is very important and how it can be used differently.

The third chapter is all about micro services … what the concept of micro services is, why that’s useful, why organizations that move to that style have seen a lot of success with it. You don’t have to do streaming, obviously, to set up micro services. Stream is a new way to start micro services, and I think sometimes people are surprised to realize it does support streaming. We explain how.

The fourth chapter is called Apache Kafka, and we explain the power of Kafka, how it works, templates, some sample programs … Chapter five turns around and does the same thing with MapR Streams. Then we have a couple of chapters that just take specific use cases. One is an anomaly detection case. The book shows how to build it using Stream system architecture, and why that could be an advantage to you.

The last use case … [laughing] I’m laughing because one of our figures has a little joke built into it, but it’s using the example of IoT (internet of things) data, looking at container shipping, just a mind-boggling scale of data to transport …

Ted Dunning used that as an example in his talk. (At Strata + Hadoop World 2016)

Well, I was at Strata Singapore in December. I was on the 22nd floor of some building meeting with a customer but I was distracted and looked out the window, and I could see the container port.  A huge percentage of the world’s container shipping goes through there. I’ve written about it before, but I’ve never been there before. Staring out the window there looking at the scale, the sheer amount of ships … it’s like your brain melts. It’s just stunning. When you think that all those containers can be just covered with sensors that are reporting back. There’s sensors on the ship. You can have an onboard cluster. You can stream data to that cluster. It can then stream data to a cluster at the port, which is maybe owned by the shipping company, so they’re tracking what’s happening with their stuff. They can send that data around the world.

Like to the port authority.

…who is interested not just in that one company at that one port. The port authority is interested in what’s happening in all the ports. That’s where the geo-distributed feature of MapR Streams comes in. Then the ship leaves, loads up its stuff, and chugs off to the next port. While it’s at sea, it’s collecting data about what’s happened on its on-board cluster. I’m not saying everyone’s doing this right now. I’m saying it’s the potential of what we see happening. Meanwhile, that shipping cluster the company has in Tokyo can be, with the MapR Stream replication, sending that data to Singapore before the ship ever gets there. So, now Singapore has an accurate record of what’s supposed to be coming in on the ship. The ship comes in and says, “This is what’s happened while we were at sea. Let me update you about what’s happened”. It’s this beautifully synchronized thing.

Pretty amazing. We live in interesting times.

I think we do. I just find that to be a mind-boggling example, even more so because … I could see the scale, see all those ships and all those containers. I just thought, “Oh, my God. What a huge job” I tell people, “If you read the book, you have to look for that little Easter Egg of an example.”

At the end of the book, we talk about if you are interested in taking this style of approach, with Apache Kafka and MapR Streams, how do you migrate your system? It gives some pointers for how to do it. MapR has the rights to the book for something like 30 days, so they are giving it away, doing a book signing here as well. MapR has it available online for free download. I know there is a .PDF and I think they are also sending it as an e-book, which is a little easier reading. The others books published by O’Reilly are available as free PDFs at MapR.com, which includes the series called “Practical Machine Learning.” Two are set up as an active e-book.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article from FiveFilters.org: Most Labour MPs in the UK Are Revolting.

Syncsort blog

Expert Interview: Part 2 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Ellen Friedman Part 2 pic Expert Interview: Part 2 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Spark Streaming actually isn’t real-time. It is very, very close to real-time. However, it still won’t be replacing Hadoop.

In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book in the “Practical Machine Learning” series, called “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.

In Part 1, Ellen discussed what people are using Hadoop for, and where Spark and Drill fit in.  In Part 2, she talks about Streaming Data — what she finds the most exciting about technologies and strategies for steaming, including cool things happening in the streaming data processor space, streaming architecture, metadata management for streaming data and streaming messaging systems.

Let’s talk about streaming data … What looks most exciting to you right now, as far as technologies and strategies?

I’m so excited about this topic, I really am.

People start off by saying, “I need real-time insight, I need an analytic tool, I need the right algorithm, and a processor that can handle streaming data.”

I think one of the tools that comes to mind first is Spark Streaming. It’s so popular, and it is a very good tool for working in memory. People say that they use it for real-time. It actually can’t do real-time… It approaches it by micro batching, which is very clever. And over people’s window of what’s required, it is often adequate.

People were going in before and adding Apache Storm, the pioneer in real-time processing. They’ve done a beautiful job. People do say, though, Storm is a little bit hard, a little crankier to use, but it’s worth the effort.

Right now, though, I’m so excited about a new project, a top-level Apache project called Apache Flink. Just as Spark started out of Berkeley, Flink started out of several universities in Europe, a university in Berlin, and in Stockholm. It’s a Ph.D. project, and was called Stratosphere. Then, when it came into the Apache Foundation, it became Flink. Flink does the same sorts of things as Spark Streaming, or Storm. The difference between Spark Streaming and Flink is that Flink is real-time. It isn’t approximately real-time, it is real-time.

They took Spark Streaming’s approach and turned it around. Instead of, “Go from batch and approximate real time by micro-batching,” they say, “Let’s do it the other way around. Let’s do good, clean street-accessible real-time streaming,” and you can have it go back toward batch by changing the time window. That’s a more efficient way to do it.

People say Flink is very developer friendly. It’s a tremendously growing community. It gives people another alternative, but one they can use for both batch and streaming processes. I think it’s great that people now have a selection of tools to look at.

For real-time insight with very low latency, those are great processing tools. There are others, Apache Apex, for instance. There’s a lot of tools. That’s not the only one, or two or three. Look at what fits your situation and your previous expertise the best.

There are some cool things happening in the streaming data processor space.

Definitely. But, let’s move upstream from that. If people are good and clever and safe, they realize that to deliver streaming data to a process like that, you don’t want to just throw it in there. You want some kind of a clue if there’s a rupture in the process. You don’t want to lose your data. So, you begin to look at the whole range of tools you can use to deliver data.

You can use Flume, or others, but the tool that we think is so powerful is Apache Kafka. It works differently than the others. And now, I’m additionally excited because MapR has developed a streaming app called MapR Streams, a messaging system feature that’s integrated into the whole MapR platform. It uses the Apache Kafka .9 API. They work very similarly. There are a few things you can do with Streams that Kafka wouldn’t be able to do for you, and the fact that it’s integrated into the whole platform. As I said, it simplifies things. But at the heart of it, they are approaching this the same way. I think MapR Streams and Apache Kafka are both excellent tools for this messaging.

But I want to talk to you about something more fundamental than the technologies, and that’s really what our book “Streaming Architectures” is about.

The architecture for streaming.

Exactly. Instead of just talking about tools, what I do in the book is to talk about what are the capabilities you want to look for in that messaging layer to support the kind of architecture that we think makes the best use of streaming data. Because right now, those two tools, Apache Kafka and MapR Streams, are absolutely the tools of choice. But people constantly develop new tools. So, it’s not about this tool or that tool. It’s about what does a tool need to do for you? Do you understand it’s capabilities and why they’re an advantage? If so, you’ll recognize other good new tools as they get developed.

So, what do you feel are the capabilities to look for in a good streaming messaging system?

I think the big idea is that it’s not just about using streaming data for a single workflow, a single data flow, toward that real-time insight. The value of the messaging layer technology and the value of that right architecture goes way beyond that. It’s much broader.

Kafka and MapR Streams are very scalable, very high throughput, without the performance being a tradeoff against latency. Usually, if you can do one, you can’t do the other. Well, Kafka and Streams both do them both very well. The level at which they perform is a little different, but they’re both off in a class almost by themselves. They also have to be reliable, obviously, but they’re both excellent at that.

Another feature to look for is that they need to be able to support multiple producers of data in that same stream or topic, and multiple consumers or consumer groups. They both can be partitioned, so that helps with load balancing and so forth.

The consumers subscribe to topics and so the data shows up and is available for immediate use, but it’s decoupled from the consumer. So these messaging systems provide the data for immediate use, yet the data is durable. It’s persistent. You don’t have to have the consumer running constantly. They don’t have to be coordinated. The consumer may be there and use the data immediately; the consumer may start up later; you may add a new consumer.

That decoupling is incredibly important.

It makes that message stream be re-playable. What that does for you is make that stream become a way to support micro services, which is hugely powerful. Both Kafka and MapR Streams have that feature.

Back to the idea of flexibility that we discussed earlier. These message systems work for batch processes as well as streaming processes. It’s no longer just a queue that’s upstream from a streaming application, it becomes the heart of a whole architecture where you put your organization together.

You can have multiple consumers coming in and saying, “Oh, you were streaming that data toward the streaming application because you needed to make a real-time dashboard, blah blah blah. But, look. The data in that event stream, is something I want to analyze differently for this database or for this search document.” You just tap into that data stream and use the data, and it doesn’t interfere with what’s going on over there. It’s absolutely a different way of doing things. It simplifies life. Both Kafka and MapR Streams support all of those features and architecture. And I think this is a shift that people are just beginning to relate to.

Shifting to a new way of thinking and building can be difficult.

One of the nice things about this decoupling and the flexibility of using a good messaging system is that it makes the transition easier, as well. You can start it in parallel and then take the original offline. A transition is never easy to do, but it makes it much less painful than it could be.

MapR has one aspect that’s different. It’s a very new feature. It goes one step further, and actually collects together a lot of topics that go up to thousands, hundreds of thousands, millions of topics into a structure feature that MapR calls the Stream. There isn’t an equivalent in Kafka. The Stream is a beautiful thing. It’s a management level piece of technology. You don’t need it for everything. But, if you have a lot of topics, this is a really great thing.

Kind of a metadata management for streaming data?

Well … for the topics that you want to manage in a similar way, you can set up multiple streams. There may be one topic in a stream, there may be 100,000 topics collected into that stream. But for all the ones that you want to manage together, you can set various policies at the Stream level. That makes it really convenient. You can set policies on access, for instance at the Stream level. You can set time-to-live.

People should get it out of their head that if you’re streaming data it’s going to be a “use it or lose it” kind of situation, that the data is gone because you don’t have room to store it. It doesn’t mean you have to store all your streaming data, it just means that if you want to, you have the option. You have a configurable time-to-live. The time-to-live can be…

Seven seconds or seven days …

Or, if you want to just keep it, you basically set that date to infinity and you’ve persisted everything. You have the option to do what you want, and you can go back and change it, too. You can set those policies. With MapR Streams, you don’t have to go in and set it

To 100,000 different topics.

Right. You can do it collectively. Say, that whole project, we want to keep for one year, and then we’re done. Or, that whole project, we want to keep it for one day, and we’re done. You can set access rights as well at the project level. You can set who in your organization has access to what data.

This group can access this project, this group can access that project, but they can’t access this other data.

That’s right. MapR Streams gives you that additional advantage.

And here’s another, different way of thinking about streaming architecture. MapR Streams has a really efficient geo-distributed replication. Say you’re trying to push streaming data out to data centers in multiple places around the world, you want that to happen right away, and you want to do it efficiently. You just replicate the stream to the other data center. It’s a very powerful capability, and

That is organized at the stream level, as well, so again, you might say, “These topics, I want the same time-to-live or I want the same access, but these I want to replicate to five data centers, and these ones I don’t. So, I’ll make two different streams.”

It’s a good management system. These are elegant additional features, but I think at the heart of it, even if you don’t have that capability, you still have the capability to bring a stream-first architecture to most systems. Then, streaming isn’t the specialized thing, it becomes the norm.

You pull data away from that into various kinds of streams, and decide, “I’m going to do a batch process here, a virtualized database there, and I’m going to do this thing in real time.”

Right now, Kafka and MapR Streams are the two messaging technologies that we like, but it doesn’t mean they will be the only ones in the field. That’s why I think it’s important for people to look at what the capability is, rather than just looking at the tools. A tool may be the popular tool now, but there may be even better ones later.

Is there anything else people should keep in mind relating to streaming architectures?

In this whole industry, looking at how people use anything related to the whole Hadoop ecosystem, I think future-proofing is something you need to keep in mind. People are very new to this, in some cases, and one thing they can do by jumping in and using any of these technologies is build up expertise. They’re not even sure, in some cases, exactly how they want to use it. The sooner they begin to build expertise internally, the better position they’ll be in by the time they want to put something into production.

On Friday, in Part 3, Ellen will talk about her book, “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article from FiveFilters.org: Most Labour MPs in the UK Are Revolting.

Syncsort blog

New data landscape augurs discovery-based architectures

The changes in the data landscape over recent years have ramifications that are not immediately apparent. Some basic tenets of the data profession are coming under review. If nothing else, these shifts require flexibility on the part of data practitioners, according to Lakshmi Randall, principal at the Unabashed Advice consultancy. In 19-plus years, she has focused to a great extant on data preparation and quality issues. We caught up with her following her appearance on a panel that pitted data warehouses against data lakes at the recent Enterprise Data World 2016 event in San Diego.

I suppose pitting data warehouses against data lakes has some purpose. But isn’t it just a fact that data landscape is shifting? How do you see the warehouse and the lake today?

Lakshmi Randall: What is breaking down is a strictly linear approach to data management and analytics. That is, one in which data travels a step-by-step path from acquisition to insights. It works when you understand the data, when it’s predominantly structured and it originates from familiar data sources.

But in the case of big data — notes from a physician or insurance claims form data — the data is semi-structured or unstructured, making the linear approach no longer feasible. These examples require discovering the data sources, filing the data and facilitating the understanding of the data before we decide on the path to the insights.

 New data landscape augurs discovery based architecturesLakshmi Randall

You could move it to the data warehouse or, after the discovery process, you find it’s not useful and you throw it away. I think with the change in the data landscape, you have to think about more than just the linear approach. You have instead to think also about discovery and exploratory approaches. Based on that, you decide on the next best actions for either processing or storing the data.

As the data landscape is changing, we are seeing new types of data. We should be open to different architectures, where it is appropriate. Data governance is still a key, but you have to have some level of agility and flexibility too.

With the new use cases, data becomes part of a more iterative process. Lakshmi Randall principal, Unabashed Advice

There seems a growing need for IT to support a somewhat different user than they may have in the past — something like a power user on steroids, one might say.

Randall: Well, different use cases drive the different tactics. Data becomes part of a more iterative process. The personas that must be supported change. It is not just a persona that typically does day-to-day analysis. It may be what you call a power user or a data discovery user or a data scientist. It may be someone who combines the skills of domain knowledge along with some level of technical knowledge, a hybrid persona. Really, there is a need for a continuum of personas in the enterprise.

Let’s look at another aspect of the data landscape: NoSQL. What are some forces driving interest in using NoSQL?

Randall: When you’re modeling data that holds true relationships — ones that are more affinity driven — data modeling is different than it is with a traditional relational database. That is a great example of the need for a NoSQL database.

For example, as part of a customer experience management solution, there are different touch points in the customer journey. These can be across many different channels. And finding those special connections, I think, is only possible if we have NoSQL, given that it stores the data in something close to its natural form. That is, as opposed to having to translate the data into rows and columns. People are finding that there are some use cases, like this one, that are really good candidates for NoSQL databases. It all has to do with the nature of the data. If it is relational data, then relational databases and data warehouses are better candidates.

In your experience as of late, where is the data profession on all this? For example, with governance and modeling, there can be a natural inclination to ask for more upfront control. Are you seeing changes in the way teams are organizing?

Randall: The business is justified in demanding the ability to conduct ad-hoc analysis or to have access to the appropriate and relevant data in order to accelerate the time-to-insights. At the same time, the business should be a sponsor of IT in establishing governance and stewardship initiatives.

Today, the data profession extends across IT and the business. And the reality is the enterprise needs a continuum of personas — that means people with quantitative skills, qualitative skills, domain experts, process experts, data scientists, data stewards and so on — to support the multitude of business objectives.

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources