Tag Archives: Friedman

Expert Interview: Part 3 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Big Data 7 15 16 Expert Interview: Part 3 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

With big data, not only can a shipping company keep track of the data for one ship, but the port authorities can collect and track metrics from all the ships in all the ports, all over the world.

In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book.

Tell me about your book.

Well, I just did! [laughs] The content I was talking about [in Part 1 and Part 2 of this interview] is kind of the heart of the book. But, there is more. The book is called “Streaming Architecture: New Designs Using Apache Kafka and MapR Streams.” It’s a book that should work well both for people who are actually the ones technically building these systems, and for people who are not. That’s the approach we take with all six of our books.

High level for the business person, and then drill down into the code for the technical person?

Right. It helps the very technical implementer because it gives them a chance to think about the basics behind what they’re doing. They don’t always have the time to do that.

We talk about why people use streaming and give a number of use cases. We talk about the stream-based architecture that I just described to you and why the messaging system is very important and how it can be used differently.

The third chapter is all about micro services … what the concept of micro services is, why that’s useful, why organizations that move to that style have seen a lot of success with it. You don’t have to do streaming, obviously, to set up micro services. Stream is a new way to start micro services, and I think sometimes people are surprised to realize it does support streaming. We explain how.

The fourth chapter is called Apache Kafka, and we explain the power of Kafka, how it works, templates, some sample programs … Chapter five turns around and does the same thing with MapR Streams. Then we have a couple of chapters that just take specific use cases. One is an anomaly detection case. The book shows how to build it using Stream system architecture, and why that could be an advantage to you.

The last use case … [laughing] I’m laughing because one of our figures has a little joke built into it, but it’s using the example of IoT (internet of things) data, looking at container shipping, just a mind-boggling scale of data to transport …

Ted Dunning used that as an example in his talk. (At Strata + Hadoop World 2016)

Well, I was at Strata Singapore in December. I was on the 22nd floor of some building meeting with a customer but I was distracted and looked out the window, and I could see the container port.  A huge percentage of the world’s container shipping goes through there. I’ve written about it before, but I’ve never been there before. Staring out the window there looking at the scale, the sheer amount of ships … it’s like your brain melts. It’s just stunning. When you think that all those containers can be just covered with sensors that are reporting back. There’s sensors on the ship. You can have an onboard cluster. You can stream data to that cluster. It can then stream data to a cluster at the port, which is maybe owned by the shipping company, so they’re tracking what’s happening with their stuff. They can send that data around the world.

Like to the port authority.

…who is interested not just in that one company at that one port. The port authority is interested in what’s happening in all the ports. That’s where the geo-distributed feature of MapR Streams comes in. Then the ship leaves, loads up its stuff, and chugs off to the next port. While it’s at sea, it’s collecting data about what’s happened on its on-board cluster. I’m not saying everyone’s doing this right now. I’m saying it’s the potential of what we see happening. Meanwhile, that shipping cluster the company has in Tokyo can be, with the MapR Stream replication, sending that data to Singapore before the ship ever gets there. So, now Singapore has an accurate record of what’s supposed to be coming in on the ship. The ship comes in and says, “This is what’s happened while we were at sea. Let me update you about what’s happened”. It’s this beautifully synchronized thing.

Pretty amazing. We live in interesting times.

I think we do. I just find that to be a mind-boggling example, even more so because … I could see the scale, see all those ships and all those containers. I just thought, “Oh, my God. What a huge job” I tell people, “If you read the book, you have to look for that little Easter Egg of an example.”

At the end of the book, we talk about if you are interested in taking this style of approach, with Apache Kafka and MapR Streams, how do you migrate your system? It gives some pointers for how to do it. MapR has the rights to the book for something like 30 days, so they are giving it away, doing a book signing here as well. MapR has it available online for free download. I know there is a .PDF and I think they are also sending it as an e-book, which is a little easier reading. The others books published by O’Reilly are available as free PDFs at MapR.com, which includes the series called “Practical Machine Learning.” Two are set up as an active e-book.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article from FiveFilters.org: Most Labour MPs in the UK Are Revolting.

Syncsort blog

Expert Interview: Part 2 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Ellen Friedman Part 2 pic Expert Interview: Part 2 — Dr. Ellen Friedman Discusses Streaming Data Architectures and What to Look for in a Good Streaming Message Layer

Spark Streaming actually isn’t real-time. It is very, very close to real-time. However, it still won’t be replacing Hadoop.

In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book in the “Practical Machine Learning” series, called “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.

In Part 1, Ellen discussed what people are using Hadoop for, and where Spark and Drill fit in.  In Part 2, she talks about Streaming Data — what she finds the most exciting about technologies and strategies for steaming, including cool things happening in the streaming data processor space, streaming architecture, metadata management for streaming data and streaming messaging systems.

Let’s talk about streaming data … What looks most exciting to you right now, as far as technologies and strategies?

I’m so excited about this topic, I really am.

People start off by saying, “I need real-time insight, I need an analytic tool, I need the right algorithm, and a processor that can handle streaming data.”

I think one of the tools that comes to mind first is Spark Streaming. It’s so popular, and it is a very good tool for working in memory. People say that they use it for real-time. It actually can’t do real-time… It approaches it by micro batching, which is very clever. And over people’s window of what’s required, it is often adequate.

People were going in before and adding Apache Storm, the pioneer in real-time processing. They’ve done a beautiful job. People do say, though, Storm is a little bit hard, a little crankier to use, but it’s worth the effort.

Right now, though, I’m so excited about a new project, a top-level Apache project called Apache Flink. Just as Spark started out of Berkeley, Flink started out of several universities in Europe, a university in Berlin, and in Stockholm. It’s a Ph.D. project, and was called Stratosphere. Then, when it came into the Apache Foundation, it became Flink. Flink does the same sorts of things as Spark Streaming, or Storm. The difference between Spark Streaming and Flink is that Flink is real-time. It isn’t approximately real-time, it is real-time.

They took Spark Streaming’s approach and turned it around. Instead of, “Go from batch and approximate real time by micro-batching,” they say, “Let’s do it the other way around. Let’s do good, clean street-accessible real-time streaming,” and you can have it go back toward batch by changing the time window. That’s a more efficient way to do it.

People say Flink is very developer friendly. It’s a tremendously growing community. It gives people another alternative, but one they can use for both batch and streaming processes. I think it’s great that people now have a selection of tools to look at.

For real-time insight with very low latency, those are great processing tools. There are others, Apache Apex, for instance. There’s a lot of tools. That’s not the only one, or two or three. Look at what fits your situation and your previous expertise the best.

There are some cool things happening in the streaming data processor space.

Definitely. But, let’s move upstream from that. If people are good and clever and safe, they realize that to deliver streaming data to a process like that, you don’t want to just throw it in there. You want some kind of a clue if there’s a rupture in the process. You don’t want to lose your data. So, you begin to look at the whole range of tools you can use to deliver data.

You can use Flume, or others, but the tool that we think is so powerful is Apache Kafka. It works differently than the others. And now, I’m additionally excited because MapR has developed a streaming app called MapR Streams, a messaging system feature that’s integrated into the whole MapR platform. It uses the Apache Kafka .9 API. They work very similarly. There are a few things you can do with Streams that Kafka wouldn’t be able to do for you, and the fact that it’s integrated into the whole platform. As I said, it simplifies things. But at the heart of it, they are approaching this the same way. I think MapR Streams and Apache Kafka are both excellent tools for this messaging.

But I want to talk to you about something more fundamental than the technologies, and that’s really what our book “Streaming Architectures” is about.

The architecture for streaming.

Exactly. Instead of just talking about tools, what I do in the book is to talk about what are the capabilities you want to look for in that messaging layer to support the kind of architecture that we think makes the best use of streaming data. Because right now, those two tools, Apache Kafka and MapR Streams, are absolutely the tools of choice. But people constantly develop new tools. So, it’s not about this tool or that tool. It’s about what does a tool need to do for you? Do you understand it’s capabilities and why they’re an advantage? If so, you’ll recognize other good new tools as they get developed.

So, what do you feel are the capabilities to look for in a good streaming messaging system?

I think the big idea is that it’s not just about using streaming data for a single workflow, a single data flow, toward that real-time insight. The value of the messaging layer technology and the value of that right architecture goes way beyond that. It’s much broader.

Kafka and MapR Streams are very scalable, very high throughput, without the performance being a tradeoff against latency. Usually, if you can do one, you can’t do the other. Well, Kafka and Streams both do them both very well. The level at which they perform is a little different, but they’re both off in a class almost by themselves. They also have to be reliable, obviously, but they’re both excellent at that.

Another feature to look for is that they need to be able to support multiple producers of data in that same stream or topic, and multiple consumers or consumer groups. They both can be partitioned, so that helps with load balancing and so forth.

The consumers subscribe to topics and so the data shows up and is available for immediate use, but it’s decoupled from the consumer. So these messaging systems provide the data for immediate use, yet the data is durable. It’s persistent. You don’t have to have the consumer running constantly. They don’t have to be coordinated. The consumer may be there and use the data immediately; the consumer may start up later; you may add a new consumer.

That decoupling is incredibly important.

It makes that message stream be re-playable. What that does for you is make that stream become a way to support micro services, which is hugely powerful. Both Kafka and MapR Streams have that feature.

Back to the idea of flexibility that we discussed earlier. These message systems work for batch processes as well as streaming processes. It’s no longer just a queue that’s upstream from a streaming application, it becomes the heart of a whole architecture where you put your organization together.

You can have multiple consumers coming in and saying, “Oh, you were streaming that data toward the streaming application because you needed to make a real-time dashboard, blah blah blah. But, look. The data in that event stream, is something I want to analyze differently for this database or for this search document.” You just tap into that data stream and use the data, and it doesn’t interfere with what’s going on over there. It’s absolutely a different way of doing things. It simplifies life. Both Kafka and MapR Streams support all of those features and architecture. And I think this is a shift that people are just beginning to relate to.

Shifting to a new way of thinking and building can be difficult.

One of the nice things about this decoupling and the flexibility of using a good messaging system is that it makes the transition easier, as well. You can start it in parallel and then take the original offline. A transition is never easy to do, but it makes it much less painful than it could be.

MapR has one aspect that’s different. It’s a very new feature. It goes one step further, and actually collects together a lot of topics that go up to thousands, hundreds of thousands, millions of topics into a structure feature that MapR calls the Stream. There isn’t an equivalent in Kafka. The Stream is a beautiful thing. It’s a management level piece of technology. You don’t need it for everything. But, if you have a lot of topics, this is a really great thing.

Kind of a metadata management for streaming data?

Well … for the topics that you want to manage in a similar way, you can set up multiple streams. There may be one topic in a stream, there may be 100,000 topics collected into that stream. But for all the ones that you want to manage together, you can set various policies at the Stream level. That makes it really convenient. You can set policies on access, for instance at the Stream level. You can set time-to-live.

People should get it out of their head that if you’re streaming data it’s going to be a “use it or lose it” kind of situation, that the data is gone because you don’t have room to store it. It doesn’t mean you have to store all your streaming data, it just means that if you want to, you have the option. You have a configurable time-to-live. The time-to-live can be…

Seven seconds or seven days …

Or, if you want to just keep it, you basically set that date to infinity and you’ve persisted everything. You have the option to do what you want, and you can go back and change it, too. You can set those policies. With MapR Streams, you don’t have to go in and set it

To 100,000 different topics.

Right. You can do it collectively. Say, that whole project, we want to keep for one year, and then we’re done. Or, that whole project, we want to keep it for one day, and we’re done. You can set access rights as well at the project level. You can set who in your organization has access to what data.

This group can access this project, this group can access that project, but they can’t access this other data.

That’s right. MapR Streams gives you that additional advantage.

And here’s another, different way of thinking about streaming architecture. MapR Streams has a really efficient geo-distributed replication. Say you’re trying to push streaming data out to data centers in multiple places around the world, you want that to happen right away, and you want to do it efficiently. You just replicate the stream to the other data center. It’s a very powerful capability, and

That is organized at the stream level, as well, so again, you might say, “These topics, I want the same time-to-live or I want the same access, but these I want to replicate to five data centers, and these ones I don’t. So, I’ll make two different streams.”

It’s a good management system. These are elegant additional features, but I think at the heart of it, even if you don’t have that capability, you still have the capability to bring a stream-first architecture to most systems. Then, streaming isn’t the specialized thing, it becomes the norm.

You pull data away from that into various kinds of streams, and decide, “I’m going to do a batch process here, a virtualized database there, and I’m going to do this thing in real time.”

Right now, Kafka and MapR Streams are the two messaging technologies that we like, but it doesn’t mean they will be the only ones in the field. That’s why I think it’s important for people to look at what the capability is, rather than just looking at the tools. A tool may be the popular tool now, but there may be even better ones later.

Is there anything else people should keep in mind relating to streaming architectures?

In this whole industry, looking at how people use anything related to the whole Hadoop ecosystem, I think future-proofing is something you need to keep in mind. People are very new to this, in some cases, and one thing they can do by jumping in and using any of these technologies is build up expertise. They’re not even sure, in some cases, exactly how they want to use it. The sooner they begin to build expertise internally, the better position they’ll be in by the time they want to put something into production.

On Friday, in Part 3, Ellen will talk about her book, “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams”.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article from FiveFilters.org: Most Labour MPs in the UK Are Revolting.

Syncsort blog

Expert Interview Series: Part 1 — Dr. Ellen Friedman Discusses Increased Flexibility in Big Data Tools and Changing Business Cultures

In this Syncsort Expert Interview, Syncsort’s Paige Roberts speaks with scientist, writer, and author of numerous books on big data, Dr. Ellen Friedman. The two discuss how Hadoop fits in the industry, what other tools work well for big data streaming and batch processing, and about Friedman’s latest book in the “Practical Machine Learning” series, called “Streaming Data Architecture: New Designs Using Apache Kafka and MapR Streams“.

Ellen Friedman is a consultant and commentator, currently writing mainly about big data topics. She is a committer for the Apache Mahout project and a contributor to the Apache Drill project. With a PhD in Biochemistry, she has years of experience as a research scientist and has written about a variety of technical topics including molecular biology, nontraditional inheritance, and oceanography. Ellen is on Twitter at @Ellen_Friedman

Ellen Friedman Blog 7 12 16 Expert Interview Series: Part 1 — Dr. Ellen Friedman Discusses Increased Flexibility in Big Data Tools and Changing Business Cultures

Hadoop isn’t just one thing. There is Hadoop itself, with all its multiple parts, plus the whole ecosystem of Hadoop infrastructure and architecture.

In your experience, what are people using Hadoop for?

I wrote a book a little over a year ago with Ted Dunning called “Real World Hadoop”. For that book, I looked at how MapR customers are using Hadoop. We tried to talk about Hadoop so that it wasn’t MapR specific. MapR is a different kind of platform, so some of the workflows we showed were a little simpler than in standard Hadoop. But you could do the same work with other systems, you would just have to have more pieces. The goal was to understand why people are doing this, and not just what drives them to it, like marketing. What are their needs? In some cases, companies are just now starting to recognize their own needs and how Hadoop fits.

Why is that, do you think?

Hadoop sounds like a specialized thing, but Hadoop isn’t just one thing. There’s Hadoop itself, which has multiple parts. A larger issue is the whole Hadoop ecosystem, which includes the whole collection of tools people use.

In general terms, Hadoop allows companies to scale up to levels of data they might not have even considered using before. And, by allow, I mean that it makes it feasible; it’s practical, and it’s affordable. So, Hadoop is opening doors, not just to do the same things people have been doing, but now at lower cost. People are beginning to ask questions that have never been asked, applications they can do that they could never have approached, because you actually do need a threshold of data to be able to do that.

The other purpose, that I think is less obvious to people until they really start working with the system, is that the Hadoop style approach, including the MapR platform, opens the door to a new kind of flexibility. Now with the new ability to handle streaming data, that agility changes the way people can analyze and process data.

Part of what I try to do is to help people recognize that they have a different option in terms of flexibility. That means they can begin on the human culture side of their organization to rethink how they approach their problems. Because otherwise, they aren’t using the tool to its full advantage.

Can you give some examples of the kind of flexibility you mean?

One example is, as people begin to use a NoSQL database, like the MapR Database … it uses the Apache Hbase API, so it’s like a columnar data store. It actually has a second version, which uses a JSON document-style interface that works sort of like Mongo, so it really goes in two directions. But I think the basic principle is the same. … Those NoSQL databases (Apache, HBase, Mongo, and the others), you can put in more data, you can put in raw data, you can process data, you can do these early steps in the ETL, even if you’re going to use a traditional relational database later, you’re doing that early-stage processing, starting with huge data and going down to smaller data…

Aggregating and filtering.

Yes, and you can do it cheaper. It’s better to hand a relational database the good stuff, right? So you do your early processing on your Hadoop platform, in HBase or whatever. It gives you a different kind of flexibility.

Another flexibility example is that, unlike a traditional relational database, you don’t have to pre-decide every question you want to ask. You don’t have to know exactly how you’re going to use the data. Not to say you start with no clue, I don’t mean that at all. But let’s say I know I want to use it for this particular purpose, I’m going toward a BI relational database for example.

But you’re still storing that raw data in HBase or wherever. You didn’t have to use it and throw it away because you couldn’t afford to store it. Now you can ask, “What other questions, what other exploration do I want to do in that data?”

You don’t have to have a big committee to decide it. Because the expense of going back and asking a different question in Hadoop is different than saying I want to do an entire new line of analytics through a relational BI database. That’s a big commitment.

That type of flexibility means, people can defer decisions. They can save the data and then figure out what they want to do with it later. On the human side, they need to transform their thinking in order to take real advantage of that.

How do you feel about Spark?

As new tools come along like Apache Spark, people say, “Well, hasn’t Spark replaced Hadoop?” I say, “Well, some of the processing in Spark certainly is replacing many of the situations where people are using the computational framework in Hadoop, which is MapReduce, but it doesn’t mean it’s replacing all of Hadoop. It means it’s replacing just the piece that’s running in Spark.

Tell me about Drill. MapR has put a lot of emphasis on Drill, I’ve noticed.

Apache Drill is a very fast, very efficient SQL query engine that works on Hadoop, or without Hadoop.

What makes Drill different?

Drill works on some of these new styles of data, Parquet, JSON, it works on even nested data. And you don’t have to do a lot of pre-defining of schema. It discovers schema on the fly. The other thing that’s unusual about it, as opposed to things like Hive or Impala that are SQL-like, is that it’s standard, ANSI SQL. So it connects the traditional people and their traditional BI tools directly into this Hadoop-based, big data world in a much more seamless way.

But, one of the great things about Drill, back to this theme of flexibility, is that … people say, “How fast do its queries run? How do you compare it directly to one or the other choice?” Well, it’s very fast. It’s not always the fastest, it depends on the situation … But really, to understand if it’s a good tool for people, the question is not, “How fast did the race run once the starting gun went off and you ran that query,” that’s important. But how long did it take you to get to the starting line? Was that three weeks of preparation of your data to run the query? Well, with Drill, that three weeks may become an hour or two hours.

So, if I got to the race in two hours plus ran it in 20 seconds, versus three weeks to get there plus 15 seconds to run. Which was faster?

Not only do you get to the end of that race faster, but now you realize you can take insight from that first query and turn around in the moment and say, “Oh, now I see. I also have another question.” You loop back and you ask that second question.

So when people say, “How fast is it?” I say, “One question you should ask is, ‘how long does it take me to get to form the second query?’.” Because you can begin to build insight on insight at the rate people sit down and talk, like a conversation.

They look at a result, have a cup of coffee, chat with your colleague, go back in the same day and take it to the next step. That’s just not possible in other systems where it takes hours and hours or days or weeks to prepare the next query.

When it takes that long, your thinking is different. The expense of doing the next set of questions is different. So you say, “Do I really want to ask that question? Is it worth it?”

As opposed to following that train of thought and going, “Well what about this? What if we do that?

Exactly. So one thing that I try to help people see is with Apache Drill, it’s not just that initial convenience, which is huge. Also, you want to structure your teams, how they do their work, how they think about things, how they move the work forward, what their goals are, differently to take full advantage of it.

So, to go back to the beginning, the purpose of Hadoop is to let you have access to new and unstructured data, and let you have access to traditional data that you’ve been using before, but at larger scale, much less expensively, and, on the other side, let you start thinking in new ways of “save data, go back to it later”.

You don’t always know what is going to be the important question about that data. You see that in research all the time. When people do a study, they save what they think is important at the time. You go back and say, “If only we had asked people in the case study this question.” And you can’t help that.

I think Hadoop lets people start to use data in a much more fundamental and exciting way. But it’s not Hadoop versus traditional ways … it’s how you connect those together that optimizes the system.

Tomorrow, in Part 2, Ellen talks about Streaming Data — what she finds the most exciting about technologies and strategies for steaming, including cool things happening in the streaming data processor space, streaming architecture, metadata management for streaming data and streaming messaging systems.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article from FiveFilters.org: Most Labour MPs in the UK Are Revolting.

Syncsort blog