Tag Archives: Streaming

Video: Syncsort CTO on Trends in Data Science, Streaming & Cloud, and their Impact on Data Governance

During Strata Data Conference 2018 in San Jose, California, Syncsort CTO, Dr. Tendü Yoğurtçu sat down with theCUBE co-hosts George Gilbert and Lisa Martin at Big Data SV 2018. In the recorded interview, they discuss three key industry trends in Data Science, streaming and the Cloud, and how all of them create data governance challenges.

Watch the video to learn more about what organizations are doing as they work to make data their core strategy, and how Syncsort is working to help them.

blog banner landscape Video: Syncsort CTO on Trends in Data Science, Streaming & Cloud, and their Impact on Data Governance

Data Science Trends Complicate Data Governance

First, Tendü talks about how organizations are focused on preparing data for deep learning and artificial intelligence use. She also addresses how the data must be trusted to use with these technologies, heightening the importance of data integration and data quality to prepare, cleanse and match data.  Should we add supervised learning to this?  Tendü also addresses the advantage Syncsort has in having domain expertise to infuse machine learning algorithms and connect data profiling with data quality capabilities. That approach could help organizations recommend business rules and automate the mandated tasks.

Ensuring Data Governance Doesn’t Get “Cloudy”

Tendü explains that many organizations now have multiple workloads in hybrid clouds, creating governance challenges as well as necessitating more scoping and planning for the Cloud. She points out that Data Governance is the “umbrellas focus for everything we are doing at Syncsort,” because these other trends and developing next generation analytics environments require good data governance. The big driver is regulatory compliance, such as GDPR, which is “on the mind of every C-level exec” – not just for European companies – since most companies have European data sources in their environments. Security and availability of the data are key, and another critical aspect is delivering high quality data to data scientists.

Tendü talks about the importance of Syncsort’s design once, deploy anywhere strategy to enable organizations to run the same applications, without requiring any changes, across all their environments.

Data Governance Must Swim Up and Down Stream

Tendü also discussed another macro trend – streaming with connected devices. So much data is being generated, driving the need to process and stream data on the edge. In addition, the Kafka data bus is now a streaming data consumer, publishing data and making it available for applications and analytics in the data pipeline. Syncsort helps meet the resulting data governance challenges by providing CDC and real-time data replication capabilities.

For more on industry on trends in Data Science can be game-changing for IT organizations, be sure to check out our Strata Data Conference recap tomorrow!

Also, make sure to download our eBook, “The New Rules for Your Data Landscape“, and take a look at the rules that are transforming the relationship between business and IT.

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Top 10 Big Data Blogs of 2017 – Streaming Data *Spark*les

As 2017 draws to a close, we’re reviewing our best content of the year starting with the best of our Big Data blogs. While last year was all about Hadoop, 2017 can easily be considered the year of streaming data (and Spark).

Let’s get this countdown started!

Back in January, open source software was taking the world of Big Data and analytics by storm – so much so that it was hard to keep track of all the open source data tools out there. Here’s our guide to the top open source data products we anticipated leading the market in 2017, including Spark and Kafka. Read more >

If you’re like most organizations, you collect a lot of dark data – which means data that you don’t put to work. But your data doesn’t have to stay dark. Keep reading for examples of ways you can put dark data to use by using it to gain new insights. Read more >

You know that Big Data involves lots of data. But have you ever stopped to think about just how much data, exactly, goes into Big Data? In other words, how big is Big Data, actually? Read more >

Dakshinamurthy V. Kolluru is the Founder & President of INSOFE (International School of Engineering), an organization that champions training & certification, consulting, research and product development in Data Science and Big Data Analytics. His expertise lies in simplifying complex ideas and communicating them in clear and exciting ways. Read more >

If you work with Big Data, you might not think DevOps has much to do with you – and vice versa. But you’d be wrong. Here’s why Big Data and DevOps make sense together. Read more >

blog banner 2018 Big Data Trends eBook Top 10 Big Data Blogs of 2017 – Streaming Data *Spark*les

Not all expert interviews are created equal. Here are the five most popular Big Data experts we spoke to in the first half of 2017, including our conversations with Spark‘s Xin and INSOFE’s Kolluru. See who else made the cut. Read more >

Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. During his interview with Syncsort’s Paige Roberts, he discusses the details on the driving factors behind Spark 2.0 and its newest features. Read more >

You’ve heard all about Big Data and the theory behind it. But do you know how data analytics are actually being used to change the way we work in the real world? Keep reading for an overview of five industries and how they are being reshaped by data analytics. Read more >

At the cusp of 2017, there were still quite a number of Big Data products and platforms to pick from to assemble an infrastructure that meets your needs. In this blog post from January, we lined up the sexiest prospects for streaming Big Data to ramp up your 2017 projects. Read more >

Are you trying to understand Big Data and data analytics, but are confused by the difference between streaming data and batch data processing? If so, our most popular blog post of the year is for you! Read more >

We hope you’ve enjoyed our Big Data blogs this year. For a look ahead, check out our report, 2018 Big Data Trends: Liberate, Integrate & Trust, to see what every business needs to know in the upcoming year about Big Data, including 5 key trends to watch for in the next 12 months!

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

TIBCO Named a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017

forrester wave blog TIBCO Named a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017

Forrester has named TIBCO a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017 among thirteen vendors that were evaluated. For the Strategy category, we received a 4.9 out of a possible 5 points.

TIBCO StreamBase is also recognized for “[unifying] real-time analytics” with a “full-featured streaming analytics solution that integrates with applications to automate actions and also offers Live DataMart to create a real-time visual command center.”

Today’s organizations don’t just want streaming analytics or analytics at rest—they want the ability to operationalize analytics insights and the ability to capture streams, both the raw input and resulting predictions and streaming analytics to analyze and generate new insights, which they then operationalize. Streaming analytics customers will be more successful—and more satisfied in the long term—doing the full analytics round trip, and TIBCO has the tools to do it.

Learn more about TIBCO StreamBase here.

Download a complimentary copy of the report here.

TIBCO is focused on insights. Not the garden variety insights that lay dormant and unactionable on someone’s desk. Rather, TIBCO focuses on perishable insights that companies must act upon immediately to retain customers, remove friction from business processes, and prevent logistics chains from stopping cold. —Excerpt from The Forrester Wave: Streaming Analytics, Q3 2017

Let’s block ads! (Why?)

The TIBCO Blog

Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Are you trying to understand Big Data and data analytics, but are confused by the difference between stream processing and batch data processing? If so, this article’s for you!

Batch Processing vs. Stream Processing

The distinction between batch processing and stream processing is one of the most fundamental principles within the Big Data world.

There is no official definition of these two terms, but when most people use them, they mean the following:

  • Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing.
  • Under the streaming model, data is fed into analytics tools piece-by-piece. The processing is usually done in real time.

Those are the basic definitions. To illustrate the concept better, let’s look at the reasons why you’d use batch processing or streaming, and examples of use cases for each one.

blog banner CDC webcast Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Batch Processing Purposes and Use Cases

Batch processing is most often used when dealing with very large amounts of data, and/or when data sources are legacy systems that are not capable of delivering data in streams.

Data generated on mainframes is a good example of data that, by default, is processed in batch form. Accessing and integrating mainframe data into modern analytics environments takes time, which makes streaming unfeasible to turn it into streaming data in most cases.

blog cookies batch stream processing Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Batch processing works well in situations where you don’t need real-time analytics results, and when it is more important to process large volumes of information than it is to get fast analytics results (although data streams can involve “big” data, too – batch processing is not a strict requirement for working with large amounts of data).

Stream Processing Purposes and Use Cases

Stream processing is key if you want analytics results in real time. By building data streams, you can feed data into analytics tools as soon as it is generated and get near-instant analytics results using platforms like Spark Streaming.

Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.

blog stream processing Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Turning Batch Data into Streaming Data

As noted, the nature of your data sources plays a big role in defining whether the data is suited for batch or streaming processing.

That doesn’t mean, however, that there’s nothing you can do to turn batch data into streaming data to take advantage of real-time analytics. If you’re working with legacy data sources like mainframes, you can use a tool like DMX-h to automate the data access and integration process and turn your mainframe batch data into streaming data.

This can be very useful because by setting up streaming, you can do things with your data that would not be possible using streams. You can obtain faster results and react to problems or opportunities before you lose the ability to leverage results from them.

To learn more about how Syncsort’s data tools can help you make the most of your data – and develop an agile data management strategydownload our new eBook: The New Rules for Your Data Landscape.

 Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Databricks’ Reynold Xin on Structured Streaming, Apache Kafka and the Future of Spark

The new major version release of Spark has been getting a lot of attention in the Big Data community. One of the most significant strides forward has been the introduction of Structured Streaming.

At the last Strata + Hadoop World in San Jose, Syncsort’s Big Data Product Manager, Paige Roberts sat down with Reynold Xin (@rxin) to get the details on the driving factors behind Spark 2.0 and its newest features. Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. He had just finished giving a presentation on the full history of Spark from taking inspirations from mainframe databases to the cutting edge features of Spark 2.0.

In part 1 of this interview, he talked about some of the driving factors behind this major change in Spark. In today’s part 2, Reynold Xin gives us some good information on the differences between stream and Structured Streaming; how to integrate Structured Streaming with Apache Kafka; and some hints about the future of Spark.

blog banner accessing integrating app data Expert Interview (Part 2): Databricks’ Reynold Xin on Structured Streaming, Apache Kafka and the Future of Spark

Paige Roberts: I’m still learning about Structured Streaming. Can you contrast Structured Streaming versus stream? Essentially, what’s the difference?

Reynold Xin: Sure. In many ways, you can think of it as the RDD API and the DataFrame API. The old stream was built on top of the RDD API. The way it works is to keep re-running the RDD API over and over again. This fact is reflected in the API itself.

Whereas Structured Streaming moves the API higher level. It just asks the user, “What kind of business logic do you want to happen?” And then the engine will automatically incrementalize the operation. For example, in Structured Streaming, you just say, “I want a running sum on my data.” Versus if you go back to Spark Streaming, you have to think, “How do I compute a running sum? Well, the way I compute a running some is for each of the batches to compute a sum, and then I’ll be summing them all up myself.” So, this is one big difference.

Related: What Spark “Structured Streaming” Really Has to Offer

The other big difference is Structured Streaming takes the transactional concept as a first-class citizen. Essentially, the users don’t have to worry about how to guarantee exactly once delivery. And, data integrity is a first-class concern. We made it very difficult for users to screw it up. Because in the past we have seen with Spark Streaming, while exactly once was possible to do, often the users would screw up data integrity accidentally.

Last, but not least, the API and the integration with the batch component, it just works with batch. The API is the same, so you can write to the same data destination. You can read it back directly. You get stuff that actually makes sense. It’s just a lot easier.

Roberts: Yeah. Okay, um, so again more ease of use and I see the emphasis on data integrity. You also get the batch and streaming together. That’s nice, that you don’t have to re-rewrite. I guess before Spark Streaming was very micro-batch, and it sounds like Structured Streaming can do true streaming processing?

Xin: From the API point of view there is no concept of a batch. Now from the execution engine point of view, it is still going through micro-batching. But the comparison of micro-batch and true streaming is kind of a misnomer in my mind. Typically, people think if you do event at the time, that’s real streaming.

If you do three events at a time …

Right, if you do a bunch of events at a time, it’s batch. But in reality, everything else out there is batch. There’s no true event-at-a-time engine, because processing one event at a time has enormous overhead. It’s basically not practical when you have a huge amount of data. So, every other engine actually batches to some degree.

Including engines like Storm, Apex and Flink?

Yes, absolutely. All of them batch. They all batch at some point.

Alright. Okay, so I saw in your presentation that you guys integrate with Kafka really well.


Is that integration difficult to accomplish? I mean if you’re feeding into Kafka, you’re going into Structured Streaming, and you’re using the DataFrame API, is there a special procedure or something?

It just works out of the box. We took care of all the details so the user doesn’t have to worry about it. All they need to do is spark.readstream and then the Kafka stream information, and put in the topic you want to subscribe to, and now you’ve got a DataFrame.

That’s really simple.

In many cases, it even automatically infers a schema. For example if you have JSON data coming in, Spark will infer the schema automatically. You usually don’t have to even declare it. Although for sanity’s sake, I would recommend users do declare it. Because you can’t guarantee that your data will be perfect and you’d want the right error to surface when they are not.

Does it integrate at all with the schema registry for Kafka?

It doesn’t currently integrate with that part. I think that part is fairly new.

It is very new, yeah. Okay well, you mentioned the Tungsten project and I saw you said something about in the future, it’s looking to move into other execution platforms like GPUS, and other ways to make that even more efficient, or at least more flexible. How far ahead is that? Is that like way ahead or…

It’s in an exploratory stage right now. And there are also different teams outside of Databricks looking at that. It’s a pretty major project. We should only do it if it makes sense. For example, sometimes it might not make sense at all because all the data is not in a specific part of the processing but rather for example, reading IO. What we have found, at least for a lot of the Databricks projects, IO is currently the biggest bottleneck, so we have a lot of work that’s coming that will address IO performance. And then uh maybe processing becomes the next bottleneck in there.

Yeah, it seems to ricochet back and forth. It’s CPU bound, now it’s I/O bound, now it’s CPU bound …

When you optimize one more, the other one becomes more devolved.

Yeah. So, is there anything exciting coming up fairly soon that you would like to talk about?

Oh, yeah. For open source Spark, I think there are a few things. One is that Structured Streaming will GA, hopefully soon. It will be a pretty important milestone of the project. Another thing is we are looking at how we can make Spark more useable on essentially a single node laptop. This includes, for example, being able to publish Spark to the Python package index. So the users can just go to pip install Pyspark, and Spark shows up on your laptop. It’s becoming more efficient, and this broadens the addressable users. Acquiring that kind of market, but addressable users for the open source project. It would be really nice if, with a single tool, you could process a small amount of data on your laptop, and then when you want to scale up to a larger amount of data, it just runs on the cloud.

Being able to move from laptop to server to cluster to cloud is something Syncsort has been all about for a while now so we’re really happy to see that. We call it Design Once, Deploy Anywhere. So, that’s great to hear! Is there anything else you wanted to mention?

Yes. Come to Spark Summit!

[laughing] Come to Spark Summit! Alright! Yeah, some of our guys were at Spark Summit East so we’ll probably be at the next one.

This one will probably be much larger.

Download Syncsort’s latest white paper, “Accessing and Integrating Mainframe Application Data with Hadoop and Spark,” to learn about the architecture and technical capabilities that make Syncsort DMX-h the best solution for accessing the most complex application data from mainframes and integrating that data using Hadoop.

Let’s block ads! (Why?)

Syncsort blog

Setting up data alerts on streaming data sets

Streaming data sets in Power BI are a cool feature that allows you to analyze data as it occurs. I was playing around with setting up this cool twitter demo using Flow as described here by Sirui: https://powerbi.microsoft.com/en-us/blog/push-rows-to-a-power-bi-streaming-dataset-without-writing-any-code-using-microsoft-flow/ but was thinking, wouldn’t it be cool if I can get alerts based on the data that comes in. For example I want to get an alert when I get more than 20 negative Power BI tweets in the last hour. Unfortunately you cannot create measure at this time to add any logic but there is a way. Let’s take a look.

If you follow the above instruction you will end up with a dataset in Power BI that gets fed tweets and their sentiments real time. One change I made to the flow above is that I turned on “Historic data analysis

 Setting up data alerts on streaming data sets

This gives me a dataset that I can build reports on and analyze data in the past similar to a push api dataset, more on this here.

So now I have this dataset I can start creating reports and dashboards (you can also add more data as described here to make things a bit more interesting):

 Setting up data alerts on streaming data sets

and pin them to your dashboard:

 Setting up data alerts on streaming data sets

So far so good but now I want to see only the count of tweets with sentiment <= 0.5 in the last hour. Here is where the trouble starts as I can’t express last hour in the report designer. Luckily there is another smart feature that will help us here called Q & A, I can just ask the question.

 Setting up data alerts on streaming data sets

This immediately gives me the answer I need, just not completely in the right shape as I need this to be a single card, not a chart for data alerts to work. There is also an option for me to change this manually, so in this case I open the viz pane and select card:

 Setting up data alerts on streaming data sets

Now I pin it to the dashboard and after renaming the tile I have the number of negative tweets in the last hour:

 Setting up data alerts on streaming data sets

Now as last step I can configure my data alert to send me an alert when I get more then 20 negative tweets in hour:

 Setting up data alerts on streaming data sets

Done  Setting up data alerts on streaming data sets again goes to show you the power of Q&A. Pretty cool scenario and NO code required …

Let’s block ads! (Why?)

Kasper On BI

Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

In Part 1, Cloudera’s Sean Anderson (@SeanAndersonBD), summarized what’s new in Spark 2.0. In Part 2, he talks more about new features for Spark Structured Streaming, including how unified APIs simplify support for streaming and batch workloads, and support for Spark in the Cloud.

In Spark 2.0, the ecosystem combined the functional API’s and now you have a unified API for both batch and streaming jobs. It’s pretty nice to not have to use different interfaces to achieve this. There’s still native language support, and they are still very simplified and easy to use APIs, but for both of those types of workloads.

blog Spark diagram Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

Roberts: Ooh! Streaming and batch together in one interface is something Syncsort has been pushing for a while! That’s great to hear. Very validating.

Anderson: Then the last improvement was around Spark Structured Streaming, which is a streaming API that runs on top of Spark SQL. That generally gives us better performance on micro-batch or streaming workloads, and really helps with things like out of order data handling.

There was this issue with Spark Streaming before where you may have outputs that resolve themselves quicker than the actual inputs or variables. So you have a lot of really messy out of order data that people had to come up with homegrown solutions to address.

And now that Spark Structured Streaming has essentially extensible table rows forever, you can really do that out of order data handling a lot better.

Related: Syncsort goes native for Hadoop and Spark mainframe data integration

Streaming and batch seems like they’ve always been two separate things, and they’re becoming more and more just two different ways to handle data. We are also seeing a lot of push towards Cloud. What else are you seeing coming up that looks exciting?

For us, really understanding how we guide our customers on deploying in the Cloud is great. There’s persistent clusters, there’s transient clusters. For ETL, what’s the best design pattern for that? For exploratory data science, what’s the best for that? For machine learning, what’s the best for cloud based scoring? So giving customers some guidance on those aspects is key.

blog banner BBDtL ExpertsSay Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

Recently, we announced S3 integration for Apache Spark which allows us to run Spark jobs on data that already lives in S3. The transient aspects of clusters makes it very easy to just spin up compute resources, and run a Spark job on data that lives in S3. And then you don’t have to spend all that time moving the data and going through all the manual work on the front end.

Really work on the data right where it is.

Exactly. That’s Spark in the Cloud.

Syncsort recently announced support for Spark 2.0 in our DMX-h Intelligent Execution (IX) capabilities. Be sure to check that out, and see what the experts have to say about Spark in our recent eBook.  

Also, be sure to the read the third and final part of this interview on Friday. Paige and Sean talk about two new projects that Cloudera is excited about, Apache Livy and Apache Spot.

Let’s block ads! (Why?)

Syncsort blog

Data loading into HDFS – Part3. Streaming data loading

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it deserves a dedicated blog post. Today I’m going to talk about Flume and Kafka.

What is Kafka? 

Kafka is distributed service bus. Ok, but what is service bus? Let’s imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this:

5 Data loading into HDFS   Part3. Streaming data loading

but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this:

 Data loading into HDFS   Part3. Streaming data loading

What is Flume? 

“Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.” – this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn’t just use Hadoop client?

Challenge 1. Small files.

Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it’s still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops.

Challenge 2. Lots of data sources

Let’s imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS.

7 Data loading into HDFS   Part3. Streaming data loading

life is good,  if files are large enough it’s not gonna be a problem.

But now let’s imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe – just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources. 

8 Data loading into HDFS   Part3. Streaming data loading

Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS.

9 Data loading into HDFS   Part3. Streaming data loading

In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it’s only two connections – it’s affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents.

I also want to note, that it’s a good practice to use AVRO sink when you move data from one tier to next. Here is example of the flume config file:



agent.sinks = avroSink

agent.sinks.avroSink.type = avro 

agent.sinks.avroSink.channel = memory 

agent.sinks.avroSink.hostname = avrosrchost.example.com

agent.sinks.avroSink.port = 4353 



Kafka Architecture.

Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts.

10 Data loading into HDFS   Part3. Streaming data loading

Producer – a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role.

Consumer – a process that reads the data from Kafka cluster. 

Brocker – a member of Kafka cluster. Set of members is Kafka cluster. 

Flume Architecture.

You could find a lot of useful information about Flume in this book, here I just highlight key concepts.

11 Data loading into HDFS   Part3. Streaming data loading

Flume  has 3 major components:

1) Source – where I get the data

2) Chanel – where I buffer it. It could be memory or disk, for example. 

3) Sink – where I load my data. For example, it could be another tier of Flume agents, HDFS or  HBase.

12 Data loading into HDFS   Part3. Streaming data loading

Between source and channel, there are two minor components: Interceptor and Selector.

With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header. 

Flume and Kafka similarities and differences.

It’s a frequent question: “what is the difference between Flume and Kafka”, the answer could be very expanded, but let me briefly explain key points.

1) Pull and Push.

Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk

Kafka accumulates data until client initiate reads. So client pulls data whenever he wants.

2)  Data processing

Flume could do simple transformations by interceptors

Kafka doesn’t do any data processing, just store that data. 

3) Clustering

Flume usually is a batch of single instances.

Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts. 

4) Message size

Flume doesn’t have any obvious restrictions for size of the message

Kafka was designed for few KB messages

5) Coding vs Configuring

Flume usually configurable tool (users usually don’t write the code, instead of this they use configure capabilities).

With Kafka, you have to write code for load/unload the data.


Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It’s quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there).

First of all, Flafka is not a dedicated project. It’s just bunch of Java classes for integration Flume and Kafka.

1 Data loading into HDFS   Part3. Streaming data loading

Now  Kafka could be either source for Flume by flume config:

flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource

or channel by the following directive:

flume1.channels.kafka-channel-1.type = org.apache.flume.channel.kafka.KafkaChannel 

Use Case1. Kafka as a source or Chanel

if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don’t like it, you may use Kafka as a Flume source. 

2 Data loading into HDFS   Part3. Streaming data loading

in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case.

Also, you could use Kafka as a Flume Chanel for high availability purposes (it’s distributed by application design). 

Use case 2. Kafka as a sink.

If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) – you have this option. The only one that you need is config Kafka as a sink.

3 Data loading into HDFS   Part3. Streaming data loading

Use case 3. Flume as the tool to enrich data.

As I Already told before – Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.

4 Data loading into HDFS   Part3. Streaming data loading


There are two major tools for loading stream data – Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it’s why Flafka have been created – it’s just a combination of those two tools.

Let’s block ads! (Why?)

The Data Warehouse Insider

Drizzle on tap to spur Spark Streaming architecture

TTlogo 379x201 Drizzle on tap to spur Spark Streaming architecture

Innovation in Spark Streaming architecture continued apace last week as Spark originator Databricks discussed an upcoming add-on expected to reduce streaming latency.

Based on ongoing work by a lab at the University of California, Berkeley, elements of what is being called the Drizzle framework are expected to become part of Apache Spark later this year, according to the company.

The anticipated streaming update is part of Databricks’ larger efforts to provide a platform for broad new analytics uses. Drizzle is intended to help promote users’ moves to so-called Lambda architectures that combine batch and real-time data processing approaches.

Spark trending now at Netflix

The move to embrace both batch and real-time processing isn’t an easy one, even for fast-flying web companies. But it is a natural step, according to Shriya Arora, a senior data engineer at Netflix.

Arora is part of a Netflix team that employs Spark processing and streaming to transform and push data to data scientists who develop algorithms that personalize the company’s movie recommendations to subscribers. As Netflix converts some applications from batch to real time, she’s working to fine-tune Spark Streaming to ensure there are monitoring alerts that warn when streaming jobs may fail.

“Streaming is better than having long-running jobs, but it comes at a cost. For example, streaming failures have to be addressed immediately. If an application is down too long, you run into data loss,” she told an audience at last week’s Spark Summit East 2017 in Boston.

Real time means ‘why wait?’

The real-time effort is worthwhile, however, because it can better align Netflix’s movie recommendations with the immediate interests of customers. “Trending now” viewing choices, for example, can be more completely up to date, Arora said. “Why wait 24 hours when you can pick up the new information in an hour?”

But the Spark Streaming architecture today doesn’t support pure event streaming — it still has roots in a “micro-batching” formula that rapidly processes small batches of data. So, there are cases where time-sensitive applications might better opt for streaming as supported by alternative frameworks such as Flink or Storm, Arora said.

Such use cases are a prime target for Drizzle, a project within the UC Berkeley RISELab — itself a descendent of the AMPLab project that begat Apache Spark. [RISE stands for Real-time Intelligence with Secure Execution.]

Drizzle’s goal is to unify record-at-a-time streaming with micro-batch models, and is in some part an answer to Flink, an emerging streaming architecture that has shown performance benefits over present Spark Streaming.

Hearing Flink steps?

As he discussed Drizzle in a Spark Summit keynote, Ion Stoica didn’t try to cover up Spark Streaming architecture’s present latency shortcomings in streaming versus Apache Flink. He said Drizzle is intended to reduce Spark Streaming’s performance latency by about 10 times.

Stoica is executive chairman and a co-founder of Databricks, and is also a professor of computer science at UC Berkeley and a part of the RISELab. In graphs, he showed Spark trailing Apache Flink by hundreds of milliseconds in handling event throughput.

He also showed data in which early versions of Drizzle and a companion Drizzle-Opt execution engine slightly improve upon present Apache Flink performance. While details were sparse, Drizzle architecture as depicted on the RISELab’s website is meant to “decouple execution granularity from coordination granularity” for workloads on clusters.

In an interview, Spark inventor Matei Zaharia, who is CTO at Databricks and another co-founder — as well as Stoica’s former grad student — said parts of Drizzle would likely appear in Apache Spark during the third quarter of 2017.

Pursuing a unified model

Both Stoica and Zaharia emphasized that recent advances in streaming technology for Spark, including a Structured Streaming engine and API added as part of Spark 2.0 last year, have focused on enabling a more cohesive approach for programmers that combine real-time and batch data processing on a single platform. They positioned Spark overall as a unified approach to diverse data management and analytical needs that include ETL, machine learning and SQL querying, as well as streaming.

“We think of Spark as the infrastructure for machine learning, which itself is really a small part of the entire workflow,” Stoica said. “You have to clean the data, and transform it. Then, at the end, when it is curated, you apply machine learning algorithms on top.”

This unified approach has merit, according to a machine learning user at a marketing analytics firm who attended the Boston event.

“Previous to our use of Spark, we had ETL, machine learning and other analytics processes, and they were all on different software stacks,” said Saket Mengle, senior principal data scientist at Boston-based DataXu Inc. “Spark allows us to put this on one stack. It is something you have to tweak, but uniformity is good.”

Spark in context

Improvements to Spark Streaming should be viewed in the context of Spark’s overall analytical adoption, said one industry analyst on hand at the conference.

“Spark’s long-term appeal has been as an ensemble of analytical approaches, and its ability to address a variety of workloads,” said Doug Henschen, a principal analyst at Constellation Research Inc.

In a blog post following the conference, Henschen remarked that Spark was progressing more quickly than was predecessor Hadoop at a comparable stage of development, and that it promises “wider hands-on use” by a variety of developers and data scientists.

One measure of Spark’s progress is its adoption by vendors beyond Databricks, he said. In fact, the open source version, Apache Spark, is offered by traditional enterprise players like IBM and Oracle, as well as Hadoop distribution providers Cloudera, Hortonworks and MapR.

It’s noteworthy, too, that Spark is offered on the cloud by the likes of Amazon, Google, Microsoft and others. So far, Databricks has focused its efforts on providing cloud services, which is where its new approach to streaming will likely first be tested.

Let’s block ads! (Why?)

SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources

Streaming data analytics puts real-time pressure on project teams

TTlogo 379x201 Streaming data analytics puts real time pressure on project teams

When I worked at a fast-food restaurant in high school, a co-worker friend and I decided its motto should be “Speed, Not Perfection.” We silkscreened t-shirts for the two of us with that phrase embedded in the corporate logo — two smart-aleck teenagers gently sticking it to the man.

Nowadays, data management and analytics teams increasingly find themselves being asked to fulfill the speed part to enable real-time data analysis in their organizations. But they don’t have the luxury of being able to get by with the same sort of occasional sloppiness that my friend and I did in slapping burgers together. And that puts them under a lot of pressure, because creating a real-time architecture and using it to run streaming data analytics applications is a complicated undertaking.

For starters, streaming analytics systems don’t come in a box — not even a large one. Setting them up is an artisanal process that requires prospective users to piece together various data processing technologies and analytics tools to meet their particular application needs. In addition, the technology options have increased significantly over the past few years, thanks largely to the emergence of multiple big data platforms that provide stream processing capabilities in different ways.

A plethora of streaming platforms

Spark Streaming, Flink, Storm, Samza, Pulsar, Druid, Kylin — they’re all open source processing engines vying for a piece of the data streaming and real-time analytics action. Even Kafka, originally a messaging technology for feeding data from one system to another, now also functions as a stream processing platform in its own right. In addition to the open source tools, various IT vendors offer more traditional complex event processing systems that began emerging in the late 1990s. Specialized databases — in-memory ones, for example — are also built to handle streaming data analytics.

On the analytics software side, broader use of machine learning algorithms is making it more feasible to build predictive models that can churn through large amounts of streaming data on things like financial transactions, equipment performance and internet clickstreams. But again, there are a multitude of technology choices to consider: tools from mainstream analytics vendors and machine learning specialists, cloud-based services, open source platforms.

As with building a big data architecture in general, the surfeit of software available to underpin a real-time analytics architecture can be a boon for users — or mire them in a veritable boondoggle of a deployment. Finding the right technologies and combining them into an effective analytics framework is a perilous process; missteps can send a project careening off the intended path.

Streaming forward on real-time projects

That isn’t stopping companies, particularly large ones with lots of data and ample IT resources, from giving it a go. In an ongoing survey being conducted by SearchBusinessAnalytics publisher TechTarget Inc., 28.1% of the 7,000-plus IT, analytics and business professionals who had responded as of mid-January said their organizations were looking to invest in real-time analytics technology over the ensuing 12 months. In addition, 13.4% said they planned to buy stream processing software.

Why do it? The ability to pull useful information out of data streams in real time lets business operations act fast, and that clearly can be to their advantage. Predictive analytics applications run against streaming data on the web activity of consumers can drive website personalization programs and targeted online advertising and marketing campaigns. Fraud detection, predictive maintenance and satellite imaging are other applications that can benefit from streaming data analytics.

In many cases, real time might be the only time to take advantage of what’s in the data being collected. Streaming analytics tools point to “perishable insights” that need to be acted on quickly before the opportunity is lost, Forrester Research analyst Mike Gualtieri and then-colleague Rowan Curran wrote in a 2016 Forrester Wave report. And you can’t get those kinds of insights simply by throwing data into a Hadoop cluster, as Darryl Smith, chief data platform architect at Dell EMC, said during a presentation on the data storage vendor’s real-time streaming efforts at Strata + Hadoop World 2016 in New York.

Speed is indeed a wonderful thing. Just be sure your team has a well-thought-out plan before turning up the heat on a streaming analytics initiative. Otherwise, it might end up getting flame-grilled by disappointed business executives.

Let’s block ads! (Why?)

SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources