Tag Archives: Streaming

TIBCO Named a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017

forrester wave blog TIBCO Named a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017

Forrester has named TIBCO a Leader in The Forrester Wave™: Streaming Analytics, Q3 2017 among thirteen vendors that were evaluated. For the Strategy category, we received a 4.9 out of a possible 5 points.

TIBCO StreamBase is also recognized for “[unifying] real-time analytics” with a “full-featured streaming analytics solution that integrates with applications to automate actions and also offers Live DataMart to create a real-time visual command center.”

Today’s organizations don’t just want streaming analytics or analytics at rest—they want the ability to operationalize analytics insights and the ability to capture streams, both the raw input and resulting predictions and streaming analytics to analyze and generate new insights, which they then operationalize. Streaming analytics customers will be more successful—and more satisfied in the long term—doing the full analytics round trip, and TIBCO has the tools to do it.

Learn more about TIBCO StreamBase here.

Download a complimentary copy of the report here.

TIBCO is focused on insights. Not the garden variety insights that lay dormant and unactionable on someone’s desk. Rather, TIBCO focuses on perishable insights that companies must act upon immediately to retain customers, remove friction from business processes, and prevent logistics chains from stopping cold. —Excerpt from The Forrester Wave: Streaming Analytics, Q3 2017

Let’s block ads! (Why?)

The TIBCO Blog

Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Are you trying to understand Big Data and data analytics, but are confused by the difference between stream processing and batch data processing? If so, this article’s for you!

Batch Processing vs. Stream Processing

The distinction between batch processing and stream processing is one of the most fundamental principles within the Big Data world.

There is no official definition of these two terms, but when most people use them, they mean the following:

  • Under the batch processing model, a set of data is collected over time, then fed into an analytics system. In other words, you collect a batch of information, then send it in for processing.
  • Under the streaming model, data is fed into analytics tools piece-by-piece. The processing is usually done in real time.

Those are the basic definitions. To illustrate the concept better, let’s look at the reasons why you’d use batch processing or streaming, and examples of use cases for each one.

blog banner CDC webcast Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Batch Processing Purposes and Use Cases

Batch processing is most often used when dealing with very large amounts of data, and/or when data sources are legacy systems that are not capable of delivering data in streams.

Data generated on mainframes is a good example of data that, by default, is processed in batch form. Accessing and integrating mainframe data into modern analytics environments takes time, which makes streaming unfeasible to turn it into streaming data in most cases.

blog cookies batch stream processing Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Batch processing works well in situations where you don’t need real-time analytics results, and when it is more important to process large volumes of information than it is to get fast analytics results (although data streams can involve “big” data, too – batch processing is not a strict requirement for working with large amounts of data).

Stream Processing Purposes and Use Cases

Stream processing is key if you want analytics results in real time. By building data streams, you can feed data into analytics tools as soon as it is generated and get near-instant analytics results using platforms like Spark Streaming.

Stream processing is useful for tasks like fraud detection. If you stream-process transaction data, you can detect anomalies that signal fraud in real time, then stop fraudulent transactions before they are completed.

blog stream processing Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Turning Batch Data into Streaming Data

As noted, the nature of your data sources plays a big role in defining whether the data is suited for batch or streaming processing.

That doesn’t mean, however, that there’s nothing you can do to turn batch data into streaming data to take advantage of real-time analytics. If you’re working with legacy data sources like mainframes, you can use a tool like DMX-h to automate the data access and integration process and turn your mainframe batch data into streaming data.

This can be very useful because by setting up streaming, you can do things with your data that would not be possible using streams. You can obtain faster results and react to problems or opportunities before you lose the ability to leverage results from them.

To learn more about how Syncsort’s data tools can help you make the most of your data – and develop an agile data management strategydownload our new eBook: The New Rules for Your Data Landscape.

 Big Data 101: Dummy’s Guide to Batch vs. Streaming Data

Let’s block ads! (Why?)

Syncsort + Trillium Software Blog

Expert Interview (Part 2): Databricks’ Reynold Xin on Structured Streaming, Apache Kafka and the Future of Spark

The new major version release of Spark has been getting a lot of attention in the Big Data community. One of the most significant strides forward has been the introduction of Structured Streaming.

At the last Strata + Hadoop World in San Jose, Syncsort’s Big Data Product Manager, Paige Roberts sat down with Reynold Xin (@rxin) to get the details on the driving factors behind Spark 2.0 and its newest features. Reynold Xin is the Chief Architect for Spark core at Databricks and one of Spark’s founding fathers. He had just finished giving a presentation on the full history of Spark from taking inspirations from mainframe databases to the cutting edge features of Spark 2.0.

In part 1 of this interview, he talked about some of the driving factors behind this major change in Spark. In today’s part 2, Reynold Xin gives us some good information on the differences between stream and Structured Streaming; how to integrate Structured Streaming with Apache Kafka; and some hints about the future of Spark.

blog banner accessing integrating app data Expert Interview (Part 2): Databricks’ Reynold Xin on Structured Streaming, Apache Kafka and the Future of Spark

Paige Roberts: I’m still learning about Structured Streaming. Can you contrast Structured Streaming versus stream? Essentially, what’s the difference?

Reynold Xin: Sure. In many ways, you can think of it as the RDD API and the DataFrame API. The old stream was built on top of the RDD API. The way it works is to keep re-running the RDD API over and over again. This fact is reflected in the API itself.

Whereas Structured Streaming moves the API higher level. It just asks the user, “What kind of business logic do you want to happen?” And then the engine will automatically incrementalize the operation. For example, in Structured Streaming, you just say, “I want a running sum on my data.” Versus if you go back to Spark Streaming, you have to think, “How do I compute a running sum? Well, the way I compute a running some is for each of the batches to compute a sum, and then I’ll be summing them all up myself.” So, this is one big difference.

Related: What Spark “Structured Streaming” Really Has to Offer

The other big difference is Structured Streaming takes the transactional concept as a first-class citizen. Essentially, the users don’t have to worry about how to guarantee exactly once delivery. And, data integrity is a first-class concern. We made it very difficult for users to screw it up. Because in the past we have seen with Spark Streaming, while exactly once was possible to do, often the users would screw up data integrity accidentally.

Last, but not least, the API and the integration with the batch component, it just works with batch. The API is the same, so you can write to the same data destination. You can read it back directly. You get stuff that actually makes sense. It’s just a lot easier.

Roberts: Yeah. Okay, um, so again more ease of use and I see the emphasis on data integrity. You also get the batch and streaming together. That’s nice, that you don’t have to re-rewrite. I guess before Spark Streaming was very micro-batch, and it sounds like Structured Streaming can do true streaming processing?

Xin: From the API point of view there is no concept of a batch. Now from the execution engine point of view, it is still going through micro-batching. But the comparison of micro-batch and true streaming is kind of a misnomer in my mind. Typically, people think if you do event at the time, that’s real streaming.

If you do three events at a time …

Right, if you do a bunch of events at a time, it’s batch. But in reality, everything else out there is batch. There’s no true event-at-a-time engine, because processing one event at a time has enormous overhead. It’s basically not practical when you have a huge amount of data. So, every other engine actually batches to some degree.

Including engines like Storm, Apex and Flink?

Yes, absolutely. All of them batch. They all batch at some point.

Alright. Okay, so I saw in your presentation that you guys integrate with Kafka really well.

Yeah.

Is that integration difficult to accomplish? I mean if you’re feeding into Kafka, you’re going into Structured Streaming, and you’re using the DataFrame API, is there a special procedure or something?

It just works out of the box. We took care of all the details so the user doesn’t have to worry about it. All they need to do is spark.readstream and then the Kafka stream information, and put in the topic you want to subscribe to, and now you’ve got a DataFrame.

That’s really simple.

In many cases, it even automatically infers a schema. For example if you have JSON data coming in, Spark will infer the schema automatically. You usually don’t have to even declare it. Although for sanity’s sake, I would recommend users do declare it. Because you can’t guarantee that your data will be perfect and you’d want the right error to surface when they are not.

Does it integrate at all with the schema registry for Kafka?

It doesn’t currently integrate with that part. I think that part is fairly new.

It is very new, yeah. Okay well, you mentioned the Tungsten project and I saw you said something about in the future, it’s looking to move into other execution platforms like GPUS, and other ways to make that even more efficient, or at least more flexible. How far ahead is that? Is that like way ahead or…

It’s in an exploratory stage right now. And there are also different teams outside of Databricks looking at that. It’s a pretty major project. We should only do it if it makes sense. For example, sometimes it might not make sense at all because all the data is not in a specific part of the processing but rather for example, reading IO. What we have found, at least for a lot of the Databricks projects, IO is currently the biggest bottleneck, so we have a lot of work that’s coming that will address IO performance. And then uh maybe processing becomes the next bottleneck in there.

Yeah, it seems to ricochet back and forth. It’s CPU bound, now it’s I/O bound, now it’s CPU bound …

When you optimize one more, the other one becomes more devolved.

Yeah. So, is there anything exciting coming up fairly soon that you would like to talk about?

Oh, yeah. For open source Spark, I think there are a few things. One is that Structured Streaming will GA, hopefully soon. It will be a pretty important milestone of the project. Another thing is we are looking at how we can make Spark more useable on essentially a single node laptop. This includes, for example, being able to publish Spark to the Python package index. So the users can just go to pip install Pyspark, and Spark shows up on your laptop. It’s becoming more efficient, and this broadens the addressable users. Acquiring that kind of market, but addressable users for the open source project. It would be really nice if, with a single tool, you could process a small amount of data on your laptop, and then when you want to scale up to a larger amount of data, it just runs on the cloud.

Being able to move from laptop to server to cluster to cloud is something Syncsort has been all about for a while now so we’re really happy to see that. We call it Design Once, Deploy Anywhere. So, that’s great to hear! Is there anything else you wanted to mention?

Yes. Come to Spark Summit!

[laughing] Come to Spark Summit! Alright! Yeah, some of our guys were at Spark Summit East so we’ll probably be at the next one.

This one will probably be much larger.

Download Syncsort’s latest white paper, “Accessing and Integrating Mainframe Application Data with Hadoop and Spark,” to learn about the architecture and technical capabilities that make Syncsort DMX-h the best solution for accessing the most complex application data from mainframes and integrating that data using Hadoop.

Let’s block ads! (Why?)

Syncsort blog

Setting up data alerts on streaming data sets

Streaming data sets in Power BI are a cool feature that allows you to analyze data as it occurs. I was playing around with setting up this cool twitter demo using Flow as described here by Sirui: https://powerbi.microsoft.com/en-us/blog/push-rows-to-a-power-bi-streaming-dataset-without-writing-any-code-using-microsoft-flow/ but was thinking, wouldn’t it be cool if I can get alerts based on the data that comes in. For example I want to get an alert when I get more than 20 negative Power BI tweets in the last hour. Unfortunately you cannot create measure at this time to add any logic but there is a way. Let’s take a look.

If you follow the above instruction you will end up with a dataset in Power BI that gets fed tweets and their sentiments real time. One change I made to the flow above is that I turned on “Historic data analysis

 Setting up data alerts on streaming data sets

This gives me a dataset that I can build reports on and analyze data in the past similar to a push api dataset, more on this here.

So now I have this dataset I can start creating reports and dashboards (you can also add more data as described here to make things a bit more interesting):

 Setting up data alerts on streaming data sets

and pin them to your dashboard:

 Setting up data alerts on streaming data sets

So far so good but now I want to see only the count of tweets with sentiment <= 0.5 in the last hour. Here is where the trouble starts as I can’t express last hour in the report designer. Luckily there is another smart feature that will help us here called Q & A, I can just ask the question.

 Setting up data alerts on streaming data sets

This immediately gives me the answer I need, just not completely in the right shape as I need this to be a single card, not a chart for data alerts to work. There is also an option for me to change this manually, so in this case I open the viz pane and select card:

 Setting up data alerts on streaming data sets

Now I pin it to the dashboard and after renaming the tile I have the number of negative tweets in the last hour:

 Setting up data alerts on streaming data sets

Now as last step I can configure my data alert to send me an alert when I get more then 20 negative tweets in hour:

 Setting up data alerts on streaming data sets

Done  Setting up data alerts on streaming data sets again goes to show you the power of Q&A. Pretty cool scenario and NO code required …

Let’s block ads! (Why?)

Kasper On BI

Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

In Part 1, Cloudera’s Sean Anderson (@SeanAndersonBD), summarized what’s new in Spark 2.0. In Part 2, he talks more about new features for Spark Structured Streaming, including how unified APIs simplify support for streaming and batch workloads, and support for Spark in the Cloud.

In Spark 2.0, the ecosystem combined the functional API’s and now you have a unified API for both batch and streaming jobs. It’s pretty nice to not have to use different interfaces to achieve this. There’s still native language support, and they are still very simplified and easy to use APIs, but for both of those types of workloads.

blog Spark diagram Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

Roberts: Ooh! Streaming and batch together in one interface is something Syncsort has been pushing for a while! That’s great to hear. Very validating.

Anderson: Then the last improvement was around Spark Structured Streaming, which is a streaming API that runs on top of Spark SQL. That generally gives us better performance on micro-batch or streaming workloads, and really helps with things like out of order data handling.

There was this issue with Spark Streaming before where you may have outputs that resolve themselves quicker than the actual inputs or variables. So you have a lot of really messy out of order data that people had to come up with homegrown solutions to address.

And now that Spark Structured Streaming has essentially extensible table rows forever, you can really do that out of order data handling a lot better.

Related: Syncsort goes native for Hadoop and Spark mainframe data integration

Streaming and batch seems like they’ve always been two separate things, and they’re becoming more and more just two different ways to handle data. We are also seeing a lot of push towards Cloud. What else are you seeing coming up that looks exciting?

For us, really understanding how we guide our customers on deploying in the Cloud is great. There’s persistent clusters, there’s transient clusters. For ETL, what’s the best design pattern for that? For exploratory data science, what’s the best for that? For machine learning, what’s the best for cloud based scoring? So giving customers some guidance on those aspects is key.

blog banner BBDtL ExpertsSay Expert Interview (Part 2): Sean Anderson Talks about Spark Structured Streaming and Cloud Support

Recently, we announced S3 integration for Apache Spark which allows us to run Spark jobs on data that already lives in S3. The transient aspects of clusters makes it very easy to just spin up compute resources, and run a Spark job on data that lives in S3. And then you don’t have to spend all that time moving the data and going through all the manual work on the front end.

Really work on the data right where it is.

Exactly. That’s Spark in the Cloud.

Syncsort recently announced support for Spark 2.0 in our DMX-h Intelligent Execution (IX) capabilities. Be sure to check that out, and see what the experts have to say about Spark in our recent eBook.  

Also, be sure to the read the third and final part of this interview on Friday. Paige and Sean talk about two new projects that Cloudera is excited about, Apache Livy and Apache Spot.

Let’s block ads! (Why?)

Syncsort blog

Data loading into HDFS – Part3. Streaming data loading

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it deserves a dedicated blog post. Today I’m going to talk about Flume and Kafka.

What is Kafka? 

Kafka is distributed service bus. Ok, but what is service bus? Let’s imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this:

5 Data loading into HDFS   Part3. Streaming data loading

but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this:

 Data loading into HDFS   Part3. Streaming data loading

What is Flume? 

“Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.” – this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn’t just use Hadoop client?

Challenge 1. Small files.

Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it’s still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops.

Challenge 2. Lots of data sources

Let’s imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS.

7 Data loading into HDFS   Part3. Streaming data loading

life is good,  if files are large enough it’s not gonna be a problem.

But now let’s imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe – just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources. 

8 Data loading into HDFS   Part3. Streaming data loading

Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS.

9 Data loading into HDFS   Part3. Streaming data loading

In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it’s only two connections – it’s affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents.

I also want to note, that it’s a good practice to use AVRO sink when you move data from one tier to next. Here is example of the flume config file:

########################################################################################################################

——————————————————————————————

agent.sinks = avroSink

agent.sinks.avroSink.type = avro 

agent.sinks.avroSink.channel = memory 

agent.sinks.avroSink.hostname = avrosrchost.example.com

agent.sinks.avroSink.port = 4353 

——————————————————————————————

######################################################################################################################## 

Kafka Architecture.

Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts.

10 Data loading into HDFS   Part3. Streaming data loading

Producer – a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role.

Consumer – a process that reads the data from Kafka cluster. 

Brocker – a member of Kafka cluster. Set of members is Kafka cluster. 

Flume Architecture.

You could find a lot of useful information about Flume in this book, here I just highlight key concepts.

11 Data loading into HDFS   Part3. Streaming data loading

Flume  has 3 major components:

1) Source – where I get the data

2) Chanel – where I buffer it. It could be memory or disk, for example. 

3) Sink – where I load my data. For example, it could be another tier of Flume agents, HDFS or  HBase.

12 Data loading into HDFS   Part3. Streaming data loading

Between source and channel, there are two minor components: Interceptor and Selector.

With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header. 

Flume and Kafka similarities and differences.

It’s a frequent question: “what is the difference between Flume and Kafka”, the answer could be very expanded, but let me briefly explain key points.

1) Pull and Push.

Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk

Kafka accumulates data until client initiate reads. So client pulls data whenever he wants.

2)  Data processing

Flume could do simple transformations by interceptors

Kafka doesn’t do any data processing, just store that data. 

3) Clustering

Flume usually is a batch of single instances.

Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts. 

4) Message size

Flume doesn’t have any obvious restrictions for size of the message

Kafka was designed for few KB messages

5) Coding vs Configuring

Flume usually configurable tool (users usually don’t write the code, instead of this they use configure capabilities).

With Kafka, you have to write code for load/unload the data.

Flafka.

Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It’s quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there).

First of all, Flafka is not a dedicated project. It’s just bunch of Java classes for integration Flume and Kafka.

1 Data loading into HDFS   Part3. Streaming data loading

Now  Kafka could be either source for Flume by flume config:

flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource

or channel by the following directive:

flume1.channels.kafka-channel-1.type = org.apache.flume.channel.kafka.KafkaChannel 

Use Case1. Kafka as a source or Chanel

if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don’t like it, you may use Kafka as a Flume source. 

2 Data loading into HDFS   Part3. Streaming data loading

in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case.

Also, you could use Kafka as a Flume Chanel for high availability purposes (it’s distributed by application design). 

Use case 2. Kafka as a sink.

If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) – you have this option. The only one that you need is config Kafka as a sink.

3 Data loading into HDFS   Part3. Streaming data loading

Use case 3. Flume as the tool to enrich data.

As I Already told before – Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.

4 Data loading into HDFS   Part3. Streaming data loading

Conclusion.

There are two major tools for loading stream data – Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it’s why Flafka have been created – it’s just a combination of those two tools.

Let’s block ads! (Why?)

The Data Warehouse Insider

Drizzle on tap to spur Spark Streaming architecture

TTlogo 379x201 Drizzle on tap to spur Spark Streaming architecture

Innovation in Spark Streaming architecture continued apace last week as Spark originator Databricks discussed an upcoming add-on expected to reduce streaming latency.

Based on ongoing work by a lab at the University of California, Berkeley, elements of what is being called the Drizzle framework are expected to become part of Apache Spark later this year, according to the company.

The anticipated streaming update is part of Databricks’ larger efforts to provide a platform for broad new analytics uses. Drizzle is intended to help promote users’ moves to so-called Lambda architectures that combine batch and real-time data processing approaches.

Spark trending now at Netflix

The move to embrace both batch and real-time processing isn’t an easy one, even for fast-flying web companies. But it is a natural step, according to Shriya Arora, a senior data engineer at Netflix.

Arora is part of a Netflix team that employs Spark processing and streaming to transform and push data to data scientists who develop algorithms that personalize the company’s movie recommendations to subscribers. As Netflix converts some applications from batch to real time, she’s working to fine-tune Spark Streaming to ensure there are monitoring alerts that warn when streaming jobs may fail.

“Streaming is better than having long-running jobs, but it comes at a cost. For example, streaming failures have to be addressed immediately. If an application is down too long, you run into data loss,” she told an audience at last week’s Spark Summit East 2017 in Boston.

Real time means ‘why wait?’

The real-time effort is worthwhile, however, because it can better align Netflix’s movie recommendations with the immediate interests of customers. “Trending now” viewing choices, for example, can be more completely up to date, Arora said. “Why wait 24 hours when you can pick up the new information in an hour?”

But the Spark Streaming architecture today doesn’t support pure event streaming — it still has roots in a “micro-batching” formula that rapidly processes small batches of data. So, there are cases where time-sensitive applications might better opt for streaming as supported by alternative frameworks such as Flink or Storm, Arora said.

Such use cases are a prime target for Drizzle, a project within the UC Berkeley RISELab — itself a descendent of the AMPLab project that begat Apache Spark. [RISE stands for Real-time Intelligence with Secure Execution.]

Drizzle’s goal is to unify record-at-a-time streaming with micro-batch models, and is in some part an answer to Flink, an emerging streaming architecture that has shown performance benefits over present Spark Streaming.

Hearing Flink steps?

As he discussed Drizzle in a Spark Summit keynote, Ion Stoica didn’t try to cover up Spark Streaming architecture’s present latency shortcomings in streaming versus Apache Flink. He said Drizzle is intended to reduce Spark Streaming’s performance latency by about 10 times.

Stoica is executive chairman and a co-founder of Databricks, and is also a professor of computer science at UC Berkeley and a part of the RISELab. In graphs, he showed Spark trailing Apache Flink by hundreds of milliseconds in handling event throughput.

He also showed data in which early versions of Drizzle and a companion Drizzle-Opt execution engine slightly improve upon present Apache Flink performance. While details were sparse, Drizzle architecture as depicted on the RISELab’s website is meant to “decouple execution granularity from coordination granularity” for workloads on clusters.

In an interview, Spark inventor Matei Zaharia, who is CTO at Databricks and another co-founder — as well as Stoica’s former grad student — said parts of Drizzle would likely appear in Apache Spark during the third quarter of 2017.

Pursuing a unified model

Both Stoica and Zaharia emphasized that recent advances in streaming technology for Spark, including a Structured Streaming engine and API added as part of Spark 2.0 last year, have focused on enabling a more cohesive approach for programmers that combine real-time and batch data processing on a single platform. They positioned Spark overall as a unified approach to diverse data management and analytical needs that include ETL, machine learning and SQL querying, as well as streaming.

“We think of Spark as the infrastructure for machine learning, which itself is really a small part of the entire workflow,” Stoica said. “You have to clean the data, and transform it. Then, at the end, when it is curated, you apply machine learning algorithms on top.”

This unified approach has merit, according to a machine learning user at a marketing analytics firm who attended the Boston event.

“Previous to our use of Spark, we had ETL, machine learning and other analytics processes, and they were all on different software stacks,” said Saket Mengle, senior principal data scientist at Boston-based DataXu Inc. “Spark allows us to put this on one stack. It is something you have to tweak, but uniformity is good.”

Spark in context

Improvements to Spark Streaming should be viewed in the context of Spark’s overall analytical adoption, said one industry analyst on hand at the conference.

“Spark’s long-term appeal has been as an ensemble of analytical approaches, and its ability to address a variety of workloads,” said Doug Henschen, a principal analyst at Constellation Research Inc.

In a blog post following the conference, Henschen remarked that Spark was progressing more quickly than was predecessor Hadoop at a comparable stage of development, and that it promises “wider hands-on use” by a variety of developers and data scientists.

One measure of Spark’s progress is its adoption by vendors beyond Databricks, he said. In fact, the open source version, Apache Spark, is offered by traditional enterprise players like IBM and Oracle, as well as Hadoop distribution providers Cloudera, Hortonworks and MapR.

It’s noteworthy, too, that Spark is offered on the cloud by the likes of Amazon, Google, Microsoft and others. So far, Databricks has focused its efforts on providing cloud services, which is where its new approach to streaming will likely first be tested.

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources

Streaming data analytics puts real-time pressure on project teams

TTlogo 379x201 Streaming data analytics puts real time pressure on project teams

When I worked at a fast-food restaurant in high school, a co-worker friend and I decided its motto should be “Speed, Not Perfection.” We silkscreened t-shirts for the two of us with that phrase embedded in the corporate logo — two smart-aleck teenagers gently sticking it to the man.

Nowadays, data management and analytics teams increasingly find themselves being asked to fulfill the speed part to enable real-time data analysis in their organizations. But they don’t have the luxury of being able to get by with the same sort of occasional sloppiness that my friend and I did in slapping burgers together. And that puts them under a lot of pressure, because creating a real-time architecture and using it to run streaming data analytics applications is a complicated undertaking.

For starters, streaming analytics systems don’t come in a box — not even a large one. Setting them up is an artisanal process that requires prospective users to piece together various data processing technologies and analytics tools to meet their particular application needs. In addition, the technology options have increased significantly over the past few years, thanks largely to the emergence of multiple big data platforms that provide stream processing capabilities in different ways.

A plethora of streaming platforms

Spark Streaming, Flink, Storm, Samza, Pulsar, Druid, Kylin — they’re all open source processing engines vying for a piece of the data streaming and real-time analytics action. Even Kafka, originally a messaging technology for feeding data from one system to another, now also functions as a stream processing platform in its own right. In addition to the open source tools, various IT vendors offer more traditional complex event processing systems that began emerging in the late 1990s. Specialized databases — in-memory ones, for example — are also built to handle streaming data analytics.

On the analytics software side, broader use of machine learning algorithms is making it more feasible to build predictive models that can churn through large amounts of streaming data on things like financial transactions, equipment performance and internet clickstreams. But again, there are a multitude of technology choices to consider: tools from mainstream analytics vendors and machine learning specialists, cloud-based services, open source platforms.

As with building a big data architecture in general, the surfeit of software available to underpin a real-time analytics architecture can be a boon for users — or mire them in a veritable boondoggle of a deployment. Finding the right technologies and combining them into an effective analytics framework is a perilous process; missteps can send a project careening off the intended path.

Streaming forward on real-time projects

That isn’t stopping companies, particularly large ones with lots of data and ample IT resources, from giving it a go. In an ongoing survey being conducted by SearchBusinessAnalytics publisher TechTarget Inc., 28.1% of the 7,000-plus IT, analytics and business professionals who had responded as of mid-January said their organizations were looking to invest in real-time analytics technology over the ensuing 12 months. In addition, 13.4% said they planned to buy stream processing software.

Why do it? The ability to pull useful information out of data streams in real time lets business operations act fast, and that clearly can be to their advantage. Predictive analytics applications run against streaming data on the web activity of consumers can drive website personalization programs and targeted online advertising and marketing campaigns. Fraud detection, predictive maintenance and satellite imaging are other applications that can benefit from streaming data analytics.

In many cases, real time might be the only time to take advantage of what’s in the data being collected. Streaming analytics tools point to “perishable insights” that need to be acted on quickly before the opportunity is lost, Forrester Research analyst Mike Gualtieri and then-colleague Rowan Curran wrote in a 2016 Forrester Wave report. And you can’t get those kinds of insights simply by throwing data into a Hadoop cluster, as Darryl Smith, chief data platform architect at Dell EMC, said during a presentation on the data storage vendor’s real-time streaming efforts at Strata + Hadoop World 2016 in New York.

Speed is indeed a wonderful thing. Just be sure your team has a well-thought-out plan before turning up the heat on a streaming analytics initiative. Otherwise, it might end up getting flame-grilled by disappointed business executives.

Let’s block ads! (Why?)


SearchBusinessAnalytics: BI, CPM and analytics news, tips and resources

Announcing General Availability of Power BI Real-Time Streaming Datasets

Today, I am happy to announce the general availability of real-time streaming datasets in Power BI. This feature set allows users to easily stream data to Power BI via the REST API, Azure Stream Analytics, or PubNub, and to see that data instantly light on their dashboards. Since we announced public preview earlier last year, we’ve been delighted to see thousands of users across a dozen industries leverage these capabilities to gain insights and take action on their data, right as it happens.

As part of this news, I am also happy to announce the general availability of Azure Stream Analytics outputting to Power BI streaming datasets. This feature allows users to build streaming tiles on top of datasets pushed to Power BI by Azure Stream Analytics, while still supporting all existing functionality (e.g. using the dataset to build reports). These streaming tiles augment the existing Stream Analytics to Power BI workflow by enabling support for highly requested scenarios such as showing latest value, and showing values over a set time window. See our previous post for further details and a step-by-step walkthrough.

For the rest of this blog post, we’d like to shine the spotlight on a set of innovative and inspiring deployments of Power BI real-time streaming. Read on to see some exciting examples of real-time data in action!

TransAlta is Canada’s largest renewable power generation company, operating over 70 power plants spread across the globe. TransAlta uses Azure Stream Analytics in conjunction with Power BI to facilitate real-time data collection and monitoring for its power generation facilities. Azure Stream Analytics provides the ability to perform engineering calculations on streams of data in-flight, while Power BI is leveraged to display that data in real-time to technicians and engineers in the field.

“What is exciting about streaming datasets is the latest value now being available,” says Kent Weare, Senior Enterprise Architecture and Integration Lead, “while aggregating data is often required for trending analysis, latest value provides immediate feedback about how a device, or piece of equipment is performing.”

Piraeus Bank S.A. is a Greek multinational finance service company with hundreds of branches across Europe. Piraeus Bank uses Power BI streaming datasets in together with Azure Stream Analytics, to understand the key metrics of their web banking solution in real-time. In addition, the solution also provides the flexibility to allow for historical analysis on different times scales when the need arises.

01fbf825 9dc4 4d4d 9de0 e0e340507fcb Announcing General Availability of Power BI Real Time Streaming Datasets

“Power BI delivers actionable events that help us understand how our platform evolves and needs to adapt, so that we can provision our service for different scalability needs and loads,” says John Raptakis, Head of Remote Banking in Transactional Banking Systems at Piraeus Bank.

IntelliScape.io uses Power BI and PubNub together to implement real-time dashboards for its Curb Utilization Analytics solution, which leverages video analytics to capture and understand traffic patterns in real-time. Data flows seamlessly from curb-side camera to intelligent video analytics software to Power BI, allowing modern cities to get up-to-the-second updates on traffic events as they unfold.

519c4ac6 992c 4ade 8f9c 7915f8e4e590 Announcing General Availability of Power BI Real Time Streaming Datasets

“Undeniably, on the very near horizon, is a new reality; the perfect storm of event data and the desire to interpret insights will overwhelm even the best-prepared IT and BI organizations. Microsoft Power BI is ideally integrated with PubNub to meet these new requirements,” says Bill French, Chief Architect at IntelliScape.io

Further Reading

Let’s block ads! (Why?)

Microsoft Power BI Blog | Microsoft Power BI

Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Today, I am happy to announce an exciting new update to the Power BI connector for Microsoft Flow. Coming hot on the heels of our data alert Flow trigger, we have added a new action which pushes rows of data to a Power BI streaming dataset.

Since their release last year, thousands of users have used Power BI streaming datasets to easily build real-time dashboards by pushing data into the REST API endpoint, and having that data update in seconds on their streaming visuals. With this connector, this process can now be automated without writing a single line of code. Simply create a Flow with the “push rows to streaming dataset” action and Flow will automatically push data to that endpoint, in the schema that you specify, whenever the Flow is triggered. Even better, you’ll be able to choose from hundreds of Flow triggers to act as data sources.

The richness of the Flow ecosystem enables countless use cases for this action. Create a Flow to monitor the Twitter sentiment in Power BI via incorporating the Twitter trigger and the Microsoft Cognitive Services Sentiment Analysis action. Add real-time weather data into your dashboards via the MSN Weather trigger. You could even create a log of the Power BI data alerts that have been triggered by piping the alert trigger into a streaming dataset.

To get started, jump over to Flow to begin building your own Flows. Read on for a full end-to-end tutorial.

End-to-end tutorial: gauging Twitter sentiment with Flow and Power BI

In this tutorial, we will create a real-time dashboard which charts the sentiment of a keyword on Twitter. You could imagine using this dashboard to monitor the status of your social media campaign in real-time in Power BI. We’ll start by creating a streaming dataset in Power BI, and then from there push Twitter sentiment data to that dataset via Flow.

bfa7877e 9528 4de1 8a8a 1e39ae6a883a Push data to Power BI streaming datasets without writing any code using Microsoft Flow

 

Creating the streaming dataset in Power BI

To create the Power BI streaming dataset, we will go to the powerbi.com and “Streaming datasets.”

From there, we will create a dataset of type API:

ffb8bf3e af32 445f a832 e84b0c5e88e5 Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Name the dataset whatever you want (but remember the name!). Then, add the following fields to the streaming dataset:

23efb90d 42ab 40ee bee3 8a2c1317dfef Push data to Power BI streaming datasets without writing any code using Microsoft Flow

To summarize – we’ve now created a dataset with the following fields

  • time (DateTime) – when the Tweet was sent
  • tweet (Text) – the contents of the Tweet
  • sentiment (Number) – a number between 0 and 1, representing the sentiment of the tweet, with 0 being extremely negative, and 1 being extremely positive

Create a Flow to push Tweet sentiments to Power BI

Next, we will create a Flow which will push Tweets and their sentiments to Power BI. Start by navigating to Flow. Sign in, and then go to “My Flows”, then “Create from blank.” You should see the following:

578825e9 a6bc 4e48 b420 0b042d03e2ec Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Now, click the Twitter category, and select the “When a new tweet is posted” trigger, and enter your search term. For the purposes of this walkthrough, we’ll use “PowerBI.” As the title suggests, this trigger will start a Flow whenever a tweet which contains the search term is posted.

db2e18bc 9418 4583 92fd 0a90f73b86e3 Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Next, we will pipe these Tweets into the Microsoft Cognitive Services Sentiment Detection Flow action to understand the positivity of the Tweet content. This action takes in a Tweet, and outputs a number from zero (very negative) to one (very positive).

Select “New Step,” then “Add an action,” then search for “Sentiment Analysis” and select the “Cognitive Services Text Analytics – Detect Sentiment” action. To continue, you’ll need an API key with Microsoft Cognitive Services – you can get one for free.

Once you’ve entered your API key and related information, go ahead and pipe in the Tweet text to the sentiment detection action by selecting it from the dynamic content pane on the right side of the screen. Your Flow should now look like the following:

8df98db6 c0bb 4907 8c95 bf9eeb1bd13f Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Now, we’re at the last step of the Flow: we’re going to push this data into the Power BI streaming dataset that we created earlier. Go to “New step,” then “Add an action,” and then enter “Power BI” into the search box. Select “Add row to streaming dataset” from the actions list.

621e36f3 596a 48f9 bf88 2dc36ef90974 Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Select the name of the workspace, then the name of the streaming dataset in the first step, and select the Table titled “RealTimeData.” Note that all streaming datasets created in powerbi.com will have one table named “RealTimeData.” Next, in the data field, add the following:

4fd11b7f d7ed 4b01 8f78 05246c242692 Push data to Power BI streaming datasets without writing any code using Microsoft Flow

Go ahead and give your Flow a name, and select “Create to Flow” to start the Flow.

Building the real-time dashboard in Power BI

Now that the data is flowing, the last thing we’ll want to do is create a dashboard with a streaming visual in Power BI. Back in Power BI, go to a dashboard, select “Add tile,” then “Custom streaming data” and finally the name of the streaming dataset that we created in the first step. Configure the streaming visual as follows:

585e199b 1680 4081 a620 5a2dcd509d72 Push data to Power BI streaming datasets without writing any code using Microsoft Flow

You should now see a line chart appear in your dashboard, graphing the Tweet sentiments over time. And you’re done!

Consider for a second that we’ve built a fairly complex pipeline of components, spanning social media, Artificial Intelligence and Business Intelligence domains, all in a short amount of time and with zero lines of code – no small feat!

What’s next?

For more ideas on Flows, check out the Flow templates page to see a gallery of the most popular ideas. We’ll also be adding several Flow templates with the new “Push to streaming dataset” action, including the one featured in the tutorial above – so stay tuned for that.

Have an idea for another way that Power BI can connect to Flow? Head to the Power BI UserVoice and cast your vote to make your voice heard.

Made a cool Flow? Share it with the community! Either post in the comments below, or in the Power BI forums. Your Flow might even be featured in the next Power BI Flow blog post!

Documentation links

Let’s block ads! (Why?)

Microsoft Power BI Blog | Microsoft Power BI