Tag Archives: Start

Big Data SQL Quick Start. Multi-user Authorization – Part 25

One of the major Big Data SQL benefits is security. You deal with the data, which you store in HDFS or other sources, through Oracle Database, which means, that you could apply many Database features, such as Data Redaction, VPD or Database Vault. These features in conjunction with database schema/grant privileged model, allows you to protect data from Database side (when intruder tries to reach data from database side).

But it’s also important to keep in mind, that Data stored on HDFS may be required for other purposes (Spark, Solr, Impala…) and they need to have some other mechanism for protection. In Hadoop world, Kerboros is most popular way for protect data (authentification method). Kerberos in conjunction with HDFS ACL gives you opportunity to protect data on the file system level. HDFS as a file system has concept of user and group. And files, which you store on HDFS have different privileges for owner, group and all others. 

Conclusion: For working with Kerberized clusters, Big Data SQL needs to have valid Kerberos ticket for work with HDFS files. Fortunitely, all this setup been automated and available within standard Oracle Big Data SQL installer. For get more details please check here.

Big Data SQL and Kerberos.

Well, usually, customers have a Kerbirized cluster and for working with it, we need to have valid Kerberos ticket. But here raised up the question – which principal do you need to have with Big Data SQL? 

Answer is easy – oracle. In prior Big Data SQL releases, all Big Data SQL run on the Hadoop cluster as the same user: oracle. This has the following consequences:

- Unable to authorize access to data based on the user that is running a query

- Hadoop cluster audits show that all data queried thru Big Data SQL is made by oracle

What if I already have some data, used by other application and have different privileges (belonging to different users and groups)? Here in Big Data SQL 3.2 we introduced the new feature – Multi-User Authorization.

Hadoop impersonalization.

In foundation of Multi-User Authorization lays Hadoop feature, called impersonalization. I took description from here:

“A superuser with username ‘super’ wants to submit job and access hdfs on behalf of a user joe. The superuser has Kerberos credentials but user joe doesn’t have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super’s Kerberos credentials. In other words super is impersonating the user joe.”

at the same manner, “oracle” is the superuser and other users are impersonalized.

Multi-User Authorization key concepts.

1) Big Data SQL will identify the trusted user that is accessing data on the cluster.  By executing the query as the trusted user:

- Authorization rules specified in Hadoop will be respected
- Authorization rules specified in Hadoop do not need to be replicated in the database
- Hadoop cluster audits identify the actual Big Data SQL query user

2) Consider the Oracle Database as the entity that is providing the trusted user to Hadoop

3) Must map the database user that is running a query in Oracle Database to a Hadoop user

4) Must identify the actual user that is querying the Oracle table and pass that identity to Hadoop 
- This may be an Oracle Database user (i.e. schema)
- Lightweight user comes from session-based contexts (see SYS_CONTEXT)
- User/Group map must be available thru OS lookup in Hadoop


Full documentation for this feature, you may find here and now I’m going to show few most popular cases with code examples.

For working with certain objects, you need to grant follow permissions for user, who will manage a mapping table:

SQL> grant select on BDSQL_USER_MAP to bikes;
SQL> grant execute on DBMS_BDSQL to bikes;
SQL> grant BDSQL_ADMIN to bikes;

In my cases, this is user “bikes”.

Just in case clean up permissions for user BIKES:

SQL> begin
(current_database_user =>'BIKES');

check that the mapping table is empty:

SQL> select * from SYS.BDSQL_USER_MAP;

and after this run a query:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;

this is the default mode, without any mapping, so I assume that I’ll contact HDFS as oracle user.

For double check this, I review audit files:

$   cd /var/log/hadoop-hdfs
$   tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=oracle ... ip=/ cmd=open ... src=/data/weather/central_park_weather.csv..

here is clear, that user Oracle reads the file (ugi=oracle).

Let’s check permissions for given file (which represents this external table):

$   hadoop fs -ls /data/weather/central_park_weather.csv
-rw-r--r--   3 oracle oinstall      26103 2017-10-24 13:03 /data/weather/central_park_weather.csv

so, everybody may read it. Remember this and let’s try to create the first mapping.

SQL> begin
    current_database_user =>'BIKES',
    syscontext_namespace  => null,
    syscontext_parm_hadoop_user => 'user1'

this mapping tells me that user BIKES, will be always mapped to user1 for OS. Let’s find this in file permission table:

Run query again and check the user, who reads this file:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;
$   cd /var/log/hadoop-hdfs
$   tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/ cmd=open ... src=/data/weather/central_park_weather.csv..

It’s interesting that user1 doesn’t exist on the Hadoop OS:

# id user1
id: user1: No such user

if user not exists (user1 case), it could only read 777 files. Let me revoke read permission from everyone and run the query again:

$   sudo -u hdfs hadoop fs -chmod 640 /data/weather/central_park_weather.csv
$   hadoop fs -ls /data/weather/central_park_weather.csv
-rw-r-----   3 oracle oinstall      26103 2017-10-24 13:03 /data/weather/central_park_weather.csv

Now it failed. For make it works I may create “user1″ account on each Hadoop node and add it to oinstall group.

$   useradd user1
$   usermod -a -G oinstall user1

Run the query again and check the user, who reads this file:

SQL> select /*+ MONITOR */ * from bikes.weather_ext;
$   cd /var/log/hadoop-hdfs
$   tail -f hdfs-audit.log |grep central_park
2018-03-01 17:42:10,938 INFO ... ugi=user1... ip=/ cmd=open ... src=/data/weather/central_park_weather.csv..

here we are! We could read the file because of group permissions.

What if I want to map this schema to HDFS or some other powerful user? Let’s try:

SQL> begin
(current_database_user =>'BIKES');
    current_database_user =>'BIKES',
    syscontext_namespace  => null,
    syscontext_parm_hadoop_user => 'hdfs'

the reason why we got this exception is that hdfs user is on the blacklist for impersonation.

$   cat $  ORACLE_HOME/bigdatasql/databases/orcl/bigdata_config/bigdata.properties| grep impersonation
# Impersonation properties

the second scenario is authorization with the thin client or with CLIENT_IDENTIFIER. In case of multi-tier architecture (when we have application tier and database tier), it may be a challenge to differentiate multiple users within the same application, which use the same schema.

Bellow is the example, which illustrates this:

we have an application, which connected to a database as HR_APP user, but many people may use this application and this database login. To differentiate these human users we may use dbms_session.set_IDENTIFIER procedure (more details you could find here).

So, Big Data SQL multi-user authorization feature allows using SYS_CONTEXT user for authorization on the Hadoop.

Bellow is a test case, which illustrates this.

-- Remove previous rule, related with BIKES user --
SQL> begin
(current_database_user =>'BIKES');

-- Add a new rule, which tells that if database user is BIKES, Hadoop user have to be taken from USERENV as CLIENT_IDENTIFIER --
SQL> begin
    current_database_user =>'BIKES',
    syscontext_namespace  => 'USERENV',
    syscontext_parm_hadoop_user => 'CLIENT_IDENTIFIER'

--Check current database user (schema) --
SQL> select user from dual;




-- Run any query aginst Hadoop --
SQL> select /*+ MONITOR */ * from bikes.weather_ext;

-- check in the Hadoop audit logs --
-bash-4.1$   tail -f hdfs-audit.log |grep central_park
2018-03-01 18:14:40 ... ugi=oracle ... src=/data/weather/central_park_weather.csv

SQL> begin

-- Check CLIENT_IDENTIFIER for current session --

-- Run query agin over HDFS data --
SQL> select /*+ MONITOR */ * from bikes.weather_ext;

-- check in the Hadoop audit logs: --
-bash-4.1$    tail -f hdfs-audit.log |grep central_park
2018-03-01 18:17:43 ... ugi=Alexey ... src=/data/weather/central_park_weather.csv

the third way to make authentification is user authentification identity. Users connecting to a database (via Kerberos, DB User, etc…) have their authenticated identity passed to Hadoop. To make it work, simply run:

SQL> begin
    current_database_user => '*' ,
    syscontext_namespace  => 'USERENV',
    syscontext_parm_hadoop_user => 'AUTHENTICATED_IDENTITY');

and after this your user on HDFS will be that returned by:



for example, if I logged on to the database as BIKES (as database user), on HDFS I’ll be authenticated as bikes user

-bash-4.1 $    tail -f hdfs-audit.log |grep central_park
2018-03-01 18:23:23 ... ugi=bikes... src=/data/weather/central_park_weather.csv

for checking all rules, which you have for multi-user authorization you may run follow query:

SQL> select * from SYS.BDSQL_USER_MAP;

Hope that this feature may allow you to create robust security bastion around your data in HDFS.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Want to Operationalize Analytics? Here’s Where to Start

Screen Shot 2018 03 09 at 1.03.42 PM Want to Operationalize Analytics? Here’s Where to Start

Thoughts From Gartner’s Data and Analytics Summit

This week I was at Gartner’s Data and Analytics Summit in Grapevine Texas, with my FICO team. This is an event we never miss, it’s attended by thousands of experts, analysts, data scientists and analytics leaders.

This year’s Summit was dominated by presentations and conversations about data, analytics, explainable artificial intelligence (AI), and machine learning; nearly every discussion came to the same point — we want to use data and technology to gain competitive advantage and deliver real business value, but how? There was a wide variety of opinions, Gartner asserted that “to accomplish this feat, data and analytic leaders must master four key dimensions of scale: diversity, literacy, complexity, and trust.”

In many of my conversations at the Summit, I shared my own view of what it takes to operationalize analytics. By this I mean, take all of the data and insights gleaned from advanced analytics and connect those to day-to-day operations. Surprisingly, the first two steps really have nothing to do with technology.

What it takes to operationalize analytics

First, companies need to start by putting the decision before the data. With a decision-first strategy you define the business objective, then determine what data and analytics you need to achieve the goal. If the modeling and data analytics requirements are defined by the business outcome first, data exploration and analytic development is faster and more productive. This helps enterprises narrow in on meaningful outcomes, shutting out extraneous noise to focus on the insights that address specific objectives.

Then, enterprises need to get data science into the hands of business decision makers. Empower the business leaders with the ability to evaluate the complete spectrum of potential opportunities. Experience has shown that, when business experts have access to the data, insight, and the tools to exploit analytics, they can visualize relationships between different variables and actions to quickly identify the preferred outcomes for maximum impact.

Rita Sallam, the conference chair and VP at Gartner, opened the event with a telling statement, “This is a consequential time to be a data and analytics leader.” I couldn’t agree more; leading a digital transformation is no small task. But if you start with the business challenge, then look at the analytics and arm business leaders with access, you will at least be headed in the right direction.

Let’s block ads! (Why?)


Six New Tech Trends To Start Pursuing Today

Businesses share something important with lions. When a lion captures and consumes its prey, only about 10% to 20% of the prey’s energy is directly transferred into the lion’s metabolism. The rest evaporates away, mostly as heat loss, according to research done in the 1940s by ecologist Raymond Lindeman.

Today, businesses do only about as well as the big cats. When you consider the energy required to manage, power, and move products and services, less than 20% goes directly into the typical product or service—what economists call aggregate efficiency (the ratio of potential work to the actual useful work that gets embedded into a product or service at the expense of the energy lost in moving products and services through all of the steps of their value chains). Aggregate efficiency is a key factor in determining productivity.

SAP Q417 DigitalDoubles Feature2 Image2 Six New Tech Trends To Start Pursuing TodayAfter making steady gains during much of the 20th century, businesses’ aggregate energy efficiency peaked in the 1980s and then stalled. Japan, home of the world’s most energy-efficient economy, has been skating along at or near 20% ever since. The U.S. economy, meanwhile, topped out at about 13% aggregate efficiency in the 1990s, according to research.

Why does this matter? Jeremy Rifkin says he knows why. Rifkin is an economic and social theorist, author, consultant, and lecturer at the Wharton School’s Executive Education program who believes that economies experience major increases in growth and productivity only when big shifts occur in three integrated infrastructure segments around the same time: communications, energy, and transportation.

But it’s only a matter of time before information technology blows all three wide open, says Rifkin. He envisions a new economic infrastructure based on digital integration of communications, energy, and transportation, riding atop an Internet of Things (IoT) platform that incorporates Big Data, analytics, and artificial intelligence. This platform will disrupt the world economy and bring dramatic levels of efficiency and productivity to businesses that take advantage of it,
he says.

Some economists consider Rifkin’s ideas controversial. And his vision of a new economic platform may be problematic—at least globally. It will require massive investments and unusually high levels of government, community, and private sector cooperation, all of which seem to be at depressingly low levels these days.

However, Rifkin has some influential adherents to his philosophy. He has advised three presidents of the European Commission—Romano Prodi, José Manuel Barroso, and the current president, Jean-Claude Juncker—as well as the European Parliament and numerous European Union (EU) heads of state, including Angela Merkel, on the ushering in of what he calls “a smart, green Third Industrial Revolution.” Rifkin is also advising the leadership of the People’s Republic of China on the build out and scale up of the “Internet Plus” Third Industrial Revolution infrastructure to usher in a sustainable low-carbon economy.

The internet has already shaken up one of the three major economic sectors: communications. Today it takes little more than a cell phone, an internet connection, and social media to publish a book or music video for free—what Rifkin calls zero marginal cost. The result has been a hollowing out of once-mighty media empires in just over 10 years. Much of what remains of their business models and revenues has been converted from physical (remember CDs and video stores?) to digital.

But we haven’t hit the trifecta yet. Transportation and energy have changed little since the middle of the last century, says Rifkin. That’s when superhighways reached their saturation point across the developed world and the internal-combustion engine came close to the limits of its potential on the roads, in the air, and at sea. “We have all these killer new technology products, but they’re being plugged into the same old infrastructure, and it’s not creating enough new business opportunities,” he says.

All that may be about to undergo a big shake-up, however. The digitalization of information on the IoT at near-zero marginal cost generates Big Data that can be mined with analytics to create algorithms and apps enabling ubiquitous networking. This digital transformation is beginning to have a big impact on the energy and transportation sectors. If that trend continues, we could see a metamorphosis in the economy and society not unlike previous industrial revolutions in history. And given the pace of technology change today, the shift could happen much faster than ever before.

SAP Q417 DigitalDoubles Feature2 Image3 1024x572 Six New Tech Trends To Start Pursuing TodayThe speed of change is dictated by the increase in digitalization of these three main sectors; expensive physical assets and processes are partially replaced by low-cost virtual ones. The cost efficiencies brought on by digitalization drive disruption in existing business models toward zero marginal cost, as we’ve already seen in entertainment and publishing. According to research company Gartner, when an industry gets to the point where digital drives at least 20% of revenues, you reach the tipping point.

“A clear pattern has emerged,” says Peter Sondergaard, executive vice president and head of research and advisory for Gartner. “Once digital revenues for a sector hit 20% of total revenue, the digital bloodbath begins,” he told the audience at Gartner’s annual 2017 IT Symposium/ITxpo, according to The Wall Street Journal. “No matter what industry you are in, 20% will be the point of no return.”

Communications is already there, and energy and transportation are heading down that path. If they hit the magic 20% mark, the impact will be felt not just within those industries but across all industries. After all, who doesn’t rely on energy and transportation to power their value chains?

That’s why businesses need to factor potentially massive business model disruptions into their plans for digital transformation today if they want to remain competitive with organizations in early adopter countries like China and Germany. China, for example, is already halfway through an US$ 88 billion upgrade to its state electricity grid that will enable renewable energy transmission around the country—all managed and moved digitally, according to an article in The Economist magazine. And it is competing with the United States for leadership in self-driving vehicles, which will shift the transportation process and revenue streams heavily to digital, according to an article in Wired magazine.

SAP Q417 DigitalDoubles Feature2 Image4 Six New Tech Trends To Start Pursuing TodayOnce China’s and Germany’s renewables and driverless infrastructures are in place, the only additional costs are management and maintenance. That could bring businesses in these countries dramatic cost savings over those that still rely on fossil fuels and nuclear energy to power their supply chains and logistics. “Once you pay the fixed costs of renewables, the marginal costs are near zero,” says Rifkin. “The sun and wind haven’t sent us invoices yet.”

In other words, zero marginal cost has become a zero-sum game.

To understand why that is, consider the major industrial revolutions in history, writes Rifkin in his books, The Zero Marginal Cost Society and The Third Industrial Revolution. The first major shift occurred in the 19th century when cheap, abundant coal provided an efficient new source of power (steam) for manufacturing and enabled the creation of a vast railway transportation network. Meanwhile, the telegraph gave the world near-instant communication over a globally connected network.

The second big change occurred at the beginning of the 20th century, when inexpensive oil began to displace coal and gave rise to a much more flexible new transportation network of cars and trucks. Telephones, radios, and televisions had a similar impact on communications.

Breaking Down the Walls Between Sectors

Now, according to Rifkin, we’re poised for the third big shift. The eye of the technology disruption hurricane has moved beyond communications and is heading toward—or as publishing and entertainment executives might warn, coming for—the rest of the economy. With its assemblage of global internet and cellular network connectivity and ever-smaller and more powerful sensors, the IoT, along with Big Data analytics and artificial intelligence, is breaking down the economic walls that have protected the energy and transportation sectors for the past 50 years.

Daimler is now among the first movers in transitioning into a digitalized mobility internet. The company has equipped nearly 400,000 of its trucks with external sensors, transforming the vehicles into mobile Big Data centers. The sensors are picking up real-time Big Data on weather conditions, traffic flows, and warehouse availability. Daimler plans to establish collaborations with thousands of companies, providing them with Big Data and analytics that can help dramatically increase their aggregate efficiency and productivity in shipping goods across their value chains. The Daimler trucks are autonomous and capable of establishing platoons of multiple trucks driving across highways.

It won’t be long before vehicles that navigate the more complex transportation infrastructures around the world begin to think for themselves. Autonomous vehicles will bring massive economic disruption to transportation and logistics thanks to new aggregate efficiencies. Without the cost of having a human at the wheel, autonomous cars could achieve a shared cost per mile below that of owned vehicles by as early as 2030, according to research from financial services company Morgan Stanley.

The transition is getting a push from governments pledging to give up their addiction to cars powered by combustion engines. Great Britain, France, India, and Norway are seeking to go all electric as early as 2025 and by 2040 at the latest.

The Final Piece of the Transition

Considering that automobiles account for 47% of petroleum consumption in the United States alone—more than twice the amount used for generators and heating for homes and businesses, according to the U.S. Energy Information Administration—Rifkin argues that the shift to autonomous electric vehicles could provide the momentum needed to upend the final pillar of the economic platform: energy. Though energy has gone through three major disruptions over the past 150 years, from coal to oil to natural gas—each causing massive teardowns and rebuilds of infrastructure—the underlying economic model has remained constant: highly concentrated and easily accessible fossil fuels and highly centralized, vertically integrated, and enormous (and enormously powerful) energy and utility companies.

Now, according to Rifkin, the “Third Industrial Revolution Internet of Things infrastructure” is on course to disrupt all of it. It’s neither centralized nor vertically integrated; instead, it’s distributed and networked. And that fits perfectly with the commercial evolution of two energy sources that, until the efficiencies of the IoT came along, made no sense for large-scale energy production: the sun and the wind.

But the IoT gives power utilities the means to harness these batches together and to account for variable energy flows. Sensors on solar panels and wind turbines, along with intelligent meters and a smart grid based on the internet, manage a new, two-way flow of energy to and from the grid.

SAP Q417 DigitalDoubles Feature2 Image5 Six New Tech Trends To Start Pursuing TodayToday, fossil fuel–based power plants need to kick in extra energy if insufficient energy is collected from the sun and wind. But industrial-strength batteries and hydrogen fuel cells are beginning to take their place by storing large reservoirs of reserve power for rainy or windless days. In addition, electric vehicles will be able to send some of their stored energy to the digitalized energy internet during peak use. Demand for ever-more efficient cell phone and vehicle batteries is helping push the evolution of batteries along, but batteries will need to get a lot better if renewables are to completely replace fossil fuel energy generation.

Meanwhile, silicon-based solar cells have not yet approached their limits of efficiency. They have their own version of computing’s Moore’s Law called Swanson’s Law. According to data from research company Bloomberg New Energy Finance (BNEF), Swanson’s Law means that for each doubling of global solar panel manufacturing capacity, the price falls by 28%, from $ 76 per watt in 1977 to $ 0.41 in 2016. (Wind power is on a similar plunging exponential cost curve, according to data from the U.S. Department of Energy.)

Thanks to the plummeting solar price, by 2028, the cost of building and operating new sun-based generation capacity will drop below the cost of running existing fossil power plants, according to BNEF. “One of the surprising things in this year’s forecast,” says Seb Henbest, lead author of BNEF’s annual long-term forecast, the New Energy Outlook, “is that the crossover points in the economics of new and old technologies are happening much sooner than we thought last year … and those were all happening a bit sooner than we thought the year before. There’s this sense that it’s not some distant risk or distant opportunity. A lot of these realities are rushing toward us.”

The conclusion, he says, is irrefutable. “We can see the data and when we map that forward with conservative assumptions, these technologies just get cheaper than everything else.”

The smart money, then—72% of total new power generation capacity investment worldwide by 2040—will go to renewable energy, according to BNEF. The firm’s research also suggests that there’s more room in Swanson’s Law along the way, with solar prices expected to drop another 66% by 2040.

Another factor could push the economic shift to renewables even faster. Just as computers transitioned from being strictly corporate infrastructure to becoming consumer products with the invention of the PC in the 1980s, ultimately causing a dramatic increase in corporate IT investments, energy generation has also made the transition to the consumer side.

Thanks to future tech media star Elon Musk, consumers can go to his Tesla Energy company website and order tempered glass solar panels that look like chic, designer versions of old-fashioned roof shingles. Models that look like slate or a curved, terracotta-colored, ceramic-style glass that will make roofs look like those of Tuscan country villas, are promised soon. Consumers can also buy a sleek-looking battery called a Powerwall to store energy from the roof.

SAP Q417 DigitalDoubles Feature2 Image6 Six New Tech Trends To Start Pursuing TodayThe combination of solar panels, batteries, and smart meters transforms homeowners from passive consumers of energy into active producers and traders who can choose to take energy from the grid during off-peak hours, when some utilities offer discounts, and sell energy back to the grid during periods when prices are higher. And new blockchain applications promise to accelerate the shift to an energy market that is laterally integrated rather than vertically integrated as it is now. Consumers like their newfound sense of control, according to Henbest. “Energy’s never been an interesting consumer decision before and suddenly it is,” he says.

As the price of solar equipment continues to drop, homes, offices, and factories will become like nodes on a computer network. And if promising new solar cell technologies, such as organic polymers, small molecules, and inorganic compounds, supplant silicon, which is not nearly as efficient with sunlight as it is with ones and zeroes, solar receivers could become embedded into windows and building compounds. Solar production could move off the roof and become integrated into the external facades of homes and office buildings, making nearly every edifice in town a node.

The big question, of course, is how quickly those nodes will become linked together—if, say doubters, they become linked at all. As we learned from Metcalfe’s Law, the value of a network is proportional to its number of connected users.

The Will Determines the Way

Right now, the network is limited. Wind and solar account for just 5% of global energy production today, according to Bloomberg.

But, says Rifkin, technology exists that could enable the network to grow exponentially. We are seeing the beginnings of a digital energy network, which uses a combination of the IoT, Big Data, analytics, and artificial intelligence to manage distributed energy sources, such as solar and wind power from homes and businesses.

As nodes on this network, consumers and businesses could take a more active role in energy production, management, and efficiency, according to Rifkin. Utilities, in turn, could transition from simply transmitting power and maintaining power plants and lines to managing the flow to and from many different energy nodes; selling and maintaining smart home energy management products; and monitoring and maintaining solar panels and wind turbines. By analyzing energy use in the network, utilities could create algorithms that automatically smooth the flow of renewables. Consumers and businesses, meanwhile, would not have to worry about connecting their wind and solar assets to the grid and keeping them up and running; utilities could take on those tasks more efficiently.

Already in Germany, two utility companies, E.ON and RWE, have each split their businesses into legacy fossil and nuclear fuel companies and new services companies based on distributed generation from renewables, new technologies, and digitalization.

The reason is simple: it’s about survival. As fossil fuel generation winds down, the utilities need a new business model to make up for lost revenue. Due to Germany’s population density, “the utilities realize that they won’t ever have access to enough land to scale renewables themselves,” says Rifkin. “So they are starting service companies to link together all the different communities that are building solar and wind and are managing energy flows for them and for their customers, doing their analytics, and managing their Big Data. That’s how they will make more money while selling less energy in the future.”

SAP Q417 DigitalDoubles Feature2 Image7 1024x572 Six New Tech Trends To Start Pursuing Today

The digital energy internet is already starting out in pockets and at different levels of intensity around the world, depending on a combination of citizen support, utility company investments, governmental power, and economic incentives.

China and some countries within the EU, such as Germany and France, are the most likely leaders in the transition toward a renewable, energy-based infrastructure because they have been able to align the government and private sectors in long-term energy planning. In the EU for example, wind has already overtaken coal as the second largest form of power capacity behind natural gas, according to an article in TheGuardian newspaper. Indeed, Rifkin has been working with China, the EU, and governments, communities, and utilities in Northern France, the Netherlands, and Luxembourg to begin building these new internets.

Hauts-de-France, a region that borders the English Channel and Belgium and has one of the highest poverty rates in France, enlisted Rifkin to develop a plan to lift it out of its downward spiral of shuttered factories and abandoned coal mines. In collaboration with a diverse group of CEOs, politicians, teachers, scientists, and others, it developed Rev3, a plan to put people to work building a renewable energy network, according to an article in Vice.

Today, more than 1,000 Rev3 projects are underway, encompassing everything from residential windmills made from local linen to a fully electric car–sharing system. Rev3 has received financial support from the European Investment Bank and a handful of private investment funds, and startups have benefited from crowdfunding mechanisms sponsored by Rev3. Today, 90% of new energy in the region is renewable and 1,500 new jobs have been created in the wind energy sector alone.

Meanwhile, thanks in part to generous government financial support, Germany is already producing 35% of its energy from renewables, according to an article in TheIndependent, and there is near unanimous citizen support (95%, according to a recent government poll) for its expansion.

If renewable energy is to move forward in other areas of the world that don’t enjoy such strong economic and political support, however, it must come from the ability to make green, not act green.

Not everyone agrees that renewables will produce cost savings sufficient to cause widespread cost disruption anytime soon. A recent forecast by the U.S. Energy Information Administration predicts that in 2040, oil, natural gas, and coal will still be the planet’s major electricity producers, powering 77% of worldwide production, while renewables such as wind, solar, and biofuels will account for just 15%.

Skeptics also say that renewables’ complex management needs, combined with the need to store reserve power, will make them less economical than fossil fuels through at least 2035. “All advanced economies demand full-time electricity,” Benjamin Sporton, chief executive officer of the World Coal Association told Bloomberg. “Wind and solar can only generate part-time, intermittent electricity. While some renewable technologies have achieved significant cost reductions in recent years, it’s important to look at total system costs.”

On the other hand, there are many areas of the world where distributed, decentralized, renewable power generation already makes more sense than a centralized fossil fuel–powered grid. More than 20% of Indians in far flung areas of the country have no access to power today, according to an article in TheGuardian. Locally owned and managed solar and wind farms are the most economical way forward. The same is true in other developing countries, such as Afghanistan, where rugged terrain, war, and tribal territorialism make a centralized grid an easy target, and mountainous Costa Rica, where strong winds and rivers have pushed the country to near 100% renewable energy, according to TheGuardian.

The Light and the Darknet

Even if all the different IoT-enabled economic platforms become financially advantageous, there is another concern that could disrupt progress and potentially cause widespread disaster once the new platforms are up and running: hacking. Poorly secured IoT sensors have allowed hackers to take over everything from Wi-Fi enabled Barbie dolls to Jeep Cherokees, according to an article in Wired magazine.

Humans may be lousy drivers, but at least we can’t be hacked (yet). And while the grid may be prone to outages, it is tightly controlled, has few access points for hackers, and is physically separated from the Wild West of the internet.

If our transportation and energy networks join the fray, however, every sensor, from those in the steering system on vehicles to grid-connected toasters, becomes as vulnerable as a credit card number. Fake news and election hacking are bad enough, but what about fake drivers or fake energy? Now we’re talking dangerous disruptions and putting millions of people in harm’s way.

SAP Q417 DigitalDoubles Feature2 Image8 Six New Tech Trends To Start Pursuing TodayThe only answer, according to Rifkin, is for businesses and governments to start taking the hacking threat much more seriously than they do today and to begin pouring money into research and technologies for making the internet less vulnerable. That means establishing “a fully distributed, redundant, and resilient digital infrastructure less vulnerable to the kind of disruptions experienced by Second Industrial Revolution–centralized communication systems and power grids that are increasingly subject to climate change, disasters, cybercrime, and cyberterrorism,” he says. “The ability of neighborhoods and communities to go off centralized grids during crises and re-aggregate in locally decentralized networks is the key to advancing societal security in the digital era,” he adds.

Start Looking Ahead

Until today, digital transformation has come mainly through the networking and communications efficiencies made possible by the internet. Airbnb thrives because web communications make it possible to create virtual trust markets that allow people to feel safe about swapping their most private spaces with one another.

But now these same efficiencies are coming to two other areas that have never been considered core to business strategy. That’s why businesses need to begin managing energy and transportation as key elements of their digital transformation portfolios.

Microsoft, for example, formed a senior energy team to develop an energy strategy to mitigate risk from fluctuating energy prices and increasing demands from customers to reduce carbon emissions, according to an article in Harvard Business Review. “Energy has become a C-suite issue,” Rob Bernard, Microsoft’s top environmental and sustainability executive told the magazine. “The CFO and president are now actively involved in our energy road map.”

As Daimler’s experience shows, driverless vehicles will push autonomous transportation and automated logistics up the strategic agenda within the next few years. Boston Consulting Group predicts that the driverless vehicle market will hit $ 42 billion by 2025. If that happens, it could have a lateral impact across many industries, from insurance to healthcare to the military.

Businesses must start planning now. “There’s always a period when businesses have to live in the new and the old worlds at the same time,” says Rifkin. “So businesses need to be considering new business models and structures now while continuing to operate their existing models.”

He worries that many businesses will be left behind if their communications, energy, and transportation infrastructures don’t evolve. Companies that still rely on fossil fuels for powering traditional transportation and logistics could be at a major competitive disadvantage to those that have moved to the new, IoT-based energy and transportation infrastructures.

Germany, for example, has set a target of 80% renewables for gross power consumption by 2050, according to TheIndependent. If the cost advantages of renewables bear out, German businesses, which are already the world’s third-largest exporters behind China and the United States, could have a major competitive advantage.

“How would a second industrial revolution society or country compete with one that has energy at zero marginal cost and driverless vehicles?” asks Rifkin. “It can’t be done.” D!

About the Authors

Maurizio Cattaneo is Director, Delivery Execution, Energy and Natural Resources, at SAP.

Joerg Ferchow is Senior Utilities Expert and Design Thinking Coach, Digital Transformation, at SAP.

Daniel Wellers is Digital Futures Lead, Global Marketing, at SAP.

Christopher Koch is Editorial Director, SAP Center for Business Insight, at SAP.

Read more thought provoking articles in the latest issue of the Digitalist Magazine, Executive Quarterly.


Let’s block ads! (Why?)

Digitalist Magazine

How CRM Can Give Your Content Marketing a Jump Start

gettyimages 470907992 How CRM Can Give Your Content Marketing a Jump Start

If you are familiar with content writing, you may occasionally experience difficulties getting your work to rank on search engines. There are even struggles with trying to find ways to gain attention from readers and subsequently generate a following. Finally, another struggle with content writing is making your subject something that is interesting and worth reading, especially if it is a topic that the general public may not be too interested in.

In today’s blog, we are going to talk about content marketing, what it is and how to make people stop and notice through the usage of CRM software.

Content Marketing. What Is It?

Before we begin discussing the ways in which customer relationship management software can assist with giving your content marketing the push it needs, let’s define the phrase and make you a bit more familiar with it.

A strategic marketing approach which focuses on creating and distributing relevant and consistent content through online platforms is content marketing. With traditional marketing tactics such as cold calls, TV and radio spots and print advertisements no longer proving to be as successful as they once were, content marketing has now become the more widely accepted and effective form of marketing. That is because it works from an often used and popular platform, which is the internet. Essentially, you are meeting your customer base exactly where they are and using the internet allows for more creative ways to market.

How Does CRM Help Content Marketing?

One of the benefits of using customer relationship management software is its overflow of data.  No matter what your goal is for using the software, having information about your customers and all interactions involving your business and its customers, can help you in more ways than one. When it comes to content marketing, you can use the information collected within the system to learn more about your customer base, their interests and how to marketing toward them. Here are three specific ways CRM can bring life to your content marketing.

  • Gain customer social media understanding: CRM allows you to gain a better understanding of your customers in order to create a more personal relationship with them. Social media usage can easily tell a business what their customers are interested in or do no care for, making marketing easier. Social CRM is a tool that can make this process easier.
  • Monitoring metrics: Utilizing CRM data is the perfect way of finding data that rests in email threads and phone conversations. You can look through metrics and data, what was successful or a failure and from there make changes with your marketing.
  • Make Businesses Better Listeners: This one is simple, but effective. Paying attention to your customer’s actions, needs and requests will make content marketing easier to perform. The needs of a customer are often documented in customer relationship management software. Knowing this information doesn’t just simplify marketing, it also shows your customer base that you care and pay attention to them.

CRM is an incredibly helpful software and can be used for more than just building a relationship with customers. It also helps with giving your content marketing the boost it needs in order to better perform on search engines.

Let’s block ads! (Why?)

OnContact CRM

Start Planning for Black Friday and Cyber Monday 2018 Now

black friday Start Planning for Black Friday and Cyber Monday 2018 Now

Black Friday and Cyber Monday are about to hit yet again. Last year, 154 million more people shopped than in 2015 and it’s only predicted to get bigger. It’ll be interesting to see how 2017 shapes up, but in the meantime, there are some things you can do as a retailer for both brick-and-mortar and online stores to start planning for next year and beyond.

How to handle the crush of shoppers

Every retailer’s biggest focus and investment, for both e-commerce and brick-and-mortar, needs to be on scaling. If your backend falls apart because of a crush of traffic, everything you set up will be for naught. You need a system that can handle the large traffic load and the pounding of people hitting refresh and trampling through your store.

Moving to the cloud or at least hybridizing part of your environment prior to Black Friday must be a priority. The cloud allows you to rapidly scale your operations based on traffic load for a significant cost savings over running your datacenter in-house. It’s a key part of keeping your network flexible, agile, and scalable.

For brick-and-mortar stores, the key to increasing revenue is optimizing the movement of people in and out of the store. The traffic flooding into these stores on Black Friday can represent several multiples of the typical daily traffic. In order to be successful, the floor staff must be equipped with real-time information about inventory levels, customer flows, in-store deal announcements, and answers to the questions asked by panicked, manic shoppers.

The best way to arm staff is to supply mobile devices that enable them to communicate with each other and provide that real-time information at their fingertips. You can even provide POS systems directly on those devices so they can check people out right in the middle of the store, significantly reducing the crush at the front counters.

Brick-and-mortar retailers must also stand out from the noise of their competitors beyond just giving excellent customer service. To differentiate, they must offer compelling real-time deals that can draw the crowds and satisfy shoppers’ needs. But customers must receive this information in a timely manner. The same digital systems that keep floor staff up to date can help customers determine their Black Friday game plan. With the right analytics in place offering a 360 degree view of the customer, you can provide personalized deals and recommendations that compel shoppers to prioritize your shops on their Black Friday agenda.

How to make the online experience as exciting as the in-store experience

Online goes a bit differently. Shoppers may plan their Cyber Monday or Black Friday based on email newsletters and ads they’ve received, but there’s a ton of competition. Sending real-time offers and encouraging shoppers to subscribe to deal alerts throughout the day can help you stand out from the noise and leverage the loyalty you’ve built throughout the rest of the year.

Online retailers can leverage the same technologies driving in-store personalization to send customized deals and ads to consumers based on their shopping habits with the help of products like analytics and integration. Researchers have seen consumers that receive more targeted ads are more than twice as likely to buy the advertised product as consumers that get non-targeted ads.

The same systems can identify shopping trends in real time, allowing for dynamic pricing and deal creation based on predicted shopping trends and inventory levels. This can help dramatically decrease the logistical headaches that come with coordinating your Black Friday and Cyber Monday sales while offering deals that drive purchasing behavior without discounting so deeply that you lose profits.

To see how TIBCO is helping retailers digitally transform, download the whitepaper: The Digital Challenge in Retail or visit our Retail page.

Let’s block ads! (Why?)

The TIBCO Blog

Big Data SQL Quick Start. Correlate real-time data with historiacal benchmarks – Part 24

In Big Data SQL 3.2 we have introduced new capability – Kafka as a data source. Some details about how it works with some simple examples, I’ve posted over here. But now I want to talk about why do you want to run queries over Kafka. Here is Oracle concept picture on Datawarehouse:

You have some stream (real-time data), data lake where you land raw information and cleaned Enterprise data. This is just a concept, which could be implemented in many different ways, one of this depict here:

Kafka is the hub for streaming events, where you accumulate data from multiple real-time producers and provide this data to many consumers (it could be real-time processing, such as Spark-Streaming or you could load data in batch mode to the next Datawarehouse tier, such as Hadoop). 

In this architecture, Kafka contains stream data and it’s able to answer the question “what is going on right now”, whereas in Database you store operational data, in Hadoop historical and those two sources are able to answer the question “how it use to be”. Big Data SQL allows you to run the SQL over those tree sources and correlate real-time events with historical.

Example of using Big Data SQL over Kafka and other sources.

So, above I’ve explained the concept why you may need to query Kafka with Big Data SQL, now let me give a concrete example. 

Input for demo example:

- We have company, called MoviePlex, which sells video content all around the world

- There are two stream datasets – network data, which contains information about network errors, conditions of routing devices and so. The second data source is the fact of the movie sales. 

- Both stream data in real-time in Kafka

- Also, we have historical network data, which we store in HDFS (because of the cost of this data), historical sales data (which we store in database) and multiple dimension tables, stored in RDBMS as well.

Based on this we have a business case – monitor revenue flow, correlate current traffic with the historical benchmark (depend on Day of the Week and Hour of the Day) and try to find the reason in case of failures (network errors, for example).

Using Oracle Data Visualization Desktop, we’ve created a dashboard, which shows how real-time traffic correlate with statistical and also, shows a number of network errors based on the countries:

The blue line is a historical benchmark.

Over the time we see that some errors appear in some countries (left dashboard), but current revenue is more or less the same as it uses to be.

After a while revenue starts going down.

This trend keeps going.

A lot of network errors in France. Let’s drill down into itemized traffic:

Indeed, we caught that overall revenue goes down because of France and cause of this is some network errors.


1) Kafka stores real-time data  and answers on question “what is going on right now”

2) Database and Hadoop stores historical data and answers on the question: “how it use to be”

3) Big Data SQL could query the data from Kafka, Hadoop, Database within single query (Join the datasets)

4) This fact allows us to correlate historical benchmarks with real-time data within SQL interface and use this with any SQL compatible BI tool 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Big Data SQL Quick Start. Big Data SQL over Kafka – Part 23

G Data SQL 3.2 version brings a few interesting features. Among those features, one of the most interesting is the ability to read Kafka. Before drilling down into details, I’d like to explain in the nutshell what Kafka is.

What is Kafka?

The full scope of the information about Kafka you may find here, but in the nutshell, it’s distributed fault tolerant message system. It allows you to connect many systems in an organized fashion. Instead, connect each system peer to peer:

you may land all your messages company wide on one system and consume it from there, like this:

Kafka is kind of Data Hub system, where you land the messages and serve it after.

More technical details.

I’d like to introduce a few key Kafka’s terms.

1) Kafka Broker. This is Kafka service, which you run on each server and which operates all read and write request

2) Kafka Producer. The process which writes data in Kafka

3) Kafka Consumer. The process, which reads data from Kafka.

4) Message. The name describes itself, I just want to add that messages have key and value. In comparison to NoSQL databases key Kafka’s key is not indexed. It has application purposes (you may put some application logic in Key) and administrative purposes (each message with the same key goes to the same partition).

5) Topic. Set of the messages organized into topics. Database guys would compare it with a table.

6) Partition. It’s a good practice to divide the topic into partitions for performance and maintenance purposes. Messages within the same key go to the same partition. If a key is absent, messages are distributing in round – robin fashion.

7) Offset. The offset is the position of each message in the topic. The offset is indexed and it allows you quickly access your particular message.

When do you delete data?

One of the basic Kafka concepts is that of retention – Kafka does not keep data forever, nor does it wait for all consumers to read a message before deleting a message. Instead, the Kafka administrator configures a retention period for each topic – either amount of time for which to store messages before deleting them or how much data to store older messages are purged. This two parameters control this: log.retention.ms and log.retention.bytes.

The amount of data to retain in the log for each topic-partition. This is the limit per partition: multiply by the number of partitions to get the total data retained for the topic. 

How to query Kafka data with Big Data SQL?

for query the Kafka data you need to create hive table first. let me show an ent-to-end example. I do have a JSON file:

$   cat web_clicks.json
{ click_date: "38041", click_time: "67786", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "396439", web_page: "646"}
{ click_date: "38041", click_time: "41831", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "90714", web_page: "804"}
{ click_date: "38041", click_time: "60334", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "afternoon", item_sk: "151944", web_page: "867"}
{ click_date: "38041", click_time: "53225", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "175796", web_page: "563"}
{ click_date: "38041", click_time: "47515", date: "2004-02-26", am_pm: "PM", shift: "first", sub_shift: "afternoon", item_sk: "186943", web_page: "777"}
{ click_date: "38041", click_time: "73633", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "118004", web_page: "647"}
{ click_date: "38041", click_time: "43133", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "148210", web_page: "930"}
{ click_date: "38041", click_time: "80675", date: "2004-02-26", am_pm: "PM", shift: "second", sub_shift: "evening", item_sk: "380306", web_page: "484"}
{ click_date: "38041", click_time: "21847", date: "2004-02-26", am_pm: "AM", shift: "third", sub_shift: "morning", item_sk: "55425", web_page: "95"}
{ click_date: "38041", click_time: "35131", date: "2004-02-26", am_pm: "AM", shift: "first", sub_shift: "morning", item_sk: "185071", web_page: "118"}

and I’m going to load it into Kafka with standard Kafka tool “kafka-console-producer”:

$   cat web_clicks.json|kafka-console-producer --broker-list bds2:9092,bds3:9092,bds4:9092,bds5:9092,bds6:9092 --topic json_clickstream

for a check that messages have appeared in the topic you may use the following command:

$   kafka-console-consumer --zookeeper bds1:2181,bds2:2181,bds3:2181 --topic json_clickstream --from-beginning

after I’ve loaded this file into Kafka topic, I create a table in Hive.

Make sure that you have oracle-kafka.jar and kafka-clients*.jar in your hive.aux.jars.path:

and here:

after this you may run follow DDL in the hive:

hive> CREATE EXTERNAL TABLE json_web_clicks_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
hive> describe json_web_clicks_kafka;
hive> select * from json_web_clicks_kafka limit 1;

and as soon as hive table been created I create ORACLE_HIVE table in Oracle:

SQL> CREATE TABLE json_web_clicks_kafka (
topic varchar2(50),
partitionid integer,
VALUE  varchar2(4000),
offset integer,
timestamp timestamp, 
timestamptype integer

here you also have to keep in mind that you need to add oracle -kafka.jar and  kafka -clients*.jar in your bigdata.properties file on the database and on the Hadoop side. I have dedicated the blog about how to do this here.

Now we are ready to query:

SQL> SELECT * FROM json_web_clicks_kafka

json_clickstream	209	{ click_date: "38041", click_time: "43213"..."}	0	26-JUL-17 PM	1
json_clickstream	209	{ click_date: "38041", click_time: "74669"... }	1	26-JUL-17 PM	1

Oracle 12c provides powerful capabilities for working with JSON, such as dot API. It allows us to easily query the JSON data as a structure: 

SELECT t.value.click_date,
  FROM json_web_clicks_kafka t

38041	40629
38041	48699

Working with AVRO messages.

In many cases, customers are using AVRO as flexible self-described format and for exchanging messages through the Kafka. For sure we do support it and doing this in very easy and flexible way.

I do have a topic, which contains AVRO messages and I define Hive table over it:

row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'
describe web_sales_kafka;
select * from web_sales_kafka limit 1;

Here I define ‘oracle.kafka.table.value.type’=’avro’ and also I have to specify ‘oracle.kafka.table.value.schema’. After this we have structure.

In a similar way I define a table in Oracle RDBMS:

  topic varchar2(50),
  partitionid integer,
  offset integer,
  timestamp timestamp, 
  timestamptype INTEGER
      ( com.oracle.bigdata.tablename: web_sales_kafka

And we good to query the data!

Performance considerations.

1) Number of Partitions.

This is the most important thing to keep in mind there is a nice article about how to choose a right number of partitions. For Big Data SQL purposes I’d recommend using a number of partitions a bit more than you have CPU cores on your Big Data SQL cluster.

2) Query fewer columns

Use column pruning feature. In other words list only necessary columns in your SELECT and WHERE statements. Here is the example.

I’ve created void PL/SQL function, which does nothing. But PL/SQL couldn’t be offloaded to the cell side and we will move all the data towards the database side:

SQL> create or replace function fnull(input number) return number is
Result number;
end fnull;

after this I ran the query, which requires one column and checked how much data have been returned to the DB side:


“cell interconnect bytes returned by XT smart scan” 5741.81MB

after this I repeat the same test case with 10 columns:


“cell interconnect bytes returned by XT smart scan” 32193.98 MB

so, hopefully, this test case clearly shows that you have to use only useful columns

3) Indexes

There is no Indexes rather than Offset columns. The fact that you have key column doesn’t have to mislead you – it’s not indexed. The only offset allows you have a quick random access

4) Warm up your data

If you want to read data faster many times, you have to warm it up, by running “select *” type of the queries.

Kafka relies on Linux filesystem cache, so for reading the same dataset faster many times, you have to read it the first time.

Here is the example

- I clean up the Linux filesystem cache

dcli -C "sync; echo 3 > /proc/sys/vm/drop_caches"

- I tun the first query:


it took 278 seconds.

- Second and third time took 92 seconds only.

5) Use bigger Replication Factor

Use bigger replication factor. Here is the example. I do have two tables one is created over the Kafka topic with Replication Factor  = 1, second is created over Kafka topic with ith Replication Factor  = 3.


this query took 278 seconds for the first run and 92 seconds for the next runs


This query took 279 seconds for the first run, but 34 seconds for the next runs.

6) Compression considerations

Kafka supports different type of compressions. If you store the data in JSON or XML format compression rate could be significant. Here is the examples of the numbers, that could be:

Data format and compression type Size of the data, GB
JSON on HDFS, uncompressed 273.1
JSON in Kafka, uncompressed 286.191
JSON in Kafka, Snappy 180.706
JSON in Kafka, GZIP 52.2649
AVRO in Kafka, uncompressed 252.975
AVRO in Kafka, Snappy 158.117
AVRO in Kafka, GZIP 54.49

This feature may save some space on the disks, but taking into account, that Kafka primarily used for the temporal store (like one week or one month), I’m not sure that it makes any sense. Also, you will pay some performance penalty, querying this data (and burn more CPU). 

I’ve run a query like:

SQL> select count(1) from ...

and had followed results:

Type of compression Elapsed time, sec
uncompressed 76
snappy 80
gzip 92

so, uncompressed is the leader. Gzip and Snappy slower (not significantly, but slow). taking into account this as well as fact, that Kafka is a temporal store, I wouldn’t recommend using compression without any exeptional need. 

7) Use parallelize your processing.

If for some reasons you are using a small number of partitions, you could use Hive metadata parameter “oracle.kafka.partition.chunk.size” for increase parallelism. This parameter defines a size of the input Split. So, if you set up this parameter equal 1MB and your topic has 4MB total, you will proceed it with 4 parallel threads.

Here is the test case:

- Drop Kafka topic

$   kafka-topics --delete --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales

- Create again with only one partition

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic store_sales

- Check it

$   kafka-topics --describe --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --topic store_sales
Topic:store_sales       PartitionCount:1        ReplicationFactor:3     Configs:
      Topic: store_sales      Partition: 0    Leader: 79      Replicas: 79,76,77      Isr: 79,76,77

- Check the size of input file:

$   du -h store_sales.dat
19G     store_sales.dat

- Load data to the Kafka topic

$   cat store_sales.dat|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic store_sales  --request-timeout-ms 30000  --batch-size 1000000

- Create Hive External table

hive> CREATE EXTERNAL TABLE store_sales_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'

- Create Oracle external table

   (	TOPIC VARCHAR2(50), 
      VALUE VARCHAR2(4000), 
      ( com.oracle.bigdata.tablename=default.store_sales_kafka

- Run test query

SQL> SELECT COUNT(1) FROM store_sales_kafka;

it took 142 seconds

- Re-create Hive external table with ‘oracle.kafka.partition.chunk.size’ parameter equal 1MB

hive> CREATE EXTERNAL TABLE store_sales_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'

- Run query again:

SQL> SELECT COUNT(1) FROM store_sales_kafka;

Now it took only 7 seconds

One MB split is quite low, and for big topics we recommend to use 256MB.

8) Querying small topics.

Sometimes it happens that you need to query really small topics (few hundreds of messages, for example), but very frequently. At this case, it makes sense to create a topic with fewer paritions.

Here is the test case example:

- Create topic with 1000 partitions

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1000 --topic small_topic

- Load only one message there

$   echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic

- Create hive external table

hive> CREATE EXTERNAL TABLE small_topic_kafka
row format serde 'oracle.hadoop.kafka.hive.KafkaSerDe'
stored by 'oracle.hadoop.kafka.hive.KafkaStorageHandler'

- Create Oracle external table

SQL> CREATE TABLE small_topic_kafka (
topic varchar2(50),
partitionid integer,
VALUE varchar2(4000),
offset integer,
timestamp timestamp,
timestamptype integer

- Query all rows from it

SQL> SELECT * FROM small_topic_kafka

it took 6 seconds

- Create topic with only one partition and put only one message there and run same SQL query over it

$   kafka-topics --create --zookeeper cfclbv3870:2181,cfclbv3871:2181,cfclbv3872:2181 --replication-factor 3 --partitions 1 --topic small_topic
$   echo "test"|kafka-console-producer --broker-list cfclbv3870.us2.oraclecloud.com:9092,cfclbv3871.us2.oraclecloud.com:9092,cfclbv3872.us2.oraclecloud.com:9092,cfclbv3873.us2.oraclecloud.com:9092,cfclbv3874.us2.oraclecloud.com:9092 --topic small_topic
SQL> SELECT * FROM small_topic_kafka

now it takes only 0.5 second

9) Type of data in Kafka messages.

You have few options for storing data in Kafka messages and for sure you want to do pushdown processing. Big Data SQL supports pushdown operations only for JSONs. This means that everything that you could expose thought the JSON will be pushed down to the cell side and will be prosessed there.


- The query which could be pushed down to the cell side (JSON):


- The query which could not be pushed down to the cell side (XML):

 .getNumberVal() = 233183247;

If amounts of data is not significant, you could use Big Data SQL for processing. If we are talking about big data volumes, you could process it once and convert into different file formats on HDFS, with Hive query:

hive> select xpath_int(value,'/operation/col[@name="WR_ORDER_NUMBER"]/after/text()') from WEB_RETURNS_XML_KAFKA limit 1 ;

10) JSON vs AVRO format in the Kafka topics

In continuing to the previous point, you may be wondering which semi-structured format use? The answer is easy – use what your data source produce there is no significant performance difference between Avro and JSON. For example, a query like:


Will be done in 112 seconds in case of JSON and in 105 seconds in case of Avro.

and JSON topic will take 286.33 GB and Avro will take 202.568 GB. There is some difference, but not worth for converting the original format.

How to bring data from OLTP databases in Kafka? Use Golden Gate!

Oracle Golden Gate is the well-known product for capturing commit logs on the database side and bring the changes into a target system. The good news that Kafka may play a role in the target system. I’d like to skip the detailed explanation of this feature, because it’s already explained in very deep details here.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Whether it’s a Garage Start up or 100-year manufacturer, NetSuite Fuels Business Growth

Posted by David Turner, Senior Marketing Director, EMEA, Oracle NetSuite

There are exciting opportunities for organisations today to grow and innovate. That could mean going into new markets, launching new products and services, or coming up with new business models. There are always ways to expand your business.

NRBT024 Whether it’s a Garage Start up or 100 year manufacturer, NetSuite Fuels Business Growth

There are also challenges to that growth, however. Data is locked away in siloes in the organisation. It’s not real-time, it’s not accessible, you can’t always analyse it. Compliance and regulation is growing ever more complex, country by country. Systems don’t talk to each other. Attracting and retaining talent is tough and on top of all that there are always new competitors entering the market.

This ‘hairball’ of disconnected systems hinders visibility across the organisation. As operations become ever more complex many companies are forced to resort to spreadsheets and manual processes to paper over the cracks between disparate systems, making it hard to see what’s really going on. With half of start-ups failing within five years, it’s more vital than ever to monitor the health of the business and identify the drivers of growth.

Having a unified cloud business system is absolutely key to tackling these challenges. At our NetSuite Next Ready Business Tour in London this week, NetSuite customers detailed the real-world challenges they’re facing and how NetSuite is helping to overcome them.

NRBT023 Whether it’s a Garage Start up or 100 year manufacturer, NetSuite Fuels Business Growth

London-based home fashion label Buster + Punch has grown rapidly over the four years. Founded in a London garage, the company has since grown to include an ecommerce website, showroom in London and retail store in Stockholm. It now has more than 71 stockists selling products across 27 countries.

Buster + Punch CEO Martin Preen explained to the audience at the event: “We were growing very fast in lots of different markets, different languages, an omnichannel business and lots of different siloes everywhere – and quite frankly getting one picture of the organisation was impossible.”

That drove Buster + Punch to standardize, streamline and scale its operations on NetSuite OneWorld. With one unified cloud solution, Buster + Punch can extend its global growth, including a move to build a presence in the US.

At the other end of the scale is Sheffield-based OSL Cutting Technologies, a manufacturing business that has been around since 1865. OSL Cutting Technologies manufactures and imports magnetic drilling machines and cutting tools. Matthew Grey, managing director at OSL Cutting Technologies, told the Next Ready Business Tour attendees that his business has seen a lot of change and transition in the last few years.

“We have a distribution hub in the US and China and supply chain all over the world. That offers some interesting challenges in terms of building systems to support it. We acquired a business in 2015 and that left us with four systems in one business,” he said.

The company implemented NetSuite OneWorld in May 2017 to manage financials, multi-currency accounting and financial consolidation, CRM, email marketing and advanced manufacturing processes and it has already improved its on-time delivery, reporting and streamlined its financial operations.

These organisations are using NetSuite to regain control of their data and systems and extract clear actionable insights. Since our acquisition by Oracle, and the increased resources that gives us, that’s something we are going to be better placed than ever to help businesses do, as we expand our cloud platform capabilities to cater for any industry, country, language and currency. Ultimately our mission remains same as ever – to help you grow your business.

Posted on Fri, October 20, 2017
by NetSuite filed under

Let’s block ads! (Why?)

The NetSuite Blog

Big Data SQL Quick Start. Custom SerDe – Part 20

Big Data SQL Quick Start. Custom SerDe – Part 20

Many thanks to Bilal Ibdah, who is actual author of this content, I’m just publishing it in the Big Data SQL blog.

A modernized data warehouse is a data warehouse augmented with insights and data from a Big Data environment, typically Hadoop, now rather than moving and pushing the Hadoop data to a database, companies tend to expose this data through a unified layer that allows access to all data storage platforms, Hadoop, Oracle DB & NoSQL to be more specific.

The problem lies when the data that we want to expose is stored in its native format and in the lowest granularity possible, for example packet data, which can be in a binary format (PCAP), typical uses of packet data is in the telecommunications industry where this data is generated from a packet core, and can contain raw data records, known in the telecom industry as XDRs.

Here as an example of traditional architecture when source data is loading into mediation and after this TEXT (CSV) files parsed to some ETL engine and then load data into Database:

10 Big Data SQL Quick Start. Custom SerDe – Part 20

here is an alternative architecture, when you load the data directly to the HDFS (which is the part of your logical datawarehouse) and after this parse it on the fly during SQL running:

11 Big Data SQL Quick Start. Custom SerDe – Part 20

In this blog we’re going to use Oracle Big Data SQL to expose and access raw data stored in PCAP format living in hadoop.

The first step is up store the PCAP files in HDFS using the “copyFromLocal” command.

1 Big Data SQL Quick Start. Custom SerDe – Part 20

This is what the file pcap file looks like in HDFS:

2 Big Data SQL Quick Start. Custom SerDe – Part 20

In order to expose this file using Big Data SQL, we need to register this file in the Hadoop Metastore, once it’s registered in the metastore Big Data SQL can access the metadata, create an external table, and run pure Oracle SQL queries on the file, but registering this file requires to unlock the content using a custom SerDe, more details here.

Start by downloading the PCAP project from GitHub here, the project contains two components:

  • The hadoop-pcap-lib, which can be used in MapReduce jobs and,
  • The hadoop-pcap-serde, which can be used to query PCAPs in HIVE

For this blog, we will only use the serde component.

If the serde project hasn’t been compiled, compile it in an IDE or in a cmd window using the command “mvn package -e -X”

3 Big Data SQL Quick Start. Custom SerDe – Part 20

Copy the output jar named “hadoop-pcap-serde-1.1-SNAPSHOT-jar-with-dependencies.jar” found in the target folder to each node in your hadoop cluster:

4 Big Data SQL Quick Start. Custom SerDe – Part 20

Then add the pcap serde to the HIVE environment variables through Cloudera Manager:

5 Big Data SQL Quick Start. Custom SerDe – Part 20

Then save the changes and restart HIVE (you might also need to redeploy the configuration and restart the stale services).

Now let’s create a HIVE table and test the serde; copy the below to create a HIVE table:

DROPtable pcap;
ADD JAR hadoop-pcap-serde-0.1-jar-with-dependencies.jar;
SET net.ripe.hadoop.pcap.io.reader.class=net.ripe.hadoop.pcap.DnsPcapReader;
                             ts_usec string,
                             protocol string,
                             src string,
                             src_port int,
                             dst string,
                             dst_port int,
                             len int,
                             ttl int,
                             dns_queryid int,
                             dns_flags string,
                             dns_opcode string,
                             dns_rcode string,
                             dns_question string,
                             dns_answer array<string>,
                             dns_authority array<string>,
                             dns_additional array<string>)
ROW FORMAT SERDE 'net.ripe.hadoop.pcap.serde.PcapDeserializer'
STORED AS INPUTFORMAT 'net.ripe.hadoop.pcap.io.PcapInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION'hdfs:///user/oracle/pcap/';

Now it’s time to test the serde on HIVE, let’s run the below query:

select*from pcap limit5;

6 Big Data SQL Quick Start. Custom SerDe – Part 20

The query ran successfully. Next we will create an Oracle external table that points to the pcap file using Big Data SQL, for this purpose we need to add the PCAP serde file to the Big Data SQL environment variables (this must be done on each node in your hadoop cluster). Create a directory on each server in the Oracle Big Data Appliance such as “/home/oracle/pcapserde/ ” Copy the serde jar to each node in your Big Data Appliance. Browse to /opt/oracle/bigdatasql/bdcell-12.1

Add the the pcap jar file to the environment variables list in the configuration file “bigdata.properties”

7 Big Data SQL Quick Start. Custom SerDe – Part 20

The class also needs to be updated in bigdata.properties file on the database nodes.

First we need to copy the jar to the database nodes: 

  • Copy jar to db side
  • Add jar to class path
  • Create db external table and run query
  • Restart “bdsql” service in Cloudera Manager

After this we are goot to define External table in Oracle RDBMS and query it!

8 Big Data SQL Quick Start. Custom SerDe – Part 20

9 Big Data SQL Quick Start. Custom SerDe – Part 20

Just in case I will highlight that in the last query we query (read as parse and query) binary data on the fly.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

Big Data SQL Quick Start. Complex Data Types – Part 21

Many thanks to Dario Vega, who is the actual author of this content. I’m just publishing it on this blog.

A common potentially mistaken approach that people take regarding the integration of NoSQL, Hive and ultimately BigDataSQL is to use only a RDBMS perspective and not an integration point of view. People generally think about all the features and data types they’re already familiar with from their experience using one of these products; rather than realizing that the actual data is stored in the Hive (or NoSQL) database rather than RDBMS. Or without understanding that the data will be querying from RDBMS. 

When using Big Data SQL with complex types, we are thinking to use JSON/SQL without taking care of differences between Oracle Database and Hive use of Complex Types. Why ? Because the complex types are mapped to varchar2 in JSON format, so we are reading the data in JSON style instead of the original system. 

The Best sample of this is from a Json perspective JSON ECMA-404 – Map type does not exist. 

Programming languages vary widely on whether they support objects, and if so, what characteristics and constraints the objects offer. The models of object systems can be wildly divergent and are continuing to evolve. JSON instead provides a simple notation for expressing collections of name/value pairs. Most programming languages will have some feature for representing such collections, which can go by names like record, struct, dict, map, hash, or object.

The following built-in collection functions are supported in Hive:

  • int size (Map) Returns the number of elements in the map type.

  • array map_keys(Map) Returns an unordered array containing the keys of the input map.

  • array map_values(Map)Returns an unordered array containing the values of the input map.

Are they supported in RDBMS? the answer is NO but may be YES if using APEX PL/SQL or JAVA programs. 

In the same way, there is also a difference between Impala and Hive.

Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types.

The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other programming languages to deconstruct the underlying data structures.

Best practices We would advise taking a conservative approach.

This is because the mappings between the NoSQL data model, the Hive data model, and the Oracle RDBMS data model is not 1-to-1.
For example, the NoSQL data model is quite a rich and there are many things one can do with nested classes in NoSQL that have no counterpart in either Hive or Oracle Database (or both). As a result, integration of the three technologies had to take a ‘least-common-denominator’ approach; employing mechanisms common to all three.

But let me show a sample

Impala code

       ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946
+---------+----------+-----------+-----------+--------+------+--------------+| zipcode | lastname | firstname | ssn       | gender |KEY| VALUE        |+---------+----------+-----------+-----------+--------+------+--------------+|02610| ACEVEDO  | TAMMY     |576228946| female |WORK|617-656-9208||02610| ACEVEDO  | TAMMY     |576228946| female | cell |408-656-2016||02610| ACEVEDO  | TAMMY     |576228946| female | home |213-879-2134|+---------+----------+-----------+-----------+--------+------+--------------+

Oracle code:

`phoneinfo`IS JSON
FROM pmt_rmvtable_hive_json_api a
WHERE a.json_column.zipcode ='02610'AND a.json_column.lastname ='ACEVEDO'AND a.json_column.firstname ='TAMMY'AND a.json_column.ssn =576228946 ;
ZIPCODE : 02610 
SSN : 576228946
GENDER : female
PHONEINFO :{"work":"617-656-9208","cell":"408-656-2016","home":"213-879-2134"}

QUESTION : How to transform this JSON – PHONEINFO in two “arrays” keys, values- Map behavior expected.

Unfortunately, the nested path JSON_TABLE operator is only available for JSON ARRAYS. In the other side, when using JSON, we can access to each field as columns.

FROM pmt_rmvtable_hive_orc a  WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946;
-------------------- -------------------- -------------------- ---------- -------------------- ------------------ --------------- --------------- ---------------02610		     ACEVEDO		  TAMMY 		576228946 female	       533933353734363933617-656-9208213-879-2134408-656-2016

and what about using map columns on the where clause Looking for a specific phone number

Impala code

  ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE PHONEINFO.key='work'AND PHONEINFO.value ='617-656-9208'
+---------+------------+-----------+-----------+--------+------+--------------+| zipcode | lastname   | firstname | ssn       | gender |KEY| VALUE        |+---------+------------+-----------+-----------+--------+------+--------------+|89878| ANDREWS    | JEREMY    |848834686| male   |WORK|617-656-9208||00183| GRIFFIN    | JUSTIN    |976396720| male   |WORK|617-656-9208||02979| MORGAN     | BONNIE    |904775071| female |WORK|617-656-9208||14462| MCLAUGHLIN | BRIAN     |253990562| male   |WORK|617-656-9208||83193| BUSH       | JANICE    |843046328| female |WORK|617-656-9208||57300| PAUL       | JASON     |655837757| male   |WORK|617-656-9208||92762| NOLAN      | LINDA     |270271902| female |WORK|617-656-9208||14057| GIBSON     | GREGORY   |345334831| male   |WORK|617-656-9208||04336| SAUNDERS   | MATTHEW   |180588967| male   |WORK|617-656-9208|
|23993| VEGA       | JEREMY    |123967808| male   |WORK|617-656-9208|+---------+------------+-----------+-----------+--------+------+--------------+
Fetched 852ROW(s) IN99.80s

But let me continue showing the same code on Oracle (querying on work phone).

Oracle code

`phoneinfo`IS JSON
FROM pmt_rmvtable_hive_parquet  a
35330		     SIMS		  DOUGLAS		295204437 male		       {"work":"617-656-9208","cell":"901-656-9237","home":"303-804-7540"}43466		     KIM		  GLORIA		358875034 female	       {"work":"617-656-9208","cell":"978-804-8373","home":"415-234-2176"}67056		     REEVES		  PAUL			538254872 male		       {"work":"617-656-9208","cell":"603-234-2730","home":"617-804-1330"}07492		     GLOVER		  ALBERT		919913658 male		       {"work":"617-656-9208","cell":"901-656-2562","home":"303-804-9784"}20815		     ERICKSON		  REBECCA		912769190 female	       {"work":"617-656-9208","cell":"978-656-0517","home":"978-541-0065"}48250		     KNOWLES		  NANCY 		325157978 female	       {"work":"617-656-9208","cell":"901-351-7476","home":"213-234-8287"}48250		     VELEZ		  RUSSELL		408064553 male		       {"work":"617-656-9208","cell":"978-227-2172","home":"901-630-7787"}43595		     HALL		  BRANDON		658275487 male		       {"work":"617-656-9208","cell":"901-351-6168","home":"213-227-4413"}77100		     STEPHENSON 	  ALBERT		865468261 male		       {"work":"617-656-9208","cell":"408-227-4167","home":"408-879-1270"}852ROWS selected.
Elapsed: 00:05:29.56

In this case, we can also use the dot-notation A.PHONEINFO.work = ‘617-656-9208′

Note: for make familiar with Database JSON API you may use follow blog series: https://blogs.oracle.com/jsondb

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog