Category Archives: DWH News and Info

UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

 UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017 ** NEW ** Chapter 5

 UPDATED: Big Data Warehousing Must See Guide for Oracle OpenWorld 2017

*** UPDATED *** Must-See Guide now available as PDF and via Apple iBooks Store

This updated version now contains details of all the most important hands-on labs AND a day-by-day calendar. This means that our comprehensive guide now covers absolutely everything you need to know about this year’s Oracle OpenWorld conference. Now, when you arrive at Moscone Conference Center you are ready to get the absolute most out of this amazing conference.

The updated, and still completely free, big data warehousing Must-See guide for OpenWorld 2017 is now available for download from the Apple iBooks Store – click hereand in PDF format – click here.

Just so you know…this guide contains the following information:

Chapter 1

 – Introduction to the must-see guide. 

Chapter 2

 – A guide to the key the highlights from last year’s conference so you can relive the experience or see what you missed. Catch the most important highlights from last year’s OpenWorld conference with our on demand video service which covers all the major keynote sessions. Sit back and enjoy the highlights. The second section explains why you need to attend this year’s conference and how to justify it to your company. 

Chapter 3

- Full list of Oracle Product Management and Development presenters who will be at this year’s OpenWorld. Links to all their social media sites are included alongside each profile. Read on to find out about the key people who can help you and your teams build the FUTURE using Oracle’s Data Warehouse and Big Data technologies. 

Chapter 4

 – List of the “must-see” sessions

and hands-on labs

at this year’s OpenWorld by category. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017. 

Chapter 5

 – Day-by-Day “must-see” guide. It includes all the sessions and hands-on labs by the Oracle Product Management and Development teams along with key customer sessions. Read on for the list of the best, most innovative sessions at Oracle OpenWorld 2017. 

Chapter 6

 – Details of all the links you need to keep up to date on Oracle’s strategy and products for Data Warehousing and Big Data. This covers all our websites, blogs and social media pages. 

Chapter 7  

Details of our exclusive web application for smartphones and tablets provides you with a complete guide to everything related to data warehousing and big data at OpenWorld 2017. 

Chapter 8

 – Information to help you find your way around the area surrounding the Moscone Conference Center this section includes some helpful maps. 

Let me know if you have any comments. Enjoy and see you in San Francisco.

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

The Colbran Group of Companies – Home

To thrive and excel in today’s business environment, you have to be able to focus on your core business. 

All kinds of distractions can slow your company down, and you may end up losing your competitive edge.

We analyze your business processes and provide support in all other areas of management so you can focus on your business.

Let’s block ads! (Why?)

The Colbran Group of Companies – Home

Big Data SQL Quick Start. Complex Data Types – Part 21

Many thanks to Dario Vega, who is the actual author of this content. I’m just publishing it on this blog.

A common potentially mistaken approach that people take regarding the integration of NoSQL, Hive and ultimately BigDataSQL is to use only a RDBMS perspective and not an integration point of view. People generally think about all the features and data types they’re already familiar with from their experience using one of these products; rather than realizing that the actual data is stored in the Hive (or NoSQL) database rather than RDBMS. Or without understanding that the data will be querying from RDBMS. 

When using Big Data SQL with complex types, we are thinking to use JSON/SQL without taking care of differences between Oracle Database and Hive use of Complex Types. Why ? Because the complex types are mapped to varchar2 in JSON format, so we are reading the data in JSON style instead of the original system. 

The Best sample of this is from a Json perspective JSON ECMA-404 – Map type does not exist. 

Programming languages vary widely on whether they support objects, and if so, what characteristics and constraints the objects offer. The models of object systems can be wildly divergent and are continuing to evolve. JSON instead provides a simple notation for expressing collections of name/value pairs. Most programming languages will have some feature for representing such collections, which can go by names like record, struct, dict, map, hash, or object.

The following built-in collection functions are supported in Hive:

  • int size (Map) Returns the number of elements in the map type.

  • array map_keys(Map) Returns an unordered array containing the keys of the input map.

  • array map_values(Map)Returns an unordered array containing the values of the input map.

Are they supported in RDBMS? the answer is NO but may be YES if using APEX PL/SQL or JAVA programs. 

In the same way, there is also a difference between Impala and Hive.

Lateral views. In CDH 5.5 / Impala 2.3 and higher, Impala supports queries on complex types (STRUCT, ARRAY, or MAP), using join notation rather than the EXPLODE() keyword. See Complex Types (CDH 5.5 or higher only) for details about Impala support for complex types.

The Impala complex type support produces result sets with all scalar values, and the scalar components of complex types can be used with all SQL clauses, such as GROUP BY, ORDER BY, all kinds of joins, subqueries, and inline views. The ability to process complex type data entirely in SQL reduces the need to write application-specific code in Java or other programming languages to deconstruct the underlying data structures.


Best practices We would advise taking a conservative approach.

This is because the mappings between the NoSQL data model, the Hive data model, and the Oracle RDBMS data model is not 1-to-1.
For example, the NoSQL data model is quite a rich and there are many things one can do with nested classes in NoSQL that have no counterpart in either Hive or Oracle Database (or both). As a result, integration of the three technologies had to take a ‘least-common-denominator’ approach; employing mechanisms common to all three.

But let me show a sample

Impala code

`phoneinfo`map<string,string>
 
impala>SELECT
        ZIPCODE
       ,LASTNAME
       ,FIRSTNAME
       ,SSN
       ,GENDER
       ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946
;
 
+---------+----------+-----------+-----------+--------+------+--------------+| zipcode | lastname | firstname | ssn       | gender |KEY| VALUE        |+---------+----------+-----------+-----------+--------+------+--------------+|02610| ACEVEDO  | TAMMY     |576228946| female |WORK|617-656-9208||02610| ACEVEDO  | TAMMY     |576228946| female | cell |408-656-2016||02610| ACEVEDO  | TAMMY     |576228946| female | home |213-879-2134|+---------+----------+-----------+-----------+--------+------+--------------+

Oracle code:

`phoneinfo`IS JSON
 
 
SQL>SELECT/*+ MONITOR  */
     a.json_column.zipcode
    ,a.json_column.lastname
    ,a.json_column.firstname
    ,a.json_column.ssn
    ,a.json_column.gender
    ,a.json_column.phoneinfo
FROM pmt_rmvtable_hive_json_api a
WHERE a.json_column.zipcode ='02610'AND a.json_column.lastname ='ACEVEDO'AND a.json_column.firstname ='TAMMY'AND a.json_column.ssn =576228946 ;
 
 
ZIPCODE : 02610 
LASTNAME : ACEVEDO
FIRSTNAME : TAMMY
SSN : 576228946
GENDER : female
PHONEINFO :{"work":"617-656-9208","cell":"408-656-2016","home":"213-879-2134"}

QUESTION : How to transform this JSON – PHONEINFO in two “arrays” keys, values- Map behavior expected.

Unfortunately, the nested path JSON_TABLE operator is only available for JSON ARRAYS. In the other side, when using JSON, we can access to each field as columns.

SQL>SELECT/*+ MONITOR  */
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,LICENSE
  ,a.PHONEINFO.work
  ,a.PHONEINFO.home
  ,a.PHONEINFO.cell
FROM pmt_rmvtable_hive_orc a  WHERE zipcode ='02610'AND lastname ='ACEVEDO'AND firstname ='TAMMY'AND ssn =576228946;
 
 
 
ZIPCODE 	     LASTNAME		  FIRSTNAME		      SSN GENDER	       LICENSE		  WORK		  HOME		  CELL
-------------------- -------------------- -------------------- ---------- -------------------- ------------------ --------------- --------------- ---------------02610		     ACEVEDO		  TAMMY 		576228946 female	       533933353734363933617-656-9208213-879-2134408-656-2016

and what about using map columns on the where clause Looking for a specific phone number

Impala code

`phoneinfo`map<string,string>SELECT
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,PHONEINFO.*FROM rmvtable_hive_parquet, rmvtable_hive_parquet.PHONEINFO AS PHONEINFO
WHERE PHONEINFO.key='work'AND PHONEINFO.value ='617-656-9208'
;
 
+---------+------------+-----------+-----------+--------+------+--------------+| zipcode | lastname   | firstname | ssn       | gender |KEY| VALUE        |+---------+------------+-----------+-----------+--------+------+--------------+|89878| ANDREWS    | JEREMY    |848834686| male   |WORK|617-656-9208||00183| GRIFFIN    | JUSTIN    |976396720| male   |WORK|617-656-9208||02979| MORGAN     | BONNIE    |904775071| female |WORK|617-656-9208||14462| MCLAUGHLIN | BRIAN     |253990562| male   |WORK|617-656-9208||83193| BUSH       | JANICE    |843046328| female |WORK|617-656-9208||57300| PAUL       | JASON     |655837757| male   |WORK|617-656-9208||92762| NOLAN      | LINDA     |270271902| female |WORK|617-656-9208||14057| GIBSON     | GREGORY   |345334831| male   |WORK|617-656-9208||04336| SAUNDERS   | MATTHEW   |180588967| male   |WORK|617-656-9208|
...
|23993| VEGA       | JEREMY    |123967808| male   |WORK|617-656-9208|+---------+------------+-----------+-----------+--------+------+--------------+
 
Fetched 852ROW(s) IN99.80s

But let me continue showing the same code on Oracle (querying on work phone).

Oracle code

`phoneinfo`IS JSON
 
 
SELECT/*+ MONITOR */
   ZIPCODE
  ,LASTNAME
  ,FIRSTNAME
  ,SSN
  ,GENDER
  ,PHONEINFO
FROM pmt_rmvtable_hive_parquet  a
WHERE JSON_QUERY("A"."PHONEINFO" FORMAT JSON , '$  .work' RETURNING VARCHAR2(4000) ASIS  WITHOUTARRAY WRAPPER NULLON   ERROR)='617-656-9208'
;
 
35330		     SIMS		  DOUGLAS		295204437 male		       {"work":"617-656-9208","cell":"901-656-9237","home":"303-804-7540"}43466		     KIM		  GLORIA		358875034 female	       {"work":"617-656-9208","cell":"978-804-8373","home":"415-234-2176"}67056		     REEVES		  PAUL			538254872 male		       {"work":"617-656-9208","cell":"603-234-2730","home":"617-804-1330"}07492		     GLOVER		  ALBERT		919913658 male		       {"work":"617-656-9208","cell":"901-656-2562","home":"303-804-9784"}20815		     ERICKSON		  REBECCA		912769190 female	       {"work":"617-656-9208","cell":"978-656-0517","home":"978-541-0065"}48250		     KNOWLES		  NANCY 		325157978 female	       {"work":"617-656-9208","cell":"901-351-7476","home":"213-234-8287"}48250		     VELEZ		  RUSSELL		408064553 male		       {"work":"617-656-9208","cell":"978-227-2172","home":"901-630-7787"}43595		     HALL		  BRANDON		658275487 male		       {"work":"617-656-9208","cell":"901-351-6168","home":"213-227-4413"}77100		     STEPHENSON 	  ALBERT		865468261 male		       {"work":"617-656-9208","cell":"408-227-4167","home":"408-879-1270"}852ROWS selected.
 
Elapsed: 00:05:29.56

In this case, we can also use the dot-notation A.PHONEINFO.work = ‘617-656-9208′

Note: for make familiar with Database JSON API you may use follow blog series: https://blogs.oracle.com/jsondb

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog

The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters

We missed to document some functionality !!!

With the next milestone for Oracle Database 12c Release 2 just taking place – the
availability on premise for Linux x86-64, Solaris Sparc64, and
Solaris x86-64, in addition to the Oracle Cloud – I managed to use this as an excuse to play around with
it for a bit .. and found that we somehow missed to document new
functionality. Bummer. But still better than the other way around .. icon wink The first really hidden gem in Oracle Database 12c Release 2: runtime modification of external table parameters

We missed to document the capability to override some parameters of an external table at runtime.

So I decided to quickly blog about this to not only fill the gap in
documentation (doc bug is filed already) but also to ruthlessly hijack the momentum and to start
highlighting new functionality (there’s more blogs to come, specifically
around my pet peeve Partitioning, but that’s for later).

So what does it mean to override some parameters of an external table at runtime?

It simply means hat you can use one external table definition stub as proxy for external
data access of different files, with different reject limits, at different points in time. Without the
need to do a DDL to modify the external table definition.

The usage is pretty simple and straightforward, so let me quickly
demonstrate this with a not-so-business-relevant sample table. The
pre-requirement SQL for this one to run is at the end of this blog and
might make its way onto github as well; I have not managed that yet and just wanted to get this blog post out.

Here is my rather trivial external table definition. Works for me since version 9, so why not using it with 12.2 as well.

CREATE TABLE et1 (col1 NUMBER, col2 NUMBER, col3 NUMBER)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
  DEFAULT DIRECTORY d1
  ACCESS PARAMETERS
  ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
    NOBADFILE
    NOLOGFILE
    FIELDS TERMINATED BY ","
   )
  LOCATION ('file1.txt')
)
   REJECT LIMIT UNLIMITED
;

Pretty straightforward vanilla external table. Let’s now see how many rows this external table returns (the simple “data generation” is at the end of this blog):

SELECT count(*) FROM et1;

  COUNT(*)
———-
        99
So far, so good. And now the new functionality. We will now access the exact
same external table but tell the database to do a runtime modification of the file (location) we are accessing:

SELECT count(*) FROM et1
EXTERNAL MODIFY
(LOCATION ('file2.txt'));

  COUNT(*)
———-
         9
As you can see, the row count changes without me having done any change
to the external table definition like an ALTER TABLE. You will also see that nothing has
changed in the external table definition:

SQL> SELECT table_name, location FROM user_external_locations WHERE table_name='ET1';

TABLE_NAME                     LOCATION
—————————— ——————————
ET1                            file1.txt
And there’s one more thing. You might have asked yourself right now, right this moment … why do I have to specify a location then for the initial external table creation? The
answer is simple: you do not have to do this anymore.

Here is my external table without a location:

CREATE TABLE et2 (col1 NUMBER, col2 NUMBER, col3 NUMBER)
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
  DEFAULT DIRECTORY d1
  ACCESS PARAMETERS
  ( RECORDS DELIMITED BY NEWLINE CHARACTERSET US7ASCII
    NOBADFILE
    NOLOGFILE
    FIELDS TERMINATED BY ","
   )
)
   REJECT LIMIT UNLIMITED
;

When I now select from it, guess what: you won’t get any rows back. The location is NULL.

SQL> SELECT * FROM et2;

no rows selected
Personally I am not sure I would go for this approach rather than having at least one
dummy file in the location with a control record with some pre-defined
values to being able to tell whether there’s really no records or
whether there’s a programmatic mistake when you plan to always override the location. But as many things in life, that’s a choice. As you saw, you don’t have to.

Using this stub table in the same way as before gives me access to my data.

SELECT count(*) FROM et1
EXTERNAL MODIFY
(LOCATION ('file2.txt'));

  COUNT(*)
———-
         9

You get the idea. Pretty cool stuff. 

Aaah, and to complete the
short functional introduction: the following clauses can be over-ridden:
DEFAULT DIRECTORY,
LOCATION, ACCESS PARAMETERS (BADFILE, LOGFILE, DISCARDFILE) and REJECT
LIMIT. 

That’s about it for now for online modification capabilities for
external tables. I am sure I have forgotten
some little details here and there, but there’s always soo many
things to talk (write) about that you will never catch it all. And
hopefully the documentation will cover it all rather sooner than later.

Stay tuned for now. There’s more blog posts about 12.2 to come. And
please, if you have any comments about this specific one or
suggestions for future ones, then please
let me know. You can always reach me at hermann.baer@oracle.com.

Cheers, over and out.

And here’s the most simple “data generation” I used for the examples above to get “something” in my files. Have fun playing.

rem my directory
rem
create or replace directory d1 as '/tmp';

rem create some dummy data in /tmp
rem
set line 300 pagesize 5000
spool /tmp/file1.txt
select rownum ||’,’|| 1 ||’,’|| 1 ||’,’  from dual connect by level < 100;
spool off

spool /tmp/file2.txt
select rownum ||’,’|| 22 ||’,’|| 22 ||’,’  from dual connect by level < 10;
spool off

Let’s block ads! (Why?)

The Data Warehouse Insider

Data loading into HDFS – Part3. Streaming data loading

In my previous blogs, I already told about data loading into HDFS. In the first blog, I covered data loading from generic servers to HDFS. The second blog was devoted by offloading data from Oracle RDBMS. Here I want to explain how to load into Hadoop streaming data. Before all, I want to note that I will now explain Oracle Golden Gate for Big Data just because it deserves a dedicated blog post. Today I’m going to talk about Flume and Kafka.

What is Kafka? 

Kafka is distributed service bus. Ok, but what is service bus? Let’s imagine that you do have few data systems, and each one needs data from others. You could link it directly, like this:

5 Data loading into HDFS   Part3. Streaming data loading

but it became very hard to manage. Instead this you could have one centralized system, that will accumulate data from all sources and be a single point of contact for all systems. Like this:

 Data loading into HDFS   Part3. Streaming data loading

What is Flume? 

“Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store.” – this definition from documentation pretty good explains what is Flume. Flume historically was developed for loading data in HDFS. But why I couldn’t just use Hadoop client?

Challenge 1. Small files.

Hadoop have been designed for storing large files and despite on that on the last few year were done a lot of optimizations around NameNode, it’s still recommended to store only big files. If your source has a lot of small files, Flume could collect them and flush this collection in batch mode, like a single big file. I always use the analogy with glass and drops. You could collect one million drops in one glass and after this, you will have one glass of water instead one million drops.

Challenge 2. Lots of data sources

Let’s imagine that I do have an application (even two on two different servers) that produce files which I want to load into HDFS.

7 Data loading into HDFS   Part3. Streaming data loading

life is good,  if files are large enough it’s not gonna be a problem.

But now let’s imagine, that I have 1000 application servers and each one wants to write data into HDFS. Even if files are large this workload will collapse your Hadoop cluster. If not believe – just try it (but not on production cluster!). So, we have to have something in between HDFS and our data sources. 

8 Data loading into HDFS   Part3. Streaming data loading

Now is time for Flume. You could do two tiers architecture, fist ties will collect data from different sources, the second one will aggregate them and load into HDFS.

9 Data loading into HDFS   Part3. Streaming data loading

In my example I depict 1000 sources, which is handled by 100 Flume servers on the first tier, which is load data on the second tier, that connect directly to HDFS and in my example, it’s only two connections – it’s affordable. Here you could find more details, just want to add that general practice is use one aggregation agent for 4-16 client agents.

I also want to note, that it’s a good practice to use AVRO sink when you move data from one tier to next. Here is example of the flume config file:

########################################################################################################################

——————————————————————————————

agent.sinks = avroSink

agent.sinks.avroSink.type = avro 

agent.sinks.avroSink.channel = memory 

agent.sinks.avroSink.hostname = avrosrchost.example.com

agent.sinks.avroSink.port = 4353 

——————————————————————————————

######################################################################################################################## 

Kafka Architecture.

Deep technical presentation about Kafka you could find here and here actually, I got few screens from there. The Very interesting technical video you could find here. In my article, I just will remind key terms and concepts.

10 Data loading into HDFS   Part3. Streaming data loading

Producer – a process that writes data into Kafka cluster. It could be part of an application or edge nodes could play this role.

Consumer – a process that reads the data from Kafka cluster. 

Brocker – a member of Kafka cluster. Set of members is Kafka cluster. 

Flume Architecture.

You could find a lot of useful information about Flume in this book, here I just highlight key concepts.

11 Data loading into HDFS   Part3. Streaming data loading

Flume  has 3 major components:

1) Source – where I get the data

2) Chanel – where I buffer it. It could be memory or disk, for example. 

3) Sink – where I load my data. For example, it could be another tier of Flume agents, HDFS or  HBase.

12 Data loading into HDFS   Part3. Streaming data loading

Between source and channel, there are two minor components: Interceptor and Selector.

With Interceptor you could do simple processing, with Selector you could choose channel depends on the message header. 

Flume and Kafka similarities and differences.

It’s a frequent question: “what is the difference between Flume and Kafka”, the answer could be very expanded, but let me briefly explain key points.

1) Pull and Push.

Flume accumulates data up to some condition (number of the events, size of the buffer or timeout) and then push it to the disk

Kafka accumulates data until client initiate reads. So client pulls data whenever he wants.

2)  Data processing

Flume could do simple transformations by interceptors

Kafka doesn’t do any data processing, just store that data. 

3) Clustering

Flume usually is a batch of single instances.

Kafka is the cluster, which means that it has such benefits as High Availability and scalability out of the box without extra efforts. 

4) Message size

Flume doesn’t have any obvious restrictions for size of the message

Kafka was designed for few KB messages

5) Coding vs Configuring

Flume usually configurable tool (users usually don’t write the code, instead of this they use configure capabilities).

With Kafka, you have to write code for load/unload the data.

Flafka.

Many customers are thinking about choosing right technology either Flume or Kafka for handing their data streaming. Stop choosing, use both. It’s quite common use case and it named as Flafka. Good explanation and nice pictures you could find here (actually I borrowed few screens from there).

First of all, Flafka is not a dedicated project. It’s just bunch of Java classes for integration Flume and Kafka.

1 Data loading into HDFS   Part3. Streaming data loading

Now  Kafka could be either source for Flume by flume config:

flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource

or channel by the following directive:

flume1.channels.kafka-channel-1.type = org.apache.flume.channel.kafka.KafkaChannel 

Use Case1. Kafka as a source or Chanel

if you do have Kafka as enterprise service bus (see my example above) you may want to load data from your service bus into HDFS. You could do this by writing Java program, but if don’t like it, you may use Kafka as a Flume source. 

2 Data loading into HDFS   Part3. Streaming data loading

in this case, Kafka could be also useful for smoothing peak load. Flume provides flexible routing in this case.

Also, you could use Kafka as a Flume Chanel for high availability purposes (it’s distributed by application design). 

Use case 2. Kafka as a sink.

If you use Kafka as enterprise service bus, I may want to load data into it. The native way for Kafka is Java program, but if you feel, that it will be way more convenient with Flume (just using few config files) – you have this option. The only one that you need is config Kafka as a sink.

3 Data loading into HDFS   Part3. Streaming data loading

Use case 3. Flume as the tool to enrich data.

As I Already told before – Kafka could do any data processing. It just stores data without any transformation. You could use Flume as the way to add some extra information to your Kafka messages. For doing this you need to define Kafka as a source, implement interceptor which will add some information to your message and write back to the Kafka in a different topic.

4 Data loading into HDFS   Part3. Streaming data loading

Conclusion.

There are two major tools for loading stream data – Flume and Kafka. There is no right answer, what to use because each tool has own advantages/disadvantages. Generally, it’s why Flafka have been created – it’s just a combination of those two tools.

Let’s block ads! (Why?)

The Data Warehouse Insider

Dealing with very very long string lists using Database 12.2

 Dealing with very very long string lists using Database 12.2

Oracle RDBMS 11gR2 introduced the LISTAGG function for working with string values. It can be used to aggregate values from groups of rows and return a concatenated string where the values are typically separated by a comma or semi-colon – you can determine this yourself within the code by supplying your own separator symbol.

Based on the number of posts across various forums and blogs, it is widely used by developers. However, there is one key issue that has been highlighted by many people: when using LISTAGG on data sets that contain very large strings it is possible to create a list that is too long. This causes the following overflow error to be generated:

ORA-01489: result of string concatenation is too long.

Rather annoyingly for developers and DBAs, it is very difficult to determine ahead of time if the concatenation of the values within the specified LISTAGG measure_expr will cause an ORA-01489 error. Many people have posted workarounds to resolve this problem – including myself. Probably the most elegant and simple solution has been to use the 12c MATCH_RECOGNIZE feature, however, this required use of 12c Release 1 which was not always available to all DBAs and/or developers.

If you want to replicate the problem and you have access to the sample SH schema then try executing this query:

SELECT
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

All the samples in this post use our sample SH schema. Once we release the on-premise version of 12.2 you will be able to download the Examples file for your platform from the database home page on OTN. I do have a tutorial ready-to-go on LiveSQL, however, there is currently a technical issue with using very very very long strings – essentially, running my LISTAGG workshop generates a ‘Data value out of range’ error so as soon as it’s fixed I will update this blog post with a link to the tutorial.

What have we changed in 12.2?

One way of resolving ORA-01489 errors is to simply increase the size of VARCHAR2 objects.

Larger object sizes

The size limit for VARCHAR2 objects is determined by the database parameter MAX_STRING_SIZE. You can check the setting in your database using the following command:

show parameter MAX_STRING_SIZE

in my demo environment this returns the following:

NAME            TYPE   VALUE
————— —— ——–
max_string_size string STANDARD

Prior to Oracle RDBMS 12.1.0.2 the upper limit for VARCHAR2 was 4K. With Oracle RDBMS 12.1.0.2 this limit has been raised to 32K. This increase may solve a lot of issues but it does require a change to the database parameter MAX_STRING_SIZE. By setting MAX_STRING_SIZE = EXTENDED this enables the new 32767 byte limit.

ALTER SYSTEM SET max_string_size=extended SCOPE= SPFILE;

However, with the increasing interest in big data sources it is clear that there is still considerable potential for ORA-01489 errors as you use the LISTAGG feature within queries against extremely large data sets.

What is needed is a richer syntax within the LISTAGG function and this has now been implemented as part of Database 12c Release 2.

Better list management

With 12.2 we have made it easier to manage to lists that are likely to generate an error because they are too long. There are a whole series of new keywords that can be used:

  • ON OVERFLOW ERROR
  • ON OVERFLOW TRUNCATE
  • WITH COUNT vs. WITHOUT COUNT

Let’s look a little closer at each of these features….. 

1. Keeping Pre-12.2 functionality

If you want to your existing code to continue to return an error if the string is too long then the great news is that this is the default behaviour. When the length of the LISTAGG string exceeds the VARCHAR2 limit then the standard error will be returned:

ERROR at line xxx:
ORA-01489: result of string concatenation is too long

However, where possible I would recommend adding “ON OVERFLOW ERROR” to your LISTAGG code to make it completely clear that you are expecting an error when an overflow happens:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

So it’s important to note that by default the truncation features are disabled and you will need to change any existing code if you don’t want to raised an error. 

2. New ON OVERFLOW TRUNCATE… keywords

If you want to truncate the list of values at the 4K or 32K boundary then you need to use the newly added keywords ON OVERFLOW TRUNCATE as shown here:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

when truncation occurs we will truncate back to the next full value at which point you can control how you tell the user that the list has been truncated. By default we append three dots ‘…’ to the string as indicator that truncation has occurred but you can override this as follows:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

If you want to keep the existing pre-12.2 behaviour where we return an error if the string is too long then you can either rely on the default behaviour or explicitly state that an error should be returned (always a good idea to avoid relying on default behaviour in my opinion) by using the keywords:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW ERROR) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which will now generate the normal error message – i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 – “result of string concatenation is too long”
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

 of course you can simply omit the new keywords and get the same behaviour:

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

which, as before, generates the normal error message – i.e. replicates the pre-12.2 behaviour:

ORA-01489: result of string concatenation is too long
01489. 00000 – “result of string concatenation is too long”
*Cause: String concatenation result is more than the maximum size.
*Action: Make sure that the result is less than the maximum size.

3. How many values are missing?

If you need to know how many values were removed from the list to make it fit into the available space then you can use the keywords ‘WITH COUNT’ – this is the default behaviour. Alternatively if you don’t want a count at the end of the truncated string you can use the keywords ‘WITHOUT COUNT’, which is the default behaviour.

SELECT 
g.country_region,
LISTAGG(c.cust_first_name||’ ‘||c.cust_last_name, ‘,’ ON OVERFLOW TRUNCATE ‘***’ WITH COUNT) WITHIN GROUP (ORDER BY c.country_id) AS Customer
FROM customers c, countries g
WHERE g.country_id = c.country_id
GROUP BY country_region
ORDER BY country_region;

4. Do we split values when truncation occurs?

No. When determining where to force the truncation we take into account the full length of each value. Therefore, if you consider the example that we have been using which creates a list of customer names within each country we will always include the customer full name  “Keith Laker” (i.e. first name + last name). There has to be enough space to add the complete string (first+last name) to the list otherwise the whole string, “Keith Laker” is removed and the truncation indicator is inserted. It’s not possible for the last value in the string to be only the first name where the last name has been truncated/removed. 

5. How do we calculate the overall length of the string values?

The characters to indicate that an overflow has occurred are appended at the end of the list of values, which in this case if the default value of three dots “. . .”.  The overflow functionality traverses backwards from the maximum possible length to the end of the last complete value in the LISTAGG clause, then it adds the user-defined separator followed by the user defined overflow indicator, followed by output from the ’WITH COUNT’ clause which adds a counter at the end of a truncated string to indicate the number of values that have been removed/truncated from the list. 

Summary

With Database 12c Release 2 we have tackled the ORA-01489 error in two ways: 1) increased the the size of VARCHAR2 objects to 32K and 2) extended functionality of LISTAGG to allow greater control over the management of extremely long lists. Specifically there are several new keywords:

  • ON OVERFLOW TRUNCATE
  • ON OVERFLOW ERROR (default behaviour)
  • WITH COUNT
  • WITHOUT COUNT  (default behaviour)

Hopefully this new functionality will mean that all those wonderful workarounds for dealing with “ORA-01489: result of string concatenation is too long“ errors that have been created over the years can now be replaced by standard SQL functionality. 

Let’s block ads! (Why?)

The Data Warehouse Insider

SQL Pattern Matching Deep Dive – Part 6, state machines

256px HAL9000.svg SQL Pattern Matching Deep Dive   Part 6, state machines

The obvious way to start this particular post is to pose a couple of simple questions: what is a state machine and why should you care? In general I would say that you don’t need to know about or care about state machines. That’s the beauty of using SQL for pattern matching. The MATCH_RECOGNIZE clause encapsulates all the deep technical modelling and processing that has to be performed to run pattern matching on a data set. However, there are times when it is useful, probably vital, that you understand what is going on behind the scenes and one of the most obvious situations is when backtracking happens.

Therefore, the content covered in this post is a going to be a gently lead-in into my next post where I am going to discuss the concept of “backtracking”  and the dreaded ORA-30009 error.

Let’s start our voyage of discovery…when you attempt to run a SQL statement containing a MATCH_RECOGNIZE clause during the compilation phase we generate a finite state machine based on the PATTERN and DEFINE clauses in your statement. What is an Finite State Machine? According to wikipedia:

A finite-state machine (FSM)…is a mathematical model of computation.…it is conceived as an abstract machine that can be in one of a finite number of states. The machine is in only one state at a time…changes from one state to another when initiated by a triggering event or condition; this is called a transition. A particular FSM is defined by a list of its states, and the triggering condition for each transition.

Reference from wikipedia – https://en.wikipedia.org/wiki/Finite-state_machine

A state machine, which is the PATTERN and DEFINE elements of your MATCH_RECOGNIZE clause, can be represented by a directed graph called a state diagram. This diagram shows each of the possible states for the “machine” and the conditions that force the machine to either remain in its current state or move to the next state. Below is a simple example of a state machine:  

 SQL Pattern Matching Deep Dive   Part 6, state machines

On the above diagram each state is represented by a node (ellipse) which in this case are marked as “State 1” to “State 4”. The arrows, known as Edges, show the transition(s) from one state to another. If you look at states 2 and 4 you will notice that they have two edges although these edges are shown in different vertical positions on the diagram. When drawing a proper state diagram each event is labeled with the event (condition) that triggers transition. Events (conditions) that don’t cause a change of state are represented by a circular arrow returning to the original state and these can be seen on states 2 and 4.

The precedence for reading the information is to read from top-down. What this means is that when in State 2 the FSM will test to see if State 3 can be achieved and if it can’t it will then test to see if State 2 can be maintained. The reverse is true for State 4 where the FSM will test to see if State 4 can be maintained and if it can’t it will then, in this example, either end having determined that a match has completed or start backtracking to try and complete a match. I am sure you can now see how this is going to link into my next blog post.

State machines are not limited to just pattern matching. They have all sorts of other uses. If you want a gentle diversion to look at state machines in a little more detail then try this article by Enrique Ortiz from the OTN magazine in August 2004: Managing the MIDlet Life-Cycle with a Finite State Machine.

All of this flows directly into keywords that appear (or don’t appear) in the explain plans which was covered in this post MATCH_RECOGNIZE and the Optimizer from January 2015. As quick refresher…essentially there are four new keywords that you need to be aware of:

  • MATCH RECOGNIZE
  • SORT
  • BUFFER
  • DETERMINISTIC FINITE AUTO

The fist three bullet points are reasonably obvious. The last keyword is linked to the use of “state machine”. Its appearance, or lack of appearance, affects the way our pattern is applied to our data set but that is all explained in the blog post. Most of my MATCH_RECOGNIZE examples are based on the stock ticker data set. Let’s assume that we are searching for V-shaped patters in our data set (https://docs.oracle.com/database/121/DWHSG/pattern.htm#CACHHJJG):

SELECT *
FROM Ticker MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES STRT.tstamp AS start_tstamp,
LAST(DOWN.tstamp) AS bottom_tstamp,
LAST(UP.tstamp) AS end_tstamp
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(down.price),
UP AS price > PREV(up.price)
) MR
ORDER BY MR.symbol, MR.start_tstamp;

this is what the state diagram would look like:

 SQL Pattern Matching Deep Dive   Part 6, state machines

These diagrams can be really helpful when you have more complex patterns and you need to consider the impact of backtracking. This posts is all about laying the building blocks for my next post on backtracking and the dreaded ORA-30009 error. If you have managed to read this far then you are guaranteed to be ready for an in-depth look at what happens inside MATCH_RECOGNIZE when we have move from right to left through our state diagram in an attempt to find a complete match. Now you should know everything you need to know about state machines and I am going to carry over the “why care” part to the next post…

If you want a recap of where we are in this series of pattern matching deep dive posts here is the full list:

As per usual, if you have any comments or questions then feel free to contact me directly via email: keith.laker@oracle.com 

Technorati Tags: , , , ,

Let’s block ads! (Why?)

The Data Warehouse Insider

Amazon fiber optic project seeks funding

A project aimed at connecting remote areas in Brazil’s Amazon rainforest needs R$ 70m ($ 21m) in funding in order to resume work in 2017.

The resources are required for the purchase and roll-out of additional fibre optic infrastructure as well as other equipment needed for the implementation.

Currently, the Amazônia Conectada (Connected Amazon), project led by the Brazilian Army connects some 52 riverside municipalities to the Internet through subfluvial fiber optic cables.

One of the heads at the Army’s Integrated Telematics Center, Colonel Eduardo Wolski, told Brazilian portal Convergência Digital that the body is hoping to secure backing from the federal and states governments as well as the private sector – particularly telecom providers interested in expanding their foothold in the North of the country – so that it can carry on.

Lenovo launches New Glass C200: A quick tour

lenovo c200 a Lenovo launches New Glass C200: A quick tour

Connect with us


This article passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at fivefilters.org/content-only/faq.php#publishers.
Recommended article: The Guardian’s Summary of Julian Assange’s Interview Went Viral and Was Completely False.

Colbran South Africa

Cyanogen shutting down services and OS by December 31

bloglogo1 optimized Cyanogen shutting down services and OS by December 31Image: Cyanogen

Cyanogen late Friday announced all services and nightly builds will be discontinued no later than December 31.