See How Easily You Can Copy Data Between Object Store and HDFS

Object Stores tend to be a place where people put there data in the cloud (see also The New Data Lake – You Need More Than HDFS). Add data here and then share it, load it or use it across various other services. Here we won’t discuss the architecture and whether or not the data lake now is the object store (hint: not yet…), but instead focus on how to easily move data back and forth between object stores and your Big Data Cloud Service (BDCS) cluster(s).

ODCP

The underlying foundation for the coming screen shots and for Big Data Manager – a free component included with Big Data Cloud Service – is Oracle Distributed CoPy. The utility is based loosely on DistCP but made data movement leveraging Object Stores scalable and simple.

For a good overview and some performance numbers on ODCP and a comparison with a host of other ways of loading data into BDCS I would recommend reviewing this post from the A-team at Oracle.

For production workloads I would expect everyone to go command line, as it enables scripting of jobs or embedding this in your favorite ETL tool for execution in a more comprehensive flow. The command line reference manual is published here.

Big Data Manager

For those looking to get going, the command line may be a bit intimidating. Big Data Manager resolves that by providing an elegant way of:

  • Creating reusable storage providers, and managing access to these providers
  • Providing an intuitive file browser and drag and drop capabilities between providers
  • Providing a simple GUI to choose between scheduled (and repeated) and immediate execution of jobs

Creating Data Providers (Storages)

The cluster pages for BDCS have a link to Big Data Manager. The tool requires a specific log in once working in the cluster. After you log in you will end up on the main page:

mainscreen See How Easily You Can Copy Data Between Object Store and HDFS


Selecting the Administration Tab in the tool enables the creation and editing of the Storages as they are called. You can create these providers for a number of – ever expanding – providers. For example, Oracle Storage Cloud, Amazon S3, BDCS HDFS etc. Check back for new ones frequently or simply keep an eye on your updated Big Data Manager.

Tip: Creating a Storage for Oracle Object Store, the Tenant starts with “storage-” and you add your identity domain after that.

Once you have your storages created, you are in business and dragging and dropping can start. In my example here, I am going from Oracle Storage Cloud Service to my HDFS in my BDCS cluster, and so I am loading data into my BDCS system:

draganddrop See How Easily You Can Copy Data Between Object Store and HDFS

Now simply drag and drop from left to right (or of course the other way) and you will be asked whether or not to do the move from Object Store to HDFS now, or schedule it and repeat on a specified frequency.

movenow See How Easily You Can Copy Data Between Object Store and HDFS

Clicking Create will spawn an Apache Spark job on the BDCS cluster, open a connection to Object Store and run a data transfer in parallel based on the setting you can tweak in the advanced tab.

Switching the “Run Immediately” toggle to “Repeated Execution” gives you the scheduling information:

movelater See How Easily You Can Copy Data Between Object Store and HDFS

Once done, the job runs, and can be monitored in Big Data Manager:

jobdone See How Easily You Can Copy Data Between Object Store and HDFS

 

 

 

SDK

Last but not least, there is both a Python and Java SDK for Big Data Manager. Feel free to give all this a whirl in your BDCS instance and let us know how things go. 

Let’s block ads! (Why?)

Oracle Blogs | Oracle The Data Warehouse Insider Blog