Category Archives: Revolution

Because it’s Friday: The winding river

Ever wondered why rivers take such meandering paths on their way to the sea? Minute Earth explains in this short video:


This process goes on all the time: unless the banks are reinforced (as they are usually as they flow through big cities), the river’s path will keep on changing over time. Vox recently featured this animated gif (from the Time Timelapse site, which has several other cool animations), showing the path of Peru’s Ucayali River from 1982 to 2012.

 Because its Friday: The winding river

Incidentally, much the same process is behind the formation of river deltas. As the river reaches the sea and slows down, it deposits sediment. The river’s shifting position acts much like a random walk, and the sediment creates the triangular shape of the delta.

That’s all we have for this week. Join us back here on Monday for more from the Revolutions blog.

Recommended article: Chomsky: We Are All – Fill in the Blank.
This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


Now’s a great time to learn R. Here’s how.

In a recent article at, I offer up some reasons why now is the time to learn R: data scientists are in high demand, R is the natural language for data scientists, and companies around the world are using R (and hiring R programmers) to make sense of new data sources. Sharp Sight Labs also offers some excellent reasons why you should choose the R language for data science.

So, if you’d like to take the plunge, here are some tips to help you get started with R:

Got any other tips for beginners getting started with R? Let us know in the comments.

Recommended article: Chomsky: We Are All – Fill in the Blank.
This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


Because it’s Friday: Learn to Kern

Are you one of those people obsessed by typefaces? I am, and one of my bugbears is bad kerning. (Follow that link for many infuriating examples of keming.) But kerning isn’t as easy as it looks, as the Kerntype game will prove to you.

 Because its Friday: Learn to Kern

Use your mouse or touchscreen to move the interior letters around to improve the appearance of the word. You’ll be scored against a “perfect” kerning. My score was a disappointing 84 out of 100. Looks like I need to lem to kem.

That’s all for this week. See you on Monday!

Recommended article: Chomsky: We Are All – Fill in the Blank.
This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


What’s your KatRisk Score?

by Joseph Rickert

KatRisk, a Berkeley based catastrophe modeling company specializing in wind and flood risk, has put three R and Shiny powered interactive demos on their website. Together these provide a nice introduction to the practical aspects of weather based risk modeling and give a good indication of the kinds of data that are important. Two of the models, the US & Caribbean Hurricane Model and the Asia Typhoon Model, provide a tremendous amount of information but they require a little bit of background knowledge to understand the data required to drive them, and the computed loss statistics.

The Flood Data Lookup Model, however, can really hit home for anybody. Just bring up the model, type in the address of the location of interest and press the red “Geocode” button to get the associated longitude and latitude. Then click on the “Get Data” button. The resulting information will give you an idea of the level of risk for the property and let you know what a 100 year flood and 500 year flood would look like. Next, switch to the “Flood Map” tab and press the “Get Map” button to see some of the information overlayed on a Google map.

Not being able to resist the opportunity to have Google Maps google Google, I thought it would be interesting to see how bad things could get at the Googleplex.

 Whats your KatRisk Score?

Uh oh! The Googleplex gets a pretty high KatRisk score. A 100 year flood would put the place under 7 feet of water!

 Whats your KatRisk Score?

Not to worry though: Google has already completed their first round of feasibility tests for a navy. (Nobody does long range planning like Google.)

The KatRisk models are based on R code that makes heavy use of data.table for fast table look ups of the risk results. As the company says on their website:

KatRisk has developed a suite of analytic tools to make it easy to access our data and models. We use open source software tools including R Shiny for our web applications. By using R shiny we can develop on-line products that can also easily be deployed to a client site. Our software is completely open, so if you decide to host our analytical tools you will be able to see all of the details in easy to understand and modify R code.

For some details on the underlying analytics have a look at this previous post that was based on a talk Dag Lohmann gave to the Bay Area UseR Group last year.

So, go ahead and compute your KatRisk score, but please do be mindful of the company’s request not to run the model for more than 3 locations in one day.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


R wins a 2014 Bossie Award

I missed this when it was announced back on September 29, but R won a 2014 Bossie Award for best open-source big-data tools from InfoWorld (see entry number 5):

A specialized computer language for statistical analysis, R continues to evolve to meet new challenges. Since displacing lisp-stat in the early 2000s, R is the de-facto statistical processing language, with thousands of high-quality algorithms readily available from the Comprehensive R Archive Network (CRAN); a large, vibrant community; and a healthy ecosystem of supporting tools and IDEs. The 3.0 release of R removes the memory limitations previously plaguing the language: 64-bit builds are now able to allocate as much RAM as the host operating system will allow.

Traditionally R has focused on solving problems that best fit in local RAM, utilizing multiple cores, but with the rise of big data, several options have emerged to process large-scale data sets. These options include packages that can be installed into a standard R environment as well as integrations into big data systems like Hadoop and Spark (that is, RHive and SparkR).

Check out the full list of winners at the link below. (Thanks to RG for the tip!)

InfoWorld: Bossie Awards 2014: The best open source big data tools

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


Because it’s Friday: Not a spiral

Regular readers of this blog know that I love optical illusions, and I recently found the most (literally!) mind-boggling one I’ve seen yet. This is not a spiral:

 Because its Friday: Not a spiral

(There are many similar illusions, but I was unable to find the source for this exact one. If you know who created it, let me know in the comments.) 

But why do our brains have such difficulty with images like this? It turns out has a lot to do with angles and colors, as this nifty post at Carlos Scheidegger’s Visualization blog demonstrates. Using interactive sliders, you can adjust the parameters of a similar illusion to see how the image on the left looks normal, while the one on the right boggles the mind.

 Because its Friday: Not a spiral   Because its Friday: Not a spiral

But if you want a real trip, click the “Engage” button near the bottom of the post. Whoa, indeed! Remember, you’re looking at concentric circles, not spirals.

That’s all for this week. Have a great weekend, and we’ll be back on Monday!

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at
Want something else to read? How about ‘Grievous Censorship’ By The Guardian: Israel, Gaza And The Termination Of Nafeez Ahmed’s Blog


Quandl Chapter 2: The Democratization of Commercial Data

by Tammer Kamel
Quandl’s Founder

About 22 months ago I had the privilege of introducing Quandl to the world on this blog. At that time Quandl had about 2 million datasets and a few hundred users. (And we thought that was fabulous.) Now, at the end of 2014, we have some 12 million datasets on the site and tens of thousands of registered users. On most days we serve about 1 million API requests.

One thing that has not changed however, is the simplicity with which R users can access Quandl. Joseph’s post last year, and Ilya’s this year both demonstrated the ease of connecting to Quandl via R.

Adoption of Quandl in the R community was perhaps the biggest factor in our early success. Thus it is fitting that I am back guest-blogging here at this moment in time because we are actually at the dawn of a new chapter at Quandl: We’re adding commercial data to the site. We are going to make hundreds of commercial databases from domain experts like Zacks, ORATS, OptionWorks, Corre Group, MP Maritime, DelphX, Benzinga and many others available via the same simple API.

What makes this new foray interesting is that we won’t be playing by the rules that the incumbent oligarchy of data distributors have established. Their decades-old model has not served consumers well: it keeps data prices artificially high, it cripples innovation, and it is antithetical to modern patterns of data consumption and usage. In fact, the business models around commercial data predate the internet itself. They can and should be disrupted. So we’re going to give that a go.

Our plan is nothing less than democratizing supply and demand of commercial data. Anyone will be able to buy data on Quandl. There will be no compulsory bundling, forcing you to pay for extra services you don’t need; no lock-in to expensive long-term contracts; no opaque pricing; no usage monitoring or consumption limits; no artificial scarcity or degradation. Users will be able to buy just the datasets they need, a la carte, as and when they need them. They will get their data delivered precisely the way they want, with generous free previews, minimal usage restrictions and all the advantages of the Quandl platform. And of course, the data itself will be of the highest quality; professional grade data manufactured by the best curators in the world.

We will also democratize the supply of data. Anyone, from existing data vendors and primary data producers to individuals and entrepreneurs, will have equal access to the Quandl platform and the unmet demand of the Quandl user base. We want to create a situation where anyone capable of curating and maintaining a database can monetize their work. In time, we hope that competition among vendors will force prices to their economic minimum. This is the best possible way to deliver the lowest possible prices to our users.

At the same time this democratization should empower capable curators to realize the full value of their skills: If someone can build and maintain a database that commands $ 25 a month from 1000 people, then Quandl can be the vehicle that transforms that person from skilled analyst to successful data vendor.

If you were to characterize what we are doing as a marketplace for data you would be absolutely correct. We are convinced that fair and open competition will do great things, both for data consumers who are, frankly, being gouged, and for existing and aspirational data vendors who are disempowered. Open and fair competition is a panacea for both ills: it effects lower prices, wider distribution, better data quality, better documentation and better customer service.

Our foray into commercial data has already started with 6 pilot vendors. They range from entrepreneurially-minded analysts who are building databases to rival what the incumbents currently sell for exorbitant fees, to long-established data vendors progressive enough to embrace Quandl’s modern paradigm. We have no less than 25 vendors coming online in Q1 2015.

So, Quandl in 2015 should very quickly become everything an analyst needs: A free and unlimited API, dozens of package connections including to R, 12 million (and growing) free and open datasets, and access to commercial data from the best companies in the world at ever decreasing prices. Wish us luck!

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


Some R Highlights from H20 World

by Joseph Rickert held its first H2O World conference over two days at the Computer History Museum in Mountain View, CA. Although the main purpose of the conference was to promote the company’s rich set of Java based machne learning algorithms and announce their new products Flow and Play there were quite a few sessions devoted to R and statistics in general.

 Some R Highlights from H20 World

Before I describe some of these, a few words about the conference itself. H20 World was exceptionally well run, especially for a first try with over 500 people attending (my estimate). The venue is an interesting, accommodating space with plenty of parking, that played well with what, I think, must have been an underlying theme of the conference: acknowledging contributions of past generations of computer scientists and statisticians. There were two stages offering simultaneous talks for at least part of the conference: The Paul Erdős stage and the John Tukey stage. Tukey I got, why put such an eccentric mathematician front and center? I was puzzled until Sri Ambati,’s CEO and co-founder remarked that he admired Erdős because of his great generosity with collaboration. To a greater extent than most similar events, H2O World itself felt like a collaboration. There was plenty of opportunity to interact with other attendees, speakers and H20 technical staff (The whole company must have been there). Data scientists, developers and Marketing staff were accessible and gracious with their time. Well done!

R was center stage for a good bit the hands on training that that occupied the first day of the conference. There were several sessions (Exploratory Data Analysis, RegressionDeep LearningClustering and Dimensionality Reduction) on accessing various H2O algorithms through the h2o R package and the H2O API. All of these moved quickly from R to running the custom H2O alogorithms on the JVM. However, the message that came through is that R is the right environment for sophisticated machine learning.

Two great pleasures from the second day of the conference were Trevor Hastie’s tutorial on the Gradient Boosting Machine and John Chamber’s personal remembrances of John Tukey. It is unusual for a speaker to announce that he has been asked to condense a two hour talk into something just under an hour and then go on to speak slowly with great clarity, each sentence beguiling you into imagining that you are really following the details. (It would be very nice if the video of this talk would be made available.)

Two notable points from Trevor’s lecure where understanding gradient boosting as minimizing the exponential loss function and the openness of the gbm algorithm to “tinkering”. For the former point see Chapter 10 of the Elements of Statistical Learning or the more extended discussion in Schapire and Freund’s Boosting: Foundations and Algorithms.

John Tukey spent 40 years at Bell Labs (1945 – 1985) and John Chamber’s tenure there overlapped the last 20 years of Tukey’s stay. Chambers who had the opportunity to observe Tukey over this extended period of time painted a moving and lifelike portrait of the man. According to Chambers, Tukey could be patient and gracious with customers and staff, provocative with his statistician colleagues and “intellectually intimidating”. John remembered Richard Hamming saying: “John (Tukey) was a genius. I was not.” Tukey apparently delighted in making up new terms when talking with fellow statisticians. For example, he called the top and bottom lines that identify the interquartile range on a box plot “hinges” not quartiles. I found it particularly interesting that Tukey would describe a statistic in terms of the process used to compute it, and not in terms of any underlying theory. Very unusual, I would think, for someone who earned a PhD in topology under Solomon Lefschetz. For more memories of John Tukey including more from John Chambers look here.

Other R related highlights were talks by Matt Dowle and Erin Ledell. Matt reprised the update on new features in data.table that he recently gave to the Bay Area useR Group and also presented interesting applications using data.table from UK insurance company Landmark, and KatRisk (Look here for KatRisk part of Matt’s presentation).

Erin, author of the h20Ensemble package available on GitHub, delivered an exciting and informative talk on using ensembles of learners (combining gbm models and logistic regression models, for example) to create “superlearners”.

Finally, I gave a short talk Revolution Analytics’ recent work towards achieving reproducibility in R. The presentation motivates the need for reproducibility by examining the use of R in industry and science and describing how the checkpoint package and Revolution R Open, an open source distribution of R that points to a static repository can be helpful.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


The three types of Reddit posts, and how they make it to the front page

Todd Schneider’s blog post on solving the traveling salesman problem with R hit the front page of This is a big deal: front-page placement on the popular social news site can drive a ton of traffic (in Todd’s case, 1.3 million pageviews). But what factors determine which of reddit’s contributed links make it to the front page? (There are 25 front-page slots, but more than 100,000 reddit posts on an average day.)

Todd set out to answer this question using the statistical language R, and reported his results on Mashable. He collected 6 weeks of data including 1.2 million rankings for about 15,000 posts, and looked for commonalities amongst those posts that made the top 25.

Now, you might expect that a post’s front page ranking is determined by its score (the number of times it has been “liked” by a reddit user, most likely after having seen it in the “subreddit” special topic area where it was posted), and how long since it was posted (reddit’s front page generally contains recent posts). But it turns out that not all subreddits are treated equally. Todd discovered that there are three different types of subreddits when it comes to how posts are promoted to the front page:

  • Viral Candy” subreddits like funnygifs and todayilearned. Posts from this category dominate page one.
  • Page Two” subreddits”, which includes DocumentariesFitness and personalfinance. As the name suggests, posts in these subreddits almost never make it to page 1, but are often promoted to page 2.
  • The Rest“, which includes foodLifeProTips, and sports. Todd’s post was in this category, in the subreddit dataisbeautiful. Posts in these subreddits make a small but significant fraction of page 1 posts.

It seems that reddit’s front page (and pages 2, 3 and 4 which follow) follow a well-defined mix of posts from each of the three categories, as you can see in the chart below:

 The three types of Reddit posts, and how they make it to the front page

Starting from the left of the chart above, you can see the #1 post (on page 1) is from one of the “Viral Candy” subreddits about 97% of the time, but that a “The Rest” post does occasionally make top billing. By contrast, posts from the “Page Two” subreddits almost never appear above #10, but dominate page two (ranks 26-50). There’s a pretty consistent mix on pages 3 and 4: about 65% “viral candy”, about 15% “page twos” and about 25% “the rest”.

As for post scores, Todd noted that posts from “Viral Candy” and “The Rest” subreddits need high scores to get on page 1: about 3500-4500 and 3000-4000 respectively for the top slot. By contrast, posts in “Page 2″ reddits only need scores in the 500-1500 range to hit the lower ranks of Page 1 (but are much more likely to appear on Page 2).

If you’re interested in the details of what gets a post on reddit’s front page, Todd’s blog post has lots more information. And if you’re an R user and want to do a similar analysis, Todd’s data and R code are available on github.

Todd W Schneider: The reddit Front Page is Not a Meritocracy

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at


Looking into a very messy data set

by Joseph Rickert

I recently had the opportunity to look at the data used for the 2009 KDD Cup competition. There are actually two sets of files that are still available from this competition. The “large” file is a series of five .csv files that when concatenated form a data set with 50,000 rows and 15,000 columns. The “small” file also contains 50,000 rows but only 230 columns. “Target” files are also provided for both the large and small data sets. The target files contain three sets of labels for “appentency”, “churn” and “upselling” so that the data can be used to train models for three different classification problems.

The really nice feature of both the large and small data sets is that they are extravagantly ugly, containing large numbers of missing variables, factor variables with thousands of levels, factor variables with only one level, numeric variables with constant values, and correlated independent variables. To top it off, the targets are severely unbalanced containing very low proportions of positive examples for all three of the classification problems. These are perfect files for practice.

Often times, the most difficult part about working with data like this is just knowing where to begin. Since getting a good look is usually a good place to start, let’s look at a couple of R tools that I found to be helpful for taking that first dive into messy data.

The mi package takes a sophisticated approach to multiple imputation and provides some very advance capabilities. However, it also contains simple and powerful tools for looking at data. The function missing.pattern.plot() lets you see the pattern of missing values. The following line of code provides a gestalt for the small data set.

main = “KDD 2009 Small Data Set”)

 Looking into a very messy data set

Observations (rows) go from left to right and variables from bottom to top. Red indicates missing values. Looking at just the first 25 variables makes it easier to see what the plot is showing.

 Looking into a very messy data set

The function, also in the mi package, provides a tremendous amount of information about a data set. Here is the output for the first 10 variables. The first thing the function does is list the variables with no data and the variables that are highly correlated with each other. Thereafter, the function lists a row for each variable that includes the number of missing values and the variable type. This is remarkably useful information that would otherwise take a little bit of work to discover. 

variable(s) Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var169, Var175, Var185 has(have) no observed value, and will be omitted.
following variables are collinear
[1] "Var156" "Var66"  "Var9"  
[1] "Var104" "Var105"
[1] "Var111" "Var157" "Var202" "Var33"  "Var61"  "Var71"  "Var91" 
     names include order number.mis all.mis          type                    collinear
1     Var1     Yes     1      49298      No           nonnegative                  No
2     Var2     Yes     2      48759      No           binary                       No
3     Var3     Yes     3      48760      No           nonnegative                  No
4     Var4     Yes     4      48421      No           ordered-categorical          No
5     Var5     Yes     5      48513      No           nonnegative                  No
6     Var6     Yes     6       5529      No           nonnegative                  No
7     Var7     Yes     7       5539      No           nonnegative                  No
8     Var8      No    NA      50000     Yes           proportion                   No
9     Var9      No    NA      49298      No           nonnegative              Var156, Var66
10   Var10     Yes     8      48513      No           nonnegative                  No

For Revolution R Enterprise users the function rxGetINfo() is a real workhorse. It applies to data frames as well as data stored in .xdf files. For data in these files there is essentially no limit to how many observations can be analysed. rxGetInfo() is an example of an external memory algorithm that only reads a chunk of data at a time from the file. Hence, there is no need to try and stuff all of the data into memory.

The following is a portion of the output from running the function with the getVarinfo flag set to TRUE.

rxGetInfo(DF, getVarInfo=TRUE)

Data frame: DF 
Number of observations: 50000 
Number of variables: 230 
Variable information: 
Var 1: Var1, Type: numeric, Low/High: (0.0000, 680.0000)
Var 2: Var2, Type: numeric, Low/High: (0.0000, 5.0000)
Var 3: Var3, Type: numeric, Low/High: (0.0000, 130668.0000)
. Var 187: Var187, Type: numeric, Low/High: (0.0000, 910.0000) Var 188: Var188, Type: numeric, Low/High: (-6.4200, 628.6200) Var 189: Var189, Type: numeric, Low/High: (6.0000, 642.0000) Var 190: Var190, Type: numeric, Low/High: (0.0000, 230427.0000) Var 191: Var191 2 factor levels: r__I Var 192: Var192 362 factor levels: _hrvyxM6OP _v2gUHXZeb _v2rjIKQ76 _v2TmBftjz ... zKnrjIPxRp ZlOBLJED1x ZSNq9atbb6 ZSNq9aX0Db ZSNrjIX0Db Var 193: Var193 51 factor levels: _7J0OGNN8s6gFzbM 2Knk1KF 2wnefc9ISdLjfQoAYBI 5QKIjwyXr4MCZTEp7uAkS8PtBLcn 8kO9LslBGNXoLvWEuN6tPuN59TdYxfL9Sm6oU ... X1rJx42ksaRn3qcM X2uI6IsGev yaM_UXtlxCFW5NHTcftwou7BmXcP9VITdHAto z3s4Ji522ZB1FauqOOqbkl zPhCMhkz9XiOF7LgT9VfJZ3yI Var 194: Var194 4 factor levels: CTUH lvza SEuy Var 195: Var195 23 factor levels: ArtjQZ8ftr3NB ArtjQZmIvr94p ArtjQZQO1r9fC b_3Q BNjsq81k1tWAYigY ... taul TnJpfvsJgF V10_0kx3ZF2we XMIgoIlPqx ZZBPiZh Var 196: Var196 4 factor levels: 1K8T JA1C mKeq z3mO Var 197: Var197 226 factor levels: _8YK _Clr _vzJ 0aHy ... ZEGa ZF5Q ZHNR ZNsX ZSv9 Var 198: Var198 4291 factor levels: _0Ong1z _0OwruN _0OX0q9 _3J0EW7 _3J6Cnn ... ZY74iqB ZY7dCxx ZY7YHP2 ZyTABeL zZbYk2K Var 199: Var199 5074 factor levels: _03fc1AIgInD8 _03fc1AIgL6pC _03jtWMIkkSXy _03wXMo6nInD8 ... zyR5BuUrkb8I9Lth ZZ5

rxGetInfo() doesn’t provide all of the information that does, but is does do a particularly nice job on factor data, giving the number of levels and showing the first few. The two functions are complementary.

For a full listing of the output shown above down load the file:  Download Mi_info_output.

This entry passed through the Full-Text RSS service – if this is your content and you’re reading it on someone else’s site, please read the FAQ at