• !Astrostatistics school
    What a wonderful week at the Astrostat [Indian] summer school in Autrans! The setting was superb, on the high Vercors plateau overlooking both Grenoble [north] and Valence [west], with the colours of the Fall at their brightest on the foliage of the…
    - 13 hours ago 17 Oct 17, 7:17am -
  • !Data from Public Bicycle Hire Systems
    A new rOpenSci package provides access to data to which users may already have directly contributed, and for which contribution is fun, keeps you fit, and helps made the world a better place. The data come from using public bicycle hire schemes, and…
    - 13 hours ago 17 Oct 17, 7:00am -
  • !How we built a Shiny App for 700 users?
    One of our senior data scientists, Olga Mierzwa-Sulima spoke at the userR! conference in Brussels to a packed house. The seats were full and there were audience members spilling out the doors.Source: https://twitter.com/matlabulous/status/8825304…
    - 20 hours ago 17 Oct 17, 12:00am -
  • !Data acquisition in R (1/4)
    R is an incredible tool for reproducible research. In the present series of blog posts I want to show how one can easily acquire data within an R session, documenting every step in a fully reproducible way. There are numerous data acquisition options…
    - 20 hours ago 17 Oct 17, 12:00am -
  • colourpicker package v1.0: You can now select semi-transparent colours in R (& more!)
    For those who aren’t familiar with the colourpicker package, it provides a colour picker for R that can be used in Shiny, as well as other related tools. Today it’s leaving behind its 0.x days and moving on to version 1.0!colourpicker h...
    - 1 day ago 16 Oct 17, 5:00pm -

High Scalability


  • SQL Fundamentals
    The pandas workflow is a common favorite among data analysts and data scientists. The workflow looks something like this: The pandas workflow works well when: the data fits in memory (a few gigabytes but not terabytes) the data is relatively static…
    - 12 days ago 6 Oct 17, 7:00am -
  • Loading Data into Postgres using Python and CSVs
    An introduction to Postgres with Python Data storage is one of (if not) the most integral parts of a data system. You will find hundreds of articles online detailing how to write insane SQL analysis queries, how to run complex machine learning algori…
    - 18 days ago 30 Sep 17, 7:00am -
  • Explore Happiness Data Using Python Pivot Tables
    One of the biggest challenges when facing a new data set is knowing where to start and what to focus on. Being able to quickly summarize hundreds of rows and columns can save you a lot of time and frustration. A simple tool you can use to achie…
    - 22 days ago 25 Sep 17, 3:00pm -
  • How to Generate FiveThirtyEight Graphs in Python
    If you read data science articles, you may have already stumbled upon FiveThirtyEight’s content. Naturally, you were impressed by their awesome visualizations. You wanted to make your own awesome visualizations and so asked Quora and Reddit how to…
    - 41 days ago 7 Sep 17, 7:00am -
  • Machine Learning Fundamentals: Predicting Airbnb Prices
    Machine learning is easily one of the biggest buzzwords in tech right now. Over the past three years Google searches for “machine learning” have increased by over 350%. But understanding machine learning can be difficult — you either use pre-bu…
    - 47 days ago 31 Aug 17, 10:00am -


  • My interview with ROpenSci
    The ROpenSci team has started publishing a new series of interviews with the goal of “demystifying the creative and development processes of R community members”. I had the great pleasure of being interviewed by Kelly O'Briant earlier this year,…
    - 1 day ago 16 Oct 17, 3:21pm -
  • Because it's Friday: Line Rider
    Line Rider is a simple web-based game: draw a line (or a series of lines), and watch an animated sledder ride along it like it was a snow slope. It's remained much the same since it was created in 2006 by Boštjan Čadež as a student (although I not…
    - 4 days ago 13 Oct 17, 9:43pm -
  • An AI pitches startup ideas
    Take a look at this list of 13 hot startups, from a list compiled by Alex Bresler. Perhaps one of them is the next Juicero? FAR ATHERA: A CLINICAL AI PLATFORM THAT CAN BE ACCESSED ON DEMAND. ZAPSY: TRY-AT-HOME SERVICE FOR CONSUMER ELECTRONICS. MADESS…
    - 4 days ago 13 Oct 17, 8:47pm -
  • A cRyptic crossword with an R twist
    Last week's R-themed crossword from R-Ladies DC was popular, so here's another R-related crossword, this time by Barry Rowlingson and published on page 39 of the June 2003 issue of R-news (now known as the R Journal). Unlike the last crossword, this…
    - 5 days ago 12 Oct 17, 7:30pm -
  • Tutorial: Azure Data Lake analytics with R
    The Azure Data Lake store is an Apache Hadoop file system compatible with HDFS, hosted and managed in the Azure Cloud. You can store and access the data within directly via the API, by connecting the filesystem directly to Azure HDInsight services, o…
    - 6 days ago 11 Oct 17, 9:26pm -

Data Analytics and R

  • !Distilled News
    Data Lake Business Model Maturity Index “Our organization is abuzz with the concept of data lakes!” a customer recently told …Continue reading →
    - 2 hours ago 17 Oct 17, 6:19pm -
  • !If you did not already know
    Log-Linear Model A log-linear model is a mathematical model that takes the form of a function whose logarithm is a …Continue reading →
    - 14 hours ago 17 Oct 17, 6:07am -
  • !Magister Dixit
    “A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing …Continue reading →
    - 16 hours ago 17 Oct 17, 4:05am -
  • !Distilled News
    How Data-Driven Businesses Can Benefit from Machine Learning Advancements in machine learning and artificial intelligence (AI) opens new doors for …Continue reading →
    - 18 hours ago 17 Oct 17, 2:03am -
  • !Document worth reading: “Time Series Management Systems: A Survey”
    The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale …Continue reading →
    - 20 hours ago 17 Oct 17, 12:01am -

Google Cloud

  • Control and granularity with Spark and Hadoop on Cloud Dataproc
    Posted by James Malone, Product Manager.It’s always great when customer feedback leads to great new feature ideas, which is exactly the case in the latest release of Google Cloud Dataproc.Cloud Dataproc’s primary goal is to simplify, speed up…
    - 5 days ago 13 Oct 17, 12:00am -
  • My summer project: a rock-paper-scissors machine built on TensorFlow
    Posted by Kaz Sato, Developer Advocate, Google Cloud.After looking for a fun project to do with my son this past summer, I decided to build a rock-paper-scissors machine powered by TensorFlow.Rock-paper-scissors machine, powered by TensorFlow…
    - 6 days ago 12 Oct 17, 12:00am -
  • Separation of compute and state in Google BigQuery and Cloud Dataflow (and why it matters)
    Posted by Tino Tereshko, Big Data Lead, Google Cloud Platform Office of CTO. (Thanks to Rodd Zurcher, Engineering Director, Motorola Mobility, and Matthew Baird, Cofounder and CTO, AtScale, Alexey Maloletkin, Advisory Software Engineer, Motorola Mobi…
    - 8 days ago 10 Oct 17, 12:00am -
  • Intro to text classification with Keras: automatically tagging Stack Overflow posts
    Posted by Sara Robinson (Developer Advocate), Josh Gordon (Developer Advocate), and Marianne Linhares Monteiro (DA Intern).As humans, our brains can easily read a piece of text and extract the topic, tone, and sentiment. Up until just a few years…
    - 12 days ago 6 Oct 17, 12:00am -
  • Genomic ancestry inference with deep learning
    Posted by Puneith Kaul, Developer Programs Engineer, Cloud Machine Learning; Nicole Deflaux, Software Engineer, Verily; Allen Day, Developer Advocate, Google Cloud Health AI; Elmer Garduno, Software Engineer, Cloud AIFor the past several years, our…
    - 21 days ago 27 Sep 17, 12:00am -

The HortonWorks Blog

  • Leveraging Data to Make Decisions in Financial Services
    The financial services industry encounters a variety of unique challenges with data. These companies not only need the ability to process huge amounts of data from traditional and non-traditional sources, but they also need to ensure the data is secu…
    - 1 day ago 16 Oct 17, 4:00pm -
  • APM with Unravel and Hortonworks to Ensure Mission Critical, Fast and Error Free Performance
    This blog post is from one of our newest partners, Unravel Data, an MDS ISV/IHV Partner in the Partnerworks Program. Our guest blogger is Oliver Claude, CMO. Application Performance Management (APM) As more and more modern data applications get deplo…
    - 1 day ago 16 Oct 17, 1:00pm -
  • Why The Big Data Landscape Is All Shades of Grey
    At this point, the number of blogs and articles about Big Data probably surpasses the amount of data collected by a typical organization. For every company trying to solve the “data problem”, the issue isn’t just sheer size. That is just the st…
    - 4 days ago 13 Oct 17, 4:00pm -
  • Big SQL: SQL on Apache Hadoop Across the Enterprise
    Guest author Nagapriya Tiruthani, Offering Manager, IBM Big SQL, IBM Why Big SQL? Enterprise Data Warehousing (EDW) emerged as a logical home for all enterprise data that captures the essence of all enterprise systems. But in recent years, there’s…
    - 5 days ago 12 Oct 17, 3:05pm -
  • Data Virtualization: Enabling Faster Adoption of Big Data
    Recent versions of Hortonworks Data Platform (HDP) introduced several innovations in the areas of security, data governance, business cataloging, query optimization, visualization, and backup-and-restore. To help us keep pace with the rapid adoption…
    - 6 days ago 11 Oct 17, 10:00pm -


  • Raspberry Pi: Deep learning object detection with OpenCV
    A few weeks ago I demonstrated how to perform real-time object detection using deep learning and OpenCV on a standard laptop/desktop. After the post was published I received a number of emails from PyImageSearch readers who were curious if the Raspbe…
    - 1 day ago 16 Oct 17, 2:00pm -
  • Optimizing OpenCV on the Raspberry Pi
    This tutorial is meant for advanced Raspberry Pi users who are looking to milk every last bit of performance out of their Pi for computer vision and image processing using OpenCV. I’ll be assuming: You have worked through my previous Raspberry Pi +…
    - 8 days ago 9 Oct 17, 2:00pm -
  • Deep learning on the Raspberry Pi with OpenCV
    I’ve received a number of emails from PyImageSearch readers who are interested in performing deep learning in their Raspberry Pi. Most of the questions go something like this: Hey Adrian, thanks for all the tutorials on deep learning. You’ve real…
    - 15 days ago 2 Oct 17, 2:00pm -
  • macOS for deep learning with Python, TensorFlow, and Keras
    In today’s tutorial, I’ll demonstrate how you can configure your macOS system for deep learning using Python, TensorFlow, and Keras. This tutorial is the final part of a series on configuring your development environment for deep learning. I crea…
    - 18 days ago 29 Sep 17, 2:00pm -
  • Setting up Ubuntu 16.04 + CUDA + GPU for deep learning with Python
    Welcome back! This is the fourth post in the deep learning development environment configuration series which accompany my new book, Deep Learning for Computer Vision with Python. Today, we will configure Ubuntu + NVIDIA GPU + CUDA with everything yo…
    - 20 days ago 27 Sep 17, 2:00pm -

Walking Randomly

  • HPC-centric Research Software Engineering role within RSE Sheffield
    A job opportunity within the RSE Sheffield group is available under the job title of “Research Software Engineer in High Performance Computing (HPC) enabled Multi-Scale Modelling”. This is a EU funded position with a focus on supporting the biome…
    - 24 May 17, 6:43am -
  • Faster transpose matrix multiplication in R
    I’m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There’s no easily optimisable hotspot and there’s no obvious way to leverage parallelism. Progress is being made by s…
    - 23 May 17, 9:42am -
  • How powerful are Microsoft Azure’s free Jupyter notebooks?
    For a while now, Microsoft have provided a free Jupyter Notebook service on Microsoft Azure. At the moment they provide compute kernels for Python, R and F# providing up to 4Gb of memory per session. Anyone with a Microsoft account can upload their o…
    - 15 May 17, 7:05am -
  • Research Software Engineering: State of the Nation 2017
    I am a co-investigator on an EPSRC-funded grant called the RSE-N (Research Software Engineering Network), the aim of which is to co-ordinate various Research Software Engineering activities nationally.  One of the outputs of this work is a ‘State…
    - 10 Apr 17, 3:40pm -
  • High Performance Computing – There’s plenty of room at the bottom
    UK to launch 6 major HPC centres Tomorrow, I’ll be attending the launch event for the UK’s new HPC centres and have been asked to deliver a short talk as part of the program. As someone who paddles in the shallow-end of the HPC pool I find this b…
    - 29 Mar 17, 8:13pm -