• !useR!2017 Roundup
    Organising useR!2017 was a challenge but a very rewarding experience. With about 1200 attendees of over 55 nationalities exploring an interesting program, we believe it is appropriate to call it a success - something the aftermovie only seems to conf…
    - 8 hours ago 23 Aug 17, 8:00am -
  • !Gender roles in film direction, analyzed with R
    What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle…
    - 18 hours ago 22 Aug 17, 9:33pm -
  • !Caching httr Requests? This means WAR[C]!
    I’ve blathered about my crawl_delay project before and am just waiting for a rainy weekend to be able to crank out a follow-up post on it. Working on that project involved sifting through thousands of Web Archive (WARC) files. While I have a nascen…
    - 22 hours ago 22 Aug 17, 5:53pm -
  • Some Neat New R Notations
    The R package seplyr supplies a few neat new coding notations. An Abacus, which gives us the term “calculus.” The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::se…
    - 1 day ago 22 Aug 17, 1:39pm -
  • So you (don’t) think you can review a package
    Contributing to an open-source community without contributing code is an oft-vaunted idea that can seem nebulous. Luckily, putting vague ideas into action is one of the strengths of the rOpenSci Community, and their package onboarding system offers a…
    - 1 day ago 22 Aug 17, 7:00am -

High Scalability


  • How to get your first job as a data scientist.
    Many aspiring data scientists focus on doing Kaggle competitions as a way to build their portfolios. Kaggle is an excellent way to practice, but it should only be one of many avenues you use to work on data science projects. This is because Kag…
    - 8 days ago 15 Aug 17, 8:00am -
  • Introducing our new Interface
    Our new mission design has arrived! Over the past few months we’ve been tirelessly talking to students like you to learn how we can improve the mission interface. Today we are unveiling the results of this hard work. Since a lot has changed,…
    - 12 days ago 11 Aug 17, 7:00am -
  • SQL Intermediate: PostgreSQL, Subqueries and more!
    If you’re in the early phases of learning SQL and have completed one or more introductory-level courses, you’ve probably learned most of the basic fundamentals and possibly even some high-level database concepts. As you prepare to embark on the n…
    - 13 days ago 10 Aug 17, 8:00am -
  • Using pandas with large data
    Tips for reducing memory usage by up to 90% When working using pandas with small data (under 100 megabytes), performance is rarely a problem. When we move to larger data (100 megabytes to multiple gigabytes), performance issues can make run times muc…
    - 18 days ago 5 Aug 17, 8:00am -
  • Python Cheat Sheet for Data Science
    The printable version of this cheat sheet It’s common when first learning Python for Data Science to have trouble remembering all the syntax that you need. While at Dataquest we advocate getting used to consulting the Python documentation, s…
    - 34 days ago 20 Jul 17, 8:00am -


  • !Gender roles in film direction, analyzed with R
    What do women do in films? If you analyze the stage directions in film scripts — as Julia Silge, Russell Goldenberg and Amber Thomas have done for this visual essay for ThePudding — it seems that women (but not men) are written to snuggle, giggle…
    - 18 hours ago 22 Aug 17, 9:33pm -
  • Highlights of the Data Science Track at Microsoft Ignite
    I will be at the AI Summit in San Francisco next month, which means I can't make it to Ignite in Orlando this year. Which is a bit of a shame, because there's a fantastic Data Science track at Ignite. There are 25 sessions on offer, with presentation…
    - 2 days ago 21 Aug 17, 4:30pm -
  • Because it's Friday: Movie Trailer
    Via Gizmodo, this generic template for a AAA movie trailer recalls that generic brand video from a couple of years back. That's all for us for this week. Have a great weekend, we'll be back on Monday!
    - 5 days ago 18 Aug 17, 10:39pm -
  • Obstacles to performance in parallel programming
    Making your code run faster is often the primary goal when using parallel programming techniques in R, but sometimes the effort of converting your code to use a parallel framework leads only to disappointment, at least initially. Norman Matloff, auth…
    - 5 days ago 18 Aug 17, 7:18pm -
  • 20 years of the R Core Group
    The first "official" version of R, version 1.0.0, was released on February 29, 2000. But the R Project had already been underway for several years before then. Sharing this tweet, from yesterday, from R Core member Peter Dalgaard: It was twenty years…
    - 6 days ago 17 Aug 17, 9:03pm -

Data Analytics and R

  • !R Packages worth a look
    Static ‘SAS’ Code Analysis (sasMap)A static code analysis tool for ‘SAS’ scripts. It is designed to load, count, extract, remove, …Continue reading →
    - 9 hours ago 23 Aug 17, 6:07am -
  • !Document worth reading: “The Probability of Causation”
    Many legal cases require decisions about causality, responsibility or blame, and these may be based on statistical data. However, causal …Continue reading →
    - 11 hours ago 23 Aug 17, 4:05am -
  • !Distilled News
    In the Time of Big Data and Machine Learning, It’s Important to Ask “Why?” In this special guest feature, Sundeep …Continue reading →
    - 13 hours ago 23 Aug 17, 2:03am -
  • !Magister Dixit
    “The most important questions of life are, for the most part, really only problems of probability.” Pierre Simon, Marquis de …Continue reading →
    - 15 hours ago 23 Aug 17, 12:01am -
  • !If you did not already know
    Paragraph Vector-based Matrix Factorization Recommender System (ParVecMF) Review-based recommender systems have gained noticeable ground in recent years. In addition to …Continue reading →
    - 17 hours ago 22 Aug 17, 10:23pm -

Google Cloud

The HortonWorks Blog

  • Data Has Sparked a Retail Revolution
    Late last year Amazon made waves when reports surfaced that they were testing out a new retail store location in downtown Seattle. Not only is the fact that Amazon is experimenting with retail locations newsworthy, but their plans will transform groc…
    - 2 days ago 21 Aug 17, 4:00pm -
  • Worldpay: Influencing Open Source for Enterprise Readiness via Hortonworks Support
    As we announced a few weeks ago, well over half of the Fortune 100 companies, and over one quarter of the Fortune Global 500 companies, leverage Hortonworks’ open source data platforms, HDP and HDF. Open source has won. What these top companies rec…
    - 5 days ago 18 Aug 17, 9:17pm -
  • What is a Data Science Workbench and Why Do Data Scientists Need One?
    Data science is inherently an exploratory and creative process because there is usually neither a definitive answer to the problem at hand nor a well-defined approach to reaching one. Data scientists research problems, explore data, visualize pattern…
    - 6 days ago 17 Aug 17, 4:09pm -
  • Model as Service: Modern Streaming Data Science with Apache Metron
    The Motivation Many cybersecurity problems are also big data problems.   What is more and more apparent, though, is that these problems are also problems solved by data science.  The modern cybersecurity practitioner solves cybersecurity problems…
    - 6 days ago 17 Aug 17, 1:00pm -
  • How Do Insurance Companies Find Data in a Rain Drop?
    We may not be able to look into the future yet, but we are at a major crossroads of data and predictive analytics that allows us to mitigate risks to prevent disasters. Whether it be an accidental house fire or a car accident, insurance companies are…
    - 7 days ago 16 Aug 17, 6:33pm -


  • Deep Learning with OpenCV
    Two weeks ago OpenCV 3.3 was officially released, bringing with it a highly improved deep learning ([crayon-599d3d274f5b9644329675-i/] ) module. This module now supports a number of deep learning frameworks, including Caffe, TensorFlow, and Torch/Py…
    - 2 days ago 21 Aug 17, 2:00pm -
  • Long exposure with OpenCV and Python
    One of my favorite photography techniques is long exposure, the process of creating a photo that shows the effect of passing time, something that traditional photography does not capture. When applying this technique, water becomes silky smooth, star…
    - 9 days ago 14 Aug 17, 2:00pm -
  • Announcing PyImageJobs: A computer vision and deep learning jobs board
    Today, I am pleased to announce that PyImageJobs has officially launched. Whether you are (1) looking to find a job in the computer vision, OpenCV, or deep learning space or (2) trying to fill a computer vision position for your company, organiz…
    - 16 days ago 7 Aug 17, 2:00pm -
  • Bank check OCR with OpenCV and Python (Part II)
    Today’s blog post is Part II in our two part series on OCR’ing bank check account and routing numbers using OpenCV, Python, and computer vision techniques. Last week we learned how to extract MICR E-13B digits and symbols from input images. Today…
    - 23 days ago 31 Jul 17, 2:00pm -
  • Bank check OCR with OpenCV and Python (Part I)
    Today’s blog post is inspired by Li Wei, a PyImageSearch reader who emailed me last week and asked: Hi Adrian, Thank you for the PyImageSearch blog. I read it each week and look forward to your new posts every Monday. I really enjoyed last week’s…
    - 30 days ago 24 Jul 17, 2:00pm -

Walking Randomly

  • HPC-centric Research Software Engineering role within RSE Sheffield
    A job opportunity within the RSE Sheffield group is available under the job title of “Research Software Engineer in High Performance Computing (HPC) enabled Multi-Scale Modelling”. This is a EU funded position with a focus on supporting the biome…
    - 91 days ago 24 May 17, 6:43am -
  • Faster transpose matrix multiplication in R
    I’m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There’s no easily optimisable hotspot and there’s no obvious way to leverage parallelism. Progress is being made by s…
    - 92 days ago 23 May 17, 9:42am -
  • How powerful are Microsoft Azure’s free Jupyter notebooks?
    For a while now, Microsoft have provided a free Jupyter Notebook service on Microsoft Azure. At the moment they provide compute kernels for Python, R and F# providing up to 4Gb of memory per session. Anyone with a Microsoft account can upload their o…
    - 15 May 17, 7:05am -
  • Research Software Engineering: State of the Nation 2017
    I am a co-investigator on an EPSRC-funded grant called the RSE-N (Research Software Engineering Network), the aim of which is to co-ordinate various Research Software Engineering activities nationally.  One of the outputs of this work is a ‘State…
    - 10 Apr 17, 3:40pm -
  • High Performance Computing – There’s plenty of room at the bottom
    UK to launch 6 major HPC centres Tomorrow, I’ll be attending the launch event for the UK’s new HPC centres and have been asked to deliver a short talk as part of the program. As someone who paddles in the shallow-end of the HPC pool I find this b…
    - 29 Mar 17, 8:13pm -