• !Analyzing Github pull requests with Neural Embeddings, in R
    At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distribute…
    - 9 hours ago 24 Jul 17, 9:52pm -
  • !Are computers needed to teach Data Science?

    - 10 hours ago 24 Jul 17, 8:48pm -
  • !Hacking Highcharter: observations per group in boxplots
    Highcharts has long been a favourite visualisation library of mine, and I’ve written before about Highcharter, my preferred way to use Highcharts in R. Highcharter has a nice simple function, hcboxplot(), to generate boxplots. I recently generated…
    - 18 hours ago 24 Jul 17, 12:46pm -
  • !Random Forests in R
    Ensemble Learning is a type of Supervised Learning Technique in which the basic idea is to generate multiple Models on a training dataset and then simply combining(average) their Output Rules or their Hypothesis \( H_x \) to generate a Strong Model w…
    - 20 hours ago 24 Jul 17, 10:55am -
  • !Stippling and TSP art in R: emulating StippleGen
    Stippling is the creation of a pattern simulating varying degrees of solidity or shading by using small dots (Wikipedia).StippleGen is a piece of software that renders images using stipple patterns, which I discovered on Xi’an’s blog a couple day…
    - 21 hours ago 24 Jul 17, 10:06am -

High Scalability


  • Python Cheat Sheet for Data Science
    The printable version of this cheat sheet It’s common when first learning Python for Data Science to have trouble remembering all the syntax that you need. While at Dataquest we advocate getting used to consulting the Python documentation, s…
    - 5 days ago 20 Jul 17, 8:00am -
  • Should I learn Python 2 or 3?
    Image Credit: DigitalOcean One of the biggest sources of confusion and misinformation for people wanting to learn Python is which version they should learn. Should I learn Python 2.x or Python 3.x? Indeed, this is one of the questions we are…
    - 12 days ago 13 Jul 17, 8:00am -
  • Understanding SettingwithCopyWarning in pandas
    SettingWithCopyWarning is one of the most common hurdles people run into when learning pandas. A quick web search will reveal scores of Stack Overflow questions, GitHub issues and forum posts from programmers trying to wrap their heads around w…
    - 20 days ago 5 Jul 17, 3:00pm -
  • Web Scraping with Python and BeautifulSoup
    To source data for data science projects, you’ll often rely on SQL and NoSQL databases, APIs, or ready-made CSV data sets. The problem is that you can’t always find a data set on your topic, databases are not kept current and APIs are either expe…
    - 26 days ago 29 Jun 17, 3:00pm -
  • The tips and tricks I used to succeed on Kaggle
    I learned machine learning through competing in Kaggle competitions. I entered my first competitions in 2011, with almost no data science knowledge. I soon ended up in fifth place out of a hundred or so in a stock trading competition. Over the…
    - 33 days ago 22 Jun 17, 3:00pm -


  • !Analyzing Github pull requests with Neural Embeddings, in R
    At the useR!2017 conference earlier this month, my colleague Ali Zaidi gave a presentation on using Neural Embeddings to analyze GitHub pull request comments (processed using the tidy text framework). The data analysis was done using R and distribute…
    - 9 hours ago 24 Jul 17, 9:52pm -
  • Because it's Friday: How Bitcoin works
    Cryptocurrencies have been in the news quite a bit lately. Bitcoin prices have been soaring recently after the community narrowly avoided the need for a fork, while $32M in rival currency Etherium was recently stolen, thanks to a coding error in wall…
    - 3 days ago 21 Jul 17, 10:10pm -
  • IEEE Spectrum 2017 Top Programming Languages
    IEEE Spectrum has published its fourth annual ranking of of top programming languages, and the R language is again featured in the Top 10. This year R ranks at #6, down a spot from its 2016 ranking (and with an IEEE score — derived from search, soc…
    - 4 days ago 21 Jul 17, 6:41pm -
  • Data Analysis for Life Sciences
    Rafael Irizarry from the Harvard T.H. Chan School of Public Health has presented a number of courses on R and Biostatistics on EdX, and he recently also provided an index of all of the course modules as YouTube videos with supplemental materials. The…
    - 5 days ago 20 Jul 17, 3:00pm -
  • Securely store API keys in R scripts with the "secret" package
    If you use an API key to access a secure service, or need to use a password to access a protected database, you'll need to provide these "secrets" in your R code somewhere. That's easy to do if you just include those keys as strings in your code —…
    - 6 days ago 19 Jul 17, 12:00pm -

Data Analytics and R

  • !Magister Dixit
    “Prediction is very difficult, especially about the future.” Niels Bohr
    - 3 hours ago 25 Jul 17, 4:05am -
  • !Whats new on arXiv
    RAIL: Risk-Averse Imitation Learning Imitation learning algorithms learn viable policies by imitating an expert’s behavior when reward signals are not …Continue reading →
    - 5 hours ago 25 Jul 17, 2:03am -
  • !Book Memo: “Supplier Selection”
    An MCDA-Based Approach The purpose of this book is to present a comprehensive review of the latest research and development …Continue reading →
    - 7 hours ago 25 Jul 17, 12:01am -
  • !Distilled News
    Decision Trees, Classification & Interpretation Using Scikit-Learn… Wouldn’t it be nice if defects and product failures can be predicted in …Continue reading →
    - 8 hours ago 24 Jul 17, 10:23pm -
  • !Document worth reading: “A Contemporary Overview of Probabilistic Latent Variable Models”
    In this paper we provide a conceptual overview of latent variable models within a probabilistic modeling framework, an overview that …Continue reading →
    - 10 hours ago 24 Jul 17, 8:21pm -

Google Cloud

The HortonWorks Blog

  • !Don’t Leave Your Customers out in the Cold
    A 30-minute pizza delivery window used to be cutting edge, but times have changed. Not only have our attention spans shortened, but as customers, we have higher expectations. We live in the age of right now, “good enough” is no more. Companies mu…
    - 7 hours ago 25 Jul 17, 12:05am -
  • Join the Big Data Revolution! (Apply Inside)
    Last Saturday, the Financial Times published an article titled, “BP looks to big data to help weather weak oil price” which explained why BP is expanding their data storage from one petabyte to six petabytes. It almost seems counter-intuitive, es…
    - 3 days ago 22 Jul 17, 1:10am -
  • What Does Hortonworks SmartSense Mean To You?
    Here at Hortonworks, we help big data projects be as successful as possible.  Part of that success is based on Hortonworks SmartSense. SmartSense provides a collection of tools and services to proactively prevent common cluster problems. With SmartS…
    - 5 days ago 20 Jul 17, 5:10pm -
  • Doing Nothing About Cyber Security Will Cost You Everything
    With every swipe of a credit card, every social media interaction, and every opened email advertisement, businesses are continuously trying to gain a single view of customers. Along with this mountain of data comes a lot of risk. Cyber crime has reac…
    - 5 days ago 19 Jul 17, 11:51pm -
  • How Telecommunications Companies Answer the Call for Innovation
    Usually, customer demand prompts the need for innovation. If enough people need something, companies adapt to meet the requests of their consumers. Today, the development and evolution of mobility and telecommunications have outpaced our ability to c…
    - 7 days ago 18 Jul 17, 4:00pm -


  • !Bank check OCR with OpenCV and Python (Part I)
    Today’s blog post is inspired by Li Wei, a PyImageSearch reader who emailed me last week and asked: Hi Adrian, Thank you for the PyImageSearch blog. I read it each week and look forward to your new posts every Monday. I really enjoyed last week’s…
    - 17 hours ago 24 Jul 17, 2:00pm -
  • Credit card OCR with OpenCV and Python
    Today’s blog post is a continuation of our recent series on Optical Character Recognition (OCR) and computer vision. In a previous blog post, we learned how to install the Tesseract binary and use it for OCR. We then learned how to cleanup images u…
    - 8 days ago 17 Jul 17, 2:00pm -
  • Using Tesseract OCR with Python
    In last week’s blog post we learned how to install the Tesseract binary for Optical Character Recognition (OCR). We then applied the Tesseract program to test and evaluate the performance of the OCR engine on a very small set of example images. As…
    - 15 days ago 10 Jul 17, 2:00pm -
  • Installing Tesseract for OCR
    Today’s blog post is part one in a two part series on installing and using the Tesseract library for Optical Character Recognition (OCR). OCR is the automatic process of converting typed, handwritten, or printed text to machine-encoded text that we…
    - 22 days ago 3 Jul 17, 2:00pm -
  • Labeling superpixel colorfulness with OpenCV and Python
    After our previous post on computing image colorfulness was published, Stephan, a PyImageSearch reader, left a comment on the tutorial asking if there was a method to compute the colorfulness of specific regions of an image (rather than the entire im…
    - 29 days ago 26 Jun 17, 2:00pm -

Walking Randomly

  • HPC-centric Research Software Engineering role within RSE Sheffield
    A job opportunity within the RSE Sheffield group is available under the job title of “Research Software Engineer in High Performance Computing (HPC) enabled Multi-Scale Modelling”. This is a EU funded position with a focus on supporting the biome…
    - 62 days ago 24 May 17, 6:43am -
  • Faster transpose matrix multiplication in R
    I’m working on optimising some R code written by a researcher at University of Sheffield and its very much a war of attrition! There’s no easily optimisable hotspot and there’s no obvious way to leverage parallelism. Progress is being made by s…
    - 63 days ago 23 May 17, 9:42am -
  • How powerful are Microsoft Azure’s free Jupyter notebooks?
    For a while now, Microsoft have provided a free Jupyter Notebook service on Microsoft Azure. At the moment they provide compute kernels for Python, R and F# providing up to 4Gb of memory per session. Anyone with a Microsoft account can upload their o…
    - 71 days ago 15 May 17, 7:05am -
  • Research Software Engineering: State of the Nation 2017
    I am a co-investigator on an EPSRC-funded grant called the RSE-N (Research Software Engineering Network), the aim of which is to co-ordinate various Research Software Engineering activities nationally.  One of the outputs of this work is a ‘State…
    - 10 Apr 17, 3:40pm -
  • High Performance Computing – There’s plenty of room at the bottom
    UK to launch 6 major HPC centres Tomorrow, I’ll be attending the launch event for the UK’s new HPC centres and have been asked to deliver a short talk as part of the program. As someone who paddles in the shallow-end of the HPC pool I find this b…
    - 29 Mar 17, 8:13pm -