I managed to find a few spare hours this weekend so I’m trying out Python for the first time. I usually use Matlab and R for data processing, visualisation and statistics, but I wanted to give Python a try, since some of my friends at Vokke seem to really love it.
It’s early days so I haven’t actually managed to produce anything useful with Python yet, but I thought I’d start to document the steps I’m taking to learn Python for data science, from the point of view of a Matlab and R user.
First off, I downloaded and installed Anaconda which includes a distribution of Python, plus all the popular python packages you might need for data science.
Then I searched for an IDE that I like the feel of. Anaconda comes with a couple of IDE’s including one called “Spyder” which I thought seemed very good. However, I ended up deciding on using the Rodeo IDE for starters. The reason I decided on Rodeo is it is set out very similarly to the Rstudio and matlab IDEs, so I’m a little more comfortable with it to start with.
Third I started searching for the “python equivalents” to my favourite R packages for data science. I’m a major fan of most of Hadley Wickham’s’s R packages including ggplot2, dplyr, tidyr, lubridate, readr and readxl. So far for python I’ve found:
Pandas seems to be the popular package for manipulating data in python, but another package that seems closer to dplyr in R, is dplython which maintains the functional programing ideas of dplyr, including my favourite feature from magrittr and dplyr: the pipe-operator!
The python plotting packages seaborn, bokeh and matplotlib all seem really nice. Matplotlib in particular seems very familiar to the plotting system in matlab. But since I’ve recently become very comfortable using Hadley’s ggplot2 ‘grammar of graphics’ type plotting system, I think ggplot for python will suit me perfectly for starters!
…annnd that’s all I’ve got time for today, BUT I plan to keep updating this post with more info, as I come across it, that I think could be useful for somebody learning python for data science who is coming from a background of R and Matlab….so stay tuned!!
I’ll be finishing my PhD over the next two months, exciting times!
Since I’ve got a thesis to write, I’ll try to keep this post short (or at least written in a short amount of time!).
I have to give another shout out to The Peer Reviewers’ Openness Initiative (PRO)
which is one of several excellent new initiatives in support of open science, and which has already received over 200 signatories.
Basically, PRO outlines a mechanism whereby peer reviewers require access to data/analysis code/materials (or at least a reason from the authors why these things are not provided) before conducting a comprehensive review. This is designed to shift incentives and achieve the goal of creating the expectation of open science practices.
The advantages that will come with mass uptake of open science practices, particularly in relationship to the PRO initiative,
have recently been outlined in excellent blogs by researchers who are more accomplished and qualified than me (e.g. see here , here , here , here , here and here).
So this post is not about rehashing their excellent points. Rather, I wish to add another perspective to this discussion, from the viewpoint of a very early career researcher.
Since I am currently considering post-PhD career paths (e.g. post-doc positions, industry positions), initiatives like PRO are important to me, because they give me a sense of hope that over time the incentives in academic science will change to encourage open science. I’ve noticed over the last few years that the scientific publication process (at least in psychology, cognitive science, and neuroscience where I’ve been interested) is very slowly moving more in line with the ideals of open, transparent and reproducible research. I’m excited to get on board with this open science movement as much as possible early in my career - I’ve just submitted a final research paper to contribute towards my PhD, and I’ve chosen to submit it to a fully open access journal and make all of the related raw data, analysis scripts and paradigm code open source, so my results are reproducible.
Hurrah!!! (I’m yet to submit a pre-registered report, but that’s next on my list of publication goals).
So personally I am enthusiastic about open science. And I wonder if this attitude is shared amongst my peers?
Are early career researchers enthusiastic about open science?
I would love to see some valid data addressing this question.
Anecdotal evidence from my conversations with friends/colleagues who are at similar career stages,
suggests that many early career researchers agree that open science is the way forward.
When I’ve chatted about getting on board with the open science movement (e.g. by signing PRO , sharing data/analysis scripts, pre-registering studies etc.),
my colleagues have unanimously agreed that it is a good idea and the best way forward for science, for these reasons .
So what’s stopping early career researchers practicing open science?
There may be many perceived barriers to implementing open science practises, which are worth addressing (see links in first paragraph).
But here I just wanted to comment on one of these barriers which seems to be at the forefront of early career researchers’ minds- people have pointed out to me that currently the pressures and incentives set up in academic science, particularly for early career researchers, do not always encourage or reward the extra time taken to learn and implement open science practices. One main point of concern that I’ve heard is a potential loss in the number of papers you can produce given the extra time taken to learn/implement open science practices, since criteria for awards, post-doc positions, promotions, etc. are often heavily weighted on the of number of publications, and not necessarily on how ‘open’ the science is.
Is this a valid concern? I cannot comment on changes (or lack of changes) in the weighting of open science practices as a criteria for awards, post-doc positions, promotions, etc. across research institutes.
Though Felix Schonbrodt has an excellent piece about changing hiring practices towards research transparency here. I can however comment on the perceived potential loss in the number of papers you can produce given the extra time taken to learn/implement open science practices:
It definitely doesn’t take much extra time! HURRAY! It can feel like a burden at first, but there are many online tools to help with open science
(e.g. the OFS, or see this excellent post, Full Stack Science; a guide to open science infrastructure, about using GitHub, Docker Hub, FigShare,Travis CI and Zenodo), and once you get started it’s faster, easier and more enjoyable than you may imagine.
It took me 4 weeks part-time (20 hours per week, so a total of 80 hours) back at the end of 2014/start of 2015 to learn how to use
R markdown and github properly to share analysis and paradigm code. I took my time to learn it thoroughly, and this was coming from a point of complete ignorance about
R markdown and github. There are many free online options to help learn such skills, for example I learned free via these two courses The Data Scientist’s Toolbox and Reproducible Research .
And it took me about half an hour to work out that my university has an account with figshare and then to upload my 24GB of raw data, and use figshare’s neat “Generate private link” function for reviewers to access my data which can then be manually switched to public when you need it to be (i.e. once your paper is accepted for publication).
So a total of ~80 hours of my time to gain the skills to make my analysis reproducible and share raw data.
And now because I have these skills and enjoy implementing them as a usual part of my analysis pipeline,
it will take me no extra time to make my data and analysis code open in the future. In fact, this 80 hours of work actually saves me time in the long run, because of the advantages of reproducible code; it is easier to check for errors, if you comment it well you spend less time figuring out what you did later, etc. So you get back all of that 80 hours pretty quickly. In the long run, the initial time-investment pays recurring dividends.
So regarding applying for awards, post-doc positions, etc. I’m banking on the hope that the reputational gain from doing fully open science from now on, not to mention the time it saves me in the long run,
will be worth more than a small one-off expenditure of ~80 hours that it took me to learn the necessary skills.
Furthermore, if I ever need to leave academia and pursue work in industry then having programing, or at least scripting, skills for data analysis along with git/github for version control and reproducible research makes you more valuable for data science/analytics jobs in industry, than only knowing how to analyse data with the old point-and-click style methods with software like SPSS etc.
So from my viewpoint, as a very early career researcher, it definitely appears that open science is the way forward.
I’m excited about a new initiative to promote data and analysis/paradigm code sharing, called the The Peer Reviewers Openness Initiative (PRO)https://opennessinitiative.org//. Openness and transparency are core values of science. PRO outlines practical steps to improve open science, and I would like to see improved openness of code and data particularly in my areas of the behavioural sciences and cognitive neuroscience.
Technology (the internet) has advanced to a point now where open access data and code for scientific publications is possible, however the uptake of open science practices has been slow for a number of reasons. For one thing, better incentives are needed for the transition to open science practices. But I think another key thing holding our area of science back is that most researchers in our area don’t even know what things like GitHub, Mozilla Science Lab, and FigShare, etc are (I didn’t until very recently). Also at the outset it seems like a bit hassle to learn how to use these tools. Even though I’m now finding it is not too bad – I’m a beginner with these kinds of open science methods but there are some really good free short courses from Johns Hopkins University to help learn how to use some of the tools for open science, I’ve provided links to 3 of these below:
(2)https://www.coursera.org/course/rprog – a short course to use** R** to script your whole analysis for a publication from start to finish to show others exactly what was done from raw data to results (any stats software that allows scripting will do, but R is better than SPSS, for example, since R can be downloaded for free). I’ve just started making the switch from SPSS to R for my inferential statistics since R makes sharing analysis code easier. I will still likely do my signal processing in MATLAB though.
(3)https://www.coursera.org/course/repdata – a short course on tools for** Reproducible Research** – e.g. combine GitHub, R pubs, FigShare to make both the data and code easily available and citable with a DOI
I’m going through these courses in my spare time at the moment, and hope to make my next scientific publication fully open access, in line with the ideal of reproducible research, so that other scientists can verify and build upon my findings.
This is my first blog post, and it is actually going to be a shout out to somebody else’s blog! A couple of weeks ago I went over to Deakin University and gave a short presentation to the Cognitive Neuroscience Unit (CNU) at Deakin. I presented some of the work we have been doing in our lab at Monash. The CNU blogged about my talk here:
It was great to meet the members of the newly formed Deakin CNU, they have attracted an excellent team of researchers there lead by Peter Enticott. I was impressed by the enthusiasm for cognitive neuroscience shown by the members of the CNU. After the talk I was shown a tour around the labs at the Deakin CNU, and I was very impressed with their facilities for transcranial magnetic stimulation (TMS). I believe the CNU was only established this year (2014), but have already been productive and I’m expecting to see more exciting research from them in the years to come!