RDocumentation: Scoring and Ranking

March 10, 2017 ~ Cesar Prado ~ Leave a comment

One of the core features of RDocumentation.org is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.

In today’s technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.

Elasticsearch

RDocumentation.org uses Elasticsearch to index and search through all R packages and topics.

Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.

Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.

An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on elastic.co.

RDocumentation.org uses three different types: package_version, topic, and package. The first 2 are the main ones; let’s discuss package later.

Because RDocumentation.org is open-source, you can see the Elasticsearch’s mappings in our github repo.

package_version type

The package_version type is like a translation of the DESCRIPTION file of a package, it features the main field that one can find in there; package_name, version, title, description, release_date, license, url, copyright, created_at, updated_at, latest_version, maintainer and collaborators. The maintainer and collaborators are extracted from the Authors field in the DESCRIPTION file

topic type

The topic documents are parsed from the Rd files, the standard format of documentation in R. The topic type has the following keys: name, title, description, usage, details, value, references, note, author, seealso, examples, created_at, updated_at, sections, aliases and keywords.

Scoring in Elasticsearch

Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.

However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.

Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.

If you want to lean more about how that function is used in Elasticsearch, you can check out this section of elastic.co guide.

One way to improve relevancy is to apply a boost to some fields. For example, in the RDocumentation.org full search, we naturally boost fields like package_name and title for packages and aliases and name for topics.

Boosting the popular documents

Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.

Using downloads as a popularity measure

There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).

At RDocumentation.org, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.

Timeframing

One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?

To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.

Direct vs Indirect downloads

Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.

As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.

To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.

We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on elastic.co.

The score is modified as follows:

new_score = old_score * log(1 + number of direct downloads in the last month)

We use a log() function to smooth out the number of downloads value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100.000 and 101.000 downloads.

This re-scoring improves the overall relevancy of the search results presented by RDocumentation.org and as a result, users can focus on reading documentation instead of searching for it.

If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController.

If you want to learn more about how RDocumentation.org is implemented, check out our repos:

RDocumentation-app: The web application running rdocumentation.org.
RDocumentation-elasticsearch: Configuration and feeders of the Elasticsearch server serving rdocumentation.org.
RDocumentation: R package to integrate rdocumentation.org into your R workflow
RDocumentation-lambda-worker: AWS Lambda pipeline to parse package documentation for rdocumentation.org

About RDocumentation

RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub – the three most common sources of current R documentation. RDocumentation.org goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to RDocumentation.org from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples.

Create an RDocumentation account today!

from DataCamp Blog http://www.datacamp.com/community/blog/rdocumentation-ranking-scoring

New Course: Merging Dataframes with pandas

March 8, 2017 ~ Cesar Prado ~ Leave a comment

Hi there – today we’re launching a new course on Merging Dataframes with pandas by Dhavide Aruliah.

As a Data Scientist, you’ll often find that the data you need is not in a single file. It may be spread across a number of text files, spreadsheets, or databases. You want to be able to import the data of interest as a collection of DataFrames and figure out how to combine them to answer your central questions. This course is all about the act of combining, or merging, DataFrames, an essential part of any working Data Scientist’s toolbox. You’ll hone your pandas skills by learning how to organize, reshape, and aggregate multiple data sets to answer your specific questions.

Start for free

Merging Dataframes with pandas features interactive exercises that combine high-quality video, in-browser coding, and gamification for an engaging learning experience that will make you a master at Data Science with Python!

What you’ll learn:

In chapter 1, you’ll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you’ll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you’ll need for merging DataFrames later in the course. Start first chapter for free!

Having learned how to import multiple DataFrames and share information using Indexes, in chapter 2 you’ll learn how to perform database-style operations to combine DataFrames. In particular, you’ll learn about appending and concatenating DataFrames while working with a variety of real-world datasets.

Here in chapter 3, you’ll learn all about merging pandas DataFrames. You’ll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You’ll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns.

The last chapter will focus on a applying your skills with a case study on summer Olympics medal data.

About Dhavide: Dhavide is Director of Training at Continuum Analytics, the creator and driving force behind Anaconda—the leading Open Data Science platform powered by Python. Dhavide was previously an Associate Professor at the University of Ontario Institute of Technology (UOIT). He served as Program Director for various undergraduate & postgraduate programs at UOIT. His research interests include computational inverse problems, numerical linear algebra, & high-performance computing. The materials for this course were produced by the Continuum training team.

Start course for free

from DataCamp Blog http://www.datacamp.com/community/blog/new-course-merging-dataframes-with-pandas

Pandas Cheat Sheet: Data Wrangling in Python

March 8, 2017 ~ Cesar Prado ~ Leave a comment

By now, you’ll already know the Pandas library is one of the most preferred tools for data manipulation and analysis, and you’ll have explored the fast, flexible, and expressive Pandas data structures, maybe with the help of DataCamp’s Pandas Basics cheat sheet.

Yet, there is still much functionality that is built into this package to explore, especially when you get hands-on with the data: you’ll need to reshape or rearrange your data, iterate over DataFrames, visualize your data, and much more. And this might be even more difficult than “just” mastering the basics.

That’s why today’s post introduces a new, more advanced Pandas cheat sheet.

It’s a quick guide through the functionalities that Pandas can offer you when you get into more advanced data wrangling with Python.

The Pandas cheat sheet will guide you through some more advanced indexing techniques, DataFrame iteration, handling missing values or duplicate data, grouping and combining data, data functionality, and data visualization.

In short, everything that you need to complete your data manipulation with Python!

Do you want to learn more? Start our Pandas Foundations course for free now or try out our Pandas DataFrame tutorial!

Don’t miss out on our other cheat sheets for data science that cover Matplotlib, SciPy, Numpy, and the Python basics.

from DataCamp Blog http://www.datacamp.com/community/blog/pandas-cheat-sheet-python

DataChats: An Interview with David Stoffer

March 3, 2017 ~ Cesar Prado ~ Leave a comment

Hi R enthusiasts, we’ve released episode 12 of our DataChats video series.

In this episode, we interview David Stoffer. David is a Professor of Statistics at the University of Pittsburgh. He is member of the editorial board of the Journal of Time Series Analysis and Journal of Forecasting. David is the coauthor of the book “Time Series Analysis and Its Applications: With R Examples”, which is the basis of his course. Another (free) book he wrote on Time Series Analysis is available here. Check out his course to learn more: ARIMA Modeling with R.

We hope that you enjoy watching this series and make sure not to miss any of our upcoming episodes by subscribing to DataCamp’s YouTube channel!

from DataCamp Blog http://www.datacamp.com/community/blog/datachats-an-interview-with-david-stoffer

IPython Or Jupyter?

March 2, 2017 ~ Cesar Prado ~ Leave a comment

IPython and Jupyter Notebook

For learners as well as for more advanced data scientists, the Jupyter Notebook is one of the popular data science tools out there: the interactive environment is not only ideal to teach and learn, and to share your work with peers, but also ensures reproducible research. Yet, as you’re discovering how to work with this notebook, you’ll often bump into IPython.

The two seem to be synonyms in some cases and you’ll agree with me when I say that it’s very confusing when you want to dig deeper: are magics part of Jupyter or IPython? Is saving and loading notebooks a feature of IPython or Jupyter?

You can probably keep on going with the questions.

Today’s blog post intends to illustrate some of the core differences between the two more explicitly, not only starting from the origins of both to explain how the two relate, but also covering some specific features that are either part of IPython or Jupyter, so that it will be easier for you to make the distinction between the two!

Consider also reading DataCamp’s Definitive Guide to Jupyter Notebook for tips and tricks, best practices, examples, and much more.

The Origins of IPython and Jupyter

To fully understand what the Jupyter Notebook is and how it differs from IPython, it might be interesting to first read a bit more about how these two fit into the history and the future of computational notebooks.

The Start of Computational Notebooks: MATLAB, Mathematica & Maple

In the mid-1980s, MATLAB was released by The MathWorks, founded by Jack Little, Steve Bangert, and Cleve Moler.

Let’s go to the late 1980s, 1987 to be exact. Theodore Gray started working on what was to be the Mathematica notebook frontend and a year later, it released to the public. The GUI allowed for the interactive creation and editing of notebook documents that contain pretty-printed program code, formatted text and a whole bunch of other stuff such as typeset mathematics, graphics, GUI components, tables, and sounds. Standard word processing capabilities were there, such as real-time multilingual spell checking. You could output the documents in a slideshow environment for presentations.

When you look at how these notebooks were structured, you notice straight away that they depended on a hierarchy of cells that allowed for the outlining and sectioning of documents, which you now also find in Jupyter notebooks.

Also in the late 1980s, in 1989, Maple introduced their first notebook-style GUI. It was included with version 4.3 for the Macintosh. Versions of the new interface for X11 and Windows followed in 1990.

These notebooks will all be an inspiration for others to develop what will be called “data science notebooks”.

The Rise of Data Science Notebooks

There have been many computational notebooks in between the ones that are now widely known to do interactive data science. This section will focus on the notebooks that have contributed to the been most notable in the rise of the data science notebooks.

Sage Notebook

The Sage notebook as a browser-based system was first released mid-2000s and then in 2007, a new version was released that was more powerful, had user accounts and could be used to make documents public. It resembled the Google docs UI design since the layout of the Sage notebook was based on the layout of Google notebooks.

The creators of the Sage notebooks have confirmed that they were avid users of the Mathematica notebooks and Maple worksheets. Other motivations or drivers that were important when you consider the development of the Sage notebooks were the facts that the developers had close contact with the team behind IPython, they had experienced failed attempts at GUIs for IPython, and the rise of “AJAX” = web applications, which didn’t require users to refresh the whole page every time you do something.

IPython and Jupyter Notebook

In late 2001, twenty years after Guido van Rossum began to work on Python at the National Research Institute for Mathematics and Computer Science in the Netherlands, Fernando Pérez starts developing IPython. The project was heavily influenced by the Mathematica notebooks and Maple worksheets, just like the Sage notebook and many other projects that followed.

In 2005, both Robert Kern and Fernando Pérez attempted building a notebook system. Unfortunately, the prototype had never become fully usable.

Fast forward two years: the IPython team had kept on working, and in 2007, they formulated another attempt at implementing a notebook-type system. By October 2010, there was a prototype of a web notebook and in the summer of 2011, this prototype was incorporated and it was released with 0.12 on December 21, 2011. In subsequent years, the team got awards, such as the Advancement of Free Software for Fernando Pérez on 23 of March 2013 and the Jolt Productivity Award, and funding from the Alfred P. Sloan Foundations, among others.

Lastly, in 2014, Project Jupyter started as a spin-off project from IPython.

The last release of IPython before the split to Jupyter contained an interactive shell, the notebook server, the Qt console, etc. The project was really big, with tools that were increasingly becoming more and more distinct projects that happened to pertain to the same project. After the Jupyter project started, the language-agnostic parts of the IPython project, such as the notebook format, message protocol, Qt console, notebook web application, etc. were put into the Jupyter project. This is called “The Big Split”.

IPython has now only two roles to fulfill: being the Python backend to the Jupyter Notebook, which is also known as the kernel, and an interactive Python shell. But this is not all: within the IPython ecosystem, you’ll also find a parallel computing framework. You’ll read more about this later on!

And just like IPython, Project Jupyter is actually one name for a bunch of projects: the three applications that it harbors are the Notebook itself, a Console and a Qt console, but there are also subprojects such as Jupyterhub to support notebook deployment, nbgrader for educational purposes, etc. You can see an overview of the Jupyter architecture here.

Note that it’s exactly the evolution of this project that explains the confusion that many Pythonistas have when it comes to IPython and Jupyter: since one came out of the other (quite recently), there are some who still have difficulties adopting the right names for the concepts. But what might be the even more complicating factor is the evolution: since one came out of the other, there is a considerable overlap between the IPython and the Jupyter Notebook features that are sometimes hard to distinguish!

How to distinguish between the two will become clear in the next sections of this post.

If you want to know more details about how the development of IPython came about, check out the personal accounts of Fernando Pérez and William Stein about the history of their notebooks.

R Notebooks

R Markdown and Jupyter notebooks share the delivering of a reproducible workflow, the weaving of code, output, and text together in a single document, supporting interactive widgets and outputting to multiple formats.

However, the two also differ: the former focuses on reproducible batch execution, plain text representation, version control, production output and offers the same editor and tools that you use for R scripts. The latter focus on outputting inline with code, caching the output across sessions, sharing code and outputting in a single file. Notebooks have an emphasis on an interactive execution model. They don’t use a plain text representation, but a structured data representation, such as JSON.

That all explains the purpose of RStudio’s notebook application: it combines all the advantages of R Markdown with the good things that computational notebooks have to offer.

To learn more about how you can work with R notebooks and what the exact differences are between Jupyter and R Markdown notebooks in terms of notebook sharing, project management, version control and more, check out DataCamp’s Jupyter and R: Notebooks with R post.

Other Data Science Notebooks

Of course, there are even more notebooks that you can consider when you’re getting into data science. In recent years, a lot of new alternatives have found their way to data scientists and data science enthusiasts: think not only of Beaker Notebook, Apache Zeppelin, Spark Notebook, DataBricks Cloud, etc., but also of other tools such as the Rodeo IDE which also make your data science analyses interactive and reproducible.

The Future of Notebooks

And notebooks seem to be here to stay. Recently, the next generation of Jupyter Notebooks has been introduced to the community: JupyterLab. The Notebook application includes not only support for Notebooks but also a file manager, a text editor, a terminal emulator, a monitor for running Jupyter processes, an IPython cluster manager and a pager to display help.

The rich toolset of the Jupyter Notebook has evolved organically and was driven by the needs of our users and developers. JupyterLab is a next-generation architecture to support all these tools, but with a flexible and responsive UI, offering a user-controlled layout that could tie together the tools.

IPython or Jupyter?

The evolution of the project and the consequent “Big Split” are the foundations to understanding the true differences betweeb the two. But, as the two are inherently connected, you’ll sometimes find yourself doubting what is part of what.

The following section will go over some features that are either part of the IPython ecosystem or of Jupyter Project.

Up to you to select the right answer and discover more about each feature!

Kernels?

Kernels are a feature of the Jupyter Notebook Application. A kernel is a program that runs and introspects the user’s code: it provides computation and communication with the frontend interfaces, such as notebooks. The Jupyter Notebook Application has three main kernels: the IPython, IRkernel and IJulia kernels.

Since the name “Jupyter” is actually short for “Julia, Python and R”, that really doesn’t come as too much of a surprise. The IPython kernel is maintained by the Jupyter team, as a result of the evolution from the IPython to the Jupyter project.

However, you can also run many other languages, such as Scala, JavaScript, Haskell, Ruby, and many more in the Jupyter Notebook Application. Those are community maintained kernels.

Notebook Deployment?

Deploying notebooks is something that you’ll typically find or look into when you’re working with the Jupyter notebooks. There are quite a number of packages that will help you to deploy your notebooks and that are part of the Jupyter ecosystem.

Here are some of them:

docker-stacks will come in handy when you need stacks of Jupyter applications and kernels as Docker containers.
ipywidgets provides interactive HTML & JavaScript widgets (such as sliders, checkboxes, text boxes, charts, etc.) for the Jupyter architecture that combine front-end controls coupled to a Jupyter kernel.
jupyter-drive allows IPython to use Google Drive for file management.
jupyter-sphinx-theme to add a Jupyter Sphinx theme to your notebook. It will make it easier to create intelligent and beautiful documentation.
kernel_gateway is a web server that supports different mechanisms for spawning and communicating with Jupyter kernels. Look here to see some use cases in which this package can be come in handy.
nbviewer to share your notebooks. Check out the gallery here.
tmpnb to create temporary Jupyter Notebook servers using Docker containers. Try it out for yourself here.
traitlets is a framework that lets Python classes have attributes with type checking, dynamically calculated default values, and ‘on change’ callbacks. You can also use the package for configuration purposes, to load values from files or from command line arguments. traitlets powers the configuration system of IPython and Jupyter and the declarative API of IPython interactive widgets.

System Shell Usage?

It is possible to adapt IPython for system shell usage with magics: lines that start with ! are passed directly to the system shell. For example, !ls will run ls in the current directory. You can assign the result of a system command to a Python variable with the syntax myfiles=!ls.

However, if you want to get the result of an ls function explicitly printed out as a list with strings, without assigning it to a variable, use two exclamation marks (!!ls) or the %sx magic command without an assignment.

# Assign the result to `ls`
ls = !ls

# Explicit `ls`
!!ls

# Or with magics
%sx

# Assign magics result
ls = %sx

Note that !!commands cannot be assigned to a variable, but that the result of a magic (as long as it returns a value) can be assigned to a variable.

IPython also allows you to expand the value of Python variables when making system calls: just wrap your variables or expressions in braces ({}). Also, in a shell command with ! or !!, any Python variable prefixed with $ is expanded. In the code chunk below, you’ll see that you echo the argv attribute of the sys variable. Note that you can also use the $/$$ syntaxes to Python variables from system output, which you can later use for further scripting.

To pass a literal $ to the shell, use a double $$. You’ll need this literal $ if you want to access the shell and environment variables like $PATH:

# Import and initialize
import math
x = 4 

# System call with variable
!echo {math.factorial(x)}

# Expand a variable
!echo $sys.argv

# Use $$ for a literal $
!echo "A system variable: $$HOME"

Magics?

If you have gone through DataCamp’s Definitive Guide to Jupyter Notebook or if you have already worked with Jupyter, you might already know the so-called “magic commands”. The magics usually consist of a syntax element that is not valid in the underlying language and some kind of word that implies a command. Beneath the hood, magics functions are actually Python functions.

The IPython kernel uses, as you might already know, the % syntax element because it’s not a valid unary operator in Python. However, lines that begin with %% signal a cell magic: they take as arguments not only the rest of the current line, but all lines below them as well, in the current execution block. Cell magics can in fact make arbitrary modifications to the input they receive, which need not even be valid Python code at all. They receive the whole block as a single string.

Magics are specific to and provided by kernels and are designed to make your work and experience within Jupyter Notebook a lot more interactive. Whether magic commands are available in a certain kernel depends on the kernel developer(s) and on kernel per kernel. You already see it: magics are a kernel feature.

When you’re using the Python backend to the Jupyter Notebook, IPython, which is also known as the kernel, you might want to make use of the following tricks to gain access to functionalities that will make your programming faster, easier, and more interactive. Note that the ones that will be listed are not meant to be exhaustive. Check out this list of built-in magic commands for a complete overview.

Plotting

One major feature of the IPython kernel is the ability to display plots that are the output of running code cells. The IPython kernel is designed to work seamlessly with the matplotlib data visualization library to provide this functionality. To make use of it, use the magic command %matplotlib.

As such, your plot will be displayed in a separate window by default. Additionally, you can also specify a backend, such as inline or qt, the output of the plotting commands will be shown inline or through a different GUI backend. You can read more about it here.

Debugger Access

Next, you can also use magics to call up a Python debugger %pdb every time there is an uncaught exception. This will direct you through the part of the code that triggered the exception, which will make it possible rapidly find the source of a bug.

You can also use the %run magic command with the -d option to run scripts under the Python debugger’s control. It will automatically set up initial breakpoints for you. Lastly, you can also use the %debug magic for even easier debugger access.

IPython Extensions

You can use the %load_ext magic to load an IPython extension by its module name. IPython extensions are Python modules that modify the behaviour of the shell: extensions can register magics, define variables, and generally modify the user namespace to provide new features for use within code cells.

Here are some examples:

Use %load_ext oct2py.ipython to seamlessly call M-files and Octave functions from Python,
Use %load_ext rpy2.ipython to use an interface to R running embedded in a Python process,
Use %load_ext Cython to use a Python to C compiler,
Use sympy.init_printing() to pretty pritn Sympy Basic objects automatically, and
To use Fortran in your interactive session, you can use %load_ext fortranmagic.
… There are many more! You can create an register your own IPython extensions and register them on PyPi: this also means that there are many other user-defined extensions and magics out there! One example is ipython_unittest, but also check out this Extensions Index.

One of the other extensions that you should keep an eye out on is sparkmagic, a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook.

# Load in sparkmagic
%load_ext sparkmagic.magics

# Set the endpoint
%manage_spark

# Ask for help
%spark?

Go here for more examples of how you can make use of these magics to work with Spark cluster interactively.

Note that IPython has two other magics besides %load_ext that allow you to manage the extensions from within your Jupyter Notebook: %reload_ext and %unload_ext to unload, reimport and load an extension and to unload the extension, respectively.

Different Kernels, Other Magics

However, in other languages, the syntax element in the magic commands might have a meaning.

The R kernel, IRKernel, doesn’t have a magics sytem. To execute bash commands, for example, you’ll use R functions such as system() to invoke an OS command. An example would be system("head -5 *.csv", intern=TRUE). Note that by including the intern argument, you specify that you want to capture the output of the command as a character vector in R. To display markdown input, you make use of display_markdown(), to which you pass the Markdown code as a character vector.

Likewise, the Julia Kernel IJulia also doesn’t use “magics”. Instead, other syntaxes to accomplish the same goals are more natural in Julia, work in environments outside of IJulia code cells, and are often more powerful. However, the developers of the IJulia kernel have made sure that whenever you enter an IPython magic command in an IJulia code cell, you will see a printout with help that explains how you can achieve a similar effect in Julia if possible.

For example, the analog of IPython’s %load in IJulia is IJulia.load().

On the other hand, there are kernels such as the Scala kernel IScala that do support magic commands, similarly to IPython. However, the set of magics is different as it has to match the specifics of Scala and the JVM. Magic commands consist of percent sign % followed by an identifier and optional input to a magic. Some of the most notable magics are:

# Type Information
%type 1

# Library Management 
%libraryDependencies
%update

As you have read above, the sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations. And this without needing any code!

For example, you can easily execute SparkSQL queries with %%sql or access Spark application information and logs via %%info magic.

If you’re working with another kernel and you wonder if you can make use of magic commands, it might be handy to know that there are some kernels that build on the metakernel project and that will use, in most cases, the same magics that you’ll also find in the IPython kernel. You can find a list of the metakernel magics here. The metakernel is a Jupyter/IPython kernel template which includes core magic functions.

Some examples:

The MATLAB kernel matlab_kernel,
The Octave kernel octave_kernel,
The Java9 kernel java9_kernel,
The Wolfram kernel wolfram_kernel,
The SAS kernel sas_kernel.
… And many more!

This means that, for example, when you’re using the MATLAB kernel, you’ll have the following magics available:

Available line magics:
%cd  %connect_info  %download  %edit  %get  %help  %html  %install  %install_magic  %javascript  %kernel  %kx  %latex  %load  %ls  %lsmagic  %magic  %parallel  %plot  %pmap  %px  %python  %reload_magics  %restart  %run  %set  %shell  %spell

Available cell magics:
%%debug  %%file  %%help  %%html  %%javascript  %%kx  %%latex  %%processing  %%px  %%python  %%shell  %%show  %%spell

If you look at the chunk that is printed above, you’ll see that some of those magic commands seem very familiar. For those who don’t know the magics so well, compare the above chunk to the magics that you have available in the IPython kernel by default and you’ll see that some of the magics are the same:

Available line magics:
%alias  %alias_magic  %autocall  %automagic  %autosave  %bookmark  %cat  %cd  %clear  %colors  %config  %connect_info  %cp  %debug  %dhist  %dirs  %doctest_mode  %ed  %edit  %env  %gui  %hist  %history  %killbgscripts  %ldir  %less  %lf  %lk  %ll  %load  %load_ext  %loadpy  %logoff  %logon  %logstart  %logstate  %logstop  %ls  %lsmagic  %lx  %macro  %magic  %man  %matplotlib  %mkdir  %more  %mv  %notebook  %page  %pastebin  %pdb  %pdef  %pdoc  %pfile  %pinfo  %pinfo2  %popd  %pprint  %precision  %profile  %prun  %psearch  %psource  %pushd  %pwd  %pycat  %pylab  %qtconsole  %quickref  %recall  %rehashx  %reload_ext  %rep  %rerun  %reset  %reset_selective  %rm  %rmdir  %run  %save  %sc  %set_env  %store  %sx  %system  %tb  %time  %timeit  %unalias  %unload_ext  %who  %who_ls  %whos  %xdel  %xmode

Available cell magics:
%%!  %%HTML  %%SVG  %%bash  %%capture  %%debug  %%file  %%html  %%javascript  %%js  %%latex  %%perl  %%prun  %%pypy  %%python  %%python2  %%python3  %%ruby  %%script  %%sh  %%svg  %%sx  %%system  %%time  %%timeit  %%writefile

In essence, you can use one question to distinguish which magics are specific to IPython and which ones can be used in other kernels: is this functionality Python-specific or is it a general thing that can also be used in the language that you’re working with?

For example, %pdb or the Python Debugger or %matplotlib is something specific to Python that wouldn’t make sense when you’re working with the JavaScript kernel. However, changing directories with %cd is typically something that is very general and that should work in any language, as it is such a “general” command. So, chances are that this will be a magic that can be used in other kernels. Of course, you’ll still need to know whether your kernel makes use of the magics.

Conversion and Formatting Notebooks?

Converting and formatting notebooks are features that you’ll find in the Jupyter ecosystem. Two tools that you’ll typically find for these tasks are nbconvert and nbformat.

You can use the former to convert notebooks to various other formats to present information in familiar formats, to publish research and to embed notebooks in papers, to collaborate with others and to share content with a larger audience.

The latter basically contains the Jupyter Notebook format and is the key to understanding that notebook files are simple JSON documents that contain: metadata (such as the kernel or language info), the version of the notebook format (major and minor), and the cells in which all text, code, etc. is stored.

Saving and Loading Notebooks?

Saving and loading notebooks is a feature of the Jupyter Notebook Application. You can load IPython Notebooks, saved as files with an .ipynb filename extension, which other people have created by downloading and opening up the file in the Jupyter application. More specifically, you can create a new notebook and then choose to open the file by clicking on the “File” tab, clicking “Open” and selecting your downloaded notebook.

Reversely, you can also save your own notebooks by clicking on the same “File” tab and selecting “Download as” to get your hands on your own notebook file or you can also choose to save the file and set a checkpoint. This is very handy for when you want to do some minor version control and maybe revert back to an earlier version of your notebook. Of course, your modifications are saved automatically every few minutes, so there isn’t always a need to do this action explicitly.

Note that you can also opt to not save any changes to an original notebook by making a copy of it and saving all changes to that copy!

Keyboard Shortcuts & Multicursor Support?

Selecting multiple cells, toggling the cell output, inserting new cells, etc. For all these actions, you have keyboard shortcuts that are part of the Jupyter Notebook. You can find a list of keyboard shortcuts under the menu at the top: go to the “Help” tab and select “Keyboard Shortcuts”.

Also, the multi cursor support is a feature that you’ll find in the Jupyter Notebook!

Parallel Computing Network?

The parallel computing network was part of the IPython project, but as of 4.0, it’s s a standalone package called ipyparallel. The package is basically a collection of CLI scripts for controlling clusters for Jupyter.

Even though it’s split off, it’s still a powerful component of the IPython ecosystem that is generally overlooked; It’s so powerful because instead of running a single Python kernel, it allows you to start many distributed kernels over many machines.

Typical use cases for ipyparallel are, for example, cases in which you need to run models many different times to estimate the distributions of its outputs or how they vary with input parameters. When the runs of the model are independent, you can speed up the process by running them in parallel across multiple computers in a cluster. Think about distributed model training or simulations.

Terminal?

This feature is one that is part of the Jupyter ecosystem: you have the Jupyter Console and a Jupyter terminal application. However, since the start, IPython has been used to indicate the original, interactive command-line terminal for Python. It offers an enhanced read-eval-print loop (REPL) environment particularly well adapted to scientific computing. This was the standard before 2011 when the Notebook tool was introduced and started offering a modern and powerful web interface to Python.

Next, you also had the IPython console, which started two processes: the original IPython terminal shell and the default profile or kernel which gets started if not otherwise noted. By default, this was Python.

The IPython console is now deprecated and if you want to start it, you’ll need to use the Jupyter Console, which is a terminal-based console frontend for Jupyter kernels. This code is based on the single-process IPython terminal. The Jupyter Console provides the interactive client-side experience of IPython at the terminal, but with the ability to connect to any Jupyter kernel instead of only to IPython.

This lets you test any Jupyter Kernel you may have installed at the terminal, without needing to fire up a full-blown Notebook for it. The Console allows for console-based interaction with other Jupyter kernels such as IJulia, IRKernel.

However, when you start up the console, you’ll see that, if you don’t add a --kernel argument, you’ll start with Python by default. The screen that you get to see is very similar to when you start up the IPython terminal.

Lastly, the Jupyter Notebook Application also has a Terminal Application: a simple bash shell terminal that runs in your browser. You can easily find it when you start the application and select a new terminal from the dropdown menu.

Qt Console?

The Qt console used to be a part of the IPython project, but it has now moved to the Jupyter project. It’s a lightweight application that largely feels like a terminal but provides a number of enhancements only possible in a GUI, such as inline figures, proper multi-line editing with syntax highlighting, graphical call tips, and much more. The Qt console can use any Jupyter kernel.

Conclusion

Today’s blog post was an addition to DataCamp’s Definitive Guide and covered not only the history of computational notebooks in more detail so that you can understand the evolution and the difference between the IPython and Jupyter projects more clearly, but it also went deeper into some features that are IPython-or Jupyter-specific. The goal was to see that the distinction between the two is sometimes hard if you don’t take into account the historic perspective of the two projects. In some cases, there is a gray zone, an “in between”, that isn’t easily classified.

from DataCamp Blog http://www.datacamp.com/community/blog/ipython-jupyter

Tag: DataCamp Blog

RDocumentation: Scoring and Ranking

Elasticsearch

package_version type

topic type

Scoring in Elasticsearch

Boosting the popular documents

Using downloads as a popularity measure

Timeframing

Direct vs Indirect downloads

New Course: Merging Dataframes with pandas

Pandas Cheat Sheet: Data Wrangling in Python

DataChats: An Interview with David Stoffer

IPython Or Jupyter?

IPython and Jupyter Notebook

The Origins of IPython and Jupyter

The Start of Computational Notebooks: MATLAB, Mathematica & Maple

The Rise of Data Science Notebooks

Sage Notebook

IPython and Jupyter Notebook

R Notebooks

Other Data Science Notebooks

The Future of Notebooks

IPython or Jupyter?

Kernels?

Notebook Deployment?

System Shell Usage?

Magics?

Plotting

FileSystem Navigation

Debugger Access

IPython Extensions

Different Kernels, Other Magics

Conversion and Formatting Notebooks?

Saving and Loading Notebooks?

Keyboard Shortcuts & Multicursor Support?

Parallel Computing Network?

Terminal?

Qt Console?

Conclusion

Elasticsearch

package_version type

topic type

Scoring in Elasticsearch

Boosting the popular documents

Using downloads as a popularity measure

Timeframing

Direct vs Indirect downloads

Share this:

Share this:

Share this:

Share this:

IPython and Jupyter Notebook

The Origins of IPython and Jupyter

The Start of Computational Notebooks: MATLAB, Mathematica & Maple

The Rise of Data Science Notebooks

Sage Notebook

IPython and Jupyter Notebook

R Notebooks

Other Data Science Notebooks

The Future of Notebooks

IPython or Jupyter?

Kernels?

Notebook Deployment?

System Shell Usage?

Magics?

Plotting

FileSystem Navigation

Debugger Access

IPython Extensions

Different Kernels, Other Magics

Conversion and Formatting Notebooks?

Saving and Loading Notebooks?

Keyboard Shortcuts & Multicursor Support?

Parallel Computing Network?

Terminal?

Qt Console?

Conclusion

Share this: