# ROC and AUC details

For information on how works AUC and ROC in detail:

this paper from HP laboratories is simply incredible, very clear.

# Online courses for Git and Github

For those interested in git and github:

https://www.udacity.com/course/ud775

The following 2 are very good:

and for the most common problems use this link….very very good

https://www.coursera.org/course/datascitoolbox …. has some lessons on git and github

For those who start in R:

– First, install R : http://cran.rstudio.com
– Third, work always with RStudio. Provides a user interface quite similar to Matlab, and is very easy to use.
– Fourth, do this free online course: https://www.coursera.org/course/rprog

(the more energetics can do the complete series: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=listingPage)
– Fifth, here you have reference information:

https://mlopezm.wordpress.com/2014/12/28/courses-to-learn-r/
but in the web you have tons of information about R.

– The last point, when you look up for information on google about R, use always R written as [R], otherwise you won’t have very useful results…..

…..good luck

# Using Python for data analysis, machine learning

Here is an excellent list of tools from python that you can use for your machine learning projects:

http://stats.stackexchange.com/questions/1595/python-as-a-statistics-workbench

extracted from this reference:

• NumPy/Scipy You probably know about these already. But let me point out the Cookbook where you can read about many statistical facilities already available and the Example List which is a great reference for functions (including data manipulation and other operations). Another handy reference is John Cook’s Distributions in Scipy.
• pandas This is a really nice library for working with statistical data — tabular data, time series, panel data. Includes many builtin functions for data summaries, grouping/aggregation, pivoting. Also has a statistics/econometrics library.
• larry Labeled array that plays nice with NumPy. Provides statistical functions not present in NumPy and good for data manipulation.
• python-statlib A fairly recent effort which combined a number of scattered statistics libraries. Useful for basic and descriptive statistics if you’re not using NumPy or pandas.
• statsmodels Statistical modeling: Linear models, GLMs, among others.
• scikits Statistical and scientific computing packages — notably smoothing, optimization and machine learning.
• PyMC For your Bayesian/MCMC/hierarchical modeling needs. Highly recommended.
• PyMix Mixture models.

If speed becomes a problem, consider Theano — used with good success by the deep learning people.

Information for the tools:

For Pandas:

http://pandas.pydata.org/pandas-docs/dev/10min.html

for a short summary on pandas:

http://www.bigdataexaminer.com/exploratory-data-analysis-in-python-using-pandas-matplotlib-and-numpy/

For Numpy/Scipy:

http://wiki.scipy.org/Cookbook

For GLM:

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/glm.html

Monte Carlo

http://pymc-devs.github.io/pymc/tutorial.html#