Famous Datasets

The atoms toolbox rdataset provides a collection of 761 datasets that were originally distributed alongside R.

Download it on ATOMS:

https://atoms.scilab.org/toolboxes/rdataset

Collaborate to the code on our Forge:

https://forge.scilab.org/index.php/p/rdataset/

You can import easily those famous dataset, and get inspiration from the open source software R, specialized in statistics.

93 Cars on Sale in the USA in 1993

// Read the dataset Cars93 from MASS
[data,desc] = rdataset_read("MASS","Cars93");

// read description
disp(desc)

// read header of data
disp(data(1))

//view the hierarchy of data as a tree
tree_show(data)

//store as matrix of string
M=[]
for i=2:29
    M=[M,string(data(i))]
end

//add header
M=[data(1)(2:29);M]

//export as CSV
csvWrite(M,"Cars93.csv")

Package	Item	Title	Rows	Cols	has_logical	has_binary	has_numeric	has_character	CSV	Doc
MASS	Cars93	Data from 93 Cars on Sale in the USA in 1993	93	27	FALSE	TRUE	TRUE	FALSE	CSV	DOC

You can also import the data in a more classic way after downloading it above:

Cars93 = csvRead('Cars93.csv',',','.', 'string');

D=csvRead('Cars93.csv'); data = D(2:$,:);

This enables to have a single matrix containing all data in our variable workspace:

The first distribution of interest would be the price of the different models:

histplot(12,data(:,6))

title("Prices of 93 Models of 1993 Cars")

One way to represent data relatively is a pie chart:

Cars93(2:$,4)

m=tabul(Cars93(2:$,4))

pie(m(2),m(1))

pie(m(2));legend(m(1));

Some might advocate that it is easier to differentiate the respective value of categories when plotted on the same scale.

For that, a bar chart is more appropriate. However, you need to change the ticks labels to have a representation of non-numeric categories.

//Bar

bar(m(2));

h = get("current_entity");

h.parent.x_ticks.labels = m(1)

The Wealth of Nations

Another interesting dataset to look at is considering the United Nations (1998) Social indicators: infant mortality and Gross domestic product(GDP)

// Read the dataset survey from from UN

[data,desc] = rdataset_read("car","UN");

// read description

disp(desc)

// read header of data

disp(data(1))

Some data might not be defined (NaN = Not a Number) in our dataset, so we can use the function thrownan.

// History plot

// Getting rid of the NaN entries

[gdp,k] = thrownan(data.gdp);

[rationonan,kk] = thrownan(data.infant_mortality(k));

gdpnonan = gdp(kk);

// Plot

[p,i] = gsort(gdpnonan,'g','i');

scf(1); clf(1);

plot([1:30],rationonan(i(1:30)),'bo-')

plot([1:30],data(k(kk(i(1:30))),15),'go-')

[m,im] = min(rationonan(i(1:30)));

plot([0,im],[m,m],'r')

[M,iM] = max(rationonan(i(1:30)));

plot([0,iM],[M,M],'r')

set(gca(),"grid",[1 1]*color('gray'));

set(gca(),"data_bounds",[0 30 0 200]);

title('Infant mortality ratio in the poor states');

Now let's look at a Pareto chart:

// Bar Chart and cumulative

[gdpnonan,i] = thrownan(data.gdp);

[gdp,j] = gsort(gdpnonan);

tot_gdp = sum(gdp);

cum_gdp = cumsum(gdp(1:10));

scf(4); clf(4);

bar(gdp(1:10)*100/tot_gdp)

plot([1:10],cum_gdp*100/tot_gdp,'ro-')

set(gca(),"grid",[1 1]*color('gray'));

We have plotted the percentage of the global per capita GDP for the 10 states with highest per capita GDP, while the red line shows the cumulative sum of these values: it points out that these 10 states cover the 30% of the global per capita GDP...

You can find more on this tutorial from Openeering on Data mining.

Hot dogs and buns

Another way to find data is to crawl the internet. For instance you find tremendous sources of wisdom, such as this great proof of cultural domination from the US: Nathan's Hot Dog Eating Contest

This graphic was produced with Scilab, exported as a vectorial format SVG, and edited with the open-source software Inkscape.

We could dedicated a full tutorial to this subject, but let us shortly go through the basics:

First we can download the content of the webpage with the function getURL:

--> filename=getURL("en.wikipedia.org/wiki/Nathan%27s_Hot_Dog_Eating_Contest")
 filename  = 

 C:\myFolder\Nathan's_Hot_Dog_Eating_Contest

Then we can real the HTML, with the function we use in Scilab for XML, such as xmlRead:

--> doc=xmlRead(filename)
 doc  = 

XML Document
url: file:///C:/myFolder/Nathan's_Hot_Dog_Eating_Contest
root: XML Element

We need to know a little bit about HTML in order to find the data we are looking for:

You can inspect the webpages in your browser (Ctrl+Maj+I with Chrome).

Extracting the nodes corresponding to the lines of a table in html would be performed thanks to the function xmlXPath:

--> td=xmlXPath(doc, "//td")
 td  = 

XML List
size: 367

We find 367 occurencies. By browsing, we finally find the content we are looking for:

--> td(10).content
 ans  =

 2018

You can write a program based on this principal to extract valuable data from webpages. Enjoy hacking ;)