AAAI Accepted Papers: This data set compromises the metadata for the AAAI MicroMass: A dataset to explore machine learning approaches for the benchmark problem for data integration of two entity relationship databases . Understanding what type of data relationships to look for helps you find those stories faster. start to get a sense of what else you might like to explore in other data sets. As you dive into these, consider what types of interesting angles your. What do the demographics data tell you about the relationship between interesting data banks out there that are just waiting to be explored.
From these digital images, objects have been extracted, and an objects catalogue has been composed. For each object, useful astronomical characteristics have been registered, such as the size, the brightness, the position, etc. A project was then caried out to classify the objects as stars or galaxies. External labeling to evaluate the classification algorithm was obtained from the more precise data of the Sloan Digital Sky Survey.
- Change History of Wikipedia Infoboxes
- Interesting Data Sets
- Further Reading
There are 4 object sets, one for B and I, and two for R one set from pictures taken in the 50's and one set more recent. Each of these is divided in a set of paired objects for which a corresponding SDSS object was found and a set of unpaired ones: The size of the datasets is as follows: Introduction and description by N.
This paper is an introduction to the SSS project. Image detection, parameterisation, classification and photometry by N.
A description of the methods for image detection, parameterisation, classification and photometry. A useful paper for you to read, as it gives explanations about how the data were obtained and what they mean, and about the object classification efforts by the SSS people.
Astrometry by N. An overview of how the astrometric parameters of the data were derived. Probably less interesting for you. This paper uses a similar astronomical dataset. It is quite interesting, as it is much more understandable than paper II above. It uses a similar two-step classification method and should therefore give you some insight in what is happening in paper II.
Then perform exploratory data analysis and prepare the data for data mining. You can concentrate on one of the paired datasets. Classify sky objects as stars or galaxies use the SDSS classification as label.
Also, do a performance evaluation with respect to the magnitude as was done in paper II. These are astronomical data, and all the documentation is written in 'astronomical language', so it is quite difficult to understand what the data are all about and how the previous research has been caried out. Furthermore, the dataset is quite big, so case reduction might be necessary.
Less interesting datasets You are allowed to come up with your own dataset for this project. In order to guide you in this search, we present here some examples of datasets which were considered less interesting. The Landsat image data from Statlog Description: For each pixel, 4 frequency values each between are given.
This is because foto's are taken in 4 different spectral bands. The aim is to classify the central pixel a piece of land into 1 of 6 classes based on its values and those of its immediate neighbours.
More info about this is given in the documentation file. The data have been perfectly preprocessed, and the classes are quite well balanced. This dataset is not very challenging, as very good results can be obtained very easily. It containstexts to be searched and classified. These texts are records of medical articles containing fields for the author, the title, the source, the publication type, a number of human assigned relevant terms, and in about two thirds of the cases also the abstract.
Full texts are not available. In the TREC filtering task, the program gets a user profile a query and a sample of texts that match this profile.
6 Amazing Sources of Practice Data Sets
The aim is to search the massive text database to find more texts that match the profile. Solutions to the filtering problem presented in TREC-9 include eg a kNN-method and an adaptive term and treshold selection method. This dataset is too hard, mainly because of its sheer magnitude. In An Evaluation of Statistical Approaches to Text Categorizationthe use of different text classification methods on different text datasets was examined.
It was pointed out that this set is much harder than for example the Reuters set. The predictive toxicology dataset Description: The PKDD conferences organize discovery challenges, not as a competition, but with the aim that different researchers would work together to try and find solutions for certain kdd problems. One of the tasks in the challenge used a dataset of chemical structures.
For each structure, it was indicated whether or not this substance caused cancer to mice or rats. The aim was to create a predictor of toxicology for substances based on their chemical structure. This is in fact a very difficult task.
Chemical structures can take on an enormous variety of shapes and sizes. They are a perfect example of data that do not fit in the classical attribute-value format. This means that traditional data mining techniques cannot be used on these data. On the dataset website, there are links to solution papers. All of these seem to try to derive attribute-value formats from the chemical structures, using domain background knowledge. This is definitely an interesting dataset, but certainly too difficult for a mini-project.
This is a well known data set for text classification, used mainly for training classifiers by using both labeled and unlabeled data see references below. The data set is a collection of 20, messages, collected from UseNet postings over a period of several months in The data are divided almost evenly among 20 different UseNet discussion groups. Many of the categories fall into overlapping topics; for example 5 of them are about companies discussion groups and 3 of them discuss religion.
6 Amazing Sources of Practice Data Sets - Jigsaw
Other topics included in News Groups are: This dataset is too well known and is in fact used as the example dataset for the rainbow software documentation. Yeast Gene Regulation Prediction dataset Description: I did that recently with the AutoDiscovery tool from Butler Scientifics.
Exploratory Data Analysis EDA for Small Data First, note that this tool is not explicitly for big data, though it is certainly useful for small subsets of big data: The focus is therefore on scientific discovery from small data.
This is the style of data science that nearly every scientist needs to carry out on a routine basis, since data from daily experiments are rarely in the rarified realm of big data, but modern scientific instruments often do generate large numbers of measured parameters per data object. Consequently, AutoDiscovery aims to satisfy a very particular scientific discovery requirement: It is a complement to those other more comprehensive statistical packages, not a competitor. Correlation discovery alone may seem relatively simple and thus a specialized tool for it seems unnecessary.
The top 10 features of AutoDiscovery for exploring complex relationships in data for scientific discovery are: GET-Evidence has put up public genomes for download. Maybe you could make yourself a clone? In what is the smallest data set on this list, the survival rates of men and women on the Titanic. Want an super specific breakdown of the contents of your food?
Invented a new image compression algorithm Pied Piper, anyone? Or maybe tiny images are too tiny. In that case, try the ImageNet databasewhich is structured around the WordNet hierarchy. So if you want to teach an algorithm what a narwhal looks like, this would be a good place to start.
How about all the Wikipedia images? Stanford in association with Google Research has you covered with their English-phrase-to-associated-Wikipedia-article database. The research paper can be downloaded here. Yandex, the Russian search engine, has made a bunch of search data available.What makes a good life? Lessons from the longest study on happiness - Robert Waldinger
Namely, if someone searches for something, what do they click on? Did you know that Google has a search engine for data sets? Questions this data could answer: Is the world becoming more progressive over time?
IRDS: Datasets for Mini-Projects
How have attitudes towards religion shifted over time? Speaking of public attitudes over time, you can download a set of the General Social Survey from until aboutwhich should answer both of those questions.
But what about the real life celebrity problem? Need a billion webpages from February ? Maybe to train a never ending language learner named NELL? Once we manage to teach machines natural language, we can just have a computer read it all and give us the cliff notes and the scientific breakthroughs.
If you need economic census data on any industry, check out census. If finance is really evil, you ought to be able to find something damning in the data. It was much more popular before the rise of the world wide web. Anyways, you can download a huge data set of postings to Usenet here.
It might be pretty good for some kind of textual analysis project or training a machine learning algorithm maybe a spellchecker?
You could use the data to build out a Google Groups competitor, too. One way to start saving all those future lives might be by digging into this data set of every recorded meteor impact on Earth from BCE to How do gender and mental illness affect crime? This data set was collected explicitly with that question in mind. There are a lot of lonely men and women out there, and some of those lonely men and women have excellent analytical skills.
Not creatine levels, unfortunately. Are modern jobs worse than those of the past? My grandparents built tires at Firestone. Today, people rarely have that level of control and visceral experience of the finished product of their work. This set of five surveys regarding how different groups experience employment could answer that question.
Do early positive reviews beget more positive reviews? You can download it after filling out a form. Today, some algorithms are actually more accurate than human judges! This would have been nice to have back when I was in grade school.
I distinctly recall once arguing with a teacher over missing a question because she insisted that I had written the letter j when it was clearly a d. UCI has a poker hand data set available. Machines have won in at least one tournament. Another data set from UCI: This is good for building up classification algorithms that decide whether or not a new image is an ad or not, which might be good for, say, automatic ad blocking or spam detection. Or maybe a Google Glass application that filters out real life advertisements.
Look at a billboard and instead see a virtual extension of the natural landscape. Remember the whole Star Wars Kid debacle? Wikipedia informs me that Attack of the Show rated it the number 1 viral video of all time. Someone could take this data and produce a visualization of who saw it when via maps, along with annotations of where the traffic was coming from. With this WordPress crawlyou can find out. Or maybe clustering people by interest.
Is Obama in bed with big oil? Or the corn lobbies? And who was backing that Herman Cain dude, anyways?