In ODDS, we openly provide access to a large collection of outlier detection datasets with ground truth (if available). Our focus is to provide datasets from different domains and present them under a single umbrella for the research community. As such, we arrange the datasets based on their types into different tables in the order as listed below. [read more about ODDS]
Multi-dimensional point datasets: There is one record per data point, and each record contains several attributes.
Time series graph datasets for event detection: Temporal graph data where the graph changes dynamically over time in which new nodes and edges arrive or existing nodes and edges disappear.
Time series point datasets (Multivariate/Univariate): Temporal point data where each point has one or more attributes and the attributes change over time.
Adversarial/Attack scenario and security datasets: Opinion fraud detection data from online review system. Cyber security data, e.g. intrusion detection with DoS, DDoS etc. attack scenario.
Crowded scene video data for anomaly detection: Video clips acquired with camera.
Multi-dimensional point datasets
| Dataset | #points | #dim. | #outliers (%) |
|---|---|---|---|
| Lympho | 148 | 18 | 6 (4.1%) |
| WBC | 278 | 30 | 21 (5.6%) |
| Glass | 214 | 9 | 9 (4.2%) |
| Vowels | 1456 | 12 | 50 (3.4%) |
| Cardio | 1831 | 21 | 176 (9.6%) |
| Thyroid | 3772 | 6 | 93 (2.5%) |
| Musk | 3062 | 166 | 97 (3.2%) |
| Satimage-2 | 5803 | 36 | 71 (1.2%) |
| Letter Recognition | 1600 | 32 | 100 (6.25%) |
| Speech | 3686 | 400 | 61 (1.65%) |
| Pima | 768 | 8 | 268 (35%) |
| Satellite | 6435 | 36 | 2036 (32%) |
| Shuttle | 49097 | 9 | 3511 (7%) |
| BreastW | 683 | 9 | 239 (35%) |
| Arrhythmia | 452 | 274 | 66 (15%) |
| Ionosphere | 351 | 33 | 126 (36%) |
| Mnist | 7603 | 100 | 700 (9.2%) |
| Optdigits | 5216 | 64 | 150 (3%) |
| Http (KDDCUP99) | 567479 | 3 | 2211 (0.4%) |
| ForestCover | 286048 | 10 | 2747 (0.9%) |
| Mulcross | 262144 | 4 | 26214 (10%) |
| Smtp (KDDCUP99) | 95156 | 3 | 30 (0.03%) |
| Mammography | 11183 | 6 | 260 (2.32%) |
| Annthyroid | 7200 | 6 | 534 (7.42%) |
| Pendigits | 6870 | 16 | 156 (2.27%) |
| Ecoli | 336 | 7 | 9 (2.6%) |
| Wine | 129 | 13 | 10 (7.7%) |
| Vertebral | 240 | 6 | 30 (12.5%) |
| Yeast | 1364 | 8 | 64 (4.7%) |
| Seismic | 2584 | 11 | 170 (6.5%) |
| Heart | 224 | 44 | 10 (4.4%) |
| OSAD Benchmark Datasets | Multiple datasets | -- | -- |
| One-class dataset by David Tax | Multiple datasets | -- | -- |
Time series graph datasets for event detection
| Dataset | #nodes | duration | description |
|---|---|---|---|
| EnronInc | 80,884 | 4 years | Email communication network over time in Enron Inc. |
| RealityMining | 9104 | 50 weeks | communication and proximity data of 97 faculty, student, and staff at MIT . |
| TwitterWorldCup2014 | 54K | 1 month | Entity co-mention network from twitter related to 2014 Soccer World Cup. |
| TwitterSecurity2014 | 130K | 4 months | Entity co-mention network from twitter related to terrorism and domestic security. |
| NYTNews | 320K | 7.5 years | Entity co-mention graph for New York Times News Corpus over 7.5 years. |
| ChallengeNetwork | 125 | 9 days | Simulated cyber challenge network traffic flow data. |
| VAST2012MC2 | 5K | 2 days | Bank of Money Regional Office Network Operations Forensics. |
| VAST2013MC3 | 1.2K | 2 weeks | Big Marketing computer network flow data. |
| VAST2014 | -- | 3 days | Timestamped text, network, and transaction data from GAStech. |
Time series point datasets (Multivariate/Univariate)
| Dataset | Type | Size | Duration | Description |
|---|---|---|---|---|
| DataMarket - TSDL | Univariate | Multiple datasets | -- | The Time Series Data Library (TSDL) was created by Rob Hyndman, Professor of Statistics at Monash University, Australia. |
| Yahoo - a benchmark dataset for TSAD | Multivariate | between 741 and 1680 observations per series at regular interval | 367 time series | This dataset is released by Yahoo Labs to detect unusual traffic on Yahoo servers. |
| Numenta Anomaly Benchmark (NAB) | Multivariate | Multiple datasets | -- | Numenta Anomaly Benchmark, a benchmark for streaming anomaly detection where sensor provided time-series data is utilized. |
Adversarial/Attack scenario and security datasets
| Dataset | Size | Description |
|---|---|---|
| YelpCHI | 67,395 hotel and restaurant reviews | Reviews from Yelp.com for Chicago Hotels and Restaurants. |
| YelpNYC | 359,052 restaurant reviews | Reviews from Yelp.com for NYC restaurants |
| YelpZip | 608,598 restaurant reviews | Zip code wise reviews from Yelp.com for NY, NJ, VT, CT, and PA. |
| YelpAcademic | 2.7M yelp reviews | Reviews of various businesses from Yelp.com for academic challenge. |
| AmazonReview | 34,686,770 product reviews | Reviews from Amazon.com |
| SWMReview | 1, 132, 373 reviews | SWM Review dataset contains reviews under the entertainment category from a popular online software marketplace. |
| BeerAdvocate | 1,586,259 beer reviews | Beer reviews from BeerAdvocate |
| RateBeer | 2,924,127 beer reviews | Beer reviews from RateBeer |
| CellarTracker | 2,025,995 wine reviews | Wine reviews from CellarTracker |
| FineFoods | 568,454 food reviews | Food reviews from Amazon |
| Movies | 7,911,684 movie reviews | Movie reviews from Amazon |
| AZSecure-data | Multiple datasets | Data Science Testbed for Security Researchers |
| CAIDA datasets | Multiple datasets | Collection and sharing site of data for scientific analysis of Internet traffic, topology, routing, performance, and security-related events. |
| DARPA intrusion detection | Multiple datasets | The Cyber Systems and Technology Group of MIT Lincoln Laboratory, under DARPA ITO and AFRL/SNHS sponsorship, has collected and distributed the first standard corpora of intrusion detection datasets. |
| KDDCUP99 | 4,900K connection records | The dataset includes a wide variety of intrusions simulated in a military network environment. |
| MAWI Working Group Traffic Archive | 2006 - present collection | This is a traffic data repository maintained by the MAWI Working Group of the WIDE Project where traffic traces are collected at some sampling points everyday. |
| MOME | Multiple datasets | Cluster of European Projects aimed at Monitoring and Measurement. |
| Waikato Internet Traffic Storage | Multiple datasets | The Waikato Internet Traffic Storage project aims to collect and document all the Internet traces that the WAND Group has in their possession. |
| RIPE | Multiple datasets (currently ~100TB) | The RIPE Data Repository is a collection of diverse datasets that are useful for scientific and operational Internet research. |
| The Internet Traffic Archive | Multiple datasets | The Internet Traffic Archive is a moderated repository to support widespread access to traces of Internet network traffic, sponsored by ACM SIGCOMM. |
| UMassTraceRepository | Multiple datasets | The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. |
Crowded scene video data for anomaly detection
| Dataset | size | description |
|---|---|---|
| UCSD Anomaly Detection Dataset | 98 video clips | The UCSD anomaly detection annotated dataset was acquired with a stationary camera mounted at an elevation, overlooking pedestrian walkways. |
| University of Minnesota crowd activity datasets | Multiple datasets | Data for monitoring human activity by University of Minnesota. |
| Anomalous Behavior Data Set | Multiple datasets | Datasets for anomalous behavior detection in videos. |
| Virat video dataset | ~8.5 hours of videos | This is a video surveillance data for human activity/event detection. |
| McGill University Dominant and Rare Event Detection Data | 3 video clips (43, 96 mins) | This is a video surveillance data for dominant and rare event detection captured by cameras from a subway station. |
