Semantic MediaWikis

From SKC
Jump to: navigation, search

Contents

Introduction: 2012 Semantic MediaWikis Summer Internship

This document describes the activities of three student interns that worked on the Semantic Mediawikis project in the Summer of 2012 at USC in the ISI department. The interns were:

  • Angela Knight
  • Larry Zhang
  • Kevin Zhang

They were aided by Varun Ratnakar and mentored by Yolanda Gil, both of whom work at USC ISI. Over a period of several days, the interns:

  • created one-page resumes
  • discovered Semantic MediaWikis
  • analyzed various examples of Semantic MediaWikis

The rest of this document contains their report describing these activities and their findings on a day-to-day basis.

Definitions

General

Editing - Altering any kind of information related to the wiki.

Users - People who access the wiki, whether to edit it or simply to access the information stored on the pages.

Viewers - People who simply access information stored on the pages in the wiki without editing or contributing.

Editors - People who edit the wiki and contribute in some form or another.

Project-Relevant Terms

Ontology - A method of capturing information about a group of objects, or "individuals". It does so by describing the categories that each individual falls under ("classes") and attaches information to each of these classes and individuals using "properties". OWL ontologies are an example of this. These are the basic structure of Semantic MediaWikis.

Semantic MediaWikis vs Regular MediaWikis - Semantic MediaWikis allow for listing of all objects and individuals associated with a certain property. The difference between these and regular MediaWikis is described in detail in Day 2: Create a report to describe Semantic Mediawikis. Examples may be found here: Semantic MediaWiki Examples

  • Warning: many of the so-called Semantic MediaWiki examples are in fact merely MediaWikis. Confirmation of a site as a Semantic MediaWiki may be found at each site's main page in the lower right hand corner in the form of a badge saying "Powered by Semantic MediaWiki".
  • Note: Semantic MediaWikis can be abbreviated as SMW's.

Spider - A type of program that allows users to download the entire content of a wiki. This was used by Varun Ratnakar to download data on various SMW's that the interns then analyzed.

Infoboxes - Small boxes of information about a certain object. A picture of the object is almost always attached. The information in these is stored in the form of properties, which is most significant in Semantic MediaWikis.

Properties - Variables attached to objects/individuals which have some kind of significance, e.g. price.

  • Structured properties - The backbone of Semantic MediaWikis. These are properties that may be processed by machines and integrated, especially variables that represent numerical values. Semantic MediaWikis use these to link several different pages, e.g. by price value, size, etc. of a group of objects all falling under the category specified by the property.
  • Unstructured properties - Properties that are attached to an object but are impossible to input into a machine without resorting to hand, e.g. a value stated in a sentence within a page.

________ bot - A "user" that edits the both the wiki's properties and pages automatically. The blank in the definition is filled by the wiki's name. This bot appears in metadata when it edits anything just like a regular user.

The Semantic Web - Described in minor detail in the Historical Background section of Day 2: Create a report to describe Semantic MediaWikis.

Superuser - A user who does the majority of the editing of properties and pages on the wiki to which the user belongs.

Day 1: Learning About Semantic Mediawikis

Goals

  1. [x] Create one-page resumes
  2. [x] Learn what ISI is
  3. [x] Learn about Semantic Mediawikis by reading the tutorial on Mediawikis
  4. [x] Begin to work on a report to describe Semantic Mediawikis

Accomplishments

  1. Completed individually before we collaborated to give each other advice. These resumes were submitted for review to Yolanda Gil and approved.
  2. We accomplished this with the help of Yolanda Gil, the professor giving us the opportunity for this internship. The Information Sciences Institute is a branch of USC's Viterbi School of Engineering that supports many projects based on computer science and artificial intelligence. Professors at ISI developed the Domain Name System and were a large part of the address system. Now, ISI employs 350 professors that focus on research in computer science and artificial intelligence, as well as sometimes work with the government, and teaching students.
  3. We were able to discover Semantic Mediawikis by reading about them.
  4. We learned the difference between Semantic Mediawikis and ordinary Mediawikis; Semantic Mediawikis tag articles with categories which allow users to search these articles more quickly and analyze them easily. In short, Semantic Mediawikis are similar to searchable databases, while ordinary Mediawikis are mere archives of articles. Time did not allow us to create a full report on them, so we will finish it tomorrow.

Day 2: Create a report to describe Semantic Mediawikis

Goals

  1. [x] Finish the report to describe Semantic Mediawikis
  2. [x] Go through some of the OWL modelling tutorial at http://owl.cs.manchester.ac.uk/tutorials/protegeowltutorial/resources/ProtegeOWLTutorialP4_v1_3.pdf

Accomplishments

  1. We were able to complete the [Report on Semantic Mediawikis].
  2. The tutorial was extremely long, so time did not permit us to completely finish reading it and going through it. We did as much as we could.

Report on Semantic Mediawikis

Semantic Mediawikis are a branch of Mediawikis. The way that semantic mediawikis differ from normal mediawikis is that semantic wikis use tags and pointers to link topics and tabulate all data into properties and their values into one page rather than blocks of text scattered across multiple pages. The links and pointers can be used to categorize and analyze data based on specific keywords. For example, instead of searching an entire database of articles for details which cannot normally be “searched” for, e.g. the latitude of a geographical feature, a SMW can be used to trawl the database for all latitudes stored. Semantic Mediawikis are especially helpful since they offer automatically-generated lists, improved data structure like semantic templates, easier searching since users can create their own queries, consistency between articles of different languages, easily exportable data, semantic data that can easily integrate with external data, and better displays of visual information like calendars, timelines, graphs, and maps instead of lists. Additionally, not every bit of information has to be specific because they may have subcategories.

One of the most important properties of Semantic Mediawikis is the fact that they may be shared using the Semantic Web, an extension of the World Wide Web. All information on this Web is given explicit meaning; in other words, the data is not stored in the form of verbal expressions but with properties and values. Processing and integrating the data, then, is facilitated by the fact that users do not have to go through blocks of text looking for information or jumping from site to site obtaining the data, but rather may download it from one source all at once. A simple way of putting this is that these Semantic Mediawikis convert unstructured data into metadata and their relationships that may be processed by machines directed by a user query.

We were unable to show Yugi Mato's entire infobox due to its extensive detail
This infobox from Peacecorps shows how sparsely uploaded personal information about specific people is

Semantic mediawikis can differ greatly from one another. We have picked two examples called the Yu-Gi-Oh! Wiki and the Peace Corps Wiki to demonstrate such differences. The latter is based off of an official organization that sends volunteers to aid more unfortunate people, while Yu-Gi-Oh! contains collected information about the TV show and card game. The Peace Corps wiki has many instructional articles such as Advice for Applicants, whereas the Yu-Gi-Oh! wiki has more pages entailing descriptions of characters like Yugi Muto. In every wiki article, there is something called an "infobox", where basic properties are shown for easy viewing. In comparing article information boxes, we have found that the Yu-Gi-Oh! website has more personal information about each character like age, weight, height, etc... Even into as much detail as favorite foods. The Peace Corps seem to put less personal information and more information about what the person has accomplished by working with others on projects that benefit their organization, like Chuck Ludlam.


Usage of semantic mediawikis is limited to users who have installed the bundle on Semantic-Mediawiki.org. Data created by SMW users can be shared online, regardless of whether or not viewers have installed SMW. Extensions are listed at SMW Extensions; these extensions allow SMW users to add semantic properties to the framework of a mediawiki. For example, Semantic Drilldown, written by Yaron Koren, allows users to view categories and data in a parent-child type of inventory. Categories contain subcategories, and clicking on any filter allows users to "drill down" through data until they have retrieved the desired/specified articles. These extensions add much more functionality to mediawikis, allowing users to browse data more easily and giving editors ways to add and display more data.


Different Articles, Same Property
Structured properties are being used outside of ISI


The main benefit of semantic mediawikis is that they organize data efficiently with structured properties that can easily allow people to organize and compare data from different articles. In many online wikis, information is tabulated in boxes called “infoboxes”. Often, however, these infoboxes’ data is hard to search for and use; semantic mediawikis aid in such searches by downloading and arranging these data into tables for users to look up at their leisure. Instead of sorting article by article attempting to collect a property such as latitude as you would on Wikipedia, semantic wikimedias have the ability to organize information by property instead of by the subject of the article. For example, the article on Lake Mendota has many structured properties. If you wanted to compare a piece of data that applies to Lake Mendota with other lakes, all you have to do is click on the name of that structured property and it will take you to a page that is focused on that specific property and lists the data of that property from many different articles. This makes it easier to compare data and relate different subjects to one another, which would be very useful in a medical situation where it would be useful to compare symptoms (properties) of different patients (articles) to see if any correlated.

Historical Background

The Semantic Web Originally called the Semantic Network Model, the Semantic Web (only called thus after Tim Berners-Lee, the British scientist who came up with the World Wide Web, coined the term) first appeared in various publications of the early 1960's, collaborated on by the cognitive scientist Allan M. Collins, psychologist Elizabeth F. Loftus, and linguist M. Ross Quillian. The idea was brought to life in the form of a communal movement started by the World Wide Web Consortium, or the W3C. Tim Berners-Lee defined the Semantic Web as "a web of data that can be processed directly and indirectly by machines".

Semantic Mediawikis These are the backbone and building blocks of the Semantic Web. Created in 2005 by Markus Krötzsch, Denny Vrandečić and Max Völkel, the development was funded by the European Union as part of the Framework Programme for Research and Technological Development #6.

Day 3: Understand Changes to Wikis and Their Ontological Properties

Definitions

Users - People who access the wiki, whether to edit it or simply to access the information stored on the pages.

Viewers - People who simply access information stored on the pages in the wiki without editing or contributing.

Editors - People who edit the wiki and contribute in some form or another.

Properties - Variables attached to objects/individuals which have some kind of significance, e.g. price.

Structured properties - Properties that may be processed by machines and integrated, especially numerical values.

Unstructured properties - Properties that are attached to an object but are impossible to input into a machine, e.g. a value stated in an English sentence within a page.

Goals

  1. [x] Download and understand data from wikis such as the FoodFinds, Beachapedia, and Biodiversity of India wikis.
  2. [x] Create a parser to analyze metadata from the wikis and create databases of usernames, timestamps of edits, and how many times a given page has been edited.
  3. [x] Determine a method to answer the following questions about wikis' users and ontological properties:


Questions

  1. Who are the most active users and how do they use the properties?
  2. When are these properties created/modified by users, in relation to the users' use of the website and the users' inception to the website?
  3. How many people created the structural properties? How many people simply use the structural properties? (# structural property editors vs. viewers)
  4. How often are specific structured properties used?
  5. How do properties' rate/frequency of usage match user activity over time in general?

Accomplishments

  1. Varun Ratnakar gave us the initial data for the wikis listed above.
  2. Kevin created a Java program that decomposes the metadata into the editors and how many times they have edited the site's pages. He is currently working on the timestamps and edits-per-page databases.
  3. Larry and Angela delegated the questions among themselves and wrote possible plans of attack that they could use to answer the questions with the help of the databases that Kevin collected with his parsing program.

Questions

  1. Completed. -Larry
  2. Requires user access of pages/properties data over time to complete. -Larry


Possible Methods to Create a Parsing Program

Our main goal is to create databases of usernames, timestamps of edits, and how many times a given page has been edited.

Without the cache of all the pages on the wiki since they've been created, we cannot look at the frequency of the usage over time which prevents us from answering questions 3 and 5.

The first line of the Meta entails the number of the restaurant and the name of the restaurant. Below are a list of edits that have been made by users, sorted by the overall edit #, followed by the username or IP address, and then the timestamp.

The Meta

To gather timestamps, we can have the parser search for years (the earliest being from 2008). Once it finds them, we now have the year, month, and day all in the same group of characters. For the job of collecting the smaller part of the timestamp (hours, minutes, seconds), we must have the parser collect the rest of the line after the date since there is a space in the timestamp. Having the timestamps in a database would help us answer questions 3 and 5.

To gather the usernames of the edits that have been made, the parser can search through the document by making all the lines Strings and break apart everything except for the usernames by using spaces. By doing this, we can create a database of usernames. The frequency of the usernames will allow us to determine who are the most active users. This helps answer questions 1 and 4.

Code Written - Parser

Class 1 - ParserForMetaTxt

    import java.io.*;
    import java.util.*;
    
    public class ParserForMetaTxt
    {
        public static String[][] Parser (String file) throws IOException
        {
            FileInputStream fstream = new FileInputStream(file);
            DataInputStream in = new DataInputStream(fstream);
            BufferedReader br = new BufferedReader(new InputStreamReader(in));
            /* Initialize file reader. */
            
            String strline1 = null;
            String[][] editArray = new String[1292][3];
            /* Compile information into a table of three columns. 
             * Rows signify individual edits.  
             * Column 1 = page #, column 2 = editor name, column 3 = date stamp.
             */
            int counter = 0;
            /* Create counter to denote the # of the edit in the array. */
            String[] tempArray = null;
            /* Create temporary array to hold entries from edits. */
            String delimiter = "\t";
            /* Create string to allow str.split() to split data. */
       
            do 
            {
                try {
                    tempArray = null;
                    strline1 = br.readLine();
                    if (strline1.charAt(0) < 48 || strline1.charAt(0) > 57)
                    {
                        tempArray = strline1.split("\t", 4);
                        /* Split data into array and assign to tempArray. */
                        for (int i = 1; i < 4; i++)
                        {
                            editArray[counter][i-1] = tempArray[i];
                        }
                        /* Assign data from tempArray to two dimensional array editArray. */
                        counter++;
                        /* Note that another edit has been added to the array. */
                    }
                } catch (Exception e){}
            } while (strline1 != null);
            
            return editArray;
        }
    }

Class 2 - Editors

    import java.util.*;
    import java.io.*;
    import java.util.Collections.*;
    
    public class Editors extends ParserForMetaTxt
    {
        public static void main (String[] args) throws IOException
        {
            try
            {
                String[][] DataArray = Parser("meta");
                //Import data
                List<String> editors = new ArrayList<String>();
                
                for (int i = 0; i < 1292; i++)
                {
                    if (editors.contains(DataArray[i][1])) {}
                    else
                    {
                        editors.add(DataArray[i][1]);
                    }
                }
                //Create list of editors.
                
                for (int count = 0; count < editors.size(); count++)
                {
                    for (int count1 = 0; count1 < editors.size() - 1; count1++)
                    {
                        int i = 0;
                        String temp1 = editors.get(count1);
                        String temp2 = editors.get(count1 + 1);
                        do
                        {
                            if (temp1.charAt(i) > temp2.charAt(i))
                            {
                                Collections.swap(editors, count1, count1 + 1);
                            }
                            //Swap elements if not alphabetical.
                            i++;
                        } while (temp1.charAt(i) == temp2.charAt(i));
                    }
                }
                //Alphabetize list of editors.
                
                int[] editInstances = new int[editors.size()];
                //Create array to tabulate the number of times each editor has edited the site.
                
                for (int i = 0; i < 1292; i++)
                {
                    String temp = DataArray[i][1];
                    int number = editors.indexOf(temp);
                    editInstances[number]++;
                    //Add one to the index of the editor within the edit array.
                }
                for (int i = 0; i < editors.size(); i++)
                {
                    System.out.println(editors.get(i) + " edited the site " + editInstances[i] + " times.");
                }
            } catch (Exception e){}
        }
    }

Class 3 - Pages

    import java.util.*;
    import java.io.*;
    import java.util.Collections.*;
    
    public class Pages extends ParserForMetaTxt
    {
        public static void main (String[] args) throws IOException
        {
            try
            {
                String[][] DataArray = Parser("meta");
                //Import data
                List<String> pages = new ArrayList<String>();
                
                for (int i = 0; i < 1292; i++)
                {
                    if (pages.contains(DataArray[i][0])) {}
                    else
                    {
                        pages.add(DataArray[i][0]);
                    }
                }
                //Create list of pages.
                System.out.print(pages);
                
                for (int count = 0; count < pages.size(); count++)
                {
                    for (int count1 = 0; count1 < pages.size() - 1; count1++)
                    {
                        int temp1 = Integer.parseInt(pages.get(count1));
                        int temp2 = Integer.parseInt(pages.get(count1 + 1));
                        if (temp1 > temp2)
                        {
                            Collections.swap(pages, count1, count1 + 1);
                        }
                    }
                }
                //Alphabetize list of pages.
                
                int[] editInstances = new int[pages.size()];
                //Create array to tabulate the number of times each editor has edited the site.
                
                for (int i = 0; i < 1292; i++)
                {
                    String temp = DataArray[i][0];
                    int number = pages.indexOf(temp);
                    editInstances[number]++;
                    //Add one to the index of the editor within the edit array.
                }
                for (int i = 0; i < pages.size(); i++)
                {
                    System.out.println("Page " + pages.get(i) + " was edited " + editInstances[i] + " times.");
                }
            } catch (Exception e){}
        }
    }
  • Note: 1292 was used in the program apparently without cause or reason; this is actually the total number of edits that occurred in the meta.txt file. This total was required for the program.
  • The "meta" file needed to be saved in the same folder as the java classes, with no file extension.

Day 4: Redirect Focus and Continue Work On Questions

A roadblock appeared on Day 3 in the form of the impossibility of retrieving chronological information about the number of users who have accessed each page. Though there is a distant possibility of accessing caches of each page over time, the work involved would be enormous, requiring days or weeks to complete, especially for the larger wikis. Therefore, the focus of the research was readjusted, as recorded below.

Goals

  1. [x] Readjust focus of questions to account for irretrievable information.
  2. [x] Answer the readjusted questions.


Rewritten Questions

  1. Who are the most active editors?
  2. How often are specific structured properties created/edited?
  3. How many people created the structural properties?
  4. How are these properties created/modified by users, in relation to the users' editing of the website?


Accomplishments

  1. We managed to come up with new questions that we can answer successfully.
  2. Angela answered question #1 and Larry answered question #3. Question #2 is assigned to Angela and question #4 to Larry for completion on a later date.

Questions

Answer: #1 // Who are the most active editors?

With the help of a parser program that Kevin wrote, I was able to go through the meta document which holds the names and id numbers of articles, the users who have edited them and the number of the edit, and the timestamps of the edits. Kevin was able to remove everything except for the usernames, which allowed me to find the most actively editing users by seeing what names came up most frequently. By putting the username and the number of times of its occurrence, I was able to create the graphs necessary to answer question number 1 for both FoodFinds and Beachapedia since I applied the parser for each of their meta documents. Some editors are anonymous and were added to the database with their IP addresses.

User Activity in Editing Documents on FoodFinds
User Activity in Editing Documents on Beachapedia

FoodFinds.referata.com: Since FoodFinds has a large amount of article edits, I was only able to include the top 20 most frequently editing users. Most of the edits have been made by HungryG, who reached the milestone of 1100 edits, but the second most frequent editor only racks up 57 edits. Also, a 75% of their editors are anonymous, represented on the user activity graph with their IP address. The pie chart was created to demonstrate how large the group of anonymous editors on FoodFinds are. However, most have only edited once or twice so HungryG has edited more times than all of them put together. HungryG is a more vital editor to the website than the anonymous people who visit this website.

Beachapedia.org: Although the graph shows that FoodFinds has more edits overall than Beachapedia, it's mainly made by HungryG. Beachapedia's edits are spread out among more users, with less anonymous editors than FoodFinds. Out of 19 editors in total, only one is anonymous.


Conclusion: With the help of these graphs and the analysis I came to, it is highly probable that Beachapedia is a tight-knit online community (in terms of editors) than FoodFinds. Albeit there are less editors in Beachapedia than FoodFinds, their edits amount to closer numbers. The edits on FoodFinds seem to be the work of mainly one person and a large amount of anonymous users. It seems that on Beachapedia, the editors would communicate with one another and correct any mistakes the others make, while it is extremely difficult for the one main editor on FoodFinds to find their own mistakes.

-Angela

Answer: #3 // How many people created the structural properties?

FoodFinds.referata.com: It seems that the majority of properties were created and are edited by two users, HG and Hungry G. Moreover, apart from the initial creation and some editing, the properties are untouched, apart from when Hungry G appeared to have added a "form" component to allow other users to edit the properties more easily or created pages instead of simply strings or numerical values for the property. The only other user to have modified a property is Robin Patterson (her username). However, the property change seems to be simply a comment or minor change on the property's function; Hungry G changed it back a short time later.

Because of the similarity of the two usernames and the timing of the edits, HG and Hungry G seem to have been two accounts corresponding to the same person, though this is impossible to confirm as the user HG has been deleted from the website.

Beachapedia.org: It seems that, similar to that in foodfinds.referata.com, all the properties were created and edited by TWO users in this case. However, the vast majority seem to have been the work of the user "Gwsuperfan", as only 4 out of the 26 were created by the other user, "Akozak". However, Akozak made 3 of his 4 properties in one field, language, of which two were already equivalent to one introduced by Gwsuperfan; the third was merely to identify the language used by the page. The fourth of his properties was introduced to simply mark whether a category had any members or not. Therefore, almost all the main work seems to have been done by Gwsuperfan. Furthermore, all of Gwsuperfan's changes merely slightly improve or fix errors in his properties, of which the most common change is to the type that he chose.

biodiversityofindia.org: Interestingly enough, it seems that in this case almost all work was carried out by a single user that goes by two user names, Gauravm and Indlad11. Only one other user altered or edited the properties: Shwetankverma, who only altered the "class" property and whose change was immediately overridden (not even a MINUTE later) by Gauravm. In all other cases, only the user Gauravm/Indlad11 created or modified properties. However, contrary to the findings in the other two wikis, the properties were spread out and created over a large portion of time during 2010-2011, although a large chunk of the earliest ones were created at about the same time, as expected. Doing some more research into the site gave us the following information:

The biodiversityofindia.org website was originally conceived as the Brahma project by several scientists, including Michigan State University PhD student Gaurav Moghe and Indian Institute of Science PhD student Shwetank Verma. Gauravm obviously corresponds to Gaurav Moghe and Shwetankverma as Shwetank Verma. Indlad11, after a bit of research, was revealed to be Gaurav Moghe (shown at http://www.biodiversityofindia.org/index.php?title=User:Indlad11).

HYPOTHESIS: For all of the Semantic Mediawikis that we have examined so far, it appears that the vast majority of property creation and editing is carried out by an extremely small group of people, usually just a single person. Furthermore, most of the properties are modified only within a few days of their creation, and a large portion are created at a single point in time. Interestingly enough, this time interval in which most of the properties are created also seems to be the first time point when any properties are created. This makes sense, seeing as the user who created the Semantic Mediawiki would have to come up with a few initial properties. Overall, unfortuantely, this implies one thing: for most wikis, the only users who actually edit the properties are the creators themselves. One explanation is simply that the creators are the only people allowed to edit the ontology of the Semantic Mediawikis and alter/add properties. However, on many sites, this is simply not the case, such as on the FoodFinds.referata.com website, which indicates either a lack of awareness of properties' usefulness/the option to create one themselves, or an unwillingness to contribute to the website itself and only using the pages on the site.


A possibility in the future for any Semantic Mediawikis is to allow users in general to submit new or edited properties to the admins, who can then approve them. This would allow for greater user contribution to the website structure itself instead of simply the pages on it; it would also allow the public at large or the targeted domain of users to make suggestions to help make the site easier to use and more efficient.

-Larry

Implementation of Varun Ratnakar's Parser

    import java.util.*;
    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.ArrayList;
    import java.util.Date;
    import java.text.ParseException;
    import java.text.SimpleDateFormat;
    
    import edu.isi.smw.PageSemanticData;
    import edu.isi.smw.Property;
    import edu.isi.smw.SMWPageParser;
    import edu.isi.wiki.EventChain;
    import edu.isi.wiki.PageVersion;
    import edu.isi.wiki.WikiEvent;
    import edu.isi.wiki.WikiPage;
    
    public class WikiParser {
        String wikiDir;
        File metaFile;
        File propMetaFile;
        
        int NUMBER_OF_PAGES_TO_PARSE = -1; // Set to -1 to Parse all pages. I've set it to 100 for testing
        
        /**
         * @param args The wiki directory
         * @throws IOException 
         * @throws ParseException 
         */
        public static void main(String[] args) throws IOException, ParseException {
            Scanner keyboard = new Scanner (System.in);
            System.out.println("Directory: ");
            String wikiDirectory = keyboard.nextLine();
            args = new String[1];
            args[0] = wikiDirectory;
            //These lines were added to Varun's original program to allow users to specify the specific database to analyze.
            
            if(args.length < 1) {
                System.err.println("No arguments given. Need to give the wiki folder as an argument");
                System.exit(1);
            }
            String wikiDir = args[0];
            WikiParser parser = new WikiParser(wikiDir);
            
            EventChain chain = parser.parse();
            /*for(WikiEvent ev: chain.getEvents()) {
                Date timestamp = ev.getTimestamp();
                WikiPage page = ev.getPage();
                PageVersion version = ev.getVersion();
                // Need a Semantic Media Wiki parser for the Wiki Content
                // System.out.println(version.getContent());
            }*/
            System.out.println(chain.getEvents());
        }
        
        public WikiParser(String dir) {
            this.wikiDir = dir;
            this.metaFile = new File(dir+"/meta.txt");
            this.propMetaFile = new File(dir+"/prop_meta.txt");
        }
        
        public EventChain parse() throws IOException, ParseException {
            EventChain chain = new EventChain();
            SMWPageParser pageParser = new SMWPageParser();
            /**
             * Read the property metadata file
             */ 
            String line;
            Property curprop = null;
            BufferedReader reader = new BufferedReader(new FileReader(this.propMetaFile.getAbsolutePath()));
            while ((line = reader.readLine()) != null) {
                // Split the line by tabs
                String[] tokens = line.split("\t");
                // If there are 2 tokens, then it is the <propid> <propname> line
                if(tokens.length == 2 && !tokens[0].equals("")) {
                    int propid = Integer.parseInt(tokens[0]);
                    String propname = tokens[1];
                    if(curprop != null) {
                        pageParser.addKnownProperty(curprop);
                    }
                    curprop = new Property(propid, propname);
                }
                // If there are 4 tokens, then it is the <space> <versionid> <author> <timestamp> line
                else if(tokens.length == 4 && curprop != null) {
                    int versionid = Integer.parseInt(tokens[1]);
                    String author = tokens[2];
                    // Parse date using a date formatter
                    Date timestamp = (new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")).parse(tokens[3]);
                    // Read content of the wiki file itself (i.e. <pageid>_<versionid>)
                    String content = readFile(this.wikiDir+"/prop_"+curprop.getId()+"_"+versionid);
                    PageSemanticData data = pageParser.parse(content);
                    ArrayList<String> types = data.getPropertyValue("Has name");
                    if(curprop != null && types != null)
                        for(String type: types)
                            curprop.setType(type);
                    
                    // Create a new PageVersion object
                    PageVersion version = new PageVersion(curprop.getId(), versionid, author, timestamp, data);
                    // Create the WikiEvent object
                    WikiEvent event = new WikiEvent(timestamp, author, curprop, version);
                    // Add event to the event chain
                    chain.addEvent(event);
                }
            }
            reader.close();
            
            /**
             * Read the main metadata file
             */
            int i=0;
            WikiPage curpage = null;
            reader = new BufferedReader(new FileReader(this.metaFile.getAbsolutePath()));
            while ((line = reader.readLine()) != null) {
                if(NUMBER_OF_PAGES_TO_PARSE > 0 && i > NUMBER_OF_PAGES_TO_PARSE) break;
                // Split the line by tabs
                String[] tokens = line.split("\t");
                // If there are 2 tokens, then it is the <pageid> <pagename> line
                if(tokens.length == 2 && !tokens[0].equals("")) {
                    i++;
                    int pageid = Integer.parseInt(tokens[0]);
                    String pagename = tokens[1];
                    curpage = new WikiPage(pageid, pagename);
                }
                // If there are 4 tokens, then it is the <space> <versionid> <author> <timestamp> line
                else if(tokens.length == 4 && curpage != null) {
                    int versionid = Integer.parseInt(tokens[1]);
                    String author = tokens[2];
                    // Parse date using a date formatter
                    Date timestamp = (new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")).parse(tokens[3]);
                    // Read content of the wiki file itself (i.e. <pageid>_<versionid>)
                    String content = readFile(this.wikiDir+"/"+curpage.getId()+"_"+versionid);
                    PageSemanticData data = pageParser.parse(content);
                    // Create a new PageVersion object
                    PageVersion version = new PageVersion(curpage.getId(), versionid, author, timestamp, data);
                    // Create the WikiEvent object
                    WikiEvent event = new WikiEvent(timestamp, author, curpage, version);
                    // Add event to the event chain
                    chain.addEvent(event);
                }
            }
            reader.close();
            return chain;
        }
        
        private String readFile( String file ) throws IOException {
            BufferedReader reader = new BufferedReader( new FileReader (file));
            String         line = null;
            StringBuilder  stringBuilder = new StringBuilder();
            String         ls = System.getProperty("line.separator");
    
            while( ( line = reader.readLine() ) != null ) {
                stringBuilder.append( line );
                stringBuilder.append( ls );
            }
    
            return stringBuilder.toString();
        }
    }
  • By far, the majority of the code belongs to Varun. Six lines were added to allow users to use the code on various databases; implementation of the code involved adding a custom package to the Java Virtual Machine named BlueJ. The code creates a filetype called an EventChain which stores data in the format

TIME: (DAYOFTHEWEEK) (MONTH) (DAYOFTHEMONTH) (HH:MM:SS) (TIMEZONE) (YEAR), PAGE:(TYPE):(TYPE), VERSION:(VERSION), AUTHOR:(AUTHOR).

  • Example: Time: Mon Mar 17 00:56:08 PDT 2008, Page:Home, Version:8, Author:HG

Day 5/6: Expanding the Dataset

Now that we've analyzed our three wikis, it's time to expand to other wikis and begin to compare these wikis to the ones that we've already analyzed and check to see if the pattern we've established continues. We also need to keep an eye out for anything else that looks interesting.

Goals

  1. [x] Examine the list of Semantic MediaWiki examples and choose 10 to work on (each).
  2. [x] Send to Varun Ratnakar in order to obtain Wiki data
  3. [x] Continue work on questions 2/4

Accomplishments

  1. Went through list of Semantic MediaWikis and chose 10 each.
  2. Sent to Varun Ratnakar for Wiki data.
  3. We worked on the questions, though we didn't manage to complete them.


Semantic MediaWikis vs MediaWikis

We found that on the list of examples of Semantic MediaWikis we used to select our sites that we would focus on, there are many sites that are simply MediaWikis and not Semantic MediaWikis. Half of the sites we had chosen were not SMW's, but merely MediaWikis. Therefore, we were forced to omit these and reduce our lists of data.

List of links (wikis) to be investigated further:

Larry


  1. http://artwiki.org/ArtWiki
  2. http://www.deepskypedia.com/
  3. https://wiki.mozilla.org/Main_Page
  4. http://rosettacode.org/wiki/Welcome_to_Rosetta_Code
  5. http://www.wikidevi.com/wiki/Main_Page

not semantic mediawikis:

  1. http://animanga.wikia.com/wiki/Animanga_Wiki
  2. http://www.bioinformatics.org/wiki/Main_Page
  3. http://grey.colorado.edu/emergent
  4. http://fp0804.emu.ee/wiki/index.php/Main_Page
  5. http://en.wikiarquitectura.com/index.php?title=Main_Page

Kevin

  1. http://www.dexid.org/wiki/Dexid
  2. http://enipedia.tudelft.nl/
  3. http://navi.referata.com/
  4. http://protegewiki.stanford.edu/index.php/Main_Page
  5. http://scientolipedia.org/index.php?title=Main_Page

not semantic mediawikis:

  1. http://creationwiki.org/Main_Page
  2. http://www.crisiswiki.org/Main_Page
  3. http://neurolex.org/wiki/Main_Page
  4. http://souleater.wikia.com/wiki/Soul_Eater_Wiki
  5. http://www.wikijava.org/wiki/Main_Page

Angela

  1. http://roadsignmath.com/wiki/Welcome
  2. http://www.gardenology.org/wiki/Main_Page
  3. http://dnd-wiki.org/wiki/Main_Page
  4. http://www.stowiki.org/Main_Page
  5. http://farmafripedia.ikmemergent.net/index.php/Main_Page

not semantic mediawikis:

  1. http://www.archiplanet.org/wiki/Main_Page
  2. http://portlandwiki.org/PortlandWiki
  3. http://practicalplants.org/wiki/Arbutus_unedo
  4. http://lexingtoncompanies.com/wiki/index.php?title=Certificate_server
  5. http://stormravengaming.net/wiki/Main_Page

Questions (Continued)

Answer: #2 // How often are specific structured properties created/edited?

Although there is an incredibly large amount of structured properties on both FoodFinds and Beachapedia, not all of them have been edited at one time or another. We received data for the amounts of edits that every property had received during its existence. Properties that had zero edits made to them in none of the articles on the wiki were omitted from the data tables and the graphs.

Property Edits on FoodFinds
Property Edits on Beachapedia

FoodFinds.referata.com: FoodFinds had thirteen properties that had been edited in total, with varying numbers of edits. It was interesting that PriceRange was the structured property that had been edited the most, perhaps because of the recent economic situation. It was edited more than five times more often than the second most frequently edited property, City. It seems strange that City was a property that would be changed so frequently, since restaurants do not move; on the other hand, it might have had to do with a chain of restaurants and someone adding multiple cities to one restaurant. Overall, it seemed to be a graph unique for a edits made on structured properties on a mediawiki since there were not that many edits.

Beachapedia.org: Although the highest number of edits on a property on Beachapedia was lower than FoodFinds, more properties had been edited on Beachapedia (26) than on FoodFinds (13). The graph of the number of edits made with the property for Beachapedia seemed more like a typical graph of that type would be for a wiki. The edits are not sparse, amounting to a good total amount. Not all the structured properties on the wiki have been edited, but a larger amount of properties have been edited on Beachapedia than FoodFinds.

Conclusion: FoodFinds has less edited properties, though they have higher maximum edit number for a certain property. Both have around the same amount of overall edits made to properties, but Beachapedia has more- which makes sense, since it's a larger website.


-Angela

Answer: #4 // How are these properties created/modified by users, in relation to the users' editing of the website?

foodfinds.referata.com: Like the edits with the properties, the edits with the pages themselves were dominated by the user Hungry G. Curiously enough, HG edited the pages only 4 times, compared to Hungry G's 1040. The user with the second highest number of edits was Kghbln with 57 edits. The total was 1292, which means that Hungry G made over 80% of the total edits. Out of the total 107 editors, Hungry G was the only one who contributed a large enough portion to be counted as even 5%, which, considering the size of the site, means that no one besides Hungry G spent a significant amount of time editing the site. This is mirrored in the properties' creation/editing count.

beachapedia.org: This wiki actually had a large amount of contributing users. Of the 15479 total edits, at least 7 users contributed over 200 edits. Of course, a majority of the edits was still made by a small group of users, but in this case, it wasn't just one user who contributed, but several! The highest, as expected from the trend, came from Gwsuperfan at 6545 edits; this is the same person that made a large majority of the edits to the properties. However, contrary to expectation, there was another user who actually came close to Gwsuperfan in the number of edits: Rwilson, at 5046. The interesting thing is that Rwilson never actually edited or created any properties; the fact that he contributed to the site so much but not the properties implies that Rwilson does not know how to edit the properties or is ignorant of their true utility, even though the site offers the capability to users to create/edit properties.

There seems to be multiple contributing editors of the site itself, compared to the almost single editor of the properties themselves. This does not match up with the pattern on foodfinds.referata.com; however, Beachapedia.org is a larger site that is visited more frequently. This might have some bearing on the number of editors of the pages vs the number of editors of the properties.

biodiversityofindia.org: This wiki matched the pattern established with the first wiki, foodfinds.referata.com. The page edits were mostly performed by a single user, Gauravm, who also dominated the property editing. Of the 4238 page edits, Gauravm made 3866 of them, 91.222%. Although this is a significantly lower percentage than the percentage of property edits made by Gauravm (99.722%), the overall meaning is still clear: a large majority of all the work is being done by Gauravm alone. The difference may be attributed to the fact that, again, most users never actually alter the site structure itself, but rather, contribute (if indeed they ever do so) to the pages. Therefore, they merely add data instead of improving upon the database structure ontology.

HYPOTHESIS:

In all three examples of Semantic MediaWikis that I examined, there is a recurring pattern of the same single user editing the majority of all properties and pages, kind of a "superuser". However, in the more visited, accessible, and attractive site, beachapedia.org, there seems to be a large amount of page contributors as well, though the properties are still mostly edited by the superuser. The foodfinds.referata.com website is plainer and less visited, and the biodiversityofindia.org site is accessed by probably mainly scientists. Therefore, there also seems to be the following pattern among Semantic MediaWikis: users tend to contribute more to pages if the site is attractive and accessible versus a site that is plain (e.g. the foodfinds website) or hard to understand (e.g. the biodiversityofindia website). It remains to be seen if these patterns continue in the other SMW's that we have decided to examine as further examples. -Larry

Day 7: The Semantic MediaWiki Examples

Between our meetings on Thursday 8/9/12 and Wednesday 8/15/12, we were given the task of choosing additional SMW's (as recorded in the report for Day 5/6) and examining the data here to look for more creative methods to which properties are put and to determine if the patterns already established continue in these SMW's. We were also told to keep an eye out for anything else interesting.

Goals:

  1. [x] Examine wikis and do the following:
    1. Look for other ways for properties to be put to use other than those in the wikis we have already examined.
    2. Determine if the patterns seen before continue in these SMW's.
    3. Keep an eye out for anything else that looks interesting.

Accomplishments

  1. We managed to go through most of the SMW's, though we did not quite finish.

Larry's SMW's

artwiki.org: Artwiki.org fits the pattern already established by beachapedia.com; that is, the site's properties were maintained by almost exclusively one user, the "superuser". The pages themselves were maintained and edited by a slew of users, though the superuser still made the majority of the edits.

I discovered something quite interesting here: wikis, in fact, sometimes have "bots"; that is, there are "users" created by the administrator of the wiki that automatically edit the properties and pages of the site. In this case, ArtWiki bot was such a bot; in fact, it made the next largest of the total number of edits of pages at 5645, after the superuser Dusan's 10292. Additionally, it is the only other "user" to have edited properties besides Dusan, at 4 edits out of the total 76. Going back to the pages themselves, the real user who made the next largest number of page edits is Anja Christine Rob, at 518 edits, just over one-twentieth of Dusan's.

The fact that the percentage of "actual edits" - that is, edits that weren't made by a bot - made by Dusan is 10292/18531 = 55.539% for pages, compared to the 100% in properties, can most likely be, again, attributed to the fact that Artwiki.org is a fairly well-visited and attractive site, compared to the less-visited foodfinds.referata.org and semi-esoteric biodiversityofindia.org, encouraging users to actually contribute to the site. The fact that they can also edit properties, again, are most likely unknown to these users. Additionally, if some users do actually know that they may edit the properties, the notion that they are editing the structure of the site itself may dissuade users from doing so simply out of fear that they might damage the site in some way. In the future, this kind of ignorance, hopefully, can be erased.

rosettacode.org: This wiki completely shattered both patterns established by the previous wikis. Out of the 99 property edits, the user with the largest number of them was Dkf at 38. Short Circuit followed at 19, with Colderjoe and Mwn3d at 14 and 13 respectively not too far behind. The pages edits, on the other hand, totaled 80114, of which Mwn3d made 3036, the highest of all the users. Not a SINGLE user made a contribution of more than 3.8% of the work in the pages, and no more than 19.2% in the properties; in short, there was absolutely no "superuser". Furthermore, the biggest contributor in the properties was not the same person as the person in the pages.

Perhaps the most interesting point was that the properties were not edited by mostly a single person; in fact, 7 real users (not bots) contributed. Additionally, the person who created the largest number of properties was not the founder of the website, Mike Mol (a.k.a. Short Circuit), but Donal Fellows (a.k.a. Dkf), and the largest contributor to the pages was Mike Neurohr (a.k.a. Mwn3d). This fact may be associated with the statement Mike Mol made in his user page: "I try to leave the decisions on these things to the CS professionals, academics and hackers... In short, I just run the servers, try to keep things running smoothly, add whatever features I think of or that people ask for, if possible. I enable the process; I try not to control it" and the point that this wiki is one about coding. Therefore, most every user has at least some basic coding knowledge, and thus they can get more involved in the actual structure of the website, as users Dkf and Mwn3d obviously have. Additionally, there are fully 471 programming languages used on the site, which requires a wide range of users to contribute in order to encompass the scope of the website.

mozillawiki.org: This wiki contains 371009 page edits, of which 7213 (1.944%) were made by Jbecerra, the largest number of page edits by a single user. Similarly, there were 5905 property edits, of which 909 property edits (15.394%) were made by Ctalbert, the largest number of property edits by a single user. This matches the pattern started by rosettacode.org; no "superuser" exists. In fact, the work was done by so many users in such a way that one may say that not a single user contributed a significant percentage of the overall whole. However, this is not to say that the users did not contribute a large part, as the size of the wiki far dwarfs even the largest of the other wikis I have examined. Additionally, the user who edited the pages the most and the user who edited the properties the most aren't even the same person.

Again, one of the most interesting points is that contributions to the properties of the site were not made by a single user, but rather a large conglomeration of several users who worked on different projects; this is probably because these users created properties to suit their own, specific projects, e.g. the Thunderbird properties, whereas some of the properties were used by many different users, e.g. the Test Case List property. Similarly, no single user could have edited a large majority of the articles on the wiki, as that would have entailed several lifetimes' worth of work.

deepskypedia.com: This wiki seems to fit pattern #1: That is, the large majority of edits to both pages and properties are made by a "superuser"; in this case, the "superuser" is "Deepsky". Two bots, "Deepsky2" and "DeepskyBot", made a comparable number of edits, though they are not counted simply because they are not "actual users" but simply automated programs. The next registered user to contribute to the site's properties was Haley B.E. at 1 edit, and the next registered user to contribute to the site's pages was W at 1400 edits, compared to Deepsky's 33765. Of the total number of page edits (65538), Deepsky made 33765/65538 = 51.520%, and of the property edits (1394), Deepsky made 652/1394 = 46.772%. Though these numbers may seem rather small to belong in pattern #1, remember that an almost equivalent portion of edits was done by bots, though the actual number is impossible to determine as this would entail going through every user and checking to see if they are bots.

An interesting piece of information cropped up here in the form of a bot named "Deepsky2". Though it did not have the "bot" part appended to the end of its name, it was still a bot and performed automated edits. This showed that seemingly "real" users might actually be bots; however, the fact that this user might be a bot was still rather apparent due to its name, as it was still named after the site itself. Looking back at this SMW's user list, all five bots on the list are rather obviously bots, as either 1. they have the site's name in its username or 2. they have the suffix "bot" attached to their username. However, we should still keep an eye out for bots whose usernames do not mark them as such, as this could mean that some of the most active editors are actually bots.

wikidevi.com: This matches the first pattern as well; in fact, it is one of the clearest examples of a "superuser". All 178 property edits were made by the same user, M86. Additionally, of the 49977 page edits, 48407 were made by M86, or 96.859%. The user "Dave" edited the wiki the next largest amount of times, 658 times (1.317%). Quite clearly, this site is maintained almost exclusively by the superuser M86.

Interestingly enough, it seems that none of the users on this wiki are bots. I guessed at this while looking at the list of users who had edited the website, and confirmed it by looking at the site's user list; none of them had the property "bot" associated with them. This means that M86 took on the entire burden of maintaining this website all by himself without using automated methods to fix the more minor errors in the pages, whether because he fixed them himself or didn't bother fixing them. The latter is more likely, given the sheer number of edits he has made.

Kevin's SMW's

Angela's SMW's

roadsignmath.com:

User Activity on Road Sign Math
Property Edits on Road Sign Math

gardenology.org:

File:Useractivity gardenology.jpg
User Activity on Gardenology
File:Propertyedits gardenology.jpg
Property Edits on Gardenology

dnd-wiki.org:

File:Useractivity dnd.jpg
User Activity on Dungeons & Dragons Wiki
File:Propertyedits dnd.jpg
Property Edits on Dungeons & Dragons Wiki

www.stowiki.org/

File:Useractivity stowiki.jpg
User Activity on Star Trek Online Wiki
File:Propertyedits stowiki.jpg
Property Edits on Star Trek Online Wiki

farmafripedia.ikmemergent.net:

User Activity on Farmafripedia
Property Edits on Farmafripedia


Day 8: September 30, 2012

Paper Draft

As a Google Doc

Data for the 19 wikis

All data as 1 file

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox