<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Tagide</title>
	<atom:link href="http://tagide.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://tagide.com/blog</link>
	<description>Software and Musings</description>
	<lastBuildDate>Sun, 06 May 2012 23:01:40 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Simulating a City</title>
		<link>http://tagide.com/blog/2012/05/simulating-a-city/</link>
		<comments>http://tagide.com/blog/2012/05/simulating-a-city/#comments</comments>
		<pubDate>Sun, 06 May 2012 23:01:40 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[simulation]]></category>
		<category><![CDATA[social software systems]]></category>
		<category><![CDATA[Encitra]]></category>
		<category><![CDATA[OpenSim]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=928</guid>
		<description><![CDATA[For the past 4 years or so, in my spare time, I have been working with a small start-up company, Encitra, whose goal is to help cities and real estate developers make sustainable urban plans come to life in the &#8230; <a href="http://tagide.com/blog/2012/05/simulating-a-city/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><object style="height: 390px; width: 640px;" width="640" height="360" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0"><param name="allowFullScreen" value="true" /><param name="allowScriptAccess" value="always" /><param name="src" value="http://www.youtube.com/v/291yE_9eefU?version=3&amp;feature=player_detailpage" /><param name="allowfullscreen" value="true" /><param name="allowscriptaccess" value="always" /><embed style="height: 390px; width: 640px;" width="640" height="360" type="application/x-shockwave-flash" src="http://www.youtube.com/v/291yE_9eefU?version=3&amp;feature=player_detailpage" allowFullScreen="true" allowScriptAccess="always" allowfullscreen="true" allowscriptaccess="always" /></object></p>
<p>For the past 4 years or so, in my spare time, I have been working with a small start-up company, <a href="http://encitra.com">Encitra</a>, whose goal is to help cities and real estate developers make sustainable urban plans come to life in the minds and hearts of stakeholders and the general public. We go at it with virtual reality. Not just computer animation movies; we develop complete multi-user interactive virtual environments that are built and re-built over time by multiple people, and that simulate urban areas &#8212; both structural and dynamic aspects &#8212; as faithfully as possible. Recently, we accomplished an important milestone: we were able to simulate an area of 3km x 1.5km of the city of Uppsala, Sweden. This includes the actual terrain, the major landmarks of the city, several hundred assorted buildings, as well as traffic and pedestrians. It&#8217;s all live and accessible on the Internet, although not on the Web browser. This post explains the technology behind it. For the most part, it&#8217;s all based on open source software!</p>
<h3><span id="more-928"></span>The Server Side</h3>
<p>Being a developer of <a href="http://opensimulator.org">OpenSimulator</a>, it&#8217;s no surprise that the core of the infrastructure is OpenSimulator. I seriously considered other alternatives, though, but I leave the technology decision process for another post.</p>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/grid.jpg"><img class="alignleft size-medium wp-image-1008" title="grid" src="http://tagide.com/blog/wp-content/uploads/2012/05/grid-225x300.jpg" alt="" width="225" height="300" /></a> The area is divided into a 2D grid of 256-meter &#8220;regions&#8221;, meaning that we have 6&#215;12=72 regions total. In turn, the regions are grouped in &#8220;sectors&#8221; of 3&#215;3 regions, each sector running in a separate simulator. So we have 8 simulators for this area, like the picture on the left (not to scale, it&#8217;s just an illustration).</p>
<p>We use OpenSimulator in a standard grid configuration, with one central &#8220;Robust&#8221; server, and with the simulators all sharing the grid resources. During different phases of this project, we placed the simulators in a variety of servers, from our own machines to <a href="http://aws.amazon.com/ec2/">AWS EC2</a>. We have developed a simple systems administration layer on top of OpenSim that allows us to very quickly deploy simulators in whatever servers we need them to be on.</p>
<p>We have configured the simulators using advanced options that are available in OpenSim, but that aren&#8217;t well known. For example, we open child agents in a span of 6 regions all around, instead of the default 1. This allows us to view almost the entire area independent of where our avatars are.</p>
<p>One characteristic of the Encitra virtual reality environment vs. Second Life and the OpenSim grids out there is that our environment has very few visitors. During the most active phases of the build, there are 3-4 people logged in, and when we have meetings, there are 7-10 of us. We don&#8217;t expect ever having to cope with large numbers of users in the same simulators, since this is not a social environment, it&#8217;s just a collaborative one. We do, however, produce relatively large builds in terms of numbers of prims, meshes and textures.</p>
<h3>The Viewers</h3>
<p>We use primarily 2 viewers: <a href="http://wiki.kokuaviewer.org/wiki/Imprudence:Downloads">Imprudence</a> and the <a href="https://bitbucket.org/Zena_Juran/zen-viewer">Zen viewer</a>. Imprudence is a great utility viewer, as it allows us to create very large structures, as well as importing/exporting objects. More recently, we have all come to use and love the Zen viewer for doing routine builds and for capturing video and pictures. Zen is based on the Linden Lab V3 code base, meaning that it supports all the latest eye candy like media-on-a-prim, shadows, etc. It also comes with presets for creating stunning environments regarding the sky, the water and the light. We have been very pleased with the videos that Zen allows us to produce!</p>
<h3>Terrain</h3>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_001.png"><img class="alignright size-medium wp-image-947" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Terrain detail" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_001-300x179.png" alt="" width="300" height="179" /></a></p>
<p>With an external toolset that I developed, we are able to generate terrains from GIS data, embedded with aerial images. (One of these tools is based on old code from <a href="http://www.sinewavecompany.com/about/adam-frisby/">Adam Frisby</a>.) These realistic aerial-image terrains, by themselves, provide a fair amount of immersion, even before any building is modeled. They also give us the footprint for placing the buildings in the scene and for visualizing the roads and vegetation.</p>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_008.png"><img class="alignright size-medium wp-image-962" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Underground" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_008-300x179.png" alt="" width="300" height="179" /></a>These terrains allow us to work under them. Even though this was not part of the requirements for the Uppsala simulation project,  I simply had to place something underground&#8230; after all, cities are not just what&#8217;s visible, they are also what&#8217;s invisible.</p>
<h3> The Build Process</h3>
<p>The build process is exactly the same as any build in Second Life / OpenSim. We have been using primarily prim buildings, but we also have sculpties and meshes. We have taken a fair amount of pictures of the building façades, and used them to texture the buildings. Here are pictures of some of the Uppsala landmarks:</p>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_007.png"><img class="size-medium wp-image-959 alignleft" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Statue" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_007-300x179.png" alt="" width="300" height="179" /></a></p>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_006.png"><img class="size-medium wp-image-960 alignnone" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Cathedral" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_006-300x179.png" alt="" width="300" height="179" /></a></p>
<p>For the urban plans that are being considered but that don&#8217;t yet exist, like the podcar system, for example, we do something special with them so that we can add them to or remove them from the scene with the click of a button.</p>
<h3>Traffic Simulation</h3>
<p>We developed a traffic simulation addon that is capable of driving thousands of vehicles all over the simulated area without this negatively impacting the visiting user&#8217;s experience. The traffic simulation runs on a separate server. The vehicles go from one end to the other without regards for &#8220;region&#8221; or simulator borders.</p>
<p>The current traffic simulation/visualization has been developed by me from scratch, but the intention is to hook up this technology to external traffic simulators, of which there are a few out there. That will be a future milestone. But since I developed this one, I developed an appreciation for traffic simulators. They&#8217;re fun pieces of software! My traffic simulation includes working traffic lights and stops to which the vehicles react, as well as collision detection, all wrapped up in very simple, <a href="http://en.wikipedia.org/wiki/Boids">boid</a>-like rules.</p>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_002.png"><img class="alignleft size-medium wp-image-955" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Waypoints -- Cars" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_002-300x179.png" alt="" width="300" height="179" /></a><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_011.png"><img class="alignleft size-medium wp-image-964" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Snapshot_011" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_011-300x179.png" alt="" width="300" height="179" /></a>Currently, the painful part is the establishment of routes. We have no way to infer the routes other than by visual inspection and by insider&#8217;s knowledge of the traffic flow and signs in Uppsala. The picture on the left shows the way points for cars and buses around the station area. With areas like this one, 3km wide, setting up the paths that the vehicles use, and the speed limits, is a daunting task, almost as daunting as doing it in real life! Perhaps when we hook this up to real traffic simulators that information will already be available.</p>
<p>On the positive side, the podcar system, given that it doesn&#8217;t yet exist, has been much more amenable to automation. We have automatically generated both the tracks and the routes from an existing plan.</p>
<h3>Pedestrians</h3>
<p><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_012.png"><img class="size-medium wp-image-968 alignright" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Snapshot_012" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_012-300x179.png" alt="" width="300" height="179" /></a><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_022.jpg"><img class="size-medium wp-image-969 alignright" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Bot paths" src="http://tagide.com/blog/wp-content/uploads/2012/05/Snapshot_022-300x179.jpg" alt="" width="300" height="179" /></a>Since 0.7.3, OpenSim supports <a href="http://opensimulator.org/wiki/OSSLNPC">server-side bots</a> that are scriptable inworld. We have used that facility to create over 60 standing/sitting bots and a dozen of walking ones. The standing/sitting bots are very lightweight; we could easily have hundreds of them in every simulator. The walking bots, on the other hand, are relatively heavy, as they are part of the physics scene.</p>
<p>Additionally to contributing to lag, and similarly to traffic, one of the major hassles associated with walking bots is establishing the routes that they walk. Our bot developer has scripted a very pretty system that allows us to visualize the routes of the bots using particle systems. This can&#8217;t exist in real life, but it&#8217;s so pretty that I wish it could!</p>
<h3>Can Cities be Simulated, Really?</h3>
<p>You may very well be asking that question. Clearly, this simulation ignores an enormous amount of things that happen in the real city of Uppsala. In fact, it focuses only on a very small number of aspects &#8212; the buildings, the roads, the traffic, the pedestrians around the stations and, most importantly in this case, the podcar system connecting the main station to the Hospital and the University. As with any computational model, big chunks of reality are discarded. That&#8217;s how these models work.</p>
<p>In this particular case, the main question at hand is the viability of the podcar system &#8212; its concept, its look &amp; feel, its utility with respect to the alternative (buses), its layout throughout the city, and the positive and negative interference with existing infrastructure. Since there aren&#8217;t many podcar systems in the world, people are&#8217;t used to this urban transportation concept. The decision to have one is part of a long process of technical, political and strategic deliberation involving many stakeholders. This simulation and visualization is part of a larger set of artifacts that are being produced. It will be interesting to see what the outcome of those deliberations will be&#8230;</p>
<h3><a href="http://tagide.com/blog/wp-content/uploads/2012/05/Uppsala_City_030.png"><img class="aligncenter" style="border-image: initial; border-width: 2px; border-color: black; border-style: solid;" title="Uppsala Simulation" src="http://tagide.com/blog/wp-content/uploads/2012/05/Uppsala_City_030-300x159.png" alt="Uppsala Simulation" width="300" height="159" /></a></h3>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2012/05/simulating-a-city/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Single Most Important Thing</title>
		<link>http://tagide.com/blog/2012/03/the-single-most-important-thing/</link>
		<comments>http://tagide.com/blog/2012/03/the-single-most-important-thing/#comments</comments>
		<pubDate>Sun, 25 Mar 2012 15:33:37 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[commentary]]></category>
		<category><![CDATA[social software systems]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=916</guid>
		<description><![CDATA[What is the single most important feature of a programming system without which you can&#8217;t write programs effectively? My answer is: the vast amount of accessible documentation and knowledge out there. For example, I can&#8217;t program in an airplane without &#8230; <a href="http://tagide.com/blog/2012/03/the-single-most-important-thing/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>What is the single most important feature of a programming system without which you can&#8217;t write programs effectively?</p>
<p><span id="more-916"></span>My answer is: the vast amount of accessible documentation and knowledge out there. For example, I can&#8217;t program in an airplane without Internet; I roll for 10 minutes until I get stuck on something for which I don&#8217;t know the answer, or I know that there&#8217;s a much better way of doing it, I just don&#8217;t remember the API. This is followed by a feeling of frustration that makes me quit the IDE/emacs/vi and write a blog post or start working on the PowerPoint slides for a talk. Forget about types, objects, and whatnot. On-demand documentation, examples and question answering `a-la Stackoverflow is the single most important thing for me. I bet for others too.</p>
<p>This is quite a dramatic change from how things were back in the 20th century. The Web and search engines have raised the bar really high, a bar that we didn&#8217;t even know existed: the acquisition of &#8220;brick knowledge.&#8221; Brick knowledge is knowledge about specificities of how to do things. For example, if I want to add a feature to my program that uses gzip compression/decompression of certain data, the non-brick knowledge is knowledge about compression in general, about gzip in particular, and what it does to the performance of my program; while the brick knowledge is knowledge about the specific APIs or example code that implements compression in the various programming systems (say, .NET or Python or Haskell). Once I know what I want to do, and why, I just pull up a search engine and type &#8220;.net gzip&#8221; or &#8220;python gzip&#8221; or &#8220;haskell gzip&#8221; and the brick knowledge magically appears in seconds. Without the Web, I&#8217;m stuck.</p>
<p>In the old days, people coped with this by specializing on specific programming systems. This would allow them to acquire vast amounts of brick knowledge, so they had it when they needed it. The Web has changed the game. These days, I feel I can program in just about any language/system out there&#8230; I CAN HAZ POWER&#8230;  as long as there&#8217;s searchable brick knowledge about it.</p>
<p>The more the better. So the best programming systems, for me, are those for which all my information needs are satisfied with a search using reasonable keywords.</p>
<p>One of these days, I&#8217;d like to prototype a programming system that lets me write programs as a sequence of search queries <img src='http://tagide.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2012/03/the-single-most-important-thing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Research in Programming Languages</title>
		<link>http://tagide.com/blog/2012/03/research-in-programming-languages/</link>
		<comments>http://tagide.com/blog/2012/03/research-in-programming-languages/#comments</comments>
		<pubDate>Fri, 02 Mar 2012 17:56:01 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[research]]></category>
		<category><![CDATA[Programming languages]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=416</guid>
		<description><![CDATA[Is there still research to be done in Programming Languages? This essay touches both on the topic of programming languages and on the nature of research work. I am mostly concerned in analyzing this question in the context of Academia, &#8230; <a href="http://tagide.com/blog/2012/03/research-in-programming-languages/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><strong>Is there still research to be done in Programming Languages?</strong> This essay touches both on the topic of programming languages and on the nature of research work. I am mostly concerned in analyzing this question in the context of Academia, i.e. within the expectations of academic programs and research funding agencies that support research work in the STEM disciplines (<span class="st">Science, Technology, Engineering, and Mathematics</span>). This is not the only possible perspective, but it is the one I am taking here.</p>
<p><span id="more-416"></span>PLs are dear to my heart, and a considerable chunk of my career was made in that area. As a designer, there is something fundamentally interesting in designing a language of any kind. It&#8217;s even more interesting and gratifying when people actually start exercising those languages to create non-trivial software systems. As a user, I love to use programming languages that I haven&#8217;t used before, even when the languages in question make me curse every other line.</p>
<p>But the truth of the matter is that ever since I finished <a href="ftp://ftp.ccs.neu.edu/pub/people/crista/publications/thesis/index.html">my Ph.D.</a> in the late 90s, and especially since I joined the ranks of Academia, I have been having a hard time convincing myself that research in PLs is a worthy endeavor. I feel really bad about my rational arguments against it, though. Hence this essay. Perhaps by the time I am done with it I will have come to terms with this dilemma.</p>
<p>Back in the 50s, 60s and 70s, programming languages were a BigDeal, with large investments, upfront planning, and big drama on standardization committees (Ada was the epitome of that model). Things have changed dramatically during the 80s. Since the 90s, a considerable percentage of new languages that ended up being very popular were designed by lone programmers, some of them kids with no research inclination, some as a side hobby, and without any grand goal other than either making some routine activities easier or for plain hacking fun. Examples:</p>
<ul>
<li>PHP, by Rasmus Lerdorf circa 1994, &#8220;originally used for tracking visits to his online resume, he named the suite of scripts &#8216;Personal Home Page Tools,&#8217; more frequently referenced as &#8216;PHP Tools.&#8217; &#8221; [<a href="http://www.php.net/manual/en/history.php.php">1</a>] PHP is a marvel of how a horrible language can become the foundation of large numbers of applications&#8230; for a second time! <a href="http://www.dreamsongs.com/RiseOfWorseIsBetter.html">Worse is Better</a> redux. According one <a href="http://langpop.com/">informal but interesting survey</a>, PHP is now the 4th most popular programming language out there, losing only to C, Java and C++.</li>
<li>JavaScript, by Brendan Eich circa 1995, &#8220;Plus, I had to be done in ten days or something worse than JS would have happened.&#8221; [<a href="http://www.jwz.org/blog/2010/10/every-day-i-learn-something-new-and-stupid/#comment-1021">2</a>] According to that same survey, JavaScript is the 5th most popular language, and I suspect it is climbing up that rank really fast. It may be #1 by now.</li>
<li>Python, by Guido van Rossum circa 1990, &#8220;I was looking for a &#8216;hobby&#8217; programming project that would keep me occupied during the week around Christmas.&#8221; [<a href="http://www.python.org/doc/essays/foreword/">3</a>] Python comes at #6, and its strong adoption by scientific computing communities is well know.</li>
<li>Ruby, by Yukihiro &#8220;Matz&#8221; Matsumoto circa 1994, &#8220;I wanted a scripting language that was more powerful than Perl, and more object-oriented than Python. That&#8217;s why I decided to design my own language.&#8221; [<a href="http://linuxdevcenter.com/pub/a/linux/2001/11/29/ruby.html">4</a>] At #10 in that survey.</li>
</ul>
<p>Compare this mindset with the context in which the the older well-known programming languages emerged:</p>
<ul>
<li>Fortran, 50s, originally developed by IBM as part of their core business in computing machines.</li>
<li>Cobol, late 50s, designed by a large committee from the onset, sponsored by the DoD.</li>
<li>Lisp, late 50s, main project occupying 2 professors at MIT and their students, with the grand goal of producing an algebraic list processing language for artificial intelligence work, also funded by the DoD.</li>
<li>C, early 70s, part of the large investment that Bell Labs was doing in the development of Unix.</li>
<li>Smalltalk, early 70s, part of a large investment that Xerox did in &#8220;inventing the future&#8221; of computers.</li>
</ul>
<p>Back then, developing a language processor was, indeed, a very big deal. Computers were slow, didn&#8217;t have a lot of memory, the language processors had to be written in low-level assembly languages&#8230; it wasn&#8217;t something someone would do in their rooms as a hobby, to put it mildly. Since the 90s, however, with the emergence of PCs and of decent low-level languages like C, developing a language processor is no longer a BigDeal. Hence, languages like PHP and JavaScript.</p>
<p>There is a lot of fun in designing new languages, but this fun is not an exclusive right of researchers with, or working towards, Ph.Ds. Given all the knowledge about programming languages these days, anyone can do it. And many do. And here&#8217;s the first itchy point: <em>there appears to be no correlation between the success of a programming language and its emergence in the form of someone&#8217;s doctoral or post-doctoral work. </em>This bothers me a lot, as an academic. It appears that deep thoughts, consistency, rigor and all other things we value as scientists aren&#8217;t that important for mass adoption of programming languages. But then again, <a href="http://www.dreamsongs.com/RiseOfWorseIsBetter.html">I&#8217;m not the first to say it</a>. It&#8217;s just that this phenomenon is hard to digest, and if you really grasp it, it has tremendous consequences. If people (the potential users) don&#8217;t care about conceptual consistency, why do we keep on trying to achieve that?</p>
<p>To be fair, some of those languages designed in the 90s as side projects, as they became important, eventually became more rigorous and consistent, and attracted a fair amount of academic attention and industry investment. For example, the Netscape JavaScript hacks quickly fell on Guy Steele&#8217;s lap resulting in the <a href="http://en.wikipedia.org/wiki/ECMAScript">ECMAScript specification</a>. Python was never a hack even if it started as a Christmas hobby. Ruby is a fun language and quite elegant from the beginning. PHP&#8230; well&#8230; it&#8217;s fun for possibly the wrong reasons. But the core of the matter is that &#8220;the right thing&#8221; was not the goal. It seems that <span style="text-decoration: underline;"><em>a reliable implementation of a language that addresses an important practical need</em></span> is the key for the popularity of a programming language. But being opportunistic isn&#8217;t what research is supposed to be about&#8230; (or is it?)</p>
<p>Also to be fair, not all languages designed in the 90s and later started as side projects. For example, Java was a relatively large investment by Sun Microsystems. So was .NET later by Microsoft.</p>
<p>And, finally, all of these new languages, even when created over a week as someone&#8217;s pet project, sit on the shoulders of all things that existed before. This leads me to the second itch: <em>one striking commonality in all modern programming languages, especially the popular ones, is how little innovation there is in them</em>! Without exception, including the languages developed in research groups, they all feel like mashups of concepts that already existed in programming languages in 1979, wrapped up in their own idiosyncratic syntax. (I lied: exceptions go to aspects and monads both of which came in the 90s)</p>
<p><a href="http://tagide.com/blog/?attachment_id=544" rel="attachment wp-att-544"><img class="alignright size-medium wp-image-544" title="PLs" src="http://tagide.com/blog/wp-content/uploads/2011/09/PLs-300x225.jpg" alt="" width="300" height="225" /></a>So one pertinent question is: given that not much seems to have emerged since 1979 (that&#8217;s 30+ years!), is there still anything to <em>innovate</em> in programming languages? Or have we reached the asymptotic plateau of innovation in this area?</p>
<p>I need to make an important detour here on the nature of research.</p>
<h3>&lt;Begin Detour&gt;</h3>
<p>Perhaps I&#8217;m completely off; perhaps <em>producing innovative new software</em> <em>is not a goal of [STEM] research</em>. Under this approach, any software work is dismissed from STEM pursuits, unless it is necessary for some specific goal &#8212; like if you want to study some far-off galaxy and you need an IT infrastructure to collect the data and make simulations (S for Science); or if you need some glue code for piecing existing systems together (T for Technology); or if you need to improve the performance of something that already exists (E for Engineering); or if you are a working on some Mathematical model of computation and want to make your ideas come to life in the form of a language (M for Mathematics). This is an extreme submissive view of software systems, one that places software in the back sit of STEM and that denies the existence of value in research in/by software itself. If we want to lead something on our own, let&#8217;s just&#8230; do empirical studies of technology or become biologists/physicists/chemists/mathematicians or make existing things perform better or do theoretical/statistical models of universes that already exist or that are created by others. Right?</p>
<p>I confess I have a dysfunctional relationship with this idea. Personally, I can&#8217;t be happy without creating software things, but I have been able to make my scientist-self function both as a cold-minded analyst and, at times, as an expert passenger in someone else&#8217;s research project. The design work, for me, has moved to sabbatical time, evenings and weekends; I don&#8217;t publish it [much] other than the code itself and some informal descriptions. And yet, I loathe this situation.</p>
<p>I loathe it because it&#8217;s is clear to me that software systems are something very, <em>very</em> special. Software revolutionized everything in unexpected ways, including the methods and practices that our esteemed colleagues in the &#8220;hard&#8221; sciences hold near and dear for a very long time. The evolution of information technology in the past 60 years has been _way_ off from what our colleagues thought they needed. Over and over again, software systems have been created that weren&#8217;t part of any scientific project, as such, and that ended up playing a central role in Science. Instead of trying to mimic our colleagues&#8217; traditional practices, &#8220;computer scientists&#8221; ought to be showing the way to a new kind of science &#8212; maybe <em>that </em><a href="http://www.wolframscience.com/nksonline/page-1?firstview=1">new kind of science</a> or <a href="http://www.amazon.com/Sciences-Artificial-Herbert-Simon/dp/0262691914">that one</a> or maybe something else. I dare to suggest that the something else is related to the design of things that have software in them. It should not be called Science. It is a bit like Engineering, but it&#8217;s not it either because we&#8217;re not dealing [just] with physical things. Technology doesn&#8217;t cut it either. It needs a new name, something that denotes &#8220;the design of things with software in them.&#8221; I will call it Design for short, even though that word is so abused that it has lost its meaning.</p>
<h3>&lt;Suspend Detour&gt;</h3>
<p>Let&#8217;s assume, then, that it&#8217;s acceptable to create/design new things &#8212; innovate &#8212; in the context of doctoral work. Now comes the real hard question.</p>
<p>If anyone &#8212; researchers, engineers, talented kids, summer interns &#8212; can design and  implement programming languages, what are the actual hard goals that <em>doctoral research work</em> in programming languages seeks that distinguishes it from what anyone can do?</p>
<p>Let me attempt to answer these questions, first, with some well-known goals of language design:</p>
<ul>
<li>Performance &#8212; one can always have more of this; certain application domains need it more than others. This usually involves having to come up with interesting data structures and algorithms for the implementation of PLs that weren&#8217;t easy to devise.</li>
<li>Human Productivity &#8212; one can always want more of this. There is no ending to trying to make development activities easier/faster.</li>
<li>Verifiability &#8212; in some domains this is important.</li>
</ul>
<p>There are other goals, but they are second-order. For example, languages may also need to catch up with innovations in hardware design &#8212; multi-core comes to mind. This is a second-order goal, the real goal behind it is to increase performance by taking advantage of potentially higher-performing hardware architectures.</p>
<p>In other words, someone wanting to do doctoral research work in programming languages ought to have one or more of these goals in mind, and &#8212; very important &#8212; <em>ought to be ready to demonstrate how his/her ideas meet those goals</em>. If you tell me that your language makes something run faster, consume less energy, makes some task easier or results in programs with less bugs, the scientist in me demands that you show me the data that supports such claims.</p>
<p>A lot of research activity in programming languages falls under the performance goal, the Engineering side of things. I think everyone in our field understands what this entails, and is able to differentiate good work from bad work under that goal. But a considerable amount of research activities in programming languages invoke the human productivity argument; entire sub-fields have emerged focusing on the engineering of languages that are believed to increase human productivity. So I&#8217;m going to focus on the human productivity goal. The human productivity argument touches on the core of what attracts most of us to creating things: having a direct positive effect on other people. It has been carelessly invoked since the beginning of Computer Science. (I highly recommend <a href="http://www.cs.washington.edu/education/courses/cse590n/10au/hanenberg-onward2010.pdf">this excellent essay</a> by Stefan Hanenberg published at Onward! 2010 with a critique of software science&#8217;s neglect of human factors)</p>
<p>Unfortunately, this argument is the hardest to defend. In fact, I am yet to see the first study that <em>convincingly demonstrates</em> that a programming language, or a certain feature of programming languages, makes software development a more productive process. If you know of such study, please point me to it. I have seen many <a href="http://en.wikipedia.org/wiki/Observational_study">observational studies</a> and <a href="http://en.wikipedia.org/wiki/Experimental_control">controlled experiments</a> that try to do it [<a href="http://page.mi.fu-berlin.de/prechelt/Biblio/jccpprt_computer2000.pdf">5</a>, <a href="http://infoscience.epfl.ch/record/138586/files/dubochet2009coco.pdf">6</a>, <a href="http://dl.acm.org/citation.cfm?id=279140">7</a>, <a href="http://dl.acm.org/citation.cfm?id=359800&amp;CFID=39593267&amp;CFTOKEN=95540901">8</a>, <a href="http://www.cs.washington.edu/education/courses/cse590n/10au/hanenberg-oopsla2010.pdf">9</a>, <a href="http://haskell.cs.yale.edu/?post_type=publication&amp;p=366">10</a>, among many]. I think those studies are <em>really</em> important, there ought to be more of them, but they are always very difficult to do [well]. Unfortunately, they always fall short of giving us any definite conclusions because, even when they are done right, <a href="http://en.wikipedia.org/wiki/Correlation_does_not_imply_causation">correlation does not imply causation</a>. Hence the never-ending ping-pong between studies that focus on the same thing and seem to reach opposite conclusions, best known in the health sciences. We are starting to see that ping-pong in software science too, for example <a href="http://dl.acm.org/citation.cfm?id=279140">7</a> vs <a href="http://www.cs.washington.edu/education/courses/cse590n/10au/hanenberg-oopsla2010.pdf">9</a>. But at least these studies show some correlations, or lack thereof, given specific experimental conditions, and they open the healthy discussion about what conditions should be used in order to get meaningful results.</p>
<p>I have seen even more research and informal articles about programming languages that claim benefits to human productivity without providing any evidence for it whatsoever, other than the authors&#8217; or the community&#8217;s intuition, at best based on rational deductions from abstract beliefs that have never been empirically verified. Here is <a href="http://www.haskell.org/haskellwiki/Why_Haskell_Matters">one</a> that surprised me because I have the highest respect for the academic soundness of Haskell. Statements like this &#8220;<em>Haskell programs have fewer bugs because Haskell is: pure [...], strongly typed [...], high-level [...], memory managed [...], modular [...] [...] There just isn&#8217;t any room for bugs!</em>&#8221; are nothing but wishful thinking. Without the data to support this claim, this statement is deceptive; while it can be made informally in a blog post designed to evangelize the crowd, it definitely should not be made in the context of doctoral work unless that work provides <em>solid evidence</em> for such a strong statement.</p>
<p>That article is not an outlier. The Internets are full of articles claiming improved software development productivity for just about every other language. No evidence is ever provided, the argumentation is always either (a) deducted from principles that are supposed to be true but that have never been verified, or (b) extrapolated from ad-hoc, highly biased, severely skewed personal experiences.</p>
<p>This is the main reason why I stopped doing research in Programming Languages in any official capacity. Back when I was one of the main evangelists for AOP I realized at some point that I had crossed the line to saying things for which I had very little evidence. I was simply&#8230; evangelizing, i.e. convincing others of an idea that I believed strongly. At some point I felt I needed empirical evidence for what I was saying. But providing evidence for the human productivity argument is damn hard! My scientist self cannot lead doctoral students into that trap, a trap that I know too well.</p>
<p>Moreover, designing and executing the experiments that lead to uncovering such evidence requires a lot of time and a whole other set of skills that have absolutely nothing to do with the time and skills for actually designing programming languages. We need to learn the methods that experimental psychologists use. And, in the end of all that work, we will be lucky if we unveil correlations but we will not be able to draw any definite conclusions, which is&#8230; depressing.</p>
<p>But without empirical evidence of any kind, and from a scientific perspective, unsubstantiated claims pertaining to, say, Haskell  or AspectJ (which are mostly developed and used by academics and have been the topic of many PhD dissertations) are as good as unsubstantiated claims pertaining to, say, PHP (which is mostly developed and used by non-academics).  The PHP community is actually very honest when it comes to stating the benefits of using the language. For example, here is an <a href="http://blogs.agriya.com/benefits-of-php">honest-to-god set of reasons for using PHP</a>. Notice that there are no claims whatsoever about PHP leading to less bugs or higher programmer productivity (as if anyone would dare to state that!); they&#8217;re just pragmatic reasons. (Note also: I&#8217;m not implying that Haskell/AspectJ/PHP are &#8220;comparables;&#8221; they have quite different target domains. I&#8217;m just comparing the narratives surrounding those languages, the &#8220;stories&#8221; that the communities tell within themselves and to others)</p>
<p>OK, now that I made 823 enemies by pointing out that the claims about human productivity  surrounding languages that have emerged in academic communities &#8212; and therefore ought to know better &#8212; are unsubstantiated, PLUS 865 enemies by saying that empirical user studies are inconclusive and depressing&#8230; let me try to turn my argument around.</p>
<p>Is the high bar of <em>scientific evidence</em> killing innovation in programming languages? Is this what&#8217;s causing the asymptotic behavior? It certainly is what&#8217;s keeping <em>me</em> away from that topic, but I&#8217;m just a grain of sand. What about the work of many who propose intriguing new design ideas that are then shot down in peer-review committees because of the lack of evidence?</p>
<p>This ties back to my detour on the nature of research.</p>
<h2>&lt;Join Detour&gt; Design experimentation vs. Scientific evidence</h2>
<p>So, we&#8217;re back to whether design innovation per se is an admissible first-order goal of doctoral work or not. And now that question is joined by a counterpart: is the provision of scientific evidence really required for doctoral work in programming languages?</p>
<p>If what we have in hand is not Science, we need to be careful not to blindly adopt methods that work well for Science, because that may kill the essence of our discipline. In my view, that essence has been the radical, fast-paced, off the mark design experimentation enabled by software. This rush is fairly incompatible with the need to provide scientific evidence for the design &#8220;hopes.&#8221;</p>
<p>I&#8217;ll try a parallel: drug design, the modern-day equivalent of alchemy. In terms of research it is similar to software: partly based on rigor, partly on intuitions, and now also on automated tools that simply perform an enormous amount of logical combinations of molecules and determine some objective function. When it comes to deployment, whoever is driving that work better put in place a plan for actually testing the theoretical expectations in the context of actual people. Does the drug really do what it is supposed to do without any harmful side effects? We require scientific evidence for the claimed value of experimental drugs. Should we require scientific evidence for the value of experimental software?</p>
<p>The parallel diverges significantly with respect to the consequences of failure. A failure in drug design experimentation may lead to people dying or getting even more sick. A failure in software design experimentation is only a big deal if the experiment had a huge investment from the beginning and/or pertains to safety-critical systems. There are still some projects like that, and for those, seeking solid evidence of their benefits before deploying the production version of the experiment is a good thing. But not all software systems are like that. Therefore the burden of scientific evidence may be too much to bear. It is also often the case that over time, the enormous amount of testing by real use is enough to provide assurances of all kinds.</p>
<p>One good example of design experimentation being at odds with scientific evidence is <a href="http://www.w3.org/History/1989/proposal.html">the proposal that Tim Berners-Lee made to CERN regarding the implementation of the hypertext system</a> that became the Web. Nowhere in that proposal do we find a plan for verification of claims. That&#8217;s just a solid good proposal for an intriguing &#8220;linked information system.&#8221; I can imagine TB-L&#8217;s manager thinking: &#8220;hmm, ok, this is intriguing, he&#8217;s a smart guy, he&#8217;s not asking that many resources, let&#8217;s have him do it and see what comes of it. If nothing comes of it, no big deal.&#8221; Had TB-L have to devise a scientific or engineering assessment plan for that system beyond &#8220;in the second phase, we&#8217;ll install it on many machines&#8221; maybe the world would be very different today, because he might have gotten caught in the black hole of trying to find quantifiable evidence for something that didn&#8217;t need that kind of validation.</p>
<p>Granted, this was not a doctoral topic proposal; it was a proposal for the design and implementation of a very concrete system with software in it, one that (1) clearly identified the problem, (2) built on previous ideas, including the author&#8217;s own experience, (3) had some intriguing insights in it, (4) stated expected benefits and potential applications &#8212; down to the prediction of search engines and graph-based data analysis. Should a proposal like TB-L&#8217;s be rejected if it were to be a doctoral topic proposal? When is an unproven design idea doctoral material and other isn&#8217;t? If we are to accept design ideas without validation plans as doctoral material, how do we assess them?</p>
<h2>Towards the discipline of Design</h2>
<p>In order to do experimental design research AND be scientifically honest at the same time, one needs to let go of <em>claims</em> altogether. In that dreadful part of a topic proposal where the committee asks the student &#8220;what are your claims?&#8221; the student should probably answer &#8220;none of interest.&#8221; In experimental design research, one can have <em>hopes</em> or <em>expectations</em> about the effects of the system, and those must be clearly articulated, but very few <em>certainties</em> will likely come out of such type of work. And that&#8217;s ok! It&#8217;s very important to be honest. For example, it&#8217;s not ok to claim &#8220;my language produces bug-free programs&#8221; and then defend this with a deductive argument based on unproven assumptions; but it&#8217;s ok to state &#8220;I expect that my language produces programs with fewer bugs [but I don't have data to prove it].&#8221; <a href="http://www.w3.org/History/1989/proposal.html">TB-L&#8217;s proposal</a> was really good at being honest.</p>
<p>Finally, here is an attempt at establishing a rigorous criteria for design assessment in the context of doctoral and post-doctoral research:</p>
<ul>
<li><strong>Problem</strong>: how important and surprising is the problem and how good is its description? The problem space is, perhaps, the most important component for a piece of design research work. If the design is not well grounded in an interesting and important problem, then perhaps it&#8217;s not worth pursuing as research work. If it&#8217;s a old hard problem, it should be formulated in a surprising manner. Very often, the novelty of a design lies not in the design itself but in its designer seeing the problem differently. So &#8212; surprise me with the problem. Show me insights on the nature of the problem that we don&#8217;t already know.</li>
<li><strong>Potential</strong>: what intriguing possibilities are unveiled by the design? Good design research work should open up doors for new avenues of exploration.</li>
<li><strong>Feasibility</strong>: good design research work should be grounded on what is possible to do. The ideas should be demonstrated in the form of a working system.</li>
<li>Additionally, design research work, like any other research work, needs to be placed in a solid <strong>context</strong> of what already exists.</li>
</ul>
<p>This criteria has two consequences that I really like: first, it substantiates our intuitions about proposals such as TB-L&#8217;s &#8220;linked information system&#8221; being a fine piece of [design] research work; second, it substantiates our intuitions on the difference of languages like Haskell vs. languages like PHP. I leave that as an exercise to the reader!</p>
<p>&nbsp;</p>
<p><a href="http://tagide.com/blog/?attachment_id=573" rel="attachment wp-att-573"><img class="aligncenter size-medium wp-image-573" title="PLsWant" src="http://tagide.com/blog/wp-content/uploads/2011/09/PLsWant-300x225.jpg" alt="" width="300" height="225" /></a></p>
<h2>Coming to terms</h2>
<p>I would love to bring design back to my daytime activities. I would love to let my students engage in designing new things such as new programming languages and environments &#8212; I have lots of ideas for what I would like to do in that area! I believe there is a path to establishing a set of rigorous criteria regarding the assessment of design that is different from scientific/quantitative validation. All this, however, doesn&#8217;t depend on me alone. If my students&#8217; papers are going to be shot down in program committees because of the lack of validation, then my wish is a curse for them. If my grant proposals are going to be rejected because they have no validation plan other than &#8220;and then we install it in many machines&#8221; or &#8220;and then we make the software open source and free of charge&#8221; then my wish is a curse for me. We need buy-in from a much larger community &#8212; in a way, <em>reverse the trend of placing software research under the auspices of science and engineering [alone]</em>.</p>
<p>This, however, should only be done <em>after</em> the community understands what science and scientific methods are all about (the engineering ones &#8212; everyone knows about them). At this point there is still a severe lack of understanding of science within the CS community. Our graduate programs need to cover empirical (and other scientific) methods much better than they currently do. If we simply continue to ignore the workings of science and the burden of scientific proof, we end up continuing to make careless religious statements about our programming languages and systems that simply will lead nowhere, under the misguided impression that we are scientists because the name says so.</p>
<p><span class="st"><em>Copyright © Crista Videira Lopes. All rights reserved.<br />
Note: this is a work-in-progress essay. I may update it from time to time. Feedback welcome.<br />
</em> </span></p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2012/03/research-in-programming-languages/feed/</wfw:commentRss>
		<slash:comments>104</slash:comments>
		</item>
		<item>
		<title>To Dish or Not To Dish</title>
		<link>http://tagide.com/blog/2012/02/to-dish-or-not-to-dish/</link>
		<comments>http://tagide.com/blog/2012/02/to-dish-or-not-to-dish/#comments</comments>
		<pubDate>Mon, 27 Feb 2012 10:03:33 +0000</pubDate>
		<dc:creator>abby</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[advice]]></category>
		<category><![CDATA[sarcasm]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=893</guid>
		<description><![CDATA[Dear @bby, I am being asked to write a recommendation letter for someone who has been working with me for 3 years and who I think sucks. What should I do? Should I simply decline to do it? Or should &#8230; <a href="http://tagide.com/blog/2012/02/to-dish-or-not-to-dish/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<blockquote><p>Dear @bby,</p>
<p>I am being asked to write a recommendation letter for someone who has been working with me for 3 years and who I think sucks. What should I do? Should I simply decline to do it? Or should I say what I honestly think about that person and his work? &#8212; because he deserves it!</p>
<p>Sincerely, Conflicted Recommender</p></blockquote>
<p><span id="more-893"></span>Dear Conflicted Recommender: By by now you probably realized that recommendation letters are the currency of your trade. A negative recommendation may stop that person&#8217;s career on its tracks. You have the power! Before you indulge yourself in dishing that poor soul and pulling the rug from under him so unelegantly, ask yourself the following questions: (a) Does that person&#8217;s work really suck, or is your dislike of him, and <em>it</em>, a consequence of a bad fit, mismatched expectations and/or personal issues? (b) Did you take the time to tell that person how displeased you were with his work, how he needed to change if he wanted to continue to work for you, and, most importantly, how his poor performance would influence your recommendation to future employers?</p>
<p>Your answers to these questions will determine the right course of action.</p>
<p>If you can&#8217;t guess, the no-no situation here is for you to write a letter that sounds like a Youtube comment. That can actually work against you, and can thwart your plans to end that person&#8217;s career. You accumulate <em>Youtube commenter vitriol </em>(a form of black energy recently discovered by astrophysicists) when you let things roll with that person without ever expressing your concerns directly to him, or take a passive-aggressive attitude, until the time comes to write that killer letter that will avenge your frustration with that person. The professor reading that letter on the other end will likely think you, not him, are the jerk. Let&#8217;s be honest: you might have been the jerk here. How could you possibly waste 3 years supervising someone who you think sucks? You should have gotten rid of him long ago.</p>
<p>So make sure that during those 3 years of unpleasant interactions you made your concerns known to that poor soul. If he was foolish enough to stay, and even more foolish to ask you for a recommendation letter, then you have gained the right to dish him. Do so carefully. Start by saying that you made all you could to get rid of him, and even told him the letter wouldn&#8217;t be good, but that he didn&#8217;t understand, so you are in the unfortunate situation of having to write this letter.</p>
<p>Better yet: don&#8217;t write it at all. If you have nothing nice to say, decline to give your recommendation. The professor on the other end will get the smoke signals and will interpret them accordingly. This is a manner of conveying what you think without having to say a word, so don&#8217;t let your emotions get in the way.</p>
<p><em>Dear @bby is written by @bb1941l v@n Bµr3n. @bby channels uncommon common sense for cynic academics. Send her your difficult questions by email to abby at this web site&#8217;s domain name.</em></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2012/02/to-dish-or-not-to-dish/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ethics in Economics</title>
		<link>http://tagide.com/blog/2011/11/ethics-in-economics/</link>
		<comments>http://tagide.com/blog/2011/11/ethics-in-economics/#comments</comments>
		<pubDate>Tue, 01 Nov 2011 18:14:09 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[commentary]]></category>
		<category><![CDATA[ethics]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=869</guid>
		<description><![CDATA[Imagine this. You have a brilliant idea for how to reverse the effects of aging in female infertility, a wonderful combination of drugs that you have been developing in your lab with your graduate students, and that will open the &#8230; <a href="http://tagide.com/blog/2011/11/ethics-in-economics/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Imagine this. You have a brilliant idea for how to reverse the effects of aging in female infertility, a wonderful combination of drugs that you have been developing in your lab with your graduate students, and that will open the possibility of motherhood to hundreds of thousands of women who waited just too long to conceive. You have done your Math, your Chemistry, you have developed the model explaining why your idea works. You have tested it in mice. You have tested it in pigs. You got 90% success. You have very little doubt that it works in humans too. If only you could test it&#8230; Now imagine that this is 1925, there are no Institutional Review Boards, no Ethics committees to go through, no clinical protocols. In order to test your ideas, you simply need to recruit women who routinely come to your medical office lamenting that they would like to have children but they are too old to conceive. You wholeheartedly believe in your cure and dream with the Nobel prize. Those women desperation is a powerful context for testing your ideas; they want it, they will gladly try anything!</p>
<p><span id="more-869"></span>As you deploy your experiment among those women, you observe that not only it fails to produce the results you expected, but it also accelerates the process of aging: within 9 months, those 40-something women suddenly look like they are 60 years old. You scratch your head, and go back to the drawing board trying to understand where your beautiful model went wrong. You don&#8217;t feel bad for those women, after all they were willful volunteers; but your ego is bruised because your wonderful idea didn&#8217;t work, and the possibility of a Nobel prize is farther than ever.</p>
<p>What I just described to you is a typical scenario in routine Ethics courses that all of us academics in the US need to go through every few years.</p>
<p>Now imagine this. You have a brilliant idea for how to generate value in economic markets, a clever combination of bets on future values that you have been developing in your lab with your graduate students, and that will open the possibility of riches to a whole breed of entrepreneurial financiers, as well as to entire economies by trickle-down effects. You have done your Math, you have developed a model that explains why your idea works. You have simulated it using powerful computers. You got 90% success. You have very little doubt that it works in real economies too. If only you could test it. Well, it&#8217;s 1995, there are no Institutional Review Boards for Economics, no Ethics committees to go through &#8212; the issue of Ethics doesn&#8217;t even register in your mind as something to worry about. There are only some political barriers to your ideas, some old laws from post-Great-Depression years that vaguely prevent those ideas from being put to practice. With help from powerful friends who believe that your ideas will make them even richer, those laws are quickly and quietly deactivated, one by one. In order to test your ideas, you simply need to recruit governments, organizations and individuals who are cash-strapped. Your confidence in your Mathematical model plus those countries&#8217;, organizations&#8217; and individuals&#8217; desperation are a powerful combination for testing your ideas, they will try anything! &#8212; especially a financial &#8216;cure&#8217; that is undersigned by such world-renowned economics professor such as yourself. You dream of a Nobel prize or maybe you already won one.</p>
<p>As you deploy your experiment, you observe a bubble that first gives the impression of economic growth but that then comes crashing down, leaving behind it millions of people, organizations and governments with debts they can&#8217;t pay. Entire countries&#8217; economies are in ruins, social unrest is everywhere. You scratch your head and go back to the drawing board trying to understand where your beautiful model went wrong. You don&#8217;t feel bad for the economic disaster; you feel proud for trying such a clever, innovative model, and for observing that your ideas had a strong effect in real life. You don&#8217;t empathize with the millions of people who got screwed out of their hopes and desires; after all, they were willful participants in your experiment, just like the women who were too old to conceive. It&#8217;s really fun to see your mathematical model come to life at the macro-economic scale! Perhaps you need to tweak a few parameters for next time. You back up your pride with the assurance that <a href="http://www.economist.com/node/14165405">the crisis was not predicted, because economic theory predicts that such events cannot be predicted</a>. Trivial mathematical truth that applies to just about everything that is complex and stochastic but that, said by you with your awards and recognitions to back you up, sound like something complex that lay people are unable to understand.</p>
<p>Θ</p>
<p>I don&#8217;t know about you, reader, but this state of affairs in Economics research, what is going on in world-famous Business Schools and their involvement with high-finance companies, bothers me a lot. Some of my esteemed academic colleagues seem to be completely out of control and out of touch with reality &#8212; not unlike the state of affairs in Medical research until that community felt the need to establish Institutional Review Boards in the second half of the 20th century. I have no doubt they have good intentions &#8212; most people don&#8217;t do evil on purpose, not even those physicians in the past who committed atrocities in the name of advancing knowledge. Economists happen to live and work in the <a href="http://en.wikipedia.org/wiki/Herbert_Simon">sciences of the artificial</a>, a place of mind that has very little contact with people and empirical data (how I understand the appeal!). They dwell in their stochastic, abstract models of how markets work, and, in the absence of the need for empirical validation, it&#8217;s easy to get enamored by such models. Economists would be as harmless as theoretical computer scientists&#8230; if it wasn&#8217;t for the real damage that comes from their collaboration with greedy financiers, who see in those models additional chances to make a buck.</p>
<p>Perhaps it&#8217;s time for the Economics research community to start having an internal conversation about their trade, and for the rest of us academics to put some pressure on them. <a href="http://ineteconomics.org/blog/inet/john-kay-map-not-territory-essay-state-economics">Here is a start &#8212; an eye-opening essay</a> by <a href="http://www.johnkay.com/">one of them</a>. This essay may be hard to digest, so here&#8217;s a <a href="http://www.sonyclassics.com/insidejob/">documentary</a> that deconstructs the financial meltdown in simple bottom-line punches and puts some of the spotlight on some of those academics &#8212; watch it on Netflix.</p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/11/ethics-in-economics/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Producing SPLASH</title>
		<link>http://tagide.com/blog/2011/10/producing-splash/</link>
		<comments>http://tagide.com/blog/2011/10/producing-splash/#comments</comments>
		<pubDate>Sat, 22 Oct 2011 19:45:07 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[SPLASH]]></category>
		<category><![CDATA[SPLASHcon]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=823</guid>
		<description><![CDATA[&#160; &#160; I&#8217;m chairing SPLASH/OOPSLA this year. That means that I&#8217;m like a Producer, I get to do all the work behind the scenes in order to make the conference come to life. And it&#8217;s finally coming to life. After &#8230; <a href="http://tagide.com/blog/2011/10/producing-splash/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://splashcon.org/2011"><img class="alignleft" title="SPLASH Conference" src="http://splashcon.org/2011/templates/splash2011/images/logotype2011-2.png" alt="" width="480" height="76" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>I&#8217;m chairing SPLASH/OOPSLA this year. That means that I&#8217;m like a Producer, I get to do all the work behind the scenes in order to make the conference come to life. And it&#8217;s finally coming to life. After one year and a half of &#8220;programming,&#8221; I just pressed &#8220;Run.&#8221; It&#8217;s a little crazy if you believe in agile. A whole year and a half of designing and &#8220;programming,&#8221; with no testing whatsoever, no small chunks, just a long process of envisioning, estimating, guessing, coordinating, signing contracts, making decisions; then we unleash the event during 5 days over almost 600 people and hope for the best!</p>
<p>So what&#8217;s involved in producing a conference like SPLASH? Read on if you want to know.</p>
<p><span id="more-823"></span>For the most part, there is a huge amount of coordination work that needs to be done. The producer needs to coordinate with the ACM (the sponsors of the conference), the hotel(s), the conference venue, the A/V and Internet people, external restaurants/entertainment, the industry supporters, and the registration people, among other 6 or so miscellaneous services. This is the administrative, logistic and operational side of the conference. I had produced one conference before SPLASH; for that one, I had no idea what I was getting myself into, so I ended up doing all this work myself, because I realized (too late) that if <em>I</em> didn&#8217;t do it, it wouldn&#8217;t happen &#8212; a duh moment for 1st timers. For SPLASH, I knew better. So from the onset I brought in one wonderful person from the <a href="http://www.isr.uci.edu/">Institute for Software Research</a>, <a href="http://www.isr.uci.edu/%7Ebrodbeck/">Debi Brodbeck</a>, who is an absolute maniac when it comes to getting things done. She is an &#8220;animal&#8221; in the sense that Paul Graham uses that word in his essay &#8220;<a href="http://www.paulgraham.com/start.html">How to start a startup</a>.&#8221; If I ever would start a startup and had to hire a COO, I would hire Debi in an instant. We&#8217;re so very lucky to have her at UC Irvine&#8230;! It really speaks to the environment we have at the University; that we have these absolutely fantastic staff people who like to work there, when they probably would make a lot more money if they would jump to positions in companies. Not unlike us faculty, I guess&#8230; except that we&#8217;re usually not nice.</p>
<p>Anyway, besides the administrative, logistic and operational aspects of the conference, there are also strategic and content aspects to it. These are the ones I called upon myself to take care of, again, with the help of Debi who also knows a lot about these issues, and the Steering Committee of SPLASH. When I accepted to do this, I realized that the conference was in flux trying to find its position in a context that is quite different from what we had in the 90s, when OOP was the big thing and everyone wanted to go to OOPSLA. It was this strategic challenge that made me accept to produce SPLASH 2011.</p>
<p>There are other big things now, and some of them have their own conferences; there are many developer-oriented conferences that took the ideas from OOPSLA, matured them, and made them even better for those audiences. For about two decades or so, OOPSLA has been right on the edge where academic and industrial research meets advanced development. It&#8217;s a balancing act at that edge. I didn&#8217;t think I could bring SPLASH back to the golden days of 2,000-people OOPSLA, and that wasn&#8217;t even a goal for me. Things have changed, and I loathe staring into the past. My goal here was to try to formulate a mission statement for SPLASH that goes beyond catchy, meaningless groups of words (we know OOP is not hot anymore, it&#8217;s everywhere), and that truly captures the uniqueness of this community &#8212; because I believe there&#8217;s something really unique here that we don&#8217;t find in any other conference.</p>
<p>So, what is the uniqueness of SPLASH? Why would someone attend SPLASH as opposed to, say, <a href="http://rubyconf.org/">RubiConf</a>, <a href="http://eclipsecon.org/">EclipseCon</a>, <a href="http://gotocon.com/">Goto</a>, <a href="http://www.qconferences.com/">Qcon</a>, or academic conferences like <a href="http://sigplan.org/pldi.htm">PLDI</a>, <a href="http://sigplan.org/popl.htm">POPL</a>, and the like? As I said, SPLASH sits right at the edge of these two types of conferences. Look at the <a href="http://splashcon.org/2011/program">program</a> this year: the first keynote speaker is Turing award winner <a href="http://en.wikipedia.org/wiki/Ivan_Sutherland">Ivan Sutherland</a>; the third keynote speaker is Mr. JavaScript <a href="http://en.wikipedia.org/wiki/Brendan_Eich">Brendan Eich</a>; the keynote speaker in the middle is a Swiss academic, <a href="http://www.inf.ethz.ch/personal/markusp/">Markus Puschel</a>, with some pretty wacky ideas on performance/productivity. Where else could we possibly find this combination of keynote speakers in one conference?!</p>
<p>This edge is not for everyone. Many people are better served if they go to conferences that have only Brendan Eich type of speakers or Ivan Sutherland type of speakers or Markus Puschel types of speakers, but not the combination of the three. And that&#8217;s ok. But this is the uniqueness of SPLASH: it&#8217;s a hybrid, a melting pot of software development approaches. As you go from session to session you may have the impression that you are traveling between distant planets!</p>
<p>The rest of the [vast] program reflects this hybrid combination, with academic research papers woven with experience reports, idea-papers and demonstrations. Even the 3 TechTalk speakers are a hybrid bunch: Jesper will geek out on &#8220;<a href="http://splashcon.org/2011/program/techtalks/197-how-to-handle-1-000-000-daily-users-without-using-a-cache">How to handle 1M daily users without a cache</a>;&#8221; Dave will entertain us with a rant on &#8220;<a href="http://splashcon.org/2011/program/techtalks/202-why-modern-application-development-sucks-death-by-objects-agile-middleware">Why modern application development sucks!</a>;&#8221; and Kresten will tell a more personal story of his involvement with <a href="http://splashcon.org/2011/program/techtalks/198-erlang-the-road-movie">Erlang</a>.  Let&#8217;s not forget the self-hybrid that is the incredible <a href="http://en.wikipedia.org/wiki/Guy_L._Steele,_Jr.">Guy Steele</a> doing a live demonstration of <a href="http://splashcon.org/2011/program/rpg/201-rpg-2011">singing calls</a>! The days preceding the main conference are also full of interesting talks and events with the same hybrid characteristic: from the <a href="http://www.dartlang.org/">Dart</a> people (at <a href="http://splashcon.org/2011/program/243">DLS</a> and <a href="http://design.cs.iastate.edu/vmil/2011/program.shtml">VMIL</a>), who have just unleashed one of the largest programming language design experiments ever over all of us, to <a href="http://ecs.victoria.ac.nz/Events/PLATEAU/Keynote">Brad Myers</a>, who studies the human aspects of programming in relatively controlled environments, to the AWS <a href="http://splashcon.org/2011/program/hackathon">Hackathon</a>.</p>
<p>You can probably sense the pride that I have in being the producer of this conference. I&#8217;m not going to hide it, I am proud of being part of this wacky hybrid! And I am extremely grateful to <a href="http://splashcon.org/2011/committee">all the people</a> who have helped put this conference together. As many said before me, the most important thing for a team leader to do is to put in place a great team and move him/herself out of the way!</p>
<p>Now, let me go monitor the execution of this test-less program&#8230; I have a few amulets in my pocket!</p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/10/producing-splash/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Theory of Aspects as Latent Topics</title>
		<link>http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/</link>
		<comments>http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/#comments</comments>
		<pubDate>Sat, 17 Sep 2011 17:26:42 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[software repositories]]></category>
		<category><![CDATA[aop]]></category>
		<category><![CDATA[aspect-oriented programming]]></category>
		<category><![CDATA[LDA]]></category>
		<category><![CDATA[topic modeling]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=722</guid>
		<description><![CDATA[Underlying the work on Aspect-Oriented Programming (AOP) there is a premise that no one ever challenged: the existence of cross-cutting concerns that find their way to programs in a tangled and scattered manner. We&#8217;ve all seen it. But do tangling &#8230; <a href="http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.proustarchive.org/?p=60"><img class="alignleft" title="Proust Topic Model" src="http://farm6.static.flickr.com/5180/5535063164_f57b07d05a_z.jpg" alt="" width="264" height="232" /></a>Underlying the work on <a href="http://en.wikipedia.org/wiki/Aspect-oriented_programming">Aspect-Oriented Programming</a> (AOP) there is a premise that no one ever challenged: the existence of cross-cutting concerns that find their way to programs in a tangled and scattered manner. We&#8217;ve all seen it. But do tangling and scattering of program concerns really exist in real programs? Do they have a strong effect or is this one of those academic non-issues? That was the question we set out to answer in a paper we published at <a href="http://www.oopsla.org/oopsla2008/">OOPSLA 2008</a>. And the answer was: yes, these effects do exist in real programs, they are noticeable and detectable, and they reveal a few insights on the nature of those concerns. But they raise even more questions for AOP. Here is a summary of our study. For all the details, <a href="http://dl.dropbox.com/u/18483217/oopsla08.pdf">read the paper</a> [<a href="http://dl.acm.org/citation.cfm?id=1449807">1</a>].</p>
<p><span id="more-722"></span>Let me explain briefly how we detect aspects in this study. We don&#8217;t; we detect <em>topics</em> using an unsupervised topic modeling technique called <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation</a> (LDA). This technique is used very successfully for topic modeling  in text. It works by finding co-occurrences of words in bags of words (files/texts/whatever); the topics emerge given the most frequent co-occurrences of groups of words. It&#8217;s a probabilistic model, meaning that it has some nice smoothing properties and a few other goodies such as the distribution (aka <em>entropy</em>) of those topics across files and within files &#8212; a direct measure of scattering and tangling of topics.</p>
<p>Let me back track a little. This work came to be because <a href="http://www1.chapman.edu/%7Elinstead/">Erik</a>, who, at the time, knew nothing about AOP, decided to run <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">LDA</a> on a few hundred Java projects we had laying around; then one day he showed me the latent topics that emerged from that experiment, and that&#8217;s when it hit me &#8212; holy crap!, these are aspects! So yeah, I wish I could tell this was all part of a grand plan, but it wasn&#8217;t; it was an accidental finding.</p>
<div id="attachment_747" class="wp-caption alignright" style="width: 222px"><a href="http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/scattering/" rel="attachment wp-att-747"><img class="size-medium wp-image-747" title="Scattering" src="http://tagide.com/blog/wp-content/uploads/2011/09/scattering-212x300.jpg" alt="" width="212" height="300" /></a><p class="wp-caption-text">Scattering of topics in the entire repository</p></div>
<p>After that first experiment, we tuned a few things in our method of using LDA (like, for example, carefully selecting which words to use) and ran it again over a much larger collection of Java projects. Again, the surprising finding was that the topics that emerged with high entropy across that very large collection of Java programs included, among many others, the concerns that the AOP community had been using as examples of aspects in the first place: persistence, authentication, exception handling, concurrency, etc. This was very interesting, and immediately made me want to explore how the concept of latent topic related with the concept of aspect.</p>
<p>But something was still a bit off: we ran LDA on the entire collection of 5,000 projects or so; we were treating the entire collection as one big project. A such, and on second thoughts, it was not surprising to find that things like concurrency cut across many projects/files. The latent topics with high entropy read almost like a &#8220;list of things you need to deal with if you program in Java.&#8221; Perhaps if we ran it on a per project basis, things would look different.</p>
<div id="attachment_760" class="wp-caption alignright" style="width: 310px"><a href="http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/scattering-small/" rel="attachment wp-att-760"><img class="size-medium wp-image-760" title="scattering-small" src="http://tagide.com/blog/wp-content/uploads/2011/09/scattering-small-300x194.jpg" alt="" width="300" height="194" /></a><p class="wp-caption-text">Scattering of topics in some projects</p></div>
<p>Indeed, things looked different on a per-project basis. There was some overlap on generic things like string manipulation; but the interesting thing is that some of those topics were very specific to the projects in question. For example, for <a href="http://www.jhotdraw.org/">JHotDraw</a> (a graphics editor) we got topics that are clearly related to drawing; for <a href="http://jikes.sourceforge.net/">Jikes</a> (a Java compiler) we got groups of words that are quite reasonable to find in a compiler; for <a href="http://www.zimmers.net/home/mud/index.html">CoffeeMud</a> (a MUD game engine) we got this mysterious word &#8220;mob&#8221; prominently represented in several topics; we had to look into the documentation of that project to solve the mystery: it stands for Mobile OBject, a central concept in that engine &#8212; basically NPCs.</p>
<div id="attachment_765" class="wp-caption alignright" style="width: 310px"><a href="http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/scattering-distribution/" rel="attachment wp-att-765"><img class="size-medium wp-image-765" title="scattering-distribution" src="http://tagide.com/blog/wp-content/uploads/2011/09/scattering-distribution-300x249.jpg" alt="" width="300" height="249" /></a><p class="wp-caption-text">Scattering curves for selected projects</p></div>
<p>Then we plotted the topic distributions for 5 projects on the same plot and saw an interesting fact: some projects show more topic scattering than others.  CoffeeMud is noticeably more scattered. JHotDraw has a handful of very scattered topics, but most of the topics in it are a lot less scattered than all topics in all other projects. Jikes is the most regular one, with less variation.</p>
<p>Could this be a good metric for measuring the accidental complexity of projects? I.e. is the CoffeeMud code base a mess compared to, for example, the code base of Jikes? This is definitely an interesting conjecture, but it&#8217;s nothing more than a conjecture at this point. That relates to the second never-challenged assumption underlying AOP: that tangling and scattering are &#8220;bad&#8221; for software development. Our study here shows that there is scattering and tangling of topics, and gives us a nice mathematical framework for quantifying those effects, but it says absolutely nothing about whether that is a good thing or a bad thing.</p>
<p>The results raise a lot of questions for AOP, and the paper has a lengthy discussion regarding the findings. I&#8217;m going to end this summary with the issue that I think is more itchy.</p>
<p>A lot of scattered topics we saw emerging were clearly uses of specific APIs &#8212; string manipulation, list manipulation, xml, io, etc. This resonates with my experience; some things are used everywhere. &#8220;String manipulation,&#8221; for example, could be seen as an <em>aspect</em> of programs; it would be perfectly reasonable to want to understand what the entire program is doing wrt that. If AOP says that scattering is bad, does it mean that scattering of string manipulations is bad? Should we localize string manipulations in programs? Should we extract them out into a separate module and do reverse binding?</p>
<p>If that would be odd, then a question is raised about the reason for doing that for concerns such as logging and concurrency. At the very least, that practice, advocated by AOP from early on, requires a better justification.</p>
<p>I have my own view on these questions, but that&#8217;s a topic for another paper/post.</p>
<p><em>This work was a collaboration between myself, <a href="http://www.igb.uci.edu/~pfbaldi/">Pierre Baldi</a> and <a href="http://www1.chapman.edu/%7Elinstead/">Erik Linstead</a>, with help from <a href="https://plus.google.com/116688183339080430878/about">Suhil Bajracharya</a>. Work in part supported by <a href="http://nsf.gov/">National Science Foundation</a> grants EIA-0321390, <em>CCF-0347902 and CCF-0725370</em> and a Microsoft Faculty Research Award</em>.</p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/09/a-theory-of-aspects-as-latent-topics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vandalism Detection in Wikipedia</title>
		<link>http://tagide.com/blog/2011/09/vandalism-detection-in-wikipedia/</link>
		<comments>http://tagide.com/blog/2011/09/vandalism-detection-in-wikipedia/#comments</comments>
		<pubDate>Fri, 16 Sep 2011 00:22:17 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[social software systems]]></category>
		<category><![CDATA[PAN Workshop]]></category>
		<category><![CDATA[vandalism]]></category>
		<category><![CDATA[Wikipedia]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=674</guid>
		<description><![CDATA[If you have to develop a classifier for detecting vandalism in Wikipedia with just a small number of features, what kind of features give the best results? According to our latest work on vandalism detection in Wikipedia, to be presented &#8230; <a href="http://tagide.com/blog/2011/09/vandalism-detection-in-wikipedia/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/File:Banner_whose_side_are_you_on.png"><img class="alignleft" title="Whose side are you on?" src="http://upload.wikimedia.org/wikipedia/commons/c/c7/Banner_whose_side_are_you_on.png" alt="" width="283" height="300" /></a>If you have to develop a classifier for detecting vandalism in Wikipedia with just a small number of features, what kind of features give the best results? According to our latest work on vandalism detection in Wikipedia, to be presented at <a href="http://www.wikisym.org/ws2011/start">WikiSym 2011</a>, the best features are the ones pertaining to user behavior within the system &#8212; things like the deletion of other users&#8217; content, the survivability of the user&#8217;s additions, number of words deleted by a user, whether the user has a page on Wikipedia or not, etc. Other kinds of features such as textual and language model features are routinely used in email spam filters, but it turns out that these don&#8217;t do as well as the user behavior features. That&#8217;s right, the user behavior within these systems contains a very strong signal for detecting what the users are capable of doing in the future, and therefore can detect vandalism fairly well, especially the more subtle kinds of vandalism. I&#8217;ve been wanting to write an overview of this work for a long time, finally here it is. For all the details, <a href="http://www.ics.uci.edu/%7Esjavanma/WikiSym-2011.pdf">read the paper</a>.</p>
<p><span id="more-674"></span>The ultimate goal of this line of work in my group is to get a better understanding of the social dynamics that emerged on the Web with user-generated content sites &#8212; wikis, social networks, virtual worlds, etc. Unlike interactions in the real world, these sites collect an enormous amount of data regarding the interactions that people have with each other and with the content. As such, one can now feed that data into computing machinery and hope to gain insights on social dynamics [on the Web, at least, but maybe beyond]. Wikipedia happens to be a wonderful playground for that line of investigation, because all the data is available.</p>
<p>Vandalism is a very strong word but with a <a href="http://en.wikipedia.org/wiki/Wikipedia:Vandalism">clear definition</a> within Wikipedia. Wikipidians have strict <a href="http://en.wikipedia.org/wiki/Wikipedia:Policies_and_guidelines">policies and guidelines</a> for editing articles, and they spend a lot of time fighting editorial &#8220;crimes.&#8221; They have some <a href="http://en.wikipedia.org/wiki/Wikipedia:Bots">bots</a> that do basic housekeeping; some of those bots [<a href="http://en.wikipedia.org/wiki/User:ClueBot_NG">1</a>] trigger alarms and/or revert edits in obvious cases of vandalism. But not all vandalism is obvious. Improving automatic vandalism detection is therefore a goal that many people set out to accomplish &#8212; including one of my former students, <a href="http://www.ics.uci.edu/%7Esjavanma/">Sara</a>, who just graduated last month and is now working for Microsoft. There is a research community around this topic that curates data, organizes workshops and puts up competitions from time to time &#8212; that&#8217;s <a href="http://pan.webis.de/">PAN</a>. Sara participated in this community, and <a href="http://www.uni-weimar.de/medien/webis/research/events/pan-10/task2-vandalism-detection.html#results">won 3rd price</a> in the PAN competition in 2010.</p>
<p>The paper I&#8217;m focusing on here is the culmination of Sara&#8217;s work. Here is what we did, in a nutshell: first, we collected a relatively large number of features that had been known to be of value for vandalism detection in Wikipedia &#8212; a total of 66 features. Most of these features had been proposed and tested by other people, others were our own. These features all fell nicely into four groups: user features, textual features, metadata features and language model features. Then we trained a <a href="http://en.wikipedia.org/wiki/Random_forest">random forest classifier</a> with all those 66 features using the <a href="http://www.webis.de/research/corpora">PAN corpus</a> train set. Finally we run that classifier on the PAN corpus test set. With such a feature-rich model we were able to achieve  the highest performance ever reported for that corpus, an <a href="http://machine-learning.blogspot.com/2008/07/auc-as-performance-metric-in-ml.html">AUC</a> of 0.9553 &#8212; the previous record was 0.9218.</p>
<p>However, feature-rich models aren&#8217;t very practical; they are slow to compute. For all practical usages of ML, especially the ones that run online, we need models with few and cheap features. Machine learning approaches tend to suffer from this problem: we throw a large number of intuitions about what matters for classification, and let the machine figure it out, but the machine doesn&#8217;t ever tell us what&#8217;s really important, what&#8217;s not so important, or how the features correlate. So we need to do extra work in order to find that out. Here&#8217;s what we did.</p>
<p>In order to detect and eliminate redundant features, we performed two sets of experiments. First, we studied the contribution of each of the 4 groups of features to determine if any of those groups could be eliminated without a significant drop in AUC. Then we studied the contribution of each feature individually and used the results for eliminating redundant features, using a technique called <a href="http://www-stat.stanford.edu/~tibs/lasso.html">Lasso</a> (Least Absolute Shrinkage and Selection Operator).</p>
<p>The first set of experiments told us that the User features were the most important group &#8212; by a lot. With the User features alone, we obtained an AUC of 0.9225. In the second set of experiments, we were able to reduce the model to 28 features (down from 66) and still obtain an AUC of 0.9505. These 28 features include features from all groups, but the User features have a strong presence.</p>
<p>And there you have it, this is how we answered the question &#8220;If you have to develop a classifier for detecting vandalism in Wikipedia with just a small number of features, what kind of features do best?&#8221; But I think the result we obtained is more interesting that its practical application on vandalism detection in Wikipedia. What the result suggests is that there are very strong signals associated with the users&#8217; actions within a system, i.e. who the user is, as given by the sequence of actions of that user within the system. It&#8217;s not just that someone added text; it&#8217;s <em>who</em> that person is. This gets at the concept of reputation, but goes at it from an implicit, within-the-system perspective, rather than with an explicit thumbs-up-thumbs-down kind of approach. It suggests that it is possible to automatically build extremely accurate models of users&#8217; reputations without explicit endorsements from other users.</p>
<p><em>This work was a collaboration with <a href="http://dub.washington.edu/people/david-mcdonald">David McDonald</a>, and it was supported by the <a href="http://nsf.gov/">National Science Foundation</a> under grant No. OCI-074806.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/09/vandalism-detection-in-wikipedia/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>File cloning in open source: the good, the bad and the ugly</title>
		<link>http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/</link>
		<comments>http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/#comments</comments>
		<pubDate>Wed, 14 Sep 2011 17:52:07 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[research]]></category>
		<category><![CDATA[software repositories]]></category>
		<category><![CDATA[file cloning]]></category>
		<category><![CDATA[icsm 2011]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=607</guid>
		<description><![CDATA[How much copying is there in open source projects? According to our recent study soon to be presented at ICSM 2011, more than 10% of files found in open source Java projects are clones of other files. That is a &#8230; <a href="http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/clones/" rel="attachment wp-att-630"><img class="alignleft size-full wp-image-630" title="clones" src="http://tagide.com/blog/wp-content/uploads/2011/09/clones.jpg" alt="" width="273" height="185" /></a>How much copying is there in open source projects? According to our recent study soon to be presented at <a href="http://www.cs.wm.edu/icsm2011/">ICSM 2011</a>, more than 10% of files found in open source <em>Java</em> projects are clones of other files. That is a lot. But those clones are only in about 15% of projects, meaning that 85% of projects don&#8217;t have clones. And, it turns out, some cloning out there is relatively harmless, but we found some uglies too. Here&#8217;s a summary of our analysis of what&#8217;s going on with these clones. For the complete details, <a href="http://dl.dropbox.com/u/18483217/ossher-icsm2011.pdf">read the paper</a>.</p>
<p><span id="more-607"></span></p>
<h3>The clones</h3>
<p>In order to understand the nature of these clones we looked into a large number of projects that had clones in them in order to identify the circumstances under which the files ended up being clones. Here&#8217;s a categorization of what we found:</p>
<ul>
<li><strong>Demos/tutorials</strong>: these are files that were specifically intended to be used as examples of functionality, usually for a specific library or framework. These<br />
files usually originate from example or demo programs, or are working code fragments included with tutorials. We found files for SWT, JBoss, Java Servlets, Swing, and a number of other well-known libraries.<br />
We think this category is a perfectly legitimate case for file cloning. These are files designed to be copied and executed in new projects, and they do not make up any integral part of the system.</li>
<li><strong>Small library / utility files</strong>: these are clones that appeared to come from small third-party libraries or self-contained utility files.  Some examples include a single-file Java port of GNU Getopt, a tool for encoding PNG files, a Java connector for Spidermonkey, and a file for converting CVS date strings.<br />
It is difficult to classify this category as strictly good or bad. On the one hand, libraries are being reused through copying, which eliminates the connection to their original source and carries with it a whole host of maintenance issues. On the other hand, these libraries are all rather small and their functionality not overly complicated. Furthermore, especially in the case of copied utility files, there may not have been any reasonable way to reuse the functionality without copying the files. And developers might be hesitant to introduce an external dependency for the use of a handful of files.</li>
<li><strong>Library</strong>: this category differs from the previous one only in that the copied libraries are larger and more well known. Examples of this category are split between those projects that copied significant portions of common libraries, and those that copied only a handful of files. The copied libraries include Jython, Apache Beanutils, SWT, JUnit, and the Java Excel API.<br />
The most interesting example we found in this category was this: the version of Apache Lenya in our dataset contained a complete copy of Apache Cocoon, a spring-based framework that Lenya uses. In trying to discover why this copy existed, we looked at the most recent version of Lenya, which no longer contains a copy of Cocoon. Instead, it has been replaced by a script to check out Cocoon and then automatically apply a number of patches. It appears the developers of Lenya needed to modify portions of Cocoon, and originally did this by copying the entire library. Only later did they settle upon the patching mechanism to achieve the same result.<br />
Clones in this category are a bit uglier than those in the previous category. These libraries are larger and more complex, and so are more likely to contain bugs. As seen with Lenya, one possible motivation behind this copying is  developers wishing to modify portions of the library. They might also wish to remove aspects that they don’t need. While understandable, we would hesitate to recommend such action except when absolutely necessary. Lenya’s current  solution is clearly preferable.</li>
<li><strong>Related Projects</strong>: this category includes those clones that occurred between two projects that are related in some way. This includes project forks, sub-projects, and renamed projects. Roughly 1/3 of the cases in this category were due to a developer beginning a new project by copying the entirety of an older project.<br />
This type of cloning can clearly impact the maintainability of a system, but, if handled properly, forms a reasonable part of an open source project’s lifecycle.</li>
<li><strong>Duplicated Project</strong>: This category is similar to the previous category, except instead of the projects simply being related, they are exact copies. The most common cause of this is that the project has been simultaneously placed in multiple open source repositories. Usually the actual version control is handled by one repository, while the other contains a package distribution. While projects in this category are not clones in the same sense as the other categories, their duplication can be a source of confusion to those looking to find a project’s real home.</li>
<li><strong>Java Standard Library</strong>: this category was quite unexpected. We were surprised to find so many projects containing Java itself. On further investigation, we discovered a large number of applications designed to transform Java code that contained their own, often modified, versions. We found tools for converting Java to Javascript, multiple implementations of Java Virtual Machines, and a few Java compilers. Cloning of this type is necessary, but also very limited in the scope of projects that require it.<br />
We also found a large number of projects including files from org.xml.sax and org.w3c.dom, despite their being included with the JDK. These libraries are something of a special case, they are on a more frequent release schedule than the JDK itself, and were not included in older versions.</li>
<li><strong>Java Extensions</strong>: this category contains copies of files from common Java extensions, such as JAXP or JMX. While these extensions are now packaged with the JDK, this has not historically been the case. The sheer number of times that slightly different versions of these extensions appear as source in different projects suggests the importance of a better mechanism for handling extensions before their inclusion in the JDK.</li>
<li><strong>Other</strong>: the common theme of this category is their being extremely ugly examples of file cloning. There were two related projects that copied an Apache library, yet renamed every package to something slightly different. In an extreme case, a developer copied someone else’s Java application for animating images and uploaded it to a new repository under his name.</li>
</ul>
<h3>The data</h3>
<p><a href="http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/project-sizes/" rel="attachment wp-att-612"><img class="alignright size-medium wp-image-612" title="project-sizes" src="http://tagide.com/blog/wp-content/uploads/2011/09/project-sizes-300x192.png" alt="" width="300" height="192" /></a>The data consisted of a universe of 13,241 Java projects collected from repositories such as Apache, Java.net, Google Code, and Sourceforge, in a total of 3,237,910 files. The picture on the right shows the distribution of project size in this collection. Read it like this: there are about 5,000 projects with 1 to 50 files (left-most column), etc., all the way to there are about 90 projects with 2001 to 17,893 files (right-most column). Indeed, there are a few massive projects out there!</p>
<p>I&#8217;d like to hear people&#8217;s personal experiences with file cloning in their projects. Do you copy-and-paste code a lot? At the file level, smaller or bigger?</p>
<p><em><em>This study was led by <a href="http://www.ics.uci.edu/%7Ejossher/">Joel Ossher</a>, with help from <a href="http://www.isr.uci.edu/%7Ehsajnani/">Hitesh Sajnani</a> </em>and with me supervising. The work was supported by the <a href="http://www.nsf.gov/">National Science Foundation</a> under Grant No. CCF-1018374.<br />
</em></p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/09/file-cloning-in-open-source-the-good-the-bad-and-the-ugly/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Graduate School Application Dos and Don&#039;ts</title>
		<link>http://tagide.com/blog/2011/09/graduate-school-application-dos-and-donts/</link>
		<comments>http://tagide.com/blog/2011/09/graduate-school-application-dos-and-donts/#comments</comments>
		<pubDate>Tue, 06 Sep 2011 03:07:12 +0000</pubDate>
		<dc:creator>crista</dc:creator>
				<category><![CDATA[academia]]></category>
		<category><![CDATA[advice]]></category>

		<guid isPermaLink="false">http://tagide.com/blog/?p=362</guid>
		<description><![CDATA[It&#8217;s the beginning of a new academic year. With it, there comes  a new wave of inquiries about applying to UCI/ICS graduate programs and joining my research group. I&#8217;ve seen these waves every year for the past 9 years. The &#8230; <a href="http://tagide.com/blog/2011/09/graduate-school-application-dos-and-donts/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://tagide.com/blog/2011/09/graduate-school-application-dos-and-donts/50221978_02702_0157/" rel="attachment wp-att-409"><img class="alignleft size-thumbnail wp-image-409" title="50221978_02702_0157" src="http://tagide.com/blog/wp-content/uploads/2011/09/50221978_02702_0157-150x150.jpg" alt="" width="150" height="150" /></a>It&#8217;s the beginning of a new academic year. With it, there comes  a new wave of inquiries about applying to <a href="http://www.ics.uci.edu">UCI/ICS</a> graduate programs and joining my research group. I&#8217;ve seen these waves every year for the past 9 years. The vast majority of these inquiries don&#8217;t pass my mental spam filter; a small percentage does; an even smaller percentage ends up being accepted. I thought I&#8217;d write down my thoughts on these inquiries. I know that amongst the hordes of applicants who fail to cause a good impression on prospective advisors, there are a few bright ones to whom that happens because of unawareness and bad advice. This post is for them. If they find it.</p>
<h3><span id="more-362"></span>The Insider Scoop</h3>
<p>Here&#8217;s the secret: if your resume is just plain good, and not insanely awesome, the single most important pieces of the application process are (1) your direct contact with a prospective advisor, and (2) your recommendation letters. Forget about GPAs and GREs. <em>You</em> and the interactions you have with others are _it_. Professors like me are looking for <em>good people</em> who are driven to produce <em>great work</em> and who can grow into <em>independent intellectuals</em> under their supervision. GPAs and GREs give some information about you, but they miss to capture the potential you have for doing independent work. Besides the signs of potential for good, independent work (usually given by papers and/or projects), the impression that you make on your prospective advisor is the deal maker or breaker. As such, it is really important that you prepare for that, as much as (or more than) you prepare for the GRE.</p>
<p>Matt has some good <a href="http://matt.might.net/articles/how-to-apply-and-get-in-to-graduate-school-in-science-mathematics-engineering-or-computer-science/">advice on how to get into grad school</a>. My post here focuses on the interactions that you will have with prospective advisors.</p>
<h3>Your Context</h3>
<p>So, you are sitting there in front of your computer &#8212; in India, China, Iran, or even in the U.S. You just graduated, or are about to graduate, from an undergraduate institution, and you have no idea what you want to do with the rest of your life. The prospect of getting a job doesn&#8217;t entice you. Or maybe you tried it, and you think that there ought to be more intellectual stimulation in your life. You like being in school, you admired some of your professors, you enjoyed doing some projects, you have good grades, and your family values Higher Education. Some of your friends and acquaintances are doing a PhD, and you have a secret envy for them &#8212; not the least of it because they seem to continue to have the relatively care-free lifestyle that characterizes student life, and that is very different from the lifestyle that your <em>other</em> friends have, those who got a job. And then it hits you &#8212; you could do it too! I&#8217;ll leave the reasons for choosing to go to graduate school for another post. Let&#8217;s just say you made up your mind. But where to start?</p>
<p>You frantically Google for information about graduate schools. You hit the so-called &#8220;top 10&#8243; Universities, and you day-dream about being a nerd-wiz at MIT. Then reality hits you, and you start looking at other options. You exhaustively browse through faculty web pages to check out what they do and imagine how it would feel working under their supervision. Finally you feel ready to shoot up the first contact messages. And this the first opportunity you have&#8230; for your plans to fail. Right here, in the beginning.</p>
<h3>The Many Ways of Succeeding in Making a Bad Impression</h3>
<p>Want to make sure your application is ignored? Send an email that reads like spam. Here is the worst possible email you can send:</p>
<blockquote><p>Dear  Sir,</p>
<p>I am XX YY from UUU University in CCC. I am seeking an opportunity to use my background to do research in your prestigious lab. I have seen your publications and research work and they are of great interest to me. I am attaching my resume. I had acquired programming skills on C, C++, VC++, Java, Oracle, and MS SQL. Also I had conducted relevant projects during that period.</p>
<p>I am rather sorry to trouble you. It would be very helpful to me if you could convey to me the chances I have of getting into the PhD program in your university.</p>
<p>Sincerely yours,<br />
XX YY</p></blockquote>
<p>OK, stop right there. If this is the best you can do, don&#8217;t even bother doing it! And if the reason isn&#8217;t clear, let me explain what&#8217;s wrong with this email. There is nothing in this email that differentiates prospective University A from prospective University B, much less prospective advisor L from prospective advisor M. Why should <em>I</em> pay attention? It&#8217;s really easy <em>for you</em>, because it&#8217;s the exact same message, but what <em>message</em> are you <em>really</em> sending <em>me</em>, the receiver? Essentially, you are saying &#8220;I&#8217;m so lazy, I&#8217;m not bothering to write down anything that will force me to manually type a few differentiating parts in the 100 emails that I am about to spam professors with.&#8221; Professors don&#8217;t like spam and they also don&#8217;t want lazy people in their labs &#8212; at least not uninterestingly lazy like this. It&#8217;s a #SureFail.</p>
<p>(There is a slightly worse variation of this email, one that discloses your background in, say, signal processing, while spamming professors whose research areas have nothing to do with that&#8230;)</p>
<p>Here is the second-to-worst email that you can send: (notice the typesetting)</p>
<blockquote><p>Dear    <em>Professor Lopes</em>   ,</p>
<p>My name is XX YY from UUU University in CCC. I am writing to explore the possibility of becoming a PhD student in your lab at <em>University of California Irvine</em> .</p>
<p>I am very interested in pursuing research in    <em>Mining Software Repositories</em> . My previous educational and work experience has given me solid background in analysis/design/programming.</p>
<p>Attached please find my resume in word format. I am looking forward to your reply. Thank you.</p>
<p>XX YY</p></blockquote>
<p>This is slightly better than the first because at least you went to the trouble of writing down some differentiating bits. Unfortunately the typesetting reveals the spamming nature of this message. The differentiating parts were pasted using a different font, revealing that you have composed this message from a template with blanks on it. Then you simply went through a list that you may have compiled in Excel or something, listing professors names and research areas, and voila! &#8211; insta-spam again. #SureFail.</p>
<p>You may feel that I&#8217;m being unfair in calling this spam, because you have done your homework in researching professors and their research interests. But I&#8217;m not being unfair. This is still spam from where I stand. This message is telling me: &#8220;Hey, I really want to be admitted&#8230; somewhere, working with&#8230; someone, doing&#8230; something. I don&#8217;t care where, who or what, I just want to go to graduate school.&#8221; Well, professors like to think that their work matters. They aren&#8217;t just looking for generic brain power, they are looking for people who are a good fit and who show signs of being able to find out what they want.</p>
<h3>What Your First Email Should Look Like</h3>
<p>What I am about to tell you is about <em>quality</em> and is fairly incompatible with sheer quantity. If you do what I say, there&#8217;s just not enough time for you to do this for 100 professors. As such, narrow your search to just a few and target them wisely.</p>
<p>First of all, understand what kind of work goes on under that professor&#8217;s supervision. This means that you need to figure out what projects are going on in that lab, what publications are being produced, who that professor&#8217;s current graduate students are, and who the past graduate students were and where they are working. You need to read through a few of those papers. If you care about your career, you absolutely need to do this thorough background research, not just look at the professor&#8217;s home page to pick up a few keywords. You are about to embark on a 5-6 year engagement that will have a profound effect on the rest of your life. Caring about who, what and where you are committing yourself to, respecting yourself, is the basis for making a good impression in others.</p>
<p>Then write emails specifically for each professor you are interested in working with. Don&#8217;t send the application essay in that email, leave that to where it belongs &#8212; in the application materials. In the emails, be short, but be specific for each of them. Just cover the main points. Tell them what it is of their labs&#8217; work that got your attention, and why you&#8217;d like to join. If you refer to a paper or two written by the professors and their students, and make some intelligent comments about them, I guarantee you you will get the professors&#8217; attention.</p>
<p>As I said above, this takes time and engagement on your part. But here&#8217;s the thing: in the absence of substantial research experience, the time you take to study the prospective professors&#8217; work and the intellectual engagement you show with that work are the first real indicators of your potential as a researcher, and how good of a fit you are to the professors&#8217; labs.</p>
<h3>Coda</h3>
<p>If you read the entire post, feel free to <a href="mailto:lopes@ics.uci.edu">email me</a> if you&#8217;re considering applying to <a href="http://www.ics.uci.edu">UCI/ICS</a> for graduate work, and especially, if you are interested in <a href="http://mondego.ics.uci.edu">my lab</a>. Besides the &#8216;old&#8217; program in Information and Computer Sciences, we have a shining new Ph.D. program in <a href="http://se.ics.uci.edu/Welcome.html">Software Engineering</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://tagide.com/blog/2011/09/graduate-school-application-dos-and-donts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

