<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-5290984163937382875</id><updated>2012-01-21T21:51:20.981+02:00</updated><category term='Merge'/><category term='step #3'/><category term='Step #1'/><category term='eclass'/><category term='rclass'/><category term='foreach'/><category term='return list'/><category term='local'/><category term='Step #4'/><category term='ereturn list'/><category term='gen'/><category term='programming'/><category term='macros'/><category term='Combine Datasets'/><category term='conditions'/><category term='forvalues'/><category term='Step #6'/><category term='replace'/><category term='Append'/><category term='string'/><category term='Data Manipulation'/><category term='keep'/><category term='Step #5'/><category term='drop'/><category term='egen'/><category term='global'/><category term='group-wise statistics'/><category term='Data Transfer'/><category term='Step #2'/><category term='Joinby'/><category term='extended functions'/><category term='if'/><category term='Automation'/><category term='_Introduction_'/><title type='text'>The Stata Project-Oriented Guide</title><subtitle type='html'>Making your Stata project work!
Data manipulation, programming and other tips to make your Stata project more reliable, robust and easier to work with</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>7</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-5299930648861185663</id><published>2016-01-01T08:00:00.006+02:00</published><updated>2009-10-16T06:26:33.268+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='_Introduction_'/><title type='text'>Introduction</title><content type='html'>&lt;img id="BLOGGER_PHOTO_ID_5140383595041517634" style="margin: 0px 10px 10px 0px; float: left;" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1ZPpQ-x8EI/AAAAAAAAAdM/gQ0QayWJaDo/s200/computer+on+desk+with+stata.JPG" border="0" /&gt;&lt;br /&gt;&lt;div&gt;&lt;p&gt;This blog is a free Stata tutorial. I have been using Stata for the last two years now for different applied work in economics and other fields of the social sciences. If you are in your undergraduate or graduate studies or if you are working for some agency that performs social research, you will probably need to use Stata in the context of your project. Stata has an extensive manual which is very accessible, in my opinion, but in order to know how to use it, one needs to already know the commands' names. &lt;/p&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;However, if you are new to Stata, and you have a project to do, there is a sequence of actions you probably need to do. This tutorial is constructed to follow this sequence: data assembly and construction of additional variables. Then I deliberately skip talking about commands that perform statistical analyses and leave it to your statistics or econometrics courses. But the second part of the tutorial (steps #5-#8) are dedicated to automating those commands and the creation of tables which will report the results. In addition, there are best practices of how to write code that will be easy to follow and change if needed.&lt;/p&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;I am assuming the reader has basic knowledge of Econometrics (regressions etc.) and I will not get into issues of how to specify an appropriate model. I will concentrate, though, on the practical steps one needs to do before and after the regressions, and how to organize the code so as to minimize mistakes.&lt;/p&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;The tutorial is divided to steps. You might not need to go through all the steps, so feel free to move on if you see the step is irrelevant for you. You can also navigate through the tutorial with the labels on the left bar. They consist of keywords (like an index) and steps numbers (like a table of contents).&lt;/p&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;The steps are as follows (keep in mind that the tutorial is still under construction):&lt;/p&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://stataproject.blogspot.com/2007/12/project-1-getting-data.html"&gt;Step #1: Getting the Data&lt;/a&gt; - if your Data isn't in .dta format - Excel as an example&lt;/li&gt;&lt;li&gt;&lt;a href="http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html"&gt;Step #2: Combine Multiple Datasets into One&lt;/a&gt; - for datasets already in .dta format&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;a href="http://stataproject.blogspot.com/2007/12/step-3-simple-data-manipulation.html"&gt;Step #3: Simple Data Manipulation&lt;/a&gt;&lt;/u&gt; - generate variables, change values and drop variables or observations&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;a href="http://stataproject.blogspot.com/2007/12/step-4-thank-god-for-egen-command.html"&gt;Step #4: Thank God for the &lt;span style="font-family:courier new;"&gt;egen&lt;/span&gt; Command&lt;/a&gt;&lt;/u&gt; - a very powerful command that extends the possibilities of data manipulation.&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;a href="http://stataproject.blogspot.com/2008/02/step-5-using-commands-outputs.html"&gt;Step #5: Keeping commands' calculations&lt;/a&gt;&lt;/u&gt; - How to tell your program to use the output from reg, sum, etc.&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;a href="http://stataproject.blogspot.com/2008/12/step-6-automation-separating-men-from.html"&gt;Step #6: Automation&lt;/a&gt;&lt;/u&gt; - macros, loops, and other sorts of fun&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;u&gt;Step #7: Exporting Results to a Spreadsheet&lt;/u&gt; - Excel as an example&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;u&gt;Step #8: Program Definition&lt;/u&gt; - if you start to see the same code in many .do files, maybe you should read this step.&lt;/li&gt;&lt;/ul&gt;I did not find time to fill in steps #7 and #8 here. For step #7, and other good things, I prepared some slides for a short Stata sequence I gave at the department. &lt;a href="http://www.stanford.edu/%7Eroymill/cgi-bin/computing/material.php"&gt;You can find them here&lt;/a&gt;.&lt;br /&gt;&lt;ul&gt;&lt;p&gt;&lt;/p&gt;&lt;/ul&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;Good luck!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-5299930648861185663?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/5299930648861185663/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=5299930648861185663' title='145 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/5299930648861185663'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/5299930648861185663'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2016/01/introduction.html' title='Introduction'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_iuN8kqdF7_g/R1ZPpQ-x8EI/AAAAAAAAAdM/gQ0QayWJaDo/s72-c/computer+on+desk+with+stata.JPG' height='72' width='72'/><thr:total>145</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-417234010002329473</id><published>2008-12-28T21:21:00.021+02:00</published><updated>2008-12-31T12:03:10.718+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='foreach'/><category scheme='http://www.blogger.com/atom/ns#' term='extended functions'/><category scheme='http://www.blogger.com/atom/ns#' term='Automation'/><category scheme='http://www.blogger.com/atom/ns#' term='global'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='string'/><category scheme='http://www.blogger.com/atom/ns#' term='Step #6'/><category scheme='http://www.blogger.com/atom/ns#' term='if'/><category scheme='http://www.blogger.com/atom/ns#' term='local'/><category scheme='http://www.blogger.com/atom/ns#' term='forvalues'/><category scheme='http://www.blogger.com/atom/ns#' term='macros'/><title type='text'>Step #6: Automation - Separating the Men from the Boys</title><content type='html'>&lt;div style="text-align: left;"&gt;Before we start, to make sure, when I'm talking about automation I'm referring to all sorts of ways to write a program in Stata that will run, and save output from, commands in a more-or-less structured program without having to write all commands separately. In the Stata manual you will sometimes see the term Automation as reserved to OLE automation, which is making Stata available to Microsoft Office programs. I will not deal with this automation at all.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Intro: Why?&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;OK, so what is automation good for? We had a preview in the previous step: One thing we can do is to calculate things based on outputs from commands we ran, before seeing the output and automatically by the program: divide mean by standard deviation, add 1.96 standard errors to the mean, and so on. Of course, one way (the boys way) is to run the command, and then calculate it with the &lt;span style="font-family:courier new;"&gt;di &lt;/span&gt;command, or in a spreadsheet or calculator. But the men don't do things manually. They tell their programs what to do.&lt;br /&gt;Another thing we can do is to avoid repeating similar commands. Instead of having 50 rows of the same reg command but with different regressors or different samples, we can write a loop that will do it in 5 rows. This is good not only to save rain forests when you print out your code, but it also puts a structure to your regressions and reduces the chances that when you need to change something in the command (report clustered standard errors for example), you'll need to do it just once and not 50 times (or 48 times and forget to change 2 of the regressions accidentally).&lt;br /&gt;Finally, what we can also do is to construct tables of the results we want to report. But we'll deal with this possibility in the following step.&lt;br /&gt;When is it better to leave automation out? Probably when you need just a few regression and you're not doing anything too time-consuming with the output. Think of automation as an investment you do in your program. It entails a fixed cost of thinking about the structure and implementing it, but above a certain threshold the benefits of having the program do most of the work for you. What I used to do in many cases when I was starting to automate do-files, was to first write the program simply and then when I saw that I'm starting to repeat almost the same code (Copying and pasting like there's no tomorrow) I started thinking of how to automate things.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold; font-style: italic;"&gt;Macros&lt;/span&gt;&lt;br /&gt;A macro is a word (or a string) that whenever we write it in a Stata command, before running the command, Stata replaces this word by a string  (a set of characters) that is set for this macro, and only after the replacement, it runs the command.&lt;br /&gt;Enough with the definitions, let's see a simple example:&lt;br /&gt;To define a macro x that contains the value 6, run the following line (without the . in the beginning):&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. local x = 6&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Now Stata has assigned a place in memory called x and put 6 in it. So whenever we want to tell Stata to use this x we saved, we use backquote (usually the key in the top-left corner of the main keyboard, left to "1") and quote (single quote) characters:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. di `x' + 5&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;11&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;What really happens, behind the scene, is that Stata first sees the ` followed by the ', and looks within them to find a local we previously defined. Then it replaces this referral by the value that was saved:&lt;br /&gt;First (as typed): &lt;span style="font-family:courier new;"&gt;di `x' + 5&lt;/span&gt;&lt;br /&gt;Then (replacing `x' by 6): &lt;span style="font-family:courier new;"&gt;di 6 + 5&lt;/span&gt;&lt;br /&gt;And only then when no `' are left in the command, Stata will run it and return 11.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;Note: if nothing was defined for y, then Stata will replace the `y' by nothing:&lt;br /&gt;First (as typed): &lt;span style="font-family:courier new;"&gt;di `y' + 5&lt;/span&gt;&lt;br /&gt;Then (replacing `y' by an empty string): &lt;span style="font-family:courier new;"&gt;di  + 5&lt;/span&gt;&lt;br /&gt;And then you will get an error because the &lt;span style="font-family:courier new;"&gt;di &lt;/span&gt;command can't handle "+ 5" as input. Note that you will not always get an error. If Stata has no problem running the command after replacing the macro by an empty string, then it will run. Many annoying bugs in your programs will stem from this problem.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;We can also put strings inside macros:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. local x = "Hi there, how are you?"&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;or&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. local dependent_variable = "wage"&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;Why do we need the double quotes? Because otherwise Stata will think what we put after the = sign is part of our command and not simply a value. Specifically, when we put words (instead of numeric values) after the = sign, Stata thinks we're referring to a variable in the dataset (if there is, indeed a variable in that name, it will put the value of the first observation in this variable). Thus, to tell Stata you don't want the value of what's inside &lt;span style="font-family:courier new;"&gt;wage&lt;/span&gt;, but rather you want simply the name "wage" to be kept inside the macro dependent_variable&lt;br /&gt;&lt;br /&gt;Now, let's see how we can refer to these macros. Almost the same as when it was numeric:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. di "`x'"&lt;br /&gt;Hi there, how are you?&lt;br /&gt;&lt;br /&gt;. sum `dependent_variable'&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;Why did we use the double quotes for the first command but not for the second?&lt;br /&gt;Remember what the macro does, it replaces the `macro' with what we've saved in it. So for the first line this would be:&lt;br /&gt;First (as typed): &lt;span style="font-family:courier new;"&gt;di "`x'"&lt;/span&gt;&lt;br /&gt;Then (replacing for the macro): &lt;span style="font-family:courier new;"&gt;di "&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;Hi there, how are you?&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;"&lt;/span&gt;&lt;br /&gt;So if we had not put the double quotes, it would had been equivalent to running:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;. di &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;Hi there, how are you?&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;/span&gt;But Stata will then look for a variable named Hi and won't find it. We didn't intend Stata to look up a variable, but to simply display a string as it is.&lt;br /&gt;&lt;br /&gt;However, when we ran the second line with the &lt;span style="font-family:courier new;"&gt;summarize &lt;/span&gt;(&lt;span style="font-family:courier new;"&gt;sum&lt;/span&gt;) command, we indeed wanted the command to treat wage not as just a word, but as a reference to the variable wage! In other words, we wanted to run:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. sum wage&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;(and not &lt;span style="font-family:courier new;"&gt;sum "wage"&lt;/span&gt;)&lt;br /&gt;&lt;br /&gt;Three final remarks before we move on to loops (if you're still wondering why do we need all this, hang on).&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;local and global&lt;/span&gt; - you might have wondered why the command to define a macro is called local. In step #8, which hopefully will be written some day, we will deal with writing commands in Stata, and then it will matter. local means that the macro is defined within the program it was set in, and global means that all commands and programs can refer to the macro. &lt;/li&gt;&lt;/ul&gt;In any case, to define global, you will use the word global instead of local:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. global x = 6&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;But to refer to the global later, we do something a little different. We use ${&lt;span style="font-style: italic;"&gt;macroname&lt;/span&gt;}&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. di ${x}&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Long strings&lt;/span&gt; - at least in Stata 9, there is a weird issue with defining macros for strings that are longer than 255 characters. You might think why do you need more than 255 characters, but it happens sometimes and unless somebody told you, this can be one of the most annoying bugs (Stata might simply cut your string after 255 characters...). To avoid that, what you need to do is define string macros without the = operator:&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;. local mylongstring "This is my very long string, and since it is longer than 255 characters, I omitted the = in its definition. Looks strange, but this is how it works. Good luck!"&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Predefined macros&lt;/span&gt; - Stata has some macros within it that you might find helpful. They're not exactly macros, when I come to think of them, but we treat them as such (but without the `'). For example, _N holds the number of observations in the dataset (try to run &lt;span style="font-family:courier new;"&gt;di _N&lt;/span&gt;), _pi holds the number pi. For a few more, you can look up &lt;span style="font-family:courier new;"&gt;help _variables&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Nested macro reference&lt;/span&gt; - You can refer to a macro within another macro reference. What does that mean? Say you have one macro named x_a and another macro named x_b, you can define a macro named i and do the following:&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;local x_a = 800&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;local x_b = 43.2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;local i = "b"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;di `x_`i''&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Note that it is not double-quote at the end of the di command but two single-quotes. What happens when Stata hits the di command is the following:&lt;br /&gt;First (as typed): &lt;span style="font-family:courier new;"&gt;di `x_`i''&lt;/span&gt;&lt;br /&gt;Then (replacing the innermost macro): &lt;span style="font-family:courier new;"&gt;di `x_b'&lt;/span&gt;&lt;br /&gt;Then (replacing the next innermost macro): &lt;span style="font-family:courier new;"&gt;di 43.2&lt;/span&gt;&lt;br /&gt;And then Stata will execute the command and shoe the number 43.2&lt;br /&gt;&lt;br /&gt;This can be handy when you have several macros and you want to alternate referring to them (i decides which of the x's to use), which sometimes you need to do inside loops.&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Extended functions&lt;/span&gt; - Ever wondered how to save a variable's label? Ever wanted to count how many words there are in a string? Maybe you didn't, but sometimes there are things you need to save to a macro and you have no idea how to do that. In some of these cases, you might find your answer in the extended functions. They work a bit differently (but just like egen, you get many different features with variations on the same command):&lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;local &amp;lt;macro_name&amp;gt; : &amp;lt;extended_function&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Note that we use : instead of = which tells Stata we're not using the regular functions but the extended functions. For example, say you want to keep the label of the variable w2gef (usually questionnaire data will have cryptic variable names but hopefully informative labels) inside the macro w2gef_label:&lt;br /&gt;&lt;blockquote style="font-family: courier new;" face="courier new"&gt;local w2gef_label : variable label w2gef&lt;/blockquote&gt;Another example:&lt;br /&gt;&lt;blockquote style="font-family: courier new;" face="courier new"&gt;local x "This is my string. How many words are in it?"&lt;br /&gt;local num_words : word count `x'&lt;br /&gt;local sixth_word : word 6 of `x'&lt;br /&gt;di "There are `num_words' words in x. The sixth of them is `sixth_word'"&lt;/blockquote&gt;The output will then be:&lt;br /&gt;&lt;blockquote style="font-family: courier new;" face="courier new"&gt;There are 10 words in x. The sixth of them is many&lt;/blockquote&gt;More on that in &lt;span style="font-family:courier new;"&gt;help extended_fcn&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Saving and reusing&lt;/span&gt; - I thought once that this part is obvious, but teaching Stata has taught me otherwise. So to make things clear... When you save with the "save" command or with the icon on the top left and so on, it just saves your &lt;span style="font-weight: bold;"&gt;data&lt;/span&gt;. It will not save the macros. Macros are part of programs. If you use the interpreter interface of Stata (the command line below the output window), then when you will close Stata, your macros will disappear. If you want to reuse them when you open Stata next time, you have got to work with .do files. Stata comes with a do-file editor, but you can write it in any text editor. Make sure from now on you work with .do files. Of course, experimenting commands with the interpreter is something which is always worth doing, but in the end keep the commands you liked in a do-file.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Loops (and Conditions)&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;The power of automation comes mainly from the ability to create a loop and repeat commands with it. The next subsection will show some examples.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;* Before we start, the following examples sometimes have lines extend beyond the boundaries of the blog, so it cuts them to two. If you're not sure where each line ends and another one starts, copy the example to a text editor.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:courier new;" &gt;&lt;br /&gt;while&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The simplest loop is the &lt;span style="font-family:courier new;"&gt;while&lt;/span&gt;. The syntax goes like this:&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;while &amp;lt;exp&amp;gt; {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; ...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;where &lt;span style="font-style: italic;font-family:courier new;" &gt;&amp;lt;exp&amp;gt;&lt;/span&gt; is a condition. If you remember when we talked about creating dummy variables we said that a condition in stata is an expression that is equal to 1 if the condition holds and 0 otherwise. The while command tells Stata to keep running the same commands between the &lt;span style="font-family:courier new;"&gt;{}&lt;/span&gt; until &lt;span style="font-style: italic;font-family:courier new;" &gt;&lt;exp&gt;&lt;/exp&gt;&lt;/span&gt; is equal to 0 (that is, as long as the condition is satisfied).&lt;br /&gt;When we dealt with dummy variables we usually constructed the condition on one of the other variables (and Stata checked the condition on the values of the variables for each observation: educ&amp;gt;12 for example). But when you deal with loops and other matters of flow control (that is, how your program runs contingent upon the situations it faces), the conditions will mainly deal with macros instead of variables.*&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:85%;"&gt;* There is no technical problem with referring to variables in the condition. The thing is that as opposed to conditions when creating variables - in which Stata goes through all the observations - here referring to a variable will give its value in the first observation only, because nothing tells Stata to go through all observations. If you want to refer specifically to the value of the variable in an observation other than the first, just rever to &lt;span style="font-style: italic;"&gt;varname[observation_number].&lt;/span&gt; You can experiment with the command &lt;span style="font-family:courier new;"&gt;di&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Anyway, here's an example:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;local i = 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;while `i' &lt;= 4 {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; di "counting `i'"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; di "good."&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; local i = `i' + 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;This will output:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;counting 1&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;good.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;counting 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;good.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;counting 3&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;good.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;counting 4&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;good.&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Note that the last row within the while loop is iterating the macro i. Each time before the iteration is over, we increase i by 1. If we didn't do so, i would have stayed 1 and the condition would always be satisfied. To get out of the loop we need to make sure that after a finite number of iteration the condition is no longer satisfied.&lt;br /&gt;&lt;br /&gt;But for examples as the one I given, there is a better loop which is less cumbersome (the foreach/forvalues loop). While is good for situations in which one doesn't know in advance how many iteration one wants.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-family:courier new;"&gt;forvalues&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;When you know how many iterations you want, using a for loop is much better. The simplest for loop is the &lt;span style="font-family:courier new;"&gt;forvalues &lt;/span&gt;loop. Lets start with an example which will do the same thing as the example for the while loop:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;forvalues i=1/4&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; di "counting `i'"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; di "good."&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Let's try to find the differences between the examples:&lt;br /&gt;First and foremost, the condition from the while loop has changed to i=1/4. Second, the initialization of the macro i before the while loop and the incrementation before the end of the iteration are both gone. This is done by the simple i=1/4 which we wrote for the forvalues. It means that we are creating a loop that will start with i=1, then increase i by 1 until it reaches 4 (including 4). We can refer to i inside the loop or we can ignore it. The loop will run 4 times with each time having the next number for i.&lt;br /&gt;&lt;br /&gt;More generally, our forvalues loop looks like this:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;forvalues &lt;span style="font-style: italic;"&gt;&amp;lt;loop_macro&amp;gt;&lt;/span&gt; = &lt;span style="font-style: italic;"&gt;&amp;lt;range&amp;gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; ...&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;&lt;br /&gt;In our example &lt;span style="font-style: italic;font-family:courier new;" &gt;&amp;lt;loop_macro&amp;gt;&lt;/span&gt; was i and &lt;span style="font-style: italic;font-family:courier new;" &gt;&amp;lt;range&amp;gt;&lt;/span&gt; was 1/4. Note that when we're in the forvalues context, 1/4 doesn't mean a quarter, but rather "from 1 until 4 in steps of 1". The range can be different both in terms of boundaries and in terms of steps. We can do this:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;forvalues proportion = 0(0.05)1 { ...&lt;/span&gt;&lt;/blockquote&gt;Which will start with proportion=0, then the next iteration will have proportion=0.05, then the next one 0.1, and so on until proportion=1.&lt;br /&gt;&lt;br /&gt;More on the possibilities of range in &lt;span style="font-family:courier new;"&gt;help forvalues&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;&lt;span style="font-family:courier new;"&gt;foreach&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;The foreach command is pretty versatile. In my experience, two of its versions are very common. The first and simplest one is this:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;foreach &amp;lt;loop_macro&amp;gt; in &amp;lt;list&amp;gt; {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp; ... &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;where &lt;span style="font-style: italic;font-family:courier new;" &gt;&amp;lt;list&amp;gt;&lt;/span&gt; is simply a list of words (can also be numbers if you want) separated by white space. Let's see some examples:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;foreach regressor in educ_mom educ_dad "educ_mom educ_dad" {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; reg wage educ `regressor'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;The loop will run the following three regressions:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;reg wage educ educ_mom&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;reg wage educ educ_dad&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;reg wage educ educ_mom educ_dad&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Note that the double-quotes in the last expression are there to tell Stata we want it to treat it as a one word (one iteration in which the whole string inside the double quotes is the value that is assigned to the macro &lt;span style="font-style: italic;"&gt;regressor&lt;/span&gt;). In other words, if you don't want Stata to treat the space as a separator.&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;foreach male_value in 0 1 {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; reg unemployed wage educ shock if male == `male_value'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;This will run twice:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;reg unemployed wage educ shock if male == 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;reg unemployed wage educ shock if male == 1&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;What if you want an additional regression for both males and females? Because macros are simply text substitutions before commands are run, there are quite a few possibilities to implement this. I would try to do the one which makes the code easiest to read. One possibility is doing it this way:&lt;br /&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;foreach male_cond in "male == 0" "male == 1" 1 {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; reg unemployed wage educ shock if `male_cond'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;This will run the following three regressions:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;reg unemployed wage educ shock if male == 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;reg unemployed wage educ shock if male == 1&lt;br /&gt;reg u&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;nemployed wage educ shock if 1&lt;br /&gt;&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;The last 1 says that the condition will always satisfy. Thus, all observations (including, for example, those with a missing value in the variable male) will be in the last regression.&lt;br /&gt;&lt;br /&gt;Now, besides lists of strings and numbers, we can tell foreach to iterate between variables only. This is good for two reasons: (1) You can refer to a group of many variables with just one word , and, (2) If we're really interested in iterating names of variables, we can get something which is usually absent in Stata - we can get an error message if there is no such variable (error messages are definitely underrated - it is true you don't want any of them, but if you misspelled one of the variables' name, you probably want Stata to tell you).&lt;br /&gt;How do we do it?&lt;br /&gt;&lt;blockquote  style="font-family:courier new;"&gt;foreach &lt;span style="font-style: italic;"&gt;&amp;lt;loop_macro&amp;gt;&lt;/span&gt; of varlist &lt;span style="font-style: italic;"&gt;&amp;lt;varlist&amp;gt;&lt;/span&gt; {&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;...&lt;br /&gt;}&lt;br /&gt;&lt;/blockquote&gt;For example (suppose the following variables exist in the loaded dataset: educ educ_dad educ_mom year1998 year1999 year2000 year2001 year2002):&lt;br /&gt;&lt;blockquote  style="font-family:courier new;"&gt;foreach var_to_sum of varlist educ* year1998-year2002 {&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;sum `var_to_sum'&lt;br /&gt;}&lt;br /&gt;&lt;/blockquote&gt;The educ* will make the loop go through all variables of which names start with educ. Then, year1998-year2002 will make the loop go through all the variables between year1998 and year2002.&lt;br /&gt;&lt;br /&gt;As always, further details are to be found in &lt;span style="font-family:courier new;"&gt;help foreach&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:courier new;" &gt;if&lt;/span&gt;&lt;br /&gt;You are already familiar with the if condition most commands support. This &lt;span style="font-family:courier new;"&gt;if&lt;/span&gt; is meant to limit the execution of the command only to observations for which the condition is satisfied. As we said when we talked about the while loop, sometimes we would like conditions to control how our program flows. Those conditions are a bit different.&lt;br /&gt;&lt;br /&gt;Let's do an example. Suppose you want to run the loop above which iterates over different samples: male, female and all. But when you run both males and females in the regression you want to add the male dummy as a regressor (this is sometimes called adding a main-effect), or an interaction between the male dummy and a treatment variable. You only need to add those regressors to the "all-sample" iteration (actually you can put the regressors in the male-only and female-only regressions too and Stata will just drop those variables as they are multicollinear with the constant, but lets ignore this for the sake of the example). You can do something like&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-family:courier new;"&gt;foreach male_cond in "male == 0" "male == 1" 1 {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;if "`male_cond'" == "1" {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;  &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;local add2reg "male maleXshock"&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;else {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; local add2reg ""&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; reg unemployed wage educ shock `add2reg' if `male_cond'&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;}&lt;/span&gt;&lt;br /&gt;&lt;/blockquote&gt;Note that I put double-quotes on both sides of the condition because if I wouldn't, the first and second iterations would make the if command look like this:&lt;br /&gt;if male == 0 == 1&lt;br /&gt;&lt;br /&gt;Stata would first evaluate 0 == 1 and then male == 0 (the second 0 is because 0 is not equal to 1). You didn't want this. You wanted simply to compare the string of the condition to 1 (to get the last iteration).&lt;br /&gt;This example brings me to another point. Note that we wrote 9 rows  of code for a loop that replaces 3 rows of simple regression commands. In many cases, simply writing the original regressions will do the job. In others you might be working in a greater framework, or you want to later add additional subsamples which will make it better to write the loop instead of the regressions themselves. Do your own calculation of whether complicating things with a loop (and inner conditions) is better than simply repeating your commands, however stupid it feels.&lt;br /&gt;&lt;br /&gt;For further help (this time I'm going to surprise you), look up &lt;span style="font-family:courier new;"&gt;help ifcmd&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;Additional issues for loops and conditions:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Nested loops&lt;/span&gt; - you can write a loop inside a loop. This will make the inner loop run anew for each iteration of the outer loop. This is where the whole thing really starts to pay off, because you can run many regressions and make it pretty readable, enabling easier changes in the specification when you need it. Here's an example&lt;/li&gt;&lt;/ul&gt;&lt;blockquote  style="font-family:courier new;"&gt;local control_vars "educ_dad educ_mom hh_income grade_5 grade_6"&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;foreach dep_var of varlist score pass_dummy admitted {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; foreach treatment of varlist hours_tutored tutored_dummy {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; foreach sample in "male == 0" "female == 1" "male == 0 &amp;amp; educ_dad &amp;lt; 12" {&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; reg `dep_var' `treatment' `control_vars' if `sample'&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; } &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt; &amp;nbsp;&amp;nbsp;&amp;nbsp; } &lt;/span&gt; &lt;br /&gt;&lt;span style="font-family:courier new;"&gt; } &lt;/span&gt; &lt;br /&gt;&lt;/blockquote&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;font-family:courier new;" &gt;continue&lt;/span&gt;&lt;span style="font-family:courier new;"&gt; &lt;/span&gt;- if you want to exit a loop before it ends naturally (i.e murder a loop?), you can use the &lt;span style="font-family:courier new;"&gt;continue &lt;/span&gt;command. Usually it will appear inside an inner if condition. This is very uncommon, though, and makes the code less readable.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;span style="font-weight: bold;"&gt;Summary&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;So we learned how to define macros and give our regressions a structure with loops (and nested loops). I hope by now you understand how this can contribute to your project. I think the last example - for the nested loop remark - makes it very clear. As we will see in the following steps, loops and macros can help us automate not only the statistical commands, but also how we save the output we're interested in and export it to nice tables (if reading logs of Stata isn't your favorite pastime activity).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-417234010002329473?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/417234010002329473/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=417234010002329473' title='21 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/417234010002329473'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/417234010002329473'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2008/12/step-6-automation-separating-men-from.html' title='Step #6: Automation - Separating the Men from the Boys'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>21</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-3166742463842641025</id><published>2008-02-13T15:47:00.015+02:00</published><updated>2008-12-31T00:51:17.923+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='eclass'/><category scheme='http://www.blogger.com/atom/ns#' term='return list'/><category scheme='http://www.blogger.com/atom/ns#' term='Step #5'/><category scheme='http://www.blogger.com/atom/ns#' term='rclass'/><category scheme='http://www.blogger.com/atom/ns#' term='ereturn list'/><title type='text'>Step #5: Using commands' outputs</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;Now that we're done with data handling, we're skipping the part about statistical analysis commands (regressions and the like). The next four steps will deal with automation of the analyses. This step is actually a preface to the next three steps which deal with automation. As you know, Stata can run each command separately in the command window. It is also possible to save all the commands in a ".do" file and rerun all of them whenever you like. Since programs tend to run the same commands again and again, we would like to automate some of the commands. It will also help us organize our analysis and report the results in a neat table instead of with the regular text output of Stata. To do that, we will go through steps 5-8.&lt;br /&gt;&lt;br /&gt;Don't worry if you don't understand &lt;em&gt;why&lt;/em&gt; we're doing what we're doing in this chapter. That will be clearer in the following chapter.&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The most basic statistical command in Stata is probably &lt;span style="font-family:courier new;"&gt;summarize&lt;/span&gt;. I'm relying on the fact that the reader is familiar with this command. If this is not the case, I think a more basic tutorial will fit you before going through this one.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;So, let's say that I run this command (&lt;span style="font-family:courier new;"&gt;su&lt;/span&gt; is an abbreviation for &lt;span style="font-family:courier new;"&gt;summarize&lt;/span&gt;)&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9J1plBYEI/AAAAAAAAC4M/NvxqZIdthts/s1600-h/step5_1.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241989677325574210" style="CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9J1plBYEI/AAAAAAAAC4M/NvxqZIdthts/s400/step5_1.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Now, what can you do if you want to use one of the statistics, for example, to calculate something with it? Suppose we want to calculate how much is 2 standard deviations. We see how much is one standard deviation (.409255) so we can manually calculate 2 * .409255:&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://4.bp.blogspot.com/_iuN8kqdF7_g/SL9IAi7ciiI/AAAAAAAAC3k/kCI6EytmQNw/s1600-h/step5_2.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241987665495886370" style="CURSOR: hand" alt="" src="http://4.bp.blogspot.com/_iuN8kqdF7_g/SL9IAi7ciiI/AAAAAAAAC3k/kCI6EytmQNw/s400/step5_2.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;But what happens if we're in a program, or if we want to use a more precise measure of standard error (Stata saves more than 6 figures after the dot)? In a program, we could not write "&lt;span style="font-family:courier new;"&gt;di 2*.409255&lt;/span&gt;". We can, but it will be bad coding. We would have to first run the program, get the standard deviation and then write it &lt;em&gt;in the code&lt;/em&gt;. Moreover, if we just slightly change the data in the future, we will need to fix the code itself.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;The solution is rather simple. Many commands keep their calculations for later use as scalars, vectors or macros. To see which calculations the &lt;span style="font-family:courier new;"&gt;summarize&lt;/span&gt; command saved, lets run "&lt;span style="font-family:courier new;"&gt;return list&lt;/span&gt;":&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;span style="font-family:courier new;"&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/SL9IArLThiI/AAAAAAAAC3s/JtP1E-oOboU/s1600-h/step5_3.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241987667709888034" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/SL9IArLThiI/AAAAAAAAC3s/JtP1E-oOboU/s400/step5_3.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;div&gt;&lt;br /&gt;See, the command "return list" lists the scalars that the summarize command saves. Summarize, specifically, saves the calculations for the last variable specified (in our case it was just one). Now, we can run in our code, if we want to calculated generally 2 standard deviation of the variable we can simply run two commands:&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9IoS0-ZNI/AAAAAAAAC38/SUO4W5b1zcI/s1600-h/step5_4.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241988348368544978" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9IoS0-ZNI/AAAAAAAAC38/SUO4W5b1zcI/s400/step5_4.gif" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;This way, we didn't use the actual s.d in our command, but rather the &lt;span style="font-family:courier new;"&gt;r(sd)&lt;/span&gt;&lt;span style="font-size:0;"&gt; &lt;/span&gt;&lt;span style="font-size:100%;"&gt;word. Stata replaced &lt;span style="font-family:courier new;"&gt;r(sd)&lt;/span&gt; with the value saved by the summarize command.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;This trick can be done with other commands. Actually with all programs that are in the rclass category. I don't know where the rclass word came from, but I do know that it means that the scalars will be saved in &lt;span style="font-family:courier new;"&gt;r(something)&lt;/span&gt; command.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Another class of commands is the eclass commands. The command &lt;span style="font-family:courier new;"&gt;regress&lt;/span&gt; is one of them. If we will run reg, we can then look for saved values with "ereturn list":&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/SL9O63zCXRI/AAAAAAAAC4U/oark8SUccBE/s1600-h/step5_5.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241995264599940370" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/SL9O63zCXRI/AAAAAAAAC4U/oark8SUccBE/s400/step5_5.gif" border="0" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://2.bp.blogspot.com/_iuN8kqdF7_g/SL9O7Lkkt7I/AAAAAAAAC4c/FzQisU_bvBs/s1600-h/step5_6.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241995269907986354" style="CURSOR: hand" alt="" src="http://2.bp.blogspot.com/_iuN8kqdF7_g/SL9O7Lkkt7I/AAAAAAAAC4c/FzQisU_bvBs/s400/step5_6.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;We got the scalars (numbers) that are related to the regression (root MSE, number of observations, F-statistic, R-squared etc.) but also macros (which I personally never use) and most importantly: matrices. One is e(b) which is actually a vector of the coefficients (a matrix of 1 row and 2 columns) , and the other is e(V) which is the coefficients covariance matrix. Let's take a look at them:&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9O7JwzxLI/AAAAAAAAC4k/-wq8LlV3TXA/s1600-h/step5_7.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241995269422433458" style="CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9O7JwzxLI/AAAAAAAAC4k/-wq8LlV3TXA/s400/step5_7.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;And we see, indeed, the coefficients from the regression in e(b). Compare the output of the &lt;span style="font-family:courier new;"&gt;matrix list e(b)&lt;/span&gt; command and the output from the &lt;span style="font-family:courier new;"&gt;reg&lt;/span&gt; command. The e(V) matrix is a bit less self-evident. Remember that the covariance between a variable and itself is its variance. So the coefficient's standard error is actually the square-root of the value it has in the diagonal. &lt;/div&gt;&lt;div&gt;There is, however, a shorter way to reach the coefficient and the standard error. Instead of using the matrices, you can simply refer to the coefficient's value with the &lt;span style="font-family:courier new;"&gt;_b[]&lt;/span&gt; ... thing (I don't know how to call it). In order to get the value of the coefficient of x, put x between the brackets: &lt;span style="font-family:courier new;"&gt;_b[x]&lt;/span&gt; has the coefficient of x from the last regression. So now you can have your program calculate the "effect" of having 3 more rooms and not just one (or just save that number in order to report it in a table later). Also, in order to get the standard error, use _b[]'s brother, which is _se[]. That is:&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9O7G-_f2I/AAAAAAAAC4s/LvCu9ar3bnI/s1600-h/step5_8.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5241995268676616034" style="CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9O7G-_f2I/AAAAAAAAC4s/LvCu9ar3bnI/s400/step5_8.gif" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As long as you didn't run another eclass command, the calculations from the last program will be available for you to use. Once you run the next eclass command (for example, another regression), the saved output will be replaced by the new command's output. The same goes for the rclass commands, but separately. That is, if you run an rclass command, it will not "step on" the eclass command's results, but only the last rclass command's results.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;In order to put the results aside for later use (before running the next eclass or rclass command), we will learn how to use macros - in the next chapter.&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="right"&gt;(Go on to &lt;a href="http://stataproject.blogspot.com/2008/12/step-6-automation-separating-men-from.html"&gt;Step #6&lt;/a&gt;)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-3166742463842641025?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/3166742463842641025/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=3166742463842641025' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/3166742463842641025'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/3166742463842641025'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2008/02/step-5-using-commands-outputs.html' title='Step #5: Using commands&apos; outputs'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_iuN8kqdF7_g/SL9J1plBYEI/AAAAAAAAC4M/NvxqZIdthts/s72-c/step5_1.gif' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-1462171203112159684</id><published>2007-12-19T13:30:00.002+02:00</published><updated>2009-07-10T17:43:37.069+03:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='group-wise statistics'/><category scheme='http://www.blogger.com/atom/ns#' term='Step #4'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Manipulation'/><category scheme='http://www.blogger.com/atom/ns#' term='egen'/><title type='text'>Step #4: Thank God for the egen Command</title><content type='html'>&lt;span style="font-size:130%;"&gt;&lt;u&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;/u&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;What's so special, really, about the &lt;span style="font-family:courier new;"&gt;egen&lt;/span&gt; (extensions to &lt;span style="font-family:courier new;"&gt;genereate&lt;/span&gt;) command? The answer is that it lets you do lots of things to the data. Things that in other statistical programs might take a lot of commands are possible to do with a couple of &lt;span style="font-family:courier new;"&gt;egen&lt;/span&gt; commands. So this is actually the next phase of data manipulation.&lt;br /&gt;&lt;br /&gt;The syntax is pretty simple:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen &amp;lt;new variable&amp;gt;= &amp;lt;function&amp;gt;(&amp;lt;expression(s)&amp;gt; or &amp;lt;variable(s)&amp;gt;) [, by (&amp;lt;variables&amp;gt;)]&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The functions actually determine what the &lt;span style="font-family:courier new;"&gt;egen&lt;/span&gt; command will do. There are many of them, all described in &lt;span style="font-family:courier new;"&gt;help egen&lt;/span&gt;, and the following sectios of this step will describe the use of the most commonly used functions. These examples will hopefully clarify how to use the different functions and how can they help us.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;mean()&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen store_mean_price = mean(price), by(store_id)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This &lt;span style="font-family:georgia;"&gt;example will create a variable in which, for each observation, the value will be the mean &lt;em&gt;price&lt;/em&gt; of all observations&lt;/span&gt; that have the same &lt;em&gt;store_id&lt;/em&gt;. See the figure under rowmean() for a graphic illustration.&lt;br /&gt;&lt;br /&gt;One can omit the &lt;span style="font-family:courier new;"&gt;by&lt;/span&gt; option - this will put the mean of the original variable for all observations in the dataset.&lt;br /&gt;&lt;br /&gt;Other examples:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen mean_firm_occupation_wage = mean(wage), by(firm_id occupation_id)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This will put, for each observation, the mean &lt;em&gt;wage&lt;/em&gt; of all other observatios with the same &lt;em&gt;firm_id&lt;/em&gt; and &lt;em&gt;occupation_id&lt;/em&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;rowmean()&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;The function rowmean also compute means, but instead of computing means of a variable across observations, it compute the mean across variables for each of the observations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen mean_score = rowmean(math_score physics_score chemistry_score)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Suppose you had a dataset of students and their scores. This example will simply create a new variable - &lt;em&gt;mean_score&lt;/em&gt; - which will hold the mean of math, physics and chemistry score for each of the students.&lt;br /&gt;&lt;br /&gt;Note that the fact that it computes it separately for each of the observations makes the &lt;span style="font-family:courier new;"&gt;by&lt;/span&gt; option irrelevant. Take the previous example: There's no meaning to do add a &lt;span style="font-family:courier new;"&gt;by(class_id)&lt;/span&gt; option to the egen command when using the &lt;span style="font-family:courier new;"&gt;rowmean&lt;/span&gt;&lt;span style="font-family:georgia;"&gt;function&lt;/span&gt;. If you want the mean score in class for any of the subjects (mean score across students), you should use the &lt;span style="font-family:courier new;"&gt;mean()&lt;/span&gt; function instead of the &lt;span style="font-family:courier new;"&gt;rowmean()&lt;/span&gt;. If you want the mean score (across students) of the mean score (across subjects), you need to first do mean and then rowmean (or vice versa).&lt;br /&gt;&lt;br /&gt;You might ask, what's the difference between the&lt;span style="font-family:courier new;"&gt; rowmean()&lt;/span&gt; and simply using the gen command:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen mean_score = (math_score + physics_score + chemistry_score) / 3&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are two main differences:&lt;br /&gt;1. You can use &lt;u&gt;wildcards&lt;/u&gt; - The same &lt;span style="font-family:courier new;"&gt;rowmean&lt;/span&gt; command can be written like this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen mean_score = rowmean(*_score)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This is very useful if the list of valiables is very long, or if you think that later on you might add &lt;em&gt;english_score history_score&lt;/em&gt;, and so on, to your dataset, and you don't want to update this command every time. To learn more about wildcards, see &lt;span style="font-family:courier new;"&gt;help varlist&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;2. &lt;u&gt;Missing&lt;/u&gt; values - If one of the variables mentioned above is missing, &lt;span style="font-family:courier new;"&gt;gen&lt;/span&gt; command will not be able to sum the three variables and will therefore put missing value for &lt;em&gt;mean_score&lt;/em&gt; in that observation. This is the case even if the other two scores are not missing. &lt;span style="font-family:courier new;"&gt;egen rowmean&lt;/span&gt;, on the other hand, will disregard the missing values and compute the mean of only the nonmissing values in the variable list. Only if there are missing values in &lt;em&gt;all&lt;/em&gt; of the variables specified, egen rowmean will put missing value in the generated variable (just like the gen command). It's up to you to decide which one is better. Sometimes it is enough to have one missing value in order for the mean value to be irrelevant to what you are measuring, and sometimes you may decide that the mean of the nonmissing variables only is good enough.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This slide shows how the mean() and rowmean() functions work:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R3OQIyaCWtI/AAAAAAAAAqk/20BjbIBaJj8/s1600-h/egen+mean+rowmean.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5148617279658941138" style="" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/R3OQIyaCWtI/AAAAAAAAAqk/20BjbIBaJj8/s400/egen+mean+rowmean.gif" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;Note: Although the observations are sorted according to the &lt;span style="font-family:courier new;"&gt;by&lt;/span&gt; variable (&lt;em&gt;class_id&lt;/em&gt;) here, it doesn't mean that you should sort them before. &lt;span style="font-family:courier new;"&gt;egen&lt;/span&gt; doesn't need the dataset to be sorted according to the &lt;span style="font-family:courier new;"&gt;by&lt;/span&gt; variable (although I'm guessing that if it's sorted, it will take less time to process).&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;sum() and rowtotal(), max() and rowmax(), min() and rowmin()&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;This is the same as mean and rowmean, but instead of calculating means, these functions calculate sums. Here are some examples:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen team_effort = sum(effort), by(team)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen total_correct_answers = rowtotal(question_*)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There is a small difference between &lt;span style="font-family:courier new;"&gt;rowmean&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;rowtotal&lt;/span&gt; in the way missing values are treated. &lt;span style="font-family:courier new;"&gt;rowtotal&lt;/span&gt; simply treats missing values as zeroes. So even if the values are missing in all the specified variables, the new rowtotal variable will be 0. &lt;span style="font-family:courier new;"&gt;rowmean&lt;/span&gt; would have put missing value instead. If you want to put a missing value there, you might want to do something like that:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen total_correct_answers = rowtotal(question_*)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen nonmissing_answers = rownonmiss(question_*)&lt;/span&gt; &lt;span style="color: rgb(0, 153, 0);"&gt;// This function puts the number of variables for which this observation had a missing value&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace total_correct_answers = . if nonmissing_answers == 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop nonmissing_answers&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For max and min, the function gives the maximum or minimum value, respectively, out of all the values of the specified variable (within a group, if you are specifying the by option). Let's say, for example, that you want to normalize all values of the variable wage in a group so that each observation will have 1 if its wage is the maximum wage in a group, 0 if its wage is the minimum and some number between 0 and 1 if it's in between. The following set of commands will do the job pretty easily:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen max_wage_f = max(wage), by(firm)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen min_wage_f = min(wage), by(firm)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen norm_wage_f = (wage - min_wage_f) / (max_wage_f - min_wage_f)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace norm_wage_f = 0.5 if norm_wage_f == . &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;// This is for the firms in which the maximum wage equals the minimum wage. The command above will give them missing values, because the denominator equals zero&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;a href="http://4.bp.blogspot.com/_iuN8kqdF7_g/R52URMbEJkI/AAAAAAAAAqs/vQu5eLjLXpc/s1600-h/normalizing+a+variable.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5160443771149887042" style="" alt="" src="http://4.bp.blogspot.com/_iuN8kqdF7_g/R52URMbEJkI/AAAAAAAAAqs/vQu5eLjLXpc/s400/normalizing+a+variable.gif" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;Here is another example of using egen max in order to populate a nonmissing value to observations with missing values. Suppose you have a dataset of students in schools. In one of your regressions you want to take into account the number of children whose father dropped out of high school. In order to do so, you will the following commands:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen f_dropout_kids_only = count(student_id) if f_educ&amp;lt;12, by(school) &lt;span style="font-family:georgia;"&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;// This will count the number of children for whom the condition applies. Observations in which the condition does not apply, will get a missing value to the generated variable.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen f_dropout_kids = max(f_dropout_kids_only), by(school) &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop f_dropout_kids_only &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);font-family:georgia;" &gt;// Don't get confused. This will drop the variable f_dropout_kids_only that is no longer needed &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;Although, in general, missing value is greater than any other nonmissing value, the egen max ignores missing values and therefore the observations with the nonmissing value in that school - i.e the kids whose father dropped out - will be given to the other observations in the same school (we need all of them in the regression: both the ones who apply and the ones who doesn't). The following figure might make it clearer.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;a href="http://2.bp.blogspot.com/_iuN8kqdF7_g/R52UisbEJlI/AAAAAAAAAq0/yw6FOgUzkfo/s1600-h/using+max+to+populate.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5160444071797597778" style="" alt="" src="http://2.bp.blogspot.com/_iuN8kqdF7_g/R52UisbEJlI/AAAAAAAAAq0/yw6FOgUzkfo/s400/using+max+to+populate.gif" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;Note: Instead of &lt;span style="font-family:courier new;"&gt;egen max&lt;/span&gt;, we could have also used &lt;span style="font-family:courier new;"&gt;egen min&lt;/span&gt; or &lt;span style="font-family:courier new;"&gt;egen mean&lt;/span&gt; again as the second command. Both &lt;span style="font-family:courier new;"&gt;egen min&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;egen mean&lt;/span&gt; ignore missing values and since the nonmissing values are equal for that by category, the functions will yield that same value. This is not true for &lt;span style="font-family:courier new;"&gt;egen sum&lt;/span&gt;, though, because the sum function will multiply the nonmissing value by the number of nonmissing observations.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;For additional statistical calculations of the within group, see &lt;span style="font-family:courier new;"&gt;help egen&lt;/span&gt; and look for functions such as &lt;span style="font-family:courier new;"&gt;sd()&lt;/span&gt; (for standard deviation), &lt;span style="font-family:courier new;"&gt;median()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;mode()&lt;/span&gt; and others. You can also calculate statistics across a group of variables on an observation-basis (instead of across a group of observations on a variable-basis). See &lt;span style="font-family:courier new;"&gt;help egen&lt;/span&gt; and look for the functions that start with &lt;span style="font-family:courier new;"&gt;row&lt;/span&gt; : &lt;span style="font-family:courier new;"&gt;rowmean()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;rowmin()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;rowmax()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;rowsd()&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;rowtotal()&lt;/span&gt;, etc.&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;&lt;span style="font-family:georgia;"&gt;tag() and group()&lt;/span&gt;&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;These two function are really useful with identifying variables that have more than one observation (group-identifying variables) - the same ones we used before in the &lt;span style="font-family:courier new;"&gt;by&lt;/span&gt; options.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;Suppose you have a dataset of gas prices. Each observation has the type of fuel, the price per gallon, the station ID, and week in which the price was recorded. Not all stations were recorded in each of the weeks. That is, in some weeks, some stations didn't have their price taken. In your research, you decide to work only on stations for which you have full data - i.e, those which appear in each of the 50 weeks.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;If the data had only one observation per station-and-week combination, you could have just used the &lt;span style="font-family:courier new;"&gt;count()&lt;/span&gt; function of egen:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;&lt;span style="font-family:courier new;"&gt;egen station_count = count(week), by(station)&lt;/span&gt; &lt;span style="color: rgb(0, 153, 0);"&gt;// This will count the number of observations with non-missing values in &lt;em&gt;week, &lt;/em&gt;for each value of &lt;em&gt;station,&lt;/em&gt; and put the result for each observation of that &lt;em&gt;station&lt;/em&gt;.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;The problem is that each station-week combination has more than one observation and the number of observations per station-week varies between stations (remember, each station-week has as many observations as the number of fuel types - a price for leaded, unleaded, premium, etc.). Simply counting observations will not work here. We need to "tag" stations: in other words, we will create an indicator (a dummy variable) which will be 1 for only one observation per station, and 0 for all other observations of the same station. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;Once we tag each new combination of station-week, we can count how many station-week combinations there are for each station -- this will give us the number of weeks for each station. Although I said we need to count&lt;strong&gt;, &lt;/strong&gt;we will use the &lt;span style="font-family:courier new;"&gt;sum()&lt;/span&gt; function of egen, because &lt;span style="font-family:courier new;"&gt;count()&lt;/span&gt; will add to the count observations with 0, whereas sum will not (which is good because we do not want to count the same station-week more than once):&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;&lt;span style="font-family:courier new;"&gt;egen station_week_tag = tag(station week) &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);"&gt;// We're not using the by option since the group-identifying variables are already in the tag.&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;egen weeks_of_station = sum(station_week_tag), by(station)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;Graphically, this is what we actually do:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;a href="http://2.bp.blogspot.com/_iuN8kqdF7_g/R52VcsbEJmI/AAAAAAAAAq8/QUe_oETU7BE/s1600-h/tag.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5160445068230010466" style="" alt="" src="http://2.bp.blogspot.com/_iuN8kqdF7_g/R52VcsbEJmI/AAAAAAAAAq8/QUe_oETU7BE/s400/tag.gif" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;span style="font-family:georgia;"&gt;The &lt;span style="font-family:courier new;"&gt;group()&lt;/span&gt; function is used in the same manner we use the &lt;span style="font-family:courier new;"&gt;tag()&lt;/span&gt; function, but instead of putting 1 in each new combination and 0 in combinations that it has already seen, it puts 1 for all the observations of the first combination it sees, then 2 for all the observations of the second combination, and so on. The benefit of this function is that you can create a full numeric ordered single variable that enumerates all combinations. When we will deal with loops, it might be clearer why this is good.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;&lt;span style="font-family:georgia;"&gt;Conclusion&lt;/span&gt;&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;The egen command can help you play with the data pretty easily and intuitively (once you get the trick of the function you are using). There are other functions of egen I did not describe here. As I said, you can use the help egen, and don't be afraid to experiment with the functions. &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;To check whether your function work, browse your dataset. Sort it first by the group variables you mentioned, and then just browse the variables you want to. If you have a large dataset, you can limit the &lt;span style="font-family:courier new;"&gt;browse&lt;/span&gt; command using if conditions or in. Here are two examples:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;sort firm&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;browse firm employee wage min_wage_f max_wage_f norm_wage_f in 2000/2200 &lt;/span&gt;&lt;span style="color: rgb(0, 153, 0);font-family:georgia;" &gt;// This will browse observations #2000 through #2200&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;browse firm employee wage *_wage_f if firm &gt;= 100 &amp;amp; firm &lt;= 200&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;You can do the same with the list command, by the way (but list is limited to the width of the output screen). &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;Good luck!&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div align="right"&gt;(Go on to &lt;a href="http://stataproject.blogspot.com/2008/02/step-5-using-commands-outputs.html"&gt;Step #5&lt;/a&gt;)&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-1462171203112159684?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/1462171203112159684/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=1462171203112159684' title='35 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/1462171203112159684'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/1462171203112159684'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2007/12/step-4-thank-god-for-egen-command.html' title='Step #4: Thank God for the egen Command'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_iuN8kqdF7_g/R3OQIyaCWtI/AAAAAAAAAqk/20BjbIBaJj8/s72-c/egen+mean+rowmean.gif' height='72' width='72'/><thr:total>35</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-1941721614902610549</id><published>2007-12-06T11:46:00.000+02:00</published><updated>2007-12-11T10:15:57.841+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Data Transfer'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Manipulation'/><category scheme='http://www.blogger.com/atom/ns#' term='Step #1'/><title type='text'>Step #1 - Getting the data</title><content type='html'>The first encounter with Stata, in many cases, can be one of the most frustrating ones. You got the first project you need to work on your own, you searched the internet for data, or got it from someone, but it's not in Stata format. Transferring the data can be very annoying if you don't know how to do it. Maybe this example can help.&lt;br /&gt;&lt;br /&gt;Suppose you need to construct a dataset for a research on OECD countries' Foreign Aid. You log into &lt;a href="http://www.oecd.org/"&gt;http://www.oecd.org/&lt;/a&gt;, go to Statistics, and get the data you want. Then you want to export the data from the website and it lets you do that to an Excel file. I got the following file:&lt;br /&gt;&lt;a href="http://4.bp.blogspot.com/_iuN8kqdF7_g/R1QoSA-x7_I/AAAAAAAAAck/H_P26hU-GCY/s1600-R/project+1+excel.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139777364702654450" style="CURSOR: hand" alt="" src="http://4.bp.blogspot.com/_iuN8kqdF7_g/R1QoSA-x7_I/AAAAAAAAAck/q7Rdo-3Y6Js/s400/project+1+excel.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/R1LBDw-x76I/AAAAAAAAAb0/oyi_ccuzR7M/s1600-R/project+1+excel.GIF"&gt;&lt;/a&gt;&lt;br /&gt;There are, as you can see, four variables: Country, Year, ODA, and ODA as % of Gross National Income. A crucial point is to get the data right in the website, but I am ignoring that here. As a general advice, always ask yourself if the numbers make sense. If you won't ask yourself, others will ask you!&lt;br /&gt;&lt;br /&gt;In any case, there are a few things we better do before we transfer the data. Keep in mind that we want to get to the point where the table has one header row with the variables names, and the rest of the rows represent observations:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;The country column have only one cell containing each country's name. The rest of the cells should be filled with the name too. Otherwise, this row - when it will be transformed to an observation in Stata - will have a missing value instead of the country's name. To solve this, select the cell with the country's name, click on the bottom-right corner of the selected rectangle and drag it down all the way to the last empty cell of this country. In my file, I also needed to unmerge the cells. Repeat this for all the countries.&lt;/li&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;li&gt;The cells in the year column have this weird green triangle on their top left corner. This means that the values there are stored as Text instead of as numbers. Excel recognized this pitfall and therefore warns us with the green triangle. We must reformat the values from text to numeric. Otherwise, Stata will not be able to perform any calculation on this variable. To convert the cells formats to numbers, first select the first cell, then drag the selection all the way down. Click on the small icon with the exclamation mark and click on "Convert to Number"&lt;/li&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;li&gt;Column C and Rows 1-6 are excessive. Copy, or type, the column headers for columns D and E in row 7, and delete rows 1-6 and column C.&lt;/li&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;li&gt;This is optional, but I would suggest to delete the &lt;em&gt;DAC Countries, Total&lt;/em&gt; rows too, since they are not really observations which represent countries, and we want our dataset be country-year based.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;img id="BLOGGER_PHOTO_ID_5139637924294422450" style="CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/R1Opdg-x77I/AAAAAAAAAb8/x9eMgXS5VOM/s400/project+1+excel+2.GIF" border="0" /&gt;&lt;/p&gt;&lt;p&gt;Now, you got an Excel file that looks like the Stata file you want. Particularly, all numeric variables have values that are classified as numbers. To transfer them into Stata, you have three options: &lt;/p&gt;&lt;ol&gt;&lt;li&gt;The safest, but most expensive - Use a data-transfer program like StatTransfer. &lt;/li&gt;&lt;li&gt;The cheapest way - Copy &amp;amp; Paste.&lt;/li&gt;&lt;li&gt;The third way - Export from Excel, then import to Stata.&lt;/li&gt;&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Data-transfer Software&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;These kinds of programs will enable you to transfer data from, and to, different formats, easily and reliably. The problem is that they cost money, so if you're only into a small project, they probably don't pay off. If you are working on a university computer, maybe you have this software installed somewhere. For more details about StatTransfer, visit &lt;a href="http://stattransfer.com/"&gt;http://stattransfer.com/&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Copy &amp;amp; Paste&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;This will work in most cases, but might be problematic sometimes. For example, if you have a column with percentage format. Go to the worksheet in Excel, select all the columns that you need (click on the first column's letter and drag to the last column). Copy the selection (Ctrl+C or right-click and then copy or any other way you can think of). Now launch Stata. In the command window type &lt;span style="font-family:courier new;"&gt;edit&lt;/span&gt; and press enter. A window with an empty matrix should appear. Now paste (Ctrl+V etc). Voila!&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1Oq7A-x79I/AAAAAAAAAcM/W2yHRosw3bY/s1600-R/project+1+stata+1.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139639530612191186" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1Oq7A-x79I/AAAAAAAAAcM/v2nyQReP_XA/s400/project+1+stata+1.GIF" border="0" /&gt;&lt;/a&gt;&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Export &amp;amp; Import&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Go to the worksheet in Excel, select from the menu bar File-&gt;Save As. In the bottom of the window that opened you have "Save as type". Select either "Text (Tab delimited) (*.txt)" or "CSV (comma delimited) (*.csv)". Choose a name for the file and remember the folder you saved it in. You will need this file's full path (e.g "C:\Documents and Settings\bla\My Documents\OECD Aid.csv"). Now press on the Save button. The program might ask you whether you are sure you want to save in this format, select Yes. &lt;/p&gt;&lt;p&gt;Launch Stata. In the command line, write,&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:georgia;"&gt;If you selected CSV as the "Save as type":&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;insheet using "C:\Documents and Settings\bla\My Documents\OECD Aid.csv", comma &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:Georgia;"&gt;If you selected Text as the "Save as type":&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;insheet using "C:\Documents and Settings\bla\My Documents\OECD Aid.txt", tab &lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1OqxA-x78I/AAAAAAAAAcE/OshKWjIMRzw/s1600-R/project+1+stata+1.GIF"&gt;&lt;/a&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1OqxA-x78I/AAAAAAAAAcE/OshKWjIMRzw/s1600-R/project+1+stata+1.GIF"&gt;&lt;/a&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1OqxA-x78I/AAAAAAAAAcE/OshKWjIMRzw/s1600-R/project+1+stata+1.GIF"&gt;&lt;/a&gt;&lt;p&gt;&lt;/p&gt;You can use the Import &amp;amp; Export method with other Spreadsheet programs. As long as they can export the data in a tab- or comma-delimited format.&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Before you celebrate&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Now, make sure you got the variables right. A simple way is to type &lt;span style="font-family:courier new;"&gt;sum&lt;/span&gt; in the command window. Stata will report, for each variable, the number of observations, mean, standard error, minimum and maximum values. Look at these numbers and make sure they make sense. Make sure that all numberic variables have the number of observations in the dataset. Missing values or string values are not counted by the sum command, so it's a good way to see if you got too many of them.&lt;/p&gt;&lt;p&gt;Another way to make sure the data you have is right is the &lt;span style="font-family:courier new;"&gt;tabulate&lt;/span&gt; command, or in short: &lt;span style="font-family:courier new;"&gt;tab&lt;/span&gt;. Running this command with a variable: &lt;span style="font-family:courier new;"&gt;tab donor &lt;span style="font-family:Georgia;"&gt;will print a histogram of the variable &lt;em&gt;donor&lt;/em&gt;. This way, we can make sure each country appears 7 times. When I run tab, I usually run it with the &lt;span style="font-family:courier new;"&gt;missing&lt;/span&gt; option (or just &lt;span style="font-family:courier new;"&gt;m&lt;/span&gt;, for short). If I don't ask for that option, Stata will not count observations for which the value of the variable (&lt;span style="font-family:courier new;"&gt;country&lt;/span&gt; in our case) is missing.&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1OusA-x7-I/AAAAAAAAAcU/uogNJdRTKeM/s1600-R/project+1+stata+2.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139643670960664546" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1OusA-x7-I/AAAAAAAAAcU/gcpErRuIybE/s400/project+1+stata+2.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;As you can see, the countries' observations are fine, but look at the value at the bottom of the tabulation. This is the result of the last row in the Excel file we used. It was a note below the table that Stata mistakenly took as another observation. We should get rid of it. The easiest way to do it is to type &lt;span style="font-family:courier new;"&gt;edit&lt;/span&gt; in the command line, scroll all the way down, select the observation and click on the "Delete..." button at the top of the Data Editor, and ask it to delete this observation. Then, when you close the window, a popup comes up and asks you whether to accept the changes. Accept them, of course.&lt;/p&gt;&lt;p&gt;Run the &lt;span style="font-family:courier new;"&gt;tab&lt;/span&gt; command again to see you didn't ruin anything by accident.&lt;/p&gt;&lt;p&gt;If everything is all right, save the file. Either use the icon in the top bar, or just type:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;save "C:\myfilename"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:georgia;"&gt;If there is already a file with that name in that folder, Stata will not let you save like this. You will need to type &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;save "C:\myfilename", replace.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;p&gt;Congratulations!&lt;/p&gt;&lt;p align="right"&gt;(go on to &lt;a href="/2007/12/combine-multiple-datasets-into-one.html"&gt;Step #2&lt;/a&gt;)&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-1941721614902610549?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/1941721614902610549/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=1941721614902610549' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/1941721614902610549'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/1941721614902610549'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2007/12/project-1-getting-data.html' title='Step #1 - Getting the data'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_iuN8kqdF7_g/R1QoSA-x7_I/AAAAAAAAAck/q7Rdo-3Y6Js/s72-c/project+1+excel.GIF' height='72' width='72'/><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-4667193246924190241</id><published>2007-12-05T18:28:00.000+02:00</published><updated>2008-01-28T11:00:45.626+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='gen'/><category scheme='http://www.blogger.com/atom/ns#' term='step #3'/><category scheme='http://www.blogger.com/atom/ns#' term='drop'/><category scheme='http://www.blogger.com/atom/ns#' term='replace'/><category scheme='http://www.blogger.com/atom/ns#' term='Data Manipulation'/><category scheme='http://www.blogger.com/atom/ns#' term='keep'/><category scheme='http://www.blogger.com/atom/ns#' term='conditions'/><title type='text'>Step #3 - Simple Data Manipulation</title><content type='html'>OK, so you have all your data gathered from all sources, and now you need to define additional variables, or drop some data that is irrelevant for your sample. To do so, you will first need to think about the statistical analyses you will want to do later on. If your Econometric model is quadratic, for instance, or in other words, if you measure the effect &lt;em&gt;x&lt;/em&gt; on &lt;em&gt;y&lt;/em&gt; with both &lt;em&gt;x&lt;/em&gt; and &lt;em&gt;x-squared&lt;/em&gt;, you need to create a variable with &lt;em&gt;x-squared&lt;/em&gt;. Also, it is very common to see log-level and log-log models, in which the natural log of some variables are run instead of the variables themselves. From the project's point of view, the data handling from now on will take place in parallel with the statistical analysis.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;&lt;span style="font-family:georgia;"&gt;Variables which are simple functions of other variables&lt;/span&gt;&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;To create a variable, you should use the gen command. Its basic syntax is very simple:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen &amp;lt;new variable&amp;gt; = &amp;lt;expression&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For example:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen income_squared = income ^ 2&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen ln_income = log(income)&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen gdp_per_capita = gdp / population&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Stata will generate a new variable, and for each of the observations, it will evaluate the expression specified according to the values in the variables we use in the expression. In the first example, for an observation with an &lt;em&gt;income&lt;/em&gt; of 200,&lt;em&gt; income_squared&lt;/em&gt; will contain 40,000, but for an observation with an &lt;em&gt;income&lt;/em&gt; of 100, &lt;em&gt;income_squared&lt;/em&gt; will contain 10,000.&lt;br /&gt;&lt;br /&gt;You may use an expression that is independent of any variable. This way, the variable will be constant over all observations.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;Variables which are determined by a condition&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;Now, suppose you want to create a variable like this:&lt;br /&gt;&lt;br /&gt;(1)&lt;br /&gt;educ_parent = father's education if the student is male and mother's education if the student is female.&lt;br /&gt;&lt;br /&gt;(2)&lt;br /&gt;education_level = 0 if education equals 0&lt;br /&gt;education_level = 1 if education is higher than 0 and lower-or-equal than 6&lt;br /&gt;education_level = 2 if education is higher than 6 and lower-or-equal than 9&lt;br /&gt;education_level = 3 if education is higher than 9 and lower-or-equal than 12&lt;br /&gt;education_level = 4 if education is higher than 12&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;In order to do this, you will need to use both the &lt;span style="font-family:courier new;"&gt;gen&lt;/span&gt; and the &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; command. Let's first look at the &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; command.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace &amp;lt;existing variable&amp;gt; = &amp;lt;expression&amp;gt; &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The only difference between &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;gen&lt;/span&gt; is that &lt;span style="font-family:courier new;"&gt;gen&lt;/span&gt; creates a new variable and puts values in it, while &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; takes an already existing variable and puts values in it. Actually, if you will type &lt;span style="font-family:courier new;"&gt;help gen&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;help replace&lt;/span&gt;, you'll get to the same manual page.&lt;br /&gt;&lt;br /&gt;Let's look at the first example again. We need to first generate a variable named educ_parent that will be equal to the father's education, for boys:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen parent_educ = father_educ if boy == 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Notice the condition we added. In many Stata commands, conditions can be specified so that the command will be run only on observations for which the condition holds. Notice the double equality sign in the condition as opposed to the single equality sign in the beginning of the statement. Put simply, the single sign determines a value while the double sign checks whether the values are equal. In other words, it's like the difference between saying "set x to be equal to y" (x=y) and saying "are x and y equal?" (x==y).&lt;br /&gt;In the &lt;span style="font-family:courier new;"&gt;gen&lt;/span&gt; case, observations for which the condition does not hold, missing values will be assigned. In the &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; case, these observations will simply stay with their previous values.&lt;br /&gt;Therefore, going back to our example, all that is left now is to run the following command:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace parent_educ = mother_educ if boy != 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;u&gt;More on Conditions&lt;/u&gt;&lt;/em&gt;&lt;br /&gt;Before we move on to the next example, here are two comments about conditions:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;You can combine several comments with logical operators (and, or, not). :&lt;br /&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if x==y x==z&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This means "if x equals y OR x equals z"&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if boy==1 &amp;amp; immigrant == 1&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This means "if boy equals 1 AND immigrant equals 1"&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if !(income&amp;gt;100000)&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This means "if income is not greater than 100,000&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if income&amp;lt;=100000&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This is equivalent to the previous line.&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if income&amp;lt;=100000 &amp;amp; (profession == "Banker" profession == "Doctor")&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This selects observations of people who are either bankers or doctors with income greater than 100,000. Note the parenthesis around the OR combination. If they weren't there, the condition would selected bankers with income greater than 100,000 and all doctors (regardless of income).&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;&lt;li&gt;&lt;strong&gt;Very Very Important&lt;/strong&gt; - Missing values are a dangerous pitfall! Look at the following examples: &lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if income &amp;gt; 100000&lt;/span&gt; &lt;span style="color:#009900;"&gt;// In Stata, observations with missing values will also fulfill the condition&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if boy != 1&lt;/span&gt; &lt;span style="color:#009900;"&gt;// If you meant girls, they will fulfill the condition, but also observations in which boy has a missing value&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;In order to not to capture the observations with the missing values with the ones you are really looking for, try the following conditions respectively: &lt;/p&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if income &amp;gt; 100000 &amp;amp; income != .&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if boy != 1 &amp;amp; boy != .&lt;/span&gt; &lt;/li&gt;&lt;li&gt;&lt;span style="font-family:courier new;"&gt;if boy == 0&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This will do the same job, but it's shorter and more readable.&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ol&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;Back to the second example above. To create the variable education level, we will use a gen command followed by a sequence of replace commands. I will write two possible implementions that creates exactly the same variable:&lt;/p&gt;&lt;span style="color:#009900;"&gt;// One possibility&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;gen education_level = 0 if education == 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 1 if education &amp;gt; 0 &amp;amp; education &amp;lt;= 6 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 2 if education &amp;gt; 6 &amp;amp; education &amp;lt;= 9 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 3 if education &amp;gt; 9 &amp;amp; education &amp;lt;= 12 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 4 if education &amp;gt; 12 &amp;amp; education != .&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color:#009900;"&gt;// Another possibility:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen education_level = 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 1 if education &amp;gt; 0 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 2 if education &amp;gt; 6 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 3 if education &amp;gt; 9 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;replace education_level = 4 if education &amp;gt; 12 &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:Courier New;"&gt;replace education_level = . if education == .&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;u&gt;TIP:&lt;/u&gt;&lt;/em&gt;&lt;br /&gt;&lt;br /&gt;After defining a new variable, especially if it is not a simple one, it is always a good thing to tabulate it against the variable you are constructing it on:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://1.bp.blogspot.com/_iuN8kqdF7_g/R2E92A-x8GI/AAAAAAAAAd8/sSIK6PvCMt8/s1600-h/tab_educ_level.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5143460247619367010" style="CURSOR: hand" alt="" src="http://1.bp.blogspot.com/_iuN8kqdF7_g/R2E92A-x8GI/AAAAAAAAAd8/sSIK6PvCMt8/s400/tab_educ_level.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;Creating Dummy Variables with a Condition&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;Dummy variables, or indicator variables, are variables which have either the value of 0 or 1 (if they are not missing). For example, the variable is_working can have either 0 if not working or 1 if working. The creation of a dummy variable is made with a condition (but without if preceding it):&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen &amp;lt;dummy variable&amp;gt; = &amp;lt;condition&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Examples:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen is_working = income_from_labor &amp;gt; 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen high_demand = demand &amp;gt; 600&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;em&gt;&lt;u&gt;Problem - Missing Values&lt;/u&gt;&lt;/em&gt;&lt;br /&gt;Take a look at the second example. What if some observations have missing values in &lt;em&gt;demand&lt;/em&gt;. In the example given, they will get 1 in &lt;em&gt;high_demand&lt;/em&gt;. So you will probably say: "Hey, you already talked about it. I'll run it like this:"&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen high_demand = demand &amp;gt; 600 &amp;amp; demand != .&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;But this will make the value of &lt;em&gt;high_demand&lt;/em&gt; to be 0 for observations with missing &lt;em&gt;demand&lt;/em&gt;. Is this something you want to do? Well, sometimes. But in other times, you don't want to classify missing &lt;em&gt;demand&lt;/em&gt; as "not high", but to keep &lt;em&gt;high_demand&lt;/em&gt; missing too. In order to do that, you need to run the following command instead of the one above (note the difference):&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;gen high_demand = demand &amp;gt; 600 if demand != .&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now, why is that? Good question.&lt;br /&gt;A condition - whether defining the dummy variable or specified after the &lt;span style="font-family:courier new;"&gt;if&lt;/span&gt; word - is actually a function that returns either 1 - if the condition is met - or 0 - if the condition is unmet. For each observations the value might be different. Once you specify a condition after an &lt;span style="font-family:courier new;"&gt;if&lt;/span&gt; word, Stata understands that the command is executed only on observations which meet the condition (for which the condition returns 1). But when you specify it after the = (remember the meaning of a single equality sign?), it tells Stata to put the value that the condition returns, 1 or 0, in the variable generated.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;Getting Rid of Excess Data&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;You might want to drop some observations or variables that you don't need. That may improve running time, so you should consider whether you really need all the variables you have. You also want to reduce the complexity of the dataset and keep it as thin as possible.&lt;br /&gt;Two opposite commands will do the job: drop and keep.&lt;br /&gt;&lt;br /&gt;For both commands, you need to think first if you want to drop variables or observations. If you want to drop variables, the syntax is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop &amp;lt;variable(s) name(s)&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;keep &amp;lt;variable(s) name(s)&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Examples:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop gdp_per_capita_ppp gdp_per_capita_current&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This will drop the variables specified and keep the rest&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;keep id school_id treatment average_score&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This will keep the variables specified and drop the rest&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If you want to keep or drop observations, the syntax is a bit different:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop if &amp;lt;condition&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;keep if &amp;lt;condition&amp;gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Examples&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;drop if sales_mil &amp;lt; 100 &lt;span style="color:#009900;"&gt;// This will drop all observations (companies for example) with sales less than $100M, and keep the rest of observations.&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;keep if problematic_education == 0&lt;/span&gt; &lt;span style="color:#009900;"&gt;// This will keep observations with problematic_education == 0 and drop the rest (the problematic ones, as defined previously in the program)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt;Notice that the if here acts the same way as in the gen and replace commands: it tells Stata which observations are relevant to the command.&lt;/span&gt; &lt;/span&gt;&lt;br /&gt;&lt;p align="right"&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;(Go on to &lt;a href="http://stataproject.blogspot.com/2007/12/step-4-thank-god-for-egen-command.html"&gt;Step #4&lt;/a&gt;)&lt;/span&gt;&lt;/p&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-4667193246924190241?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/4667193246924190241/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=4667193246924190241' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/4667193246924190241'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/4667193246924190241'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2007/12/step-3-simple-data-manipulation.html' title='Step #3 - Simple Data Manipulation'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_iuN8kqdF7_g/R2E92A-x8GI/AAAAAAAAAd8/sSIK6PvCMt8/s72-c/tab_educ_level.GIF' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-5290984163937382875.post-2220507383250476962</id><published>2007-12-03T13:27:00.001+02:00</published><updated>2007-12-13T17:29:15.437+02:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Merge'/><category scheme='http://www.blogger.com/atom/ns#' term='Append'/><category scheme='http://www.blogger.com/atom/ns#' term='Joinby'/><category scheme='http://www.blogger.com/atom/ns#' term='Step #2'/><category scheme='http://www.blogger.com/atom/ns#' term='Combine Datasets'/><title type='text'>Step #2 - Combine Multiple Datasets into One</title><content type='html'>In many cases, the data needed for the statistical analyses come from different sources. For example, if you want to analyze international growth, you might find economic indicators in a dataset of the World Bank, political indicators in think tanks such as Freedom House, and climate data in another dataset. Another case is when you have one dataset which is divided into multiple files. In this post I will try to elaborate a bit on how to make it work.&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;Types of Datasets Combinations&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;br /&gt;There are actually two main types of combinations:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;u&gt;&lt;em&gt;"Vertical" combination&lt;/em&gt;&lt;/u&gt; - You want to do this when you want to add observations from one file to another file. For instance, if you are working on a sports statistics project and you have data for players performance in four separate files, one for each year between 2001 and 2004. Another possibility is that the data is separated according to different leagues, groups, etc. As long as the variables in the files are the same and the only thing you need to do is to add observations, this is vertical combination. The command in Stata we will use is &lt;span style="font-family:courier new;"&gt;append&lt;/span&gt;. We will explore this command later.&lt;/li&gt;&lt;br /&gt;&lt;br /&gt;&lt;p align="center"&gt;&lt;a href="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1Qx5w-x8AI/AAAAAAAAAcs/ttaS7CjY-gg/s1600-R/Slide1.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139787943207104514" style="CURSOR: hand" alt="" src="http://3.bp.blogspot.com/_iuN8kqdF7_g/R1Qx5w-x8AI/AAAAAAAAAcs/re82FmvonSA/s400/Slide1.GIF" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;&lt;p&gt;&lt;/p&gt;&lt;li&gt;&lt;u&gt;&lt;em&gt;"Horizontal" combination&lt;/em&gt;&lt;/u&gt; - This is the kind of combinations in which you want to add variables, and not observations. The observations appear in both files (at least most of them), but in each file there is different information about them. For example, if we're dealing with high school students and we have one file with their personal information and grades, and another file with SAT scores only. If we have an identifying variable in both files (e.g Social Security Number), we can assign each student his/her SAT score. This example is a One-to-One matching. There are three types of matches of this kind:&lt;/li&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;ol type="a"&gt;&lt;li&gt;&lt;u&gt;One-to-One matching&lt;/u&gt;: If the identifying variable which appears in the files is unique in both files, then it's a one-to-one match. Unique means that for each value of this variable, there is only one observation that contains it. In the figure below, &lt;em&gt;country&lt;/em&gt; is the identifying variable. In both datasets, each country has only one observation.&lt;/li&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;a href="http://4.bp.blogspot.com/_iuN8kqdF7_g/R1Qx6A-x8BI/AAAAAAAAAc0/VjkwLlYRHLI/s1600-R/Slide2.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139787947502071826" style="CURSOR: hand" alt="" src="http://4.bp.blogspot.com/_iuN8kqdF7_g/R1Qx6A-x8BI/AAAAAAAAAc0/JmKXbgAGDuk/s400/Slide2.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;li&gt;&lt;u&gt;One-to-Many matching&lt;/u&gt;: If the identifying variable is unique in one file, but not unique in the other, then it's a one-to-many matching. This is very common when you have groups of observations in one file (the file with the identifying variable which is not unique), and information regarding each group in another file (the other file). The following figure will make it clearer:&lt;/li&gt;&lt;br /&gt;&lt;p&gt;&lt;/p&gt;&lt;a href="http://2.bp.blogspot.com/_iuN8kqdF7_g/R1Qx6g-x8CI/AAAAAAAAAc8/e24ak_6ZdO4/s1600-R/Slide3.GIF"&gt;&lt;img id="BLOGGER_PHOTO_ID_5139787956092006434" style="CURSOR: hand" alt="" src="http://2.bp.blogspot.com/_iuN8kqdF7_g/R1Qx6g-x8CI/AAAAAAAAAc8/tHAkteAxsBc/s400/Slide3.GIF" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;p&gt;As you can see, one can group the individuals to housefolds. The household identifying variable (&lt;em&gt;fam_ID&lt;/em&gt;) is common to both of the files. It is not unique in the individuals file, but it is unique in the households file. This enables Stata to assign the same value, of each of the households variables, to all the members of the household. Note that although we have a unique identifier for the individuals (&lt;em&gt;indiv_ID&lt;/em&gt;), it is irrelevant for this merge of files.&lt;/p&gt;&lt;li&gt;&lt;u&gt;Many-to-Many matching&lt;/u&gt;: This is very rare. This is also problematic, since there is no unambiguous rule for the assignment of values from observations in one file to observations in the other file. I will not elaborate on this matching too much.&lt;/li&gt;&lt;/ol&gt;&lt;/ol&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Commands Syntax&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;There are three commands you should know if you want to combine datasets: &lt;span style="font-family:courier new;"&gt;append&lt;/span&gt;, &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; and &lt;span style="font-family:courier new;"&gt;joinby&lt;/span&gt;. All three of them combine the dataset currently in memory with data from a file you specify. We will name the data in memory "Master Data" and the data to combine from the specified file "Using Data". It will be clear why we use the word Using here.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Append&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The append command does what we called "vertical" combination. It adds observations. It's syntax, in a simple form (for options not specified in this tutorial, you can always type &lt;span style="font-family:courier new;"&gt;help append&lt;/span&gt; in the command line in order to explore more about the command), goes like this:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;append using &amp;lt;filename&amp;gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Example:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;append using "C:\more_observations.dta"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:Courier New;"&gt;append using "C:\more_observations" // (this is equivalent)&lt;/span&gt;&lt;/p&gt;&lt;p&gt;This will add the observations from the file C:\more_observations.dta to the data in memory. In case no extension is specified (i.e no .dta at the end of the filename), Stata assumes it's .dta, so you can omit it. &lt;/p&gt;&lt;p&gt;Now you understand why we call the data in C:\more_observations.dta "Using Data".&lt;/p&gt;&lt;p&gt;What happens if you have variables in the Master Data which do not exist in the Using Data? The observations from the Using Data will be assigned missing values in those variables. If there are additional variables in the Using Data which do not appear in the Master Data, the observations from the Master Data will have missing values in them.&lt;/p&gt;&lt;p&gt;&lt;u&gt;Tip&lt;/u&gt;: Before you append, you might want to make sure you know the source file for each observations. For example, if you append 2008 data to 2007 data currently in memory, you might want to make sure you have the variable &lt;em&gt;year&lt;/em&gt; in each of the datasets prior to the incorporation of the Using Data.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Merge&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;For "horizontal" combination of datasets you will need either merge or joinby. The difference between them is the method they use in order to do the merging, but in one-to-one or one-to-many merges, they give almost the same functionality. We will start with the merge command. The syntax, in its simplest form, is:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;merge &amp;lt;identifying variable(s)&amp;gt; using &amp;lt;filename&amp;gt;&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Examples:&lt;/p&gt;&lt;p&gt;(1)&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;use "D:\geography", clear &lt;span style="font-family:georgia;"&gt;// Assumes D:\geography.dta"&lt;/span&gt;&lt;br /&gt;merge country using "D:\economy" &lt;/span&gt;// Assumes "D:\economy.dta"&lt;/p&gt;&lt;p&gt;(2)&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;merge fam_id using "K:\households.dta"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;(3)&lt;/span&gt;&lt;br /&gt;merge state year using "K:\USA_data\precipitation.dta"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;In the first example, Stata first loads observations from a file called geography and then matches them to observations in the economy.dta file. This will do what the figure in the one-to-one section above shows.&lt;br /&gt;Note: what comes after the double forward-slash (//) will be ignored by Stata. It's used to make the code clearer to the human reader&lt;/p&gt;&lt;p&gt;In the second example, assume the individuals dataset is already in memory. I tried to do what the figure in the one-to-many section above shows. Notice that there is no difference in the syntax of the command. The only difference is in the structure of the files you are operating on.&lt;/p&gt;&lt;p&gt;In the third example, I wanted to show you can use more than one identifying variable. In case only combination of variables is unique (and you want to identify observations uniquely), you can specify both of them. In this example, suppose you have data on state-year basis (this is called Panel Data, because you have the same subjects reappearing in different instances) - let's say car accidents data (number of accidents, injuries, etc) and you need to add data about the weather conditions in that year, you need to tell Stata to make the match between the datasets according to both state and year.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: The &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; command requires that both the Master and Using Data will be sorted by the identifying variables. If the Master Data isn't sorted, run &lt;span style="font-family:courier new;"&gt;sort &amp;lt;identifying variable(s)&amp;gt;&lt;/span&gt;before the merge command. If the Using Data isn't sorted, open it first (&lt;span style="font-family:courier new;"&gt;use &amp;lt;filename&amp;gt;, clear&lt;/span&gt;), then run the sort command, then save it (&lt;span style="font-family:courier new;"&gt;save &amp;lt;filename&amp;gt;, replace&lt;/span&gt;), open the Master Data and run the merge command. Here's an example:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;use "D:\economy", clear&lt;br /&gt;sort country&lt;br /&gt;save "D:\economy", replace&lt;br /&gt;use "D:\geography", clear&lt;br /&gt;sort country&lt;br /&gt;merge country using "D:\economy"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;1) Since you saved D:\economy.dta in the third line, you will not need to open D:\economy.dta and sort it again in future runs.&lt;br /&gt;2) If you are doing a one-to-one match (i.e if the identifying variable(s) are unique in both sets), you can run the &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; command with the &lt;span style="font-family:courier new;"&gt;sort&lt;/span&gt; option. It will automatically sort the datasets within the merge command. The &lt;span style="font-family:courier new;"&gt;sort&lt;/span&gt; option will not work if the identifying variables are not unique.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;The _merge variable:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The merge command automatically creates a variable named &lt;em&gt;_merge&lt;/em&gt;, which contains information regarding the observation's existence in each of the two datasets. In the simple cases I mentioned above, it will contain, for each of the observations, one of the following values:&lt;br /&gt;1 =&gt; the observation (the identifying variable(s) values) appeared only in the Master Data&lt;br /&gt;2 =&gt; the observation (the identifying variable(s) values) appeared only in the Using Data&lt;br /&gt;3 =&gt; the observation (the identifying variable(s) values) appeared in both datasets&lt;/p&gt;&lt;p&gt;It is up to you to decide what you want to do with each of the cases. In some projects you will not want observatios with the value 2 in the &lt;em&gt;_merge&lt;/em&gt; variable. For example, take example 2 above. If you have households data in the Using data, but your interest is individuals (in the Master Data), you don't need observations with household data but without individuals that are linked to it. If you want to get rid of it, you can either type &lt;span style="font-family:courier new;"&gt;drop if _merge == 2&lt;/span&gt; after the merge command, or, even better, run the &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; command with the option &lt;span style="font-family:courier new;"&gt;nokeep&lt;/span&gt;. That is:&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;merge fam_id using "K:\households.dta", nokeep&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:georgia;"&gt;You can also decide that observations in the Master Data that has no corresponding observations in the Using Data are irrelevant for your research. In that case, there is no special option for the &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; command. So you need to add the command &lt;span style="font-family:courier new;"&gt;drop if _merge == 1&lt;/span&gt; after the merge command.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;Other options of interest&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;&lt;u&gt;update and replace&lt;/u&gt;&lt;/p&gt;&lt;p&gt;What happens if you have some overlap between the variables in the files? Say, when you are merging data from the CIA World Factbook and the World Bank, you might have GNI in both datasets. If you specify none of them&lt;span style="font-family:georgia;"&gt;, Stata will keep the values that were in the Master Data (in memory). If you specify the options &lt;span style="font-family:courier new;"&gt;update replace&lt;/span&gt; (&lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; can't be specified without update), Stata will take, instead, the values that are in the Using Data and put them in place of the Master Data values. If you just type the &lt;span style="font-family:courier new;"&gt;update&lt;/span&gt; option (without &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt;), however, Stata will put the Using Data values only in observations where the Master Data values are missing.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:georgia;"&gt;So in case you have the same variable but different values, use neither option when you think the Master Data is more reliable. Use the &lt;span style="font-family:courier new;"&gt;update&lt;/span&gt; &lt;span style="font-family:courier new;"&gt;replace&lt;/span&gt; options if you think the Using Data is more reliable. If they are equally reliable, use just &lt;span style="font-family:courier new;"&gt;update&lt;/span&gt;.&lt;/span&gt;&lt;/p&gt;&lt;p&gt;If you specified the &lt;span style="font-family:courier new;"&gt;update&lt;/span&gt; option, &lt;em&gt;_merge&lt;/em&gt; will contain 5 possible values:&lt;br /&gt;1 =&gt; the observation (the identifying variable(s) values) appeared only in the Master Data&lt;br /&gt;2 =&gt; the observation (the identifying variable(s) values) appeared only in the Using Data&lt;br /&gt;3 =&gt; the observation (the identifying variable(s) values) appeared in both datasets and the values are the same in both&lt;br /&gt;4 =&gt; the observation (the identifying variable(s) values) appeared in both datasets and the value in the Master Data is missing.&lt;br /&gt;5 =&gt; the observation (the identifying variable(s) values) appeared in both datasets but the values in the datasets are not missing and not the same.&lt;/p&gt;&lt;p&gt;Examples:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;merge country using "D:\Economy", update replace&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;merge id using "K:\second_version", update&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;u&gt;keep&lt;/u&gt;&lt;/p&gt;&lt;p&gt;If you want only some variables to be merged, instead of all of them, you can specify &lt;span style="font-family:courier new;"&gt;keep(&lt;variables&gt;)&lt;/span&gt;.&lt;/p&gt;&lt;p&gt;Example:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;merge country year using "F:\intl_health_stats.dta", keep(birth_rate death_rate)&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;u&gt;unique, uniqmaster, uniqusing and sort&lt;/u&gt;&lt;/p&gt;&lt;p&gt;In order to make sure the one-to-one or one-to-many matches are really unambiguously defined, you can make sure the identifying variables are unique in either the Master Data (&lt;span style="font-family:courier new;"&gt;uniqmaster&lt;/span&gt;), Using Data (&lt;span style="font-family:courier new;"&gt;uniqusing&lt;/span&gt;) or both datasets (&lt;span style="font-family:courier new;"&gt;unique&lt;/span&gt;). It is really recommended to specify them, although it won't change the functionality. The main contribution of these options is to make Stata print an error and exit if what you think is unique is not really unique. The &lt;span style="font-family:courier new;"&gt;sort&lt;/span&gt; option can make the &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; command sort the datasets on its own, but it is only possible if you're running a one-to-one match (in other words, &lt;span style="font-family:courier new;"&gt;sort&lt;/span&gt; implies &lt;span style="font-family:courier new;"&gt;unique&lt;/span&gt;).&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;More than one dataset&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;You can merge more than one file in one command. Instead of specifying one filename after &lt;span style="font-family:courier new;"&gt;using&lt;/span&gt;, you can add more filenames. Unless the &lt;span style="font-family:courier new;"&gt;nosummary&lt;/span&gt; option is specified, the command will create &lt;em&gt;_merge1, _merge2, ... , _mergen&lt;/em&gt; variables in which the observation's value in each of the &lt;em&gt;_mergek&lt;/em&gt; variables will be 1 if the &lt;em&gt;k&lt;/em&gt;-th dataset had this observations and 0 otherwise. The &lt;em&gt;_&lt;/em&gt;merge variable will still be there, but now the value 3 in it means that the observations appeared in &lt;em&gt;at least&lt;/em&gt; one of the Using datasets.&lt;/p&gt;&lt;p&gt;Personally, I prefer running the merge command iteratively and adding one dataset at a time. It requires to drop the _merge variable each time, and it might take longer time, but I can better report and deal with the merging outcomes.&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;u&gt;Joinby&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;The &lt;span style="font-family:courier new;"&gt;joinby&lt;/span&gt; command does almost the same job &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; does, but its internal working is different, so there might be differences in terms of processing time. Its main difference arises when you're dealing with many-to-many matches, but it can be used for one-to-one and one-to-many matches too. The simple syntax is:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;joinby &amp;lt;identifying variable(s)&amp;gt; using &amp;lt;filename&amp;gt; &lt;/span&gt;&lt;/p&gt;&lt;p&gt;Example:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;joinby country using "D:\economy"&lt;/span&gt;&lt;/p&gt;&lt;p&gt;Unlike merge, the default of &lt;span style="font-family:courier new;"&gt;joinby&lt;/span&gt; is to drop all observations that do not appear in both datasets. In order to keep those observations, you need to use the &lt;span style="font-family:courier new;"&gt;unmatched()&lt;/span&gt; option. This option has four possible variations:&lt;/p&gt;&lt;ul&gt;&lt;li&gt;&lt;u&gt;&lt;span style="font-family:courier new;"&gt;unmatched(none)&lt;/span&gt;&lt;/u&gt; - Keep none of the unmatched observations (this is the default)&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;span style="font-family:courier new;"&gt;unmatched(master)&lt;/span&gt;&lt;/u&gt; - Keep observations in Master Data that have no match in Using Data (but not vice versa)&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;span style="font-family:courier new;"&gt;unmatched(using)&lt;/span&gt;&lt;/u&gt; - Keep observations from Using Data that have no match in Master Data (but not vice versa)&lt;/li&gt;&lt;li&gt;&lt;u&gt;&lt;span style="font-family:courier new;"&gt;unmatched(both)&lt;/span&gt;&lt;/u&gt; - Keep all unmatched observations, from both Using and Master Data&lt;/li&gt;&lt;/ul&gt;&lt;p&gt;So if you want to do the same thing done in the first example of the merge command, use the following commang:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;joinby country using "D:\economy", unmatched(both)&lt;/span&gt;&lt;/p&gt;&lt;p&gt;There is no need for the datasets to be sorted by the identifying variable(s), which is an advantage over &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt;.&lt;/p&gt;&lt;p&gt;The update and replace options are available for joinby too.&lt;/p&gt;&lt;p&gt;As I said, more details with:&lt;/p&gt;&lt;p&gt;&lt;span style="font-family:courier new;"&gt;help joinby&lt;/span&gt;&lt;/p&gt;&lt;p&gt;&lt;strong&gt;&lt;em&gt;&lt;u&gt;Many-to-Many Merge&lt;/u&gt;&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;&lt;p&gt;Although I have never needed it, this is where merge and joinby will give you totally different results. The question is how to match values from one dataset to the other. I think the best way to explain the difference between the commands is graphically:&lt;/p&gt;&lt;p align="center"&gt;&lt;a href="http://2.bp.blogspot.com/_iuN8kqdF7_g/R1VdfQ-x8DI/AAAAAAAAAdE/uhqsCF_HAFU/s1600-h/Slide4.gif"&gt;&lt;img id="BLOGGER_PHOTO_ID_5140117341428903986" style="CURSOR: hand" alt="" src="http://2.bp.blogspot.com/_iuN8kqdF7_g/R1VdfQ-x8DI/AAAAAAAAAdE/uhqsCF_HAFU/s400/Slide4.gif" border="0" /&gt;&lt;/a&gt;&lt;/p&gt;Now you can understand the meaning of the sentence describing the &lt;span style="font-family:courier new;"&gt;joinby&lt;/span&gt; command in the help reference: "Form all pairwise combinations within groups".&lt;br /&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;Conclusion&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;&lt;strong&gt;&lt;u&gt;&lt;/u&gt;&lt;/strong&gt;&lt;br /&gt;If you want to add observations: &lt;span style="font-family:courier new;"&gt;append&lt;/span&gt;.&lt;br /&gt;If you want to add variables: &lt;span style="font-family:courier new;"&gt;merge&lt;/span&gt; or &lt;span style="font-family:courier new;"&gt;joinby&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;As always, before you celebrate, make sure you got the combination of the files right by looking at the means, counts, minimum and maximum values (&lt;span style="font-family:courier new;"&gt;sum&lt;/span&gt; command) and tabulations &lt;span style="font-family:courier new;"&gt;&lt;span style="font-family:georgia;"&gt;(&lt;/span&gt;tab&lt;span style="font-size:0;"&gt; &lt;/span&gt;&lt;span style="font-family:georgia;"&gt;command)&lt;/span&gt;&lt;/span&gt;&lt;span style="font-family:georgia;"&gt;. Take a special look at the &lt;em&gt;_merge&lt;/em&gt; variable.&lt;/span&gt; Look for missing values or other outlying observations. If you have too many of them, you might have made a mistake along the way. Browse the data a bit. See that the data merged correctly.&lt;br /&gt;&lt;br /&gt;Don't forget to save the file (that is, if you don't want to rerun the merge command later).&lt;br /&gt;&lt;br /&gt;&lt;p align="right"&gt;(go on to &lt;a href="http://stataproject.blogspot.com/2007/12/step-3-simple-data-manipulation.html"&gt;Step #3&lt;/a&gt;)&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/5290984163937382875-2220507383250476962?l=stataproject.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://stataproject.blogspot.com/feeds/2220507383250476962/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=5290984163937382875&amp;postID=2220507383250476962' title='36 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/2220507383250476962'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/5290984163937382875/posts/default/2220507383250476962'/><link rel='alternate' type='text/html' href='http://stataproject.blogspot.com/2007/12/combine-multiple-datasets-into-one.html' title='Step #2 - Combine Multiple Datasets into One'/><author><name>stataman</name><uri>http://www.blogger.com/profile/13749462261829425392</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_iuN8kqdF7_g/R1Qx5w-x8AI/AAAAAAAAAcs/re82FmvonSA/s72-c/Slide1.GIF' height='72' width='72'/><thr:total>36</thr:total></entry></feed>
