Step #4: Thank God for the egen Command


What's so special, really, about the egen (extensions to genereate) command? The answer is that it lets you do lots of things to the data. Things that in other statistical programs might take a lot of commands are possible to do with a couple of egen commands. So this is actually the next phase of data manipulation.

The syntax is pretty simple:

egen <new variable>= <function>(<expression(s)> or <variable(s)>) [, by (<variables>)]

The functions actually determine what the egen command will do. There are many of them, all described in help egen, and the following sectios of this step will describe the use of the most commonly used functions. These examples will hopefully clarify how to use the different functions and how can they help us.


egen store_mean_price = mean(price), by(store_id)

This example will create a variable in which, for each observation, the value will be the mean price of all observations that have the same store_id. See the figure under rowmean() for a graphic illustration.

One can omit the by option - this will put the mean of the original variable for all observations in the dataset.

Other examples:
egen mean_firm_occupation_wage = mean(wage), by(firm_id occupation_id)

This will put, for each observation, the mean wage of all other observatios with the same firm_id and occupation_id.


The function rowmean also compute means, but instead of computing means of a variable across observations, it compute the mean across variables for each of the observations.

egen mean_score = rowmean(math_score physics_score chemistry_score)

Suppose you had a dataset of students and their scores. This example will simply create a new variable - mean_score - which will hold the mean of math, physics and chemistry score for each of the students.

Note that the fact that it computes it separately for each of the observations makes the by option irrelevant. Take the previous example: There's no meaning to do add a by(class_id) option to the egen command when using the rowmeanfunction. If you want the mean score in class for any of the subjects (mean score across students), you should use the mean() function instead of the rowmean(). If you want the mean score (across students) of the mean score (across subjects), you need to first do mean and then rowmean (or vice versa).

You might ask, what's the difference between the rowmean() and simply using the gen command:

gen mean_score = (math_score + physics_score + chemistry_score) / 3

There are two main differences:
1. You can use wildcards - The same rowmean command can be written like this:

egen mean_score = rowmean(*_score)

This is very useful if the list of valiables is very long, or if you think that later on you might add english_score history_score, and so on, to your dataset, and you don't want to update this command every time. To learn more about wildcards, see help varlist.

2. Missing values - If one of the variables mentioned above is missing, gen command will not be able to sum the three variables and will therefore put missing value for mean_score in that observation. This is the case even if the other two scores are not missing. egen rowmean, on the other hand, will disregard the missing values and compute the mean of only the nonmissing values in the variable list. Only if there are missing values in all of the variables specified, egen rowmean will put missing value in the generated variable (just like the gen command). It's up to you to decide which one is better. Sometimes it is enough to have one missing value in order for the mean value to be irrelevant to what you are measuring, and sometimes you may decide that the mean of the nonmissing variables only is good enough.

This slide shows how the mean() and rowmean() functions work:

Note: Although the observations are sorted according to the by variable (class_id) here, it doesn't mean that you should sort them before. egen doesn't need the dataset to be sorted according to the by variable (although I'm guessing that if it's sorted, it will take less time to process).

sum() and rowtotal(), max() and rowmax(), min() and rowmin()

This is the same as mean and rowmean, but instead of calculating means, these functions calculate sums. Here are some examples:

egen team_effort = sum(effort), by(team)
egen total_correct_answers = rowtotal(question_*)

There is a small difference between rowmean and rowtotal in the way missing values are treated. rowtotal simply treats missing values as zeroes. So even if the values are missing in all the specified variables, the new rowtotal variable will be 0. rowmean would have put missing value instead. If you want to put a missing value there, you might want to do something like that:

egen total_correct_answers = rowtotal(question_*)
egen nonmissing_answers = rownonmiss(question_*) // This function puts the number of variables for which this observation had a missing value
replace total_correct_answers = . if nonmissing_answers == 0
drop nonmissing_answers

For max and min, the function gives the maximum or minimum value, respectively, out of all the values of the specified variable (within a group, if you are specifying the by option). Let's say, for example, that you want to normalize all values of the variable wage in a group so that each observation will have 1 if its wage is the maximum wage in a group, 0 if its wage is the minimum and some number between 0 and 1 if it's in between. The following set of commands will do the job pretty easily:

egen max_wage_f = max(wage), by(firm)
egen min_wage_f = min(wage), by(firm)

gen norm_wage_f = (wage - min_wage_f) / (max_wage_f - min_wage_f)
replace norm_wage_f = 0.5 if norm_wage_f == . // This is for the firms in which the maximum wage equals the minimum wage. The command above will give them missing values, because the denominator equals zero

Here is another example of using egen max in order to populate a nonmissing value to observations with missing values. Suppose you have a dataset of students in schools. In one of your regressions you want to take into account the number of children whose father dropped out of high school. In order to do so, you will the following commands:

egen f_dropout_kids_only = count(student_id) if f_educ<12, by(school) // This will count the number of children for whom the condition applies. Observations in which the condition does not apply, will get a missing value to the generated variable.

egen f_dropout_kids = max(f_dropout_kids_only), by(school)

drop f_dropout_kids_only // Don't get confused. This will drop the variable f_dropout_kids_only that is no longer needed

Although, in general, missing value is greater than any other nonmissing value, the egen max ignores missing values and therefore the observations with the nonmissing value in that school - i.e the kids whose father dropped out - will be given to the other observations in the same school (we need all of them in the regression: both the ones who apply and the ones who doesn't). The following figure might make it clearer.

Note: Instead of egen max, we could have also used egen min or egen mean again as the second command. Both egen min and egen mean ignore missing values and since the nonmissing values are equal for that by category, the functions will yield that same value. This is not true for egen sum, though, because the sum function will multiply the nonmissing value by the number of nonmissing observations.

For additional statistical calculations of the within group, see help egen and look for functions such as sd() (for standard deviation), median(), mode() and others. You can also calculate statistics across a group of variables on an observation-basis (instead of across a group of observations on a variable-basis). See help egen and look for the functions that start with row : rowmean(), rowmin(), rowmax(), rowsd(), rowtotal(), etc.

tag() and group()

These two function are really useful with identifying variables that have more than one observation (group-identifying variables) - the same ones we used before in the by options.

Suppose you have a dataset of gas prices. Each observation has the type of fuel, the price per gallon, the station ID, and week in which the price was recorded. Not all stations were recorded in each of the weeks. That is, in some weeks, some stations didn't have their price taken. In your research, you decide to work only on stations for which you have full data - i.e, those which appear in each of the 50 weeks.

If the data had only one observation per station-and-week combination, you could have just used the count() function of egen:

egen station_count = count(week), by(station) // This will count the number of observations with non-missing values in week, for each value of station, and put the result for each observation of that station.

The problem is that each station-week combination has more than one observation and the number of observations per station-week varies between stations (remember, each station-week has as many observations as the number of fuel types - a price for leaded, unleaded, premium, etc.). Simply counting observations will not work here. We need to "tag" stations: in other words, we will create an indicator (a dummy variable) which will be 1 for only one observation per station, and 0 for all other observations of the same station.

Once we tag each new combination of station-week, we can count how many station-week combinations there are for each station -- this will give us the number of weeks for each station. Although I said we need to count, we will use the sum() function of egen, because count() will add to the count observations with 0, whereas sum will not (which is good because we do not want to count the same station-week more than once):

egen station_week_tag = tag(station week) // We're not using the by option since the group-identifying variables are already in the tag.

egen weeks_of_station = sum(station_week_tag), by(station)

Graphically, this is what we actually do:

The group() function is used in the same manner we use the tag() function, but instead of putting 1 in each new combination and 0 in combinations that it has already seen, it puts 1 for all the observations of the first combination it sees, then 2 for all the observations of the second combination, and so on. The benefit of this function is that you can create a full numeric ordered single variable that enumerates all combinations. When we will deal with loops, it might be clearer why this is good.


The egen command can help you play with the data pretty easily and intuitively (once you get the trick of the function you are using). There are other functions of egen I did not describe here. As I said, you can use the help egen, and don't be afraid to experiment with the functions.

To check whether your function work, browse your dataset. Sort it first by the group variables you mentioned, and then just browse the variables you want to. If you have a large dataset, you can limit the browse command using if conditions or in. Here are two examples:

sort firm

browse firm employee wage min_wage_f max_wage_f norm_wage_f in 2000/2200 // This will browse observations #2000 through #2200

browse firm employee wage *_wage_f if firm >= 100 & firm <= 200

You can do the same with the list command, by the way (but list is limited to the width of the output screen).

Good luck!

(Go on to Step #5)


me said...

Say you have a list of respondents and each respondent lists up to 10 friends. You have info on those friends and want to get the mean of all the friends on some variable. If you use the egen rowmean command and respondent 1 and respondent 2 both list friend a. Will friend a be counted only once in the mean, or will they be counted twice in the mean? If so, is there a way to know how many are not being counted because of this? Thanks a lot!

stataman said...

I'm going to need more info on how the dataset looks like - what you have as observations and what how you save friends data. I'm guessing by the way you use rowmean rather than mean that you are working on a wide dataset: i.e for each respondent you have 10 variables each with friend's ID, 10 variables with friend's variable of interest, etc.
If this is the case, then rowmean() on the list of 10 friends' variable of interest will give you the friends' mean for this respondent. If you have another observation with another respondent listing the same friend it doesn't matter.

I don't know what else you are trying to do with the project, but have you considered having 10 observations per respondent, each with a different friend and his/her variables of interest. This way you can (a) combine the dataset of respondents with the dataset of friends pretty easily according to the friend's ID (this "a" you refer to) (b) use either "egen mean_friends_var = mean(var_of_interest)" or you can simply run "collapse var_of_interest_1 var_of_interest_2 , by respondent_ID.

The former will keep a 10-obs-per-respondent dataset with new variables that will hold the mean while the latter will "collapse" the dataset to a 1 observation per respondent dataset with the friends' mean of var_of_interest_1 and var_of_interest_2 instead of each friend's original value.

Does this help?

Jane said...

I have obs for each year 2002-2006. However, 2007 only goes to June. Thus I want to generate a count for 2007 that is essentially
egen c6=count*1.5(total)if year==2007, by(home)

However, this does not work. Any suggestions? Thanks a million

stataman said...

I'm not sure you want to count. If I'm right, you are trying to interpolate somehow an annual number from 6 months of data? Then I'm not sure why 1.5 and not 2, but in any case, egen can't do mathematical functions that are possible with gen. What you can do is to first create the variable with egen and then recalculate. For example, if I want to sum up the annual income by household from monthly observations, (and I want to keep the dataset on the monthly level - otherwise I would use collapse), then I'd do:

egen annual=sum(monthly), by(year household_id)

replace annual = 2*annual if year == 2007

I think this should work.

Erik said...


I believe I want to use a command similar to this: egen station_count = count(week), by(station).

I am using xtreg, fixing on teachers, and finding that some of the teachers only had 1 student. I would like to limit the analysis to teachers who had at least 20 students. To do that, I figured I needed to write a new variable, like the station_count above (except using teacherid and studentid) and then insert an if statement on the xtreg limiting the sample to those teachers with 20 or more students.

Is that right? So far, it hasn't worked. The xtreg is still analyzing teachers with 1 student, despite using the if statement with the new egen variable >20.


P.S. Great blog!

stataman said...

Hi Erik,

Thanks. I'm afraid I won't be able to answer your question directly as I don't know how your dataset looks like. I have a guess though. You might have each observation representing some student-station-week observation (whatever station and week means). I'm not sure how you count students, but if you have more than one observation per student you should count distinct students and not observations.

This will require first tagging the students - assuming each student_id is unique (no same id for students in different schools, for example) - you can run:

egen tag_student = tag(student_id)

And only then counting distinct observations:

egen students_per_teacher = count(tag_student) , by(teacher_id)

I hope this helps. If not, try to tabulate the variable you created, see if it has something like the distribution you expect, maybe you have something else wrong along the way. You can elaborate more here.

Erik said...

Hi stataman. Thanks for the response. Unfortunately, that is still giving me the same result as I had previously. Is there no way in xtreg (with an option) to tell it to only include teachers (those I am fixing on) with 20 or more students?

As for the database, it is set up with student ids, and then each student has 4 variables (amongst all the others) that identifies their teacher at a given time point. Don't know if that helps.

stataman said...

hmmm... that's weird. then if you have four teachers how can you pick the one you do the xtreg on?

if you had just one variable for your teacher, then after constructing students_per_teacher you should have run (assuming you want Fixed Effects regression):

xtreg y x1 x2 if students_per_teacher > 20 , fe i(teacher_id)

Try to identify the problem. Is it not having a variable that says how many students the teacher of this student had? Or is it having the variable but not being able to incorporate it into the xtreg command. I'm afraid I don't know of a different way to tell xtreg to run only categories with more than N observations in them. Just constructing another variable and conditioning with "if" on this variable.

Erik said...

Yes, I have run the xtreg as you wrote it, except that I used xtset tchrid first. I think that is the same??

I limited the xtreg to students who had the same teacher throughout a school year. Thus I fix on the teacher in that year. There are two years worth of data, so I run separate xtregs for each year since the majority of students did not have the same teacher in both years.

I think the code has worked, now that I have examined it a bit more deeply. It seems to have eliminated the less than 20 groups from the analysis. Thanks for the help!

Dimension Z said...

Hi Stataman,
I am trying to run a regression using the means of my independent variables. I have been able to compress the means but I dont know how to use them in my regression.
Please advise.
Many thanks.

stataman said...

Hi Ayesha,

In compressing the means you mean by using the "collapse" command? If you want to run a group-level regression of the means you will need each observation to represent the group you calculated the mean for, and the "collapse" command does exactly this for you.

Once you are done, just run your favorite regression command. A simple linear regression (OLS) will be:

reg <independent variable> [dependent variables]

Did it work for you? If not, try to tell me what command you wanted to run and what error did you get.

Susanne said...

Hello Stataman
Thank you for your Blog.
I stumbled upon it when I was looking for a way to create a mean over several observations - and found your post on rowmean. However, I would also like to create the median over the same observations. There does not seem to be a "rowmedian" command... any suggestions?

stataman said...

Hi Susanne,

Yes, I have noticed too that there's no median by row. Here are a couple of things you can do:
1. Reshape your data to be long rather than wide. If you have lots of variables per observation each denoting a different year, say, than turning the dataset to a panel - each observation is original_obs at year t instead of original_obs with all years - will make things easier. If you will want to run different Fixed-Effects regressions in the future, for example, then you will have to do it at some point and better reshaping sooner than later. Once the data is in long format, then you can do more with the egen command. See help reshape for more details.
Even if you don't need the shape long eventually, you can reshape the data back to wide format after you calculate.

2. If you insist not to play with the shape, you can do your own calculation. Personally I think it's a worse solution than the former, and usually will be more cumbersome, because you will have to sort the values of each row's relevant variables to extract the middle value, but if you insist...

Yeah, it's a bit of a drag, but it will make you stronger ;)

Susanne said...

Me again ... thank you for your quick answer. I could not stop searching, so I found a package called "egenmore" which includes the "rowmedian" egen function. It seems to work as expected... in order to install it, type: "ssc install egenmore" into the command line. More about the functions under "help egenmore". Play around with the shape? I probably should do that anyway, so I'll know what you mean ... lol
Thanks again

stataman said...


I wasn't aware of egenmore. Thanks for that.

Michael said...

I have an observation on one of your examples in Step 4 on the -egen- command. I believe you have an error in one of your examples. You have an example with this command:
egen f_dropout_kids_only = count(student_id) if f_educ<12,>

Stata 10 would not execute that command. Further, should that command have "by(school)" at the end? It seems so in order to produce the same results as your output graphic.

Great job on this website. I'm working through your examples and learning a lot. Thanks!

stataman said...

Thanks, Michael.
You are 100% right. I fixed the typo.

Let me know if you find anything else.

Zina said...

Dear Stataman,

First let me thank you for this really nice resource! I have a small question...and being a statanewbie I would really appreciate insight on that
I know it is probably simple and I need a combination of the egen and max command...but here is what I am trying to do. I need to create new observations of

1- the number of working PEOPLE in the household to match the hh number
2- the number of working CHILDREN in the household to match the hh number
3- the order of the children in the household

Here is what I have:
1- unique hh number
2- the status if someone is working or not
3- the age of each one in the sample

(a child in my data is defined as under 18)

I would really appreciate it if you could help me on that!

Thanks a lot!


stataman said...

Good question, Zina.

Let's divide the solution to two parts: First, you need to flag observations that match what you want to count (i.e working adults and working children). Second, you want to count the flags by household.

Assuming you have the working status of a person in a household saved in the variable "working", you can flag the children with either:

gen working_child = working == 1 & age < 18


gen working_adult = 1 if working == 1 & age < 18

Both commands will put 1 for observations that satisfy the condition. However, for observations in which the condition does NOT hold, the first command will put zeroes whereas the second command will put missing values.

For the second step you just need to sum up, for each household, all the 1's that you created. To do this, run:

egen work_persons_in_hhold = sum(working), by(household_id)
egen work_children_in_hhold = sum(working), by(household_id)

For the order of birth use the rank() function:

egen birth_order = rank(age), by(household_id)

On different possibilities to rank (especially when you have two people with the same age) see help egen and look for "rank".

This will actually rank all people in household. If you just want to rank children you can first create a variable that contains the age only for children (missing value otherwise) and then do the egen rank on it:

gen age_child = age if age < 18
egen birth_order = rank(age_child), by(household_id)

Have fun

Elizabeth said...

this is so helpful! thank you.

Michael said...

I'm using -egen- as described on Stataman's Step 4 web page to (1) create a new variable to calculate the sum for groups of observations and (2) then create a tag of '1' for the first observation in each group. Here is where the problem arises. I want to get a group sum only for records in a specific date range rather than over the entire data set.
The first function for -egen- to calculate group sums as described above works fine. But when I use -egen- to create tags so I can reasonable the total for each group, I have problems because I am using a subset of the data set based on a range of dates. The problem seems to occur because observations with a tag of '1' sometimes are assigned to an observation *outside* the specified date range. Because an observation is outside the date range, the observation's value for the group total variable is '.'. Thus, the displayed value for that group (based on the observation with a tag of '1') is '.' rather than the correct group total.
Any advice? Thanks.

Rachael said...

I have a dataset with standardized test scores. I want to calculate the achievement gap. I want to subtract group scores from each other by subject, grade level, and year. So, for example, I want to subtract low-income math scores grade 3 in 2007 from white math scores grade 3 in 2007. There are so many condition, I'm not sure how to do this simple calculation. I have variables for year, group, subject, grade. Thank you so much.

stataman said...


I would suggest narrowing your groups to accommodate the date ranges. First create a variable that will have a distinct value for each date range that interests you (to do that easily, if you have Stata 10 and above look at help dates_and_times. For versions before 10 there are other functions - look for them).
Then once you have a variable for the time periods (say, months, or quarters) you can enter it to all your egens. Instead of specifying only the original group you specified before (say it was "firm_id") specify both the group identifier and the period identifier:

Instead of:
egen tag_group = tag(firm_id)

egen tag_group = tag(firm_id period)

I hope this helps.

stataman said...


There are several ways you can do this. I am assuming you want to get both the difference in mean scores and standard errors for this difference (to be able to test if they are statistically significantly different than 0).

To do that, you will first need to decide what is the dimension along which you want to calculate the difference and what is the dimension that defines your sample.
In your example: "low-income math scores grade 3 in 2007 from white math scores grade 3 in 2007", grades and years define your sample, while income level of the family is the dimension you want to measure the difference within this sample.

For this specific example you want to run something like (depends on your method of estimation, I assume OLS with regular standard errors (assuming homoskedastic residuals), assuming the variable high_inc is 1 for high income and 0 for low income families.

reg test_score high_inc if grade == 3 & year == 2007

If you need many such regressions you can automate them with a loop (see step #6)

foreach grade in 3 5 7 {
forvalues year = 2005/2009 {
reg test_score high_inc if grade == `grade' & year == `year'

Francie said...

Hi, I am trying to generate a variable that counts the number of other people in the data set who match certain characteristics of the individual. As a simplified example: if an individual is a white male, I want a variable that says how many OTHER white males are in the data set. My actual need is more complicated, but I essentially want to know how to do:
egen num=count(YourZ=MyZ & YourQ=MyK).

thanks so much!!!

stataman said...

Hi Francie,

Good question. This is something that one definitely can need sometimes, and the implementation by egen and such commands is not very intuitive because you have a condition that relates one observation to others in the same group. However, you'll see that a different approach to the problem will solve it.

Let's start by stating the problem more generally: you have M variables that define a group: groupvar_1, groupvar_2, ..., groupvar_M. For example, (school, class, year). Then you have N categorical variables for which you want to create your variable of interest to hold the some value for all observations that have the same value in the categorical variables. Name those variables: var_1, var_2, ..., var_N. In your example, white and male are the two variables (my example will give you a value for each combination of those variables' value).

The solution is to treat the categorical variables as if they are group variables too that further divides each group of observations to smaller groups. In other words, if you want to count white males in each school, make each combination of the variables white and males and the school id a group in its own, and count its members:

egen others_like_me = count(groupvar_1), by(groupvar_1 groupvar_2 ... groupvar_M var_1 var_2 ... var_N)

Since you want don't want me (the current observation) to be counted, you can simply:

replace others_like_me = others_like_me - 1

I hope it helps. If not, please come back again ;)

Liana said...

Thanks so much for this post--very helpful. Hopefully you can help with my question. I'm trying to use egen to create unique family id variables. I would like to assign new group ids to any individual who possesses the same values on three other variables (egen familyid=group(id dob name)). The problem is that I have a lot of missing values in my dataset and I'd like to include in individuals who match on all existing traits even if they have a missing value somewhere. For example if I have two observations with the same dob and name, but one observation has id as missing, I'd still like to be able to group these two variables into one family id. Do you know how I could go about doing this? Basically I just want STATA to ignore missing values and match on values where there is existing data. Thanks so much!

egen familyid=group(id dob name)

stataman said...

Hi Liana,

I'm not sure what you want when you want Stata to "ignore" missing values.
How can you say that an observation with (3, 1/12/1954, "Smith") and another one with (3, 1/12/1954, "") are in the same group? If the (3, 1/12/1954) is good enough to group observations, then you can simply group only by the first two variables (id, dob).

If you want it to be finer grouping for those that have full values, then still you need to decide how to group the missing values. Should (3, 1/12/1954, "") be grouped with (3, 1/12/1954, "Smith") or (3, 1/12/1954, "McMahon")?

By the way, if the problem is that the group code that egen is creating missing, add the "missing" option to the egen command and it will treat missing values as if they were another category.

mel111 said...
This comment has been removed by the author.
mel111 said...
This comment has been removed by the author.
mel111 said...

Hi stataman, im not sure if this will come under the egen command. I have managed to make a variable through gen newvar=1 if var1==1|var2==1|var3==1, so that i have a variable which is 1 when say, people have a dog, or have a cat, or have a hamster, but this only gives me newvar=1, and i also need to code the newvar=0 as the value of another variable, such as has an fish, so i can perform a regression comparing the likelihood of an outcome, comparing those who have a hairy pet, compared to those that have a fish.Im sorry if this is quite unclear. Basically, in my regression i want to be able to set the baseline as a complete different variable, rather than just 'do not have a hairy pet'. Thanks very much. Mel.

stataman said...

Hi Mel,

To create a dummy variable, or an indicator variable, you can do:

gen newvar1 = <condition>

For example:
gen newvar= var1==1|var2==1|var3==1

This way you will get 1 when the condition is true and 0 otherwise.

The only problem is that you might assign a 0 to variables for which you have missing values (if you don't know if the person has a cat, you don't want to say that she doesn't). Thus, if you have missing values in your variables you might want to do this:

gen newvar= var1==1|var2==1|var3==1 if var1!=. & var2!=. & var3!=.

I hope this helps

fM said...

simply to say thanks for sharing!
It's really useful.

vikas said...

Thx a lot.
It 3:15am and because of your blog my problem is solved and I am finally going to sleep.

Babi_Partho said...

Great blog, and a very useful post! Thanks!

TugbaZ said...

I use the "survey mean" command to find out the mean of the variable. The command is such as

//svyset [pweight = lrwght], psu(psu) strata(strata)
//svy: mean variable over(country occupation)

What I would like to do is simply taking this mean value and generate a new variable regarding of this mean for the corresponding group (country occupation). Note that "egen" command is not working there.

Could you please help me?

Thank you in advance...

Chelsea Smith said...

How can I get around not being able to use the replace command with egen functions? I'm creating a scale/average using longitudinal data. In 2002 all respondents were asked questions about depression but in 2004 only new respondents were asked those depression questions. So I want my average to equal the 2002 mean if they answered those questions, but to equal the 2004 mean if they were not a respondent in 2004.
I'd like the code to be:

egen depress=rowmean(dep_a02 dep_b02 dep_c02) if year02~=.
replace depress=rowmean(dep_a04 dep_b04 dep_c04) if year02==.

Is there another way to create the variable I need despite not being able to use the "replace" command? Thanks!

Chelsea Smith said...
This comment has been removed by the author.
Liana said...

This post has been helpful to me, so I thought I could quickly answer Chelsea's question.

The simplest work around would be to create multiple variables using egen and then replace using gen.

For example:

egen depress_a=rowmean(dep_a02 dep_b02 dep_c02) if year02~=.
egen depress_b=rowmean(dep_a04 dep_b04 dep_c04) if year02==.

gen depress=depress_a if year02~=.
replace depress=depress_b if year02==.
drop depress_a depress_b

It's a little more work, but should solve your problem.

Good luck!

Chelsea Smith said...

Thanks so much for the suggestion, Liana! I followed your advice and it worked to get me the very variable I needed. Thanks again!

Stata guy said...

Hi Stataman,

Is there a way to use egen rsum
(vars) such that if all vars in the rsum(vars) are missing then egen should return missing, and not zero?

Thanks a lot

Tartan said...

I have a data set where there are say 10 observations for each participant id e.g. birth, illness, outmigration, death etc. Looks a bit like this:
222 08/06/12 Birth
222 09/07/12 Illness
222 09/10/12 outmigration
Some of these observations are census enumerations and I have ranked these observations 1 to max based on the date on which enumeration was done. I would like to keep just the date on which the last enumeration(maximum rank) observation was performed. Any advice?

neha said...

Hi stataman!

I have two variables named "A" & "B", and I want to get "A-B" for a new variable called "C". For that I use, "generate C= A-B". But it gives some missing values in "C", when one of A & B variables contain missing values. I want to consider those missing values as zeros & get the answer. How can I do it?
Thanks a lot!

Palm Beach Vending Machine said...

It will treat missing values as if they were another category.
Palm Beach Vending Machine

Elijah Agyapong said...

Data for this question is located at "". This is a 1977 and 1989 data on attitudes toward working mothers ability to develop a warm relationship with their children. I ran the following basic model for linear regression, ordered logit, and ordered probit...."warm = yr89 + male + white + age + ed + prst".

Now for the Ordered Regression Models (ORM), assume that the response category (SD, D, A, or SA) with the highest predicted probability is the predicted response. How can I create a table comparing the predicted outcome with the observed outcome for each observation, for each of the three models?

I'm really new to stata and trying to learn regression models for categorical and limited dependent variables. A friend told me to use the 'max', 'predict', and 'prtab' commands but I have no clue how to solve this example I encountered. Any help and suggestions will be appreciated.

oneappcloud said...

Come to our website
search engine optimization that
is a largest site where social business outsourcing, search engine optimization,
seo, buy twitter followers, youtube views services are available.
more detail.......
buy twitter followers
youtube views services

Erica Haney said...

So I am trying to generate a variable which denotes 1 if the household has both orphans and non orphans, and 0 if the household is not mixed (only orphans or only nonorphans). Is there a way to do this with egen and flags?

Djotieka Rakhan said...


I have 2 variables namely id (a respondents id) and the year when the person filled in the questionnaire.
The data that I have is available for 3 years namely 2007, 2008 and 2009. I want to compare the year 2007 with 2008 and 2007 with 2009.
I want to know the percentage of people that have participated in 2007, whether they have participated again in 2008 or not.
What I find difficult is that the id variable for person 1 takes the value 1 for all three years. This happens with all 7728 respondents.

So can I get the participation variable with the egen command? If yes how can I make this partcipation varaible?

Thanks a million


for-each loop limitations
Three scenarios where you can not use for each loop

for each loop provides a way to iterate over a collection of objects of class implementing Iterable interface. It provides little better performance as compared to traditional for loop and is a cleaner way to iterate over a collection. It removes the chances of error that could be introduced while iterating over multiple collection through traditional for loop . Developer could got wrong in updating and traversing over changing indexes of objects in collection . This kind of error is unlikely in for each loop

But there are situations where for-each can't be used

1. If you want to remove an element of collection while iterating for-each provides no remove method. So you can't use it in such scenario

2. If you want to replace some element in collection while iterating over collection You can't do that as there is no replace or update method here neither you have control to access the element with its index value .

3.A situation where you are working in parallel with multiple collections have to do a lot of operation with current index value of iterable collection. You don't have access to element index while iterating with for-each loop So you can't use it.

Read here about - for-each loop limitations Three scenarios where you can not use for each loop

shenablogs said...


I am trying to use the hours worked per week variable (HRS2) in my dataset. It is organized by how many respondents chose 1 hr/week, 2hr/week, etc, up until 70hr/week. There are about 30-40 respondents per year, and the data is from 1973-2012.

What I am trying to do is use the (egen, cut) function in order to cut these groups into: worked from 0-10h/wk, 11-20h/wk, 21-30h/wk, 31-40h/wk, 41-50h/wk, and 50+h/wk, for a total of 6 groups. So far I have this:

egen hourswrkedperwk = cut(HRS2){at(10,20,30,40,50)|group(6)}

But I'm getting an error message so I'm sure I'm missing something.

Help! Please!

Thank you!

Sergio said...
This comment has been removed by the author.
Jalabert said...


I have just started using STATA. I would really appreciate help.

I want to create a binary variable (create a flag) which indicates from coded medical data (like 8PG.032) if patient suffers from cancer or not.

Thank you in advance for the answer

Zbigniew Cybulski said...
This comment has been removed by the author.
Josh Catalano said...

Hey Stataman,

I'm having problems with the egen command. I have a large amount of flight data, and an observation is a flight (not a route but a specific instance that route was traveled). I want to count all of the times a given route was traveled in a given day, and then I want to perform the same calculation but also count by airline. I've been using these commands:

bys FL_DATE origin dest: egen tot_route_day = count
bys FL_DATE origin dest AIRLINE_ID: egen AL_route_day = count

Stata is telling me my syntax is wrong, but it's not telling me how it is wrong. Is there anyway that you could help?

Apple said...


I have a dataset of family expenditure by individuals. This is I have for household 1, person 1, 2, 3; then for household 2 again person 1,2,3,4. One is the household reference person and 2 is the spouse. I want to keep only households where there exists a value of 1 and 2 within the household. This is, if household 7 has person 1, 3 and 5 but no person 2, I would like STATA to drop it. If household 8 has person 1,2,3 on it, I would like STATA to keep it. Any ideas? Could this egen command work somehow?

Thanks a lot!

Household Person
1 1
1 2
1 3 (Keep 1)
2 1
2 2
2 3
2 4 (Keep 2)
7 1
7 3
7 5 (Drop 7)

aneka obat said...

very good information and Inspiring & Interesting.

success always

Nurjamal said...

Hello stataman,
I have a panel data set (137 countries over 27 years). Could you please advise me how to generate a new variable which represents the average of one variable over 10 or 15 years (which I want to fix ahead)?
Thanks a lot in advance!

francesca said...

Hi there. I have a dataset that tracks people within firms. Each person and each company have a specific identifier. I want to compute a variable that codes as 1 the first year a person is in a company, as 2 the second year a person is in the same company, and so on. could you please help?

Gemma Smith said...


I have a question about missing variables. I have a large data set that includes weekly units consumed of different types of alcohol. I am trying to calculate the total weekly units. However, some types of alcohol have missing data, so when I do 'gen unittotal=X+Y+Z' I often get missing data in the place of a numerical value. I then used egen function to calculate a total using rsum. However, where there is a full dataset missing from a participant that dropped out (not just one data point relating to one specific type of alcohol) I get a zero, in the place of a missing data point '.'

I do not know how to sum the values whilst ignoring missing data, but also not counting missing data as '0'.

Thank you!

Jens C. Kruse said...

Nice work..

D Hanudin said...

Participate ser gan, Info Panacea For Hepatitis B

Anyoption6 said...

Groß Informationen! Ausgezeichnete Schreiben. Ich bin sicher, dass ich diese Site bald wieder besuchen.
Ich habe einige relevante Informationen, die Sie unten überprüfen können
Binäre Optionen handeln

Cricket Betting Badshah said...

We are provide accurate cricket betting tips for all major events like ICC tournaments, IPL, Big bash county.Get Prediction for Session, Pari and Match results.

Emily Jorgensen said...

Stata command help

bandito said...

1) a. browse
b. sum BMI

2) a. tabstat BMI if RFPAMOD == 1, statistics (median)

3) a. sum BMI
b. sum BMI if BMI > xxxxx
c. ans1/ans2

4) a. hist FRTSERV, percent
b. tabstat FRTSERV, statistics (mean median)

5) a. twoway scatter BMI FRTSERV
b. twoway (scatter BMI FRTSERV) (lfit BMI FRTSERV)
c. pwcorr BMI FRTSERV

6) a. regress BMI FRTSERV
b. slope for one unit increase in x there is an m unit change in y
c. r-squared x% of change in y can be attributed to x
d. for regression hypthoses, Ho = slope is not 0 meaning there is a significant linear relationship, Ha = slope is 0

7) a. ttest BMI, by(RFPAMOD)

8) a. recode FRTSERV (min/1=1)(1/3=2)(3/5=3)(5/max=4), generate (FRTINDX)
b. tab BMICAT FRTINDX, chi2
c. chi squared hypotheses, ho - independent, pval is next to chi

e3cleary said...


This blog is really useful - thank you. I wonder if you could help me with a problem I'm having? I have a dataset of 10,000 people in 77 locations with disease outcome 1,0. I'm trying to generate a new variable which is mean prevalence for each location adjusted by age category and sex. Any idea how I do this?

Thanks again,


Ex LeeT said...

JoinBettingSites Nice Blog

Admin said...

You really make it seem so easy with your presentation but I find this matter to be actually something that I think I would never understand. It seems too complex and extremely broad for me. I am looking forward for your next post, I'll try to get the hang of it!
Tips Game
My Chanel
Tips Android

Mariam said...

case there is anyone who has similar problem and still

looking for a way out, and he those cast all kind of spell like ::
Love Spells
Luck, Money Spells
Health, Well Being
Protection, Healing
Curses, ex, Breakups
NEW! Combo Spells
High Priestess Spells
Vampire Spells
Authentic Voodoo Spells
Custom, Other Spells
Business spells
Health/Healing spells
Cancer healing
Curse removal
Job spells
Healing from all kind of diseases
Love binding
Barrenness(need a child)
Need love
Lottery Spells
Money rituals
winning court case
Divorce spells
Low sperm count
Infertility in women
Breast enlargement/reduction
Penis enlargement/reduction
YOU CAN CONTACT HIM HERE AS ( and also his state based number text him here if you're in the US: 760-935-3804 if you need any question contact me via here as i wish you best of luck and good health.